TnovoMS is a combined de-novo/database-search tool for the interpretation of MSMS spectra of peptides obtained in protein digestion. For each spectrum in input, it finds the best de-novo prediction (i.e., the peptide that optimizes the match to the spectrum, according to a score function) and a probability profile. The latter informs on which part of the predicted peptide is more reliable. Moreover, such profile is also used to perform a database search in the Uniprot database, to find which sequence in the database best matches the profile, in terms of a suitably defined search-score. A comparison of such search score with those of the other sequences in the database, and also in a decoy database, allows to estimate the number of false positives.

Core aspects that make TnovoMS different from other de-novo and database search methods

  1. MSMS spectra interpretation is mapped into the identification of the minimal energy state of a suitably defined, 1-dimensional statistical-physics system, defined on a discrete-mass lattice. Any possible sequence corresponds to a configuration of the system variables, and is given an energy (indeed, our score function) according to the match of the spectrum it would generate, to the actual experimental one.
  2. The best scoring (=minimal energy) state is found as the limit for low "temperatures" of the thermodynamic equilibrium of the system. This approach allows a different perspective compared to the standard alternatives, that usually produce a best scoring spectrum, and an arbitrarily long list of suboptimal ones. Here an exact profile, obtained from the weighted average of ALL the alternative predictions, is provided, along with the best scoring sequence.

    In physical terms, the standard search for the sequence with the maximum score (i.e. minimum energy, in our mapping) is equivalent to finding the ground state of the system, while the suboptimal sequences maps onto the first excited states of the energy spectrum. Upon introducing a fictitious temperature T in the system, the ground-state sequence will be recovered as the equilibrium state at T=0, while for small, but non-vanishing, T the system will explore (and weight with the Boltzmann factor exp(-E/T) ) all the alternative, suboptimal states, yielding average values of their characteristics. Notice however that at higher temperatures, the equilibrium will be dominated by the entropy: the system will prefer the exploration of huge regions of suboptimal sequences, to sticking with a few, high scoring ones. Finally, in the limit of infinite temperatures, all sequences fitting the total precursor mass would be given the same probability. The optimal temperature is therefore one that is low enough to allow a clear identification of the ground-state, and high enough so that the latter is not the only populated state, but also suboptimal states contribute to the probability profiles.

  3. The profile allows to identify which parts of the best scoring peptide are more reliably predicted.
  4. In the database-search step, the predicted profile is matched against each sequence in the database: thus, the comparison is made at the level of the profile (already including information from the de-novo prediction) vs sequence, and not of the theoretical spectrum vs the experimental one.
  5. The match-score in database search is used to produce an indicator (z-score) of the singularity of the best matching sequence with respect to all the others. A comparison of the match against a decoy database allows to estimate a safe threshold for the z-score to minimize the number of false positive predictions.

Warnings and "To Do" list

  • Only spectra with charge q less or equal to 3 can be dealt with. Higher charges will generate an error.
  • In de-novo predictions, I and L residues are indistinguishable (they have the same mass and characteristics): the user should remember this degeneracy when reading the results.
  • The algorithm calculates the thermodynamic equilibrium properties of the system, dealing with exponentials of the form exp(-E/T), that will yield overfloat errors for T going to zero. This does not prevent the identification of the minimal-energy, best-scoring de-novo sequence, since upon lowering the temperature, the ground-state will be easily identified as the most populated one, at temperatures well above the onset of numerical underflow or overflow problems. Moreover, the choice of temperatures However, the user should be aware that a choice of very small temperature will inevitably generate NaNs (Not a Number) in the results. The temperature threshold at which numerical problems start appearing depends on the spectrum, so that we cannot provide a universal recipe for selecting a safe temperatures; our experience suggest that T between 1 and 2 are reasonable in term of de-novo sequence and profile predictions.
  • At present, only tryptic rules are implemented: so it is implicitly supposed that the precursor peptide was obtained by tryptic digestion of the precursor protein.
  • Post Translational Modifications (PTM) can be dealt with in the de-novo method simply by extending the species alphabet, adding the modified species to the 20 residues. However, we haven't tested yet the performance of the method when including PTMs, so search including PTMs is disabled at present.
  • The present version uses a mass discretization of roughly 1 amu in the mass: a finer discretization can be set for better prediction of the spectra obtained with more precise instruments.
  • The present algorithm should be considered as a core-algorithm for a more general tool: it interprets the spectra one by one, and does not try to infer the whole protein: an extra layer should be added to score whole proteins against the predicted profiles.


Please notice that this is a still a beta version, and we acknowledge feedback and suggestions. T-novoMS server is an ongoing project: collaborations, sponsorship or any kind of help are welcome. If you want to get involved, please contact Pierpaolo Bruscolini pier _AT_ unizar.es

Copyright © 2014 BIFI