Input files format: Description of DTA files
(Text adapted from Mascot help[http://www.matrixscience.com/help/data_file_help.html#DTA]: see the source for more detail)
In the DTA format, the first line contains the singly protonated peptide mass (MH+) and the peptide charge state as a pair of space separated values. Subsequent lines contain space separated pairs of fragment ion m/z and intensity values. Notice that in a DTA file, the precursor peptide mass is an MH+ value independent of the charge state. This means that if the peak of a doubly-charged precursor peptide is observed at m/z=1000, this implies that M+2H+ =2000, that is, the molecular mass of the parent peptide is 1998 amu. This is equivalent to a DTA file which starts: 1999 2 where the provided mass corresponds to that of M+H+.
Output files format
A jobname.zip or jobname.tar.gz file is sent to the specified email address, with the following contents:
- spectrum-list.txt: the list of the input files and the index associated to them
- prediction-list.txt: the list of de-novo and database-search best scoring sequences, together with the scores obtained.
- de-novo_probabilities.zip (or tar.gz): the probabilities associated to the de-novo prediction;
- FDR_result.txt: an estimate of the False Discovery Rate for the whole set of submitted spectra, for different values of the confidence threshold.
- warning_FDR-result.txt: a list of spectra for which the z-score, associated to the target or the decoy database, is negative. This usually means that there are some problems with the probabilities profile for that spectrum, possibly associated to numerical errors due to exceedingly low temperatures. Since the threshold for output in FDR-result.txt is always positive, these problematic spectra do not contribute to the FDR-result.txt file.
- job.log: a log file of the run, also reporting ignored spectra (if any), that didn't satisfy the charge criterion q<=3.
For completeness, two extra files, of little interest for the plain user, are also included:
- param.ini: file with the complete set of parameters used by the program.
- fullprofiles.tar.gz: the full probability profiles, along the whole mass-array, for each spectrum.
The complete description of the output files is as follows:
spectrum-list.txt
                column 1: index associated to the spectrum-list.txt
                
                column 2: filename of the dta file containing the spectrum
                
            
prediction-list.txt
                column 1: index
                
                column 2: filename
                
                column 3: de-novo predicted precursor sequence
                
                column 4: average energy (de-novo score) of the predicted de-novo sequence, U=<H> (see Ref[1], text above Eq.10)
                
                column 5: best database sequence, P
                
                column 6: “energy per residue” score e(P)= -log(pdb(P))/L(P)  associated to the best-scoring database sequence (Eq. 12 in Ref[1] and text below)
                
                column 7: z-score zT  Eq. 13 in Ref [1], measuring how far is the e(P) from the average calculated on all the sequences in the database.
                
            
de-novo_probabilities.zip (or tar.gz)
Contains one file for each spectrum, with the format:
                column 1: position “k”, along the discretize mass array, of the peptide bonds of the de-novo predicted sequence
                
                column 2: probability of the residue N-terminal to the peptide-bond “k”, in the de-novo predicted sequence; see Eq. 10 in Ref.[1]
                
                column 3: X (not used)
                
            
FDR_result.txt
                column 1: z0 , threshold value for the z-score (z> z0 is considered meaningful)
                
                column 2: Ns, number of spectra
                
                column 3: NT,  number of identifications in target database with z> z0
                
                column 4: ND,  number of identifications in target database with z> z0
                
                column 5: FDR=ND/NT False Discovery Rate (classical definition)
                
                column 6: coverage = NT/Ns: fraction of spectra identified with z> z0
                
                column 7: FDR_NavarroVazquez=(2*decoy_best+decoy_only)/(target_best+target_only+decoy_best) (see J. Proteome Res. 2009, 8, 1792–6)
                
            
fullprofiles.zip (or tar.gz)
Contains one file for each spectrum, with the format:
                column 1: position “k”, along the discretize mass array
                
                column 2-20: probabilities that “k” is the peptide-bond C-terminal to the 19 residues in the order GASPVTCINDQEMHFYWKR, respectively (L has the same mass as I, and is indistinguishable from it).
                
                column 21: '0.0' (not used)
                
                column 22: '*' (not used)
                
            
Refererence
[1] M. Faccin; P. Bruscolini, "MS/MS spectra interpretation as a statistical-mechanics problem", Analytical Chemistry 85 (10), 4884-4892 (2013)