PrionScan Help & Credits
Background
Prion proteins are a special type of amyloids that have the distinctive properties of acting as heritable elements when in their aggregated forms, constituting self-replicating entities that can perpetuate and transmit over generations. Prions are generally ubiquitous proteins with specific functions when folded, that also perform important beneficial functions in cells following their amyloid conversion, as epigenetic elements, evolutionary capacitors and bet-hedging devices in the processes of adaptation to environmental fluctuations in microorganisms, and in mechanisms crucial to maintain long-term physiological states in invertebrates. Prions are also involved in a diverse group of serious and in some cases incurable pathologies caused by infectious prions in human and mammals. However, the number of prions characterized so far is scarce, for a handful of prions in some microorganisms and mammals. Besides, there are evidences suggesting that the number of proteins that might behave as prions could be significantly high in the genomes of organisms. Thus there is a need for computational and experimental methodologies to uncover the complete set of prions in the genomes of organisms, to try to reach a more complete understanding of prions' functional regulatory mechanisms.
Initially, researches tried to use methodologies developed to estimate the propensity of protein sequences to form amyloid aggregates based on the prediction of β-sheets forming regions from primary sequence [Fernandez-Escamilla, et al., 2004; Trovato et al., 2007; Zibaee et al., 2007; Bryan et al., 2009], though these algorithms proved quite ineffective to analyze prion sequences. In accordance, there have been some attempts to generate models of prionogenicity and use them to evaluate the prion forming ability of some proteins, and even to scan large sets of protein sequences or complete proteomes [Michelitsch & Weissman, 2000; Harrison & Gerstein, 2003]. However, it has not been until recently that the number of prions characterized experimentally have increased to a level that have give us the opportunity to begin to understand the sequential and structural determinants of prion conversion. In a recent report a new set of prions sequences have been computationally predicted and experimentally tested, concluding in an ample set of prion domains in yeast [Alberti et al., 2009]. From this same work, an equally interesting finding constituted the report of another set of sequences compositionally similar to prions and predicted with their computational model, that showed no prion-like behavior, and thus are very useful as a negative test set in the building of subsequent prion models. This groundbreaking work has served as the basis for subsequent studies aimed at generating improved computational models to assess the similarity of protein segments to prion forming regions. An example is the algorithm PAPA [Toombs et al., 2012], a strategy based on mutational libraries of a prion (SUP35), which prionogenicity is tested in vivo [Toombs et al., 2010], ultimately resulting in an experimental technique to measure the prion propensities of individual amino acids. In our case, we took a different approach, concentrating our efforts to extract as much information as possible from the prions and non-prions reported in the work by [Alberti et al., 2009] to generate our computational model and thoroughly benchmark it to test its suitability to search in large sequence databases [Angarica et al., 2013]. In this article you could find a detailed description of the main characteristics of our method and the comparison with others available.
The Methodology
We started with a group of 29 proteins that have been tested as prions in vivo and in vitro in the report by [Alberti et al., 2009], which we used as the training set for obtaining the amino acid propensities in prion domains. Another set of 18 high scoring prion predictions as reported in this article was used as the negative test set in the benchmarking of the methodology. Finally, we obtained a model containing the statistical significance of the propensities of all the amino acids in prion domains that can be used to assess the prionogenicity of protein sequences. We then developed a sliding-window program to scan protein sequences (using a window-size of 60 amino acids), and first used it to benchmark the model by estimating the tradeoff between the Sensitivity and Specificity, and the Precision and Recall for the recovery tests against non-prions, Intrinsically Unstructured Proteins from Disprot, and large datasets extracted from PDB and Uniprot, please see the figures below. After assaying the good performance of our model, we proceeded to scan all the complete proteomes annotated in UniprotKB, the main and more complete repository of protein sequences available. Based on our statistical analysis we set up a cutoff score of 50 bits for accepting putative prion sequences, and for all the proteins in UniprotKB we rescued those predicted to have prion-like domains.
ROC plots of the PrD recovery and bootstrapping assays.
The scoring histogram distributions of the negative and positive datasets were processed and the true positive rate (TPR) was plotted against the false positive rate (FPR) in a tryout in which the known PrDs –i.e. positives in all four experimental tests– are picked up from a test dataset of non prions –i.e. negatives in all four experimental tests. In red we show the plot obtained using our model which has an area under the curve (AUC) of 0.90. We also include the result of a bootstrap assay in which the 18 prions used as the training set were resampled 106 times forming partial training sets of 9 prions and generating positive test sets for the ROC plot analysis of the rest 9 prions. One million ROC plots were generated always using the same negative set and the average ROC curve was calculated (shown in blue), the area under the curve (AUC) is 0.85.
Precision-recall plots for the comparison of PrD and non-prionogenic sequence distributions.
For each one of the three negative additional datasets including proteins from Uniprot, Disprot and the PDB we follow the evolution of the classifier’s Precision to correctly make a positive mapping of known PrD segments from a pool of non-prionogenic sequences. These values are plotted against the TPR –i.e.recall– of the corresponding classification step. The ratio between the number of instances in each positive and negative distribution is also shown.
Searching the Database
Our database is automatically updated every four weeks, to synchronize our predictions with the update of UniprotKB. We systematically scan the repositories of complete proteomes within UniprotKB (taxonomic divisions), including proteins resulting from genome sequencing and annotation projects and subdivided in two complementary and non-redundant datasets: a) Swissprot for fully annotated curated entries and b) TrEMBL formed by computer-generated entries enriched with automated classification and annotation. There are two different ways of querying the database:
Simple Searches
In this case it is possible to directly access the information of a single protein providing its UniprotKB identifier or principal accession number. This option is also the best alternative for querying the database with information from one of the searchable fields Taxon, Organism Name , Protein Name (Recommended Name, Alternative Name and Submitted Name) and the Gene Ontology Terms for Molecular Function, Biological Process and Cellular Component. For example, it is possible to retrieve all the putative prion proteins in the genome of an organism by providing the complete or partial organism name.
Complex Searches
In this cases the search can be refined by combining multiple fields from the database –i.e. Taxon, Organism Name, Protein Name (Recommended Name, Alternative Name and Submitted Name) and the Gene Ontology Terms for Molecular Function, Biological Process and Cellular Component. These fields can be combined when needed, by introducing the search terms in the rightmost tabs, and selecting the appropriate field that should be considered in the leftmost tabs. You can also choose the logical operators combining the query instances. Using this option, it is possible, for example, to retrieve all the prion-like proteins having a similar Molecular Function or related to a specific Biological Process in the genome of a specific organism.
All the search tabs have help buttons for in-site help.
The Output
After performing a search for a specific protein using its UniprotKB identifier or principal accession number, if the protein selected has prion-like domains the output will be a Detailed Output Page (For a graphical view please see the following Figure) including the UniprotKB identifier (ID) and principal accession number (AC), the source (Source) of the protein (coming from Swissprot or TrEMBL), the organism name (Organism) and taxon (Taxon), the names of the protein (recommended names: RecName and/or alternative names: AltName and/or submission names: Subname), the highest scoring prion domain in the sequence (PrD), the score of the highest scoring prion domain (Score), the position in the protein sequence of the highest scoring prion domain (Position), a representation of the complete protein sequence with the highest scoring prion domain highlighted in green (Sequence), and a graphical representation of the scanning of the complete protein sequence (Plot), corresponding to a chart with the score profile along the sequence, also showing the score used for making the predictions. In addition to these fields, the Detailed Output Page might also include information regarding the Gene Ontology Terms associated to the protein for the Molecular Function, Biological Processes and/or Cellular Component and the Cross-references to other databases like the EMBL, Refseq, Pfam and so on. However, if the search, either a Simple Search or a Complex Search, retrieves more than one entry, the output will be a General Output Page (For a graphical view please see the following Figure) with columns and rows that could contain different information depending on the search conducted, with some columns enabled to be dynamically ordered in ascending or decreasing manner. Every row shown in this General Output Page redirects to a Detailed Output Page as described above. At the bottom part of the General Output Page we include a short summary of the number of results retrieved by the query, which is also useful for browsing forward and backwards to different pages in the General Output Page by using the page links, or just introducing the exact page in the ‘Go to page’ box. Independently of the type of query, it is possible to download the results retrieved in the form of a compressed file containing all the information displayed in the web version, which includes all the information of entries and the associate scanning plots. This information is in HTML format and can be displayed locally using any web browser. The TXT file can by easily parsed by ad hoc scripts written by the users for performing in-house massive offline analysis of our data. In this file, the first two letters of each line indicate the contents of the rest of the line. There may be several consecutive lines starting with the same pair of letters. For example if there are several 'RecNames' each one will appear on a different line. This can happen for the following lines: 'RecNames', 'AltNames', 'SubNames', 'Molecular_Function', 'Biological_Process', 'Cellular_Component', 'EMBL', 'RefSeq', 'KEGG', 'Pfam', 'PDB'. As the sequences are very long, they are split into blocks of 70 characters and are displays each block in a separate line, and each line preceded by the two letters denoting that it is a sequence. The pairs of letters and what they mean are shown below:
ID => ID | AC => AC | SO => Source |
OR => Organism | TA => Taxon/td> | RN => RecNames |
AN => AltNames | SN => SubNames | PR => PrD |
SC => Score | PO => Position | SE => Sequence |
MF => Molecular_Function | BP => Biological_Process | CC => Cellular_Component |
EM => EMBL | RS => RefSeq | KE => KEGG |
PF => Pfam | PD => PDB | // => Entry end |
Processing your own sequences
The user has complete flexibility for testing the prionogenicity of protein sequences using the (Sequence Analysis from text or file) functionalities. First, the right option in the Submission Form is selected in order to enable the option for pasting a limited number of sequences in FASTA format or for uploading a file with a high number of protein sequences, which can be either a flat file or a compressed (using gzip or bzip2) file in FASTA format (the limit is 500MB for compressed files, which we estimate can contain approximately one million sequences). We also provide the possibility that the user can select the best cutoff for prediction according to his/her needs. In this case, if only one among the sequences introduced by the user happens to bear prion-like domains, the output will correspond to a Detailed Output Page with the specific information for the protein. On the other hand if the analysis of the sequences results in more than one protein with prion domain predictions, then the output will be a General Output Page with one row for each protein with predictions. As in the case of results obtained while searching the database, each row redirects to a Detailed Output Page with the specific information for the selected sequence. If the number of sequences is less than 5000, the output will be generated in a few seconds in HTML format as just described here, but when the number of sequences is higher than this value, then the job will be submitted to our computer cluster for processing. In this last case the results will be submitted by e-mail to the user upon completion. Please note that as we defined a window size of 60 amino acids only sequences with at least 60 residues will be processed with our algorithm.
Credits
This database is the result of a collaborative work between the following groups.
- Group of Protein Folding and Molecular Design, at the Department of Biochemistry and Molecular and Cellular Biology, University of Zaragoza, led by Prof. Javier Sancho.
- Group of Protein Folding and Conformational Diseases, at the Institut de Biotecnologia i de Biomedicina from the Universitat Autònoma de Barcelona, led by Prof. Salvador Ventura.
The site is hosted at the Institute for Biocomputation and Physics of Complex Systems, University of Zaragoza. The site maintenance and management is coordinated by the Protein Folding and Molecular Design group (ProtMol) led by Javier Sancho.
Contact
For information, suggestions or comments please contact:
Javier Sancho
email: This email address is being protected from spambots. You need JavaScript enabled to view it.
or
Salvador Ventura
email: This email address is being protected from spambots. You need JavaScript enabled to view it.
For troubleshooting please contact:
Alfonso López
email: This email address is being protected from spambots. You need JavaScript enabled to view it.