Protposer more information
Help & Credits
The stabilization of proteins for their use in vivo or ex vivo constitutes a major goal of the protein industry, as it is key to developing novel and more efficient analytic, synthetic and therapeutic tools. While protein stabilization can be approached from trial and error solution formulation or from directed evolution techniques, structure-based approaches remain central to the field as they can readily suggest rational strategies based in quantitative structural and thermodynamic knowledge, and because they are expected to progress and become consistently reliable. Understanding protein stability and stabilization has very practical biotechnological implications, such as being able to extend the shelf life of protein-based products or generating thermostable proteins for industrial uses.
Many approaches have been described for the rational stabilization of proteins. Protposer combines some of the more successful ones and automatizes their application to any protein. From a specified PDB file, Protposer generates a list of potentially stabilizing mutations and uses a machine learning algorithm based on Logistic Regression to score them according to their potential for being stabilizing. This is performed by evaluating features such as similarity to the consensus sequence, composition of α-helices, exposure of polar and apolar residues, size of internal cavities, Solvent Accesible Surface Area, or electrostatic interactions. At the end of the evaluation, Protposer returns a list of proposed mutations in order of decreasing probability of being stabilizing.
The references for Protposer and other related to the calculations it does are in the Bibliography section.
Methodology
To propose potentially stabilizing mutations, this web application extracts and calculates different sequence and structural features of the protein indicated by the user in the form of a Protein Data Bank (PDB) file or just a PDB ID. The two key phases of the workflow are proposal of mutations and evaluation of the mutations (including scoring).
Protposer functioning
Figure 1 shows the basic steps performed when servicing a request to Protposer. Once the input form is submitted and its correctness is checked, it is queued until previous requests are finished. Protposer then prepares the input file for the calculations and run the algorithm for proposal of mutations. Then, it evaluates the mutations, scores them according to the logistic regression model, and generates an HTML summary of the results, which is e-mailed to the user. If an error is found in the structure or during the calculations, an error report is generated and sent to the user instead. The section “Theory behind the project” describes in detail the theoretical basis and functioning of the different modules participating in the proposal and evaluation steps.
Figure 1. General workflow of the program
Input
There are two ways to indicate the protein structure to Protposer: as a PDB file or as a PDB id referring to a PDB file in the worldwide Protein Data Bank (wwPDB). As the algorithm is designed for working with only one chain, the selection of a chain is required too. A general explanation of the fields to fill in the input form follows.
Title of the job
A short title (max. 200 characters) to help the user identify the request when receiving the e-mail with the results. In addition, Protposer provides a unique identifier for the job request.
Protein structure
The protein structure can be provided in two ways, which can be selected using the radio buttons. The way not selected will appear as disabled.
- The structure is provided by the user as a text file following the PDB format (see http://www.wwpdb.org/docs.html). Protposer allows for the search of the PDB structure in the user’s computer for upload. If the PDB file contains several chains, Protposer will generate a file containing only the chain on which the calculations will be performed. If the PDB file contains modified residues, they must be indicated in a MODRES entry in the header part of the file (e.g. as in the PDB entry 3TEJ).
- A PDB identifier (4 characters) is provided, identifying a PDB file in the worldwide Protein Data Bank ( http://www.wwpdb.org), from which the PDB file will be downloaded and processed as in the alternative case above.
Chain
The chain on which the calculations will be performed must be indicated by the user, as the algorithm does not currently support the analysis of interactions between the different chains of the protein. To obtain a proposal of mutations in different chains of the same PDB structure, individual jobs must be sent for each chain.
E-mail address
An e-mail address where the results will be sent must be provided. Please, double check that you write it right since, if the address is misspelled, the results will get lost. We only use the e-mail address for the purpose of sending the results to the user and for monitoring the fair use of this service.
Academic use checkbox
Protposer is only free for academic, non-commercial use, as some of the software used for the calculations performed is. We must then ensure that the use of our whole web application is only free for academic, non-commercial use, so checking this box is a requisite for starting the calculations. If you work for a non-academic corporation, get in touch with us at This email address is being protected from spambots. You need JavaScript enabled to view it. to enquire for possibilities of use.
Results
You will receive an e-mail with the results of the calculations with “Protposer” as the subject. If the calculations finish properly, the e-mail will contain a body section with the main data of the job (job ID, name of the project and of the PDB file used and selected chain) and will attach a summary of the calculations as a HTML summary, as well as a link to the web results visualizer, which includes a 3D viewer of the structure of the protein. The results will only be kept available on the web for 14 days.
If Protposer experienced problems when doing the calculations, you will receive instead an e-mail with no attached files, with the same main data as before, but with an additional field indicating the error that has occurred and some suggestions to avoid that error when resending the job.
HTML summary
You can save the HTML page attached to the e-mail and open it with a web browser. Alternatively, you could open the HTML page in your e-mail client software, but most client software do not correctly represent HTML pages.
The HTML file comprises a table of proposed single point mutations in decreasing values of probability, and two self-explanatory legends. When passing the mouse cursor over them, the headers of the table show the meaning of the columns. The contents of the table are the following:
- Mutation: Proposed mutation, in one letter code
- Success probability: Estimated Positive Predictive Value: TP/(TP+FP) calculated, from the probability provided by the Logistic Regression model, using experimental data (more accurate than the aforementioned probability)
- Monomer contact: Warning of contact of the mutated residue with residues in another monomer in the most probable biological structure predicted by PISA (https://www.ebi.ac.uk/pdbe/pisa/) or in the other chains present in the PDB file
- CSA: Warning of presence of the mutated residue in the Catalytic Site Atlas ( http://www.ebi.ac.uk/thornton-srv/databases/CSA/). If it is present, the mutation proposed may drastically reduce the activity of the protein
- SITE: Warning of presence of the mutated residue in a region annotated as a site in the PDB file. If it is present, the mutation may drastically reduce the activity of the protein
- Proposed by: Module that has proposed the mutation. More information can be found in the “Theory behind the project” section of this help page
- Cons: Prediction of the consensus sequence
- Hel: Study of alpha-helix stability
- Exp: Study of the exposure and polarity of the residues
- Cav: Study of the volume of internal cavities in the protein
- SASA: Study of the Solvent Accesible Surface Area
- Ele: Study of the electrostatic interactions of the protein
Version log
Protposer v1.1
Version changes:
- Implemented local BLAST search, for enhanced robustness and reproducibility
- Improved CD-HIT threshold selection algorithm for more efficient selection of representative homologous sequences, for improved speed of calculation
- Built database of precalculated results for PDB ID codes, for a faster display of results and reduced workload on the server
Performance check:
Protposer v1.1 was executed on the ED+ dataset described in García-Cebollada et al. (2022). The number of proposed mutations was increased by around 14%. The performance of v1.1 was compared with that of v1.0 in terms of relative number of proposed mutations, PPV and mean ΔΔG (figure 6 in the aforementioned paper). Even though a slight improvement for eSR values under 42% and a slight decline for eSR values over 42% were observed, no significant changes in predictive performance were detected.
Protposer v1.0
As presented in García-Cebollada et al. (2022).
Theory behind the Protposer
Protposer makes use of some empirically tested strategies for proposing potentially stabilizing mutations. Several sequence and structural features are evaluated: similarity with consensus or with ancestral sequences, amino acid composition of alpha helices, exposure and polarity of residues, presence of exposed acidic residues forming H-bonds, size of internal cavities, electrostatic interactions, Solvent Accessible Surface Area (SASA) and possibility of introducing steric clashes. Thermodynamic or evolutionary reasons supporting the use of these strategies are given below.
Consensus sequence
An approach widely used for protein stabilization, due to its simplicity and efficacy, is looking for differences, at the sequence level, between the protein one wishes to stabilize and its family of homologous proteins, as exemplified by Paatero et al [1]. Changes in sequence reveal a different evolutionary path between two proteins and may be related to changes in function and/or in stability. This approach relies on the hypothesis that the most conserved residues in a homologous protein family (the ones in the consensus sequence) are those most responsible for maintaining the activity and the structure of the protein. Therefore, replacement of non consensus residues present in the wild type sequence by those in the consensus sequence can lead to protein stabilization. [2]
Ancestral sequence
In the same vein of the consensus sequence approach, ancestral sequence reconstruction (ASR) builds a phylogenetic tree of the extant family of homologous proteins in order to find the common ancestral sequence, according to the more likely phylogenetic history of the family of proteins. This approach relies on the fact that the temperature on Earth when the ancestral organisms were alive was higher than now, so that the ancestral sequences most likely correspond to thermophilic variants of the current proteins. Therefore, replacement of wild type protein residues by those present at equivalent positions in the ancestral sequence is expected to stabilise the protein [3,4]
Alpha helices
Within the elements of secondary structure in proteins, the energetics of α-helices is relatively well understood. Experimental studies and theoretical models aimed at understanding the propensity of the different residues to form α-helices suggest that three different helix segments should be distinguished in order to understand mutational effects on helix (hence protein) stability. [5] [6].They are the N-cap residue (located at the N-terminus of the helix, where a partial positive charge resides), the C-cap residue (at the opposite helix end, where a partial negative charge is located), and the internal residues of the helix (not so strongly experiencing the influence of the helix dipole). The effect on stability of replacing a residue at any of three helical segments by any other residue has been determined and modelled in helical peptides [7] and it can be used to propose replacement of α-helical wild type residues making a suboptimal contribution to protein stability. Mutations are therefore proposed for those suboptimal helical residues, provided their side chains do not establish additional hydrogen bonds with other protein atoms.
Exposure and polarity of residues
The polarity of residues and their position in the three dimensional structure of the protein influences stability. Protein cores are characteristically composed of hydrophobic residues, that don’t interact favourably with water and perturb water-water interactions when exposed to the solvent. Burial of polar residues in the hydrophobic core is destabilizing as the favourable interactions with water they established in the unfolded state of the protein are lost [8]. Also, overexposed apolar residues interfere with water-water interactions, destabilizing the structure of the protein too [9]. Therefore, mutations are proposed to replace buried polar residues by apolar ones, and to replace overexposed apolar residues by polar ones.
Exposed hydrogen-bonded acidic residues
A hydrogen bond is a type of electrostatic interaction established between a hydrogen atom bonded to an electronegative atom and a neighbouring acceptor electronegative atom. Despite the abundance of hydrogen bonds in proteins, we still lack clear experimental evidence proving that protein stabilization can be achieved through the engineering of novel hydrogen bonds. A more conservative approach indicates that the substitution of surface exposed acidic residues, whose side chains form hydrogen bonds with other protein groups, by their neutral isosters (Asp by Asn, or Glu by Gln) can stabilise the protein. [10] Therefore, mutations of exposed acidic residues engaged in hydrogen bonding to their neutral isosters are proposed.
Internal cavities
Cavities are void spaces inside the protein or close to the surface where solvent molecules (i.e. water) can ideally fit. External cavities, such as pockets, clefts or channels, can be linked to the function of the protein, as they may be binding sites. Therefore, only internal cavities should be modified to increase protein stability. [11] Some studies show that aliphatic deletions in the protein hydrophobic core lead to less stable protein variants [12], while filling internal cavities may have a stabilizing effect [13]. Therefore, small-to-large aliphatic mutations are proposed for residues in contact with internal cavities.
Electrostatic interactions
Electrostatic interactions are due to the electric force established between any two charged groups of a molecule. In the case of proteins, methods trying to improve the electrostatics balance by reducing electrostatic repulsions in the native structure due to spatial concentration of equally charged residues, or by engineering ionisable residues that can established new favourable electrostatic interactions have proved effective [14, 15]. Different models have been developed to guide the optimization of these interactions. Our program uses a method that applies the Poisson-Boltzmann model to the folded structure to calculate the contribution of each charged residue to protein stability. [16] Then, charge inversion and neutralization mutations are proposed for the charged residues that make unfavourable contributions to the stability of the protein.
Solvent Accessible Surface Area
Solvent Accessible Surface Area (SASA) is the area of the surface of the protein that is accessible to a probe solvent molecule (i.e. a water molecule). In our program, SASA calculations are used to detect and discard mutations that connect to the solvent existing cavities of the wild type protein thus giving rise to big apparent changes in SASA.
Steric clashes
Each atom of a protein occupies a certain space. When one residue is replaced by another one, some atoms of the newly introduced residue may lay too close to other atoms of the protein, which greatly destabilizes the structure. All proposed mutations are checked to ensure they do not generate steric clashes.
Scoring model
Using Machine Learning techniques, a Logistic Regression model has been developed that evaluates the conservation of the mutated residue in its protein family, the composition of the protein helices if the mutation is located in one, the exposure and polarity of the residue, the size of internal cavities and the SASA and the changes in electrostatic interactions upon mutation. With these values, it returns a probability of the mutation being stabilizing by more than 0.5 kcal/mol. However, our subsequent analysis indicates that the probability provided by the Logistic Regression model underestimates the real performance of the program. We thus provide the Positive Predictive Value (PPV) associated to the calculated probability of a proposed mutation by fitting probabilities and PPV for a large set of described mutations. The reported PPV of any proposed mutation is the actual probability of the mutation being stabilizing by at least 0.5 kcal/mol. For further details in the calculations and in the training and testing process, please check this article: [ARTICLE]
Bibliography
[1] Paatero, A. et al. Crystal Structure of an Engineered LRRTM2 Synaptic Adhesion Molecule and a Model for Neurexin Binding. Biochemistry 55, 914–926 (2016).
[2] Koonin, E. V. Orthologs, Paralogs, and Evolutionary Genomics. Annu. Rev. Genet.39,309–338 (2005).
[3] Merkl, R. & Sterner, R. Ancestral protein reconstruction: Techniques and applications. Biol. Chem.397,1–21 (2016).
[4] Boussau, B., Blanquart, S., Necsulea, A., Lartillot, N. & Gouy, M. Parallel adaptations to high temperatures in the Archaean eon. Nature456,942–945 (2008).
[5] Doig, A. J. Recent advances in helix –coil theory. Biophys. Chem.102,281–293 (2002).
[6] Serrano, L., Sancho, J., Hirshberg, M. & Fersht, A. R. Alpha-Helix stability in proteins. I. Empirical correlations concerning substitution of side-chains at the N and C-caps and the replacement of alanine by glycine or serine at solvent-exposed surfaces. J. Mol. Biol. 227, 544–559 (1992).
[7] Muñoz, V. & Serrano, L. Elucidating the Folding Problem of Helical Peptides Using Empirical Parameters. Nat. Struct. Biol.1,399–409 (1994).
[8] Ayuso-Tejedor,S., Abián, O. & Sancho, J. Underexposed polar residues and protein stabilization. Protein Eng. Des. Sel.24,171–177 (2011).
[9] Strub, C. et al. Mutation of exposed hydrophobic amino acids to arginine to increase protein stability. BMC Biochem.5,9 (2004)
[10] Irun, M. P., Maldonado, S. & Sancho, J. Stabilization of apoflavodoxin by replacing hydrogen-bonded charged Asp or Glu residues by the neutral isosteric Asn or Gln. Protein Eng14,173–181 (2001)
[11] Krone, M. et al.Visual Analysis of Biomolecular Cavities: State of the Art. Comput. Graph. Forum 35,527–551 (2016).
[12] Bueno, M., Campos, L. a, Estrada, J.& Sancho, J. Energetics of aliphatic deletions in protein cores. Protein Sci.15,1858–72 (2006).
[13] Bueno, M., Cremades, N., Neira, J. L. & Sancho, J. Filling Small, Empty Protein Cavities: Structural and Energetic Consequences. J. Mol. Biol. 358, 701–712 (2006).
[14] Loladze, V. V, Ibarra-Molero, B., Sanchez-Ruiz, J. M. & Makhatadze, G. I. Engineering a Thermostable Protein via Optimization of Charge−Charge Interactions on the Protein Surface. Biochemistry 38, 16419–16423 (1999).
[15] Campos, L. A., Garcia-Mira, M. M., Godoy-Ruiz, R., Sanchez-Ruiz, J. M. & Sancho, J. Do proteins always benefit from a stability increase? Relevant and residual stabilisation in a three-state protein by charge optimisation. J. Mol. Biol. 344, 223–237 (2004).
[16] Estrada, J., Echenique, P. & Sancho, J. Predicting stabilizing mutations in proteins using Poisson–Boltzmann based models: study of unfolded state ensemble models and development of a successful binary classifier based on residue interaction energies. Phys. Chem. Chem. Phys.17,31044–31054 (2015).