Content

Protposer more information

Go back to Protposer

Help & Credits

Stabilization of proteins has been a main goal in structural biology for a long time. It does not only have theoretical implications for a better understanding of protein folding, but also practical biotechnological uses, such as the extension of the shelf life of protein-based products or the generation of thermoresistant variants of proteins for industrial uses. Many approaches have been described in the literature for the rational stabilization of proteins, some with a higher success rate than others. Protposer takes some of these rational approaches and automatizes them, proposing, for a specified PDF file, a list of potentially stabilizing mutations.

Protposer then uses a machine learning algorithm based on Logistic Regression to score all proposed mutations according to their potential of being stabilizing. This prediction is performed by evaluating features such as similarity to the consensus sequence, composition of alpha helices, exposure of polar and apolar residues, size of internal cavities, Solvent Accesible Surface Area (SASA) and electrostatic interactions. It finally returns, in an easy to interpret report, proposed mutations in order of decreasing probability of being stabilizing.

The references for Protposer and some related ones appear at the Bibliography section.

Methodology

To propose potentially stabilizing mutations, this web application extracts and calculates different features of the protein, both of its structure and sequence, using the information supplied by the user in the form of a Protein Data Bank(PDB) file or ID. The two key phases of the workflow are proposal of mutations and evaluation of the mutations (including scoring)

Protposer functioning

Figure 1 shows the basic steps performed when servicing a request to Protposer. Once the input form is submitted and its correctness is checked, it is queued until previous requests are finished. Protposer then prepares the input file for the calculations and starts with the algorithm for the proposal of mutations. Then, all proposed mutations are evaluated and scored according to the logistic regression model. After the calculations, the HTML summary of the results is generated and sent to the e-mail address provided by the user. The section “Theory behind the project” describes in detail the theoretical basis and functioning of the modules in the proposal and evaluation steps.

Input

There are two ways of introducing the protein structure to Protposer: as a PDB file or as a PDB id referring to a PDB file in the worldwide Protein Data Bank (wwPDB). The algorithm is designed for working with only one chain, so the selection of a chain is required too. A general explanation of the fields to fill in the input form follows.

Title of the job

A short title (max. 200 characters) to help the user identify the request when receiving the e-mail with the results. In addition, Protposer provides a unique identifier for the job request.

Protein structure

The protein structure can be provided in two ways, which can be selected using the radio buttons. The way not selected will appear as disabled.

  • The structure is provided by the user as a text file following the PDB format (see http://www.wwpdb.org/docs.html). Protposer allows for the search of the PDB structure in the user’s computer for later upload. If the PDB contains several chains, Protposer will generate a file with only the chain on which calculations will be performed.
  • A PDB identifier (4 characters) is provided, identifying a PDB file in the worldwide Protein Data Bank (http://www.wwpdb.org), from which the PDB file will be downloaded and processed as in the case before.

Chain

The chain on which the calculations are going to be performed must be indicated, as the algorithm does not support currently the study of interaction between two chains of a protein. For the study of stabilization on different chains of the same PDB structure, several jobs must be sent.

E-mail address

An e-mail address where the results will be sent must be provided. Please, double check that you enter it right, since if the address is misspelled, the results will get lost. We only use the e-mail address for the purpose of sending the results and for monitoring the fair use of this service.

Academic use checkbox

As some of the software used for the calculations is only free for academic, non-commercial use, we must ensure that the use of our whole web application is also only free for academic, non-commercial use, so checking this box is a requisite for starting the calculations. If you belong to a non-academic corporation, get in touch with us at This email address is being protected from spambots. You need JavaScript enabled to view it. to enquire for possibilities of use.

Results

You will receive an e-mail with the results of the calculations. This e-mail will contain a body section and will attach a summary of the calculations as a HTML summary. The e-mail will have the subject Protposer Results (job 47) (with your job id instead of 47). If Protposer had problems when doing the calculations, you will receive instead an e-mail with the subject Protposer error (job 47).

The body section will contain XXXXXX

HTML summary

You can save the HTML page attached to the e-mail and open it with a web browser. Alternatively, you could open the HTML page in your e-mail client software, but most client software do not correctly represent HTML pages.

The HTML file comprises a table of proposed single point mutations in decreasing values of probability and two self-explanatory legends. The headers of the table show the meaning of the columns when passing the mouse cursor over them. The contents of the table are the following:

  • Mutation: Proposed mutation, in one letter code
  • PPV: Estimated Positive Predictive Value, calculated from probability using experimental data (more accurate than probability)
  • Probability: Estimated probability of the mutation being stabilizing according to the Logistic Regression Model
  • CSA: Presence of the mutated residue in the Catalytic Site Atlas (http://www.ebi.ac.uk/thornton-srv/databases/CSA/). If it is present, the mutation may drastically reduce the activity of the protein
  • SITE:Presence of the mutated residue in a region annotated as a site in the PDB file. If it is present, the mutation may drastically reduce the activity of the protein
  • Proposed by: Module that has proposed the mutation. More information can be found in the “Theory behind the project” section of the help page
  • Cons: Prediction of the consensus sequence
  • Hel: Study of alpha-helix stability
  • Exp: Study of the exposure and polarity of the residues
  • Cav: Study of the volume of internal cavities in the protein
  • SASA: Study of the Solvent Accesible Surface Area
  • Ele: Study of the electrostatic interactions of the protein

Theory behind the Protposer

Protposer makes use of some empirically proven approaches for proposing mutations that are potentially stabilizing. Several criteria are considered: similarity with the consensus and ancestral sequences, composition of alpha helices, exposure and polarity of the residues, presence of exposed acidic residues forming H-bonds, size of internal cavities, electrostatic interactions, Solvent Accessible Surface Area (SASA) and possibility of steric clashes. All those criteria and their bases are explained below.

Consensus sequence

A widely used approach for protein stabilization, due to its effectivity and simplicity, is the search of differences between the protein to stabilize and its family of homologous proteins. Changes in sequence show a different evolutionary path between two proteins and may respond to changes in function or in stability. This approach relies on the hypothesis that the most conserved residues in a homologous protein family (the ones in the consensus sequence) are those most responsible for maintaining the activity and structure of the protein. Therefore, residues of the protein differing from the consensus sequence are potentially stabilizing points for mutation.

Ancestral sequence

In the same vein of the consensus sequence, there is the ancestral sequence reconstruction (ASR). ASR reconstructs the phylogenetic tree of the extant family of homologous proteins in order to find the common ancestral sequence of the protein, according to the more likely phylogenetic history of the family of proteins. This approach relies on the fact that the temperature on Earth was higher the moment those ancestors were alive, so that ancestral sequences are most likely thermophilic variants of current proteins. Therefore, residues of the protein differing with the ancestral sequence are potentially stabilizing points for mutation.

Alpha helices

Alpha helices are one of the main motifs of secondary structure found in proteins. Several studies have been performed to understand the propensity of a sequence of amino acids to form a helix and how mutations affect the helix and, therefore, protein stability. The main common conclusion of these studies is that three different zones must be distinguished in an alpha helix in terms of mutation effect on stability: the N-cap, end of the helix closer to the N-terminus with a partial positive charge due to the free amine groups not bonding to the helix; the C-cap, end closer to the C-terminus, with a partial negative charge due to the free carbonyl groups; and the internal residues, without any partial charge, as both their amine and carbonyl groups are hydrogen bonding with other residues in the helix. Muñoz et al. designed, using empirical parameters, formulas for estimating which were the most appropriate residues for a helix and which would be their contribution to the free Gibbs energy. Therefore, mutations are proposed for those residues neither optimal nor suboptimal for their position in the helix, provided their side chains do not establish additional hydrogen bonds with other protein atoms.

Exposure and polarity of residues

Polarity of residues and their position in the three dimensional structure is relevant for the stability of the protein. Usually, the inner core of the protein is composed of hydrophobic residues, as they don’t interact favorably with water and perturb water-water interactions when exposed to the solvent. Burial of polar residues in the hydrophobic core is then destabilizing as favorable interactions with water established in the unfolded states of the protein are lost. Also, overexposed apolar residues interfere with water-water interactions, destabilizing the structure too. Therefore, neutralization mutations are proposed for buried polar residues and mutations to polar residues are proposed for overexposed apolar residues.

Exposed hydrogen-bonded acidic residues

A hydrogen bond is a type of electrostatic interaction between a donor (hydrogen atom bonded to a highly electronegative small atom) and an acceptor (electronegative atom near the donor). In proteins, they contribute to the stabilization of some elements of secondary structure, such as alpha helices or beta sheets, and also to the tertiary structure. However, automatizing the modification of hydrogen bonds in a protein is risky, as a small change may modify the whole network of hydrogen bonds and, therefore, the structure of the protein, which may affect also its function, without any clear experimental evidence proving that stabilization can be achieved this way. Nevertheless, more conservative approaches have been developed. It has been described that the substitution of exposed acidic residues, the side chains of which establish hydrogen bonds with other protein groups by their neutral isosters (Asp to Asn, Glu to Gln) increases the strength of such bond. Therefore, mutations for acidic exposed residues establishing hydrogen bonds to their isosteric neutral form are proposed.

Internal cavities

Cavities are spaces inside or close to the surface of a protein where solvent molecules (i.e. water) can fit. External cavities, such as pockets, clefts or channels, are usually linked to the function of the protein, as they may be binding sites. Therefore, only internal cavities should be modified. Some studies show that aliphatic deletions in the protein hydrophobic core lead to less stable variants of the protein, hence concluding that internal cavities may have a destabilizing effect. Therefore, small-to-large aliphatic mutations are proposed for residues in contact with internal cavities.

Electrostatic interactions

Electrostatic interactions are due to the electric force between any two charged groups of a molecule. In the case of proteins, methods trying to improve the protein electrostatics are central as, to some extent, they constitute the basis for many other methods (e.g. helix composition takes into account the dipolar behavior of the helix and the establishment of hydrogen bonds. Different models have been developed for the optimization of these interactions. Our program uses the method developed by Estrada et al. based on the Poisson-Boltzmann model for the folded structure only to calculate the contribution of each charged residue to protein stability. Then, charge inversion and neutralization mutations are proposed for unfavorably charged residues.

Solvent Accessible Surface Area

Solvent Accessible Surface Area (SASA) is the area of the surface of the protein that is accessible by a solvent molecule (i.e. water). The exposure of a mutated residue has been sometimes related to the effect on stability of that mutation. However, no clear protein design techniques have been developed using that approach. In our program, it is used to ensure that big changes in volume of internal cavities do not correspond to the opening of an internal cavity (which would result in a big increase in SASA).

Steric clashes

Each atom in the protein occupies a certain space. When mutating one residue to another, one of the new atoms may be too close to other atom of the protein, destabilizing the structure. Because of this, all proposed mutations are checked to ensure they do not generate big steric clashes.

Scoring model

Using Machine Learning techniques, a Logistic Regression model has been developed that evaluates the conservation of the mutated residue, the composition of helices (if the mutation is located in one), the exposure and polarity of the residue, the size of internal cavities, the SASA and the changes in electrostatic interactions upon mutation. With these values, it returns a probability of the mutation being stabilizing. However, our research showed that these values underestimate the real potential of the program, so a calculation of PPV (Positive Predictive Value, i.e. the percentage of true positives out of the predicted positives) is performed by extrapolating with our calculated data. For further details in the calculations and in the training and testing process, please check this article: [ARTICLE]

Bibliography

TBA