Model | Sorbonne-U-Paris

Model Page

The enzymes PETase and MHETase, derived from Ideonella sakaiensis, have demonstrated a promising synergy in the degradation of polyethylene terephthalate (PET) into its constituent monomers, terephthalic acid (TPA) and ethylene glycol (EG) [1]. The engineering of chimeric proteins fusing these two enzymes has already led to enhanced PET degradation and MHET hydrolysis rates. However, there remains untapped potential for optimization through structural modifications. Drawing inspiration from these studies, we aim to investigate the effects of targeted mutations on the PETase-MHETase fusion protein. The objective is to explore how these mutations could improve enzymatic catalysis, particularly by optimizing substrate specificity, stabilizing enzyme-substrate complexes, and enhancing overall catalytic efficiency. This approach could yield significant advancements in enzymatic polymer depolymerization, bolstering the prospects for effective synthetic polymer recycling. Genetic variations and their effects on proteins are at the core of current biomedical research. Rapidly and accurately understanding how each amino acid substitution impacts protein function would allow for better control of these biomolecules to treat diseases, design new proteins, and identify novel therapeutic targets.

Build a prediction model (LSTM)

Initially, we undertook the development of a prediction model for mutations and their effects using deep learning methods. To achieve this, we designed a model based on a Long Short-Term Memory (LSTM) recurrent neural network to predict the impact of mutations on the MHETase-PETase protein and identify potentially beneficial mutations.

1. Data preparation

We began by collecting data from several CSV files containing information on wild-type protein sequences, their mutated variants, and associated scores indicating the effects of mutations (classified as pathogenic or benign). This data was sourced from the datasets provided by ProtGym. Once extracted, the data was prepared for further processing.

Next, we created a specific vocabulary to transform each amino acid into a unique numerical index. This conversion is essential for representing protein sequences in a numerical format that can be utilized by a neural network, thereby enabling the effective training of our LSTM model. This approach was chosen to capture the sequential relationships between amino acids and their influence on the effects of mutations, which is crucial in protein analysis [2].

Example:

2. Encoding of sequences and scores

The protein sequences were converted into numerical indices using the previously constructed vocabulary, with each amino acid assigned a unique index. This numerical representation of sequences is essential for processing by the LSTM model. Concurrently, the DMS scores, which indicate whether a mutation is pathogenic or benign, were transformed into binary values: 1 for pathogenic mutations and 0 for benign ones. This binary encoding facilitates the evaluation of the model's predictions regarding the effects of mutations.

Example:

3. Construction of the LSTM Model

Imagine we have a string representing a sequence of amino acids from a protein. Each amino acid is represented by a letter, for example, "MKTIIALSY".

Step 1: Sequence Representation

First, we need to transform this sequence into a numerical format that the model can understand. Each amino acid is assigned a unique number in a vocabulary. Here’s an example of a vocabulary for amino acids:

M (Methionine): 0
K (Lysine): 1
T (Threonine): 2
I (Isoleucine): 3
A (Alanine): 4
L (Leucine): 5
S (Serine): 6
Y (Tyrosine): 7

For our sequence "MKTIIALSY", it would be represented numerically as:

[0, 1, 2, 3, 3, 4, 5, 6, 7]

Step 2: Model Construction

Embedding Layer: This layer takes the numerical indices and transforms them into dense vectors (a sort of "richer description" of the amino acids). This allows the model to better understand the relationships between different amino acids.

LSTM Layer: This layer examines the amino acid vectors while considering the order in which they appear. This means it can understand how one amino acid can influence others based on their position in the sequence. For example, it might learn that when Methionine (M) is followed by Lysine (K), it might have a particular impact on the protein's function.

Output Layers:

The first output layer predicts potential mutations in the sequence. For example, it might indicate that the sequence "MKT" could become "MKT*" (where * represents a mutation).
The second output layer predicts whether a mutation is pathogenic (negative) or benign (positive). For instance, if the mutation "MKT*" is considered pathogenic, the output would be 1; otherwise, it would be 0.

Step 3: Prediction

Once the model has been trained with numerous sequences and their outcomes, we can input a new sequence, and the model predicts:

What mutations might occur.
Whether these mutations are likely harmful or not.

Application

Suppose we have a new sequence "MKTIA". After passing this sequence through the model, it might predict that:

Potential mutation: "MKTIA" → "MKTIB" (where B is a new mutation)
Effect of the mutation: Pathogenic (output 1)

4. Model training

In this step, we trained our model using small groups of data called mini-batches. Each mini-batch contained original protein sequences, their mutated versions, and scores that indicate whether the mutations are harmful or not.

To evaluate how well the model was performing, we used a method called Binary Cross Entropy (BCE). This method helps us see the difference between what the model predicted and what the actual scores were, specifically focusing on whether the mutation is pathogenic (harmful) or benign (not harmful).

To improve the model, we applied the Adam algorithm. This is a technique that helps adjust the model's settings (called weights) efficiently so it can learn better from the data.

As the training progressed, we kept track of the average loss after each training cycle (called an epoch). By monitoring this average loss, we could see if the model was getting better at predicting the effects of mutations over time.

5. Mutation prediction

Once the model was trained, we implemented a function to predict mutations and their effects on a new protein sequence. The input sequence is first encoded into numerical indices and then passed through the model. The model returns two types of scores:

A probability score for each potential mutation, enabling the prediction of the most likely mutations.
An effect score indicating whether the mutation is likely to be beneficial or pathogenic, through binary classification.

6. Model Application

To test the model, the sequence of our protein of interest is inputted, and the model predicts both the probable mutations and the potential impact of these mutations. The results can be utilized to identify beneficial mutations for the studied protein, prioritizing those with a favorable effect score.

The results obtained from our model provide insights into the potential effects of mutations on the analyzed protein sequence. The raw mutation scores, ranging from [−0.3673 to 0.5865], suggest that some mutations may have a favorable impact, while others could be less beneficial or even detrimental. By normalizing these scores, we found that the mutation with a normalized score of 1.0 is considered the most promising, while a score of 0.0 indicates a lack of beneficial effect. Additionally, the normalized effect score of 0.0 suggests a non-beneficial or neutral potential. These results highlight the necessity of validating these predictions through further experimental work to explore the functional implications of the identified mutations and to optimize protein design strategies.

However, our model still has limitations regarding the prediction of mutation effects with a sufficiently high level of confidence. This lack of reliability can be partly attributed to the limited size of our dataset. To enhance the robustness of the predictions, it would be necessary to train the model on larger databases. Unfortunately, the time constraints did not allow us to complete this step. Moreover, the specificity of the training data must be taken into account. The effects of mutations observed on sequences homologous to MHETase and PETase will yield more reliable predictions than those based on genetically distant sequences in the phylogenetic tree. These considerations highlight the improvements needed to optimize the model's performance.

Furthermore, new protein design techniques have emerged in recent years, offering promising solutions for effectively predicting beneficial mutations, even in the absence of large amounts of data [3]. While our LSTM model represents an initial step in predicting mutation effects, there are still improvements to be made. By integrating recent techniques and expanding our dataset, we could enhance the predictive capabilities of our model, making it more effective in identifying beneficial mutations for our protein of interest.

Example:

Suppose the mutation scores (mutation_preds) and effect scores (effect_preds) are as follows:


    mutation_preds = [0.15, 0.65, 0.35, 0.90, 0.20]
    effect_preds = [0.05, 0.80, 0.50, 0.95, 0.10]

Interpretation

First position:

Mutation Score: 0.15
This score shows a low probability that this mutation is beneficial. In other words, it is unlikely that this mutation will have a positive effect on the protein.

Effect Score: 0.05
The effect of the mutation on the protein is close to neutral or potentially harmful. A score this low means that this mutation is probably disadvantageous or has no significant impact.

Second position:

Mutation Score: 0.65
This score shows a moderate probability that this mutation is beneficial. It is not certain that this mutation is positive, but there is a chance that it may have a favorable effect.

Effect Score: 0.80
This effect score is relatively high, suggesting that if the mutation occurs, it could have a very positive impact on the protein's function.

Third position:

Mutation Score: 0.35
Here, the mutation score shows a rather low probability of benefit. It is unlikely that this mutation is advantageous, but it is also not harmful.

Effect Score: 0.50
The effect score suggests that the mutation could be neutral or have a moderate effect on the protein. There is no strong indication of a positive or negative impact.

Fourth position:

Mutation Score: 0.90
This very high score indicates that there is a strong probability that this mutation is beneficial. This mutation is likely favorable and worth further study.

Effect Score: 0.95
The effect of this mutation is considered highly positive, with a strong potential to improve the function or stability of the protein. This is likely the most promising mutation in this set.

In this example, the mutation in the fourth position (with a mutation score of 0.90 and an effect score of 0.95) appears the most promising, and could be experimentally studied to see if it indeed improves the protein's function. On the other hand, the mutations in the first and fifth positions seem less advantageous and might be less relevant for future experimentation.

GEMME

To complete our work and make more reliable predictions, we used GEMME (Global Epistatic Model for predicting Mutational Effects), a rapid and innovative tool that predicts the effects of mutations by leveraging the evolutionary history of natural sequences. Unlike other methods, GEMME considers all sequence positions to assess the impact of mutations. Tested on 50 experimental datasets, it demonstrated performance comparable to, or even surpassing, current predictive tools. GEMME can generate mutational landscapes in minutes for various protein families and is available as both a package and a web server [4].

Our study focuses on a chimeric MHETase-PETase fusion protein (886 amino acids), created by linking the C-terminal end of MHETase to the N-terminal end of PETase through flexible glycine-serine linkers. The following sequence of this protein will be used as input for GEMME:

          MQTTVTTMLLASVALAACAGGGSTPLPLPQQQPPQQEPPPPPVPLASRAACEALKDGNGDMV
          WPNAATVVEVAAWRDAAPATASAAALPEHCEVSGAIAKRTGIDGYPYEIKFRLRMPAEWNGR
          FFMEGGSGTNGSLSAATGSIGGGQIASALSRNFATIATDGGHDNAVNDNPDALGTVAFGLDP
          QARLDMGYNSYDQVTQAGKAAVARFYGRAADKSYFIGCSEGGREGMMLSQRFPSHYDGIVAG
          APGYQLPKAGISGAWTTQSLAPAAVGLDAQGVPLINKSFSDADLHLLSQAILGTCDALDGLA
          DGIVDNYRACQAAFDPATAANPANGQALQCVGAKTADCLSPVQVTAIKRAMAGPVNSAGTPL
          YNRWAWDAGMSGLSGTTYNQGWRSWWLGSFNSSANNAQRVSGFSARSWLVDFATPPEPMPMT
          QVAARMMKFDFDIDPLKIWATSGQFTQSSMDWHGATSTDLAAFRDRGGKMILYHGMSDAAFS
          ALDTADYYERLGAAMPGAAGFARLFLVPGMNHCSGGPGTDRFDMLTPLVAWVERGEAPDQIS
          AWSGTPGYFGVAARTRPLCPYPQIARYKGSGDINTEANFACAAPPGGGSGGSGGGSGQTNPY
          ARGPNPTAASLEASAGPFTVRSFTVSRPSGYGAGTVYYPTNAGGTVGAIAIVPGYTARQSSI
          KWWGPRLASHGFVVITIDTNSTLDQPSSRSSQQMAALRQVASLNGTSSSPIYGKVDTARMGV
          MGWSMGGGGSLISAANNPSLKAAAPQAPWDSSTNFSSVTVPTLIFACENDSIAPVNSSALPI
          YDSMSRNAKQFLEINGGSHSCANSGNSNQALIGKKGVAWMKRFMDNDTRYSTFACENPNSTR
          VSDFRTANCSLEHHHHHH

Computational Methods and Limitations

Existing computational methods for predicting the effects of mutations often rely on multiple sequence alignments (MSA) to estimate mutation frequencies. However, these tools typically assess each position independently, while the residues within a protein are interdependent, a phenomenon known as epistasis. GEMME surpasses these limitations by modeling these interdependencies to enhance prediction accuracy. GEMME's approach is based on the evolutionary history of proteins and the overall sequence similarities. By reconstructing phylogenetic trees, GEMME evaluates positional conservation within a sequence. Unlike traditional methods, GEMME accounts for covariation between amino acid positions and estimates that conserved positions are more sensitive to mutations.

Operating Principle of GEMME [4]

GEMME combines two factors to predict mutational effects:

Minimal Evolutionary Adjustment: GEMME evaluates the changes required to accommodate a mutation across the entire sequence, which serves as an indicator to quantify the "evolutionary adjustment" associated with the mutation. This is achieved by examining the distance of natural sequences displaying the mutation in relation to the query sequence within the evolutionary tree. The hypothesis is that the further these sequences are, the more deleterious the mutation is likely to be.
Relative Mutation Frequency: GEMME also considers the frequency of mutations, giving priority to rare mutations when assessing their impact. If a mutation is both rare and distant within the evolutionary tree, it is predicted to be deleterious, while a frequent mutation close to the target sequence will have a low impact.

In cases where there is a divergence between these two factors, GEMME defaults to prioritizing evolutionary adjustment (epistatic term), assigning more weight to site interdependencies.

To compare mutations at different positions, it is postulated that more conserved positions will be more sensitive to mutations than less conserved positions. Therefore, GEMME will re-weight the predicted mutational effects based on degrees of evolutionary conservation: highly deleterious mutations are expected to predominantly occur at highly conserved positions.

Analysis of GEMME Results

To analyze the potential mutations in our fusion protein, GEMME produces a mutational score matrix. This matrix, with dimensions [20 x the length of the protein sequence], contains the 19 possible effects of substitutions for each amino acid residue (position). The wild type at each position is marked as NA, while the other values represent the effects of mutations: the more negative the score, the more deleterious the mutation; the closer the score is to zero, the more neutral the effect.

For instance, to predict the effect of the mutation D330A (where D is the wild type amino acid at position 330), one would refer to column 330 and the row corresponding to the amino acid A. The amino acids are arranged alphabetically in the matrix: ["a", "c", "d", "e", "f", "g", "h", "i", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "y"]. Therefore, in this example, one would look at the score located at column 330, row 1.

For clearer visualization of the scores, GEMME also generates graphical representations in the form of colored matrices (Fig.1). Each mutation is represented by a square: the darker the color, the more impactful the mutation is for the protein.

Figure 1: Score matrix of predicted substitution mutation effects on the MHETase-PETase fusion protein.

Prediction of mutations in MHETase-PETase (fasta file)

In our initial trials with GEMME, we used the full sequence of the fusion protein MHETase-PETase as input. GEMME relies on HMMER to generate a multiple sequence alignment (MSA) by comparing our protein with its homologs. However, this approach has a limitation: the fusion protein MHETase-PETase does not naturally exist, which results in a partially incorrect alignment. Specifically, only the MHETase portion is properly aligned, while the PETase portion is completely gapped (absence of information). As a result, GEMME assigns a score of zero to the entire section corresponding to PETase due to the lack of usable data.

To address this issue, we attempted to generate a new MSA using ColabFold, integrated with Alphafold2. Unfortunately, this tool also failed to identify enough homologous sequences for PETase. As shown in Figure 1, the score matrix is devoid of color beyond position 603, which corresponds to the end of the MHETase sequence. Consequently, we only have mutation effect predictions for the MHETase enzyme, leaving our results incomplete due to the absence of reliable data for PETase.

Prediction of mutations in MHETase

We therefore decided to analyze the two proteins separately to obtain a more comprehensive spectrum of mutations. To achieve this, we first performed a multiple sequence alignment (MSA) of the MHETase protein sequence using ColabFold. This MSA file was then submitted to GEMME for the necessary predictions. The resulting score matrix is shown below:

Figure 2: Score matrix of predicted substitution mutation effects on the MHETase protein.

Positive Effects

To further analyze the data, we employed Python tools, as detailed in the Jupyter Notebook « GEMME_Analysis.ipynb ». We began by extracting the relevant data from the results file and reorganizing it into a DataFrame for easier processing.

The main objective of our study was to identify beneficial mutations for the MHETase-PETase fusion protein, so we focused on the positive mutation scores. In total, we observed only five mutations with positive scores, which is relatively low given the number of possible mutations. Among these mutations, the S131G substitution (Serine 131 → Glycine) stands out with a score almost twice as high as the others (~0.56).

To better understand the impact of this mutation, we consulted the structural data available in the Protein Data Bank (PDB) [5]. We found that this residue is spatially very close to the active site of the MHETase enzyme (Figure 3).

Figure 3: Spatial location of Serine 225 (in bright green) and residue S131 (in pink), where a substitution to glycine could be beneficial for the protein.

Hypothesis: Why Does the S131G substitution improve the protein's efficiency?

Glycine is one of the smallest amino acids, lacking a bulky side chain, which often enhances flexibility in adjacent regions of the protein fold. In this case, replacing serine (which has a bulkier hydroxyl side chain) with glycine could increase local flexibility around position 131. This enhanced flexibility may facilitate substrate access to the active site or allow for beneficial conformational adjustments that improve enzymatic activity. Additionally, the proximity of S131 to the active site suggests that this mutation could directly affect the enzyme-substrate interaction. The reduction in steric hindrance around the active site might improve the enzyme’s ability to catalyze the reaction by enabling better positioning of the molecules involved in the enzymatic mechanism.

Further analysis through molecular dynamics simulations or crystallographic studies would be valuable to confirm the structural and functional impact of this mutation. A deeper understanding of the effects of this substitution could potentially lead to additional modifications aimed at further optimizing the efficiency of the MHETase enzyme.

Negative Effects

In a second phase of our analysis, we focused on identifying the most deleterious mutations in our fusion protein. The graph in Figure 2 shows black vertical bars in certain regions of the protein sequence, indicating that any variation of the residues at these positions—regardless of the specific substitution—is highly detrimental to the protein’s function. These particularly sensitive regions include residues 115–120, 160–170, 220–230, 320, 485–495, and 520–530.

As expected, the most negative mutation scores correspond to key positions in the active site, such as Ser225, Asp492, and His528—all crucial for catalytic activity. Additionally, neighboring residues like Gly227 and Gly228 also exhibit highly deleterious mutation effects, further emphasizing the importance of maintaining structural integrity around the active site.

Interestingly, several other deleterious mutations are located outside the immediate sequence of the active site, yet exhibit spatial proximity to it in the three-dimensional structure of the protein, as observed through structural data from the Protein Data Bank (PDB). However, other mutations, such as the one at position C320, are spatially distant from the active site but result in some of the most detrimental effects on the protein (Figure 4). These findings suggest that mutations in these regions may indirectly disrupt the catalytic function by destabilizing the overall protein folding or by altering key electrostatic and steric interactions critical for the enzyme's activity.

Figure 4: Spatial location of the MHETase active site (in bright green) and residue C320 (in pink).

Prediction of mutations in PETase

We applied the same analysis procedure for PETase:

Figure 5: Score matrix of predicted substitution mutation effects on the PETase protein.

Positive Effects

The GEMME analysis of PETase reveals a total of 15 mutations with positive scores, three times more than observed for MHETase, although this still represents a low rate of beneficial mutations overall. Among these, the substitutions Q156M (Glutamine 156 → Methionine) and S188H (Serine 188 → Histidine) stand out with the highest mutation scores (0.56 and 0.47, respectively).

To better understand the impact of these mutations, we consulted structural data available in the Protein Data Bank (PDB) [6]. Similar to the beneficial mutations identified in MHETase, these advantageous mutations in PETase are located in regions spatially close to the enzyme’s active site (Figure 6).

Figure 6: (a) [Left] Spatial location of the active site (in neon green) and the Glutamine residue (in pink). (b) [Right] Spatial location of the active site (in neon green) and the helical secondary structure containing the Serine residue (in pink).

Hypothesis: Why do these substitutions improve PETase efficiency?

The substitution of glutamine (Q) to methionine (M) at position 156 introduces a hydrophobic amino acid into a potentially flexible region near the active site. Methionine, with its larger and more hydrophobic side chain, may improve interactions with the surrounding substrate or stabilize the active site region, thereby enhancing the enzyme’s catalytic efficiency.

For the S188H substitution, this residue is positioned close to a known mutagenesis site with a high reported functional impact. Substituting serine (a small, polar residue) with histidine, which can participate in catalytic mechanisms and stabilize interactions, may increase the enzyme's activity by optimizing the local charge distribution or facilitating substrate binding. Given the proximity of S188 to this critical site, the introduction of histidine could enhance catalytic efficiency by contributing to the overall stability and flexibility of the active site.

Negative Effects

Similar to MHETase, the primary residues of PETase that are particularly sensitive to variations are located at or near the active site (S134, D180, and H211). For instance, the most deleterious mutation predicted by GEMME occurs at position G132, where any substitution, regardless of the substituted amino acid, results in a drastic reduction in enzymatic activity, with a score of -8.34. This position is situated in close proximity to the catalytic residue S134, reinforcing the idea that even slight changes in this region can severely disrupt the protein's function (Figure 7).

Figure 7: Spatial location of the active site of PETase (fluorescent green) and the β-sheet secondary structure containing residue G132 (pink).

The G132 residue is part of a β-sheet that constitutes the core of the protein, and its substitution could destabilize the three-dimensional architecture of the enzyme, thereby disrupting the positioning of the residues at the active site. Glycine, being the smallest amino acid, often provides the necessary flexibility in critical structural regions. Its substitution with any other bulkier amino acid may introduce steric hindrance or a loss of flexibility, which would account for the significant reduction in enzymatic efficiency.

GEMME results also indicate that most of the predicted deleterious mutations are located in regions already annotated as having significant impacts, according to mutagenesis data available on the PDB site. These regions include essential residues for the proper folding of the protein or for critical interactions with the substrate, thereby underscoring the importance of local structural elements in the stability and enzymatic activity.

Prediction of mutations in MHETase-PETase (concatenated)

However, studying the two proteins separately might provide an incomplete perspective of the mutation effects on our fusion protein, MHETase-PETase, which connects the two enzymes via flexible linkers composed of glycine and serine. To better capture the interactions between the two enzymatic domains, we decided to concatenate the aligned sequences of the two proteins (MSA1[i].seq + MSA2[i].seq), thereby creating a multiple sequence alignment (MSA) for the chimeric protein. Since PETase has fewer homologs than MHETase, we limited the analysis to the first 5084 sequences from the MHETase alignment to maintain a comparable number of homologous sequences between the two enzymes. While this reduces sequence coverage for MHETase, a sample of 5000 sequences is still sufficiently rich to allow for robust analysis.

Below is the mutation score matrix we obtained for the MHETase-PETase protein:

Figure 8: Score matrix of predicted substitution mutation effects on the MHETase-PETase fusion protein.

Positive Effects

The GEMME analysis revealed a total of only three mutations with positive scores, a very limited number, even compared to the predictions made for MHETase or PETase alone. Furthermore, although these scores are positive, their values remain relatively low, in the range of 10⁻² to 10⁻³. Among these mutations, the S131G substitution was already identified in the analysis of mutations in MHETase alone. The two other mutations, M192F and M192Y, affect residue 192, which is also part of the MHETase domain in the fusion protein. This indicates that GEMME did not detect any beneficial mutations likely to improve enzymatic efficiency in the PETase domain of the fusion protein (whereas in the analysis of PETase alone, 15 positive scores were identified, and they were higher). Moreover, even the beneficial mutations identified in the MHETase region have such low scores that the likelihood of them having a significant impact on the activity of the fused protein is low, or that their beneficial effects would be minimal.

Negative Effects

GEMME primarily identified deleterious mutations in areas near the active sites of both enzymes, particularly at key positions such as S225, D492, H528, and S748. Interestingly, the mutation scores in the MHETase domain are generally more negative than those observed in the PETase domain. This suggests that variations in the MHETase region have a more severe impact on the stability and enzymatic activity of the fusion protein compared to mutations affecting PETase. This result aligns with previous analyses, which already indicated greater sensitivity to mutations in MHETase. This increased sensitivity could be explained by a stronger structural and functional dependence on critical residues in MHETase, which may play a more central role in maintaining the optimal conformation required for catalysis. The reduced flexibility of this domain could thus account for the more significant functional disruptions caused by variations.

Discussion

The analysis of mutation effects on the MHETase-PETase protein using the GEMME tool has provided significant insights into the underlying mechanisms of enzymatic activity and stability of this protein. By modeling epistasis and considering the evolutionary interdependencies among amino acid residues, GEMME has accurately predicted mutational effects on this bifunctional enzyme utilized for the degradation of PET and MHET.

Our study identified several intriguing mutations, notably S131G, which exhibit positive scores and may potentially enhance catalytic efficiency. In particular, mutations located near the active sites could stabilize electrostatic interactions or facilitate hydrogen bonding, thereby increasing enzymatic efficiency. These findings open promising avenues for protein engineering aimed at optimizing plastic degradation capabilities.

Conversely, we also identified highly sensitive positions to variations, where any substitution leads to significant deleterious effects. This can be attributed to their critical roles in enzymatic catalysis (active site). Mapping the regions sensitive to deleterious mutations in the three-dimensional structure of the protein confirms that these residues play a crucial role in maintaining functional architecture.

The mutation analysis of the chimeric PETase-MHETase protein, based on the concatenation of the two sequences, revealed distinct results compared to the predictions made on the individual proteins. Although a few beneficial mutations were identified, the overall positive mutation scores were low and few in number, with only three mutations detected in the fusion protein analysis, compared to a larger number in the analyses of the individual proteins.

When fusing MHETase and PETase via flexible glycine-serine linkers, this can disrupt the overall three-dimensional structure of the protein. The fusion of two distinct protein domains may lead to non-native interactions between the two domains, affecting the folding and stability of the protein. This could limit the efficacy of some mutations that would be beneficial in a non-fused structure, but lose their effect in a protein with an altered architecture. Additionally, the increased flexibility introduced by the glycine-serine segments may allow the fusion protein to adopt alternative or more dynamic conformations, potentially impacting the prediction of mutation effects. In a fused protein, both enzymatic domains must coexist in the same constrained space, which could lead to interference between them. Beneficial mutations in one domain (such as MHETase) may not be as effective if nearby residues in the other domain (PETase) disturb the local environment, a situation that was not observed in the analysis of the individual proteins.

Perspectives

These results pave the way for several avenues of research and development. First, it would be essential to investigate the effects of positive mutations in in vitro and in vivo experiments to confirm their potential to improve enzymatic activity. The study of crystallographic structures of mutated variants could also shed additional light on the atomic mechanisms driving the observed functional improvements. Moreover, rational engineering could be applied to combine multiple beneficial mutations into a single protein variant, thereby maximizing PET degradation efficiency. This approach could lead to the design of more robust and active enzymes for plastic recycling under industrial conditions.

Alphafold2

We predicted the 3D structure of the MHETase-PETase fusion protein using AlphaFold2 (Fig. 3):

Figure: Prediction of the 3D Structure of the MHETase-PETase fusion protein using AlphaFold2

AlphaFold 2 is a state-of-the-art deep learning model designed for protein structure prediction. This model leverages advanced algorithms to provide highly accurate predictions of protein folding and structure based on amino acid sequences.

To initiate the prediction, we input the complete amino acid sequence of the MHETase-PETase fusion protein into AlphaFold 2. The model processed this input by analyzing evolutionary data and integrating the physical and chemical principles governing protein structure. The output consisted of several predicted structures, each accompanied by confidence scores, including the pIDDT score, which indicates the reliability of the predicted model. Here, we observe that the most reliably predicted structures closely correspond to the 3D structures of the MHETase and PETase proteins, represented in blue. However, it is important to highlight that certain segments, likely the flexible linkers fusing the domains of MHETase and PETase, exhibit relatively low structural prediction reliability.

A special thank

We would like to thank Marina Abakarova (PhD Student in the Laboratory of Computational and Quantitative Biology), who helped us to fully understand the GEMME tool for predicting the effects of mutations on our fusion protein. She also suggested several mutation prediction tools, which were a great help in completing our work.

Bibliography

[1] Knott, B. C., Erickson, E., Allen, M. D., Gado, J. E., Graham, R., Kearns, F. L., et al. (2020). Characterization and engineering of a two-enzyme system for plastics depolymerization. Proc. Natl. Acad. Sci. 117, 25476–25485. doi: https://doi.org/10.1073/pnas.2006753117

[2] Zhang J, Zhu M, Qian Y. protein2vec: Predicting Protein-Protein Interactions Based on LSTM. IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1257-1266. doi: 10.1109/TCBB.2020.3003941. Epub 2022 Jun 3. PMID: 32750870. DOI: 10.1109/TCBB.2020.3003941

[3] Biswas, S., Khimulya, G., Alley, E.C. et al. Low-N protein engineering with data-efficient deep learning. Nat Methods 18, 389–396 (2021). https://doi.org/10.1038/s41592-021-01100-y

[4] Laine, E.; Karami, Y.; Carbone, A. GEMME: A Simple and Fast Global Epistatic Model Predicting Mutational Effects. Mol. Biol. Evol. 2019, 36 (11), 2604–2619. doi : https://doi.org/10.1093/molbev/msz179.

[5] RCSB Protein Data Bank. Structure of MHETase from Ideonella sakaiensis. https://www.rcsb.org/3d-sequence/6QZ2?assemblyId=1

[6] RCSB Protein Data Bank. Crystal Structure of Ideonella sakaiensis PET Hydrolase. https://www.rcsb.org/3d-sequence/6ANE?assemblyId=1