Contribution | Sorbonne-U-Paris

Contributions

In our quest to advance the field of protein engineering, we developed a Long Short-Term Memory (LSTM) model designed specifically to predict the effects of mutations in the PETase-MHETase fusion protein derived from Ideonella sakaiensis. This endeavor was motivated by the pressing need to enhance our understanding of how genetic variations influence protein functionality, especially in the context of enzymatic degradation of synthetic polymers like polyethylene terephthalate (PET).

Mutation prediction model

In our quest to advance the field of protein engineering, we developed a Long Short-Term Memory (LSTM) model designed specifically to predict the effects of mutations in the PETase-MHETase fusion protein derived from Ideonella sakaiensis. This endeavor was motivated by the pressing need to enhance our understanding of how genetic variations influence protein functionality, especially in the context of enzymatic degradation of synthetic polymers like polyethylene terephthalate (PET).

Developpement of the Model & Key features

Our model harnesses the power of deep learning to analyze the intricate relationships between amino acid sequences and their corresponding mutation effects. By constructing a comprehensive framework, we integrated several vital components:

Data preparation: we meticulously curated a dataset from various sources, transforming wild-type and mutated protein sequences into numerical representations that can be processed by neural networks.

Sequence encoding: the encoding process employed unique numerical indices for each amino acid, allowing the model to capture sequential dependencies effectively. This step is essential for translating biological sequences into a format suitable for computational analysis, facilitating future adaptations of the model to different proteins.

LSTM Neural network: the model’s architecture includes an embedding layer and recurrent LSTM layers, enabling it to learn complex features and contextual relationships within protein sequences. This design gives us the opportunity to refine and expand the model for diverse applications, including other enzymes and proteins of interest.

Binary classification output: our dual-output system predicts both potential mutations and their effects, providing a comprehensive view of how changes in the protein sequence may influence function. This capability could be beneficial for others seeking to identify advantageous mutations for protein optimization.

Implications for future projects

Data expansion & Model refinement: while our model demonstrates some results, we recognize the need for larger datasets to improve prediction accuracy. Future teams are encouraged to expand the dataset, incorporating sequences homologous to PETase and MHETase, thereby increasing the model's reliability.

Integration of novel techniques: recent advancements in protein design methodologies hold great promise for enhancing mutation prediction capabilities, even with limited data. Future research could benefit from integrating these innovative approaches into our existing model.

By providing a detailed account of our contributions and methodologies [link to the Model Page], we hope to inspire and facilitate future research initiatives focused on enzyme engineering and protein design. We believe that this model, with its inherent flexibility and potential for refinement, will be a tool for those wishing to harness the power of computational predictions in the search for sustainable solutions to the degradation of synthetic polymers.

The GEMME tool

As part of our project, we developed a set of codes and methodologies to analyze the results generated by the GEMME tool, a powerful tool for predicting the effects of mutations on proteins. This initiative aims to facilitate the use of GEMME by future research teams by providing clear and adaptable analysis tools.

Objectives et motivation

One of our main objectives was to make the interpretation of the results provided by GEMME accessible and understandable. Although GEMME offers valuable predictions, processing and analyzing these results is not always straightforward. We created an analytical code that allows users to visualize mutation scores, interpret them, and draw meaningful conclusions regarding the potential impacts on the enzymatic activity of proteins. This approach aims to minimize the learning curve for new teams and maximize efficiency in using this powerful tool.

Clarity & structure: the written codes are organized logically, allowing for easy sequential execution. Each section of the code is accompanied by explanatory comments, facilitating the understanding of the different steps in the analysis process.

Flexibility & adaptability: the code is designed to be easily modified by users. Future teams can tailor the analyses to their specific needs, adding new analytical methods or modifying the criteria for score interpretation.

By providing a robust and adaptable analytical framework, our work contributes to the creation of a collaborative research environment. Teams that build upon our work will benefit from our code and can also use it as a starting point to develop even more sophisticated analyses. This could help them explore the effects of mutations more thoroughly, identify potential targets for protein research, and optimize sequence modifications to enhance enzymatic activity.