Model

Explaining our Model we made for the project!

Home Image

Best Model

Novel PFAS-degrading Enzymes

We designed a series of novel potential PFAS-degrading enzymes, drawing on the key features of natural dehalogenases. We fine-tuned a generative model (esm2-t33-650M-UR50D) using 63 enzymes with less than 60% sequence identity we discovered through Foldseek as discussed on our Project Description page and then generated novel enzymes through an iterative unmasking process. The enzymes we generated using this design strategy preserved several critical structural and functional characteristics essential for dehalogenation, predicted to enable breakdown of per- and polyfluoroalkyl substances (PFAS). The enzyme includes an active site pocket for binding PFAS molecules, facilitating contact between the enzyme and the PFAS carbon-fluorine bonds. By preserving these key features while optimizing the enzyme's specificity for PFAS, this innovative design holds promise for effective bioremediation strategies targeting persistent environmental pollutants.

Generation of Novel Enzymes: pLLM ESM2 Produces Structurally Similar but Sequentially Diverse Proteins with Cofactor Binding and Active Sites Intact

Enzyme Generation Process

The group completed an initial generation run using the pLLM ESM2 and our candidate enzymes as training data.

Graph 1

This enzyme generation process included a step where the training data was made even more sequentially diverse by removing any sequences greater than 60% identity.

Graph 2

5 sequences from the 68 training data sequences were removed. We then masked our training data using the beta linear function (cite source).

This masking strategy was used because we generated novel proteins using ESM2 by gradually unmasking amino acids.

Graph 3

To effectively train a model that generates in this way, we must train it on data masked in different frequencies and positions. This strategy ensured that while generating we were producing sequences distinct from the training data.

Graph 4

Using this training method the model is trained for one epoch and after training on 250 training examples, the model generates a sequence.

Graph 5 Graph 6

The sequences generated by this model are always set at the length of one of the training data sequences. The produced sequences were saved and then reviewed computationally to determine whether they retained key catalytic features characteristic of reductive dehalogenases.

Graph 9

After training, the model was prompted to produce sequences and these were also computationally analyzed.

Graph 8

This entire generation method produced multiple protein sequences that preserved the necessary cofactor binding sites and active sites of the corrinoid iron-sulfur-containing reductive dehalogenases that comprise the training data. Despite this structural similarity, all of these generated enzymes had very low sequential similarity to the training data enzymes suggesting that they were novel.

Weights for the fine-tuned dehalogenase-generating language model and associated code can be found here - https://doi.org/10.5281/zenodo.13879527 The original notebook for training the model can be found at https://gitlab.igem.org/2024/software-tools/iea/-/blob/main/iGEM_ESMDesign.ipynb

Selection of Promising Novel Enzymes: Three Generated Sequences Chosen for Expression in MyTXTL, with DNA Ordered for Preparation

Of the sequences produced, 3 of the most promising generated enzymes in terms of structure and novelty were chosen to be expressed in MyTXTL. The DNA coding for these novel enzymes was ordered and will be prepared for expression.

AI

.

Percent Sequential Identity Matrix of Enzymes T7RdhA, AI10, AI1, and AI4.