The group completed an initial generation run using the pLLM ESM2 and our candidate enzymes as training data.
This enzyme generation process included a step where the training data was made even more sequentially diverse by removing any sequences greater than 60% identity.
5 sequences from the 68 training data sequences were removed. We then masked our training data using the beta linear function (cite source).
This masking strategy was used because we generated novel proteins using ESM2 by gradually unmasking amino acids.
To effectively train a model that generates in this way, we must train it on data masked in different frequencies and positions. This strategy ensured that while generating we were producing sequences distinct from the training data.
Using this training method the model is trained for one epoch and after training on 250 training examples, the model generates a sequence.
The sequences generated by this model are always set at the length of one of the training data sequences. The produced sequences were saved and then reviewed computationally to determine whether they retained key catalytic features characteristic of reductive dehalogenases.
After training, the model was prompted to produce sequences and these were also computationally analyzed.