Software

Motivation to Build An Expression Classifier

The most common impediment to the fabrication of proteins is the inability to express them. When an AI generates a protein, it’s more improbable that expression will occur because the sequence is, by definition, unnatural, and unlikely to have been refined by evolution to express well in a given expression system. As AI becomes an increasingly popular tool within bioengineering, AI protein generation will be more limited by design expressibility. However, an advantage of AI design as a tool is that AI models can be trained to output more “expressible” proteins which is where an expression classifier comes into action. The problem, however, is that there is not enough data around for an expression classifier to be trained on. To properly train an expression classifier on the viability of expression in AI-generated proteins, AI-generated protein data was needed. This isn’t without issue as even collecting enough data at scale to train an expression classifier is time-staking and oftentimes expensive. Our team solved this issue with a new mass spectrometry assay which collected the expressibility data of 8000 proteins. The protein libraries consisted of 2000 libraries in 4 different libraries - 3 of the libraries were generated from 3 different large language models (list them) while the 4th was randomly generated. The proteins in the libraries were designed specifically to be identified through mass spectrometry; this was achieved through tryptic digestion wherein trypsin gives the proteins unique identifiers through RK residue cleaving. We designed each sequence to provide us with 3 or more unique identifiers after cleavage.

Creating an expression classifier

When AI generates a protein, it’s improbable in real life, so we started to work on an expression classifier that requires data on generated proteins successfully expressed in TXTL. We generated groups of 2,000 proteins per language model to gather this data. We started by getting data using proteins provided by Dr. Wojtowicz with already confirmed expression levels, and then we fed this as a spreadsheet into this classifier. The classifier learns by finding patterns in the sequences and comparing them to the levels, leading to assigning patterns it finds to expression levels. Below is the spreadsheet that we fed to the classifier.

In the spreadsheet, there are 5 qualitative expression levels: undetectable, faint, low, medium, and high, which we translated into numbered bins from 2 states to 5 states to make the classifier’s job easier. For example, our current best model was 3 states, where we put undetectable, faint, and low into the 0 bin, medium into the 1 bin, and high into the 2 bin, then we fed sequences from a similar data set and asked it to assign bins to them. Here are the results from our best model.

We compiled all the data from cumulative protein mass spectrometry runs and labeled sequences as 1 if a variant was detected in any of the runs, and 0 otherwise. The full data can be accessed here https://gitlab.igem.org/2024/software-tools/iea/-/blob/main/Updated_Classifier_Training_Data.xlsx

A6RdhA and 68 similar proteins

As we switched to PFAS degradation, we started an extensive literature review to find as many PFAS degraders for training data as possible (see sources in acknowledgments). We began by using an incomplete protein from a previous team from the Air Force called A6RdhA. Unfortunately, the one the team uploaded is missing a significant part of its structure due to limitations in reading its sequence. To fix this, we put it in Foldseek. We found a total of 68 proteins similar to it, but we had to narrow down our search, so we decided to only aim for one type of protein; we had three to choose from laccases, peroxidases, and dehalogenases. Laccases and peroxidases can release radicals that break down molecules, including PFAS, but can be toxic because they can break down essential molecules in your cells, so we decided on dehalogenases. The reason for this was that dehalogenases have specific binding sites that can break down molecules more similar to PFAS, so we got 68 of these proteins, which we examined manually in Pymol to take parts that seemed missing in A6RdhA and copy them to the end of A6RdhA, creating chimeras that should be representative of the full A6RdhA molecule, leading to a large list of possible PFAS degraders that can be fed to our LLM to get even better ones.

Docking

We checked these chimeras to see if they docked well using diffdock and autodock vina. These programs can take two proteins and simulate their interaction with each other. We used this to simulate enzymes’ interactions with PFOA, and it can produce a file that can be viewed which helped us train the protein generation LLM even further. Additionally, we received docking data(not sure of units) which allowed us to easily quantify and organize data based on the strength of the interaction of the receptor and ligand

PFAS ligand docked to A6T7 Chimera

Affinity Scores of Candidate Enzymes

Here are the affinity scores of candidate enzymes as they relate to PFAS degradation:

MD tests

To produce better predictions as to whether or not our generated proteins would be able to hold PFAS-like ligands in their binding and active sites, we simulated the interactions in various molecular dynamics(MD) tests. Using the program OpenMM via TRILL we produce multiple states of interactions all of which are aligned chronologically and show the interaction of the receptors and ligands in simulated aqueous conditions. The tested proteins that are successfully able to hold the ligands in their binding sites are predicted to be better at degrading while those that are not are predicted to be worse at degrading.

PFAS Ligand binded to potential PFAS degrader