The project aimed to address the environmental persistence of PFAS by discovering and designing novel enzymes capable of degrading these compounds, focusing on reductive dehalogenases like A6RdhA. To enhance protein expression prediction, we designed an expression classifier applicable to projects beyond our own.
Per- and polyfluoroalkyl substances (PFAS) are a group of man-made chemicals that are resistant to heat, water, and oil.
Image source: Klinger International. (n.d.). PFAS regulation & its impact on the sealing industry. Retrieved from https://www.klinger-international.com/en/news/pfas-regulation-its-impact-on-the-sealing-industry
Current methods to degrade PFAS are costly and could result in a highly toxic byproduct. Our high school iGEM team is using AI to design enzymes that will destroy PFAS chemicals. We are employing AI to design enzymes that break down PFAS through a four-step process:
1. Using protein large language models (pLLMs) and an enzyme classifier to generate novel PFAS-degrading enzymes.
2. Computationally validating the predicted catalytic activity of these enzymes against PFAS.
3. Assessing the expressibility of the enzymes in a cell-free system to ensure feasibility for lab testing.
4. Sending successfully expressed enzymes to a specialized lab to confirm PFAS degradation activity, and as input for an expression classifier.
By combining AI-driven protein design with experimental validation, our approach seeks to provide an efficient and cost-effective solution to combat the environmental and health risks posed by PFAS.
Our enzyme generation pipeline contains four interconnected steps or modules. The first step is where we use a protein large language model or pLLM to generate novel PFAS degrading enzymes. Next we computationally validate that our enzyme has activity with PFAS. The third step is where we plan to utilize an expression classifier that predicts if sequences coming from the second step are expressible in a cell free expression system called TXTL. Enzymes predicted to be expressible in TXTL are then sent to the wet lab team to be expressed in TXTL. Whether the enzyme is successfully expressed or not is experimentally identified to ensure enzyme yield and strengthen future expression predictions. Then lastly, we will send our enzymes to a separate lab to test them on a PFAS substrate. Our module’s interactions and feedback allow us to tailor our process towards an optimal system for creating novel PFAS-degrading enzymes.
In the first step, we begin by fine-tuning protein large language models (pLLM). Fine-tuning is when a model is given existing data that it is trained off of so that it produces a specific output. For a protein LLM or pLLM we are using natural enzyme candidate sequences predicted to have PFAS degrading capabilities as out training data. This will allow for the model to generate PFAS degradation enzymes since that is what we are feeding into the model.
We use three pLLMs for novel enzyme generation: ZymCTRL, ProtGPT2, and ESM2. After fine-tuning, the pLLM will be biased toward the protein structures within the training data. This means that its output should retain specific key catalytic and structural motifs necessary for PFAS degradation while still having novel aspects. Any sequences generated at this step with erroneous amino acids are filtered out, and sequences with a folded structure dissimilar to natural proteins are also filtered out.
In the second step, vetted generated sequences (gen-seqs) are put through a computational process that gauges their predicted catalytic activity with PFAS. If a sequence is predicted not to have activity with PFAS, we will alter our training data and other aspects of the protein generation procedure. This creates a feedback loop within the system that selects sequences that are more likely to degrade PFAS.
Sequences predicted to have activity with PFAS move on to the third step, where their expressibility in a cell-free expression system (TXTL) is predicted computationally. Sequences predicted to not be expressible in TXTL influence our initial generation run so that we are not only selecting for sequences that have activity with PFAS but that are also expressible. Sequences predicted to be expressible are then expressed in TXTL in the lab. The results of this in-lab expression experiment will allow us to make more informed computational predictions for future generation runs.
In the fourth step, successfully expressed enzymes will be sent to a professional lab that can handle PFAS. The lab’s team of experts will use our enzymes in a PFAS degradation assay to validate their catalytic activity. If degradation is not observed, we will alter our generation procedure and our prediction mechanisms.
To begin generating novel enzymes to degrade per- and polyfluoroalkyl substances (PFAS), it was necessary to identify candidate enzymes to build effective training data to finetune our pLLMs on known PFAS-degrading enzymes or enzymes predicted to degrade PFAS and properly bias the model towards structures capable of PFAS degradation. These candidate enzymes would have to exhibit some level of catalytic activity with PFAS species, particularly the ability to cleave the strong carbon-fluorine bonds that make PFAS so difficult to degrade. Candidate enzymes also needed to be sequentially diverse from each other to ensure that when trained on this data, the pLLM would be capable of generating appropriately novel sequences and would not overfit to a particular structure. To this end, we completed an extensive literature review where we identified two enzyme families that had the potential to meet the requirements for effective training data: Radical-producing enzymes and dehalogenases.
Radical-producing enzymes produce highly reactive radical species that disrupt the structure of PFAS, leading to degradation. Dehalogenases have ligand-specific active sites that can defluorinate certain species of PFAS. The inherently destructive nature of radical-producing enzymes makes them difficult to express in microbes recombinantly. A potential approach to remedy this challenge would be to express them within a cell-free extract. However, because the radical species produced by these enzymes are generally reactive, they not only degrade PFAS, making the reactions they cause less controlled and more unpredictable. This fact made radical-producing enzymes less appealing to the team.
With that, we pivoted our focus to dehalogenases. In our search to find specific dehalogenases proven to degrade PFAS, the team discovered the corrinoid iron-sulfur reductive dehalogenase A6RdhA. A6RdhA was identified within Acidimicrobium sp. Strain A6, a soil-dwelling microbe that, when incubated with a PFOA substrate, could partially defluorinate PFOA's structure . PFOA is a legacy PFAS species that the EPA recently mandated in 2024 to be less than 4 parts per trillion (ppt), the minimal detectable limit, in drinking water.
When the gene coding for A6RdhA was knocked out, Acidimicrobium sp. Strain A6 lost its ability to degrade PFOA. This kind of concrete proof of PFAS degradation caught the team's attention, and we decided to focus our efforts towards reductive dehalogenases. Because A6RdhA has only recently been identified, its properties and complete structure have not been well characterized; in fact, the only known amino acid sequence of A6RdhA is missing a C-terminus of more than 100 amino acids. This incomplete fragment is predicted to be incapable of completing catalysis because the missing C-terminus makes up a large part of its
To find enzymes structurally similar to A6RdhA, we used the program Foldseek, which matches a query protein structure with proteins with similar structures within various protein structures and sequence databases. Through this method and a secondary literature review, the team identified 68 proteins with high structural similarity to A6RdhA (Fig. 2B).
The majority of these enzymes had a relatively low annotation score on Uniprot and were not very well characterized structurally or functionally. However, many were classified as reductive dehalogenases, which aligns with the classification of A6RdhA and fits within the family of enzymes we were researching. In addition to having high structural similarity to a known PFAS degrading enzyme, the 68 were also sequentially diverse from each other.
At this point, the computational team formulated a separate plan that aimed to reconstruct the missing C-Terminus of A6RdhA, allowing us to produce a functional novel enzyme closer to A6RdhA in structure and, therefore, function than any other candidate. To achieve this goal, the team would produce a chimera enzyme made up of the fragment of A6RdhA and one of the structurally similar candidates. The candidate chosen to reconstruct A6RdhA was enzyme T7RdhA, a reductive dehalogenase with the second-highest structural similarity to A6RdhA out of the 68 candidates. Another research group computationally verified T7RdhA to have PFAS degrading abilities similar to A6RdhA.
To construct the chimera, the structures of T7RdhA and the fragment of A6RdhA were aligned, and the point at which the fragment ends was identified. The point at which T7RdhA's sequence continues from the end of the fragment was identified. Every amino acid within T7RdhA's sequence from where the fragment sequence ends and T7RdhA's sequence begins was grafted onto the end of the A6RdhA fragment. This alteration of A6RdhA's structure reconstructed its active site (Fig. 3A).
When expressed, A6RdhA and other reductive dehalogenases form a dimer structure. To test whether the novel chimera dimerized properly, the structural file for the chimera was also input into AlphaFold3, where its dimerized structure was predicted. Its dimerized structure followed the binding formation found in other similarly structured dehalogenases and even had a higher plDDT score than that of the A6RdhA fragment's dimerized structure. The pLDDT score (predicted Local Distance Difference Test) measures the confidence of a predicted protein structure at each residue, with higher scores indicating greater reliability. This novel chimera was dubbed the A6T7 Chimera.
To attempt to characterize the candidate enzymes computationally and validate their activity with PFAS, the computation group completed a myriad of ligand-protein docking simulations with their three-dimensional structure. The results from these tests were inconclusive and seemed to show no clear pattern in binding across various PFAS ligands, specific negative control ligands, and negative control proteins. In a secondary attempt to computationally validate our candidates, we began completing molecular dynamics simulations to gauge affinity by observing whether a ligand docked within the active site of a receptor would start to drift out over time.
Molecular dynamics simulations show that PFOA docked within the active sites of enzyme candidate sequences do not drift out over time.
The amplification of the expression classifier libraries successfully produced a high yield of the correctly sized product.
Chiavola, A., Clément, J. C., Escapa, A., Huang, S., Lath, S., Lenka, S. P., Montagnolli, R. N., Ruiz-Urigüen, M., Sawayama, S., Senevirathna, S. T. M. L. D., Shuai, W., Sima, M. W., Yang, G., Zhang, D. Q., Chaudhuri, M. K., Cornell, R. M., Ding, L. J., & Gilson, E. R. (2024, February 15). Defluorination of PFAS by Acidimicrobium sp. strain A6 and potential applications for remediation. Methods in Enzymology. https://www.sciencedirect.com/science/article/pii/S0076687924000168
Guo, H.-B., Varaljay, V. A., Kedziora, G., Taylor, K., Farajollahi, S., Lombardo, N., Harper, E., Hung, C., Gross, M., Perminov, A., Dennis, P., Kelley-Loughnane, N., & Berry, R. (2023, March 11). Accurate prediction by AlphaFold2 for ligand binding in a reductive dehalogenase and implications for PFAS (per- and polyfluoroalkyl substance) biodegradation. Nature News. https://www.nature.com/articles/s41598-023-30310-x
Marciesky, M., Aga, D. S., Bradley, I. M., Aich, N., & Ng, C. (2023, December 11). Mechanisms and opportunities for rational in silico design of enzymes to degrade per- and polyfluoroalkyl substances (PFAS). Journal of Chemical Information and Modeling. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10716909/
Nakamura, R., Obata, T., Nojima, R., Hashimoto, Y., Noguchi, K., Ogawa, T., & Yohda, M. (2018, August 10). Functional expression and characterization of tetrachloroethene dehalogenase from Geobacter sp. Frontiers in Microbiology. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6095959/
Wackett, L. P. (2021, October 27). Why is the biodegradation of polyfluorinated compounds so rare? mSphere. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8513679/
Ferruz, N., Schmidt, S., & Höcker, B. (2022, July 27). ProtGPT2 is a deep unsupervised language model for protein design. Nature News. https://www.nature.com/articles/s41467-022-32007-7
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130. https://doi.org/10.1126/science.ade2574
Martinez, Z. A., Murray, R. M., & Thomson, M. W. (2023). TRILL: ORCHESTRATING MODULAR DEEP-LEARNING WORKFLOWS FOR DEMOCRATIZED, SCALABLE PROTEIN ANALYSIS AND ENGINEERING. bioRxiv. https://doi.org/10.1101/2023.10.24.563881
Munsamy, G., Lindner, S., Lorenz, P., & Ferruz, N. (2022). ZymCTRL: a conditional language model for the controllable generation of artificial enzymes. Machine Learning for Structural Biology Workshop, NeurIPS 2022. https://www.mlsb.io/papers_2022/ZymCTRL_a_conditional_language_model_for_the_controllable_generation_of_artificial_enzymes.pdf
Simulating 500 million years of evolution with a language model (n.d.).
van Kempen, M., Kim, S. S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C. L. M., Söding, J., & Steinegger, M. (2023, May 8). Fast and accurate protein structure search with Foldseek. Nature News. https://www.nature.com/articles/s41587-023-01773-0