PFAS, or "forever chemicals," are toxic compounds linked to cancers, infertility, and vaccine resistance at concentrations below 1 ppt.

Per- and polyfluoroalkyl substances (PFAS) are a group of man-made chemicals that are resistant to heat, water, and oil.

Image source: Klinger International. (n.d.). PFAS regulation & its impact on the sealing industry. Retrieved from https://www.klinger-international.com/en/news/pfas-regulation-its-impact-on-the-sealing-industry

Current methods to degrade PFAS are costly and could result in a highly toxic byproduct. Our high school iGEM team is using AI to design enzymes that will destroy PFAS chemicals. We are employing AI to design enzymes that break down PFAS through a four-step process:

1. Using protein large language models (pLLMs) and an enzyme classifier to generate novel PFAS-degrading enzymes.

2. Computationally validating the predicted catalytic activity of these enzymes against PFAS.

3. Assessing the expressibility of the enzymes in a cell-free system to ensure feasibility for lab testing.

4. Sending successfully expressed enzymes to a specialized lab to confirm PFAS degradation activity, and as input for an expression classifier.

By combining AI-driven protein design with experimental validation, our approach seeks to provide an efficient and cost-effective solution to combat the environmental and health risks posed by PFAS.

Our Four Step Plan

Our enzyme generation pipeline contains four interconnected steps or modules. The first step is where we use a protein large language model or pLLM to generate novel PFAS degrading enzymes. Next we computationally validate that our enzyme has activity with PFAS. The third step is where we plan to utilize an expression classifier that predicts if sequences coming from the second step are expressible in a cell free expression system called TXTL. Enzymes predicted to be expressible in TXTL are then sent to the wet lab team to be expressed in TXTL. Whether the enzyme is successfully expressed or not is experimentally identified to ensure enzyme yield and strengthen future expression predictions. Then lastly, we will send our enzymes to a separate lab to test them on a PFAS substrate. Our module’s interactions and feedback allow us to tailor our process towards an optimal system for creating novel PFAS-degrading enzymes.

First Step

In the first step, we begin by fine-tuning protein large language models (pLLM). Fine-tuning is when a model is given existing data that it is trained off of so that it produces a specific output. For a protein LLM or pLLM we are using natural enzyme candidate sequences predicted to have PFAS degrading capabilities as out training data. This will allow for the model to generate PFAS degradation enzymes since that is what we are feeding into the model.

Quick Explanation - Protein Large Langauge Model(s) (pLLM)

We use three pLLMs for novel enzyme generation: ZymCTRL, ProtGPT2, and ESM2. After fine-tuning, the pLLM will be biased toward the protein structures within the training data. This means that its output should retain specific key catalytic and structural motifs necessary for PFAS degradation while still having novel aspects. Any sequences generated at this step with erroneous amino acids are filtered out, and sequences with a folded structure dissimilar to natural proteins are also filtered out.

Second Step

In the second step, vetted generated sequences (gen-seqs) are put through a computational process that gauges their predicted catalytic activity with PFAS. If a sequence is predicted not to have activity with PFAS, we will alter our training data and other aspects of the protein generation procedure. This creates a feedback loop within the system that selects sequences that are more likely to degrade PFAS.

Third Step

Sequences predicted to have activity with PFAS move on to the third step, where their expressibility in a cell-free expression system (TXTL) is predicted computationally. Sequences predicted to not be expressible in TXTL influence our initial generation run so that we are not only selecting for sequences that have activity with PFAS but that are also expressible. Sequences predicted to be expressible are then expressed in TXTL in the lab. The results of this in-lab expression experiment will allow us to make more informed computational predictions for future generation runs.

Fourth Step

In the fourth step, successfully expressed enzymes will be sent to a professional lab that can handle PFAS. The lab’s team of experts will use our enzymes in a PFAS degradation assay to validate their catalytic activity. If degradation is not observed, we will alter our generation procedure and our prediction mechanisms.

PFAS-Degrading Enzyme Design

To begin generating novel enzymes to degrade per- and polyfluoroalkyl substances (PFAS), it was necessary to identify candidate enzymes to build effective training data to finetune our pLLMs on known PFAS-degrading enzymes or enzymes predicted to degrade PFAS and properly bias the model towards structures capable of PFAS degradation. These candidate enzymes would have to exhibit some level of catalytic activity with PFAS species, particularly the ability to cleave the strong carbon-fluorine bonds that make PFAS so difficult to degrade. Candidate enzymes also needed to be sequentially diverse from each other to ensure that when trained on this data, the pLLM would be capable of generating appropriately novel sequences and would not overfit to a particular structure. To this end, we completed an extensive literature review where we identified two enzyme families that had the potential to meet the requirements for effective training data: Radical-producing enzymes and dehalogenases.

Radical-Producing Enzymes Offer Potential but Pose Significant Problems in Microbial Expression

Radical-producing enzymes produce highly reactive radical species that disrupt the structure of PFAS, leading to degradation. Dehalogenases have ligand-specific active sites that can defluorinate certain species of PFAS. The inherently destructive nature of radical-producing enzymes makes them difficult to express in microbes recombinantly. A potential approach to remedy this challenge would be to express them within a cell-free extract. However, because the radical species produced by these enzymes are generally reactive, they not only degrade PFAS, making the reactions they cause less controlled and more unpredictable. This fact made radical-producing enzymes less appealing to the team.

Shift to Dehalogenases: Discovery of A6RdhA for Targeted PFAS Defluorination in Acidimicrobium sp. Strain A6

With that, we pivoted our focus to dehalogenases. In our search to find specific dehalogenases proven to degrade PFAS, the team discovered the corrinoid iron-sulfur reductive dehalogenase A6RdhA. A6RdhA was identified within Acidimicrobium sp. Strain A6, a soil-dwelling microbe that, when incubated with a PFOA substrate, could partially defluorinate PFOA's structure . PFOA is a legacy PFAS species that the EPA recently mandated in 2024 to be less than 4 parts per trillion (ppt), the minimal detectable limit, in drinking water.

Figure 2A Enzyme A6RdhA is a promising starting point for enzymatic PFAS degradation, due to its demonstrated PFAS degradation. Enzyme A6RdhA was identified within Acidimicrobium sp. Strain A6. When Acidimicrobium sp. Strain A6 is incubated with a PFOA substrate, it is capable of partially fluorinating its structure. The results of this degradation are smaller chain PFAS species. When the gene coding for the A6RdhA is knocked out, Acidimicrobium sp. Strain A6 loses its ability to degrade PFOA. These results suggest that enzyme A6RdhA is the key catalyst in this reaction. This concrete evidence of PFAS degradation makes enzyme A6RdhA the perfect enzyme to form a basis for finding other enzymes that could potentially degrade PFAS. Enzyme A6RdhA has provided researchers with a significant stepping stone in developing biotechnological approaches for environmental cleanup of PFAS compounds. Understanding how A6RdhA functions allows for the possibility of engineering similar or more effective enzymes to target a broader range of PFAS pollutants.

Focus on Reductive Dehalogenases: A6RdhA Knockout Reveals Key Role in PFOA Degradation, Inspiring Search for Similar Enzymes

When the gene coding for A6RdhA was knocked out, Acidimicrobium sp. Strain A6 lost its ability to degrade PFOA. This kind of concrete proof of PFAS degradation caught the team's attention, and we decided to focus our efforts towards reductive dehalogenases. Because A6RdhA has only recently been identified, its properties and complete structure have not been well characterized; in fact, the only known amino acid sequence of A6RdhA is missing a C-terminus of more than 100 amino acids. This incomplete fragment is predicted to be incapable of completing catalysis because the missing C-terminus makes up a large part of its

Identifying PFAS-Degrading Enzymes: Foldseek Analysis Reveals 68 Proteins Structurally Similar to A6RdhA

To find enzymes structurally similar to A6RdhA, we used the program Foldseek, which matches a query protein structure with proteins with similar structures within various protein structures and sequence databases. Through this method and a secondary literature review, the team identified 68 proteins with high structural similarity to A6RdhA (Fig. 2B).

We identified 68 putative PFAS reductive dehalogenases structurally similar to A6RdhA. Using the program, Foldseek A6RdhA was matched with hundreds of similar proteins. Of these proteins, 68 were chosen for their especially high structural similarity level and Uniprot profiles. Due to this structural similarity, functional similarity can be assumed, therefore making these 68 enzymes putative PFAS reductive dehalogenases and, in turn, the perfect candidates for effective training data.

Structural Analysis of A6RdhA-like Enzymes: Conserved Active Sites Across 68 Diverse Reductive Dehalogenases Suggest Potential for PFAS Degradation

The majority of these enzymes had a relatively low annotation score on Uniprot and were not very well characterized structurally or functionally. However, many were classified as reductive dehalogenases, which aligns with the classification of A6RdhA and fits within the family of enzymes we were researching. In addition to having high structural similarity to a known PFAS degrading enzyme, the 68 were also sequentially diverse from each other.

To ensure that the enzyme generation process did not yield homogenous sequences and instead produced sufficiently novel enzymes, it is necessary to ensure that the training data is appropriately diverse. All 68 candidate enzymes were sequentially aligned to gauge percent sequential similarity. Using the data gathered from this analysis, a heat map was generated. Every pixel within the heat map represents two of the 68 enzymes being sequentially compared to each other. The darker the pixel, the lower the percent identity. As can be seen from the overall dark color of the heat map, all 68 enzymes are sufficiently diverse from each other. A fascinating observation to note is that when the three-dimensional structures of the 68 enzymes were aligned using a protein visualization software, it was clear that the basic structure of the active site across all of them was conserved, suggesting once again that they had similar catalytic residues.

Chimera Design: Combining A6RdhA and T7RdhA to Reconstruct a Functional Enzyme for PFAS Degradation

At this point, the computational team formulated a separate plan that aimed to reconstruct the missing C-Terminus of A6RdhA, allowing us to produce a functional novel enzyme closer to A6RdhA in structure and, therefore, function than any other candidate. To achieve this goal, the team would produce a chimera enzyme made up of the fragment of A6RdhA and one of the structurally similar candidates. The candidate chosen to reconstruct A6RdhA was enzyme T7RdhA, a reductive dehalogenase with the second-highest structural similarity to A6RdhA out of the 68 candidates. Another research group computationally verified T7RdhA to have PFAS degrading abilities similar to A6RdhA.

Visualization of PFOA within the active site of the A6T7 Chimera

Chimera Construction: Grafting T7RdhA onto A6RdhA Fragment to Rebuild Active Site for PFAS Degradation

To construct the chimera, the structures of T7RdhA and the fragment of A6RdhA were aligned, and the point at which the fragment ends was identified. The point at which T7RdhA's sequence continues from the end of the fragment was identified. Every amino acid within T7RdhA's sequence from where the fragment sequence ends and T7RdhA's sequence begins was grafted onto the end of the A6RdhA fragment. This alteration of A6RdhA's structure reconstructed its active site (Fig. 3A).

The A6T7 Chimera reconstructed the A6RdhA fragment and was successfully expressed in a cell-free extract. The A6T7 chimera was produced by reconstructing the missing C-terminus of the original A6RdhA fragment with the C-terminus of the T7RdhA enzyme. The addition of this C-Terminus reformed the active site of the A6RdhA fragment making it more likely to be capable of binding essential cofactors.

A6T7 Chimera: AlphaFold3 Confirms Successful Dimerization of Novel PFAS-Degrading Enzyme with Enhanced Structural Integrity

When expressed, A6RdhA and other reductive dehalogenases form a dimer structure. To test whether the novel chimera dimerized properly, the structural file for the chimera was also input into AlphaFold3, where its dimerized structure was predicted. Its dimerized structure followed the binding formation found in other similarly structured dehalogenases and even had a higher plDDT score than that of the A6RdhA fragment's dimerized structure. The pLDDT score (predicted Local Distance Difference Test) measures the confidence of a predicted protein structure at each residue, with higher scores indicating greater reliability. This novel chimera was dubbed the A6T7 Chimera.

The computationally predicted dimerized structure of A6T7 (above) has a higher plDDT score than the dimerized structure of the A6RdhA fragment (below).

Validating Enzyme Candidates: Molecular Dynamics Simulations Used to Assess Ligand Affinity in PFAS-Degrading Enzymes

To attempt to characterize the candidate enzymes computationally and validate their activity with PFAS, the computation group completed a myriad of ligand-protein docking simulations with their three-dimensional structure. The results from these tests were inconclusive and seemed to show no clear pattern in binding across various PFAS ligands, specific negative control ligands, and negative control proteins. In a secondary attempt to computationally validate our candidates, we began completing molecular dynamics simulations to gauge affinity by observing whether a ligand docked within the active site of a receptor would start to drift out over time.

Molecular dynamics simulations show that PFOA docked within the active sites of enzyme candidate sequences do not drift out over time.

Expression Classifier

Developing an Expression Classifier: Training a Model to Predict Protein Expression in Cell-Free Systems Using DNA Libraries and Mass Spectrometry

Please refer to our Software page for more information about our Expression Classifier.

The amplification of the expression classifier libraries successfully produced a high yield of the correctly sized product.

Citations and References

Enzymes

Chiavola, A., Clément, J. C., Escapa, A., Huang, S., Lath, S., Lenka, S. P., Montagnolli, R. N., Ruiz-Urigüen, M., Sawayama, S., Senevirathna, S. T. M. L. D., Shuai, W., Sima, M. W., Yang, G., Zhang, D. Q., Chaudhuri, M. K., Cornell, R. M., Ding, L. J., & Gilson, E. R. (2024, February 15). Defluorination of PFAS by Acidimicrobium sp. strain A6 and potential applications for remediation. Methods in Enzymology. https://www.sciencedirect.com/science/article/pii/S0076687924000168

Guo, H.-B., Varaljay, V. A., Kedziora, G., Taylor, K., Farajollahi, S., Lombardo, N., Harper, E., Hung, C., Gross, M., Perminov, A., Dennis, P., Kelley-Loughnane, N., & Berry, R. (2023, March 11). Accurate prediction by AlphaFold2 for ligand binding in a reductive dehalogenase and implications for PFAS (per- and polyfluoroalkyl substance) biodegradation. Nature News. https://www.nature.com/articles/s41598-023-30310-x

Marciesky, M., Aga, D. S., Bradley, I. M., Aich, N., & Ng, C. (2023, December 11). Mechanisms and opportunities for rational in silico design of enzymes to degrade per- and polyfluoroalkyl substances (PFAS). Journal of Chemical Information and Modeling. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10716909/

Nakamura, R., Obata, T., Nojima, R., Hashimoto, Y., Noguchi, K., Ogawa, T., & Yohda, M. (2018, August 10). Functional expression and characterization of tetrachloroethene dehalogenase from Geobacter sp. Frontiers in Microbiology. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6095959/

Wackett, L. P. (2021, October 27). Why is the biodegradation of polyfluorinated compounds so rare? mSphere. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8513679/

Language Model Papers

Ferruz, N., Schmidt, S., & Höcker, B. (2022, July 27). ProtGPT2 is a deep unsupervised language model for protein design. Nature News. https://www.nature.com/articles/s41467-022-32007-7

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130. https://doi.org/10.1126/science.ade2574

Martinez, Z. A., Murray, R. M., & Thomson, M. W. (2023). TRILL: ORCHESTRATING MODULAR DEEP-LEARNING WORKFLOWS FOR DEMOCRATIZED, SCALABLE PROTEIN ANALYSIS AND ENGINEERING. bioRxiv. https://doi.org/10.1101/2023.10.24.563881

Munsamy, G., Lindner, S., Lorenz, P., & Ferruz, N. (2022). ZymCTRL: a conditional language model for the controllable generation of artificial enzymes. Machine Learning for Structural Biology Workshop, NeurIPS 2022. https://www.mlsb.io/papers_2022/ZymCTRL_a_conditional_language_model_for_the_controllable_generation_of_artificial_enzymes.pdf

Simulating 500 million years of evolution with a language model (n.d.).

van Kempen, M., Kim, S. S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C. L. M., Söding, J., & Steinegger, M. (2023, May 8). Fast and accurate protein structure search with Foldseek. Nature News. https://www.nature.com/articles/s41587-023-01773-0