Part1: Background

1.1 Current Dilemma

Currently, PET microplastic pollution has emerged as a global environmental issue, posing significant threats to ecosystems and human health. Although biocatalysts such as PETase have demonstrated considerable potential in degrading microplastics, their practical application faces numerous challenges, particularly in the limited binding affinity between PETase and PET plastics. To address this issue, the incorporation of binding peptides has been proposed as an effective strategy to enhance PETase-PET binding efficiency, thereby improving its degradation capability. However, the screening of highly efficient PET-binding peptides remains a key rate-limiting step.


Traditional biological experiments aimed at identifying peptides with high affinity for plastics encounter several limitations, including laborious experimental procedures, time consumption, and high personnel costs. This inefficiency in the screening process significantly hampers the application and promotion of PETase in microplastic degradation.


Fig. 1   Traditional biological experimental procedures[1]

In summary, the primary challenges in current microplastic treatment are centered around the insufficient binding capacity between PETase and PET, the low efficiency of screening effective binding peptides, and the high costs associated with biological experimentation. These issues present significant barriers to achieving efficient microplastic degradation. Therefore, developing a technology that can rapidly screen for high-affinity PET-binding peptides is a crucial research direction in this field.

1.2 Previous Work

In recent years, machine learning, particularly its subset deep learning, has rapidly emerged as a powerful tool. Deep learning simulates the neural networks of the human brain, learning data features through multi-layered networks to achieve precise predictions and decisions, excelling especially in big data processing. It has achieved remarkable success in fields such as computer vision and natural language processing. In the field of biology, deep learning has also demonstrated tremendous potential, particularly in synthetic biology. It can analyze vast amounts of biological data, assist in the design of novel biomolecules, optimize genes and metabolic pathways, and even predict protein structures, significantly reducing human labor and time costs, thereby driving advancements in protein engineering. In metabolic engineering, deep learning aids in identifying key enzymes and metabolic pathways, enhancing biosynthesis efficiency.


Fig. 2   Deep learning enables synthetic biology applications[2]

Indeed, a growing number of researchers are employing deep learning methods, such as neural network models, to provide new insights into the design of biomolecules. For instance, Yuan and colleagues constructed ESM and Transformer models to accurately predict metal ion binding sites in proteins, which is of critical importance for understanding protein function and drug design [3]. Similarly, Wang et al. evaluated the predictive accuracy of various models, including Graph Convolutional Networks (GCN), Graph Neural Networks (GNN), and Transformers, in forecasting the binding affinity between target proteins and small molecule drugs, thus enabling more precise and effective affinity predictions [4]. Boonyarit and colleagues developed a GCN integrated with graph attention mechanisms, termed GraphEGFR, to improve the prediction of the bioactivity of tyrosine kinase inhibitors against both wild-type and mutant human epidermal growth factor receptor (EGFR) family proteins [5].

1.3 Future Work

We found that plastic-binding peptides can significantly enhance the binding ability of bioenzymes to plastic substrates due to their efficient binding to plastic substrates, we planned to search for binding peptides with high affinity for PET plastics , thereby enhancing the efficiency of PETase in degrading PET. However, the traditional biological experiments in searching for the target short peptides are plagued by cumbersome experimental processes and large human time costs, so we plan to use deep learning algorithms to realize the efficient mining of binding peptides with high plastic affinity.

Part 2: Related Work

What work did we do?


We proposed a composite long short-term memory (LSTM )-GCN model for screening high-affinity PET-binding peptides. First, LSTM was used to capture long-distance interactions and initially identified 480 potential target peptides from 386,226 amino acid sequences. Then, GCN was employed to capture three-dimensional spatial information, resulting in 16 high-affinity peptides that showed high confidence for experimental validation. Considering the limited expression and binding ability of fusion proteins, Mutcompute-super [6] was utilized. By recoding the protein's microenvironments and optimizing protein folding through MLP classifiers, Mutcompute-super enhanced both expression levels and PET microplastic degradation.


What help did our model provide to the wet lab?


The help of the dry lab is crucial, it provides the wet lab with candidate peptides with high affinity and high specificity, significantly accelerating the entire experimental process. Ordinary experimental screening of PET-binding peptides often requires a large amount of manpower, material resources and time. Through the LSTM-GCN model, we successfully screened out 16 short peptides with potential high binding ability to PET plastic. This screening method based on multi-dimensional prediction has laid a solid foundation for the subsequent experimental verification of the wet lab. It can be said that the work of the dry lab not only significantly narrows the verification scope of the wet lab but also greatly improves the efficiency and accuracy of experimental verification. On this basis, the wet lab further carried out refined biological experiments to evaluate the actual binding effect and biological function of these peptides. These verification works not only confirm the practical application potential of the screened peptides but also provide reliable and high-quality data support for subsequent functional research.


What contribution did the measurements of the wet lab make to model development?


In the follow-up experiments on the 16 short peptide sequences with high affinity for PET plastics that were screened out, the feedback measurements indicated that the expression level of the fusion protein and the binding ability to the microplastic substrate were not strong enough with certain limitations. This made us realize that the first-generation model can only play a role in screening out relatively optimal sequences but can not ensure that the actual test results of the sequences are satisfactory. Therefore, we still need to further develop the model. Facing this problem, we further adopted Mutcompute-super [6] as the development of our model. By introducing beneficial point mutations at specific sites of the short peptide sequence and using molecular modification to further improve the expression level of the fusion protein and its degradation ability towards PET microplastic matrix. We recoded the C, N, O, and S microenvironments of protein amino acid residues and used the extracted biological information features as the input of the Mutcompute-super model. Then, combined with a deep learning classifier, we predicted the optimal combination of amino acid mutations to optimize the protein's folding conformation and enhance its functional performance. Facts have proved that the feedback of measurements from the wet lab has made a great contribution to the development of our model.


Our model approach is an effective example in synthetic biology.


In fact, LSTM and GCN have also achieved remarkable results in other fields of synthetic biology. For example, LSTM can capture the dynamic interaction between metabolites and enzymes and predict the change of metabolite concentration through long time series data, while GCN can capture the topological structure information of metabolites and enzymes in the metabolic network. This means that our model approach is universal in synthetic biology.


However, we have connected the LSTM and GCN models in series. First, we used LSTM to conduct preliminary screening of the characteristics of one-dimensional sequences. Then, we used the graph structure of GCN to conduct re-screening of three-dimensional PDB files. Moreover, we used Mutcompute-super to predict amino acid mutation combinations and improve the expression ability of fusion proteins. Our model approach fully links the characteristics of deep learning with synthetic biology organically. This means that our model approach has particularity in synthetic biology.


Thus it can be seen that our model approach will provide other researchers in synthetic biology with a new and effective example.


Fig. 3   Model general workflow diagram

References

[1] Visual China. Design Concept of Science Laboratories [Z]. https://www.vcg.com/creative/1346174541.html

[2] Goshisht M K. Machine learning and deep learning in synthetic biology: Key architectures, applications, and challenges [J]. ACS Omega, 2024, 9(9): 9921-9945.
[3] Yuan Q, Chen S, Wang Y, et al. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning [J]. Briefings in Bioinformatics, 2022, 23(6), bbac444.
[4] Wang Y, Jiao Q, Wang J, et al. Prediction of protein-ligand binding affinity with deep learning [J]. Computational and Structural Biotechnology Journal., 2023, 20(1), 1026-1035.
[5] Boonyarit B, Yamprasert N, Kaewnuratchadasorn P, et al. GraphEGFR: Multi-task and transfer learning based on molecular graph attention mechanism and fingerprints improving inhibitor bioactivity prediction for EGFR family proteins on data scarcity [J]. Journal of Computational Chemistry, 2024, 45(23), 2001-2023.
[6] Deng Z H, Cai C, Wang S T, et al. Protein design method based on amino acid microenvironment and EMO neural network, CN118136092A [P/OL].

To Top

To Top