go top
Model

Model 1 Identification of microplastics using mechine learning on Raman spectroscopic data


1. Background

It is improtant for us to detect and identify microplastics in enviroment. In our wet experiment, we have developed one rapid and convinient method to detect microplastics, however we cannot distinguish the type of microplastics because the anchoring peptide has low specifity. Therefore here we plan to combine Raman spectroscopy and machine learning to achieve the identification of different microplastics. Although some researchers have carried out similar work in recent years [1], the predicted accuracy might be enhanced.

2. Modeling Process


2.1 Data Preprocessing


Raman spectroscopy data primarily consists of two columns: wavenumber and intensity. The wavenumber, measured in inverse centimeters (cm⁻¹), is the frequency difference between incident and scattered light, providing specific identification for functional group of each compound. The other column, intensity, reflects the magnitude of scattered light energy. In the model-building process, since wavenumbers are mainly determined by the settings of the Raman spectroscopy instrument and contribute limitedly to compound identification, we primarily extract intensity values for subsequent analysis. Considering the fact that Raman signal is rather weak and is heavily affected by fluorescent background noise or various other noise signals, mainly originating from the CCD array detector, including photon shot noise and dark noise. Therefore we firstly employs a denoising algorithm based on Peak Extraction and Retention (PEER) to optimize data quality and reduce noise 2. The first and second derivatives of the Raman spectrum were analyzed to ensuring that Raman peaks are unaffected during the noise reduction process. Subsequently, an optimized window smoothing technique is applied to the left side of the Raman spectral data, combined with untreated Raman peaks, to obtain high-quality Raman spectra post-denoising. This method not only effectively reduces background noise but also maximally retains the height and shape of the peaks, thus enhancing the accuracy of spectral data.

2.2 Model Selection and Training


The objective of this study is to utilize the specificity of Raman spectroscopy data to identify types of microplastics, a task that falls under the typical classification problem. In selecting an appropriate classifier model, we considered the following common algorithms:

  • Logistic Regression
  • Naive Bayes
  • Decision Tree
  • Support Vector Machines

Among numerous candidate models, we selected Random Forest as our primary classifier due to its ability to enhance the accuracy and stability of predictions by constructing multiple decision trees and integrating their outcomes. Random Forest demonstrates excellent performance and generalization capabilities when dealing with datasets that contain high-dimensional features.

During the model training phase, the dataset was split into a training set and a test set at an 8:2 ratio. To assess the stability and accuracy of the model, the K-Fold Cross-Validation method was used. Additionally, to optimize model parameters and further improve performance, we conducted a Grid Search to determine the best model parameters, including the number of decision trees (n_estimators), the maximum number of features to consider when splitting a node (max_features), the maximum depth of the trees (max_depth), the minimum number of samples required to split an internal node (min_samples_split), and the minimum number of samples required for a leaf node (min_samples_leaf). The results of the grid search helped us choose a higher number of trees, a moderate tree depth, and an ideal number of features to achieve accurate microplastic classification while maintaining high computational efficiency.

Tab.1 K-Fold Cross-Validation Results of the Model

Fold Accuracy PET Precision PET Recall PET F1-Score PP Precision PP Recall PP F1-Score PS Precision PS Recall PS F1-Score
1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
3 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
5 0.98 1.00 1.00 1.00 1.00 0.96 1.00 0.98 1.00 0.94

To ensure consistency and stability in the model's performance across different data subsets, K-Fold Cross-Validation was employed to validate the effectiveness of the model. In this method, the original data is evenly distributed into K subsets, with K-1 subsets used for training and the remaining subset used for testing in each iteration. This process is repeated K times, each time with a different test set, effectively assessing the model’s generalization capability on unseen data. According to the cross-validation results, the model performed consistently across all folds, demonstrating a high degree of robustness and accuracy. As shown in Tab.1, in the fifth fold, despite a slight decline in precision for PP, the overall model's average accuracy still reached 0.9967, indicating its reliability and effectiveness in practical applications.

Conclusion

This study utilized the combination of Raman spectroscopy and the Random Forest model to demonstrate efficient and accurate performance in the task of microplastic type identification. Although the model has not yet been applied directly to precise estimation of microplastic concentrations, its high specificity in identifying microplastic types has proven its potential as an effective auxiliary tool. By integrating with traditional colorimetric methods, this approach offers a more comprehensive solution for our research project, enhancing the accuracy of microplastic type identification and expanding the scope of microplastic monitoring and analysis.

Overall, the integration of Raman spectroscopy with machine learning technology provides an innovative analytical tool in the field of environmental science, especially showing great potential in handling complex environmental samples. Future research may explore combining this model with other spectroscopic techniques or further optimizing algorithms to expand its applications in environmental monitoring and pollution assessment. Additionally, enhancing the model's capability for estimating microplastic concentrations will be an important direction for subsequent research, potentially providing a more precise tool for the quantitative analysis of microplastic pollution.



Model 2 Virtual Screening of anchoring peptide TA2


1. Background


As found in our wet experiments, the anchoring peptide TA2 exhibit high affinity against various microplastics. It means that TA2 has no specificity against one type of microplastics, for example PS common in enviroment. In order to enhance its specifity of TA2 anchoring peptide, we comducted molecular docking-based mutation scanning virtual screening and get some useful informaion.

2. Theoretical basis of docking and Molecular Dynamics Simulation


There are the following five steps to conduct docking and MD.

  1. Receptor selection and preparation
  2. Ligand selection and preparation
  3. Docking
  4. Molecular Dynamics (MD)
  5. Evaluation of docking and MD results

Molecular docking is a computational method aimed at predicting the optimal conformation and orientation of proteins and ligands, thereby producing the lowest free energy of complexes. Molecular docking consists of two main components: a scoring function and a search algorithm. The scoring function evaluates the binding energy and stability of a given protein-ligand pose by considering various factors such as intermolecular interactions, solvation effects, and entropy. The search algorithm explores the conformation and rotation space of proteins and ligands to find the best scoring pose 3.

Molecular docking can also predict how mutations affect the binding performance and stability of complexes by comparing the docking postures of wild-type and mutant proteins with the same ligand. However, molecular docking has some limitations, such as ignoring the effects of solvation, entropy, and flexibility on binding.

Therefore, molecular docking results need to be validated through experimental data or further refined through more accurate methods, such as molecular dynamics (MD).

MD simulation can capture the dynamic behavior of proteins and ligands in solution, and explain their conformational changes, fluctuations, and interactions with solvent molecules 4.

MD simulation includes three main steps: initialization, integration, and analysis. The initialization step includes setting the initial coordinates, velocity, force, and parameters of the system. The integration step involves using numerical methods (such as Verlet or leapfrog algorithms) to propel the system in a timely manner. The analysis step involves extracting relevant information from the trajectory, such as structural properties, thermodynamic properties, dynamic properties, and bonding properties.

3. Receptor selection and preparation

The structure of TA2 peptide is constructed using AlphaFold3, which is a state-of-the-art deep learning method that can predict protein structure from amino acid sequences. The predicted structure has a high confidence level, as shown by the PLDDT score, which measures the accuracy of each residual prediction. The protein is converted into a PDB file using PYMOL.

model-f2.png
Fig.2 Alphafold3 outputs a prediction result graph for mutant TA2 peptides. It includes the predicted structure of the best conformation, pLDDT plot, PAE plot, and pTM score

The mutant protein structure is derived from the wild-type structure using FoldX. FoldX is software mainly used for protein structure prediction and stability analysis. The PDB file of the mutant is output using FoldX software.

We wrote a Python script that uses MGLTools to batch convert PDB format to PDBQT format, simplifying the tedious step of receptor preprocessing in molecular docking.

4. Ligand selection and preparation

  1. Retrieve the three-dimensional structure of styrene from the PubChem database (CID: 798).
  2. Hydrogenation of styrene small molecules using Materials Studio 2020.
  3. Convert SDF files of styrene to PDBQT format using MGLTools.

5.Molecular docking


In this part, we used the AutoDock Vina program to conduct a comprehensive molecular docking of 817 single-point mutants of TA2 anchoring peptide with using styrene as the ligand 5. We also wrote a bat file for batch docking and support starting anytime after interruption. We aimed to explore the effect of these mutations on the binding affinity between TA2 protein and styrene. Fig.3 showed the heatmap of the interation between TA2 mutans (1-40AA) and styrene. The deeper the colour, the stronger the binding affinity. From it we can quickly identify and compare the binding affinity changes of different mutants, providing valuable information for further protein engineering. For example, it was found in Fig.3 that there were three mutants, namely ALA13, ARG6, and ARG24, whose binding affinity was lower than that of wild-type TA2 protein. These mutants will probably become valuable if we want to evolute TA2 into other microplastic anchoring peptide. More work needs to be done to get more useful information.

model-f3.jpg
Fig.3 Heatmap of single-point mutation of 1-43 AA of TA2 anchoring peptide when combining with stylene.


Model 3 AI-guided evolution of MHETase and Identification


1. Background

In recent years, with the rapid development of structural biology, computational biology, and artificial intelligence technology, computer-guided protein design strategies have greatly promoted protein modification.

In our PET-degrdation wet experiments, we expressed two PET degradation enzymes, i.e. Z1-PETase and MHETase. The latter enzyme could catalyze MHET into TPA, but its catalytic activity is limited. In order to identify beneficial mutants exhibiting higher enzymatic activity, we used the newly published unsupervised protein evolution model ESM-IF1, which is a protein language model with enhanced structural information 6. This model can effectively guide the evolution of proteins by combining protein amino acid sequence information and structural information. Next, we combined molecular docking and molecular dynamics simulations to theoretically compare and screen 20 variants recommended by ESM-IF1 model. Following these steps, we expressed and purified 9 mutants and measured their catalytic activities.

model-f4.png
Figure 4:Technical roadmap of Model 3

2. Predicted mutation check points of MHETase based on SM-IF1 model


After setting up the operating environment of the ESM-IF1 model (https://github.com/varun-shanker/structural-evolution/blob/main/README.md), we input the pdb file of MHETase obtained from PDB database (https://www.rcsb.org/) into the model to finally select 33 suggested mutation checkpoints, as shown in Tab.2.

Tab.2 33 suggested mutation check points obtained by the EDM-IF1 model

S235A H293E M192F K437A W453Y Y191I H467A S196A A181S W466V R411M
A553L W397E W397A N134A V341A M433A A119V K256R A310V A380Q T638D
D555S S305T A139D S260Q R318A R375S R116N Q458A N195L S286A W397A

3. Molecular docking and MD between MHETase variants and MHET


To explore how these suggested variants of MHETase affect the interaction with substrate MHET, we firstly used AlphaFold 3 to predict the structure of these variants 7. Fig.5A gave an example of the predicted structure of MHETaseR318A mutation using AlphaFold 3. Fig.5B showed the best conformational PAE score, and some other parameters, reflecting the high accuracy of the predicted structure. The 3D structure of the substrate MHET was obtained from PubChem database (CID: 22062452).

model-f5.png
Figure 5:Alphafold3 outputs a prediction result graph for mutant TA2 peptides. It includes the predicted structure of the best conformation, pLDDT plot, PAE plot, and pTM score

After preparing both MHETase vairans and the MHET molecule, we used AutoDock Vina to perform molecular docking between these MHET mutants and MHET molecule. We found that 9 mutants exhibit higher affinity than WT MHET and the results are shown in Fig.6. Fig.7 displayed some interaction changes between MHETase and MHET molecule. We also carried out MD using GROMACS software package with a temperature of 300 K and a pressure of 1 bar controlled by the Berendsen algorithm [9,10] to provide more explanations why these variants exhibited higher affinity than other mutants. As seen from Fig.6, the positive mutants S350T and N134A have smaller RMSF value in the 150-250AA active pocket region. In contrast, the negative mutants both R318A and R116N have higher RMSF values in the active pocket region, which may indicate that the structural stability of active pocket should be responsible for high affinity between MHETase and MHET molecule.

model-f6.jpg
Fig.6 Comparison of affinity between MHETase mutants and MHET molecule calculated by molecular docking

model-f7.png
Fig.7 Interaction information between mutants and MHET in MHET-binding pocket of MHETase

model-f8.jpg
Fig.8 Comparison of RMSF between MHETase variants and MHET obtained by MD

4. PCR-directed mutagenesis and enzymatical activity detection


After theoretically screening 9 MHETase variants, we next carried out PCR-directed mutagenesis, and the PCR results were shown in Fig.9. After DNA sequencing, we conducted the expression and purification of these 9 variants. Enzymatic activity was measured using HPLC and the results were shown in Tab.3. It was found that 6 variants showed higher enzymatic activity when compared with WT MHETase (seen from Fig.10). Among them, K442A variant is the most beneficial candidate, whose activity was remarkably increased to ~ 3.2-fold.

model-f9.png
Figure 9:Agarose gel electrophoresis of the constructed mutants

model-f10.png
Fig.10 Comparison of enzyme activity between wild type and mutant using MHET as substrate

Table 3: Comparison of enzyme activity between wild type and mutant

Name Amino acid substitution position IU/mL Fold
WT - 17.4333 1
K442A Lys442Ala 55.5567 3.19
A380Q Ala380Gln 45.8500 2.63
W397A Trp397Ala 39.0967 2.24
N134A Asn134Ala 25.8867 1.48
S286A Ser286Ala 20.4333 1.17
S350T Ser350Thr 19.9087 1.14
W397E Trp397Glu 15.7100 0.90
R318A Arg318Ala 14.6933 0.84
R116N Arg116Asn 11.3887 0.65

We have summiazed the computational design and experimental results of this PET-degrading enzyme MHETase to apply for the China patent, as seen from Fig.11. This application has been accepted for the time being.

model-f12.jpg
Fig.11 Acceptance proof of patent application granted by the China National Intellectual Property Administration (CNIPA)

References

  1. Luo, Y.; Su, W.; Xu, X.; Xu, D.; Wang, Z.; Wu, H.; Chen, B.; Wu, J. Raman Spectroscopy and Machine Learning for Microplastics Identification and Classification in Water Environments. IEEE J. Sel. Top. Quantum Electron. 2023, 29 (4: Biophotonics), 1–8.
  2. Si-heng Luo. (2023). PEER-denoising-algorithm. Retrieved from [https://github.com/3331822w/PEER-denoising-algorithm] (accessed on Sep. 20, 2024s).
  3. Eberhardt, J.; Santos-Martins, D.; Tillack, A. F.; Forli, S. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 2021, 61 (8), 3891–3898.
  4. Collier, T. A.; Piggot, T. J.; Allison, J. R. Molecular Dynamics Simulation of Proteins. Methods Mol. Biol. Clifton NJ 2020, 2073, 311–327.
  5. https://2016.igem.org/Team:UESTC-China/Model/Docking-Simulation(accessed on Sep. 20, 2024s)
  6. Shanker, V. R.; Bruun, T. U. J.; Hie, B. L.; Kim, P. S. Unsupervised Evolution of Protein and Antibody Complexes with a Structure-Informed Language Model. Science 2024, 385 (6704), 46–53.
  7. Abramson, J.; Adler, J.; Dunger, J.; Evans, R.; Green, T.; Pritzel, A.; Ronneberger, O.; Willmore, L.; Ballard, A. J.; Bambrick, J.; Bodenstein, S. W.; Evans, D. A.; Hung, C.-C.; O’Neill, M.; Reiman, D.; Tunyasuvunakool, K.; Wu, Z.; Žemgulytė, A.; Arvaniti, E.; Beattie, C.; Bertolli, O.; Bridgland, A.; Cherepanov, A.; Congreve, M.; Cowen-Rivers, A. I.; Cowie, A.; Figurnov, M.; Fuchs, F. B.; Gladman, H.; Jain, R.; Khan, Y. A.; Low, C. M. R.; Perlin, K.; Potapenko, A.; Savy, P.; Singh, S.; Stecula, A.; Thillaisundaram, A.; Tong, C.; Yakneen, S.; Zhong, E. D.; Zielinski, M.; Žídek, A.; Bapst, V.; Kohli, P.; Jaderberg, M.; Hassabis, D.; Jumper, J. M. Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3. Nature 2024, 630 (8016), 493–500.
  8. Leman, J. K.; Weitzner, B. D.; Lewis, S. M.; Adolf-Bryfogle, J.; Alam, N.; Alford, R. F.; Aprahamian, M.; Baker, D.; Barlow, K. A.; Barth, P.; Basanta, B.; Bender, B. J.; Blacklock, K.; Bonet, J.; Boyken, S. E.; Bradley, P.; Bystroff, C.; Conway, P.; Cooper, S.; Correia, B. E.; Coventry, B.; Das, R.; De Jong, R. M.; DiMaio, F.; Dsilva, L.; Dunbrack, R.; Ford, A. S.; Frenz, B.; Fu, D. Y.; Geniesse, C.; Goldschmidt, L.; Gowthaman, R.; Gray, J. J.; Gront, D.; Guffy, S.; Horowitz, S.; Huang, P.-S.; Huber, T.; Jacobs, T. M.; Jeliazkov, J. R.; Johnson, D. K.; Kappel, K.; Karanicolas, J.; Khakzad, H.; Khar, K. R.; Khare, S. D.; Khatib, F.; Khramushin, A.; King, I. C.; Kleffner, R.; Koepnick, B.; Kortemme, T.; Kuenze, G.; Kuhlman, B.; Kuroda, D.; Labonte, J. W.; Lai, J. K.; Lapidoth, G.; Leaver-Fay, A.; Lindert, S.; Linsky, T.; London, N.; Lubin, J. H.; Lyskov, S.; Maguire, J.; Malmström, L.; Marcos, E.; Marcu, O.; Marze, N. A.; Meiler, J.; Moretti, R.; Mulligan, V. K.; Nerli, S.; Norn, C.; Ó’Conchúir, S.; Ollikainen, N.; Ovchinnikov, S.; Pacella, M. S.; Pan, X.; Park, H.; Pavlovicz, R. E.; Pethe, M.; Pierce, B. G.; Pilla, K. B.; Raveh, B.; Renfrew, P. D.; Burman, S. S. R.; Rubenstein, A.; Sauer, M. F.; Scheck, A.; Schief, W.; Schueler-Furman, O.; Sedan, Y.; Sevy, A. M.; Sgourakis, N. G.; Shi, L.; Siegel, J. B.; Silva, D.-A.; Smith, S.; Song, Y.; Stein, A.; Szegedy, M.; Teets, F. D.; Thyme, S. B.; Wang, R. Y.-R.; Watkins, A.; Zimmerman, L.; Bonneau, R. Macromolecular Modeling and Design in Rosetta: Recent Methods and Frameworks. Nat. Methods 2020, 17 (7), 665–680.
  9. Eberhardt, J.; Santos-Martins, D.; Tillack, A. F.; Forli, S. AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. J. Chem. Inf. Model. 2021, 61 (8), 3891–3898.
  10. Chen, I. J.; Yin, D.; MacKerell, A. D. Combined Ab Initio /Empirical Approach for Optimization of Lennard‐Jones Parameters for Polar‐neutral Compounds. J. Comput. Chem. 2002, 23 (2), 199–213.