Model 1 Identification of microplastics using mechine learning on Raman spectroscopic data
1. Background
It is improtant for us to detect and identify microplastics in enviroment. In our wet experiment, we have developed one rapid and convinient method to detect microplastics, however we cannot distinguish the type of microplastics because the anchoring peptide has low specifity. Therefore here we plan to combine Raman spectroscopy and machine learning to achieve the identification of different microplastics. Although some researchers have carried out similar work in recent years [1], the predicted accuracy might be enhanced.
2. Modeling Process
2.1 Data Preprocessing
Raman spectroscopy data primarily consists of two columns: wavenumber and intensity. The wavenumber, measured in inverse centimeters (cm⁻¹), is the frequency difference between incident and scattered light, providing specific identification for functional group of each compound. The other column, intensity, reflects the magnitude of scattered light energy. In the model-building process, since wavenumbers are mainly determined by the settings of the Raman spectroscopy instrument and contribute limitedly to compound identification, we primarily extract intensity values for subsequent analysis. Considering the fact that Raman signal is rather weak and is heavily affected by fluorescent background noise or various other noise signals, mainly originating from the CCD array detector, including photon shot noise and dark noise. Therefore we firstly employs a denoising algorithm based on Peak Extraction and Retention (PEER) to optimize data quality and reduce noise 2. The first and second derivatives of the Raman spectrum were analyzed to ensuring that Raman peaks are unaffected during the noise reduction process. Subsequently, an optimized window smoothing technique is applied to the left side of the Raman spectral data, combined with untreated Raman peaks, to obtain high-quality Raman spectra post-denoising. This method not only effectively reduces background noise but also maximally retains the height and shape of the peaks, thus enhancing the accuracy of spectral data.
2.2 Model Selection and Training
The objective of this study is to utilize the specificity of Raman spectroscopy data to identify types of microplastics, a task that falls under the typical classification problem. In selecting an appropriate classifier model, we considered the following common algorithms:
- Logistic Regression
- Naive Bayes
- Decision Tree
- Support Vector Machines
Among numerous candidate models, we selected Random Forest as our primary classifier due to its ability to enhance the accuracy and stability of predictions by constructing multiple decision trees and integrating their outcomes. Random Forest demonstrates excellent performance and generalization capabilities when dealing with datasets that contain high-dimensional features.
During the model training phase, the dataset was split into a training set and a test set at an 8:2 ratio. To assess the stability and accuracy of the model, the K-Fold Cross-Validation method was used. Additionally, to optimize model parameters and further improve performance, we conducted a Grid Search to determine the best model parameters, including the number of decision trees (n_estimators), the maximum number of features to consider when splitting a node (max_features), the maximum depth of the trees (max_depth), the minimum number of samples required to split an internal node (min_samples_split), and the minimum number of samples required for a leaf node (min_samples_leaf). The results of the grid search helped us choose a higher number of trees, a moderate tree depth, and an ideal number of features to achieve accurate microplastic classification while maintaining high computational efficiency.
Fold | Accuracy | PET Precision | PET Recall | PET F1-Score | PP Precision | PP Recall | PP F1-Score | PS Precision | PS Recall | PS F1-Score |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
2 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
3 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
4 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
5 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 1.00 | 0.98 | 1.00 | 0.94 |
To ensure consistency and stability in the model's performance across different data subsets, K-Fold Cross-Validation was employed to validate the effectiveness of the model. In this method, the original data is evenly distributed into K subsets, with K-1 subsets used for training and the remaining subset used for testing in each iteration. This process is repeated K times, each time with a different test set, effectively assessing the model’s generalization capability on unseen data. According to the cross-validation results, the model performed consistently across all folds, demonstrating a high degree of robustness and accuracy. As shown in Tab.1, in the fifth fold, despite a slight decline in precision for PP, the overall model's average accuracy still reached 0.9967, indicating its reliability and effectiveness in practical applications.
Conclusion
This study utilized the combination of Raman spectroscopy and the Random Forest model to demonstrate efficient and accurate performance in the task of microplastic type identification. Although the model has not yet been applied directly to precise estimation of microplastic concentrations, its high specificity in identifying microplastic types has proven its potential as an effective auxiliary tool. By integrating with traditional colorimetric methods, this approach offers a more comprehensive solution for our research project, enhancing the accuracy of microplastic type identification and expanding the scope of microplastic monitoring and analysis.
Overall, the integration of Raman spectroscopy with machine learning technology provides an innovative analytical tool in the field of environmental science, especially showing great potential in handling complex environmental samples. Future research may explore combining this model with other spectroscopic techniques or further optimizing algorithms to expand its applications in environmental monitoring and pollution assessment. Additionally, enhancing the model's capability for estimating microplastic concentrations will be an important direction for subsequent research, potentially providing a more precise tool for the quantitative analysis of microplastic pollution.
Model 2 Virtual Screening of anchoring peptide TA2
1. Background
As found in our wet experiments, the anchoring peptide TA2 exhibit high affinity against various microplastics. It means that TA2 has no specificity against one type of microplastics, for example PS common in enviroment. In order to enhance its specifity of TA2 anchoring peptide, we comducted molecular docking-based mutation scanning virtual screening and get some useful informaion.
2. Theoretical basis of docking and Molecular Dynamics Simulation
There are the following five steps to conduct docking and MD.
- Receptor selection and preparation
- Ligand selection and preparation
- Docking
- Molecular Dynamics (MD)
- Evaluation of docking and MD results
Molecular docking is a computational method aimed at predicting the optimal conformation and orientation of proteins and ligands, thereby producing the lowest free energy of complexes. Molecular docking consists of two main components: a scoring function and a search algorithm. The scoring function evaluates the binding energy and stability of a given protein-ligand pose by considering various factors such as intermolecular interactions, solvation effects, and entropy. The search algorithm explores the conformation and rotation space of proteins and ligands to find the best scoring pose 3.
Molecular docking can also predict how mutations affect the binding performance and stability of complexes by comparing the docking postures of wild-type and mutant proteins with the same ligand. However, molecular docking has some limitations, such as ignoring the effects of solvation, entropy, and flexibility on binding.
Therefore, molecular docking results need to be validated through experimental data or further refined through more accurate methods, such as molecular dynamics (MD).
MD simulation can capture the dynamic behavior of proteins and ligands in solution, and explain their conformational changes, fluctuations, and interactions with solvent molecules 4.
MD simulation includes three main steps: initialization, integration, and analysis. The initialization step includes setting the initial coordinates, velocity, force, and parameters of the system. The integration step involves using numerical methods (such as Verlet or leapfrog algorithms) to propel the system in a timely manner. The analysis step involves extracting relevant information from the trajectory, such as structural properties, thermodynamic properties, dynamic properties, and bonding properties.
3. Receptor selection and preparation
The structure of TA2 peptide is constructed using AlphaFold3, which is a state-of-the-art deep learning method that can predict protein structure from amino acid sequences. The predicted structure has a high confidence level, as shown by the PLDDT score, which measures the accuracy of each residual prediction. The protein is converted into a PDB file using PYMOL.
The mutant protein structure is derived from the wild-type structure using FoldX. FoldX is software mainly used for protein structure prediction and stability analysis. The PDB file of the mutant is output using FoldX software.
We wrote a Python script that uses MGLTools to batch convert PDB format to PDBQT format, simplifying the tedious step of receptor preprocessing in molecular docking.
4. Ligand selection and preparation
- Retrieve the three-dimensional structure of styrene from the PubChem database (CID: 798).
- Hydrogenation of styrene small molecules using Materials Studio 2020.
- Convert SDF files of styrene to PDBQT format using MGLTools.
5.Molecular docking
In this part, we used the AutoDock Vina program to conduct a comprehensive molecular docking of 817 single-point mutants of TA2 anchoring peptide with using styrene as the ligand 5. We also wrote a bat file for batch docking and support starting anytime after interruption. We aimed to explore the effect of these mutations on the binding affinity between TA2 protein and styrene. Fig.3 showed the heatmap of the interation between TA2 mutans (1-40AA) and styrene. The deeper the colour, the stronger the binding affinity. From it we can quickly identify and compare the binding affinity changes of different mutants, providing valuable information for further protein engineering. For example, it was found in Fig.3 that there were three mutants, namely ALA13, ARG6, and ARG24, whose binding affinity was lower than that of wild-type TA2 protein. These mutants will probably become valuable if we want to evolute TA2 into other microplastic anchoring peptide. More work needs to be done to get more useful information.
Model 3 AI-guided evolution of MHETase and Identification
1. Background
In recent years, with the rapid development of structural biology, computational biology, and artificial intelligence technology, computer-guided protein design strategies have greatly promoted protein modification.
In our PET-degrdation wet experiments, we expressed two PET degradation enzymes, i.e. Z1-PETase and MHETase. The latter enzyme could catalyze MHET into TPA, but its catalytic activity is limited. In order to identify beneficial mutants exhibiting higher enzymatic activity, we used the newly published unsupervised protein evolution model ESM-IF1, which is a protein language model with enhanced structural information 6. This model can effectively guide the evolution of proteins by combining protein amino acid sequence information and structural information. Next, we combined molecular docking and molecular dynamics simulations to theoretically compare and screen 20 variants recommended by ESM-IF1 model. Following these steps, we expressed and purified 9 mutants and measured their catalytic activities.
2. Predicted mutation check points of MHETase based on SM-IF1 model
After setting up the operating environment of the ESM-IF1 model (https://github.com/varun-shanker/structural-evolution/blob/main/README.md), we input the pdb file of MHETase obtained from PDB database (https://www.rcsb.org/) into the model to finally select 33 suggested mutation checkpoints, as shown in Tab.2.
S235A | H293E | M192F | K437A | W453Y | Y191I | H467A | S196A | A181S | W466V | R411M |
A553L | W397E | W397A | N134A | V341A | M433A | A119V | K256R | A310V | A380Q | T638D |
D555S | S305T | A139D | S260Q | R318A | R375S | R116N | Q458A | N195L | S286A | W397A |
3. Molecular docking and MD between MHETase variants and MHET
To explore how these suggested variants of MHETase affect the interaction with substrate MHET, we firstly used AlphaFold 3 to predict the structure of these variants 7. Fig.5A gave an example of the predicted structure of MHETaseR318A mutation using AlphaFold 3. Fig.5B showed the best conformational PAE score, and some other parameters, reflecting the high accuracy of the predicted structure. The 3D structure of the substrate MHET was obtained from PubChem database (CID: 22062452).
After preparing both MHETase vairans and the MHET molecule, we used AutoDock Vina to perform molecular docking between these MHET mutants and MHET molecule. We found that 9 mutants exhibit higher affinity than WT MHET and the results are shown in Fig.6. Fig.7 displayed some interaction changes between MHETase and MHET molecule. We also carried out MD using GROMACS software package with a temperature of 300 K and a pressure of 1 bar controlled by the Berendsen algorithm [9,10] to provide more explanations why these variants exhibited higher affinity than other mutants. As seen from Fig.6, the positive mutants S350T and N134A have smaller RMSF value in the 150-250AA active pocket region. In contrast, the negative mutants both R318A and R116N have higher RMSF values in the active pocket region, which may indicate that the structural stability of active pocket should be responsible for high affinity between MHETase and MHET molecule.
4. PCR-directed mutagenesis and enzymatical activity detection
After theoretically screening 9 MHETase variants, we next carried out PCR-directed mutagenesis, and the PCR results were shown in Fig.9. After DNA sequencing, we conducted the expression and purification of these 9 variants. Enzymatic activity was measured using HPLC and the results were shown in Tab.3. It was found that 6 variants showed higher enzymatic activity when compared with WT MHETase (seen from Fig.10). Among them, K442A variant is the most beneficial candidate, whose activity was remarkably increased to ~ 3.2-fold.
Name | Amino acid substitution position | IU/mL | Fold |
---|---|---|---|
WT | - | 17.4333 | 1 |
K442A | Lys442Ala | 55.5567 | 3.19 |
A380Q | Ala380Gln | 45.8500 | 2.63 |
W397A | Trp397Ala | 39.0967 | 2.24 |
N134A | Asn134Ala | 25.8867 | 1.48 |
S286A | Ser286Ala | 20.4333 | 1.17 |
S350T | Ser350Thr | 19.9087 | 1.14 |
W397E | Trp397Glu | 15.7100 | 0.90 |
R318A | Arg318Ala | 14.6933 | 0.84 |
R116N | Arg116Asn | 11.3887 | 0.65 |
We have summiazed the computational design and experimental results of this PET-degrading enzyme MHETase to apply for the China patent, as seen from Fig.11. This application has been accepted for the time being.