Model 1 Identification of microplastics using mechine learning on Raman spectroscopic data

1. Background

It is improtant for us to detect and identify microplastics in enviroment. In our wet experiment, we have developed one rapid and convinient method to detect microplastics, however we cannot distinguish the type of microplastics because the anchoring peptide has low specifity. Therefore here we plan to combine Raman spectroscopy and machine learning to achieve the identification of different microplastics. Although some researchers have carried out similar work in recent years [1], the predicted accuracy might be enhanced.

2. Modeling Process

2.1 Data Preprocessing

Raman spectroscopy data primarily consists of two columns: wavenumber and intensity. The wavenumber, measured in inverse centimeters (cm⁻¹), is the frequency difference between incident and scattered light, providing specific identification for functional group of each compound. The other column, intensity, reflects the magnitude of scattered light energy. In the model-building process, since wavenumbers are mainly determined by the settings of the Raman spectroscopy instrument and contribute limitedly to compound identification, we primarily extract intensity values for subsequent analysis. Considering the fact that Raman signal is rather weak and is heavily affected by fluorescent background noise or various other noise signals, mainly originating from the CCD array detector, including photon shot noise and dark noise. Therefore we firstly employs a denoising algorithm based on Peak Extraction and Retention (PEER) to optimize data quality and reduce noise ². The first and second derivatives of the Raman spectrum were analyzed to ensuring that Raman peaks are unaffected during the noise reduction process. Subsequently, an optimized window smoothing technique is applied to the left side of the Raman spectral data, combined with untreated Raman peaks, to obtain high-quality Raman spectra post-denoising. This method not only effectively reduces background noise but also maximally retains the height and shape of the peaks, thus enhancing the accuracy of spectral data.

2.2 Model Selection and Training

The objective of this study is to utilize the specificity of Raman spectroscopy data to identify types of microplastics, a task that falls under the typical classification problem. In selecting an appropriate classifier model, we considered the following common algorithms:

Logistic Regression
Naive Bayes
Decision Tree
Support Vector Machines

Among numerous candidate models, we selected Random Forest as our primary classifier due to its ability to enhance the accuracy and stability of predictions by constructing multiple decision trees and integrating their outcomes. Random Forest demonstrates excellent performance and generalization capabilities when dealing with datasets that contain high-dimensional features.

During the model training phase, the dataset was split into a training set and a test set at an 8:2 ratio. To assess the stability and accuracy of the model, the K-Fold Cross-Validation method was used. Additionally, to optimize model parameters and further improve performance, we conducted a Grid Search to determine the best model parameters, including the number of decision trees (n_estimators), the maximum number of features to consider when splitting a node (max_features), the maximum depth of the trees (max_depth), the minimum number of samples required to split an internal node (min_samples_split), and the minimum number of samples required for a leaf node (min_samples_leaf). The results of the grid search helped us choose a higher number of trees, a moderate tree depth, and an ideal number of features to achieve accurate microplastic classification while maintaining high computational efficiency.

Tab.1 K-Fold Cross-Validation Results of the Model

Fold	Accuracy	PET Precision	PET Recall	PET F1-Score	PP Precision	PP Recall	PP F1-Score	PS Precision	PS Recall	PS F1-Score
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
2	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
3	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
4	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
5	0.98	1.00	1.00	1.00	1.00	0.96	1.00	0.98	1.00	0.94

To ensure consistency and stability in the model's performance across different data subsets, K-Fold Cross-Validation was employed to validate the effectiveness of the model. In this method, the original data is evenly distributed into K subsets, with K-1 subsets used for training and the remaining subset used for testing in each iteration. This process is repeated K times, each time with a different test set, effectively assessing the model’s generalization capability on unseen data. According to the cross-validation results, the model performed consistently across all folds, demonstrating a high degree of robustness and accuracy. As shown in Tab.1, in the fifth fold, despite a slight decline in precision for PP, the overall model's average accuracy still reached 0.9967, indicating its reliability and effectiveness in practical applications.

Conclusion

This study utilized the combination of Raman spectroscopy and the Random Forest model to demonstrate efficient and accurate performance in the task of microplastic type identification. Although the model has not yet been applied directly to precise estimation of microplastic concentrations, its high specificity in identifying microplastic types has proven its potential as an effective auxiliary tool. By integrating with traditional colorimetric methods, this approach offers a more comprehensive solution for our research project, enhancing the accuracy of microplastic type identification and expanding the scope of microplastic monitoring and analysis.

Overall, the integration of Raman spectroscopy with machine learning technology provides an innovative analytical tool in the field of environmental science, especially showing great potential in handling complex environmental samples. Future research may explore combining this model with other spectroscopic techniques or further optimizing algorithms to expand its applications in environmental monitoring and pollution assessment. Additionally, enhancing the model's capability for estimating microplastic concentrations will be an important direction for subsequent research, potentially providing a more precise tool for the quantitative analysis of microplastic pollution.

Model 2 Virtual Screening of anchoring peptide TA2

1. Background

As found in our wet experiments, the anchoring peptide TA2 exhibit high affinity against various microplastics. It means that TA2 has no specificity against one type of microplastics, for example PS common in enviroment. In order to enhance its specifity of TA2 anchoring peptide, we comducted molecular docking-based mutation scanning virtual screening and get some useful informaion.

2. Theoretical basis of docking and Molecular Dynamics Simulation

There are the following five steps to conduct docking and MD.

Receptor selection and preparation
Ligand selection and preparation
Docking
Molecular Dynamics (MD)
Evaluation of docking and MD results

Molecular docking is a computational method aimed at predicting the optimal conformation and orientation of proteins and ligands, thereby producing the lowest free energy of complexes. Molecular docking consists of two main components: a scoring function and a search algorithm. The scoring function evaluates the binding energy and stability of a given protein-ligand pose by considering various factors such as intermolecular interactions, solvation effects, and entropy. The search algorithm explores the conformation and rotation space of proteins and ligands to find the best scoring pose ³.

Molecular docking can also predict how mutations affect the binding performance and stability of complexes by comparing the docking postures of wild-type and mutant proteins with the same ligand. However, molecular docking has some limitations, such as ignoring the effects of solvation, entropy, and flexibility on binding.

Therefore, molecular docking results need to be validated through experimental data or further refined through more accurate methods, such as molecular dynamics (MD).

MD simulation can capture the dynamic behavior of proteins and ligands in solution, and explain their conformational changes, fluctuations, and interactions with solvent molecules ⁴.

MD simulation includes three main steps: initialization, integration, and analysis. The initialization step includes setting the initial coordinates, velocity, force, and parameters of the system. The integration step involves using numerical methods (such as Verlet or leapfrog algorithms) to propel the system in a timely manner. The analysis step involves extracting relevant information from the trajectory, such as structural properties, thermodynamic properties, dynamic properties, and bonding properties.

3. Receptor selection and preparation

The structure of TA2 peptide is constructed using AlphaFold3, which is a state-of-the-art deep learning method that can predict protein structure from amino acid sequences. The predicted structure has a high confidence level, as shown by the PLDDT score, which measures the accuracy of each residual prediction. The protein is converted into a PDB file using PYMOL.

Fig.2 Alphafold3 outputs a prediction result graph for mutant TA2 peptides. It includes the predicted structure of the best conformation, pLDDT plot, PAE plot, and pTM score

The mutant protein structure is derived from the wild-type structure using FoldX. FoldX is software mainly used for protein structure prediction and stability analysis. The PDB file of the mutant is output using FoldX software.

We wrote a Python script that uses MGLTools to batch convert PDB format to PDBQT format, simplifying the tedious step of receptor preprocessing in molecular docking.

4. Ligand selection and preparation

Retrieve the three-dimensional structure of styrene from the PubChem database (CID: 798).
Hydrogenation of styrene small molecules using Materials Studio 2020.
Convert SDF files of styrene to PDBQT format using MGLTools.

5.Molecular docking

In this part, we used the AutoDock Vina program to conduct a comprehensive molecular docking of 817 single-point mutants of TA2 anchoring peptide with using styrene as the ligand ⁵. We also wrote a bat file for batch docking and support starting anytime after interruption. We aimed to explore the effect of these mutations on the binding affinity between TA2 protein and styrene. Fig.3 showed the heatmap of the interation between TA2 mutans (1-40AA) and styrene. The deeper the colour, the stronger the binding affinity. From it we can quickly identify and compare the binding affinity changes of different mutants, providing valuable information for further protein engineering. For example, it was found in Fig.3 that there were three mutants, namely ALA13, ARG6, and ARG24, whose binding affinity was lower than that of wild-type TA2 protein. These mutants will probably become valuable if we want to evolute TA2 into other microplastic anchoring peptide. More work needs to be done to get more useful information.

Fig.3 Heatmap of single-point mutation of 1-43 AA of TA2 anchoring peptide when combining with stylene.

Model 3 AI-guided evolution of MHETase and Identification

1. Background

In recent years, with the rapid development of structural biology, computational biology, and artificial intelligence technology, computer-guided protein design strategies have greatly promoted protein modification.

In our PET-degrdation wet experiments, we expressed two PET degradation enzymes, i.e. Z1-PETase and MHETase. The latter enzyme could catalyze MHET into TPA, but its catalytic activity is limited. In order to identify beneficial mutants exhibiting higher enzymatic activity, we used the newly published unsupervised protein evolution model ESM-IF1, which is a protein language model with enhanced structural information ⁶. This model can effectively guide the evolution of proteins by combining protein amino acid sequence information and structural information. Next, we combined molecular docking and molecular dynamics simulations to theoretically compare and screen 20 variants recommended by ESM-IF1 model. Following these steps, we expressed and purified 9 mutants and measured their catalytic activities.

2. Predicted mutation check points of MHETase based on SM-IF1 model

After setting up the operating environment of the ESM-IF1 model (https://github.com/varun-shanker/structural-evolution/blob/main/README.md), we input the pdb file of MHETase obtained from PDB database (https://www.rcsb.org/) into the model to finally select 33 suggested mutation checkpoints, as shown in Tab.2.

Tab.2 33 suggested mutation check points obtained by the EDM-IF1 model

S235A	H293E	M192F	K437A	W453Y	Y191I	H467A	S196A	A181S	W466V	R411M
A553L	W397E	W397A	N134A	V341A	M433A	A119V	K256R	A310V	A380Q	T638D
D555S	S305T	A139D	S260Q	R318A	R375S	R116N	Q458A	N195L	S286A	W397A

3. Molecular docking and MD between MHETase variants and MHET

To explore how these suggested variants of MHETase affect the interaction with substrate MHET, we firstly used AlphaFold 3 to predict the structure of these variants ⁷. Fig.5A gave an example of the predicted structure of MHETaseR318A mutation using AlphaFold 3. Fig.5B showed the best conformational PAE score, and some other parameters, reflecting the high accuracy of the predicted structure. The 3D structure of the substrate MHET was obtained from PubChem database (CID: 22062452).

Figure 5:Alphafold3 outputs a prediction result graph for mutant TA2 peptides. It includes the predicted structure of the best conformation, pLDDT plot, PAE plot, and pTM score

After preparing both MHETase vairans and the MHET molecule, we used AutoDock Vina to perform molecular docking between these MHET mutants and MHET molecule. We found that 9 mutants exhibit higher affinity than WT MHET and the results are shown in Fig.6. Fig.7 displayed some interaction changes between MHETase and MHET molecule. We also carried out MD using GROMACS software package with a temperature of 300 K and a pressure of 1 bar controlled by the Berendsen algorithm [9,10] to provide more explanations why these variants exhibited higher affinity than other mutants. As seen from Fig.6, the positive mutants S350T and N134A have smaller RMSF value in the 150-250AA active pocket region. In contrast, the negative mutants both R318A and R116N have higher RMSF values in the active pocket region, which may indicate that the structural stability of active pocket should be responsible for high affinity between MHETase and MHET molecule.

Fig.6 Comparison of affinity between MHETase mutants and MHET molecule calculated by molecular docking

Fig.7 Interaction information between mutants and MHET in MHET-binding pocket of MHETase

Fig.8 Comparison of RMSF between MHETase variants and MHET obtained by MD

4. PCR-directed mutagenesis and enzymatical activity detection

After theoretically screening 9 MHETase variants, we next carried out PCR-directed mutagenesis, and the PCR results were shown in Fig.9. After DNA sequencing, we conducted the expression and purification of these 9 variants. Enzymatic activity was measured using HPLC and the results were shown in Tab.3. It was found that 6 variants showed higher enzymatic activity when compared with WT MHETase (seen from Fig.10). Among them, K442A variant is the most beneficial candidate, whose activity was remarkably increased to ~ 3.2-fold.

Figure 9:Agarose gel electrophoresis of the constructed mutants

Fig.10 Comparison of enzyme activity between wild type and mutant using MHET as substrate

Table 3: Comparison of enzyme activity between wild type and mutant

Name	Amino acid substitution position	IU/mL	Fold
WT	-	17.4333	1
K442A	Lys442Ala	55.5567	3.19
A380Q	Ala380Gln	45.8500	2.63
W397A	Trp397Ala	39.0967	2.24
N134A	Asn134Ala	25.8867	1.48
S286A	Ser286Ala	20.4333	1.17
S350T	Ser350Thr	19.9087	1.14
W397E	Trp397Glu	15.7100	0.90
R318A	Arg318Ala	14.6933	0.84
R116N	Arg116Asn	11.3887	0.65

We have summiazed the computational design and experimental results of this PET-degrading enzyme MHETase to apply for the China patent, as seen from Fig.11. This application has been accepted for the time being.

Fig.11 Acceptance proof of patent application granted by the China National Intellectual Property Administration (CNIPA)

Model 1 Identification of microplastics using mechine learning on Raman spectroscopic data

1. Background

2. Modeling Process

2.1 Data Preprocessing

2.2 Model Selection and Training

Conclusion

Model 2 Virtual Screening of anchoring peptide TA2

1. Background

2. Theoretical basis of docking and Molecular Dynamics Simulation

3. Receptor selection and preparation

4. Ligand selection and preparation

5.Molecular docking

Model 3 AI-guided evolution of MHETase and Identification

1. Background

2. Predicted mutation check points of MHETase based on SM-IF1 model

3. Molecular docking and MD between MHETase variants and MHET

4. PCR-directed mutagenesis and enzymatical activity detection

References