We took 60,286 one-dimensional amino acid sequences as the original dataset and divide them into training and testing sets in the ratio of 7:3. The label distribution results are displayed as shown below:
Fig. 1 training data labeling classification display
Three columns: Features, Len and Label.
(1). Features: each amino acid was converted into a 25-dimensional vector (including 20-dimensional unique heat codes and 5-dimensional physical and chemical features).
(2). Len: number of amino acids in each sequence.
(3). Label: classification of each sequence.
(1). Define the mapping dictionary of physicochemical attributes and unique heat codes.
(2). Accept the amino acid sequences and mapping dictionary as inputs and merge the unique heat coding vector and the physicochemical properties into a 25-dimensional vector.
(3). Normalize the physicochemical properties.
(4). Performs exception handling on amino acid sequences, skipping sequences containing unknown amino acids, which helps maintain clean and consistent data.
The input dataset consists of a total of 386,226 one-dimensional amino acid sequences, each of which is a sequence with a length between 40 and 70. The data to be filtered is shown in the table below:
Table 1 Introduction to the data set to be filtered
Fig. 2 Demonstration of the length distribution of amino acid sequences to be screened
A total of 386,226 one-dimensional amino acid sequences from the dataset and their corresponding model predictions were entered. The model is a three-category problem (labels 0, 1, and 2 correspond to different categories). In this case, the total number of one-dimensional amino acid sequences with label 0, i.e. the target category, is 285,052.
Table 2 Introduction to output datasets
Fig. 3 Model predicted label result distribution display
After that, the output results with label as 0 were analyzed for homology, and the sequences with homology greater than 70% were deleted to get 72,978 sequences. These sequences were subjected to score comparison, and 120 sequences each of 40, 50, 60, and 70 lengths were filtered out, and finally 480 amino acid sequences to be filtered were obtained.
We used 7,386 3D PDB files as the original dataset, each file corresponds to one amino acid sequence, and the number of amino acid sequences of different lengths under different labels were counted separately.
Table 3 Introduction of 3D PDB dataset
Fig. 4 Statistics of amino acid sequences of different lengths labeled “0”
Fig. 5 Amino acid sequence statistics for different lengths of label “1”
Fig. 6 Statistics of amino acid sequences with different lengths of label “2”
We screened amino acid sequences with lengths of 40-70 and removed 4 sequences containing unknown amino acids to obtain 5,852 amino acid sequences corresponding to PDB files and used them as input data sets. Among them, 3,436 PDB files were labeled “0”, 797 PDB files were labeled “1”, and 1,619 PDB files were labeled “2”.
Fig. 7 Number of PDB files with different labels
As can be seen from the above figure, the screening data set is not balanced in terms of the number of categories, in order to avoid the bias of the model in training and prediction, the data set was equalized by using random sampling and resampling methods. The processed dataset has 2,436 PDB files labeled “0”, 1,500 PDB files labeled “1”, and 1,619 PDB files labeled “2”. We divided the processed dataset into GCN training set and evaluation set according to the ratio of 4:1.
Fig. 8 Number of files with different labels after resampling and random sampling
We used the 480 amino acid sequences output from the LSTM as a test set for the GCN model using the 480 PDB dataset simulated from AlphaFold3, and used the trained GCN model to predict the 480 amino acid sequences for the bathtub, obtaining a three-dimensional output [p_1,p_2,p_3 ] with p_1,p_2,p_3=0,1,p1+p2+p3≡1, where the corresponding value is at 1 and the class of amino acid sequence predicted by the model, and Label 0 is the target class. Then they were sorted according to the score, and finally the top 16 amino acid sequences with the highest scores based on the insolubility index are selected as the optimized samples for the wet experimental group.
Contains two files: adjacency matrix, feature matrix.
Adjacency matrix (L×L dimension): stores the connection relationship between the nodes of the graph and the weight information of the edges, reflecting the overall structure of the graph data.
Feature matrix (L×34 dimensions): feature information of the nodes in the graph corresponding to each amino acid in the PDB file, consisting of 6 types of features (20-dimensional unique heat codes, 5-dimensional physical and chemical features, 4-dimensional torsion angle features, 1-dimensional solvent-accessible surface area features, 1-dimensional relative solvent accessibility features, and 3 as secondary structure features).
(1). Calculate each class of features and process into a consistent data format.
(2). Normalize the features.
(3). Splicing each class of features to form a total 34-dimensional feature file.
(4). Calculate the adjacency matrix of the corresponding amino acid sequence for each PDB file.
The GCN model yielded a table of scores for the 480 test sets, and they were sorted from high to low based on the scores, as presented in the table below:
Table 4 480 amino acid sequence scoring table
Based on the insolubility index as well as the score ranking, we successfully screened out 16 short peptide sequences with high affinity to plastics, which are presented in the following table:
Table 5 Target peptides after rescreening
Through the calculation and prediction of this model, the dry lab successfully screened 16 short peptides with potential PET-binding capabilities. This multi-dimensional prediction-based screening method significantly enhanced the accuracy and efficiency of the screening process, providing strong support for the scientific validity and specificity of the results, while also laying a solid foundation for the wet lab's subsequent experimental validation. It can be said that the work of the dry lab not only substantially narrowed the scope of the wet lab's validation but also greatly improved the efficiency and precision of the experimental process. Building on this foundation, the wet lab further conducted refined biological experiments to assess the actual binding effects and biological functions of these peptides. The contribution of the dry lab is critical, as it provided the wet lab with high-affinity and high-specificity candidate peptides, significantly accelerating the entire experimental process. These validation efforts not only confirmed the practical application potential of the screened peptides but also provided reliable and high-quality data support for subsequent functional research.
The LSTM architecture design for this project consists of a bi-directional LSTM layer, a hidden layer, and two layers of stacked LSTM units. In this design, the number of hidden layer units was set to 128 neurons. To further enhance the expressiveness of the model, we introduced a fully connected layer after the LSTM layer and added a dropout layer with the dropout rate set to 0.1 to reduce the risk of overfitting. Finally, the learning rate was set to 0.001 to balance the convergence speed and performance of the model. Such a design effectively captured the dependencies in the time series through the long and short-term memory mechanism, ensuring that the model has stronger memory and robustness when dealing with long series data. We conducted experiments on our dataset in a supervised learning setting. Our dataset contained 60,285 amino acid sequences. Meanwhile, the model was parameterized, and the main tuning parameters include batch_size, hidden_size, learning rate, dropout rate, and the number of training rounds (Epochs). Table 6 shows the results of the comparative experiments with different LSTM model parameters and the model performance is measured by accuracy (Acc).
Table 6 Comparison of LSTM model parameters
The optimal parameter settings for this LSTM model are shown in Table 7, with the following parameters: the number of training cycles (Epochs) is 100, the number of network layers (Num_layers) is 2, the learning rate (Learning rate) is 0.001, and the dropout rate (Dropout rate) is 0.1. These parameter combinations have been experimentally verified to provide the optimal model performance.
Table 7 Optimal parameters of the LSTM model
We divided the training set and test set according to the ratio of 7:3. The number of label0, label1, and label2 were 32.2%, 30.2%, and 37.6% in the training and test sets, respectively. We trained the model on the training set and evaluated the performance of the model on the test set. The evaluation metrics were the accuracy rate of each type of label and the macro average F1 score.
The results showed that the model performs best with an accuracy of 0.934 when the learning rate was 0.001, the discard rate was 0.1, and the training period was 100. Higher learning and discard rates lead to a decrease in the model accuracy. Overall, moderate parameter tuning can effectively improve model performance and prevent overfitting.
The following graph shows the trend of training loss and validation loss. The training loss continues to decrease and stabilize.
Fig. 9 LSTM loss variation at 100epoch
Fig. 10 illustrates the variation of training accuracy (Train Acc) and validation accuracy (Val Acc) with training epochs (Epoch number). The training accuracy approaches 1.0 after about 20 epochs and remains stable.
Fig. 10 Variation of LSTM training precision under 100epoch
Table 8 shows the classification report of the optimal model on the test set, and the evaluation metrics include precision, recall, and F1-score. category 0 has the best performance with an F1 value of 0.95, precision of 0.94, and recall of 0.96. category 1 performs relatively poorly, with an F1 value of 0.91, and both precision and recall of 0.91. category 2 has an F1 value of 0.93 and performs close to category 0. The overall accuracy of the model is 0.93, and the F1 values for both macro-averaging and weighted averaging are both 0.93.
Table 8 Remaining indicators of the LSTM model
By analyzing the confusion matrix, we can see that the model performs best in correctly predicting the 0-label, which indicates that our LSTM model has a high accuracy in identifying 0-labeled data. This is consistent with our task goal of focusing on the 0-label category, and demonstrates that the model has a strong discriminative ability in this key category.
Fig. 11 Confusion matrix for LSTM test results
The architecture of the GCN involved in this project consists of a two-headed structure, a hidden layer, and two layers of graph attention convolutional layers. In this design, the number of hidden layer units was set to 256 neurons. In addition, we introduced a fully connected layer and added a dropout layer afterwards to improve the generalization ability of the model, where the dropout rate was 20%. Finally, we chose 0.005 as the learning rate. Such a design aims to improve the model performance and ensure its robustness to changes in the data through multi-level feature extraction and effective regularization strategies.
Table 9 GCN model accuracy with different parameters
Table 10 Optimal parameters of the GCN model
We divided the dataset into training set and evaluation set according to the ratio of 4:1, and used resampling and random sampling to ensure the balance of the data with different labels, and the 0-label, 1-label, and 2-label data after sampling were 43.85%, 27.00%, and 29.15%, respectively. We trained the GCN model on the training set and set the number of training rounds to 100epoch, and utilized the evaluation set to evaluate the model performance, and the evaluation metrics were mainly Accuracy, and Precision, Recall, and F1-score.
After training through 100epoch, the loss of the model gradually decreased, and the model had basically converged at the beginning of 20epoch, and the model loss was 0.0054 at 100epoch.
Fig. 12 GCN model training loss at 100epoch
After training by 100epoch, the accuracy of our model in the evaluation set can reach 90.10%.
Fig. 13 Acc for GCN model testing under 100epoch
By looking at the confusion matrix, we can find that the model correctly predicts the most number of 0 labels, which proved that our GCN model had excellent performance in predicting 0 labels. And this coincides with our goal of finding target peptides with label 0.
Fig. 14 GCN model confusion matrix
Table 11 Remaining indicators of the GCN model
In addition, we also performed ablation experiments on the 34-dimensional features of the GCN dataset separately to explore which features play an important role in the learning of the GCN model and used Acc as the model performance index.
Table 12 Performance index of GCN model with different features
According to the above table, we can find that the unique heat coding feature, amino acid physicochemical property feature, and torsion angle feature play indispensable roles in the task of GCN learning and successfully predicting amino acid categories.
(*Note: OH: amino acid sole heat coding feature; LH: amino acid physicochemical property feature; TA: torsion angle feature; ASA: solvent accessibility feature; SS: protein secondary structure feature)
In the subsequent experiments on the 16 short peptide sequences selected for high affinity with PET plastics, the feedback test data revealed that the expression level of the fusion protein and its binding capacity to the PET microplastic substrate were insufficiently high, presenting certain limitations. This made us recognize that the current model merely functions to screen out relatively optimal sequences but cannot guarantee satisfactory actual test results of the sequences. Hence, we need to further develop the model. Confronted with this issue, Mutcompute-super, as an advanced computational biology tool, predicts possible mutation sites in proteins through deep learning algorithms and analyzes the three-dimensional structure of proteins via 3D deconvolution neural networks [1], thereby optimizing the performance of proteins. Therefore, we employed this model to introduce beneficial point mutations at specific sites of the short peptide sequences and further enhance the expression level of the fusion protein and its degradation capacity towards the PET microplastic matrix through molecular modification. We recoded the C, N, O, and S microenvironments of the amino acid residues in the protein and extracted the biological information features as the inputs for the Mutcompute-super model. Subsequently, in combination with the deep learning classifier, we predicted the optimal amino acid mutation combination to optimize the protein folding conformation and elevate its functional performance.
Based on the prediction results of Mutcompute-super, we scored the amino acid mutations at each locus and generated the following heat maps. These graphs demonstrate the effects of different mutations on the expression level and function of the fusion protein: the color changes in the graphs reflect the mutation effects, with blue indicating a lower score, suggesting a weaker or ineffective mutation, and red indicating a higher score, suggesting that the mutation significantly enhances the function of the protein. The left side shows the original amino acid sequence, and the right side shows the mutated amino acid sequence.
Fig. 15 Heatmap 1 of the mutation scoring situation
Fig. 15 illustrates the mutation scoring for specific regions of PET-binding peptide 60-2. As can be seen from this figure, the Mutconpute-super model gives higher scores to mutations at loci L7, A13, S26, and N29, etc. (red areas), suggesting that mutations at these loci may enhance the function of the protein.
Fig. 16 Heatmap 2 of the mutation scoring situation2
Fig. 16 reflects the effect of mutations at the 50-4 peptide site. In particular, loci R4, S14, V19, L23, Y33, and S61 show high ratings, indicating that these mutations can enhance the expression level of the fusion protein or its ability to bind to PET microplastic substrates. In contrast, other regions have lower scores (blue areas) and the effect of mutations on function is not significant.
Fig. 17 Heatmap 3 of the mutation scoring situation
Fig. 17 illustrates the mutation scoring of the 50-5 peptide. In particular, more distinct red areas can be seen at sites I283, R287, C293, R301, and A325, etc., indicating that mutations at these sites significantly enhance the properties of the protein. The blue areas in the graph, on the other hand, indicate that the mutations at some sites have less effect.
Through these heat maps, we can clearly see the effects of each mutation on different short peptides, thus providing guidance for further mutation optimization.
We proposed a composite LSTM-GCN model to screen PET-binding peptides with high affinity, initially identifying 480 potential target peptides from 386,226 amino acid sequences using LSTM to capture long-distance interactions, followed by GCN to capture three-dimensional spatial information, ultimately yielding 16 potential high-affinity peptides. These peptides demonstrated high confidence for experimental validation. Additionally, considering the limited expression and binding ability of fusion proteins, we used Mutcompute-super to introduce beneficial mutations to PET-binding peptides by recoding the protein's microenvironments and optimizing protein folding via MLP classifiers, enhancing both expression levels and PET microplastic degradation.
In fact, LSTM and GCN have also achieved remarkable results in other fields of synthetic biology. For example, LSTM can capture the dynamic interaction between metabolites and enzymes and predict the change of metabolite concentration through long time series data, while GCN can capture the topological structure information of metabolites and enzymes in the metabolic network.
According to the research of Cai et al., GCN can be applied to synthetic lethality prediction in cancer, demonstrating its application potential in complex gene interaction networks [2]. In addition, the combined model of GCN and bidirectional long short-term memory network (Bi-LSTM) has also been applied to protein secondary structure prediction. Bi-LSTM can capture the temporal dynamic characteristics of sequence information, while GCN can effectively simulate the topological relationship between protein molecules [3]. In comparison, we used LSTM to conduct preliminary screening of the characteristics of one-dimensional sequences. Then, we used the graph structure of GCN to conduct re-screening of three-dimensional PDB files. Moreover, we used Mutcompute-super to predict amino acid mutation combinations and improve the expression ability of fusion proteins. Our model approach fully links the characteristics of deep learning with synthetic biology organically.
Thus, it can be seen that our model approach provides a good example for others.
In the future, we intend to use deep learning neural network models with larger parameters, such as xLSTM and hypergraph neural network model (HGCN) to better capture the interactions in the amino acid sequences of proteins, in order to obtain more accurate and compliant target amino acid sequences.
[1] Deng Z H, Cai C, Wang S T, et al. Protein design method based on amino acid microenvironment and
EMO neural network,
CN118136092A [P/OL].
[2] Breitwieser L, Hesam A, DE Montigny J, et al. BioDynaMo: a modular platform for high-performance
agent-based
simulation [J]. Bioinformatics, 2022, 38(2): 453-460.
[3] Cai R, Chen X, Fang Y, et al. Dual-dropout graph convolutional network for predicting synthetic
lethality in human
cancers [J]. Bioinformatics, 2020, 36(16): 4458-4465.
To Top