Part1: Long Short-Term Memory

1.1 Reasons for Choosing LSTM

The traditional Fully Connected Layer (FCL) cannot effectively capture the backward and forward relationships in amino acid sequences, and deep learning models have significant limitations in processing time series data with sequential dependencies, making it difficult to capture long-term dependencies in sequences. The LSTM model, on the other hand, is a special type of recurrent neural network (RNN) designed to solve the gradient vanishing and gradient explosion problems encountered by standard RNNs when processing long sequences. For amino acid sequences, the interactions between remote amino acids in the sequence are crucial for characterizing short peptides. LSTM can effectively remember and forget information through its special gating mechanism, making it capable of handling long-term dependencies in long sequences of amino acids, and suitable for handling complex short peptide prediction tasks.

1.2 Feature Engineering

The LSTM model features are divided into two areas:

Solo thermal coding features:

In sole-hot coding, each category is converted into a binary vector of length equal to the number of categories (i.e., 20), where only one element is 1 and the rest are 0. The position of this 1 corresponds to the index of the amino acid in the dictionary. In this way, each amino acid has a unique binary representation, and this representation captures the unique identity of the amino acid without any quantitative or sequential meaning.

Amino acid physicochemical property characteristics:

Amino acid physicochemical property characteristics define the following five characteristics:
(1). Molecular size: the first number indicates the molecular size or volume of the amino acid. Molecular size affects the stacking and spatial arrangement of amino acids within a protein, which in turn affects the structural stability of the protein.

(2). Charge: indicates the charge of the amino acid under physiological pH conditions. Positive, negative, or neutral charges are critical to the three-dimensional structure and function of proteins because they affect the charge interactions between amino acids.

(3). Polarity: this number indicates the polarity of the amino acid. The distribution of polar and non-polar amino acids in a protein is critical to its water solubility and interaction with other biomolecules.

(4). pKa value: this number indicates the dissociation constant of the side chain of an amino acid. pKa value affects the charged state of an amino acid at different pH values, which in turn affects the structure and function of the protein.

(5). Hydrophobicity/hydrophilicity: this value is usually used to indicate the hydrophobicity or hydrophilicity of an amino acid. These properties determine whether the amino acid is close to an aqueous environment or favors the formation of a protein core during protein folding.

1.3 Model Architecture

The LSTM ^[1] unit consists of three main gates: input gates, forget gates and output gates which are used to control the flow of information.The key to the LSTM is its memory cell (cell state), this cell maintains and updates the state through the gating mechanism.The main workflow of the LSTM is shown in the moving figure.

Fig. 1 Memory cells for LSTM

Forget Gate

It controls the proportion of information in the current memory unit that is retained or forgotten. Its main formula is as follows:

\( f_{t} = \sigma\big(W_{f} \cdot [h_{t-1}, x_{t}] + b_{f}\big) \)

Where \( f_{t} \) is the output of the forgetting gate, \( W_{f} \) is the weight matrix, \( h_{t-1} \) is the hidden state at the previous moment, \( x_{t} \) is the current input, \( b_{f} \) is the bias, and \( \sigma \) is the sigmoid activation function.

Input Gate

It controls the proportion of new information that is added to the memory cell. The main formula is as follows:

\( i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \)
\( \widetilde{C}_t = \tanh\left(W_C \cdot [h_{t-1}, x_t] + b_C\right) \)

Where \( i_t \) is the output of the input gate, \( \widetilde{C}_t \) is the candidate memory cell state, \( W_i \) and \( W_C \) are the weight matrices, and \( b_i \) and \( b_C \) are the biases.

Meanwhile, the formula for updating the memory cell state is as follows:

\( C_t = f_t \cdot C_{t-1} + i_t \cdot \widetilde{C}_t \)

Where \( C_t \) is the current memory cell state.

Output Gate

It controls the amount of information output from the memory cell. Its main formula is as follows:

\( o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \)
\( h_t = o_t \cdot \tanh(C_t) \)

Where \( o_t \) is the output of the output gate and \( h_t \) is the current hidden state.

1.4 Model Training

The training process for LSTM models is usually done using the Backpropagation Through Time (BPTT) algorithm. The following are the key mathematical formulas used in the training process:

Loss Function:

The cross-entropy loss function is used to measure the difference between the model prediction and the actual target. Here we are multicategorization problem with the following loss function formula:

\( L = -\sum_i y_i \log\left(\widehat{y}_i\right) \)

Gradient Calculation:

In BPTT, the gradient of the loss function is back propagated through time. The update formula for the gradient is as follows:

\( \theta \leftarrow \theta - \eta \nabla_\theta L \)

Where \( \theta \) denotes the parameters of the model, \( \eta \) is the learning rate, and \( \nabla_{\theta} L \) is the gradient of the loss function with respect to the parameters.

Optimizer:

The Adam optimizer was used, combining the advantages of the momentum method and RMSprop to automatically adjust the learning rate.

By using the LSTM model, single sequence amino acids can be efficiently screened and important long-term dependency information in the sequence can be captured. This approach provides a solid foundation for subsequent homology analysis and 3D structure prediction.

Part 2: Graph Convolutional Networks

2.1 Reasons for Choosing GCN

During our experiments, we found a key limitation of the pure LSTM model in processing protein sequence data, i.e., it mainly focuses on the one-dimensional sequence information of proteins (the amino acid sequence text) and ignores the three-dimensional structural information of proteins, which is crucial for understanding protein functions and properties. Indeed, we found that this 3D structural information is indispensable for a deeper understanding of protein function, stability, interactions and thus finding PET microplastic binding peptides. To overcome this limitation, we introduce GCN with protein 3D structure information as input to better find PET-binding peptides.

2.2 Feature Engineering

Characterization of amino acid residue classes

First, we sorted the 20 amino acids in the following order: alanine (A), cysteine (C), aspartic acid (D), glutamic acid (E), phenylalanine (F), glycine (G), histidine (H), isoleucine (I), lysine (K), leucine (L), methionine (M), asparagine (N), proline (P), glutamine (Q), arginine (R), serine (S), threonine (T), valine (V), tryptophan (W), and tyrosine (Y). From this, a 20-dimensional unique heat vector \( [a_1, a_2, \ldots, a_{20}] \) can be constructed for each amino acid residue \( r \) in each PDB file. If the amino acid residue \( r \) belongs to the ith amino acid in the above sequence, then \( a_i = 1 \), and the rest of the positions are 0, i.e., \( a_j = 0 \), \( j \in \{ x | 1 \leq x \leq 20, x \in Z, x \neq 1 \} \).

Characterization of solvent accessibility

Two features, solvent Accessible Surface Area (ASA) or Relative Solvent Accessibility (RSA), are widely used in the detection of various types of protein functional sites. It is used to measure the solvent exposure of amino acid residues because most amino acid residues located at the protein-protein interaction interface occur at or near the surface of the protein and thus play a key role in this project. This project utilized a two-dimensional feature vector extracted for each residue: RSA and ASA, and they were described in detail as follows: RSA was used to measure the extent to which an amino acid is exposed to solvent in the protein structure. Specifically, it is the ratio of the actual solvent-accessible surface area of a given amino acid to the maximum ASA of that amino acid in the fully expanded state. In general, this ratio is usually normalized to a value between 0 and 1 in order to compare the exposure of different amino acids in different environments. ASA, on the other hand, represents the area of the residue exposed to the solvent and is expressed in square angstroms (Å ²).

Physicochemical characterization

Amino acid physicochemical characterization represents the physicochemical characteristics of the side chain group (R group) of each amino acid residue, which can be used to characterize the overall protein and the localized rated physicochemical environment in which the residue is located. Amino acid physicochemical property information also plays an important role in the prediction of protein-protein interactions, therefore, the following five common physicochemical characteristics were extracted for each amino acid residue in protein in this project, i.e., number of atoms, amount of charge, number of hydrogen bonds, dissociation constant of the side chain, and value of hydrophobicity. The details of the above five physicochemical properties corresponding to the 20 amino acids are shown in Table 1.

Table 1 Numerical table of physicochemical properties of side chain R groups of amino acid residues

PHI and PSI torsion angle sine-cosine characteristics

The PHI (φ) and PSI (ψ) torsion angles in the protein backbone are important parameters for describing the three-dimensional conformation of proteins, and they represent the rotation angles of specific chemical bonds in the protein backbone, respectively. Among them, the PHI torsion angle reflects the rotation angle of the C-N bond on the left side of the α-carbon in the peptide unit, while the PSI torsion angle describes the rotation angle of the C-C bond on the right side of the α-carbon. These two angles together determine the local conformation of the protein chain. Therefore, four features such as PHI sine and cosine and PSI sine and cosine were selected as part of the residue characterization in this project.

Characterization of protein secondary structure

In biology, structure and function are inextricably linked, so the function of proteins is likewise dependent on their structure. Secondary structure is a classical characterization of protein structural information. The secondary structure of proteins can be broadly categorized into α-Helix, β-Strand, and loop according to the arrangement of amino acids. There are many mature tools for predicting secondary structures, and the PSIPRED tool was chosen for this project. For each protein sequence of length L, the output of PSIPRED is an L×3 matrix, and for each residue, these 3 dimensions are the 3 kinds of secondary structures, respectively. Afterwards, a helix [1,0,0], β-folding [0,1,0], and irregular coiling [0,0,1] are encoded after using a 3D solo thermal code to encode them.

\(\mathrm{SS} = \begin{bmatrix} S_{1,1} & S_{1,2} & S_{1,3} \\ \vdots & \vdots & \vdots \\ S_{i,1} & S_{i,2} & S_{i,3} \\ \vdots & \vdots & \vdots \\ S_{L,1} & S_{L,2} & S_{L,3} \end{bmatrix}\)

2.3 Amino Acid Residue Graph Network

In this project, we use the amino acid residues in the PDB file as nodes to construct the graph association of amino acid residues with amino acid residues. For simple representation, we constructed the adjacency matrix M (nA*nA), where nA represents the number of nodes of amino acid residues. We set a threshold value of 3Å atomic distance between two neighboring amino acid residues and assume that if there is a pair of two amino acid residues with a distance of less than 3Å, it means that there is an association between these two amino acid residues, which is designated as “1”, and if there is no pair of two amino acid residues with a distance threshold value of less than 3Å, it is assumed that there is no association, which is designated as “0”. We calculated the number of atom pairs connected in different amino acid residues as the corresponding value of the adjacency matrix M, taking into account the self-connection case. Finally, we uniformly normalize the adjacency matrix to produce a graph network. If there are N PDB files, we can get N graph networks.

2.4 Model Architecture

Fig. 2 GCN model architecture

Based on the traditional GCN ^[2] network, GATConv incorporates a multi-head attention mechanism to capture the complex interactions among nodes and enhance the richness of node representations, thereby improving the flexibility and adaptability of the model. Through this mechanism, the independent computation of multiple attention heads enables the model to extract node features from different perspectives for a more comprehensive understanding of complex relationships and patterns in the data.

For each node and its neighbor node , the attention coefficient for the kth attention head is calculated as follows:

\( e_{ij}^{(k)} = \text{LeakyReLU}\left(a^{(k) \top} \left[ W^{(k)} h_i \parallel W^{(k)} h_j \right]\right) \)

Where: \( W^{(k)} \) is the weight matrix of the \( k \)-th attention head; \( a^{(k)} \) is the attention weight vector of the \( k \)-th attention head; and \( \parallel \) denotes the splicing operation of the feature vectors.

Subsequently, the attention coefficients of all neighboring nodes are normalized by Softmax function:

\( \alpha_{ij}^{(k)} = \frac{\exp\left(e_{ij}^{(k)}\right)}{\sum_{j \in \mathcal{N}(i)} \exp\left(e_{ij}^{(k)}\right)} \)

Where: \( \mathcal{N}(i) \) denotes the set of neighboring nodes of node \( i \).

The feature representation of the kth attention head output node i is obtained as:

\( h_i^{(k)^{\prime}} = \sigma\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij}^{(k)} W^{(k)} h_j\right) \)

The model uses the 2-head attention mechanism to splice the output features of each attention head to obtain:

\( h_i^{\prime} = \big\|_{k=1}^{K} h_i^{(k)^{\prime}} \)

Then in the process of feature aggregation, Average Pooling is used so that the influence of different nodes can be considered in a balanced way to get the final output features of node :

\( h_{\mathrm{i}}^{\prime\prime} = \frac{1}{|\mathcal{N}(i)|} \sum_{j=1}^{|\mathcal{N}(i)|} h_{\mathrm{i}}^{\prime} \)

The LeakyReLU activation function (with a negative slope of 0.2) is then applied to introduce the nonlinear feature:

\( h_i = \mathrm{LeakyReLU}(h_{\mathrm{i}}^{\prime\prime}) \)

This model uses two GATConv layers, and the propagation between neighboring layers is calculated as follows:

\( H^{(l+1)} = \sigma\left(\widetilde{D}^{-\frac{1}{2}} \widetilde{A} \widetilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right) \)

Where: \( \widetilde{\mathcal{A}} = \mathcal{A} + \mathcal{I}_{N} \) is the adjacency matrix of the additively self-connected undirected graph \( G \), \( A \) is the \( N \times N \)-dimensional adjacency matrix of \( G \), \( \mathcal{I}_N \) is the unit matrix, and \( N \) is the number of nodes; and \( \widetilde{\mathcal{D}} \) is the degree matrix of \( \widetilde{\mathcal{A}} \), which is computed as follows: \( \widetilde{\mathcal{D}}_{\mathrm{ii}} = \sum_{\mathrm{j}} \widetilde{\mathcal{A}}_{\mathrm{ij}} \); \( W^{(l)} \) is the weight matrix of the \( 1 \)th layer; \( H^{(l)} \) the feature matrix of the \( 1 \)th layer, for the input layer i.e., it is the input \( X \) of the model of the \( N \times D \) dimensions, and \( D \) is the number of the input features; \( \sigma \) denotes the activation function, LeakyReLU (negative slope of 0.2) is used in this model.

The GATConv layer is followed by a fully connected layer and a Dropout layer is added after it, setting the dropout rate to 0.2, thus enabling the model to prevent overfitting. The final classification layer yields a 3-classification prediction output.

2.5 Model Training

1.Loss function

The model uses as the objective function for the multi-classification task. The loss function formula is:

\( \mathrm{L = -\sum_{i=1}^{N}\sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})} \)

Where: \( N \) is the number of samples, \( C \) is the number of categories; \( y_{\mathrm{ic}} \) is the true category label of sample \( i \), and \( \hat{y}_{\mathrm{ic}} \) is the predicted probability of the model.

2.Optimization and Learning Rate Scheduling

Optimizer: use Adam optimizer for parameter update, learning rate is initially set to 0.01. Adam combines the advantages of momentum method and RMSpro to adaptively adjust the learning rate.
Learning rate scheduling: the StepLR learning rate scheduler is used and set to reduce the learning rate to 10% of the current rate every 20 epochs.

References

[1] Hochreiter, S. Long short-term memory[J]. Neural Computation, 1997, 9(8), 1735-1780.
[2] Kipf T, Welling M. Semi-supervised classification with Graph Convolutional Networks[J]. ArXiv, 2016, abs/1609.02907.