5.From SVM to GNN : Building an Accurate Peptide-PE Binding
Predictor(PEBP)
According to the wet lab's requirement, we need to build a predictor
to determine whether a peptide sequence can bind to PE. After an
initial literature search, we collected 174 relevant data points.
Due to the small data size, we first conducted machine learning
training and then performed special processing, attempting to
introduce a neural network. Specifically, we first trained an SVM
model using the original data. Then, we expanded the data by a ratio
of 1:10, resulting in nearly two thousand data points. Next, we used
the trained SVM model to filter the expanded data according to
specific criteria, resulting in a new dataset. To deeply explore
sequence information, we converted the sequences into structural
graphs, extracted features through a Graph Neural Network (GNN) and
Position-Specific Scoring Matrix, and introduced an attention module
for further learning, ultimately creating a predictor. In the
experimental section, we divided the dataset into a training set and
a test set, conducted experiments using the constructed model, and
measured relevant experimental metrics such as accuracy, recall, and
F1 score. Finally, we further analyzed the test results and adjusted
and optimized the model parameters.
5.2 Our method
5.2.1 Features and Feature Selection
Extracting appropriate features is a crucial step in building a
high-performance predictive model. Common feature types include
amino acid composition (AACs), dipeptide composition (DPCs),
physicochemical properties of amino acids, and pseudo-amino acid
composition, which have been widely used in the development of
protein and peptide classifiers. These features have shown
excellent performance in these contexts.
Representing protein sequences by calculating the frequency of
amino acids is an effective method. Differences in the frequency
distribution of amino acids among different sequences can be used
to distinguish between different types of proteins. This also
applies to peptide sequences; therefore, we selected AACs as
features. To compensate for the lack of intrinsic connections
between amino acids, we also introduced DPCs. A peptide sequence
can be composed of any of the 20 amino acids
(ACDEFGHIKLMNPQRSTVWY), so a peptide with L amino acids can be
represented as: $$\beta = (\beta_1, \beta_2, \dots, \beta_L)$$
where $\beta_1$,$\beta_2$ and $\beta_L$ represent the first,
second, and Lth amino acids of the peptide sequence ,
respectively. The definitions of AAC and DPC are as follows:
$$\begin{equation} AAC(i) = \frac{x_i}{\sum_{i=1}^{20} x_i}
\end{equation} \tag{33}$$ $$\begin{equation} DPC(j) =
\frac{y_j}{\sum_{j=1}^{400} y_j} \end{equation} \tag{34}$$
where $i$ represents one of the 20 amino acids, and $j$ represents
one of the 400 dipeptides. $x_i$ represents the number of residues
of each type, and $y_j$represents the number of dipeptides of each
type. To build an efficient PE-binder predictor (PEBP), we further
refined AAC and DPC using the fselect.py script in LIBSVM3.22,
removing irrelevant, redundant, and noisy features. The feature
selection process was as follows: features were sequentially added
to an initially empty set in descending order of accuracy, and the
accuracy of this set was calculated each time an element was
added. When the prediction accuracy reached its highest value, we
selected that set as the optimal feature subset. After these
steps, we ultimately obtained optimized AAC (OAAC) and optimized
DPC (ODPC).
5.2.2 Support Vector Machine
Support Vector Machine (SVM) is a supervised learning model
algorithm widely used for data regression analysis and prediction,
and it has received increasing attention and application in
bioinformatics. We applied SVM in the development of PEBP. This
SVM model was developed using LIBSVM3.22,an integrated support
vector classification software. The optimal error factor c and
kernel function variance g in the model can be determined using
the built-in Python script grid,py. To visualize the prediction
results, the parameter b was set to 1 during model training.
5.2.3 Prediction Evaluation
In this study, all models were evaluated using five-fold
cross-validation. The entire dataset was randomly divided into
five groups, each containing an equal number of peptides. Four
groups were used for training, and the remaining one was used for
testing, with the process repeated five times to ensure that each
group had the opportunity to serve as the test set. The average
prediction accuracy across the five combinations was calculated as
the model's final performance. To comprehensively evaluate the
PEBP model, sensitivity (Sn), specificity (Sp), accuracy (Acc),
and Matthews Correlation Coefficient (MCC) were used as evaluation
metrics.
$$\begin{equation} Sn = \frac{TP}{TP + FN} \end{equation}
\tag{35}$$ $$\begin{equation} Sp = \frac{TN}{TN + FP}
\end{equation} \tag{36}$$ $$\begin{equation} Acc = \frac{TP +
TN}{TP + FN + FP + TN} \end{equation} \tag{37}$$
$$\begin{equation} MCC = \frac{TP \times TN - FP \times
FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{equation}
\tag{38}$$
The results show that the area under the ROC curve (AUC) is 0.75,
indicating that the model has good but not perfect predictive
performance. The confusion matrix shows TP, TN, FP, and FN as 25,
15, 6, and 10, respectively, consistent with the calculations for
sensitivity, specificity, and MCC. Classification performance
metrics such as precision, recall, and F1 score demonstrate
reasonable predictive capability across different categories.
Overall, the results align with the model evaluation approach
described for five-fold cross-validation, supporting the model's
effectiveness in predicting PE-binders.
The results are as follows.
Figure 21 Receiver Operating Characteristic
Figure 22 Classification Metrics by Class
Figure 22 Confusion Matrix
5.2.4 Peptide Data Augmentation
Furthermore, to fully exploit the peptide data, we conducted
further research. Regarding the dataset, given the limitations in
the amount of source data, we first considered peptide data
augmentation:
General process:
• Generate candidate sequences: Methods such as random
mutation, sequence concatenation, and noise addition were used to
generate candidate sequences.
• Screen candidate sequences: An SVM model was used to
perform preliminary screening of the candidate sequences,
identifying sequences that might bind to PE.
• Expand the dataset:The screened sequences were added to
the dataset until the desired expansion size was reached (around
2,000 sequences).
Generating candidate sequences is the key step in data
augmentation, aimed at creating more diverse peptide samples to
improve the generalization ability of the model. Specific methods
include random mutation, sequence concatenation, and noise
addition. Random mutation involves selecting a random amino acid
in the original peptide sequence and replacing it with another
amino acid, introducing minor variations. Sequence concatenation
combines two or more peptide fragments to create new peptide
sequences, increasing sequence diversity while maintaining the
original biological characteristics. Noise addition involves
randomly inserting or deleting amino acids in the peptide chain or
introducing small perturbations between amino acids to simulate
possible variations during experiments. These methods work
together to generate a large number of candidate sequences,
providing a rich data foundation for subsequent screening and
training.
5.2.5 Network architecture of the proposed PEBP
To further explore the sequence information, the model first
converts the sequence into a structural graph to capture its
spatial and topological features, and then performs layer-wise
feature extraction through Graph Neural Networks (GNN).
Additionally, the model utilizes a BLOSUM matrix to generate
features and employs a CNN network for deep learning from the
sequence level. Subsequently, the model introduces an attention
mechanism to fuse and optimize the extracted features, enhancing
the discriminative power of the feature representation. Finally,
the output module generates an efficient predictor for accurate
target property prediction.
Sequence processing module
The sequence processing module represents peptide sequences from
two perspectives:
structural information and evolutionary information.
Structural information is obtained using the RDKít tool,
which converts peptide sequences in FASTA format into their
corresponding chemical structures in MOL format. These structures
contain molecular data such as atoms, bonds, and coordinates. To
encode the structural information, the peptide sequence is
represented as a molecular graph G = (V,E) , where V is the set of
nodes (atoms), and E is the set of edges (chemical bonds). N(v)
includes all nodes u connected to node v by edge (u,v), and it
describes the neighborhood structure and relationships of node v.
This molecular graph representation captures the spatal and
chemical interactions within the peptide sequence, aiding in
feature extraction.The example graph is as follow.
Figure 24 the structure of one peptide
Evolutionary Information:Represented through the BLOSUM
matrix (Block Substitution Matrix). The BLOSUM matrix quantifies
the similarity between protein sequences based on the conservation
of residues observed across multiple species. It records the
substitution rates of amino acids observed during sequence
alignment across species. Higher scores in the matrix indicate
common substitutions, while lower scores represent rare or
unlikely substitutions. Different BLOSUM matrices, such as
BLOSUM62, are used to balance sequence similarity based on
evolutionary distances. BLOSUM62, for example, is often used to
compare and quantify the similarity between protein sequences. The
BLOSUM matrix is represented as follows:$$\text{BLOSUM} =
\begin{bmatrix} b_{1,1} & b_{1,2} & \cdots & b_{1,20} \\ b_{2,1} &
b_{2,2} & \cdots & b_{2,20} \\ \vdots & \vdots & \ddots & \vdots
\\ b_{20,1} & b_{20,2} & \cdots & b_{20,20} \end{bmatrix}$$
Here, each row and column of the matrix represents a standard
amino acid, and each element $b_{i,j}$ in the matrix represents
the substitution score between amino acids $i$ and $j$.The BLOSUM
matrix can help reveal the evolutionary history of sequences,
playing an important role in tasks like sequence alignment and
classification.
Feature extraction module
Graph Neural Networks (GNNs) for Molecular Graphs
In this study, we employed Graph Neural Networks (GNNs) to extract
structural information from the molecular graph of a given
peptide. Since GNNs cannot directly process graph-structured data,
we introduced an embedding layer on top of the GNNs to convert the
graphstructured data into numerical vectors. Following the method
, we considered the atomic and edge information of the molecular
graph and applied the 1-dimensiona Weisfeiler-Lehman (1-WL)
algorithm. The 1-WL algorithm iteratively assigns a label to each
node based on the previous labels of both the node itself and its
neighbors. However, in this study, we also considered the previous
labels of the edges between the nodes and their neighbors.
Additionally, we updated the label of each edge by considering the
labels of the two nodes it connects. To formalize this embedding
method, let $h^{(t)}(v) \in {N}$ and $g^{(t)}(u,v)\in {N}$
represent the labels of node $v$ and edge $(u,v)$ at iteration $t$
, where ${N}$ denotes the set of natural numbers. At iteration
$t=0$, we initialize the nodes and edges based on their types in
the molecular graph, such as $h^{(0)}(oxygen) = 0,
h^{(0)}(nitrogen)=1$, and so on. For $t>0$, the node and edge
labels are updated as follows: $$ h^{(t)}(v) =
\text{HASH}(h^{(t-1)}(v), \{h^{(t-1)}(u), g^{(t-1)}(u,v) \mid u
\in N(v)\}) $$ $$ g^{(t)}(u,v) = \text{HASH} \left(
g^{(t-1)}(u,v), h^{(t)}(u), h^{(t)}(v) \right) $$
where $N(v)$ denotes the set of neighbors of node $v$, and HASH is
a function that maps the node and edge information to a unique
value in ${N}$, ensuring uniqueness across iterations.
Through this process, we obtain a “fingerprint” for the molecular
graph of a given peptide. This fingerprint is then encoded into a
matrix $X$ of size $n \times m$, where $n$ is the number of nodes
in the molecular graph, and $m$ is a hyperparameter of the
embedding layer. After embedding $X$ is processed by the GNNs to
extract structural information, defined as: $$R = ReLU(X^{(k)}
W_{gmn}) \tag{39}$$ where $k$ represents the $k$-th GNN layer,
$W_{gm}$is the weight matrix of size $m \times m$ , and ReLU is a
non-linear activation function.
To propagate this information through the layers, we use the
following update rule: $$X^{(k+1)} = X^{(k)} + AR \tag{40}$$
where $A$ is the adjacency matrix of the molecular graph. To
obtain the output $y_{graph}^{(k)}$ at each GNN layer, we compute
the mean value of each column of $X^{(k)} = (x_1^{(k)}, x_2^{(k)},
\cdots, x_n^{(k)})$ as follows: $$y_{graph}^{(k)} = \frac{1}{n}
\sum_{i=1}^{n} x_i^{(k)} \tag{41}$$
where $x_i^{(k)}$ represents the vector representation of the
$i$-th node in the molecular graph at the $k$-th GNN layer, and
$n$ is the number of nodes in the graph. The vector
$y_{graph}^{(k)}$ provides the representation of the peptide
sequence at the $k$-th GNN layer.
CNN Network for BLOSUM
To capture specific features, we designed a Convolutional Neural
Network (CNN) that leverages evolutionary information from BLOSUM.
The network mainly consists of a two-dimensional convolutional
layer and a non-linear activation function.
In this CNN network, the BLOSUM of a peptide is first normalized
through a scaling layer. The scaled BLOSUM is then processed by a
two-dimensional convolutional layer with a non-linear activation
function to extract local hidden features of the peptide sequence.
The process isdescribed as follows:
First, the BLOSUM is scaled by the normalization layer:
$$B_{scaled}= normalize(B) \tag{42}$$
Next, it is processed by a two-dimensional convolutional layer and
ReLU activation function: $$H= ReLU(conv2d(B_{scaled})) \tag{43}$$
where $B_{scaled}$ is the normalized BLOSUM, Conv2D represents the
two-dimensional convolution operation, and is the hidden feature
matrix after the convolution operation.
To handle the variable lengths of input sequences, we applied
zero- padding to standardize all sequences to a fixed length of
50, ensuring compatibility with the neural network, which requires
fixed-length inputs. To further prevent overfitting, we employed
dropout during training.
Attention module
Attention Mechanism is an attempt to mimic human brain action to
selectively concentrate on a few important parts, while ignoring
others in machine learning .In our network architecture, the
attention module cooperates with the feature extraction module to
capture interpretable features. It assigns different weights to
the concatenated feature vectors obtained by separately combining
the output of the CNN network with the output of each GNN layer in
the feature extraction module. The higher the weight, the more
important the corresponding feature vector. For the convenience of
description, we denote all the concatenated feature vectors as F
with the size $h \times u$: $$F = (f_1, f_2, \cdots, f_h) $$
where $f_1$ denotes i-th concatenated feature vector; h denotes
the number of the concatenated feature vectors, which is
determined by the number of GNN layers; $u$ denotes the length of
the concatenated feature vector. In this work, the attention
mechanism takes all the concatenated feature vectors F as input to
calculate a weight vector $a$, in which each element is the weight
of the corresponding feature vector: $$a = \text{softmax}(w_2
\tanh(w_1 F^T)) \tag{44}$$ $$z = aF \tag{45}$$
where $w_1$ denotes a weight matrix with the size $d \times u$;
$w_2$ is a weight vector with the size $d$. Note that $d$ is a
hyperparameter, which can be set randomly and fine-tuned during
network training. The softmax was used to ensure all the computed
weights sum up to 1 and tanh is a non-linear activation function.
To this end, we multiplied the weight vector a with F, yielding
the final representation z of a peptide sequence.
Output module
The feature vector z, containing structural and evolutionary
information of the peptide, is produced by an upstream attention
module. The feature vector z is passed into a fully connected
layer with a ReLU activation function: $$y = \text{ReLU}(Wz + b)
\tag{46}$$
where $W$ is the weight matrix,$B$ is the bias vector.
The output of the fully connected layer, $y = [y_1, y_2]$, is fed
into a softmax layer to compute the binding probabilities. Here:
$y_1$ represents the score indicating the likelihood of the
peptide binding to PE. $y_2$ represents the score indicating the
likelihood of the peptide not binding to PE. The softmax function
transforms these scores into probabilities: $$p_1 =
\frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)} \tag{47}$$ $$p_2 =
\frac{\exp(y_2)}{\exp(y_1) + \exp(y_2)} \tag{48}$$
where: $p_1$ is the predicted probability that the peptide binds
to PE, and $p_2$ is the predicted probability that the peptide
does not bind to PE.
The model will predict that the peptide binds to PE if $p_1 >
p_2$, and predict no binding if $p_2 > p_1$.