TransPed-Mutant

A model based on Transformer for the design and optimization of immune checkpoint binding peptide vaccines.

Background

In wet lab experiments, to alleviate immune dysregulation at sites of colitis, we utilized the PD-1/PD-L1 immune checkpoint and, through the Lpp-OmpA membrane display system, displayed PD-L1 on the surface of the EcN outer membrane. Only the functional domain that binds to PD-1 was retained to construct the fusion protein. In patients with IBD, T-cell activity is excessively strong, leading to an overactive immune response at the colitis site. By utilizing the PD-1/PD-L1 immune checkpoint, we designed an expression system where engineered bacteria express and secrete PD-L1 protein. The PD-L1 protein binds to the PD-1 protein on CD8+ T cells, inhibiting T-cell activity and thereby suppressing the excessive immune response at the colitis site, achieving immune modulation. However, activating the PD-1/PD-L1 immune checkpoint also inhibits the anti-tumor activity of tumor-infiltrating T cells. This paper will use the PD-1/PD-L1 immune checkpoint as an example to demonstrate the TransPed-Mutant model for designing binding peptide vaccines targeting immune checkpoints.

Figure 1 : Mechanism of action of PD-1 and PD-L1

PD-1 binding peptides do not activate the inhibitory signaling pathway of PD-1 like PD-L1 does. Instead, they competitively bind to PD-1, thereby preventing the binding of PD-L1 on the surface of tumor cells to PD-1. The effect of this competitive blockade is that although PD-1 binding peptides bind to PD-1, they do not transmit inhibitory signals. Unlike PD-L1, PD-1 binding peptides do not trigger the mechanism that suppresses T-cell activity, so they do not inhibit T-cell function. As a result, T cells can still remain active, recognizing and killing tumor cells, thereby achieving an anti-tumor effect. Whether it's using probiotics engineered to bind PD-1 and PD-L1 to suppress T-cell attacks on normal cells and reduce inflammation in wet lab experiments, or designing peptide vaccines to block the interaction between PD-1 and natural PD-L1, TransPed-Mutant will effectively predict and design such interventions.

Blocking the interaction between PD-1 and PD-L1 can activate tumor-infiltrating T cells and restore their anti-tumor activity, offering promising therapeutic effects for malignant tumors. Although antibody drugs targeting the PD-1/PD-L1 pathway have shown certain efficacy in cancer treatment, current antibody drugs have several issues, including high production costs, significant individual variability, the potential to trigger inappropriate immune responses, and inevitable drawbacks such as poor organ or tumor penetration and low oral bioavailability. Therefore, the development of peptide inhibitors to compensate for the shortcomings of PD-1/PD-L1 antibody blockers is of great significance. This paper is the first to apply deep learning technology, using attention scores to analyze PD-1 binding peptide sites. We conducted analyses of the amino acid composition, position-based amino acid preference, and simulated site analysis of PD-1 binding peptides. The PD-1 binding peptides selected through this approach are expected to serve as candidate peptide drugs for blocking the PD-1/PD-L1 interaction. This research provides meaningful insights into the computational design of PD-1 binding peptides.

TransPed-Mutant is a pan-specific approach that can be applied to various immune checkpoint interactions, such as PD-1/PD-L1, CTLA-4/CD80, LAG-3/MHC II, TIGIT/CD155, and others. The core concept of the TransPed-Mutant model is to apply self-attention to peptides in order to obtain binding scores.

Data Sources

We found a large amount of data in databases such as ImMunoGeneTics Information System, IUPHAR/BPS Guide to PHARMACOLOGY, PeptideAtlas, PepBank, and IEDB. To further expand the dataset, we used web crawlers to define related functions in the Protein Data Bank, sending HTTP requests and parsing the returned HTML documents to extract PDB IDs, obtain PDB files, and convert them into fasta files for later use. Then, we traversed each residue in the chain and used the is_aa function to check whether it was a standard amino acid. If it was, we used PPBuilder to build peptide sequences from the chain and obtained the sequence strings. We analyzed protein-protein binding sites, traversed each pair of binding residues, encapsulated and returned the sequence information, added a PDB parser, and merged all the data into a DataFrame. We converted the three-letter sequences into single-letter sequences. In total, we obtained 36,774 valid sequence pairs of binding sites such as PD-1/PD-L1, CTLA-4/CD80, LAG-3/MHC II, and TIGIT/CD155, which served as positive samples for the model. We also added noise data from various sources, and a 1:1 sample ratio was used to construct the dataset. Some of the code and data have been uploaded to Github (https://github.com/Charliserein/TransPed-Mutant).

Model Architecture

Figure 2: Transformer Basic Architecture

Transformer is a deep learning model architecture used for natural language processing (NLP) and other sequence-to-sequence tasks, first proposed by Vaswani et al. in 2017. The Transformer architecture introduced the self-attention mechanism, a key innovation that enables outstanding performance in handling sequential data. Since the Transformer structure itself lacks the ability to process sequence order, we introduced Position Embeddings to capture the positional information of amino acids.

Figure 3: TransPed Model Architecture

First, TransPed is developed by constructing a transformer-based model to predict immune checkpoint protein binding. For vaccine design, the attention scores generated by TransPed are then used to launch the DeepMutant program, which automatically optimizes mutant peptides with higher affinity for the target peptide. The proposed framework can automatically generate potential peptide vaccines for experimenters.

In this paper, five-fold cross-validation is used to optimize the model during the training phase. The training set is divided into five equal parts, with four parts used for model training and the remaining part used for model evaluation with the same parameters. The training and evaluation process is repeated five times, ensuring that each part of the data participates in model training four times and in model evaluation once. Finally, the average result of the five model evaluations is taken as the final evaluation result. This approach helps to avoid overfitting to some extent.

The attention mechanism is the core of the Transformer. It focuses on important information from a large amount of data, reducing the influence of irrelevant information. Essentially, it maps a query \( Q \) to a set of key-value ( \( K - V \) ) pairs and then produces an output, where the \( K - V \) pairs represent the sequence elements stored in memory. This process reflects the attention scores (i.e., weights) based on the correlation or similarity between \( Q \) and \( K \). The attention scores represent the importance of the information (i.e., \( V \) ). The larger the attention score, the more focused the corresponding information. Self-attention is a variant of the attention mechanism that captures internal correlations within a sequence, reducing dependency on external information.

The following are the main modules of our model.

1. Input Embedding Block

The input consists of two peptide sequences, pepA and pepB (i.e., binding residue sequences). Before each sequence enters the model, it needs to be transformed into a vector representation through an embedding layer, with position encoding added. First, we pad the peptide sequences to a maximum length of 12 and 15, respectively, to handle variable input lengths. Then, a character embedding model is used to create a unique embedding for each amino acid, with the embedding dimension defined as \( d_X \). On the other hand, the order of amino acids is crucial for the structure and function of peptides, but the aforementioned embedding method does not account for this. Therefore, we apply position embeddings to encode the position of amino acids in the sequence. Given a position \( p \) in the sequence, the position embedding is encoded as a \( d_X \)-dimensional vector, and the value of the \( i \)-th element of this vector is defined as \( PE(p) \). \[ \begin{align} PE(p)_{2i} &= \sin \left( \frac{p}{10,000^{\frac{2i}{d_x}}} \right) \\ PE(p)_{2i+1} &= \cos \left( \frac{p}{10,000^{\frac{2i}{d_x}}} \right) \end{align} \] In this method, \( 2i \) represents the even dimensions, and \( 2i + 1 \) represents the odd dimensions. This positional embedding method not only captures the absolute position information of amino acids but also reflects their relative position. Finally, the amino acid embeddings and positional embeddings are summed to obtain the sequence embedding.

2. Amino-Acid Embedding

Compared to RNNs, which input words one by one and naturally preserve the sequential information of each word, the Transformer uses a self-attention mechanism to extract information. In the Transformer, each word in a sentence is processed in parallel. Although it considers the influence of all words on each word, it does not take into account the positional relationships between words, i.e., the context. Therefore, it is necessary to introduce positional information.

The implementation idea of the attention mechanism is to first calculate the attention score between the first letter and each letter in the sequence (including the first letter itself). Then, the attention scores are multiplied by the information of the corresponding letters and summed up. The result is the weighted sum of the first letter with all the letters in the sentence. This process is repeated for the second letter, third letter, and so on.

3. Encoder Block

The task of the encoder is to encode the input sequence information. Each encoder block contains a multi-head self-attention mechanism, two linear layers, the Swish activation function, and layer normalization. In this step, we will process the input sequentially.

Figure 4: Self Attention Mechanism

3.1 Self Attention Mechanism

As shown in the figure above, the word vector \(\alpha^i\), which includes positional information, is used as the input to the Self-Attention Mechanism. \(\alpha^i\) represents the word vector of the \(i+1\)-th word in a sentence. \(\alpha^i\) is multiplied by three matrices \(W^Q\), \(W^K\), and \(W^V\) to obtain \(q^i\), \(k^i\), and \(v^i\), respectively. Here, - \(q\) is the query vector, - \(k\) is the "key" or "being queried" vector, - \(v\) is the "value" or "content" vector.

Next, we calculate the attention information for each word. Taking the attention information of the first word with all the words in the sequence as an example, we first multiply \(q^0\) by \[ k^0, \quad k^1, \quad k^2, \quad k^3 \] to obtain the constant attention values \(\alpha_{00}, \quad \alpha_{01}, \quad \alpha_{02}, \quad \alpha_{03}\). Then, these values are passed through Softmax normalization to get the attention scores \(\alpha_{00}, \quad \alpha_{01}, \quad \alpha_{02}, \quad \alpha_{03}\) for the first word with all the words. Their sum is 1. Finally, the attention scores are multiplied by the corresponding word vectors \(v^0, \quad v^1, \quad v^2, \quad v^3\), resulting in the weighted information of the first word with all the words in the sentence. The weighted sum is: \[ b^0 = \alpha_{00} * v^0 + \alpha_{01} * v^1 + \alpha_{02} * v^2 + \alpha_{03} * v^3 \] The weighted sum of the 2nd, 3rd, and 4th words with all the words in the sequence is: \[ b^1, \quad b^2, \quad b^3 \] and so on.

In summary, the above describes the core idea of the attention mechanism. To speed up computation, computers typically handle these calculations in matrix form.

As shown in the figure below, the word vector matrix \(\alpha^i\) is multiplied by three matrices \(W^Q\), \(W^K\), and \(W^V\), resulting in \(q^i\), \(k^i\), and \(v^i\). The dimensions of the \(W^Q\), \(W^K\), and \(W^V\) matrices are [word vector length, word vector length].

Figure 5: Attention Matrix

Next, the \(q\) matrix is multiplied by the \(k\) matrix to obtain the attention value matrix \(\alpha\), as shown in the figure below. Where, \[ \alpha_{i,j} = \frac{q * k^T}{\sqrt{d}} \] \(k^T\): the transpose of the \(k\) matrix.
\(d\): the word vector length.

Then, each row of matrix \(\alpha\) undergoes a Softmax calculation to compute the attention score matrix \(\hat{\alpha}_{ij}\). The formula is as follows: \[ \hat{\alpha}_{i,j} = \frac{(i, j)}{\sum_{j=s}^{j=0}(i, j)} \] \newpage Finally, the attention score matrix \(\hat{\alpha}_{ij}\) is multiplied by the \(v^j\) matrix to obtain the output matrix \(b^i\), where: \[ i = \sum_{j=0}^{j=s} \hat{\alpha}_{i,j} \] \[ b^0 = \hat{\alpha}_{00} * v_0 + \hat{\alpha}_{01} * v_1 + \hat{\alpha}_{02} * v_2 + \hat{\alpha}_{03} * v_3 \] This represents the dot product of the attention score matrix \(\hat{\alpha}_{i,j}\) and the \(v^j\) matrix, which is also the weighted sum. The above outlines the complete process of attention mechanism calculation.

3.2 Multi-Head Self-Attention in Decoder Block

For the input sequence (here represented as \(H\) concat, obtained by concatenating the output from the encoder), it first passes through the multi-head self-attention mechanism in the decoder. This process is similar to the multi-head attention in the encoder.

Figure 6: Multi-Head Self-Attention Mechanism

The multi-head attention mechanism splits the \(q^i\), \(k^i\), and \(v^i\) matrices into smaller matrices of the same shape along the feature dimension (word vector length). As shown in the figure below, splitting into 2 smaller matrices results in two-headed attention. Next, the \(b\) matrix is computed as described above for each head, and then the \(b\) matrices from each head are concatenated, forming the final attention matrix.

For the concatenated input sequence \(H_{\text{concat}}\), it first passes through the multi-head self-attention mechanism in the decoder. The steps are as follows:
1.1 Multi-head Self-attention Score Calculation The query (Q), key (K), and value (V) are derived from the input sequence \(H_{\text{concat}}\).
1. Calculate Query \(Q\), Key \(K\), and Value \(V\): \[ Q_{\text{concat}} = H_{\text{concat}} W_Q, \quad K_{\text{concat}} = H_{\text{concat}} W_K, \quad V_{\text{concat}} = H_{\text{concat}} W_V \] 2. Calculate the attention scores (before Softmax): \[ Z_{\text{concat}} = \frac{Q_{\text{concat}} K_{\text{concat}}^T}{\sqrt{d_k}} V_{\text{concat}} \] 1.2 Multi-head Attention Concatenation:
Concatenate the outputs from multiple attention heads, and apply a linear transformation: \[ H_{\text{concat}}^{\text{multihead}} = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) W_O \]

The Add operation adopts the concept from residual neural networks, where the input \(\alpha\) matrix of the Multi-Head Attention is directly added to the output \(b\). This allows the network to train deeper, resulting in the \(b\) matrix. It then undergoes Layer Normalization to speed up the training process, ensuring that each row of \(b\) (i.e., each sentence) is normalized to a standard normal distribution, with the output being \(\hat{b}\). The formulas are as follows: Mean: \[ \mu_i = \frac{1}{s} \sum_{j=1}^{s} b_{ij}, \quad \text{where } s = \text{the length of } b_i. \] Variance: \[ \sigma_i = \frac{1}{s} \sum_{j=0}^{s} (b_{ij} - \mu_i)^2 \] Normalization: \[ \text{LayerNorm}(x) = \frac{b_{ij} - \mu_i}{\sqrt{\sigma_i + \epsilon}} * \gamma + \beta \]

Figure 7: Layer normalization

4. Feedforward Neural Network (FFN)

The Feedforward Neural Network (FNN) in the decoder consists of two linear layers, with the ReLU activation function used in between.

1. First linear transformation: \[ H^{\text{concat}}_{\text{linear1}} = H^{\text{concat}}_{\text{multihead}} W_1 + b_1 \] 2. ReLU activation function: \[ H^{\text{concat}}_{\text{relu}} = \text{ReLU}(H^{\text{concat}}_{\text{linear1}}) \] 3. Second linear transformation: \[ H^{\text{concat}}_{\text{linear2}} = H^{\text{concat}}_{\text{relu}} W_2 + b_2 \] Residual Connection and Layer Normalization:
Then, the output of the feedforward neural network is connected to the output of the multi-head attention through residual connection, followed by layer normalization: \[ H^{\text{concat}}_{\text{decoder}} = \text{LayerNorm}(H^{\text{concat}}_{\text{linear2}} + H^{\text{concat}}_{\text{multihead}}) \]

After the decoder, we calculate the final output through an additional linear layer, Swish activation function, layer normalization, and Softmax.

3.1 Linear Layer: \[ H^{\text{linear}}_{\text{final}} = H^{\text{concat}}_{\text{decoder}} W_3 + b_3 \] 3.2 Swish Activation Function: \[ \text{Swish}(x) = x \cdot \sigma(x) \] \[ H^{\text{swish}}_{\text{final}} = \text{Swish}(H^{\text{linear}}_{\text{final}}) \] 3.3 Layer Normalization: \[ H^{\text{norm}}_{\text{final}} = \text{LayerNorm}(H^{\text{swish}}_{\text{final}}) \] 3.4 Softmax Output for True Binding Probability and Attention Distribution:

We calculate the predicted true binding probability \( P(\text{bind}) \) and attention distribution \( \alpha \) using Softmax:

1. True Binding Probability: \[ P(\text{bind}) = \text{Softmax}(H^{\text{norm}}_{\text{final}}) \] The Softmax formula is: \[ P(y=1|x) = \frac{\exp(z_1)}{\exp(z_0) + \exp(z_1)} \] 2. Attention Distribution: The attention distribution \( \alpha \) from the decoder is extracted from the multi-head attention mechanism, representing the model's focus at each position.

Result

Compared to Recurrent Neural Networks (RNNs), Transformer achieves parallelization and solves the long-term dependency problem, allowing it to process data faster than RNNs. In contrast to Convolutional Neural Networks (CNNs), which excel at extracting local information, Transformer extracts more global information, making it suitable for exploring the full sequence of peptides. In experiments, Transformer, as the encoder block in TransPed, demonstrated better performance than both RNNs and CNNs, providing more accurate predictions of peptide binding scores.

Figure 8: Attention heatmap of PD-1 and PD-L1 in positive samples

Through the TransPed framework, we successfully identified key residues, and the attention scores corresponding to the amino acids at each residue. We extracted the attention scores for the peptides binding to PD-1 and PD-L1. Based on these scores, we found that, compared to other amino acids such as alanine (A), leucine (L), proline (P), serine (S), and threonine (T), arginine (R) obtained higher scores when PD-1 binds to PD-L1. This indicates that these amino acids play a crucial role in the binding to PD-1.

Figure 9: Distribution of amino acids with higher scores

From the figure, it can be observed that there are both similarities and differences in the amino acid composition across the three PD-1 binding peptide datasets (400, 300, and 100 peptides). Alanine shows a relatively high proportion in all datasets, particularly in the 400 and 300 peptide PD-1 binding datasets, where it approaches 0.08. Cysteine has a low proportion across all three datasets, especially in the 100 peptide PD-1 binding dataset, where its proportion is nearly zero. Threonine has a significantly higher proportion in the 400 peptide dataset, close to 0.09, which may suggest that the occurrence frequency of threonine increases in larger datasets. Overall, the proportion of some amino acids varies across datasets of different sizes, with proline (P) and threonine (T) showing the most noticeable changes, indicating that the proportion of these amino acids in PD-1 binding peptides changes as the sample size increases.

DeepMutant Module

Figure 10: DeepMutant Module

TransPed and DeepMutant together form a framework that applies transformers to the field of biomolecular binding and mutation. This framework can be applied to biomolecular mutation tasks such as epitope optimization or drug design, making it especially suitable for vaccine development. For example, we designed a PD-1/PD-L1 targeting vaccine in wet lab experiments, where an overly strong T-cell immune response can trigger inflammation in the body. The core challenge in vaccine development is how to enhance specific binding capability while maintaining sufficient immunogenicity. The DeepMutant program is well-suited for this task. First, a transformer-derived model is deployed to train mutation direction data for biomolecules, and then attention scores for mutation directions are obtained. Based on these attention scores, the DeepMutant program will identify better mutants.

Main Process Steps:

The image shows two peptides: the target peptide and the original peptide. The target peptide has a weaker binding ability, while the original peptide serves as a reference. After inputting the peptides, the TransPed model analyzes the binding between the target and source peptides, obtaining attention scores for the target peptide in relation to the original peptide, as well as the contribution ratio of the target peptide. Higher attention scores are indicated in red for significant contributions, with deeper colors signifying higher contributions for the original peptide. The contribution ratio of the target peptide is marked in green, indicating higher contributions. The DeepMutant model further optimizes the target peptide sequence, offering three strategies for generating mutant peptides: Strategy 1: Prioritize the contributions of negative samples and mutate sites with high negative contributions. Strategy 2: Prioritize the contributions of negative samples while avoiding mutations at sites with high contributions in positive samples. Strategy 3: Focus only on mutating specific sites, disregarding the contributions of positive and negative samples. This approach leverages attention scores to identify important amino acid sites, applying different strategies for mutations, ultimately generating mutant peptides with improved binding capabilities.

The image above shows mutations performed on 20 PD-1 binding peptides using Strategy 1, which prioritizes the contributions of negative samples and mutates sites with high negative contributions. Each original peptide segment underwent mutations of 2 to 4 amino acids, primarily at different positions. The mutated amino acids are mostly hydrophobic or have strong structural stability, such as alanine (A), leucine (L), and proline (P), which aligns with our previous analysis of PD-1 specific binding. The sequence similarity scores after mutation range from 0.6 to 0.8, indicating that the mutated sequences largely retain a certain degree of similarity to the original sequences.

Figure 11: Attention scores in negative samples are used for mutations

Taking the Y68 and H98 binding sites as examples, we compared the experimentally determined 3D structure of PD-1/PD-L1 with the attention map from the sample. We mutated the binding peptide VFNWGHFQE simulated for PD-L1. As shown in the figure, the glycine (G) and asparagine (N) residue positions have lower scores, so the model prioritized and mutated these sites with high negative contributions, resulting in the mutant peptide VFAWLHFQE.

Figure 12: Comparison of site affinity before and after mutations

We simultaneously used AlphaFold3 to predict the binding affinity of the original protein and the protein modified by the mutant peptide at the corresponding binding sites. Among every 100 mutant peptides, 86 showed improved binding energy. The chart above compares the binding affinity (K_D values) between mutant and original sites, with lower K_D values indicating stronger binding affinity. For each category of sites, we selected 10 control experiments and averaged the K_D values. Overall, the impact of mutations on binding affinity varies across four immune checkpoint targets. All mutant sites demonstrated varying degrees of improved binding affinity compared to the original sites, indicating that all mutation designs are more effective for these targets. The K_D values of the mutant sites significantly decreased, suggesting enhanced binding capability, particularly for LAG-3/MHC II and TIGIT/CD155 mutations, which exhibited notable enhancement effects.

We analyzed the model's performance using the following classification evaluation metrics:

1. Accuracy: \[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \]

2. MCC: \[\text{MCC} = \frac{(TP \times TN) - (FN \times FP)}{\sqrt{(TP + FN) \times (TN + FP) \times (TP + FP) \times (TN + FN)}} \]

3. F1 Score: \[F_1 \text{ score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}, \quad \text{where} \quad \text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN} \]

Here, TP represents true positives, FP stands for false positives, FN denotes false negatives, and TN indicates true negatives. Additionally, we used the area under the ROC curve (AUC) as another performance evaluation metric.

Figure 13: Performance evaluation metrics compared with other model

The image above displays the performance metrics of the model on the independent test set. AUC reflects the model's ability to distinguish between positive and negative samples, with values closer to 1 indicating better performance. Accuracy represents the proportion of correct predictions out of all predictions made. MCC is used to assess the performance of binary classification models, with higher values indicating better predictive effectiveness, particularly when dealing with imbalanced datasets. F1 Score reflects the balance between precision and recall, making it especially useful for handling imbalanced data. For predicting the binding capabilities of specific protein types (such as PD-1 and PD-L1), our model shows a clear performance advantage over other models on the independent test set, achieving an AUC value of 0.92. However, this advantage is not as pronounced in the external test set.

Limitations and Future Improvements:

PD-1 binding peptides are considered potential candidate drugs in cancer therapy. This study presents a method for predicting the affinity of PD-1 binding peptides and optimizing mutations using deep learning techniques. These peptides are promising candidates for blocking the PD-1/PD-L1 interaction. The attention scores corresponding to the amino acids and site-based preference analysis indicate that alanine (A), leucine (L), proline (P), serine (S), and threonine (T) play significant roles in binding to PD-1. Furthermore, PD-1 binding peptides demonstrate notable amino acid site preferences. Fortunately, we performed position embedding for the residues, and we will further analyze the simulated sites. The selected PD-1 binding peptides partially mimic certain characteristics of the PD-L1 protein. The study of simulated sites will provide valuable references for the computational design of PD-1 binding peptides.

Xue Chen et al. categorized 89 PD-1 binding peptides from the EpiSearch Ph.D.-12 phage display library into four clusters for simulated site analysis. Using the default parameters of EpiSearch, only Cluster 1 and Cluster 2 yielded results for simulated site analysis; no relevant results were provided for Cluster 3 and Cluster 4. The residues highlighted in bold in the table appear in both the true PD-L1 epitope residues and the simulated site residues. A comparison revealed that nine residues are consistent between the simulated site residues and the true epitope residues of PD-L1. This result indicates a certain level of reliability in the simulated site analysis based on the data obtained from phage display technology. Moreover, despite the clustering analysis dividing Cluster 1 and Cluster 2 into two groups, the simulated site analysis showed that the conformational epitope residues for both clusters are identical and highly consistent with the true epitope residues on PD-L1. Therefore, it is inferred that the PD-1 binding peptides in Cluster 1 and Cluster 2 may partially mimic certain structural characteristics of the PD-L1 protein, making these peptides promising candidates for future drug development. This provides a positive reference for our subsequent work on simulated sites, and the TransPed-Mutant model is expected to address this issue.

However, this study still has some limitations, as the mutated PD-1 binding peptides have not yet undergone in vitro and in vivo experimental validation. Therefore, it remains uncertain whether these peptides can effectively bind to PD-1 or block the PD-1/PD-L1 interaction. In future work, we will conduct further experimental validations to fill the gap in computational tools for obtaining PD-1 binding peptides. We believe that TransPed-Mutant will contribute significantly to the field of binding peptide design!

Reference

[1] S. H. Kim, F. Castro, Y. Paterson, C. Gravekamp, Cancer Res. 2009, 69, 5860.

[2] M. P. Brynildsen, J. A. Winkler, C. S. Spina, I. C. MacDonald, J. J. Collins, Nat. Biotechnol. 2013, 31, 160.

[3] D. Chandra, A. Jahangir, W. Quispe-Tintaya, M. Einstein, C. Gravekamp, Br. J. Cancer 2013, 108, 2281.

[4] C. Bettegowda, X. Huang, J. Lin, I. Cheong, M. Kohli, S. A. Szabo, X. Zhang, L. A. Diaz, V. E. Velculescu, G. Parmigiani, Nat. Biotechnol. 2006, 24, 1573;

[5] I. Cheong, X. Huang, C. Bettegowda, L. A. Diaz, K. W. Kinzler, S. Zhou, B. Vogelstein, Science 2006, 314, 1308.

[6] M. Shinoh, M. Horinaka, T. Yasuda, S. Yoshikawa, M. Morita, T. Yamada, T. Miki, T. Sakai, Int. J. Oncol. 2013, 42, 903.

[7] Y. Guo, Y. Chen, X. Liu, J.-J. Min, W. Tan, J. H. Zheng, Cancer Lett. 2020, 469, 102.

[8] S. Zhou, Z. Zhao, Y. Lin, S. Gong, F. Li, J. Pan, X. Li, Z. Gao, A. Z. Zhao, Cancer Biol. Ther. 2016, 17, 732;

[9] A. Uchugonova, Y. Zhang, R. Salz, F. Liu, A. Suetsugu, L. Zhang, K. Koenig, R. M. Hoffman, M. Zhao, Anticancer Res. 2015, 35, 5225.

[10] T. X. Phan, V. H. Nguyen, M. T. Q. Duong, Y. Hong, H. E. Choy, J. J. Min, Modul. Immunol. 2015, 59, 664.

[11] S. Felgner, D. Kocijancic, M. Frahm, J. Heise, M. Rohde, K. Zimmermann, C. Falk, M. Erhardt, S. Weiss, Oncoimmunology 2018, 7, e1382791;

[12] M. Sedighi, A. Zahedi Bialvaei, M. R. Hamblin, E. Ohadi, A. Asadi, M. Halajzadeh, V. Lohrasbi, N. Mohammadzadeh, T. Amiriani, M. Krutova, A. Amini, E. Kouhsari, Cancer Med. 2019, 8, 3167.

[13] A. Kupz, R. Curtiss III, S. Bedoui, R. A. Strugnell, PLoS One 2014, 9, e97418.

[14] F. Saccheri, C. Pozzi, F. Avogadri, S. Barozzi, M. Faretta, P. Fusi, M. Rescigno, Sci. Transl. Med. 2010, 2, 44ra57;

[15] W. W. Chang, C. H. Lai, M. C. Chen, C. F. Liu, Y. D. Kuan, S. T. Lin, C. H. Lee, Int. J. Cancer 2013, 133, 1926.

[16] Y. Ma, R. Conforti, L. Aymeric, C. Locher, O. Kepp, G. Kroemer, L. Zitvogel, Cancer Metastasis Rev. 2011, 30, 71.

[17] J. F. Toso, V. J. Gill, P. Hwu, F. M. Marincola, N. P. Restifo, D. J. Schwartzentruber, R. M. Sherry, S. L. Topalian, J. C. Yang, F. Stock, J. Clin. Oncol. 2002, 20,

ROSce-Clearance-enzyme-classifier

Background

Reactive oxygen species (ROS) are highly reactive molecules that play important roles in microbial biological processes. However, excessive accumulation of ROS can lead to oxidative stress and cellular damage. Microorganisms have evolved a diverse array of enzymes to mitigate the harmful effects of ROS. Accurately predicting the categories of ROS clearance enzymes (ROSce) is crucial for understanding the mechanisms of oxidative stress and developing strategies to combat related diseases. However, existing classification methods for ROS-related proteins exhibit certain shortcomings in accuracy and inclusiveness. To address this issue, we propose a novel multi-label ensemble learning framework called the Clearance-enzyme-classifier. This framework integrates three component methods and employs a soft voting-based approach to simultaneously predict multiple ROSce categories. It can identify whether a given protein sequence is a ROSce and determine its type. The three component methods used in this framework are ROSce-RNN, which extracts raw sequence encoding features; ROSce-Bilstm, which predicts protein functions based on sequence information; and ROSce-XGBoost. Comprehensive experiments demonstrate the superior performance and robustness of our method. The Clearance-enzyme-classifier can be accessed at (https://github.com/Charliserein/clearance-enzyme-classifier) for predicting ROSce categories.

Organisms are continuously exposed to environmental stressors that can cause oxidative damage to their cells and tissues. To mitigate the effects of ROS and induced antagonists, microorganisms rely on a well-evolved ROS clearance system composed of enzymes that balance ROS levels under homeostasis. ROS molecules, including superoxide anions, hydrogen peroxide, hydroxyl radicals, and singlet oxygen, are generated from normal cellular metabolism or exposure to environmental stressors. While these ROS molecules play critical roles in various cellular processes, their excessive accumulation can lead to oxidative stress and cellular damage. Consequently, microorganisms have developed multiple defense mechanisms to combat ROS-induced damage, with ROS clearance enzymes (ROSce) playing a vital role in removing ROS and protecting cells from oxidative injury.

With the rapid development of high-throughput sequencing technology, the amount of genomic and proteomic data generated has increased exponentially, leading to an explosive growth in sequencing data in recent years. Despite the availability of numerous bioinformatics tools and databases for protein annotation, the identification of ROS clearance enzymes (ROSce) remains a significant challenge. This is due to the complex functional diversity of ROSce proteins and the lack of clearly defined sequence motifs specific to these proteins. Consequently, traditional protein annotation methods have become increasingly inefficient in keeping pace with the exponential growth of sequencing data. There is an urgent need to develop rapid and effective computational methods to accurately annotate ROSce proteins, which can be achieved through machine learning and deep learning techniques.

Sequence alignment-based methods are widely used for protein functional annotation. However, these methods have limitations in accurately predicting protein functions based solely on sequence similarity. For instance, proteins with highly similar sequences may have different functions, while proteins with low sequence similarity may perform similar functions. Additionally, these methods may not effectively identify remote homologs or proteins with divergent sequences, leading to incorrect functional annotations. Furthermore, sequence alignment methods do not take into account other critical factors, such as protein structure, post-translational modifications, and protein-protein interactions, which can significantly impact protein function.

To overcome these limitations, alternative computational methods, such as those based on machine learning, have been developed. These methods can integrate multiple data sources, including sequence information from various databases, protein structures, and functional annotations, to improve the accuracy of protein function predictions. By leveraging these methods, we can accelerate the process of determining protein functions and unlock the potential of exponentially growing protein sequence data. This study proposes a novel integrated method called Clearance-enzyme-classifier for predicting antioxidant protein categories, which addresses the limitations of existing ROS classification methods. Clearance-enzyme-classifier is a multi-task deep learning framework that can simultaneously predict multiple ROS properties, including whether an input protein sequence is ROSce, and if so, which type of ROSce protein it belongs to. To enhance the accuracy and robustness of the integrated model, this framework employs a voting-based approach that integrates three component methods: ROSce-RNN-attention, ROSce-Bilstm, and ROSce-XGBOOST. ROSce-CNN uses raw sequence encoding for feature extraction, while ROSce-NN is suitable for predicting protein functions based on sequence information by learning complex nonlinear relationships. ROSce-XGBOOST is an ensemble machine learning algorithm that combines the outputs of many decision trees to make final predictions, making it a valuable tool for functional classification. The combination of these three methods has the potential to provide more comprehensive and reliable functional predictions for proteins, while the voting-based approach reduces the impact of individual classifier errors, thus mitigating the risk of overfitting.

Figure 14: The framework of the Clearance-enzyme-classifier

Data Acquisition

To acquire effective data, we constructed positive and negative samples related to ROSce based on UniProt (https://www.uniprot.org), National Library of Medicine (https://www.ncbi.nlm.nih.gov/), Enzyme Database (https://www.brenda-enzymes.org), and Pfam (http://pfam.xfam.org). This comparison generated a refined set of 16,455 ROSce sequences from the databases, while using MMseqs2 sequence alignment to filter out 13,645 non-ROSce sequences with the minimum score for positive matches.

We named the dataset CEDB. To construct the CEDB dataset, we performed UCLUST clustering on the CEDB dataset with a 97% identity threshold. The identity threshold is a set percentage used to determine whether two sequences should be grouped together or considered homologous in clustering or alignment. If the identity value between two sequences is equal to or greater than the set identity threshold, they are regarded as similar sequences and assigned to the same cluster or group. We obtained sequences with higher identity scores by removing identical and redundant sequences. The representative sequences from the clusters generated by UCLUST were then retained in a FASTA file. Next, we labeled these sequences from three perspectives and manually checked the categories of the ROS clearance enzymes to which they belong.

The resulting CEDB dataset consists of 16,455 high-quality sequences, each labeled with one of 26 ROS clearance enzyme categories. This type of multi-class database ensures that the trained model can automatically capture the most relevant features associated with ROS clearance enzymes.

Enzyme	中文名称
thioredoxin reductase	硫氧还蛋白还原酶
cytochrome c peroxidase	细胞色素c过氧化物酶
peroxidase	过氧化物酶
glutathione peroxidase	谷胱甘肽过氧化物酶
nickel superoxide dismutase	镍超氧化物歧化酶
alkyl hydroperoxide reductase	烷基过氧化氢还原酶
thioredoxin 1	硫氧还蛋白1
thioredoxin 2	硫氧还蛋白2
glutaredoxin 1	谷胱甘肽还原蛋白1
glutaredoxin 2	谷胱甘肽还原蛋白2
catalase	过氧化氢酶
catalase-peroxidase	过氧化氢酶-过氧化物酶
superoxide dismutase 2	超氧化物歧化酶2
superoxide dismutase 1	超氧化物歧化酶1
NADH peroxidase	NADH过氧化物酶
superoxide reductase	超氧化物还原酶
Mn-containing catalase	锰含量的过氧化氢酶
monothiol glutaredoxin	单硫基谷胱甘肽还原蛋白
thiol peroxidase	硫氧化物过氧化物酶
peroxiredoxin 5	过氧还蛋白5
peroxiredoxin 6	过氧还蛋白6
peroxiredoxin 1	过氧还蛋白1
alkyl hydroperoxide reductase 1	烷基过氧化氢还原酶1
rubrerythrin	红铁素
peroxiredoxin 3	过氧还蛋白3
glutaredoxin 3	谷胱甘肽还原蛋白3

Clearance-enzyme-classifier is a supervised machine learning framework specifically designed to identify ROS clearance enzymes by analyzing annotations within the ROS annotation space. The framework employs a hierarchical prediction strategy, deploying a tiered structure for the classification of ROS clearance enzymes. Given a protein sequence, Clearance-enzyme-classifier first classifies it as either a ROS clearance enzyme or a non-ROS clearance enzyme. If the input sequence is identified as a ROS clearance enzyme, we predict which ROS clearance enzyme category it belongs to. Therefore, for any sequence analyzed by the Clearance-enzyme-classifier framework, the determination of whether it is a ROS clearance enzyme will also involve predicting its ROS clearance enzyme category.

Figure 15: The framework of RNN-attention-based model

Recurrent Neural Networks (RNNs) are widely used for modeling intrinsic patterns in time series data; however, classical RNNs struggle with long-term dependency issues when processing long sequences.

The attention mechanism's solution.

The introduction of the attention mechanism provides an effective method for addressing the long-term dependency problem in processing long sequences. The core idea of the attention mechanism is that the model can dynamically select dependencies from other positions in the sequence based on current needs, rather than relying on a fixed chain propagation structure.

Specifically, the advantages of the attention mechanism include:

Direct connection of sequence information at different positions: Traditional RNNs pass information one timestep at a time, making it difficult to transmit information when handling long sequences. The attention mechanism allows the model to establish direct connections between any two timesteps, regardless of the distance between them. This direct connection significantly alleviates the long-term dependency problem.

Global information retrieval: The attention mechanism can consider all positions in the sequence simultaneously, dynamically selecting relevant positional information that aids the current timestep rather than relying solely on local information. This enables the model to flexibly choose key information positions when processing long sequences, preventing information from being diluted as the sequence length increases.

Parallel computation: Traditional RNN computations are sequential, meaning that the computation at the current timestep depends on the state of the previous timestep, resulting in lower training efficiency. In contrast, the attention mechanism allows for parallel computation, thereby accelerating training and inference speed, particularly for long sequence tasks.

For RNN-attention-based models, the input is a protein sequence consisting of 23 characters, with each character representing a different amino acid. To convert the protein sequence into a format suitable for deep learning models, data preprocessing is necessary. The Keras Tokenizer function is used to tokenize the protein sequence, aiming to convert each character into its corresponding numerical representation. This function calculates the frequency of each character in the input sequence and maps the top N characters with the highest frequencies to numbers from 1 to N, while subsequent characters are incrementally mapped until all characters are assigned numbers. Next, these numerical sequences are padded to a fixed length to ensure that the neural network can consistently process the input data.

Each character vector is represented by the following equation: \[ c_i = \sum_{j=1}^{4} \alpha_{ij} h_j \] The formula for calculating attention weights is as follows: \[\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{4} \exp(e_{ik})} \] \[e_{ij} = fc(s_{i-1}, h_j) \] After preprocessing, the protein sequence data is converted into a format suitable for the RNN-attention model and transformed into tensors using PyTorch for training in the neural network. The RNN-attention model processes the sequence information through recurrent neural networks (RNNs) and uses the attention mechanism to weight important positions in the sequence, thereby capturing key features for more accurate predictions.

In addition, we also employed a recurrent neural network (RNN) with an attention mechanism to perform text classification tasks. This model is implemented using PyTorch. We defined an RNN-based text classification model that includes an embedding layer, three recurrent layers, a linear layer, and a self-attention layer. The embedding layer maps the input text sequences into a low-dimensional vector space, while the recurrent and linear layers are used for feature extraction and classification, and the self-attention layer is utilized to extract important information from the sequence.

During the training of the model, we defined the necessary hyperparameters and optimizer. The optimizer used is SGD (Stochastic Gradient Descent), along with the Focal Loss loss function, specifically designed to address class imbalance issues. Focal Loss increases the penalty weight on hard-to-classify samples, encouraging the model to better handle imbalanced data. SGD is a classic optimization algorithm that provides stable updates to the model parameters and improves generalization during the training process.

Finally, we loaded the training and testing datasets and began the model training process. Through iterative optimization, the model gradually improved its accuracy in classifying protein sequences or text.

Figure 16: The framework of BiLSTM-based model

For the BiLSTM-based model

CKSAAGP is a feature descriptor used in bioinformatics to represent protein sequences. The acronym stands for "Composition of k-spaced amino acid pairs with gap penalty." This descriptor takes into account the amino acid composition within the sequence and k-spaced amino acid pairs, applying a penalty for gaps between them. We computed CKSAAGP as the chemical features of each protein sequence.

The model is based on a Bidirectional Long Short-Term Memory network (BiLSTM) for classification. The BiLSTM network can simultaneously process information from both the forward and backward directions, enabling the model to capture contextual dependencies in the sequence more comprehensively. The model consists of multiple BiLSTM layers to extract temporal features from the sequence, followed by a linear layer for classification. Each BiLSTM layer effectively retains important sequence information through its memory units and gating mechanisms while suppressing irrelevant parts.

Calculate the forget gate to determine what information to forget.

Input: The hidden state at the previous time step \( h_{t-1} \), the input word at the current time step \( X_t \)

Output: The value of the forget gate \( f_t \)

Figure 17: Calculation of the Forget Gate

\[f_t = \sigma (W_f \cdot [h_{t-1}, x_t] + b_f) \] Calculate the Memory Gate to select the information to be remembered.

Input: The hidden state at the previous time step \( h_{t-1} \), the input word at the current time step \( X_t \) Output: The value of the input gate \( i_t \), the candidate cell state \( \tilde{C}_t \)

Figure 18: Calculation of Memory Gate and Temporary Cell State

\[i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i) \] \[\tilde{C}_t = \tanh (W_C \cdot [h_{t-1}, x_t] + b_C) \] Calculate the Output Gate and the Current Hidden State.

Figure 19: Calculates the output gate and the state of the hidden layer at the current moment

Input: The hidden state at the previous time step \( h_{t-1} \), the input word at the current time step \( X_t \), the cell state at the current time step \( C_t \) Output: The value of the output gate \( O_t \), the hidden state \( h_t \) \[o_t = \sigma (W_o [h_{t-1}, x_t] + b_o) \] \[h_t = o_t * \tanh (C_t) \]

Finally, we can obtain a hidden state sequence \( h_0, h_1, ..., h_{n-1} \) that corresponds to the length of the sentence.

In the forward propagation process, the ReLU activation function is used. ReLU helps address the vanishing gradient problem and accelerates convergence during training. The introduction of this nonlinear activation function enhances the neural network's expressive power, thereby improving its classification performance.

During the training process, we use the Adam optimizer to iteratively adjust the model parameters and minimize the loss function. This optimizer updates the parameters using a small batch of training data at each step, moving in the direction of the steepest descent of the loss function. The loss function employs the cross-entropy loss, which quantifies the difference between the model's predicted class probabilities and the actual labels. The cross-entropy loss is particularly effective for multiclass classification tasks, encouraging the model to assign higher probabilities to the correct classes while penalizing incorrect predictions, thus improving the model's classification accuracy.

Figure 20: The framework of Xgboost

XGBoost, short for "Extreme Gradient Boosting," is a powerful and efficient machine learning algorithm widely used for supervised learning tasks, particularly in classification and regression problems. Its core idea is to build a strong model by combining the predictions of multiple weak models, typically decision trees. In each iteration, XGBoost trains a new decision tree to fit the residuals from the previous iterations, which represent the differences between the predicted values and the actual target values. With the introduction of each tree, the model's overall predictive capability improves gradually. This process continues iterating until the loss function is minimized or predetermined stopping criteria are met.

To optimize the model's performance and prevent overfitting, XGBoost integrates various regularization techniques, such as L1 and L2 regularization, tree pruning, and early stopping. These techniques effectively control the model's complexity, enhancing its generalization ability across different datasets.

In this model training, the protein sequences were first subjected to feature extraction. We utilized Dipeptide Composition (DPC) to process the protein sequences. The DPC method considers the combinations of two adjacent amino acids by calculating the occurrence frequency of each dipeptide in the sequence to generate features. This approach captures local information and interaction characteristics within the protein sequences, enhancing the model's ability to analyze the sequences. The feature vectors extracted through DPC provide a foundation for the subsequent input to the XGBoost model.

When building the XGBoost model, we set specific parameters to optimize the model's performance on our protein dataset. The specific parameters include: a learning rate of 0.1 to control the step size for each iteration; a maximum depth of 20 to limit the depth of each decision tree and prevent overfitting; a total number of decision trees (n_estimators) set to 150 to ensure sufficient iterations; a gamma parameter set to 0, indicating no additional penalty for tree splits; and a subsample ratio of 0.8 to control the proportion of samples used for building each tree, enhancing the model's robustness. These parameter choices are based on preliminary experimental results, aimed at providing optimal model performance.

Through this feature extraction and model training process, the XGBoost model can effectively learn the chemical characteristics of protein sequences and achieve accurate classification predictions.

Finally, we used a voting algorithm for classification, which combines the predictions of multiple classifiers to produce more accurate results. In the Clearance-enzyme-classifier framework, we integrated three different algorithms (RNN with attention, BiLSTM, and XGBoost) strategically. This intentional inclusion of various algorithms enhances the overall robustness of the model. The underlying principle is that the variability in predictions from different algorithms can help offset potential errors, thereby reducing the overall error rate and ultimately improving the model's accuracy and reliability. This involves employing a soft voting strategy to identify specific categories within the ROSce classes. This more refined classification process leverages the combined strengths of all three algorithms, resulting in more detailed and accurate classification outcomes.

Prediction performance for each ROSce class.

Figure 21: Comparison of the prediction performance of the three models for each ROSce class

Next, we present the accuracy and recall of the three algorithms in classifying different categories (we randomly selected 20 categories). RNN-attention: This model shows high stability in both accuracy and recall, making it suitable for tasks that strongly rely on temporal or sequential information. Its fluctuations are relatively small; although there was a significant drop during the 15th iteration, it quickly recovered and performed excellently overall. BiLSTM: While it performs outstandingly in accuracy, close to RNN-attention, the recall shows larger fluctuations, particularly with notable drops during the 10th and 15th iterations, possibly due to instability in handling some difficult samples. Nevertheless, it remains a very robust model overall. XGBoost: This model performed slightly worse in early training; although its accuracy gradually caught up with the other models, it showed significantly greater volatility. The notable drop during the 15th iteration suggests that this model may be sensitive to certain patterns or features in the training dataset. For tasks requiring high stability, XGBoost may not perform as well as the first two models.

The Clearance-enzyme-classifier has several limitations to consider. Firstly, it has constraints in handling raw reads, which may affect its prediction accuracy in certain cases. Secondly, it requires substantial computational resources, which can be a challenge for researchers without access to high-performance computing. Lastly, its predictions are influenced by the length of the input sequences, and its utilization of non-sequential information is also limited. Despite these limitations, the Clearance-enzyme-classifier remains an effective deep learning algorithm for classifying ROSce. However, careful evaluation of these limitations is necessary when applying this method, and appropriate measures may need to be taken to enhance its prediction accuracy and expand its application scope.

Here are the 9 test sequences for ROSce under the model:

Sequence ID	Sequence
seq1	MKLYIYQHCPFCARVRYVAGMLNIPLEIINLAYDDDTTTNDLIGAKQVPLLLKNDGQALA ESLDIIAYFIELAQSSESHQPSESVLDWQRSAFLPLQKIGYPRWSNMDFPEFSTQSARQA WRMKKETDALNFDALLQDTPNIATEVEALIERAKSVLNLDSYQHATLVDEAIVFSILRGF FSAAEIQWDSVVKDWMESVSDKTHVPLLK
seq2	MAMKAVCVLKGDGPVQGTIHFEQKASGEPVVLSGQITGLTEGQHGFHVHQYGDNTQGCTS AGPHFNPHSKKHGGPADEERHVGDLGNVTAGKDGVANVSIEDRVISLSGEHSIIGRTMVV HEKQDDLGKGGNEESTKTGNAGSRLACGVIGIAQ
seq3	MSSDKYMTNSTVENLVIIGSGPAGYTAAIYAGRANLKPVVFEGFQAGGLPGGQLMTTTEV ENFPGFPQGITGPDLMDNMKAQAERWGAELYTEDVISVDLSQRPFIICSQEREVKAHSIV IATGATAKRLGLPSEQQFWSRGISACAICDGATSIFRGAELAVVGAGDSAAEEAIYLTKY GSSVHLLVRSEEMRASKAMQDRVLSNPKIQIHWHTEAVDVFGNDFLSGVKVHNNQTGEVR ELPVKGLFYAVGHTPNTGLFKGQLELDE VGYIVTKHGGVETGIDGVFAAGDVQDHEFRQAITAAGTGCMAMLAERWLSSKGLIHEFHH LPETTDNELKPQPETQTTETEAEFNLQATRHEGGYALRKLFHDSDRLLLVKYISPGCGPC HTLKPILNKVVDEFDGKIYFVEIDIDKDRDIAENAGVTGTPTVQLFKDQELVKEVKGVKQ KSEYRQLIESNL
seq4	MAASSKVIVSLVLCLMMAVSVRSQLSSTFYDTTCPNVSSIVHGVMQQALQSDDRAGAKIIR LHFHDCFVDGCDGSVLLEDQDGITSELGA PGNGGITGFNIVNDIKTAVENVCPGVVSCAD ILALGSRDAVTLASGQGWTVQLGRRDSRTANLQGARDRLPSPFESLSNIQGIFRDVGLNDN TDLVALSGAHTFGRSRCMFFSGRLNNNPNADDSPIDSTYASQLNQTCQSGSGTFVDLDPTT PNTFDRNYYTNLQNNQGLLRSDQVLFSTPGASTIATVNSLASSESAFADAFAQSMIRM GNLDPKTGTTGEIRTNCRRLN
seq5	MGGKSVASKYTGRVTVLVNTASLCSFAASNLEKLIHIQASYGPRGVTVLAFPCTQFANQEP KTDEAVAAWAREWGVNFDMFDKVKVRGGDAHPLFHMLQDSLGPIRWNYTKFVCDRDGIPRVKL APSCPREELERSIEQLLR
seq6	MSDDVQSWIKETVTENPVVLFMKGTAQFPQCGFSGKAIALLKESGVTDLITVNVLDDAEV RQGIKDYSQWPTVPQLYIKGEFIGGSDIMNEMFASGELQALLKA
seq7	MVLVTRQAPDFTAAAVLGSGEIVDKFNFKQHTNGKTTVLFFWPMDFTFVCPSELIAFDKR YEEFQKRGVEVVGVSFDSEFVHNAWRNTPVDKGGIGPVKYAMVADVKREIQKAYGIEHPD GVALRGSFLIDANGIVRHQVVNDLPLGRNIDEMLRMDALQFHEEHGDVCPAQWEKGKEGM NASPDGVAKYLAENISSL
seq8	MTTKDDAEAFFAGESQANRKYKIFSEKADAEGYSNVAKLYRAASEAEAIAHKRLLFLLQQ VNSTEENLKKSIEGETSEYTSMYPSFIKDAEKENDKEAATIFTHAMKAEEVHATNYGRALE AVAGGNDLEISKVFLCPVCGNIVFDSVPDTCPICGVPGRMFKEIE
seq9	MAKITLKGNPINTIGELPAIGNTAPDFSLTTTDLGDVHLKDFSGQRLVLNIFPSLDTEVCA TSVRKFNEMAANLENTKVLCISMDLPFAHKRFCTTEGIENVISLSELHERGFGESYGVRII DGPLRGLFTRAVVVIDEKGKVIYTELVPEITQEPNYEKALKAKLE

Testing process

Prediction results:

The results show the prediction scores for each enzyme class in the test data. This file contains the index of the maximum predicted category for each row, indicating which enzyme category each sample is predicted to belong to.

Thus, the Clearance-enzyme-classifier successfully classified the 9 ROSce into their corresponding categories.

Challenges and Limitations:

High computational resource requirements: Due to the complexity of the model and the large amount of data, the Clearance-enzyme-classifier requires robust computational resources, which may pose a barrier for some researchers.

Input sequence dependency: The predictive performance of the model is partially dependent on the length and quality of the input sequences, which can affect its accuracy in certain cases. Future research could focus on reducing the computational complexity of the model, enhancing its robustness to sequence length, while also considering the integration of more non-sequence information (such as protein structure, post-translational modifications, etc.) to further improve the accuracy of ROSce classification.

In summary, we proposed a multi-label ensemble model framework named Clearance-enzyme-classifier specifically for predicting reactive oxygen species clearance enzyme (ROSce) categories. By integrating three different machine learning algorithms—ROSce-RNN-attention, ROSce-BiLSTM, and ROSce-XGBoost—we constructed the CEDB dataset based on authoritative databases such as UniProt, Pfam, and NCBI, ensuring the quality and diversity of the data, which covers 26 categories of ROSce enzymes and provides a solid foundation for model training. Using a soft voting method for decision-making, the Clearance-enzyme-classifier effectively overcomes the shortcomings of traditional methods in protein function prediction, especially excelling in handling distantly related homologs and proteins with complex functional diversity. The Clearance-enzyme-classifier demonstrates the immense potential of deep learning in addressing complex biological problems, particularly in the field of ROS clearance enzyme prediction, providing a powerful tool for future oxidative stress research and related disease prevention and treatment.

Reference

[1] Chandra, A., Tünnermann, L., Löfstedt, T., and Gratz, R. (2023). Transformer-based deep learning for predicting protein properties in the life sciences. elife 12 . doi: 10.7554/eLife.82819

[2] Chen, C., Hou, J., Tanner, J. J., and Cheng, J. (2020). Bioinformatics methods for mass spectrometry-based proteomics data analysis. Int. J. Mol. Sci. 21:2873. doi: 10.3390/ijms21082873

[3] Ejigu, G. F., and Jung, J. (2020). Review on the computational genome annotation of sequences obtained by next-generation sequencing. Biology (Basel) 9:295. doi: 10.3390/biology9090295

[4] Ho Thanh Lam, L., Le, N. H., Van Tuan, L., Tran Ban, H., Nguyen Khanh Hung, T., Nguyen, N. T. K., et al. (2020). Machine learning model for identifying antioxidant proteins using features calculated from primary sequences. Biology (Basel) 9:325. doi: 10.3390/biology9100325

[5] Jang, B., Kim, I., and Kim, J. W. (2019). Word2vec convolutional neural networks for classification of news articles and tweets. PLoS One 14 . doi: 10.1371/journal.pone.0220976

[6] Johnson, L. A., and Hug, L. A. (2019). Distribution of reactive oxygen species defense mechanisms across domain bacteria. Free Radic. Biol. Med. 140, 93–102. doi: 10.1016/j.freeradbiomed.2019.03.032

[7] Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., and Tanabe, M. (2021a). KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 49, D545–d551. doi: 10.1093/nar/gkaa970

[8] Kanehisa, M., Furumichi, M., Sato, Y., Ishiguro-Watanabe, M., and Tanabe, M. (2021b). KEGG: integrating viruses and cellular organisms in 2021. Nucleic Acids Res. 49, D545–D551. doi: 10.1093/nar/gkaa970

[9] Kathuria, R. S., Gautam, S., Singh, A., Khatri, S., and Yadav, N., (2019). Real time sentiment analysis on twitter data using deep learning (Keras). 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), pp. 69–73.

[10] Kattenborn, T., Leitloff, J., Schiefer, F., and Hinz, S. (2021). Review on convolutional neural networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 173, 24–49. doi: 10.1016/j.isprsjprs.2020.12.010

[11] Keskin, O., and Nussinov, R. (2005). Favorable scaffolds: proteins with different sequence, structure and function may associate in similar ways. Protein Eng. Des. Sel. 18, 11–24. doi: 10.1093/protein/gzh095

[12] Ko, Y. S., Kim, J. W., Lee, J. A., Han, T., Kim, G. B., Park, J. E., et al. (2020). Tools and strategies of systems metabolic engineering for the development of microbial cell factories for chemical production. Chem. Soc. Rev. 49, 4615–4636. doi: 10.1039/D0CS00155D

[13] Kuhlman, B., and Bradley, P. (2019). Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 20, 681–697. doi: 10.1038/s41580-019-0163-x

[14] Li, Z., Liu, F., Yang, W., Peng, S., and Zhou, J. (2022). A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 33, 6999–7019. doi: 10.1109/TNNLS.2021.3084827

[15] Ma, M., Zhao, G., He, B., Li, Q., Dong, H., Wang, S., et al. (2021). XGBoost-based method for flash flood risk assessment. J. Hydrol. 598:126382. doi: 10.1016/j.jhydrol.2021.126382

[16] Manavalan, B., and Patra, M. C. (2022). MLCPP 2.0: an updated cell-penetrating peptides and their uptake efficiency predictor. J. Mol. Biol. 434:167604. doi: 10.1016/j.jmb.2022.167604

[17] Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G. A., Sonnhammer, E. L. L., et al. (2021). Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–d419. doi: 10.1093/nar/gkaa913

[18] Pai, J. A., and Satpathy, A. T. (2021). High-throughput and single-cell T cell receptor sequencing technologies. Nat. Methods 18, 881–892. doi: 10.1038/s41592-021-01201-8

[19] Qiu, Y., Zhou, J., Khandelwal, M., Yang, H., Yang, P., Li, C., et al. (2022). Performance evaluation of hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost models to predict blast-induced ground vibration. Eng. Comput. 38, 4145–4162. doi: 10.1007/s00366-021-01393-9

SOD-ROS enzyme-catalyzed reaction model

1.Abstract

Inflammatory bowel disease (IBD) is a chronic condition characterized by recurrent episodes, and its pathogenesis remains a focal point of clinical research. Ulcerative colitis (UC) is one of the major forms of IBD, with both incidence and prevalence rising globally. Although the molecular basis and mechanisms underlying UC are not yet fully understood, it is believed that they result from the complex interplay of environmental factors, genetic susceptibility, epithelial barrier defects, dysbiosis, and immune response dysregulation. Extensive research indicates that UC patients exhibit abnormalities in the structure and function of colonic epithelial cells, with excessive activation of apoptosis in intestinal epithelial cells (IECs) ^[1]. Reactive oxygen species (ROS), a class of oxygen-containing chemical molecules, are natural byproducts of normal oxygen metabolism. External factors such as ultraviolet radiation and infections can lead to excessive ROS production, causing oxidative stress and damage. Studies have shown that in IBD models, increased ROS production correlates strongly with the severity of the disease ^[2]. Additionally, ROS-induced endoplasmic reticulum stress contributes to excessive apoptosis of cells, further exacerbating the condition ^[3].

Therefore, our experimental group utilized engineered bacteria expressing heterologous superoxide dismutase (SOD) to eliminate excess ROS in the IBD intestinal environment, thereby aiming to alleviate IBD symptoms. For this therapeutic approach, we sought to visualize the therapeutic effects and predict treatment outcomes. To this end, our computational group developed a ROS clearance model using the Michaelis-Menten equation. This enzymatic kinetics model illustrates the relationship between the enzyme-catalyzed reaction rate (v), substrate concentration ([S]), and enzyme concentration ([E]). The model effectively represents the ROS-clearing activity of SOD and helps validate the accuracy of the experimental results obtained in the wet lab.

Additionally, we investigated and expanded upon the effects of oxidative stress on SOD transcription and gene regulation. We developed a model to examine the impact of ROS concentration under oxidative stress on SOD transcription. Furthermore, we conducted a comprehensive review of how structural changes in SOD protein residues affect the rate of the dismutation reaction, and we established an extended SOD-ROS model to account for variations in protein residue structures.

You can view our code through the following link: (https://github.com/Charliserein/Mutant-enzyme-reaction-rate).

2. SOD-ROS enzymatic reaction model:

2.1Generation of Superoxide Dismutase (SOD) and Cell Apoptosis:

In patients with inflammatory bowel disease (IBD), excessive apoptosis of cells is often a significant factor contributing to disease deterioration. The following is a brief overview of the fundamental process of apoptosis in vivo and the impact of ROS on this process.Within cells, O₂ is reduced to O₂⁻ in the mitochondria, where O₂⁻ is readily converted into peroxynitrite (ONOO⁻) by nitric oxide (NO•). An increase in ROS leads to the opening of the mitochondrial permeability transition pore (PTP), which triggers apoptosis. Within the cell, PERK can promote the translation of ATF4 by phosphorylating eukaryotic initiation factor 2α (eIF2α). ATF4, in turn, enhances the transcription of CHOP, which upregulates the pro-apoptotic protein Bim and downregulates the anti-apoptotic protein Bcl 2, thereby leading to cell apoptosis. Another apoptotic pathway involves the IRE1α-TRAF2-ASK1 complex, which can be activated by ROS, resulting in JNK phosphorylation and the initiation of apoptosis signaling pathways ^[3].Thus, ROS commonly induce excessive apoptosis in IBD patients. Superoxide anions (O₂⁻) can be converted into hydrogen peroxide (H₂O₂) by superoxide dismutase (SOD). H₂O₂ can then produce hydroxyl radicals (HO•) via the Fenton reaction, and H₂O₂ can be reduced to water (H₂O) in the CAT and GSH-Px systems. Finally, under chloride (Cl⁻) conditions, H₂O₂ reacts with myeloperoxidase (MPO) to form hypochlorous acid (HOCl) and advanced oxidation protein products (AOPPs) [3]. This reaction effectively reduces ROS concentrations, thereby decreasing their activation of apoptosis.

Figure 22: Excessive apoptosis caused by reactive oxygen species and endoplasmic reticulum stress

^[3]Yue Wan, Lei Yang, Shu Jiang, Dawei Qian, Jinao Duan, Excessive Apoptosis in Ulcerative Colitis: Crosstalk Between Apoptosis, ROS, ER Stress, and Intestinal Homeostasis, Inflammatory Bowel Diseases, Volume 28, Issue 4, April 2022, Pages 639–648.

2.2 The Relationship Between ROS and Cancer Risk, and Their Connection to Cell Apoptosis

In populations with inflammatory bowel disease (IBD), ulcerative colitis is often directly associated with rectal cancer in later stages. In some cases, severe intestinal inflammation is a significant risk factor for cancer development. Additionally, the risk of developing colon cancer increases with the duration of the disease ^[4]^[5]^[6]. Within this context, the role of reactive oxygen species (ROS) is notably significant.

In both normal and cancer cells, reactive oxygen species (ROS) and their scavengers play a critical role in maintaining redox homeostasis. In normal cells, ROS levels are maintained at a low and dynamic equilibrium ^[7], with stable generation and elimination processes essential for preserving redox balance. However, when ROS are overexpressed, redox homeostasis is disrupted, leading to oxidative stress and varying degrees of cell death ^[8].In contrast, human tumor cells produce substantial amounts of hydrogen peroxide, resulting in elevated ROS concentrations ^[9]. Cancer cells exploit the interplay between cellular metabolism and redox homeostasis to generate building blocks for cell growth or to produce antioxidant capacity to prevent oxidative damage. By redirecting energy substrates and metabolic intermediates into biochemical pathways that produce key antioxidant molecules, malignant cancer cells can directly support ROS detoxification mechanisms, thereby surviving in conditions of high ROS concentration^[8].

In general cells, apoptosis is primarily associated with mitochondrial dysfunction. Increasing oxidative stress can damage mitochondrial function and homeostasis, leading to excessive production of ROS ^[10]. Under normal conditions, ROS, such as superoxide anions, hydrogen peroxide, and hypochlorous acid, are byproducts of normal cellular metabolism, while antioxidant enzymes such as superoxide dismutase (SOD) and glutathione are crucial in counteracting ROS. As ROS generation increases, the levels of endogenous antioxidants in the colonic mucosa of animal models with IBD decrease ^[11], which directly results in elevated ROS levels and triggers a series of pathways that signal excessive cell apoptosis. Overall, there is a close relationship between ROS-induced excessive apoptosis and increased cancer incidence. In our experiments to eliminate ROS, we fitted the SOD enzymatic reaction equation, which allows for an approximate prediction of its clearing effect.

2.3 Michaelis-Menten Equation and Enzyme Reaction Kinetics

The Michaelis-Menten equation was proposed by Michaelis and Menten based on the kinetics of enzyme-catalyzed reactions, \[\mathbf{E}+\mathbf{S} \underset{k_{-1}}{\stackrel{k_{1}}{\rightleftharpoons}} \mathbf{E S} \xrightarrow{k_{2}} \mathbf{E}+\mathbf{P}\] The basic equation is shown in the figure below: \[v=C \cdot \phi \cdot \frac{[\mathbf{S}]}{[\mathbf{S}]+k}\] Nowadays, the equation is often written in the form depicted in the figure. \[v=\frac{\left(V_{\max } \times[\mathrm{S}]\right)}{\left(K_{\mathrm{m}}+[\mathrm{S}]\right)}\] In this equation, v represents the reaction velocity, Vmax denotes the maximum velocity when all enzyme molecules are saturated with substrate, [S]is the substrate concentration, and Km is the equilibrium constant, known as the Michaelis constant ^[12]. Our analysis of the SOD enzymatic reaction is entirely based on this velocity equation.

In normally physiological cells, SOD catalyzes the dismutation of ROS. The high-valent metal SOD complex reacts with superoxide anions (O₂⁻) to undergo an oxidation reaction, generating a low-valent metal SOD complex and releasing oxygen. The low-valent metal SOD complex then reacts with superoxide anions (O₂⁻) and protons to undergo a reduction reaction, oxidizing the metal ions back to their higher oxidation state while producing hydrogen peroxide (H₂O₂). The specific reaction equations are as follows: \[\mathrm{M}^{(\mathrm{n}+1)+}-\mathrm{SOD}+\mathrm{O}_{2}^{-} \rightarrow \mathrm{M}^{\mathrm{n+}}-\mathrm{SOD}+\mathrm{O}_{2} \] \[\mathrm{M}^{\mathrm{n}+}-\mathrm{SOD}+\mathrm{O}_{2}^{-}+2 \mathrm{H}^{+} \rightarrow \mathrm{M}^{(\mathrm{n}+1)+}-\mathrm{SOD}+\mathrm{H}_{2} \mathrm{O}_{2} .\] In this reaction, M represents a metal ion, which can be Cu (n=1), Mn (n=2), Fe (n=2), or Ni (n=2). During the reaction, the oxidation state of the metal ion M alternates between n and n+1.

In our practical wet experiments, we used Cu/Zn-SOD, which is superior and commonly found in the cytoplasm and mitochondrial intermembrane space of most eukaryotes, and is the most prevalent SOD type in mammalian cells. Additionally, Cu/Zn-SOD exhibits high catalytic efficiency under neutral pH conditions, rapidly dismutating superoxide anions into hydrogen peroxide and oxygen, demonstrating significant clearing efficiency. Since the accumulation of superoxide anions can lead to oxidative damage of proteins, lipids, and DNA, subsequently triggering various diseases, Cu/Zn-SOD can prevent oxidative stress-related pathological conditions, including cardiovascular diseases, neurodegenerative disorders, and cancer ^[13]. Its catalytic constant (kcat) is extremely high, making it one of the fastest-reacting enzymes known.

2.4 SOD-ROS Enzyme-Catalyzed Reaction Model

This is the reaction equation for a single-substrate, single-product system: \[E + S \underset{k_{-1}}{\stackrel{k_1}{\rightleftharpoons}} ES \stackrel{k_2}{\longrightarrow} E + P \] Assuming V1 is the rate of formation of the enzyme-substrate complex (ES) and V2 is the rate of its dissociation, then V1=k1[E][S] and V2 =K-1[ES]+K2[ES]. At steady state, the rate of formation of ES equals the rate of its dissociation, thus V1= V2, which yields: \[K_{1}[E][S]=\left(K_{-1}+K_{2}\right)[E S]\] The following is the dismutation reaction equation for the SOD-ROS enzymatic reaction: \[\mathrm{Cu} / \mathrm{Zn}^{n+1} \cdot S O D+R O S \stackrel{K_{1}}{\rightleftharpoons} \mathrm{Cu} / \mathrm{Z} n^{n+1} \cdot S O D \cdot R O S \xrightarrow{K_{2}} \mathrm{Cu} / \mathrm{Zn}^{n} \cdot S O D+O_{2}\] \[\mathrm{Cu} / \mathrm{Z} n^{n} \cdot S O D+R O S \stackrel{K_{3}}{\rightleftharpoons} C u / Z n^{n} \cdot S O D \cdot R O S \stackrel{K_{4}}{\rightarrow} \mathrm{Cu} / \mathrm{Zn}^{n+1} \cdot S O D+H_{2} \mathrm{O}_{2}\] Based on the single-substrate, single-product model described above, the following two rate equations can be derived: \[K_{1}\left[C u / Z n^{n+1} \cdot S O D\right][R O S]=\left(K_{-1}+K_{2}\right)\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]\] \[K_{3}\left[C u / Z n^{n} \cdot S O D\right][R O S]=\left(K_{-3}+K_{4}\right)\left[C u / Z n^{n} \cdot S O D \cdot R O S\right]\] Assuming [Et] represents the total enzyme concentration, [E] denotes the concentration of free enzyme, and [ES] is the concentration of the enzyme-substrate complex, then: \[\left[\mathbf{E}_{t}\right]=[\mathbf{E}]+[\mathbf{E S}]\] Since [Et] represents the concentration of all forms of the enzyme, according to our SOD enzymatic reaction equation, the above expression can be substituted with: \[\begin{aligned}{\left[S_{O D}\right] } & =\left[C u / Z n^{n} \cdot S O D\right]+\left[C u / Z n^{n+1} \cdot S O D\right]+\left[C u / Z n^{n} \cdot S O D \cdot R O S\right] \\& +\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]\end{aligned}\] Rearranging Equation 1 yields: \[\begin{array}{c}K_{1}\left(\left[S O D_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]-\left[C u / Z n^{n} \cdot S O D \cdot R O S\right]-\left[C u / Z n^{n+1} \cdot S O D\right.\right. \\\cdot R O S])[R O S]=\left(K_{-1}+K_{2}\right)\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]\end{array}\] Rearranging Equation 2 yields: \[\begin{array}{c}K_{3}\left(\left[S O D_{t}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]-\left[C u / Z n^{n} \cdot S O D \cdot R O S\right]-\left[C u / Z n^{n+1} \cdot S O D\right.\right. \\\cdot R O S])[R O S]=\left(K_{-3}+K_{4}\right)\left[C u / Z n^{n} \cdot S O D \cdot R O S\right]\end{array}\] This leads to the derivation of the concentrations of the substrate ROS and the enzyme SOD complex under different metal oxidation states: \[\begin{array}{l}{\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]} \\=\frac{\left\{\left(\left[S O D_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]\right)\left(K_{m_{1}}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]\right)\right\} \times[R O S]}{[R O S]+\left(K_{m_{1}}+[R O S]\right)^{2}}\end{array}\] \[\begin{array}{l}{\left[C u / Z n^{n} \cdot S O D \cdot R O S\right]} \\=\frac{\left\{\left(\left[S O D_{t}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]\right)\left(K_{m_{2}}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]\right)\right\} \times[R O S]}{[R O S]+\left(K_{m_{2}}+[R O S]\right)^{2}}\end{array}\] Two Michaelis constants are defined as follows: \[\begin{array}{l}K_{m_{2}}=\frac{K_{-3}+K_{4}}{K_{3}} \\K_{m_{1}}=\frac{K_{-1}+K_{2}}{K_{1}}\end{array}\] Finally, the relationship between the reaction rate and substrate concentration is obtained as follows: \[\begin{array}{c}V=K_{2}\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]+K_{4}\left[C u / Z n^{n} \cdot S O D \cdot R O S\right] \\=K_{2} \frac{\left\{\left(\left[S O D_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]\right)\left(\frac{K_{-1}+K_{2}}{K_{1}}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]\right)\right\} \times[R O S]}{[R O S]+\left(\frac{K_{-1}+K_{2}}{K_{1}}+[R O S]\right)^{2}} \\+K_{4} \frac{\left\{\left(\left[S O D_{t}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]\right)\left(\frac{K_{-3}+K_{4}}{K_{3}}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]\right)\right\} \times[R O S]}{[R O S]+\left(\frac{K_{-3}+K_{4}}{K_{3}}+[R O S]\right)^{2}}\end{array}\] In this equation, V represents the total reaction rate of SOD consuming ROS, SODt is the amount of enzyme in different forms, K denotes the reaction rate constants for each reaction, and K-1 and K-3are the rate constants for the reverse reactions of SOD binding with ROS. Km is the Michaelis constant, which in our model is defined byK1、K-1、K2、K3、K-3 and K4.

By consulting relevant databases, we found that the SOD activity in the supernatant of Nissle 1917 culture medium is 3.4903 U, with other values set within the default range. The figure below presents the preliminary simulation results:

Figure 23: Simulated ROS-SOD response function image

The figure shows that the reaction rate of the enzyme is positively correlated with substrate concentration, approaching a maximum value as substrate concentration increases. The fitted reaction curve aligns closely with both the experimental results and the actual expectations. In future work, we will simplify the fitting equations and conduct targeted optimization for different experimental conditions. Additionally, environmental parameters may be introduced to simulate corresponding reaction rates under various human physiological conditions.

SOD Activity Measurement

Through the measurement of its antioxidant activity, functional verification of SOD was performed:

PJSG-transformed Nissle 1917: The intracellular SOD activity after transformation was 23.56 U (17.91 U after background subtraction), while the activity on the medium was relatively low, at 1.48 U, indicating possible inefficiency in secretion (Figure 5).

PJSG-transformed DH5α: The intracellular SOD activity was relatively high, at 41.36 U, and 12.40 U after background subtraction (Figure 6).

PET-S-transformed BL21(DE3): The intracellular activity after transformation was 404.65 U, with the medium activity at 111.79 U. After background subtraction, this indicates strong expression and secretion capabilities (Figure 7).

3. The Impact of Oxidative Stress on SOD Transcription and Gene Regulation

3.1 Nuclear Localization and Gene Regulation of SOD1

Studies have shown that SOD1 is not limited to its antioxidant function in the cytoplasm but can also translocate to the nucleus, where it directly participates in gene regulation. When cells are exposed to oxidative stress, such as hydrogen peroxide (H₂O₂), SOD1 rapidly moves from the cytoplasm to the nucleus ^[14]. In the nucleus, SOD1 can bind to specific DNA promoters, regulating the expression of various antioxidant and DNA repair genes. In this manner, SOD1 not only eliminates ROS but also protects the genome from oxidative damage by enhancing antioxidant and repair capabilities.

The discovery of this nuclear localization function has broadened our understanding of SOD1's role within the cell. Traditionally, SOD1 was regarded primarily as an antioxidant enzyme combating ROS through its enzymatic activity. However, the nuclear localization function indicates that SOD1 also plays a significant role in regulating cellular responses to oxidative stress, particularly at the level of gene expression. These genes include key factors for antioxidant enzymes and DNA repair, thereby enhancing the cell's ability to respond to oxidative stress. This nuclear localization function not only extends the biological scope of SOD1 but also suggests a more complex regulatory role in managing oxidative stress ^[15].

3.2 Signal Transduction Pathways and the Regulation of SOD, and the Impact of ROS on SOD Gene Expression

The regulation of SOD1 function by oxidative stress extends beyond its direct enzymatic activity to involve complex signal transduction pathways. The ATM/Mec1 kinase and Dun1 kinase play crucial roles in this process. When ROS levels increase, ATM/Mec1 kinase is first activated, which in turn activates the downstream effector Dun1 kinase ^[15].

Dun1 regulates SOD1's nuclear localization and gene regulatory functions by directly interacting with SOD1 and phosphorylating specific sites (S60 and S99). This phosphorylation modification not only facilitates the translocation of SOD1 to the nucleus but also enhances its ability to bind DNA, thereby effectively regulating the expression of antioxidant genes. This mechanism illustrates how SOD1 dynamically adjusts its gene regulatory functions in response to oxidative stress through signal transduction pathways ^[15].

Oxidative stress not only regulates SOD1 function through the aforementioned signal transduction pathways but also modulates its intracellular levels by directly affecting the expression of the SOD gene. Research indicates that ROS can regulate SOD gene expression by influencing the activity of transcription factors. For example, oxidative stress conditions activate oxidative-sensitive transcription factors such as NF-kB and AP-1 ^[14]. These transcription factors then bind to the promoter regions of the SOD gene, thereby promoting its transcription.

This mechanism, wherein ROS directly regulates SOD gene expression, further exemplifies the multifaceted roles of oxidative stress within the cell. On one hand, ROS, as a substrate for SOD, needs to be eliminated. On the other hand, ROS also acts as a signaling molecule and an upstream regulator of its own gene expression. Cells enhance their antioxidant capacity by regulating gene expression, thereby enabling a rapid response to oxidative stress, strengthening their antioxidant defense systems, and mitigating damage caused by ROS accumulation ^[14]^[15]. Through this dual mechanism, cells are able to more effectively confront the challenges posed by oxidative stress.

3.3 ROS-SOD Transcription Kinetics Model

As observed, the concentration of ROS has a significant impact on the transcription and gene regulation of SOD. To address the effect of ROS concentration on SOD transcription efficiency, a ROS-SOD transcription kinetics model has been established.

To describe the effect of ROS concentration on SOD transcription efficiency, the Hill equation can be employed for fitting. The Hill equation is commonly used in gene regulatory networks to describe the interaction between transcription factors and gene promoters. We aim to increase the Hill coefficient to enhance the positive cooperativity of the reaction.

Figure 24: The relationship between transcription rate and Hill coefficient

[27]Santillán, M. (2008). On the Use of the Hill Functions in Mathematical Models of Gene Regulatory Networks. Mathematical Modelling of Natural Phenomena, 3(2), 85–97.

The figure illustrates the relationship between the steepness of the curve, i.e., the transcription rate, and the Hill coefficient. In the figure, panel A represents cooperative binding, where the slope of the curve increases with Hill coefficients of 1, 2, 5, 10, and 20. Panel B represents negative cooperativity. Cooperative binding is based on the concept that when a ligand (such as a transcription factor) binds to a receptor (such as a binding site on DNA), it can alter the binding affinity at other sites. A higher Hill coefficient nnn indicates a steeper curve, reflecting stronger cooperativity ^[27].

The kinetic equation representing the time-dependent changes in ROS concentration can be expressed as: \[\frac{d R O S}{d t}=k_{R O S}-k_{d e g} \cdot R O S \cdot S O D\] Where k_ROS is the generation rate of ROS, and k_deg is the degradation rate constant of ROS. The degradation rate of ROS is proportional to both the SOD concentration and ROS concentration, thus the SOD degradation rate can be expressed as k_deg·ROS·SOD. The basic form of the Hill equation is as follows: \[f(x)=\frac{x^{n}}{K_{d}^{n}+x^{n}}\] In this context, f(x) represents the reaction output, xxx is the input signal (such as transcription factor concentration), Kd is the dissociation constant (indicating the input value at which half of the maximum response is achieved), and n is the Hill coefficient. For the SOD-ROS transcription model, the following equation can be obtained: \[S(t)=\frac{k_{S O D} \cdot \text { ROS }^{n}}{K_{R O S}^{n}+\text { ROS }^{n}}\] In this equation, k_SOD represents the maximum value of SOD transcription efficiency, determining the upper limit of SOD transcription efficiency. k_SOD·ROS^n describes the partial contribution to SOD transcription efficiency at a given ROS concentration. K_ROS^ is the half-maximal effective concentration, where the transcription efficiency of SOD reaches half of its maximum value when ROS equals K_ROS^ .

By combining the kinetic equation of ROS concentration with the equation for SOD transcription efficiency, a comprehensive model can be derived: \[\frac{d R O S}{d t}=k_{R O S} \cdot \frac{k_{S O D} \cdot \text { ROS }^{n}}{K_{R O S}^{n}+\text { ROS }^{n}} \cdot S O D\] This equation can be solved using numerical methods to obtain the curves for ROS concentration over time and SOD transcription efficiency as a function of ROS concentration. Due to its superior accuracy compared to the Euler method, the Runge-Kutta method will be used to solve for the variation of SOD transcription efficiency over time.

Select the initial condition: ROS(0) = ROS₀. The fourth-order Runge-Kutta method is used to calculate four slopes (k₁, k₂, k₃, k₄), and the ROS concentration is updated using their weighted average: \[\begin{array}{c}k_{1}=\Delta t \cdot\left(k_{\mathrm{ROS}}-k_{\mathrm{deg}} \cdot \frac{k_{S O D} \cdot \operatorname{ROS}(t)^{n}}{K_{R O S}^{n}+\operatorname{ROS}(t)^{n}} \cdot \operatorname{ROS}(t)\right) \\k_{2}=\Delta t \cdot\left(k_{\mathrm{ROS}}-k_{\mathrm{deg}} \cdot \frac{k_{\text {SOD }} \cdot\left(\operatorname{ROS}(t)+0.5 k_{1}\right)^{n}}{K_{R O S}^{n}+\left(\operatorname{ROS}(t)+0.5 k_{1}\right)^{n}} \cdot\left(\operatorname{ROS}(t)+0.5 k_{1}\right)\right) \\k_{3}=\Delta t \cdot\left(k_{\mathrm{ROS}}-k_{\mathrm{deg}} \cdot \frac{k_{S O D} \cdot\left(\operatorname{ROS}(t)+0.5 k_{2}\right)^{n}}{K_{R O S}^{n}+\left(\operatorname{ROS}(t)+0.5 k_{2}\right)^{n}} \cdot\left(\operatorname{ROS}(t)+0.5 k_{2}\right)\right) \\k_{4}=\Delta t \cdot\left(k_{\mathrm{ROS}}-k_{\mathrm{deg}} \cdot \frac{k_{S O D} \cdot\left(\operatorname{ROS}(t)+k_{3}\right)^{n}}{K_{R O S}^{n}+\left(\operatorname{ROS}(t)+k_{3}\right)^{n}} \cdot\left(\operatorname{ROS}(t)+k_{3}\right)\right) \\\operatorname{ROS}(t+\Delta t)=\operatorname{ROS}(t)+\frac{1}{6}\left(k_{1}+2 k_{2}+2 k_{3}+k_{4}\right)\end{array}\] This allows for the derivation of the equation describing how ROS changes over time. By substituting this equation into the basic value fitting function, an approximate graph can be obtained:

Figure 25: ROS generation function graph over time

The ROS generation constant is set at 1.0, the ROS dissociation constant is 0.1, the maximum transcription efficiency of SOD is 2.0, the half-inhibition concentration of ROS is 1.0, and the cooperativity coefficient (Hill coefficient) is 2, with an initial ROS concentration of 0.1. As shown in the figure, the concentration of ROS is positively correlated with time, approaching a maximum value as time progresses.

At each time point, the corresponding SOD transcription efficiency S(t) can be calculated based on the current ROS concentration ROS(t). The conversion equation is as follows: \[S(t) = \frac{k_{SOD} \cdot ROS^n}{K_{ROS}^n + ROS^n} \] The fitted graph is obtained as follows:

Figure 26: Functional graph of SOD transcription efficiency and ROS concentration

From the graph, it can be observed that SOD transcription efficiency is positively correlated with ROS concentration and increases continuously with rising ROS levels. This indicates that, under normal physiological conditions, there is a correlation between SOD and ROS concentration, with a certain feedback capability. It is therefore hypothesized that within a healthy organism, the transcription efficiency of SOD varies with ROS concentration, thereby stabilizing internal homeostasis and mitigating oxidative stress effects. In contrast, in patients with IBD, the surge in ROS leads to an inability of SOD to promptly respond to transcription signals, resulting in significant ROS accumulation. This, in turn, causes excessive cell apoptosis and triggers inflammation.

4. The Effect of SOD Protein Residue Structural Changes on Dismutation Reaction Rate

Superoxide dismutase (SOD) is a crucial antioxidant enzyme that catalyzes the dismutation of superoxide anion radicals (O₂⁻) to produce oxygen and hydrogen peroxide, thereby protecting cells from oxidative stress damage. This review discusses how structural changes in key amino acid residues within SOD affect its catalytic activity, particularly by altering the microenvironment around the active site to influence the reaction rate.

Superoxide dismutase (SOD) plays a crucial role in the cellular defense system by catalyzing the dismutation reaction of superoxide anion radicals to remove these harmful substances. The activity of SOD is influenced by various factors, including the amino acid residues near its active site. Recent studies on the structure and function of the SOD active site have revealed the role of several key residues in enhancing enzyme activity.

4.1 The Role of Key Amino Acid Residues

To investigate the impact of SOD protein residue variations on the dismutation reaction, it is essential to identify key genes. Notably, amino acids associated with redox reactions include arginine 141 (R141), tyrosine 34 (Y34), and glutamine 70 (Q70).

Arginine 141 in Cu/Zn-SOD alters the redox chemistry of superoxide anions by providing a network of hydrogen bonds. In the absence of R141, superoxide anions can reduce copper ions and dissociate as oxygen. In contrast, in the presence of R141, superoxide anions form a stable complex, further oxidizing another superoxide anion to oxygen while reducing the copper ion from its trivalent to its divalent state.

The tyrosine 34 (Y34) and glutamine 70 (Q70) residues are believed to play a role in mediating the protonation of superoxide anions, thereby affecting the catalytic efficiency of SOD. For other key residues, such as phenylalanine 66 (Phe66) in Cu/Zn-SOD, adjustments in the microenvironment around the neighboring tyrosine 34 and the active site can influence the extent of product inhibition by SOD ^[16]. Thus, changes in the rate of the SOD dismutation reaction may result from mutations in residues such as arginine 141 (R141), tyrosine 34 (Y34), and glutamine 70 (Q70). To modulate the rate of the dismutation reaction, adjustments can be focused on related residues, such as Phe66.

4.2 The Impact of Structural Changes on Catalytic Efficiency

The variations in the microscopic structures of catalysts have a significant impact on catalytic efficiency. Electronic transitions determine the reaction rate, and adjusting the electronic structure of the catalyst can enhance its catalytic activity ^[17]. The active site of an enzyme is central to the catalytic reaction, and inducing the fit model and conformational selection can improve enzymatic catalytic efficiency ^[18]^[19]. Additionally, the dynamics of proteins not only affect their catalytic functions but also influence other biological functions such as signal transduction, molecular recognition, and regulation ^[20]. This also indirectly suggests that catalytic reactions might alter the steady-state environment, leading to changes in related product concentrations. These examples illustrate that changes in catalyst structure can influence catalytic efficiency, and the following sections will specifically analyze several influencing factors.

Stable complexes can significantly influence the overall reaction rate. In the SOD-ROS enzymatic reaction, such complexes often refer to the SOD-ROS complexes. The formation of stable complexes generally reduces the overall reaction rate, as alternative pathways may require more time to complete the catalytic cycle ^[14]. The formation of stable complexes may lead to changes in the enzyme-catalyzed pathway, potentially affecting the overall reaction rate. Therefore, while stable complexes may enhance the accuracy of substrate positioning, they might also introduce new reaction steps or pathways, thereby influencing the overall catalytic efficiency ^[21].

Changes in electron density within protein structures can significantly impact catalysis, as electronic effects are associated with variations in electron density within the protein structure ^[22]. The catalytic activity of an enzyme is often related to the optimization of the electronic environment within its active site. Structural changes can alter the electron density of the active site, thereby affecting the mechanism and rate of the catalytic reaction. Certain metals can influence the electron density of substrate molecules; for instance, metal centers such as molybdenum and tungsten can affect the surrounding electronic environment. This electronic effect is crucial for the enzyme's catalytic function, thus influencing the mechanism and rate of the catalytic reaction ^[23]. For example, in the Trx-SOD model, the arginine side chain, through the formation of salt bridges, may provide an electron transfer pathway that affects catalytic efficiency ^[24]. Therefore, the electron density within proteins has a notable impact on overall catalysis.

Furthermore, the construction of hydrogen bond networks significantly impacts protein catalytic activity ^[25]. Hydrogen bonds, which are non-covalent interactions between amino acid residues or between residues and substrates, play a crucial role in the structure and function of proteins. In enzyme catalysis, hydrogen bond networks maintain the three-dimensional structure of the protein, which is essential for stabilizing the composition of the complex and enhancing overall stability, thereby playing a critical role in the catalytic process ^[25]. The involvement of hydrogen bond networks determines the protein's conformation. Traditional high-resolution X-ray crystallography methods typically provide static structures of proteins, representing the average conformation of proteins within frozen crystals.

However, proteins are not static under physiological conditions but possess dynamic characteristics. These dynamic properties enable proteins to transition between different conformations to meet various functional demands. The function of a protein relies on its ability to rapidly switch between multiple conformational states. Such conformational changes often involve significant molecular rearrangements or smaller local structural adjustments. The presence of dynamics means that proteins can flexibly adjust their structure as needed, thereby effectively executing biological functions ^[26].The conformational space of a protein can be viewed as a complex energy landscape, where each conformational state corresponds to a different energy level. These energy levels can be visualized as a terrain, with low-energy states representing "valleys" and high-energy states representing "peaks" or "ridges." Proteins typically tend to reside in low-energy valleys, and the depth and relative position of these valleys determine the speed and reversibility of functional conformational transitions. However, due to thermodynamic fluctuations, proteins can also cross energy barriers (ridges) to reach other valleys. Thus, different conformational states and their interconversion pathways (i.e., the height of energy barriers) determine the protein's functionality ^[26]. In enzymatic reactions, the active site of an enzyme may need to undergo conformational changes to accommodate substrates or release products. These changes are often regulated by networks of hydrogen bonds and other molecular interactions, meaning that the enzyme's dynamics play a decisive role.

4.3 Enzymatic Reaction Rates under SOD Protein Residue Variants

As previously mentioned, the Michaelis-Menten equation is a classical model for describing enzyme-catalyzed reaction rates. We can extend this by applying the SOD-ROS enzymatic reaction model established earlier to develop an enzymatic reaction model under SOD protein residue mutations. \[v=\frac{\left(V_{\max } \times[\mathrm{S}]\right)}{\left(K_{\mathrm{m}}+[\mathrm{S}]\right)} \] In the extended model, structural changes in enzyme protein residues may affect Vmax and Km. Structural mutations often lead to changes in enzyme catalytic efficiency and alter the enzyme's affinity for the substrate. Therefore, we can introduce two additional parameters to represent the effects of structural changes: ΔV_((n+1)max), ΔV_((n)max) ΔK_((n+1)m), and ΔK_((n)m). ΔV_((n+1)max) and ΔV_((n)max) represent the effects of mutations on the maximum reaction rate in the two-step reaction, while ΔK_((n+1)m) and ΔK_((n)m) represent the effects of mutations on the Michaelis constant in the two-step reaction.

Based on the derivations above, the SOD-ROS enzymatic reaction equations are: \[\begin{array}{l}V=K_{2}\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]++K_{4}\left[C u / Z n^{n} \cdot S O D \cdot R O S\right]\\\begin{array}{l}{\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]} \\=\frac{\left\{\left(\left[\text { SOD }_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]\right)\left(K_{m_{1}}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / \mathrm{Zn}^{n+1} \cdot \text { SOD }\right]\right)\right\} \times[R O S]}{[R O S]+\left(K_{m_{1}}+[R O S]\right)^{2}} \\\text { [Cu/Zn } \left.n^{n} \cdot S O D \cdot R O S\right] \\=\frac{\left\{\left(\left[S_{D}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]\right)\left(K_{m_{2}}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / \mathrm{Zn}^{n} \cdot S O D\right]\right)\right\} \times[R O S]^{2}}{[R O S]+\left(K_{m_{2}}+[R O S]\right)^{2}} \\K_{m_{2}}=\frac{K_{-3}+K_{4}}{K_{3}} \\K_{m_{1}}=\frac{K_{-1}+K_{2}}{K_{1}}\end{array}\end{array}\] After simplification, it can be expressed as: \[V=\frac{V_{(n+1) \max } \times[R O S]}{[R O S]+\left(K_{m_{1}}+[R O S]\right)^{2}}+\frac{V_{(n) \max } \times[R O S]}{[R O S]+\left(K_{m_{2}}+[R O S]\right)^{2}}\] Here, V_(〖(n+1)max〗_ ) and V_(〖(n)max〗_ ) represent the maximum rates of the two-step reactions, respectively. Additionally, a coefficient for the impact of mutations on the reaction is introduced: \[\begin{array}{l}{\left[C u / Z n^{n+1} \cdot S O D \cdot R O S\right]} \\=\frac{\left\{\left(\left[S O D_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]\right)\left(K_{m_{1}}+\Delta K_{(n+1) m}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]\right)\right\} \times[R O S}{[R O S]+\left(K_{m_{1}}+\Delta K_{(n+1) m}+[R O S]\right)^{2}} \\\begin{array}{l} {\left[C u / Z n^{n} \cdot S O D \cdot R O S\right] } \frac{\left\{\left(\left[S O D_{t}\right]-\left[C u / Z n^{n+1} \cdot S O D\right]\right)\left(K_{m_{2}}+\Delta K_{(n) m}+[R O S]\right)-\left(\left[S O D_{t}\right]-\left[C u / Z n^{n} \cdot S O D\right]\right)\right\} \times[R O S]}{[R O S]+\left(K_{m_{2}}+\Delta K_{(n) m}+[R O S]\right)^{2}} \\K_{m_{2}}=\frac{K_{-3}+K_{4}}{K_{3}} \\K_{m_{1}}=\frac{K_{-1}+K_{2}}{K_{1}}\end{array}\end{array}\] The results are summarized as follows: \[\begin{array}{c}V=\frac{\left(V_{(n+1) \max }+\Delta V_{(n+1) \max }\right) \times[R O S]}{\left.[R O S]+\left(K_{m_{1}}+\Delta K_{(n+1}\right) m+[R O S]\right)^{2}}+\frac{\left(V_{(n) \max }+\Delta V_{(n) \max }\right) \times[R O S]}{[R O S]+\left(K_{m_{2}}+\Delta K_{(n) m}+[R O S]\right)^{2}} \\=\frac{\left(V_{(n+1) \max }+\Delta V_{(n+1) \max }\right) \times[R O S]}{[R O S]+\left(\frac{K_{-1}+K_{2}}{K_{1}}+\Delta K_{(n+1) m}+[R O S]\right)^{2}} \\+\frac{\left(V_{(n) \max }+\Delta V_{(n) \max }\right) \times[R O S]}{[R O S]+\left(\frac{K_{-3}+K_{4}}{K_{3}}+\Delta K_{(n) m}+[R O S]\right)^{2}}\end{array}\] For the determination of newly introduced variables, experimental data fitting can be employed, followed by the calculation of relative changes. Using preliminary data, we can approximately simulate the enzyme reaction rate as a function of substrate concentration under mutation conditions:

img1

img2

img3

img4

Figure 27: Graph of enzyme reaction rate as a function of substrate concentration under mutation

In the figure above, from Panel 1 to Panel 4, the additional data are as follows: \[\begin{align} \Delta V_{(n+1)\text{max}} & : -2, -8, -2, -8; \\ \Delta V_{(n)\text{max}} & : 3, 9, 3, 9; \\ \Delta K_{(n+1)m} & : 5, 5, 15, 15; \\ \Delta K_{(n)m} & : 3, 3, 13, 13 \end{align} \] It can be predicted that changes in the structure, rate, and Michaelis constant will have a significant impact on the overall reaction rate. Therefore, during the directed mutation of the SOD enzyme structure, the dynamics and stability of the enzyme-substrate complex are crucial factors in enzyme engineering.

5. Conclusions and Future Work

This study focuses on the production of reactive oxygen species (ROS) and the role of superoxide dismutase (SOD) in inflammatory bowel disease (IBD), particularly ulcerative colitis (UC). A SOD-ROS enzymatic reaction model based on the Michaelis-Menten equation has been established, providing a clear representation of the relationship between enzyme reaction rate and substrate concentration. The paper also details how excessive accumulation of ROS in IBD patients leads to excessive apoptosis through mitochondrial mechanisms, and explores how SOD mitigates oxidative stress-induced cellular damage by eliminating ROS. Additionally, the study expands on the understanding of SOD's nuclear localization and gene regulatory functions, as well as the regulation of SOD activity through signal transduction pathways, highlighting SOD's critical role in the cellular antioxidant defense system.

By simulating the SOD enzymatic reaction, we can predict the efficiency of ROS elimination by SOD under various conditions. Subsequent work will involve simplifying the model, incorporating additional relevant parameters, and enhancing the model's robustness to environmental factors. This provides a crucial basis for understanding the function of SOD under both physiological and pathological conditions and offers theoretical support for potential therapeutic strategies for IBD.

Reference

[1]. Du L, Ha C. Epidemiology and pathogenesis of ulcerative colitis.Gastroenterol Clin North Am. 2020;49:643–654.
[2]. Xia Tingting, Zhong Liang, Rong Lan, et al. Reactive Oxygen Species and Inflammatory Bowel Disease . International Journal of Digestion, 2015, 35(04): 234.
[3]. Yue Wan, Lei Yang, Shu Jiang, Dawei Qian, Jinao Duan, Excessive Apoptosis in Ulcerative Colitis: Crosstalk Between Apoptosis, ROS, ER Stress, and Intestinal Homeostasis, Inflammatory Bowel Diseases, Volume 28, Issue 4, April 2022, Pages 639–648.
[4]. Eaden, J.A., Abrams, K.R., & Mayberry, J.F. (2001). The risk of colorectal cancer in ulcerative colitis: a meta-analysis. Gut, 48(4), 526-535.
[5]. Rutter, M., Saunders, B., Wilkinson, K., Rumbles, S., Schofield, G., Kamm, M., & Williams, C.B. (2006). Severity of inflammation is a risk factor for colorectal neoplasia in ulcerative colitis. Gastroenterology, 130(7), 1850-1861.
[6]. Itzkowitz, S.H., & Yio, X. (2004). Inflammation and cancer IV. Colorectal cancer in inflammatory bowel disease: the role of inflammation. American Journal of Physiology-Gastrointestinal and Liver Physiology, 287(1), G7-G17.
[7]. Glasauer A, Chandel NS. Targeting antioxidants for cancer therapy. Biochem Pharmacol 2014; 92: 90–101.
[8]. Panieri, E., Santoro, M. ROS homeostasis and metabolism: a dangerous liason in cancer cells. Cell Death Dis 7, e2253 (2016).
[9]. Szatrowski, T.P., & Nathan, C.F. (1991). Production of large amounts of hydrogen peroxide by human tumor cells. Cancer Research, 51(3), 794-798.
[10]. Ballard O, Towarnicki G. Mitochondria, the gut microbiome and ROS. Cell Signal. 2020;75:1–11.
[11]. Zhu H, Li YR. Oxidative stress and redox signaling mechanisms of inflammatory bowel disease: updated experimental and clinical evidence. Exp Biol Med (Maywood). 2012;237:474–480.
[12]. Srinivasan B. A guide to the Michaelis-Menten equation: steady state and beyond. FEBS J. 2022 Oct;289(20):6086-6098. doi: 10.1111/febs.16124. Epub 2021 Jul 31. PMID: 34270860.
[13]. Mates, J.M., & Perez-Gomez, C. (1999). Antioxidant enzymes and human diseases. Clinical Biochemistry, 32(8), 595-603.
[14]. Tsang, C., Liu, Y., Thomas, J. et al. Superoxide dismutase 1 acts as a nuclear transcription factor to regulate oxidative stress resistance. Nat Commun 5, 3446 (2014).
[15]. Guan, G., & Lan, S. (2018). Implications of Antioxidant Systems in Inflammatory Bowel Disease. BioMed Research International, 2018, 1–7. doi:10.1155/2018/1290179
[16]. Osman, R., & Basch, H. (1984). On the mechanism of action of superoxide dismutase: a theoretical study. Journal of the American Chemical Society, 106(19), 5710–5714.
[17]. Nilsson, A., Pettersson, L.G.M., Hammer, B. et al. The electronic structure effect in heterogeneous catalysis. Catal Lett 100, 111–114 (2005).
[18]. Koshland, D. E. (1958). “Application of a Theory of Enzyme Specificity to Protein Synthesis.” Proceedings of the National Academy of Sciences of the United States of America, 44(2), 98-104.
[19]. Boehr, D. D., Nussinov, R., & Wright, P. E. (2009). “The role of dynamic conformational ensembles in biomolecular recognition.” Nature Chemical Biology, 5(11), 789-796.
[20]. Henzler-Wildman, K. A., & Kern, D. (2007). “Dynamic personalities of proteins.” Nature, 450(7172), 964-972.
[21]. Finkel, T., & Holbrook, N. J. (2000). “The Role of Protein Complex Formation in Enzyme Catalysis: Evidence from Superoxide Dismutase.” Nature Reviews Molecular Cell Biology, 1(6), 512-520.
[22]. Adachi, S., & Saito, Y. (2011). “Electronic effects in enzyme catalysis: Insights from quantum mechanical studies.” Journal of Physical Chemistry B, 115(15), 4482-4490.
[23]. Hille R. Molybdenum and tungsten in biology[J]. Trends in biochemical sciences, 2002, 27(7): 360-367.
[24]. Pinto AL, Hellinga HW, Caradonna JP. Construction of a catalytically active iron superoxide dismutase by rational protein design. Proc Natl Acad Sci U S A. 1997 May 27;94(11):5562-7.
[25]. James, T. L., & Mowshowitz, D. (2013). “The role of hydrogen bonding in enzyme catalysis.” Biochemistry, 52(20), 3700-3712.
[26]. Henzler-Wildman, K., Kern, D. Dynamic personalities of proteins. Nature 450, 964–972 (2007). https://doi.org/10.1038/nature06522
[27]. Santillán, M. (2008). On the Use of the Hill Functions in Mathematical Models of Gene Regulatory Networks. Mathematical Modelling of Natural Phenomena, 3(2), 85–97.