AI SJTU-software - iGEM 2024

1 Overview

Due to the low efficiency of experimentally measuring the optimal temperature for proteins, a large number of proteins lack optimal temperature data. Our software can predict the optimal temperature from protein sequences, offering superior speed and accuracy compared to existing models. In the following introduction, you can learn more about this software and obtain guidance on how to use it.

1.1 Background

The optimal temperature of a protein refers to the temperature at which it exhibits its maximum activity.Temperature affects the kinetic energy of molecules, and in turn, the rates of biochemical reactions. For enzymes, which are a common type of protein, optimal temperature is key to their ability to accelerate reactions efficiently.

In nature, organisms have evolved to have proteins that are optimized for their environmental temperatures. For example, thermophilic organisms have proteins with higher optimal temperatures, allowing them to thrive in hot environments. Conversely, psychrophilic organisms have proteins with lower optimal temperatures suited for cold conditions. Understanding and manipulating protein optimal temperature is important in fields like biotechnology, synthetic biology, and medicine, where proteins are engineered or used in varying conditions.

1.2 Motivaion

Determining a protein's optimal temperature typically requires conducting numerous experiments across different temperature ranges. Each condition must be assessed for the protein's activity, stability, and risk of denaturation. This process can be highly resource-intensive, especially for high-throughput systems or novel proteins, requiring significant time and laboratory resources. As a result, optimal temperature data for many proteins are often lacking. To address this issue, we developed HEATMAP, software designed to directly predict a protein's optimal temperature from its sequence, helping to bridge these data gaps efficiently.

1.3 Highlight

1. Easy and Fast prediction

2. Advance in prediction accuracy compare to previous modle

3. extendable

4. user friendly

5. The software code is open source, making it convenient to further improve based on the target of the prediction.

2 Model

2.1 Introduction

Our HEATmap AI, designed to predict the optimal temperature for enzymes, is built on the advanced Hyena architecture. This framework provides several unique advantages over conventional models:

1. Efficient Long-Sequence Handling:

Unlike traditional models that struggle with increasing sequence lengths, Hyena is optimized for processing long sequences with minimal computational cost. While typical architectures scale inefficiently as sequence length increases, Hyena reduces this overhead by using a more streamlined approach, making it possible to handle longer sequences without the usual resource drain.

2. A Novel Alternative to Self-Attention:

Hyena moves away from the conventional self-attention mechanism seen in Transformers, which becomes cumbersome with extended sequences. Instead, it employs a convolution-based structure that can capture relationships across sequences without relying on attention. This allows it to maintain performance even when managing very long input data.

3. Optimized Memory Usage:

One of Hyena's standout features is its ability to manage memory more effectively. By breaking down sequences into manageable blocks and processing them in parallel, it significantly reduces the memory load, enabling smooth handling of large datasets such as genomic or protein sequences, even on limited hardware.

4. Hierarchical Information Extraction:

Hyena incorporates a hierarchical approach to extract information at different levels within a sequence. This layered approach ensures that the model captures both detailed local interactions and broader global patterns, making it well-suited for tasks where both short-term and long-term dependencies are important.

These features position HEATmap AI as a powerful and efficient tool for enzyme temperature prediction, offering a faster and more scalable solution compared to traditional methods.

2.2 Hyena Architecture

The internal architecture of Hyena is designed to optimize the processing of long-sequence data, offering a unique approach compared to traditional Transformer-based models. Here's an overview of its key components:

1. Convolution-Based Sequence Processing:

Unlike the self-attention mechanism in Transformers, which compares every token in the sequence to every other token, Hyena uses a convolutional mechanism to process the sequence more efficiently. The core idea is that the convolution captures dependencies between tokens in a local region and can be stacked to capture global dependencies.

For a 1D convolution applied to a sequence, the output for each token $ yiy_iyi $ is calculated as: $y i = \sum k = 0 K - 1 w k x i + k y_{i} = \sum_{k = 0}^{K - 1} w_{k} x_{i + k} y i = k = 0 \sum K - 1 w k x i + k$

Where:

$ x_i $ is the input sequence at position i,

$ w_k $ is the convolutional filter of size K,

$ y_i $ is the output after the convolution operation at position i.

This mechanism allows Hyena to process long sequences with lower computational complexity compared to Transformers, as it avoids quadratic scaling.

2. Efficient Kernel Function:

Hyena’s convolutional layers are complemented by a specialized kernel function designed to efficiently manage interactions between tokens in long sequences. The kernel operates in a hierarchical manner, capturing both short- and long-range dependencies, enabling scalable sequence processing.

The computation of a kernelized convolution in Hyena can be written as: $y i = \sum k = 0 K - 1 ϕ (x i + k) w k y_{i} = \sum_{k = 0}^{K - 1} ϕ (x_{i + k}) w_{k} y i = k = 0 \sum K - 1 ϕ (x i + k) w k$

Where:

$ ϕ(x) $is a kernel function that applies a nonlinear transformation to the input,

$ w_k $ are the learnable weights of the kernel.

This kernel function reduces the computational complexity of handling long sequences from the $ O(n \log n) $ complexity of attention in Transformers to $ O(n \log n) $, which allows Hyena to process longer sequences efficiently.

3. Hierarchical Feature Extraction:

Hyena adopts a hierarchical structure to extract features from the input sequence at different layers of abstraction, similar to Convolutional Neural Networks (CNNs). This helps the model capture both local and global patterns in the data. The idea here is to progressively pool information from small windows (local) to larger windows (global).

If we define the output of each layer as $ h^{(l)} $, where lll is the layer index, the computation at layer l+1 is:

$h i (l + 1) = σ (\sum k = 0 K - 1 w k (l) h i + k (l) + b (l)) h_{i}^{(l + 1)} =$

$σ (\sum_{k = 0}^{K - 1} w_{k}^{(l)} h_{i + k}^{(l)} + b^{(l)}) h i (l + 1) = σ (k = 0 \sum K - 1 w k (l) h i + k (l) + b (l))$

Where:

$h^{(l)}_i $ is the input from the previous layer,

$ w_k^{(l)} $ are the convolutional weights at layer lll,

$ b^{(l)} $is the bias term,

$ σ\sigmaσ $ is a nonlinear activation function, typically ReLU or GELU.

This allows the model to gradually aggregate information from neighboring tokens, capturing increasingly larger context windows.

4. Block-Based Parallel Computation:

Hyena divides long sequences into smaller blocks, which are processed independently in parallel. This block-based strategy enables more efficient computation, especially for long sequences that would otherwise overwhelm memory resources in traditional models. Each block BjB_jBj contains a subset of the sequence tokens, and these blocks are processed in parallel.

The total computational cost is reduced because the sequence is split into NNN blocks, and the operations within each block are independent:Total Cost= $O (n N \cdot l o g (n N)) Total Cost = O (\frac{n}{N} \cdot \log (\frac{n}{N})) T o t a l C o s t = O (N n \cdot l o g (N n))$ Where NNN is the number of blocks and nnn is the sequence length.

This parallelism reduces the latency and memory overhead associated with processing very long sequences in Transformers.

5. Dynamic Memory Management:

Hyena’s memory management is dynamic and adjusts to the length of the input sequence. Instead of statically reserving memory for the maximum sequence length, Hyena dynamically allocates memory based on the actual sequence length, thus making it more efficient in memory-constrained environments.

If the sequence length is nnn, the memory footprint MMM for Hyena scales as: $M = O (n l o g n) M = O (n \log n) M = O (n l o g n)$ In contrast to the $O (n 2) O (n^{2}) O (n 2)$ memory complexity of standard Transformers, this makes Hyena significantly more memory-efficient.

6. Sequence-to-Sequence Encoding:

Hyena is designed for sequence-to-sequence tasks, where the input is encoded into a sequence representation suitable for various downstream tasks like classification or regression. The convolutional layers and kernel-based processing replace the need for self-attention, enabling Hyena to perform encoding more efficiently.

If $ h^{(l)} $ is the encoded sequence at layer lll, the final prediction for a task (e.g., classification or regression) is typically performed by feeding the encoded sequence into a fully connected layer: $\hat{y} = W (f) h (L) + b (f) \hat{y} = W^{(f)} h^{(L)} + b^{(f)} y^{=} W (f) h (L) + b (f)$

Where:

$W^{(f)} $ are the weights of the fully connected layer,

$ h^{(L)} $ is the sequence representation from the final layer LLL,

$ \hat{y}_i $is the model's prediction.

This sequence-to-sequence architecture allows Hyena to generate predictions for various sequence-based tasks efficiently.

2.3 Model Trainning

Deep learning training is fundamentally about guiding a neural network to approximate a target function through data-driven optimization. The process involves defining the model architecture, forward propagation, loss calculation, backpropagation, and optimization. Below is an explanation of these steps, along with the relevant mathematical formulas.

1. Forward Propagation:

In deep learning, the input data x\mathbf{x}x passes through multiple layers of neurons, with each layer applying a linear transformation followed by a non-linear activation function. The final output is a prediction y^\hat{y}y^. For a fully connected neural network, the forward propagation for the lll-th layer is given by: $z (l) = W (l) a (l - 1) + b (l) z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)} z (l) = W (l) a (l - 1) + b (l)$

Where: $W (l) W^{(l)} W (l)$ is the weight matrix of layer lll, $a (l - 1) a^{(l - 1)} a (l - 1)$ is the output of the previous layer (for the input layer, $a (0) = x a^{(0)} = x a (0) = x)$ $b (l) b^{(l)} b (l)$ is the bias vector of layer lll, $z (l) z^{(l)} z (l)$ is the linear transformation result for layer lll.

Next, an activation function σ\sigmaσ is applied to $z (l) z^{(l)} z (l)$ to get the layer’s output: $a (l) = σ (z (l)) a^{(l)} = σ (z^{(l)}) a (l) = σ (z (l))$ Common activation functions include:Sigmoid: $σ (z) = 11 + e - z σ (z) = \frac{1}{1 + e^{- z}} σ (z) = 1 + e - z 1$ ReLU: $σ (z) = m a x (0, z) σ (z) = max (0, z) σ (z) = m a x (0, z)$ Tanh: $σ (z) = e z - e - z e z + e - z σ (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}} σ (z) = e z + e - z e z - e - z$

The final layer output is y, the prediction of the model.

Hyena’s memory management is dynamic and adjusts to the length of the input sequence. Instead of statically reserving memory for the maximum sequence length, Hyena dynamically allocates memory based on the actual sequence length, thus making it more efficient in memory-constrained environments.

2. Loss Function:

The loss function quantifies the difference between the predicted value y and the true value y. Common loss functions include: Mean Squared Error (MSE) for regression tasks: $L = 1 N \sum i = 1 N (y i - y^{i}) 2 L = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})^{2} L = N 1 i = 1 \sum N (y i - y^{i}) 2$ where NNN is the number of samples, $ yiy_iyi $is the true value, $ \hat{y}_i $and is the predicted value. Cross-Entropy Loss for classification tasks where $ yiy_iyi $ is the true label (0 or 1), and $ \hat{y}_i $ is the predicted probability.

3. Backpropagation:

Backpropagation is used to calculate the gradients of the loss function with respect to the model's parameters (weights and biases) so that they can be updated during optimization. The gradients are calculated using the chain rule, propagating the error from the output layer backward through the network. For each layer, the gradient of the loss L with respect to the weight matrix $ W(l)\mathbf{W}^{(l)}W(l) $

The gradients of the loss with respect to the weights and biases are: $δ (l) = \partial z (l) \partial L = (a (l) - y) \cdot σ' (z (l))$ These gradients are then used to update the weights and biases of the model.

4. Optimization:

After computing the gradients, the model’s parameters are updated to minimize the loss function. This process is repeated over multiple iterations (epochs). Common optimization algorithms include:

Gradient Descent: $W (l) := W (l) - η \partial L \partial W (l) W^{(l)} := W^{(l)} - η \frac{\partial L}{\partial W^{(l)}} W (l) := W (l) - η \partial W (l) \partial L$

where $ η $ is the learning rate.

5. Training Process:

The entire training process can be summarized as follows:

1. Perform forward propagation to compute the model's output $ \hat{y}_i $.

2.Calculate the loss L by comparing $ \hat{y}_i $ to the true output y.

3.Use backpropagation to compute the gradients of the loss with respect to the model’s parameters.

4.Apply an optimization algorithm to update the parameters based on the computed gradients.

5.Repeat the process over multiple iterations (epochs) until the loss converges or a stopping criterion is met.

Through this iterative process, the model learns to minimize the loss function and improve its predictions on new, unseen data.

3 Engineering

3.1 Design

We identified several candidate model architectures, including:

ESM
Saprot
RosettaFold
ProteinBERT
ProtTrans
PMLM
ProtFlash
ProteinBPT
HyenaDNA

Ultimately, we chose HyenaDNA as our final architecture.

HyenaDNA Architecture

The model architecture is based on hyena architechture.The Hyena architecture is a novel deep learning framework designed to accelerate natural language processing (NLP) tasks, particularly excelling in long-sequence modeling. Unlike traditional Transformer architectures, Hyena reduces both computational and memory complexity, making it more efficient at handling longer sequences.

The hyena architecture has the following features:

1. Lightweight Long-Sequence Modeling:

Hyena is optimized for efficiently processing long-sequence data. In standard Transformer models, handling long sequences demands significant computational power and memory, with complexity scaling as O(n²), where *n* is the sequence length. Hyena introduces a more efficient computation mechanism, reducing the complexity to O(n log n). This makes it far more memory- and resource-efficient, enabling it to handle much longer input sequences.

2. Alternative to Self-Attention:

In Transformer architectures, the self-attention mechanism is the core component but becomes inefficient when dealing with very long sequences. Hyena replaces self-attention with a new convolution-based structure and kernel design, which can capture global dependencies without relying on the traditional attention mechanism. This alleviates the bottleneck of Transformer models in long-sequence tasks.

3. Dynamic Memory Management:

Hyena’s architecture is designed with flexible and efficient memory usage, making it particularly effective for long-sequence processing. It employs block-based computation, allowing the model to handle long sequences in parallel without consuming excessive GPU resources. This feature is especially useful for tasks that require processing large texts, genomic data, or other massive sequence data.

4. Efficient Hierarchical Feature Extraction:

Hyena incorporates convolutional neural network (CNN)-like feature extraction techniques that capture information at different layers within the sequence. This hierarchical feature extraction allows the model to retain both local and global context more efficiently when processing long sequences. As a result, it is better at balancing short-range and long-range dependencies.

3.2 Build

3.2.1 Data Collection

We collected data from the current research article and several open-access protein databases such as UniProt and BRENDA. After processing the data by filtering out low-quality entries and removing duplicates, we retained 6,091 protein sequences along with their optimal temperature values. These remaining sequences were then split into training and test datasets at a specified ratio.

3.2.2 Model Training

The python environment requirements can be found in hyena-dna.

However,the hyena-dna model is designed to be used on DNA sequence ,we make some adaption to make it work on protein sequence,and our train code is as follow:

The complete code is here.

The key hyperparameters are as follows:

layer: number of layers in the model

d_model: hidden states in each layer of the model

max_length: max length of input protein sequence

To test the influence of these hyperparameters, we conducted multiple trials.

3.3 Test

3.3.1 Cross-validation

To evaluate the performance of our model, we utilized the cross-validation method. This robust statistical technique assesses how well the results of a model generalize to an independent dataset. Below are the results obtained from our cross-validation process:

3.3.2 Wet-lab

In the wet-lab section, we validated the accuracy of the HEATMAP model's predicted optimal temperature for enzymes by investigating the catalytic efficiency of trypsin in hydrolyzing casein at different temperatures. Based on the amino acid sequence of trypsin, the HEATMAP model predicted an optimal temperature of 43.71°C. Consequently, we designed three reaction systems at 33.5°C, 43.5°C, and 53.5°C, respectively. The catalytic efficiency under these temperature conditions was characterized by measuring the absorbance of the reaction product, L-tyrosine, at its characteristic absorption peak of 275 nm, using a Thermo Scientific NanoDrop spectrophotometer to monitor the trypsin cleavage of α-casein. The absorbance values of the 43.5°C reaction group at wavelengths of 270 nm, 275 nm, and 280 nm were 0.243, 0.285, and 0.283, respectively, which were higher than those of the 33.5°C reaction group (0.221, 0.265, 0.270) and the 53.5°C reaction group (0.236, 0.277, 0.280). This likely indicates that more L-tyrosine was produced at 43.5°C, suggesting that the catalytic efficiency of trypsin was highest at this temperature, potentially corresponding to its optimal temperature. This observation aligns with the HEATMAP model's predicted optimal temperature of 43.71°C, thereby validating the HEATMAP model's accuracy through wet-lab experiments.

3.4 Learn

Following our wet lab experiments, we are excited to announce our plans to release optimized versions of our model that will be specifically tailored for different species. This customization will enable us to significantly enhance the prediction accuracy for each species, allowing researchers and practitioners to obtain more precise insights into protein behavior and characteristics.

In addition to this tailored approach, we are particularly focused on species that demonstrate high prediction accuracy after thorough testing. For these selected species, we aim to generate and provide optimal temperature data for protein sequences that currently lack such crucial information. This effort not only addresses gaps in existing knowledge but also supports ongoing research in the field of synthetic biology and protein engineering.

To ensure that this valuable information is easily accessible to the scientific community, we will update our website with these new predictions. Users will be able to access this data directly, facilitating their research and applications in various domains. By making our findings readily available, we hope to contribute to the advancement of knowledge in protein science and aid researchers in their endeavors.

4 Implementation

4.1 Web UI

Click here to use.

4.2 Why to build

HEATMAP is a tool using protein sequences to predict the optimal temperatures, providing a much more efficient and accurate way of obtaining temeperature data, thereby, which thereby can be used in many aspects, for instance, harmonize the metabolic pathway of $ Saccharopolyspora\:spinosa $ through evolving key enzymes whose optimal temperature and fermentation temperature have a large descrepency. Targeted evolution of these key enzymes brings their optimal temperatures closer to the actual fermentation temperatures, thus improves productivity.

The usage mentioned above is just one function of HEATMAP, which we did in our project. Apart from that, optimal temperature, as a fundamental characteristic of enzymes, has long been a research focal point, serving as a key link between biological activity and temperature. HEATMAP can offer precise data to promote wet lab, enhance industry and so on.

4.3 How to use

HEATMAP provides two main functions, on the one hand, uses can choose AI to put in protein sequences or files to get the optimal temperature. On the other hand, users can choose database to search for information of proteins whose optimal temperature have been discovered and verified.

In terms of the part of AI, users can choose to type in protein sequences or put in a file.

If users click sequences, the bottom of ‘please type in the sequence’ will turn white, allowing users to put in any protein sequences.

Subsequently, users can choose ‘Submit’ so that the website will show the result of optimal tempeature in the right table, along with FPS indicating the velocity of HEATMAP.

If users click file, the bottome of ‘Choose file’ will turn white, so that users can provide a file with protein sequences for the website, who will give the result of optimal tempeature in the right table, along with FPS indicating the velocity of HEATMAP after users click ‘Submit’.

As for ‘Database’, users can input ‘Entry’ or ‘Sequence’ or ‘Optimal temperature’, and the website will give the result of the others.

Click ‘Download all’ to save them on your laptop!

5 Result

We test our model on the test dataset,and here is the result:

6 Reference

[1]Nguyen E, Poli M, Faizi M, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution[J]. Advances in neural information processing systems, 2024, 36.

[2]Wang X, Zong Y, Zhou X, et al. Artificial Intelligence-Powered Construction of a Microbial Optimal Growth Temperature Database and Its Impact on Enzyme Optimal Temperature Prediction[J]. The Journal of Physical Chemistry B, 2024, 128(10): 2281-2292.