Software | NTU-Taiwan

Introduction

In the field of synthetic biology and bioinformatics, the design and optimization of proteins are of critical significance, especially in the generation of sequences. The repetitive segments in spider silk protein make it particularly challenging for biologists to determine which segments are most crucial. With advances in protein structure prediction and the continuous growth of sequence data, along with various deep learning models such as AlphaFold and OmegaFold, the possibilities for protein design have greatly expanded. Nevertheless, the challenge remains to develop a comprehensive tool that can efficiently manipulate protein sequences, evaluate their properties, and accurately display their structures.

Our innovative software tool, GPSS, enables users to effortlessly input spider silk sequence properties through an intuitive interface. Upon input, the tool generates multiple sequences and provides detailed visualizations of their structures and predicted properties. Additionally, this tool offers functionalities to assess the stability, functionality, and potential applications of the designed proteins, making it an invaluable resource for researchers and scientists. Furthermore, it integrates advanced algorithms to analyze sequence variability and potential critical motifs, facilitating more targeted and effective protein engineering.

GPT for Sequence Generation: Benefits and Mathematical Foundations

Generative Pretrained Transformer (GPT) is a state-of-the-art model that has significantly advanced natural language processing and sequence generation tasks. In the context of this project, GPT plays a pivotal role in generating spider silk protein sequences with specific mechanical properties. GPT's strengths lie in its ability to generate coherent and contextually appropriate sequences based on prior training, making it highly suitable for bioinformatics tasks that require pattern recognition in sequences.

Benefits of Using GPT in Protein Design

Sequence Coherence: GPT uses the Transformer architecture, which excels at capturing long-range dependencies. This is particularly important for protein sequences, where the order of amino acids can significantly impact their properties.
Scalability: GPT can handle large datasets and generate long sequences, making it ideal for bioinformatics applications involving long protein chains.
Flexibility:GPT is flexible and can be fine-tuned on specific datasets, such as spider silk proteins, to generate sequences that exhibit desired physical properties.
Context Awareness:Unlike traditional sequence generation models, GPT takes into account the context of the entire sequence, allowing it to generate more biologically relevant outputs.
Sampling Control: With hyperparameters like temperature and top-k/top-p sampling, GPT allows fine-grained control over the diversity and randomness of generated sequences, ensuring that the outputs can be both novel and specific to the input criteria.

Mathematical Foundations of GPT

GPT is based on the Transformer architecture, which relies on self-attention mechanisms. The core idea is to compute attention scores that determine how each token (or amino acid in our case) in the sequence relates to every other token. This allows GPT to efficiently capture relationships between distant elements in a sequence.

The key mathematical concepts in GPT include:

Self-Attention Mechanism: At each layer of the Transformer, GPT applies a self-attention operation where each token in the input attends to every other token, allowing it to weigh their importance in the sequence. The attention score is calculated as follows:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V \]

where Q (query), K (key), and V (value) are learned representations of the input sequence, and d_k is the dimension of the key vectors.
Positional Encoding: Since Transformers do not have a built-in understanding of sequence order, GPT adds positional encodings to the input embeddings, which help the model capture the order of amino acids in the sequence. The positional encodings are computed as:

\[ PE_{\text{pos}, 2i} = \sin\left( \frac{\text{pos}}{10000^{2i/d_{\text{model}}}} \right) \quad \text{and} \quad PE_{\text{pos}, 2i+1} = \cos\left( \frac{\text{pos}}{10000^{2i/d_{\text{model}}}} \right) \]

where pos is the position in the sequence, and model is the dimension of the model
Language Modeling Objective: GPT is trained using a causal language modeling objective, meaning it predicts the next token in a sequence based on the preceding tokens. The model learns to maximize the likelihood of the next token given the prior context:

\[ P(x) = \prod_{t=1}^{T} P(x_t | x_{ < t}; \theta) \]

where pos is the position in the sequence, and model is the dimension of the model
Sampling Techniques: During sequence generation, GPT uses sampling methods like top-k and nucleus sampling (top-p) to generate diverse yet coherent outputs. In top-k sampling, the model selects from the top \(k\) most likely tokens at each step, while top-p sampling selects from the smallest set of tokens whose cumulative probability exceeds a threshold
GPT's self-attention mechanism allows it to handle long-range dependencies, which is crucial for generating biologically accurate sequences. By utilizing these mechanisms, GPT can predict the next amino acid in the sequence with high accuracy, ensuring that the generated spider silk proteins exhibit the desired properties.

Motivation for Code Design

The primary driving force behind the development of this code was to bridge the gap between designing spider silk sequences and analyzing their structures for protein optimization. Traditional approaches often handle these steps separately, forcing researchers to juggle multiple tools and platforms, which can be inefficient and error-prone. By consolidating these tasks into a single platform, our solution enables the straightforward creation of necessary spider silk protein sequences, thereby streamlining subsequent biological experiments and minimizing redundant efforts.

There was a clear demand for a unified solution that could:

Create spider silk sequences according to specific needs.
Evaluate their properties to inform further optimization.
Integrate advanced algorithms for sequence analysis.
Predict the structures of these sequences.

This code was designed to serve as a comprehensive solution to meet these needs, simplifying the process and reducing the time and complexity involved in protein optimization. It provides researchers with a powerful tool to efficiently design and optimize spider silk proteins, facilitating groundbreaking advancements in synthetic biology and bioinformatics.

Workflow of the Code

The code’s workflow can be broadly divided into four main phases:

Phase 1: Spider Silk Sequence Generation

The code begins with an initial set of expected spider silk physical properties: toughness, elastic modulus, tensile strength, and strain at break.
It uses a model developed by the MIT team, available on Huggingface, for sequence generation, with some code modifications to suit our needs.
The generated sequence variants are stored for further analysis.

Phase 2:Reverse Task Validation

The model from https://github.com/pandeyakash23/spider_silk_codes is loaded to perform a reverse task to validate the sequence generated in Phase 1. We have also made some adjustments and training to this model.
If the predicted physical properties in this phase do not meet the initial input criteria from Phase 1, the process returns to Phase 1 for sequence regeneration until the desired properties are achieved.

Phase 3:Structure Prediction Using OmegaFold

This phase is primarily based on the work by Mirdita et al., with some modifications to the code.
Within this phase, the structure of the generated sequence is predicted using the OmegaFold model.
The code is designed to handle different versions and configurations of the OmegaFold model, ensuring flexibility and adaptability.

Phase 4:Multiple Property Evaluation

Various properties are predicted using a combination of libraries:

Biopython Library. (learn more...)
Scikit-bio Library. (learn more...)
Protein Dynamics and Sequence Analysis. (learn more...)

Here are the mathematical equations used for predicting different properties:

Molecular Weight (MW) Calculation

\[ MW = \sum_{i=1}^{n} (MW_i \times n_i) \]

where \(MW_i\) is the molecular weight of the \(i\)th amino acid and \(n_i\) is the number of occurrences of the \(i\)th amino acid in the sequence.

Isoelectric Point (pI) Calculation

\[ pI = \frac{pK_a + pK_b}{2} \]

where \(pK_a\) and \(pK_b\) are the dissociation constants of the acidic and basic groups of the protein.

Instability Index Calculation

\[ II = \left( \frac{10}{L} \right) \sum_{i=1}^{L-1} DIWV(X_i, X_{i+1}) \]

where \(L\) is the length of the sequence, and \(DIWV(X_i, X_{i+1})\) is the dipeptide instability weight value for a pair of amino acids.

Aromaticity Calculation

\[ Aromaticity = \frac{(Y + W + F)}{L} \]

where Y, W, and F are the numbers of tyrosine, tryptophan, and phenylalanine residues, respectively, and L is the length of the sequence.

These calculations help in understanding and predicting the various properties of the generated spider silk sequences, ensuring that they meet the desired physical characteristics.

How it works?

We have successfully developed a powerful and sophisticated AI model designed to predict customized spider silk protein sequences which fulfill user’s needs. However, considering that our target audience consists of individuals with a background in biology or from non-AI related fields, we decided to simplify the user experience. To achieve this, we developed an intuitive and user-friendly front-end interface, seamlessly integrating the AI model with the interface through an API. An API is like a waiter in a restaurant. After receiving the customer’s order, the waiter sends it to the kitchen, and once the meal is prepared, the waiter delivers it back to the customer. Similarly, our interface, upon receiving the user’s input of the four physical property parameters, sends them to the GPSS model via the API. The model processes the parameters and returns the results through the API to be displayed on the front end. In this way, users can easily obtain the desired information without having to navigate complex programming interfaces or modify code.

This allows users to easily interact with our AI model through simple operations, without requiring extensive technical knowledge. Our goal is to make cutting-edge technology accessible to more specialized fields.

We use four physical properties related to spider silk (Toughness, Elastic Modulus, Tensile Strength, Strain at Break) as input parameters, with their strength levels categorized from 1 to 10 (1 being the lowest, 10 the highest). Users can easily adjust the parameters they need by dragging the scroll bar, and after clicking the "Generate" button, they can expect to wait around 30 seconds (the waiting time may vary depending on the computational capacity of the device running GPSS). Once the process is complete, the corresponding spider silk protein sequence will be generated and shown on the result area

When clicking “Generate” the interface sends a "POST" request to the "getsequence" API(part of codes are shown below), then the parameters of the 'POST' request are passed to the backend, which triggers the GPSS model. After execution, the GPSS model generates a .txt file containing the spider silk protein sequence. The API reads the sequence from the file and returns it to be displayed on the front-end interface. (Note: POST (HTTP POST method) is used to send data to the server, such as form submissions or API requests that modify or process the submitted data, triggering specific server-side operations.)


                    def getsequence():
                        data = request.get_json()
                        toughness = data.get('toughness', 0)
                        elastic_modulus = data.get('elastic_modulus', 0)
                        tensile_strength = data.get('tensile_strength', 0)
                        strain_at_break = data.get('strain_at_break', 0)

                

                        result = subprocess.run(
                            ['python', 'genseq.py'],
                            input=json.dumps(data).encode('utf-8'),
                        )

Below are demonstrations of our software in action:

Note: This GIF demonstrates how to use the GPSS tool to generate custom protein sequences based on desired mechanical properties.

In this example, with the parameters for Toughness, Elastic Modulus, Tensile Strength, and Strain at Break set at (9, 9, 9, 1) respectively, the predicted spider silk sequence is expected to exhibit strong mechanical properties .

If you would like to learn more about GPSS or even operate it yourself, welcome to visit our Software Tool Repository. It contains GPSS-related code and a comprehensive guideline of how to actually run our software tool.

Future Outlook

Although the current implementation of GPSS is highly effective in generating and analyzing spider silk protein sequences, we acknowledge certain limitations primarily related to computational resources. Training large-scale language models (LLMs) like GPT from scratch or running highly complex computations for extensive datasets requires significant computational power, which we currently lack.

In the future, we hope to overcome these limitations by gaining access to more robust computational resources. With such capabilities, we aim to train our own LLMs that are specifically tailored to spider silk protein sequences, enabling us to further optimize the generation and evaluation process. Additionally, this would allow us to perform more complex analyses, ultimately leading to more refined and accurate predictions of protein properties.

By building our own models, we can fully control the fine-tuning process, ensuring that the models are specifically designed for our bioinformatics applications. This will further enhance the functionality of GPSS and push the boundaries of what is possible in protein design and synthetic biology.

Conclusion

In conclusion, the GPSS software tool represents a significant advancement in the field of synthetic biology and bioinformatics, specifically in the design and optimization of spider silk proteins. By integrating sequence generation, property evaluation, and structure prediction into a single, streamlined platform, this tool addresses several key challenges faced by researchers.

The comprehensive workflow of GPSS ensures that researchers can efficiently generate spider silk sequences with desired physical properties, validate these sequences through reverse task modeling, and predict their structures using advanced models such as OmegaFold. The integration of multiple property evaluation libraries further enhances the tool's capability to provide detailed insights into the stability, functionality, and potential applications of the designed proteins.

Looking ahead, we are excited about the possibility of expanding GPSS's capabilities by training our own large-scale models and utilizing more advanced computational techniques. This will not only improve the accuracy and efficiency of protein sequence generation and analysis but also open up new avenues for research in synthetic biology.

Overall, GPSS not only simplifies the process of protein design and optimization but also significantly reduces the time and effort required for these tasks. This tool stands as an invaluable resource for scientists and researchers, facilitating groundbreaking advancements in the understanding and application of spider silk proteins. As the field of synthetic biology continues to evolve, tools like PerSilkOme will play a crucial role in driving innovation and discovery.

Reference

Content

Generative Personalized Spider Silk (GPSS)