Model

Introduction

We developed an AI Agent that integrated iGEM parts information which utilized RAG (Retrieve-Augmented Generation) system based on LLM (Large Language Model) to re-organize the information management and retrieval of parts to replace the tedious and time-consuming process of literature query and data retrieval in the past. The system is able to provide a ChatGPT-like AI question-answer interface, where users can enter queries in the chat window or ask specific information about related biological elements.

RAG Database

Introduction

The retrieval-augmented generation (RAG) is a technique that combines information retrieval and generation model . When the developer, our team, or a user uploads a paper PDF into RAG, the loaded document is converted into text format, and then the text is divided into multiple text chunks through a Text Splitter. Each block of text is fed into the embedding model to generate the corresponding vector representation. These vector representations will be stored in a VectorStore. When a user enters a prompt, it will first be vectorized into a query vector, and then the vector similarity of the query vector is calculated with the text block vector in the repository to find the most relevant text chunks for the query. Next, the relevant text blocks are fed into the prompt template, which generates a prompt containing these text blocks that are fed into a Large Language Model (LLM), from which a final answer is generated. This whole process enhances the accuracy and relevance of the generated content [1].

RAG Model Construction

Figure 1Overview of the RAG model construction, showing the flow from data retrieval to final response generation

1. Data Preparation

Data is the most important thing to build a model. The first step we did was organizing the knowledge of biological elements and synthetic biology. We used Python crawler to obtain the parts information in the iGEM parts registry and of each team wiki, including sequence information, experimental data parameters, and so on. Additionally, we gathered other data on synthetic biology such as beginner's books and newly published papers through Google Scholar and PubMed and synthetic biology-related company and industry information.

As we mentioned earlier, by combining RAG and large language model, we hope to give all the participants and researchers more timely and professional answers while quickly giving enterprises and investors information they need. Therefore, data collection never stops from the beginning to the end of the project and will continue after the competition.

2. Question Vectorization and Vector Database Construction

We chose to use LangChain to build the basic framework to realize the management and application of the RAG system. LangChain is a large language model framework that provides a wealth of tools and interfaces to manage, optimize, and extend the capabilities of language models. In our project, LangChain framework can realize RAG system by managing document loading, text processing and vector generation, and can retrieve relevant document blocks during query, and then combine these document blocks with generation model to generate final answers.

The user's input is transformed into a numerical format, typically a vector, that can be processed by computational models. This vectorization allows the input to be quantitatively analyzed and compared against other data. The knowledge base document is also segmented into smaller, manageable parts, often referred to as chunks. Then they are vectorized to build the vector database.

3. Alignment to Database

The prompt vector is aligned with a vector database, which contains the vectorized chunks from the knowledge base document. This alignment is crucial for identifying the most relevant parts of the knowledge base in response to the user query. Then, the system retrieves the top results that most closely match the prompt vector from the vector database. This is typically achieved through similarity measures such as cosine similarity. The top 'k' results are displayed, where 'k' is a predefined number. These results are the chunks from the knowledge base that best match the user's query in vector space.

4. Combine the Prompts and Search Results

Finally, the retrieved results are combined with the original user prompt to generate responses that are not only relevant but also contextually enriched. This final output typically includes both AI-generated content and human-generated messages, providing a comprehensive response to the user's initial query.

5. Design UI Webpage

We designed a user-friendly interactive interface to ensure that users can easily query information and interact. On the web side, after the user enters the prompt, they can get more reliable answers in this field of synthetic biology because a large amount of vectorized expertise is stored in RAG.

6. Offline Knowledge Base Package

Many researchers worry about their data property rights and privacy, which is the reason why many research institutions use intranets to share data. In addition, we also want to share our design products with other people to use, such as some interested secondary developers. We took these situations into account and made a package that can be used offline. Using the package, users can create their own private knowledge bases, freely decide the content without worrying about problems like data leakage, and modify the open-source software packages according to individual needs.

Large Language Model

Introduction

Large Language Models (LLMs) are broad models of language, vision, speech, and other modalities that can support a wide range of AI activities. They are the foundation of many modern AI systems [2].

Modern LLMs are developed in two stages:

  1. Pre-training (foundation model): The foundation model is trained on simple tasks like next-word prediction or captioning. The model learns the structure of language and obtains large amounts of knowledge about the world from the text it is “reading”.
  2. Post-training (fine-tuning model): The overall model is tuned to improve specific capabilities like coding and reasoning. The pre-trained language model has a rich understanding of language, but it does not yet follow instructions or behave in the way we would expect an assistant to. We align the model with human feedback in several rounds, each of which involves supervised fine-tuning (SFT) on instruction tuning data and Direct Preference Optimization.

Foundation Model - Llama 3.1

Model Construction

Llama 3 is a Transformer language model trained to predict the next token of a textual sequence [3]. Separate encoders for images and speech are trained on it by large amounts of image-text pairs. This teaches the model how to relate visual content to natural language descriptions. The text encoder is trained using a self-supervised method that masks off portions of the speech inputs and attempts to reconstruct them using a discrete-token representation.

Figure 2The self-attention mechanism of Llama 3

Interpretation of self-attention model structure

The core formula of the key-value pair Attention is shown below. Let's try to understand the formula.

Key-value attention formula

Let's first understand the concepts of inner production. The inner product characterises the angle between two vectors and characterises the projection of one vector onto the other vector, which calculates the similarity between one row vector with the others.

The matrix of the inner product is a square matrix, which we understand in terms of row vectors, and which holds the result of the inner product operation of each vector and itself with other vectors [4].

The meaning of softmax is normalisation.

Softmax normalization

These numbers sum to 1, so what's the core mechanism of attentions? It is just a weighted sum. So how do you get the weights? It is these normalised numbers.

Figure 3Illustration of the Query (Q), Key (K), and Value (V) matrix in attention mechanisms

Q (Query) K (Key) V (Value) matrix

Many of the so-called Q K V matrices, query vectors, and other words in the article are derived from the product of X and a matrix, which is essentially a linear transformation of X. So why not just use X instead of linearly transforming? It is to improve the fit of the model, the matrix W are trainable and act as a buffer.

Figure 4Illustration of the Query (Q), Key (K), and Value (V) matrix in linear embedding

Model Parameters

The Llama 3.1 8B model configuration is outlined with a series of key hyperparameters that include 32 layers, providing depth to the model's architecture. The model dimension is set at 4,096, ensuring a broad embedding space, while the Feed-Forward Network (FFN) dimension is considerably higher at 14,336, allowing for complex transformations within the model. It employs 32 attention heads, which facilitate the model's ability to focus on different parts of the input sequence, and 8 key/value heads for handling specific tasks within the attention mechanism. The peak learning rate for the model is 3×10^−4, supporting efficient learning without overwhelming the gradient descent process. Activation is managed by SwiGLU, an activation function known for its effective non-linearity and performance in deep learning tasks. Additionally, the model uses a large vocabulary size of 128,000 and incorporates Rotational Position Embedding (RoPE) with a parameter θ=500,000, enhancing the model's handling of sequence positions and contexts [5].

Why Llama 3.1

Data, model scale, and complexity management are key factors to be optimized for a foundation model. The Llama 3 Herd model suite offers multilingualism, coding, reasoning, and tool usage [6]. The largest model is a dense Transformer with 405B parameters that processes information in a context window of up to 128K tokens. We apply 'Llama 3.1 8B Instruct' as the foundational model for our project [7]. Here is an overview of the Llama 3 model structure:

Figure 5Overview of the Llama 3 herd of models

Fine-tuning Model - LoRA

Model Construction

In Transformer architecture, each layer usually contains multiple sublayers, such as Self-Attention and Feedforward Neural Network (FFN) layers. When referring to "injecting trainable rank decomposition matrices at each layer", this refers to an optimization technique or model adaptation strategy called LoRA (Low-Rank Adaptation), which is used for fine-tuning or online updating of large pre-trained Transformer models.

Figure 6Overview of the LoRA fine-tuning process using low-rank matrices

  1. Data Preparation
    Fine-tuning is an essential part of model training after RAG, intending to make the model perform better in particular domains. The necessary preparation material we need is the fine-tuning dataset for the model to learn. Like the steps we did before, we used Python crawler to get all the information about protocols, experiments, and models from the team wiki page. All of these are in text format, so we need to make some adjustments to turn them into fine-tuning datasets. Our team members read the text contents manually and sorted them into JSON format files of context, prompt, and answer. For example, assuming that the original text on Wiki model page indicates that the team uses simple empirical models to predict a parameter, we can have a set of questions-answering that "What models the team use for predicting parameter A?" with answer "The team used simple empirical models [8]."

    The prompt-answering format makes the model know how to organize the language to answer when it encounters similar context and user questions. After being fine-tuned, the model becomes specialized, applicable to parts, models, and experiments in synthetic biology, and is better suited to a specific need or research problem within this field than using a generic model.

  2. LoRA Model Construction
    The basic idea of the LoRA method is that instead of directly modifying the original dense weight matrix of the pre-trained model during the model adaptation phase, additional low-rank matrices are introduced to represent the portion of the weight change [9].

    1. Rank Decomposition:
      For an originally high-dimensional weight matrix W, the change ΔW can be approximated as a product of two low-rank matrices (U and V) ΔW ≈ U * V^T, where the number of columns (or rows, depending on how they are defined) of U and V is much smaller than the dimension of the original matrix W. This reduces the order of magnitude of the parameters.
    2. Injecting Low-Rank Matrices:
      In some layers of the Transformer (e.g., the weight matrices in the FFN layer or the Attention mechanism), instead of directly updating the original weight matrices, the output is indirectly affected by combining these two low-rank matrices U and V through a sum or multiply operation in the inference or fine-tuning process.
    3. Keeping the Pre-training Weights Unchanged:
      A key advantage of this approach is that it allows us to personalize the model or quickly adapt it to a new task without changing the parameters of the pre-trained model, thus preserving the generic linguistic features captured by the pre-trained model and significantly reducing storage and computational requirements due to the use of low-rank matrices.

    Injecting trainable rank decomposition matrices at each layer of the Transformer implies an efficient and concise way to express changes in model parameters without destroying the original model structure and pre-training knowledge, thus enabling lightweight fine-tuning of the model and effective adaptation to different downstream tasks.

Loss function for Llama 3.1 - cross entropy

Cross-entropy loss is a widely used loss function in natural language processing models like Llama 3. In the context of language models, it measures the discrepancy between the predicted probability distribution over tokens and the true distribution (i.e., the actual word or token in the sequence).

In Llama 3, the cross-entropy loss function is employed during both the pre-training and fine-tuning stages to optimize the model. It guides the model by penalizing incorrect predictions and rewarding correct ones, allowing the model to gradually improve its predictions [10].

The formula for cross-entropy loss in the context of multi-class classification is defined as:

altWhere:

  • N is the total number of classes (or tokens in the vocabulary for language models like Llama 3),
  • y_i is a binary indicator (0 or 1) that represents if the class label (true token) is the correct classification for the current observation,
  • p_i is the predicted probability of the model for class i.

In this formula, we take the logarithm of the predicted probability p_i to emphasize the penalty for making incorrect predictions. The true label y_i is 1 for the correct class and 0 for all other classes, meaning the model will only be penalized based on its predicted probability for the correct class.

References

[1]: Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv preprint arXiv:2005.11401. https://arxiv.org/abs/2005.11401

[2]: Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems, 30. https://arxiv.org/abs/1706.03762

[3]: Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971

[4]: Brown, T., et al. (2020). "Language Models are Few-Shot Learners." arXiv preprint arXiv:2005.14165. https://arxiv.org/abs/2005.14165

[5]: Radford, A., et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[6]: Raffel, C., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 21(140), 1-67. https://jmlr.org/papers/v21/20-074.html

[7]: Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805

[8]: Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106.09685

[9]: Bishop, C. M., Pattern Recognition and Machine Learning, Springer, 2006.

[10]: Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781