Engineering Cycle

Overview

To address the challenges of filtering information in the iGEM parts library and integrating knowledge in the synthetic biology field, we propose iterative updates to our model in the following areas:

Collecting comprehensive and high-quality data
Enhancing the model's search and generation performance
Designing user-friendly software interfaces

High-Quality Data Collection

Building a database and organizing training data for the model is a critical but complex task. To progressively enhance the model's understanding and response quality, we have designed several engineering cycles to ensure that the data undergoes multiple rounds of screening and iterative updates.

Cycle 1: Insufficient data volume for model training

Design:
To address the issues of information search and integration, we considered using RAG (Retrieval-Augmented Generation) technology. To validate our initial concept, we collected some basic data to test our prototype system.

Build:
The RAG system we tested could read structured documents (formatted as JSON arrays and annotated data) and unstructured documents (such as Word, PPT, PDF, etc.). Therefore, we collected a series of data from iGEM wiki pages, the Parts database, and high-quality literature, from which we extracted a portion of data that met our criteria and manually formatted it into the required input format.

Test:
With a relatively small amount of data, the model was able to recall the knowledge we searched for, but the quality of the answers was not high, and accuracy was insufficient. Additionally, there were occasional issues where information from certain teams existed but could not be retrieved.

Learn:
We reviewed the platform's performance and determined that, after appropriate optimization, this technical route could potentially achieve our goal: quickly querying search keywords to obtain high-quality answers. However, the data quality and scale were far from sufficient to meet the needs of answering most questions in synthetic biology.

Cycle 2: Automated web crawler to replace manual data collection

Design:
Manually collecting large amounts of data is time-consuming, necessitating an automated program to quickly collect a large amount of data from wikis and parts databases. This was aimed at addressing the issues of insufficient data volume and limited coverage of relevant areas.

Build:
Using the Python bs4 package, we built a web crawler that repeatedly called the program to traverse the server's web directory to download all webpage source code. By removing the HTML tags outside the content, we finally obtained the textual information of the webpages.

Test:
We captured over 100,000 text data entries, including data from wikis, parts, literature, reports, and books.

Learn:
Although we obtained a large amount of data, after filtering, we found that the usable high-quality data was limited. The possible reason was that the crawler indiscriminately captured many indexing pages, resulting in many text files lacking actual content and only containing simple text, which was unstructured data. Additionally, since the crawler's logic was to traverse the file directories, various types of data were mixed together, making subsequent processing inconvenient.

Cycle 3: Structured documents processing to enhance data quality

Design:
Based on software development needs, the captured texts needed to be filtered and organized into structured documents. A new function was required to format the text captured by the crawler.

Build:
We used Python's Natural Language Toolkit (nltk) package to summarize documents, extract keywords, generate tags, and format each paragraph summary into JSON format.

Test:
We obtained tens of thousands of structured documents with summaries, keywords, and other basic information.

Learn:
This ensured the quality of the data input into the RAG system, improving the knowledge recall rate and the quality of the generated answers.

Cycle 4: Data collection and filtering workflow

Design:
To generate structured documents with summaries in bulk based on the texts captured by the crawler, while also creating datasets for training the underlying large language model generation side, we had a significant need to prepare a workflow for data collection, processing, classification, and generation to handle complex and large-scale data requirements.

Build:
We built a complete data processing workflow using Python, allowing us to automate data capture, processing, summary generation, and the creation of structured documents and training question-answer pairs. We replaced the summary generation function of the nltk package with an LLM API, utilizing a high-performance large language model to ensure the quality of the generated data. This greatly improved the efficiency of data collection and processing. Additionally, since the API was flexible, the final results could be easily tailored to specific tasks by modifying prompts. For more details, see gitlab.igem.org/2024/xjtlu-software/software.

Test:
We collected 20,000 high-quality structured texts and generated 60,000 question-answer pairs for training the model.

Learn:
The workflow ensured high-quality and efficient data acquisition and also provided a practical tool for future teams interested in further developing RAG and fine-tuning large language models on the generation side.

Enhancing Model Performance for Search and Answer Generation

Our software design consists of two critical components: database search performance and model answer generation performance. These two components significantly impact the software's search functionality and user experience. To incrementally improve the software's overall performance, we also designed several engineering cycles based on user feedback from the HP group and actual needs to iteratively enhance our software.

Cycle 1: Model selection and comparison

Design:
To achieve fast search and answer generation for specific domain knowledge, we believed that the currently popular RAG technology was particularly well-suited to solve such problems. During the early proof-of-concept phase of the project, we chose a one-click deployment technology stack for brief testing.

Build:
We used the ChatGLM large language model open platform developed by the Zhipu AI team for testing, inputting a small amount of manually organized data and asking targeted questions to verify whether our idea was feasible.

Test:
The model was able to generate the results we needed, searching for keywords and generating answers based on document summaries.

Learn:
This iteration demonstrated that our hypothesis was generally feasible, and the native model could already understand the data to a certain extent, producing outputs with reference value. However, we observed that the AI had insufficient understanding of some detailed issues, such as confusing different years for the same team and occasionally fabricating information that did not exist in the database.

Cycle 2: AI system's response quality optimization

Design:
To explore a solution for AI confusion and hallucination problems, we continued testing and improving the system based on the previous online platform. We chose to focus on improving and optimizing the system at the prompt level.

Build:
We optimized the data structure, adopting a more reasonable chunking method for unstructured documents to avoid semantic interruptions due to chunk breaks. Additionally, we introduced prompt engineering techniques, such as Chain of Thought (CoT) and example-based methods, into the prompts to prevent the model from answering zero-shot questions, thereby improving the quality of the model's responses.

Test:
After improving the system prompts, we achieved a relatively better knowledge recall effect. Specifically, as shown in the figure, the model was able to correctly return answers to our queries: when asked a question that was not in the input data, the model honestly acknowledged this limitation in its response.

Learn:
From this test, we discovered that for models with a sufficiently large parameter count, the system prompts could significantly influence the model's output. Techniques like Chain of Thought (CoT) and few-shot learning effectively leveraged the model's reasoning capabilities, helping us optimize its performance.

Cycle 3: Model fine-tuning to acquire domain knowledge

(Refer to model page for detailed information)

Design:
To address the pretrained model's limited understanding of specific domain knowledge, we consulted with our advisor and relevant experts. Their recommendation was to attempt model fine-tuning. We decided to retrain the model using targeted question-answer datasets to improve its performance. The resulting fine-tuned model was named "ChatParts".

Build:
We selected Meta's recently released open-source large language model, Llama 3.1 8b Instruct, due to its superior overall performance among current pretrained models. We chose the 8 billion parameter version because it can run on consumer-grade GPUs with 8GB of VRAM or more after quantization to int4 or int8, allowing for local inference. We used an open-source fine-tuning project from GitHub called Llama Factory. We opted for the supervised fine-tuning mode, utilizing LoRA (Low-Rank Adaptation) technology to fix all parameters except those relevant to our training dataset, which significantly reduced training time and computational resources while also decreasing the model's size. The training data consisted of the large number of question-answer pairs collected through the data processing workflow mentioned earlier.

Test:
In the early stages of training, due to various constraints, we used fewer than 20,000 question-answer pairs to train the model. After training, the inference validation showed a BLEU score of around 85, which did not meet our ideal expectations. Additionally, inference validation revealed that the model’s hallucinations had actually increased, though it did show a notable understanding of synthetic biology questions.

Learn:
This iteration indicated that fine-tuning the model to enhance its fundamental understanding is feasible. However, it also introduced new issues with model hallucinations. Additionally, the accuracy of the model's answers was insufficient, and for many previously learned questions, the model showed some forgetfulness when asked the same questions again, failing to reproduce responses verbatim.

Cycle 4: Expanding Pretraining Data with Additional Text Files

Design:
After analyzing the previous cycles, we realized that the quality of the generated answers was still not satisfactory. The model's knowledge base was limited by the scope of the existing data, leading to incomplete or inaccurate responses. To address these issues, we decided to augment the model's training with a broader range of text data, including pretraining on general txt files to enhance the model's knowledge scope. Additionally, we increased the volume of fine-tuning data to further improve the model's precision in answering domain-specific questions.

Build:
We incorporated a new set of unstructured text data, including public datasets, research papers, and relevant resources in synthetic biology, to expand the model's general understanding. This data was processed and fed into the model for pretraining. On top of that, we significantly increased the amount of fine-tuning data, focusing specifically on iGEM competition materials and synthetic biology research. The fine-tuning data was carefully curated and structured to ensure it aligned with our objectives.

Test:
After retraining the model with the expanded dataset, we observed improvements in the model’s ability to recall relevant knowledge. The additional pretraining data allowed the model to provide more well-rounded answers, and the increased fine-tuning data contributed to more accurate responses to domain-specific queries.

Learn:
This iteration demonstrated that expanding both pretraining and fine-tuning data was an effective strategy for improving the model’s performance. However, we found that while the model’s knowledge recall was improved, some hallucination issues still persisted. The need for continued refinement in the data processing and filtering stages became evident, especially in ensuring the relevance and quality of the text data used for training.

Please refer to the result page for model training results.

Cycle 5: Increasing Training Steps for Llama 3.1

Design:
In this cycle, we identified that the training of the Llama 3.1 model had not yet reached full convergence, as indicated by the loss function's failure to stabilize. The model was undertrained, resulting in suboptimal performance in generating answers. To address this issue, we decided to increase the number of pretraining steps to ensure that the model could learn more effectively and reduce the loss to a satisfactory level.

Build:
We extended the pretraining phase by increasing the number of training steps for the Llama 3.1 model. This adjustment allowed the model to process more data iterations and further optimize the weights. During this process, we monitored the loss function to ensure it continued decreasing and showed signs of convergence. Additionally, we used the same expanded dataset introduced in the previous cycle, focusing on maintaining high-quality input throughout the process.

Test:
After extending the training steps, we observed an improvement in the model’s loss reduction. The extended training allowed the model to reach a more stable and lower loss, which positively impacted its performance. The model became more accurate in understanding complex questions and provided more reliable answers across various test cases.

Learn:
This cycle reinforced the importance of sufficient training steps in large language model pretraining. By increasing the number of steps, we achieved better convergence and enhanced model performance. However, it also underscored the need for careful balance: while more steps improved results, overly long training without significant gains could lead to diminishing returns. Thus, monitoring the loss during training became essential to determine the optimal stopping point.

Figure 2Llama3.1 8b Instruct pretraining loss which does not reach full convergence

Figure 3Qwen2.5 14b Instruct pretraining loss does not reach convergence

Cycle 6: Offline RAG AI agent demand research

Design:
Based on the HP group's research on user needs (Refer to the human-practices for detailed information), many users expressed a desire for an AI agent that could be deployed offline and run on their own computers. In response to this demand, we began developing a fully self-contained RAG software package.

Build:
We chose LangChain as the technology stack for building the RAG software package. We utilized LangChain's pdfloader and other tools to load both unstructured and structured documents, segmenting the text into various chunks. Then, we called an embedding model to vectorize the text and used Chroma as our vector database to store and search data. Finally, we integrated the search results and system prompts with a large language model via API to generate answers.

Test:
We successfully developed an AI agent capable of loading a knowledge base, searching, and generating answers.

Learn:
The agent could comprehensively answer questions based on the data already collected. However, it still provided incorrect answers for similar types of questions it had not encountered before. Additionally, there were instances where relevant documents existed, but the model failed to understand them correctly, resulting in off-topic answers. Our team suspects that these errors stem from the pretrained model’s limited understanding of this specific domain knowledge.

Cycle 7: 'ChatParts' software was born

Design:
To resolve the issues of hallucinations and imprecise answers in the fine-tuned model, we decided to combine the RAG system with the fine-tuned large model. Unlike the previous approach, after retrieving documents from the vector database, we used the text and prompts to generate responses via an API call to an online large language model. However, this approach had some drawbacks. To enable full local deployment, we replaced the online model API with the locally fine-tuned language model specialized in synthetic biology and the iGEM competition for generating answers.

Build:
We utilized the previous technical framework to construct the RAG system and replaced the online large model API with a locally running inference module for the fine-tuned pretrained model, ChatParts.

Test:
We successfully built a fully functional local AI agent system fine-tuned for synthetic biology. This system essentially fulfilled our software design goals.

Learn:
After this iteration, we successfully implemented all the basic software features we initially envisioned. The system could correctly answer questions related to synthetic biology, custom-load and learn from documents, and perform full-process offline inference as an AI agent system. This became the final software product we developed, ChatParts.

Please refer to the result and measurement page for model training results and output evaluation.