The 2024 NJU-CHINA team has developed a large language model of synthetic biology - Prometheus.

In the following section we describe our entire process from the project conception of the model to the success of the build.

Overall Engineering

Our project as a whole went through the DBTL (Design, Build, Test, Learn) cycle.

Design1:

Through interviews and questionnaires with experts, we identified a key challenge in synthetic biology: the lack of a structured knowledge base and organized data. Combining parts and searching for the right parts was difficult and had a high barrier to entry. Therefore, we plan to combine knowledge graphs and large language models to realize the combination of part data, synthetic biology knowledge frameworks and at the same time reduce the threshold of learning and using synthetic biology, so as to promote synthetic biology in a real sense.

Build1:

To address this, we developed a synthetic biology model that processes natural language inputs and outputs part combinations and plasmid designs based on the user's functional requirements.

Test1:

We tested our model by gathering feedback from synthetic biology experts, teachers, and students, as well as individuals with no background in the field. Additionally, we conducted wet lab experiments to validate the model and provide further insights. During this testing, we realized that the current version still had a high learning curve—even for life sciences students—limiting its broader application in advancing synthetic biology.

Learn1&Design2:

This experience inspired us to improve the model by incorporating real-time feedback from users and using experimental data to continuously train and optimize it. As a result, we developed two modes: Expert Mode for users with a deep understanding of synthetic biology, and Freshman Mode, designed for beginners.

Build2:

Freshman Mode allows users to interact through natural language dialogues, with the model performing part searches and plasmid construction in the background, making the tool more accessible and beginner-friendly.

Test2&Learn2:

Freshman Mode was introduced in educational activities, allowing experts, students with a life sciences background, middle school students, and even the public to use the model, and it received widespread praise. It successfully lowered the entry barrier and helped to popularize synthetic biology. With this success, we decided to continue optimizing and iterating the model by reviewing the latest literature and gathering additional feedback through further outreach activities.

We Build Prometheus Step by Step

Milestones

We reached several milestones while constructing the Prometheus model:

1. We first extracted the part data from the database using fine-tuned Llama3.
2. We constructed the part database in the form of a Knowledge Graph.
3. We matched the functional descriptions of the parts with the functional requirements of the users using BioSentVec.
4. We completed the output of the plasmids.
5. We developed Freshman Mode, which is capable of natural language dialogue and automatic extraction of user requirements and we produced a part scoring system and a data feedback system to further prepare for optimizing the model.

Our iGEM project embodies several iterative DBTL (Design, Build, Test, Learn) cycles, where each small loop drives the continuous improvement of our synthetic biology platform. These cycles are connected and form the backbone of our engineering success. Below, we highlight the key DBTL cycles that showcase the progression of our work, each contributing to a larger, interconnected system.

Expert Mode Milestone 1: Fine-Tuning Models for Parts Data Extraction

Prometheus utilizes the iGEM official database for its part data. However, the extracted metadata often contains excessive text, noise, or truncation issues, making it unsuitable for direct matching with large language models.

To address this, we employed ChatGLM3 and Llama3 with fine-tuning or prompt engineering to extract functional descriptions of parts, condensing them into 2-3 sentences that retain key functional information. This extraction significantly improved the matching success rate of the Sent2Vec model compared to raw data.

Design1 (Dry Lab):

The focus here was on improving the extraction of detailed part functions and environmental context using advanced models like ChatGLM3-8B and Llama3.1.

Build1 (Dry Lab):

We manually annotated 1,000 parts, with two annotators per part. The data was then split into training and testing sets for fine-tuning ChatGLM3. During the training, the loss function converged at around 1.05th epoch. Therefore, we chose to take an early stop at 0.8th epoch based on the loss measurement on the validation set.

Test1 (Dry Lab):

However, ChatGLM3 has a limitation on the maximum number of tokens, which is insufficient for analyzing some iGEM web pages. Other than that, we discovered that ChatGLM3 does not have the ability to precisely extract the function of a biological part – the extraction by which often contains irrelevant information.

Learn1 (HP)&Design2:

We interviewed Prof. Zhen Wu from School of Artificial Intelligence, Nanjing University. He suggested us to apply Llama3, which was the state-of-the-art large language model to perform the extraction.

Build2:

Due to limitation of computational power, we chose to prompt engineer instead of fine-tuning Llama3. The specially designed prompt contains multiple aspects from biological background information to workflow and output format.

Test2:

Llama3 with prompt engineering results in a much better performance than fine-tuning ChatGLM3.

Expert Mode Milestone 2: Building the Knowledge Graph for Part Organization

The knowledge graph is a method for representing and organizing information through nodes and edges, where nodes represent entities or concepts, and edges denote their relationships. Knowledge graphs are widely used in natural language processing, search engines, and recommendation systems, helping machines understand and reason about information for smarter question-answering and information retrieval. Therefore, we designed a knowledge graph to organize the database of Prometheus.

Design:

Through our pre-project investigation, we identified the need for a more structured and accessible database within iGEM to facilitate efficient part searches. The challenge was that iGEM’s vast data was disorganized and difficult to utilize effectively.

Build:

We developed a knowledge graph using data from the iGEM API and web-scraped content. The knowledge graph organizes synthetic biology parts, categorizing them into distinct types such as promoters, terminators, and operators.

Test:

We ran initial tests using BioSentVec for semantic matching of user input with database parts. The model was able to return relevant parts but struggled with fine-grained distinctions between similar elements. After interviewing Prof. Bo Zhang from the School of Life Science of NJU, we developed an algorithm to assemble other formatted databases with our knowledge graph in order to enrich the biological part that Prometheus can utilize.

Learn:

Feedback showed the need for enhanced precision in part searches. We learned that part categorization and more structured descriptions were necessary to improve the search function.

Expert Mode Milestone 3: Semantic Search Enhancement for Part Matching

Prometheus allows users to describe parts in natural language and recommends the closest matching parts. Traditional matching algorithms often struggle with varied wording and complex sentence structures, limiting accuracy in real-world applications. To address this, we employed Sent2Vec technology. Given that Prometheus is frequently used in contexts highly relevant to synthetic biology, we specifically utilized the BioSentVec model, retrained on PubMed data.

Design:

After addressing extraction accuracy, we focused on improving semantic search to match user needs with the right parts. A core challenge was inaccurate matches, such as returning terminators when users searched for promoters. We interviewed assistant researcher Zhen Wu and Professor Xinyu Dai from School of Artificial Intelligence, Nanjing University to get more professional advice, such as using word vector technology, Sent2Vec, Word2Vec, and other tools.

Build:

We used BioSentVec to create embeddings for both the part descriptions and user queries. By calculating semantic distances, we aimed to enhance the matching algorithm.

Test:

Initial tests showed that while the system could return relevant parts, mismatches still occurred. We implemented prompt engineering to refine search results, improving the precision of matches.

Learn:

We learned that categorization and embedding calibration were essential. We built a category mapping rule based on previous part classifications, allowing for better matching of parts to user needs.

Expert Mode Milestone 4: Plasmid Construction Automation

Under the guidance of the HP group, we identified that the model has significant utility for non-experts and beginners. However, their limited background in synthetic biology means that simply providing recommended parts is insufficient for them to create the plasmids they need. Therefore, we require a more comprehensive design for downstream tasks that can directly provide complete plasmids for user reference.

Design:

After communicating with Prof. Shan Chang from the Institute of Bioinformatics and Biomedical Engineering at Jiangsu University of Technology, we found that there was a need for better downstream task design to allow professionals to gain a more convenient experience when designing plasmids. Aiming to streamline the plasmid construction process, we incorporated plasmid backbone selection into the platform. Initially, the system only provided lists of matched parts, but we realized the need for automatic plasmid design.

Build:

We integrated plasmid backbone into our system, allowing automatic assembly of parts based on user-defined criteria. This step improved the platform’s usability and functionality.

Test:

We tested the new system with a variety of users and tasks, discovering that adding more backbone options and auxiliary parts like antibiotic resistance genes further enhanced the design process.

Learn:

This cycle taught us that flexibility in plasmid construction was critical. The inclusion of multiple backbones and auxiliary features allowed users to build more complex, functional plasmids efficiently.

Freshman Mode Milestone 5: Improving the User Interface with Freshman Mode

After completing the construction of Expert Mode, we validated its adaptability through wet lab experiments and promoted it through HP activities. While it received widespread praise, we also recognized its high barrier to entry. Our team’s original intention is to promote synthetic biology to everyone, enabling all to utilize synthetic biology for their own purposes. The Expert Mode, which is not particularly friendly to newcomers, clearly does not align with the team’s goals. Therefore, we need a more user-friendly interactive method that can be used through common conversational formats in daily life, to lower the threshold for using our model and truly achieve the popularization of synthetic biology. As a result, we packaged it as Expert Mode and developed a more accessible version which is Freshman Mode with conversational capabilities that can automatically extract user requirements. Additionally, we were inspired to optimize the model by collecting data feedback for continuous improvement.

Design:

During the teaching support activities conducted by the HP team, they noticed that the Expert Mode was not suitable for middle and primary school students. Due to their relative lack of foundational knowledge in synthetic biology, they would find it difficult to express their needs for complete synthetic biology products in a structured way, and this frustration could further lead to unnecessary anxiety and resistance toward synthetic biology.

During the wet lab trial, we have demonstrated the utility of the Prometheus model. However, we recognize that effectively using it and inputting the correct requirements still requires some background in synthetic biology. Additionally, we realized that experimental data from Prometheus-guided experiments can be fed back into the model to further enhance its performance. Moreover, we realized that we can score the output parts to optimize Prometheus's part selection process, improving its overall functionality.

Build:

We designed a Freshman Mode by using Llama 3, where the system could guide beginners in natural language, offering step-by-step assistance from part search to plasmid design. This mode was inspired by user feedback from our outreach activities. Freshman Mode can also accept user feedback, experimental data, and user ratings on the model's output parts. We plan to collect sufficient data in the future to further optimize the model.

Test:

We deployed the Freshman Mode on our web platform, allowing non-experts, including middle school students and new researchers, to test it. Feedback showed that the simplified interface significantly lowered the learning curve.

Learn:

Through this cycle, we learned that accessibility was key. The integration of natural language input and a user-friendly interface helped make synthetic biology more approachable, fulfilling our goal of broad promotion. What’s more, we plan to gather sufficient feedback and laboratory data to further optimize Prometheus.

Integration and Cumulative Impact

Each of these DBTL cycles is interconnected and feeds into the overall evolution of our project. The knowledge graph enables accurate data extraction, which in turn powers semantic search. Freshman Mode lowers the barrier for entry, while automated plasmid construction enhances the system's practical utility for both beginners and experts. Together, these cycles create a robust, user-friendly platform that continually evolves through feedback and learning, driving the successful application of synthetic biology in a broader context.