Description

Abstract

iGEM Parts Registry is an open database containing biological components that can be assembled to create new systems[1]. Over the years, this database has grown significantly, thanks to the continuous contributions of past iGEM teams. However, it has been found that the quality of the data within the registry varies. Some parts lack essential information, such as detailed descriptions or optimizations, and only provide raw DNA sequences, limiting their usability. Additionally, the way search results are presented—through unstructured texts and images—makes it difficult for users to quickly browse and filter relevant information[2].

After analyzing user feedback, we identified the need for a tool to help users efficiently sift through the data and improve productivity. ChatParts is our solution: an AI-powered tool that integrates iGEM Parts information using a RAG (Retrieve-Augmented Generation) system. This system reorganizes how parts data is managed and retrieved, providing more comprehensive and user-friendly access to the information[3]. Furthermore, model fine-tuning is an important step in our development process to enhance performance in specific areas[4].

To support the adoption of our AI tool, we also created the iGEM ChatParts Community — a platform designed to promote collaboration and expand the reach of our project.

Data Quality Varies in iGEM Parts Registry

The Background

Since its establishment, iGEM has attracted teams from all over the world to upload and contribute various biological parts. After the accumulation of multiple competitions, the iGEM parts registry currently contains a huge number of biological parts. However, over time, the quality of the data in the registry varies. Parts that have high integrity contain information including detailed descriptions, optimization information, and contribution records, while others with low integrity contain only basic sequence information[2].

The Problems

In our iGEM 2024 project, we have identified several significant issues that pose challenges and negative impacts.

First, the high degree of information dispersion is a major concern. When searching for biobricks and experimental parameter information, one has to obtain it from multiple sources such as literature, reports, blogs, and wikis of previous years' competitions. This leads to a high level of information dispersion and makes integration extremely difficult. The process of searching and integrating is both time-consuming and labor-intensive.
Second, assessing the performance of biological elements is a complex task. There is a lack of tools for quickly evaluating the performance of biobricks, making it uncertain whether they are suitable for the current scenario. Worse still, some parts only have a sequence without any functional description, further complicating the assessment process.
Third, the complexity of competition information is another obstacle. When trying to obtain the project structure and content summary of previous years' competitions, the information on the official website is not clear enough. Furthermore, the website's complex structure necessitates proficiency to navigate it efficiently and locate information both quickly and accurately, making it less user-friendly for beginners.
Finally, communication among iGEM teams or team members is difficult. In the past, communication between iGEM teams or team members could only be achieved through public exchange conferences such as the annual iGEM Jamboree or CCIC. Such communication opportunities are rare, occurring only twice a year. Or it could be through one-on-one offline exchanges between teams, which are small-scale and inefficient. When encountering insurmountable problems, it is difficult to get attention and help from more people.

Meet Chatparts

Figure 1General description of our project

Why we need Chatparts?

We conducted research on modern techniques commonly used for solving search-related tasks and summarized their advantages and disadvantages. We found that traditional search techniques and the currently popular RAG technology each have their own strengths and weaknesses. However, for designing a user-friendly query system and a versatile AI assistant, the RAG system better meets the practical needs identified in our research.

RAG

We developed an AI Agent that integrates iGEM parts information utilizing the RAG (Retrieve-Augmented Generation) system based on LLM (Large Language Model) to reorganize the information management and retrieval of parts. This system replaces the tedious and time-consuming process of literature query and data retrieval, providing a ChatGPT-like AI question-answer interface where users can enter queries or ask for specific information about related biological elements. The RAG system enhances the accuracy and relevance of the generated content by combining information retrieval and generation models[3].

Fine-Tuning

Fine-tuning is another essential part of our model training, intending to make the model perform better in particular domains. After implementing the RAG system, we prepared fine-tuning datasets by manually sorting and organizing context, prompts, and answers into JSON format. This process helps the model specialize in synthetic biology, making it more suited to specific research problems or needs. The basic idea of the LoRA method was used for fine-tuning, which allows for efficient adaptation without changing the parameters of the pre-trained model, ensuring the integrity of the original model structure and knowledge[4].

Our Engagement with the World

Human Practice

Our human practices involved understanding the local issues related to data quality and communication within the iGEM community. We visited leading companies like GenScript to learn about the challenges in plasmid synthesis and discussed with experts like Professor Dechang Xu to address the data quality issues in the iGEM Parts registry. These insights guided our project design, focusing on creating an integrated and user-friendly database to solve these problems.

Education

We actively engaged in education and outreach by organizing events like a science outreach program at the Suzhou Customs Biosafety Museum, targeting primary and secondary school students. These activities aimed to increase public awareness and understanding of synthetic biology, sparking interest in biosafety and the protection of biodiversity.

Community

To foster communication and collaboration within the iGEM community, we developed the iGEM ChatParts Community. This platform allows iGEM participants and synthetic biology enthusiasts to exchange ideas, discuss challenges, and share experiences. It also provides a channel for knowledge acquisition and engagement beyond the traditional competition format.

Biosafety

We co-hosted a roundtable event titled "Roundtable on AI, Biosecurity, and Bioethics." with several leading universities to address potential biosafety and ethical issues related to the application of AI in the biological field. The forum led to the drafting of the white paper titled "Biosafety & Bioethics in Synthetic Biology" which established ethical guidelines for the responsible application of these technologies, demonstrating our commitment to advancing technology while upholding social responsibilities.

Outlooks and Future Perspectives

Stage 1: Pioneering Customization and Intelligent Retrieval

Our goal in this stage is to develop customized services and intelligent retrieval systems to promote productization and market application.

Collaboration with Biotech Startups
We will collaborate with the IT departments of biotech startups to jointly enhance our research and development technology. In response to market demands, we will develop customized services or intelligent retrieval systems to provide professional biological data solutions.
Productization and Market Promotion
We will initially develop open-source software packages, then test and promote them on our platform. We will also improve the RAG biological intelligent retrieval platform website to achieve intelligent services in multiple fields.
Application Potential of RAG + LLM
We will combine the RAG system with large language models (LLM) to provide more intelligent question answering and data analysis services. By leveraging the natural language processing capabilities of LLM, we will enhance the user interaction experience with the system, making the retrieval and analysis of biological data more convenient and efficient. We will also achieve automated data summarization and report generation to help researchers quickly obtain key information.

Stage 2: Leading with Innovation and Market Expansion

Our goal is to conduct market promotion and continuous optimization to maintain a leading position in the industry.

Purchase of Independent Servers
We will purchase and install independent servers to ensure the stable operation and efficient service of the platform. This will provide a stable and secure system environment and enhance the user experience.
Patent Application and Intellectual Property Protection
We will apply for patents to protect our independently developed core technologies and intellectual property. Through cooperation with investors, we will productize the model and bring it to the market.
Continuous Optimization and Innovation
We will continuously optimize and maintain the system to ensure technological leadership. As the forefront of AI technology develops, we will continuously introduce new technologies to improve the system performance and service quality. By leveraging the continuous progress of LLM, we will constantly enhance the intelligence level of the RAG system and provide users with more high-quality services.

References

[1]: Help:Parts, https://parts.igem.org/Help:Parts.

[2]: W. Hersh, Information Retrieval, 2008. https://link.springer.com/book/10.1007/978-0-387-78703-9

[3]: Y. Park and M. Kellis, Nature Biotechnology, 2015, 33, 825–826. https://www.nature.com/articles/nbt.3313

[4]: I. A. Tamposis, K. D. Tsirigos, M. C. Theodoropoulou, P. I. Kontou and P. G. Bagos, Bioinformatics, 2018, 35, 2208–2215. https://academic.oup.com/bioinformatics/article/35/13/2208/5184961