Software

Deploy a local version of our Software tool by following the instructions available at our GitLab repository.

Overview

Welcome to the iGEM Data Explorer! This innovative tool transforms how you access the rich history of iGEM projects, making it quicker and more intuitive than ever. Instead of sifting through countless wiki pages or relying on traditional databases like the Phoenix Project, our platform offers a more dynamic and interactive experience. Leveraging advanced AI specifically tailored for iGEM projects, our Retrieval-Augmented Generation (RAG) system allows you to interact with software that understands your queries and contextualizes the information it retrieves. Simply enter your query as if you’re discussing it with a colleague, and receive precise, relevant information instantly. Designed specifically for the unique demands of synthetic biology and iGEM data, the iGEM Data Explorer provides customized insights that help you delve deeper into past projects and foster innovative solutions.

Visual Example: Below is a screenshot demonstrating how the iGEM Data Explorer responded to the query: “What is HiBiT Luciferase tag and which team used it and why?”

Features

Pioneering Interactive Wiki Queries: As the first tool of its kind in the iGEM community, the iGEM Data Explorer revolutionizes how researchers access historical project data. Say goodbye to manually searching through team wikis. Now, just ask and receive instant, relevant answers directly through our application.
User-Centric Query Interface: Simple, user-friendly querying interface to fetch relevant data quickly.
Robust Documentation and Logging: Benefit from a fully documented, thoroughly commented codebase that ensures ease of use, maintenance, and scaling. Paired with our detailed logging system, you can troubleshoot with confidence and clarity.
LLM Validation: Our system uses large language model as a ‘judge’ to evaluate and optimize various Retrieval-Augmented Generation (RAG) settings. This ensures the selection of the most effective configuration for accuracy and efficiency before deployment.

Scope and Expansion Potential

The current version of the iGEM Data Explorer includes data exclusively from the 2019 iGEM wikis as part of our proof of concept. This specific year was chosen to demonstrate the capabilities of the RAG system in a controlled environment. However, the architecture and design of our tool are built with scalability in mind. It is fully capable of expanding to include all wikis from all years of the iGEM competition. This expansion can be seamlessly integrated as the project scales, providing an ever-growing repository of iGEM projects for research and analysis.

Demo

To give users a real-world feel of the iGEM Data Explorer in action, we have prepared a demo video showcasing the application’s capabilities.

Architecture

The iGEM Data Explorer harnesses a robust architecture designed to streamline the process of accessing and utilizing the vast knowledge base of iGEM wikis.

The following image is a visual representation of our architecture. Additionally, you can find paragraphs explaining each step in detail below.

Pre-production Phase

Chunking: The first step in preparing the iGEM Data Explorer is data “chunking.” This process involves segmenting the extensive text from iGEM team wikis into smaller, manageable pieces.
Summarization: After the texts are chunked, each piece undergoes a summarization process. Here, a summarization model condenses the text into its most essential points, stripping away less critical details. This step is crucial as it refines the data, ensuring that only the most important information is retained for the next stages.
Embedding: Once summarized, these texts are then processed through an embedding model. This model transforms the summarized text into numerical embeddings—essentially converting the information into a form that computers can efficiently process and compare.

Production Phase

Retriever (Vector Database): Central to the operational system is a robust vector database that stores and manages the embeddings created from the summarized documents. When a user submits a query, the system first converts this query into its own embedding using the same model used for the wiki texts. This alignment ensures that the query and the stored data are in the same format, enabling precise and accurate comparisons.
Finding Relevant Data: The system searches through its vector database to find the document embeddings that most closely match the user’s query embedding.
Reader and Generator: Once the system identifies the most relevant embeddings, it moves to the next component, the Reader. This Reader uses the context from the top matching documents and a Large Language Model (LLM) to generate a coherent and detailed response based on the summarized information. The LLM effectively reads the summarized points and crafts an answer that is both comprehensive and easy to understand, directly addressing the user’s query.

Data Management

In the iGEM Data Explorer, we’ve implemented a robust data management system that integrates rich metadata with precisely summarized and embedded (vectorized) textual data to facilitate effective semantic searches. Here’s how we manage the data:

The core content in our vector database consists of summarized chunks of text from iGEM team wikis (called ‘wiki_content’ in vector store). Before embedding, these texts are summarized to distill the most pertinent information, ensuring that each embedded vector represents a concentrated form of valuable insights. This process makes our search mechanism efficient and highly relevant. Then, each wiki_content chunk is vectorized using embedding model called all-MiniLM-L6-v2 (see below under evaluation results why we used this model).

For each chunk of data stored in our vector database, we include associated metadata derived from Phoenix Project. This metadata enhances the context of each data point, providing additional dimensions such as the team’s institution, region, project title, abstract, and more (see under ‘Example of Data Structure in Vector Store’).

Currently, our vector database is populated with data exclusively from the 2019 iGEM teams.

Example of Data Structure in Vector Store

Here’s a look at how data is structured within our vector store interface:

Currently, 8,594 data points are stored in our vector store, which collectively represent all available team wikis from the year 2019.

Benchmarking and Evaluation

In the development of the iGEM Data Explorer, benchmarking and evaluation were essential to optimize the system’s performance. We utilized a large language model (LLM) as a judge to systematically test various configurations of the Retrieval-Augmented Generation (RAG) settings. Below is an overview of the methodologies and results from our evaluation:

Evaluation Setup

Chunk Sizes: We experimented with different chunk sizes to determine how the granularity of data affects the retrieval quality. The sizes tested were 250, 500, and 800 words per chunk.
Embedding Models: To understand the impact of different embedding strategies, we utilized three prominent embedding models:

Scope of Evaluation: Due to cost constraints, the evaluation was limited to 20 iGEM wikis, with each being subjected to 20 targeted questions and answers. This focused approach allowed us to maintain a manageable scale while still gaining meaningful insights into the system’s performance.

Evaluation Method

The effectiveness of each configuration was assessed based on how well the embedded vectors facilitated the accurate retrieval of information in response to simulated user queries. The normalized evaluation scores (from 1 to 5) were calculated to provide a comparative analysis of each setting’s performance.

Example of Evaluation

To better demonstrate the efficacy and practical applications of our evaluation process, here is a specific example that illustrates how the iGEM Data Explorer performs under one of the tested settings (Embedding Model Used: sentence-transformers/all-MiniLM-L6-v2, Chunk Size: 250 words)

Query: “Have any iGEM teams utilized the BsaI-HFv2 cleavage enzyme and T7 DNA Ligation in their cloning or plasmid construction processes?”
Reference Answer: “Based on the provided context, it appears that at least one iGEM team has indeed used BsaI-HFv2 and T7 DNA Ligation for their cloning or plasmid construction process.”
Actual iGEM Data Explorer answer: “Yes, the Tunghai TAPG team has utilized the BsaI-HFv2 cleavage enzyme and T7 DNA Ligation in their cloning or plasmid construction processes.”
LLM as a judge score assigned: 5 (out of 5)

Detailed results for this query, along with others under various settings, are systematically compiled and can be accessed at src/evaluation/output. Each file within this directory corresponds to different RAG settings tested, with results showcasing how each configuration performed.

Results

The following graph displays the average evaluation scores across different settings:

Embedding models based on all-MiniLM-L6-v2 consistently showed strong performance, suggesting that this model was particularly effective at capturing semantic meanings in smaller, more focused chunks of text.
Currently, our application employs the sentence-transformers/all-MiniLM-L6-v2 model with a chunk size of 250 words.

Future Directions

Looking forward, we see plenty of opportunities to enhance the iGEM Data Explorer, addressing both current limitations and future technological advancements. Here are the key areas of focus:

Improved Data Scraping and Cleaning: To enhance the quality and usability of the data, we will advance our web scraping and data cleaning techniques. By retaining more contextual information, such as references and laboratory data, and improving the way we chunk text, the system will be able to provide richer and more accurate outputs.
Enhanced Metadata Filtering: Currently, our search capabilities are primarily based on wiki content. Plans are underway to integrate more sophisticated natural language processing techniques to process user queries. This will allow us to apply filters that combine metadata from the Phoenix project with similarity search, providing more targeted and relevant search results.
Source Tracking and Display: We intend to implement a feature that records and displays the source webpage of each data chunk within the search results. This will provide users with direct access to the original content for deeper investigation and verification.

How to Contribute

The iGEM Data Explorer project thrives on collaborative efforts from the community. Whether you are a developer, a researcher, or simply someone passionate about synthetic biology, there are numerous ways you can contribute:

Code Contributions: If you are interested in contributing code, you can start by reviewing our src/main.py file for areas that might need improvement or features that you could add. Check our project repository on GitLab for open issues or feature requests.
Data Quality: Contributions to data cleaning, formatting, or any enhancements to the way data is processed are always welcome. If you have expertise in data science or natural language processing, your input can be invaluable.
Documentation and Tutorials: Help new users get started by improving documentation or creating tutorials that explain how to use the iGEM Data Explorer effectively. Clear, comprehensive guides are crucial for user engagement and retention.
Testing and Feedback: Use the tool and provide feedback on any bugs or user experience issues you encounter. User feedback is crucial for continuous improvement.

For more details on how to get involved, visit our project page on GitLab or contact us directly through our project website. We are excited to welcome new contributors who share our vision of advancing synthetic biology research through better data tools.