cross icon
Left Gear
Right Gear
Cassette Sticker

Software

Deploy a local version of our Software tool by following the instructions available at our GitLab repository.
If you prefer, try out the deployed version directly accessible here!

Overview

Welcome to the iGEM Data Explorer! This innovative tool transforms how you access the rich history of iGEM projects, making it quicker and more intuitive than ever. Instead of sifting through countless wiki pages or relying on traditional databases like the Phoenix Project, our platform offers a more dynamic and interactive experience. Leveraging advanced AI specifically tailored for iGEM projects, our Retrieval-Augmented Generation (RAG) system allows you to interact with software that understands your queries and contextualizes the information it retrieves. Simply enter your query as if you’re discussing it with a colleague, and receive precise, relevant information instantly. Designed specifically for the unique demands of synthetic biology and iGEM data, the iGEM Data Explorer provides customized insights that help you delve deeper into past projects and foster innovative solutions.

Visual Example: Below is a screenshot demonstrating how the iGEM Data Explorer responded to the query: “What is HiBiT Luciferase tag and which team used it and why?”

Features

Scope and Expansion Potential

The current version of the iGEM Data Explorer includes data exclusively from the 2019 iGEM wikis as part of our proof of concept. This specific year was chosen to demonstrate the capabilities of the RAG system in a controlled environment. However, the architecture and design of our tool are built with scalability in mind. It is fully capable of expanding to include all wikis from all years of the iGEM competition. This expansion can be seamlessly integrated as the project scales, providing an ever-growing repository of iGEM projects for research and analysis.

Demo

To give users a real-world feel of the iGEM Data Explorer in action, we have prepared a demo video showcasing the application’s capabilities.

Architecture

The iGEM Data Explorer harnesses a robust architecture designed to streamline the process of accessing and utilizing the vast knowledge base of iGEM wikis.

The following image is a visual representation of our architecture. Additionally, you can find paragraphs explaining each step in detail below.

Pre-production Phase

Production Phase

Data Management

In the iGEM Data Explorer, we’ve implemented a robust data management system that integrates rich metadata with precisely summarized and embedded (vectorized) textual data to facilitate effective semantic searches. Here’s how we manage the data:

The core content in our vector database consists of summarized chunks of text from iGEM team wikis (called ‘wiki_content’ in vector store). Before embedding, these texts are summarized to distill the most pertinent information, ensuring that each embedded vector represents a concentrated form of valuable insights. This process makes our search mechanism efficient and highly relevant. Then, each wiki_content chunk is vectorized using embedding model called all-MiniLM-L6-v2 (see below under evaluation results why we used this model).

For each chunk of data stored in our vector database, we include associated metadata derived from Phoenix Project. This metadata enhances the context of each data point, providing additional dimensions such as the team’s institution, region, project title, abstract, and more (see under ‘Example of Data Structure in Vector Store’).

Currently, our vector database is populated with data exclusively from the 2019 iGEM teams.

Example of Data Structure in Vector Store

Here’s a look at how data is structured within our vector store interface:


Currently, 8,594 data points are stored in our vector store, which collectively represent all available team wikis from the year 2019.

Benchmarking and Evaluation

In the development of the iGEM Data Explorer, benchmarking and evaluation were essential to optimize the system’s performance. We utilized a large language model (LLM) as a judge to systematically test various configurations of the Retrieval-Augmented Generation (RAG) settings. Below is an overview of the methodologies and results from our evaluation:

Evaluation Setup

  1. sentence-transformers/all-MiniLM-L6-v2
  2. sentence-transformers/all-mpnet-base-v2
  3. thenlper/gte-small

Evaluation Method

The effectiveness of each configuration was assessed based on how well the embedded vectors facilitated the accurate retrieval of information in response to simulated user queries. The normalized evaluation scores (from 1 to 5) were calculated to provide a comparative analysis of each setting’s performance.

Example of Evaluation

To better demonstrate the efficacy and practical applications of our evaluation process, here is a specific example that illustrates how the iGEM Data Explorer performs under one of the tested settings (Embedding Model Used: sentence-transformers/all-MiniLM-L6-v2, Chunk Size: 250 words)

Detailed results for this query, along with others under various settings, are systematically compiled and can be accessed at src/evaluation/output. Each file within this directory corresponds to different RAG settings tested, with results showcasing how each configuration performed.

Results

The following graph displays the average evaluation scores across different settings:

Embedding models based on all-MiniLM-L6-v2 consistently showed strong performance, suggesting that this model was particularly effective at capturing semantic meanings in smaller, more focused chunks of text.
Currently, our application employs the sentence-transformers/all-MiniLM-L6-v2 model with a chunk size of 250 words.

Future Directions

Looking forward, we see plenty of opportunities to enhance the iGEM Data Explorer, addressing both current limitations and future technological advancements. Here are the key areas of focus:

How to Contribute

The iGEM Data Explorer project thrives on collaborative efforts from the community. Whether you are a developer, a researcher, or simply someone passionate about synthetic biology, there are numerous ways you can contribute:

For more details on how to get involved, visit our project page on GitLab or contact us directly through our project website. We are excited to welcome new contributors who share our vision of advancing synthetic biology research through better data tools.

Footer with Ad Roller