Synthetic biology is growing thanks to new design methods that let researchers use existing parts to customize systems more efficiently. Standardized biological parts are vital to this field. The iGEM community has helped by creating a large collection of these parts, making it easier for others to access and reuse them. However, the huge Biobrick database can be overwhelming because there are so many options.
Recently, large language models have advanced quickly, especially in helping with semantic querying, which means finding the right term based on a description. We decided to use these models to help search for Biobricks. Last year, we developed Ask NOX, a platform that lets users describe what they need in plain language, making it easier to find the right components. However, the model had some issues:
This year, we've developed a better model called Ask Lantern, trained with more complete data, and added a feature to collect user feedback, helping us continuously improve the tool.
Welcome to our online platform! You can start experiencing our service directly via the webpage we provide.
To search for a BioBrick:
The search results will be presented line by line, each containing the name of a potentially matching BioBrick and a brief description. You can click the name link to view detailed information about the BioBrick. If you are satisfied with any search results, please click on the thumbs-up icon following it. Your feedback will help us continuously improve our model's performance.
All of our source code has been open-sourced on GitLab. If you wish to deploy the platform yourself, please refer to the README file in the repository for instructions. If you want to utilize the API for quick development, you can click on the button at the bottom of the webpage to access detailed information about the API.
An inverted dictionary involves identifying the appropriate target term from a word description, relying on semantic matching. In recent years, research in this area has grown substantially. The development of large-scale pre-trained models, such as BERT (Devlin et al., 20191), which incorporates multiple Transformer encoder layers, has been pivotal. These models have been adapted to enhance inverted dictionary tasks, enabling effective reverse lookups and supporting cross-lingual queries.
The idea of an inverted dictionary involves finding the right term from a description using semantic matching. This area has grown a lot recently, especially with models like BERT. These models have been adapted to enhance inverted dictionary tasks, enabling effective reverse lookups and supporting cross-lingual queries.
We used Biobrick data from parts.igem.org, which has been collected by Team Fudan 2022 in their repository. By fine-tuning an inverted dictionary model with this data, we created a semantic space for Biobricks. When users input queries, BERT encodes them and maps them into this space. The model then finds and returns the most relevant Biobricks based on their proximity in this space.
To test the model's accuracy, we split the data into two groups: "test_seen" (training set) and "test_unseen" (test set). The model was trained on the "test_seen" group and then tested with both "test_seen" and "unseen" inputs. Each test prompt corresponds to a BioBrick as the "label." We sorted output BioBricks by relevance and checked accuracy based on the correct BioBrick's rank (e.g., top 1, top 10, top 20). Our wet lab team also made queries based on their needs ("test_human" group). In most cases, they found suitable BioBricks within the first ten results. The test result of our reverse dictionary model for Biobricks is as follows:
test data | top1 hit rate | top10 hit rate | top20 hit rate |
---|---|---|---|
test_seen | 0.971 | 0.996 | 0.998 |
test_unseen | 0.047 | 0.338 | 0.433 |
seen+unseen | 0.509 | 0.667 | 0.716 |
test_human | 0.476 | 0.634 | 0.794 |
The top 10 hit rate means the probability that the bio-bricks you want appear in the top ten items in the model's result; others are similar.
After we trained the model, the students in the experimental group incorporated it into their work. This model helps wet lab teammates quickly locate the Biobricks they need when designing their experimental setups, such as BBa_K1021005 and BBa_K3963005, thus saving them time.
Ask Lantern is designed to be user-friendly, enabling researchers without a computer science background to search Biobricks easily. The front end utilizes Gradio to provide an intuitive web user interface and APIs. Our software is compatible with most modern browsers like Chrome, Firefox, Edge, and Safari. The source code is also available on GitLab, allowing users to deploy and modify it according to their needs.
In the future, we aim to apply our model and workflow to other databases, such as UniProt, to assist researchers in sequence searches. Additionally, we plan to develop a publicly accessible model for the iGEM community, which would facilitate experimental design and enhance model iteration.
We have also noted that many teams, such as the DiKST project (Leiden, 2021) and the PartHub project (Fudan, 2022), are actively developing software for searching and visualizing Biobrick relationships. We hope to collaborate with these projects to provide more user-friendly software to the iGEM community. By working together, we can integrate various functionalities and create a comprehensive toolset that enhances user experience engaged in synthetic biology projects.