Software

Demo

Ask Lantern

Introduction

Synthetic biology is growing thanks to new design methods that let researchers use existing parts to customize systems more efficiently. Standardized biological parts are vital to this field. The iGEM community has helped by creating a large collection of these parts, making it easier for others to access and reuse them. However, the huge Biobrick database can be overwhelming because there are so many options.

Recently, large language models have advanced quickly, especially in helping with semantic querying, which means finding the right term based on a description. We decided to use these models to help search for Biobricks. Last year, we developed Ask NOX, a platform that lets users describe what they need in plain language, making it easier to find the right components. However, the model had some issues:

  1. Lack of good, well-labeled data. We only had part of the Biobrick data, and during the data cleaning, too much important information was lost, leading to inaccurate results.
  2. Ignoring user feedback. Gathering and analyzing user feedback is crucial for improving search tools, but this wasn't done.
  3. The model wasn't the best available; academics have developed a better one.

This year, we've developed a better model called Ask Lantern, trained with more complete data, and added a feature to collect user feedback, helping us continuously improve the tool.

Usage

Welcome to our online platform! You can start experiencing our service directly via the webpage we provide.

To search for a BioBrick:

  1. Enter the description of the BioBrick you require.
  2. Click the search button.
  3. Wait for the model to display the results.

The search results will be presented line by line, each containing the name of a potentially matching BioBrick and a brief description. You can click the name link to view detailed information about the BioBrick. If you are satisfied with any search results, please click on the thumbs-up icon following it. Your feedback will help us continuously improve our model's performance.

All of our source code has been open-sourced on GitLab. If you wish to deploy the platform yourself, please refer to the README file in the repository for instructions. If you want to utilize the API for quick development, you can click on the button at the bottom of the webpage to access detailed information about the API.

Method

Workflow of Software

An inverted dictionary involves identifying the appropriate target term from a word description, relying on semantic matching. In recent years, research in this area has grown substantially. The development of large-scale pre-trained models, such as BERT (Devlin et al., 20191), which incorporates multiple Transformer encoder layers, has been pivotal. These models have been adapted to enhance inverted dictionary tasks, enabling effective reverse lookups and supporting cross-lingual queries.

The idea of an inverted dictionary involves finding the right term from a description using semantic matching. This area has grown a lot recently, especially with models like BERT. These models have been adapted to enhance inverted dictionary tasks, enabling effective reverse lookups and supporting cross-lingual queries.

We used Biobrick data from parts.igem.org, which has been collected by Team Fudan 2022 in their repository. By fine-tuning an inverted dictionary model with this data, we created a semantic space for Biobricks. When users input queries, BERT encodes them and maps them into this space. The model then finds and returns the most relevant Biobricks based on their proximity in this space.

Result

To test the model's accuracy, we split the data into two groups: "test_seen" (training set) and "test_unseen" (test set). The model was trained on the "test_seen" group and then tested with both "test_seen" and "unseen" inputs. Each test prompt corresponds to a BioBrick as the "label." We sorted output BioBricks by relevance and checked accuracy based on the correct BioBrick's rank (e.g., top 1, top 10, top 20). Our wet lab team also made queries based on their needs ("test_human" group). In most cases, they found suitable BioBricks within the first ten results. The test result of our reverse dictionary model for Biobricks is as follows:

test data top1 hit rate top10 hit rate top20 hit rate
test_seen 0.971 0.996 0.998
test_unseen 0.047 0.338 0.433
seen+unseen 0.509 0.667 0.716
test_human 0.476 0.634 0.794

The top 10 hit rate means the probability that the bio-bricks you want appear in the top ten items in the model's result; others are similar.

Evaluation

After we trained the model, the students in the experimental group incorporated it into their work. This model helps wet lab teammates quickly locate the Biobricks they need when designing their experimental setups, such as BBa_K1021005 and BBa_K3963005, thus saving them time.

User-friendliness

Ask Lantern is designed to be user-friendly, enabling researchers without a computer science background to search Biobricks easily. The front end utilizes Gradio to provide an intuitive web user interface and APIs. Our software is compatible with most modern browsers like Chrome, Firefox, Edge, and Safari. The source code is also available on GitLab, allowing users to deploy and modify it according to their needs.

Future

In the future, we aim to apply our model and workflow to other databases, such as UniProt, to assist researchers in sequence searches. Additionally, we plan to develop a publicly accessible model for the iGEM community, which would facilitate experimental design and enhance model iteration.

We have also noted that many teams, such as the DiKST project (Leiden, 2021) and the PartHub project (Fudan, 2022), are actively developing software for searching and visualizing Biobrick relationships. We hope to collaborate with these projects to provide more user-friendly software to the iGEM community. By working together, we can integrate various functionalities and create a comprehensive toolset that enhances user experience engaged in synthetic biology projects.

Reference

  1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171-4186.
  2. Slaven Bilac, Wataru Watanabe, Taiichi Hashimoto, Takenobu Tokunaga, and Hozumi Tanaka. 2004. Dictionary search based on the target word description. In Proceedings of NLP.
  3. Yan, Hang, et al. BERT for Monolingual and Cross-Lingual Reverse Dictionary. 1, arXiv, 30 Sept. 2020. arXiv.org
  4. Team Fudan 2022. GitLab. https://gitlab.igem.org/2022/software-tools/fudan