Software

The RAG ChatBot is an innovative software solution developed for the iGEM competition, designed to enhance the retrieval and answering capabilities of synthetic biology research teams. By leveraging natural language processing (NLP) and machine learning techniques, this chatbot provides a robust platform for querying and retrieving relevant information from user-provided datasets. This is a standalone tool without actually applying for the Software award.

Motivation

The primary goal of this side-project is to facilitate efficient information retrieval and decision-making processes in synthetic biology research. Imagine that you are a member working on the Dry Lab and aren’t up to date with what’s happening inside Wet Lab. Now, you could deploy this RAG Q/A ChatBot that could answer any questions you might have about what other parts of your own research team are working on, and you don’t even have to wake them up at 4 AM just because you are a night owl!

Features

Scroll further to see some of the features of the RAG Q/A Chatbot.

It is completely free!


This chatbot uses HuggingFace to pull the LLM model from their open-source website. LLMs can be huge in size, taking up terabytes of space in some cases, however, this approach of accessing the LLM from the website itself allows the user to use the model without ever downloading it! All you need is to create a HuggingFace account, get a HuggingFace token (which ensures people don’t misuse the LLM), and you are good to go! You don’t need to pay anyone anything!

Privacy


Since the chatbot is running locally and not inside some other server, you can ensure that your research data stays with you! This approach of running LLM locally makes it near impossible for other companies to spy on your hard work!

Advanced NLP Integration


The chatbot utilizes the HuggingFace BgeEmbeddings model for efficient sentence embeddings, ensuring accurate and context-aware responses.

Vector Store Management


Employing the FAISS (Facebook AI Similarity Search) library, the chatbot manages a vector store that allows for rapid similarity searches, retrieving the most relevant documents based on user queries. In effect, searching through a vector storage is much faster than normal reading through a text document.

Customizable Prompt Templates


The chatbot uses a customizable prompt template to guide the language model, ensuring responses are grounded in the provided context and adhere to specific response rules.

Technical Implementation

The RAG ChatBot is built using the Langchain library, which facilitates the integration of various components necessary for retrieval-augmented generation. Key components include (scroll):
(1). HuggingFace Endpoint: The chatbot connects to the HuggingFace API using an endpoint configured with the Phi-3-mini-4k-instruct model, optimized for generating informative and contextually relevant answers.
(2). Document Processing: Documents are processed using a Recursive Character Text Splitter to create manageable chunks for embedding and retrieval, enhancing the system's ability to handle large datasets efficiently.
(3). Security and Efficiency: The system ensures secure handling of API tokens and employs efficient algorithms for document retrieval, prioritizing both performance and security.

Applications

The RAG ChatBot is particularly beneficial for synthetic biology teams participating in the iGEM competition, offering (scroll):
(1). Enhanced Research Capabilities: By providing quick access to relevant research documents, the chatbot aids in hypothesis generation and experimental design.
(2). Educational Resource: The chatbot serves as an educational tool, helping team members understand complex biological concepts through interactive Q&A sessions.
(3). Collaboration and Communication: Facilitates better communication among team members by providing a centralized platform for information retrieval and sharing.

Setting up RAG ChatBot

This section will guide you through the setup required to get the chatbot up and running, including installing necessary libraries and obtaining a Hugging Face API token.

1. Install Required Libraries


First, download the 'chatbot.py' and 'vectorizer.py' files from the button below. Then, place them inside a separate empty folder.

To run the RAG ChatBot, you need to install several Python libraries. Go inside the folder you made and open your command prompt or terminal inside your chosen IDE and execute the following commands:

pip install langchain
pip install langchain-community
pip install langchain-huggingface


These commands will install the necessary LangChain and Hugging Face libraries required for the chatbot to function.

2. Obtain a Hugging Face API Token


To use Hugging Face models, you need an API token. Follow these steps to create one:

Visit the Hugging Face website and log in to your account. If you don't have an account, sign up for free.

Once logged in, navigate to your profile and select "Settings."

In the settings menu, find "Access Tokens" and create a new token. Copy this token as you will need it to access Hugging Face models.

3. Set Up the ChatBot


With the libraries installed and the API token ready, you can now set up the chatbot:

Open the 'chatbot.py' file and replace the placeholder API token with your actual Hugging Face API token:

API_TOKEN = "your_huggingface_api_token_here"

Save the changes.

4. Running the ChatBot


In the folder where you have placed both the files, make a new folder called “text” and make a new text file inside the folder called “train.txt”. This is where you will place all the information that you want the ChatBot to use to answer questions.

Then, run the vectorizer.py, which will vectorize the text file into a database that is faster and easier to access by the ChatBot. A separate folder with a database will be created.

Finally, run chatbot.py and ask away! It uses a simple input statement to get the question from the command box.

By following these steps, you can successfully set up and use the RAG ChatBot. This setup allows you to leverage powerful AI models for efficient information retrieval and decision-making in synthetic biology research.

Future Developments

Future iterations of the RAG ChatBot will focus on expanding its dataset integration capabilities, improving the accuracy of its NLP models, and enhancing user interface features for a more intuitive user experience. This is something that could be carried out by other teams in the future, or could be worked on by ourselves if time permits once iGEM2024 is over.