CoralGenie: RAG-Augmented Coral Knowledge Explorer
Software in iGEM should make synthetic biology based on standard parts easier, faster, better or more accessible to our community.
1.background
Nowadays, ChatGPT is an advanced language model that has revolutionized natural language processing by generating human-like text with context and coherence, creating new possibilities for human-computer interaction. Its outstanding performance in various language tasks and benchmark tests has positioned it as one of the leading language models globally[1]. However, while ChatGPT addresses specific topic inquiries by leveraging domain-specific training and comprehending context, it faces obstacles such as data constraints, gaps in knowledge, and model biases. Overcoming these challenges requires the integration of external resources and a commitment to continuous learning to enhance its accuracy [2]. Large language models such as InstructGPT often face the challenge of being unable to utilize the entirety of large documents for queries. To address this issue, a solution known as LlamaIndex has been introduced to expand the contextual scope of these models by incorporating large documents in the generation of responses[3]. Based on this, our team envisions enhancing the conversational functionality of ChatGPT by expanding its corpus through LlamaIndex to provide greater professionalism in answering coral-related questions.
2.Our pipeline
Project Overview
In the educational component of our project this year, team members conducted several interviews with secondary school groups. The questions posed during these interviews included: "What are your usual sources of information regarding coral knowledge?" A common response was books and magazines. This prompted us to explore the possibility of integrating popular large language models with coral-related literature to design a chatbot capable of answering questions about coral knowledge based on the content of these books.
Data Selection
We consulted the school's online library and selected several books and magazines as the foundation for our large language model, using criteria such as high readership, strong professional relevance, and significant correlation to coral-related topics. If you wish to modify the sources from which CoralGenie draws its answers, please feel free to choose any materials you deem appropriate. However, if we intend for the results of LlamaIndex to be derived from a database formed by our collected data and used as proprietary information to answer user inquiries, a further challenge arises: how to convert the database's PDF format into a TXT format compatible with LlamaIndex.
LlamaParse Document Parser
The LlamaParse document parser processes PDF reports through several key steps. Initially, users load the target PDF file into the LlamaParse system, which decodes the file to read and parse its byte stream. The parser identifies text layers, distinguishing readable text from images and other non-text elements. During extraction, LlamaParse addresses character encoding issues, automatically recognizing and converting various formats (e.g., UTF-8, ISO-8859-1) to a unified encoding for enhanced readability. Once the text is extracted, it undergoes formatting to remove extraneous spaces, line breaks, and unnecessary characters. The output is structured and includes relevant metadata, such as page numbers and paragraphs, facilitating subsequent information retrieval.
RAG Pipeline Integration
After text extraction, the content is divided into smaller segments using predefined separators, resulting in document nodes that retain associated metadata. These nodes are then integrated into a Retrieval-Augmented Generation (RAG) pipeline, enabling the organization and retrieval of information via a VectorStoreIndex and a reranking module.
The output structure in the RAG pipeline is defined as a Pydantic model, encompassing response content, source page numbers, confidence levels, and explanations of confidence, ensuring that the information returned is both accurate and comprehensible.
Connecting Data Sources
Subsequently, LlamaIndex connects to these data sources and integrates the processed data with the existing data accessible to LLMs, a process commonly referred to as Retrieval-Augmented Generation (RAG). RAG is a critical framework that enhances the capabilities of large language models (LLMs) by incorporating user-specific data into the generation process. While LLMs are trained on vast datasets, they often lack access to specialized or recent information. RAG addresses this limitation by incorporating a user’s data to provide relevant context during the model's response generation.
Stages in RAG
The key stages in Retrieval-Augmented Generation (RAG) begin with Loading, where data is gathered from diverse sources such as text files, PDFs, databases, and APIs, facilitated by LlamaHub's numerous connectors. Following this, the Indexing stage organizes the loaded data into a query-friendly structure, typically through the creation of vector embeddings that represent the data's meaning, along with additional metadata to enhance retrieval accuracy. Once indexed, the Storing stage ensures that both the index and its associated metadata are saved to avoid redundant indexing, thus improving efficiency. In the Querying stage, users can employ various strategies, including sub-queries and multi-step queries, to extract tailored information from the indexed data. Finally, the Evaluation stage measures the effectiveness of the retrieval process, providing objective assessments of the accuracy, relevance, and speed of the responses generated in relation to user queries. The figure below illustrates the primary workflow of Retrieval-Augmented Generation (RAG), facilitating a clearer understanding of the interrelationships and functions among the various stages for the readers. Please refer to the figure below for a detailed illustration of the LlamaParse parsing process.
Coral Knowledge Model
Our large language model dedicated to coral knowledge will utilize LlamaIndex as a GPT-based indexing system, aimed at providing context-enhanced natural language processing capabilities. The working principle of LlamaIndex involves encoding large text datasets, such as internet documents, books, and news articles, into a pre-trained language model. These encoded representations are then used to build an index for quick retrieval of relevant information when needed. By inputting coral-related literature into the LlamaIndex framework, this approach will enable more accurate and efficient information retrieval and natural language processing. This will enhance the professionalism in handling coral-related queries, providing well-founded and substantiated responses, thereby improving the performance and expertise of generative AI applications.
3.Evaluation
Testing CoralGenie
After successfully constructing CoralGenie using the aforementioned steps, we tested both ChatGPT-4.0 Mini and our CoralGenie with the question, "What is the Coral Triangle reef?" The response from ChatGPT 4.0 Mini was:
The Coral Triangle is a marine area located in Southeast Asia, renowned for its extraordinary biodiversity. It encompasses parts of Indonesia, Malaysia, Papua New Guinea, the Philippines, Solomon Islands, and Timor-Leste. This region is home to more than 600 species of reef-building corals and over 2,000 species of fish, making it the most diverse marine ecosystem on the planet. The Coral Triangle plays a crucial role in supporting local fisheries, protecting coastal communities, and maintaining marine health, but it is threatened by overfishing, climate change, and habitat degradation. Conservation efforts are essential to safeguard its unique biodiversity and the livelihoods of millions of people who depend on these ecosystems.
And the response from our CoralGenie was:
_The Coral Triangle reef is a region in the western Pacific Ocean that is known for its high marine biodiversity and is considered the global center of marine life. It encompasses the waters of six countries: Indonesia, Malaysia, the Philippines, Papua New Guinea, Timor-Leste, and the Solomon Islands[4].
Advantages of CoralGenie
Compared to ChatGPT-4.0 Mini, the responses generated by our large language model (LLM) exhibit several advantages in clarity and detail. Our LLM clearly defines the Coral Triangle as a distinct region located in the western Pacific Ocean, explicitly emphasizing its geographical boundaries, which helps to situate the Coral Triangle within a broader geographical context. The response identifies the Coral Triangle as the "global center of marine life," thereby highlighting its importance not only within Southeast Asia but also on a global scale, underscoring the region's ecological relevance.
Additionally, the LLM’s answer is notably concise, effectively conveying key information without extraneous details, facilitating a quick understanding of the essential points. By referencing specific academic literature, the information gains credibility and authority, enhancing reader trust. Furthermore, the LLM directly underscores high marine biodiversity as a defining characteristic of the Coral Triangle, providing a clear context for its ecological significance. Overall, these attributes contribute to a more effective and informative response that meets the standards of academic rigor and clarity.
Future Directions
In our research, we have successfully inputted literature on coral-related knowledge into a large language model, significantly improving the professionalism of our answers. Our method has shown higher accuracy and efficiency in handling coral-related queries. In the future, we will further work on quantifying expected results, such as increasing the percentage of information retrieval accuracy or reducing processing time, to enhance the performance of the model. Additionally, we plan to expand the application scenarios of the existing model to demonstrate its broader applicability and value.
Research Outcomes
These advancements will be detailed and analyzed in our paper, aiming to showcase our research findings and future directions. We will refine and expand the existing model's case examples or application scenarios to make them more persuasive and practical. Through these efforts, we aim to provide a high-quality solution for knowledge processing in the coral field and promote the application and development of artificial intelligence in practice.
citation:
[1]Brown, T.B.; Mann, B.; Ryder, N. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165
[2]T. Wu et al., "A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development," in IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122-1136, May 2023, doi: 10.1109/JAS.2023.123618.
[3]Zirnstein, B. (2023). Extended context for InstructGPT with LlamaIndex.
[4]Martin-Garin, B., & Montaggioni, L. F. (2023). Corals and reefs: From the beginning to an uncertain future (1st ed.). Coral Reefs of the World.