Overview
The figure below visualizes the issue lifecycles during the development of our model. Each glowing line indicates one important issue that is of great significance to Prometheus. The update log below the figure will further illustrate how our model is built up step by step.
Initial Design - Knowledge Graph
Data Source: Web scraping data (parts.igem.org), no cleaning.
Graph Design:
- ※ Node Types: (Category (catalog), Component (part)).
- ※ Internal Attributes: (Webpage original text (webpage), using HTML2txt).
- ※ Relationships: ( [Component] belongs to [Category], [Component] and [Component] are twins or uses).
Introduction to Knowledge Graph.
Issues:
- ※ Internal attributes are chaotic and cannot accurately describe the functions and characteristics of components (fixed in V1.0)
- ※ The network structure of relationships is too simple, making global indexing and related reasoning inconvenient (fixed in V1.0)
V1.0
Improvements:
- ※ Changed internal attributes to (webpage cleaned and extracted description (ChatGLM3), nucleotide sequence (sequence)) (found in V0.0)
Transformer Principles:
- ※ Used Sent2Vec for global matching (found in V0.0).
Sent2Vec Principles:
Highlights:
- ※ Introduced LLM model: Sent2Vec can achieve semantic fuzzy matching, and ChatGLM3 can remove irrelevant content from the webpage.
Issues:
- ※ Extraction results have inconsistent formats and lengths, making it difficult to capture key content (fixed in V1.1).
- ※ Cannot handle long texts (fixed in V2.3).
V1.1
HP: Collect expectations for large model design and organize dataset annotations.
Improvements:
- ※ Collected manually annotated data and fine-tuned ChatGLM3 (found in V1.0).
Fine-tuning Principles.
Highlights:
- ※ Unified output formats and lengths.
- ※ Output content likely includes key descriptive sentences.
Issues:
- ※ (HP) Need to dynamically and real-time showcase the model externally, requiring user-friendliness and convenience (fixed in V1.2).
- ※ (HP) During the annotation process, discovered that the database content was incomplete, not fully covering all components recorded by iGEM (fixed in V2.0).
- ※ Output content contains excessive noise (e.g., experimental details, component evaluations) (fixed in V2.3).
V1.2
HP: CCIC/Roundtable Meeting.
Improvements:
- ※ Created a frontend webpage for usage (found in V1.2).
Issues:
- ※ (HP) Communicated with other teams, learned about the need to improve prompt design ideas (additional attention needed on expression context and introduction of safety and ethical standards during description extraction) (fixed in V1.3).
- ※ (HP) Communicated with other teams, and they expressed a need for feedback modification (fixed in V1.3).
- ※ The frontend is a simple wrapper for the interface, requiring certain biological knowledge and learning ability from users (fixed in V3.0).
V1.3
Improvements:
- ※ Added an environment field to [Component] to indicate its expression environment (found in V1.2).
- ※ Improved the prompt (found in V1.2).
Prompt Design Ideas:
- ※ Provided a feedback page (found in V1.2).
Highlights:
- ※ Introduced a component scoring feature, expecting to achieve component ranking after fuzzy searches.
V2.0
Event: Trialed the kernel model released by Asimov, learned about downstream task design (complete plasmid design).
HP: Contacted Asimov and learned about related database interfaces.
Improvements:
- ※ (HP) Integrated with the official iGEM database, consolidated two databases, increasing the number of components from 16k to 80k (found in V1.1).
Issues:
- ※ Inconsistencies in the [Category] field between the two databases: 1. Conflicts exist 2. There are categories not included in the original database 3. There are uncategorized components (fixed in V2.2).
- ※ Unable to design complete plasmids temporarily (fixed in V2.1).
V2.1
HP: Discussed with Professor XX, discovered that the model is of limited use for professionals but has significance for amateurs and beginners, requiring more complete downstream task design.
Improvements:
- ※ Added plasmid stitching as a downstream task (found in 2.0).
Implementation Plan.
Highlights:
- ※ Beginners can independently use the model to complete the entire process from component searching to plasmid design.
Disadvantages:
- ※ The selection of plasmids and component combinations is not ideal, and the compatibility of plasmids with the usage environment is limited (fixed in V3.1).
V2.2
Improvements:
- ※ Through category analysis of the overlapping parts of the two databases, summarized category mapping rules; remaining uncategorized components need classification (found in V2.0)
Inductive Mapping Rule Process:
- ※ Used ChatGLM3 fine-tuned on classified components along with prompt engineering for the classification of remaining components (found in V2.0)
Highlights:
- ※ Truly expanded the database, now containing the vast majority of components in the iGEM library, greatly enhancing performance (reason: previous web scraping methods led to significant missing components by broad categories).
Issues:
- ※ The prompts needed to meet benchmarks are too lengthy for ChatGLM3 to handle, resulting in the abandonment of some prompts (fixed in V2.3)
V2.3
HP: Teaching assistance activities.
Improvements:
- ※ Replaced ChatGLM3 used for feature extraction with Llama 3 and retrained for usage (found in V2.2).
Comparison of Llama 3 and ChatGLM3.
Highlights:
- ※ Enhanced model comprehension, significantly increased prompt length, and improved extraction effectiveness.
- ※ Handled previously unmanageable components due to lengthy webpages with ChatGLM3.
Issues:
- ※ During teaching assistance, discovered that individuals without a background in synthetic biology (including middle and elementary school students) cannot use it, which may lead to anxiety and resistance towards synthetic biology; this is contrary to the team's goal of promoting synthetic biology (fixed in V3.0).
V3.0
HP: Interview in Changzhou.
Improvements:
- ※ Proposed "Freshman Mode," enabling plasmid design through natural language dialogue (found in V2.3).
Freshman Mode Algorithm Diagram (Follow-up, Summary)
- ※ Improved the frontend, adapting a ChatGPT-like dialogue box design method (found in V2.3)
Highlights:
- ※ Made it beginner-friendly, removing barriers to use, effectively promoting synthetic biology.
Issues:
- ※ (HP) Scoring system: 1. May be influenced by subjectivity, requiring further design 2. Different components yield varying results under different combinations, necessitating comprehensive consideration (Unsolved).
V3.1
Improvements:
- ※ Sorted components using environment compatibility with Llama 3 (found in V2.1).
- ※ Added plasmid backbones for more usage environments (found in V2.1).
Highlights:
- ※ The model's output considers more biological factors, increasing the success rate of plasmid construction.
Issues:
- ※ Some bugs affect user experience, and low code quality makes maintenance difficult (fixed in V3.2).
V3.2
Improvements:
- ※ Organized code, summarized documentation, and fixed known bugs (found in V3.1).
Issues:
- ※ The collected scoring data is too limited to support further fine-tuning and retraining (Unsolved).
Engineering Highlights
The lifecycle of bugs: Focus not only on current issues but also on systematically planning all encountered problems to uncover deeper dependencies within them, optimizing the order of fixes to resolve issues at a higher level while reducing the hindrance of bugs on development progress.