
Results
Introduction
In our iGEM 2024 project Chatparts, we aim to optimize the iGEM parts database by utilizing advanced Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. Our foundational model is based on pre-trained LLM architectures, which we further fine-tuned using specific data from the iGEM Parts Registry. The dataset for model training was collected by scraping, processing, and categorizing relevant information from previous years' iGEM Wikis, ensuring both accurate and comprehensive coverage. After data preparation, we trained our model using state-of-the-art machine learning methods, optimizing key parameters for improved retrieval accuracy and contextual understanding. The outcomes of our project underscore the potential of LLMs to significantly boost the usability and accessibility of the iGEM parts database, thereby paving the way for more efficient synthetic biology research. Detailed results and the performance of our final trained model after thorough evaluation are presented below.
Training validation
Model initial training loss
The loss function serves as a critical metric for optimization during training. In the context of language model training, the loss function typically evaluates how well the model predicts the next word or token in a sequence based on its prior knowledge and learned representations. A high loss indicates significant prediction errors, while a lower loss suggests that the model is learning and adapting to the training data effectively. The figures present both the original and smoothed loss values, where the smoothed curve helps to reveal the underlying trend, filtering out the noise inherent in the raw data (Refer to the model page for the detailed explanation of the loss function.)
Training loss with refined model parameters
Both models, qwen2.5 14b and llama3.1 8b, exhibit consistent decreases in loss values during both the pretraining and fine-tuning phases. The loss functions for both models converge well, indicating that the models are minimizing prediction errors over time. The smoothed loss curves provide a representation of the overall learning progress.
Aftering evaluating qwen2.5-14b and llama3.1-8b by the measurement criteria, we choosed llama3.1 8b as our chatparts LLM model. To further optimize the results, we reduced the learning rate and re-trained the model to improve the process of optimization. The new results, as shown in the figure below, demonstrate that the model is trying to find out the global minimum. The refined approach yielded better optimization results: the loss of model decreased from 0.9 to 0.4.
Evaluation Results
This evaluation compares the performance of various language models across five different tasks: Question Type, Task Type, Domain Category, Experimental Techniques, and Tools and Databases (Refer to the measurement page for the detailed explanation of the evaluation process). Each model is assessed based on several key metrics, including Context Relevance, Comprehensiveness (BLEU), Groundedness (ROUGE-L), Context Precision, and Context Recall.
- – Model | Question Type | Task Type | Domain Category | Experimental Techniques | Tools and Databases |
---|---|---|---|---|---|
Context Relevance | |||||
qwen plus | 0.5039 | 0.4184 | 0.5354 | 0.5437 | 0.4517 |
qwen 2.5 | 0.4247 | 0.4483 | 0.4494 | 0.4205 | 0.3488 |
llama 3.1 no sft | 0.4252 | 0.4141 | 0.4520 | 0.4203 | 0.3354 |
chatgpt 3.5 turbo | 0.5028 | 0.5046 | 0.5342 | 0.4929 | 0.3629 |
chatparts-llama3.1 sft | 0.4940 | 0.4867 | 0.5145 | 0.4953 | 0.4537 |
chatparts-qwen2.5 sft | 0.4635 | 0.4737 | 0.4935 | 0.4581 | 0.4195 |
Comprehensiveness (BLEU) | |||||
qwen plus | 0.2406 | 0.2016 | 0.3024 | 0.1700 | 0.0943 |
qwen 2.5 | 0.0618 | 0.0772 | 0.0707 | 0.0507 | 0.0638 |
llama3.1 no sft | 0.0843 | 0.1089 | 0.1019 | 0.0673 | 0.1468 |
chatgpt 3.5 turbo | 0.2171 | 0.2469 | 0.2418 | 0.1617 | 0.2554 |
chatparts-llama3.1 sft | 0.2536 | 0.2889 | 0.2688 | 0.2128 | 0.5350 |
chatparts-qwen2.5 sft | 0.1960 | 0.2036 | 0.2144 | 0.1827 | 0.1998 |
Groundedness (ROUGE-L) | |||||
qwen plus | 0.5136 | 0.4612 | 0.5882 | 0.4544 | 0.3439 |
qwen 2.5 | 0.2284 | 0.2775 | 0.2254 | 0.2043 | 0.2627 |
llama3.1 no sft | 0.2705 | 0.3212 | 0.2929 | 0.2440 | 0.4098 |
chatgpt 3.5 turbo | 0.4740 | 0.5102 | 0.5037 | 0.3438 | 0.5038 |
chatparts-llama3.1 sft | 0.5294 | 0.5773 | 0.5366 | 0.4806 | 0.7516 |
chatparts-qwen2.5 sft | 0.4395 | 0.4376 | 0.4547 | 0.4290 | 0.4396 |
Context Precision | |||||
qwen plus | 0.4884 | 0.4072 | 0.5377 | 0.5271 | 0.4418 |
qwen 2.5 | 0.2762 | 0.2979 | 0.3126 | 0.2539 | 0.2392 |
llama3.1 no sft | 0.3034 | 0.2895 | 0.3553 | 0.2794 | 0.2898 |
chatgpt 3.5 turbo | 0.4264 | 0.4341 | 0.4776 | 0.3925 | 0.3480 |
chatparts-llama3.1 sft | 0.4373 | 0.4255 | 0.4858 | 0.4258 | 0.4032 |
chatparts-qwen2.5 sft | 0.3745 | 0.3639 | 0.4295 | 0.3662 | 0.3430 |
Context Recall | |||||
qwen plus | 0.4769 | 0.4783 | 0.5363 | 0.4283 | 0.3789 |
qwen 2.5 | 0.4848 | 0.4855 | 0.5443 | 0.4372 | 0.3778 |
llama3.1 no sft | 0.4449 | 0.4183 | 0.4794 | 0.4348 | 0.3800 |
chatgpt 3.5 turbo | 0.4787 | 0.4803 | 0.5372 | 0.4303 | 0.3705 |
chatparts-llama3.1 sft | 0.4782 | 0.4794 | 0.5370 | 0.4292 | 0.3671 |
chatparts-qwen2.5 sft | 0.4807 | 0.4819 | 0.5393 | 0.4321 | 0.3728 |
The final scores for each task and metric are aggregated by weight (calculated by entropy weight method) into a Total score to provide an overall performance comparison (Refer to the measurement page for the detailed explanation of the evaluation process).
- – Model | Question Type | Task Type | Domain Category | Experimental Techniques | Tools and Databases |
---|---|---|---|---|---|
qwen plus | 0.4455 | 0.3963 | 0.5003 | 0.4234 | 0.3417 |
qwen 2.5 | 0.3081 | 0.3295 | 0.3337 | 0.2862 | 0.2674 |
llama3.1 no sft | 0.3154 | 0.3196 | 0.3447 | 0.3001 | 0.3169 |
chatgpt 3.5 turbo | 0.4248 | 0.4397 | 0.4635 | 0.3701 | 0.3692 |
chatparts-llama3.1 sft | 0.4422 | 0.4555 | 0.4716 | 0.4120 | 0.5018 |
chatparts-qwen2.5 sft | 0.3978 | 0.4003 | 0.4326 | 0.3796 | 0.3590 |
Key Observations:
- qwen plus demonstrates consistently high performance across most tasks, particularly excelling in Domain Category and Experimental Techniques with total scores of 0.5003 and 0.4234, respectively. It also maintains strong context relevance and precision.
- qwen 2.5 exhibits lower overall performance, with a total score of 0.3081 for Question Type and 0.2674 for Tools and Databases. This model struggles particularly in the Comprehensiveness (BLEU) metric.
- llama3.1 no sft delivers moderate results, with the highest score in Domain Category (0.3447). However, its Comprehensiveness (BLEU) score remains relatively low compared to other models.
- chatgpt 3.5 turbo performs consistently well across all metrics, achieving a total score of 0.4635 in Domain Category and 0.4397 in Task Type, reflecting its well-rounded capabilities.
- chatparts-llama3.1 sft leads in several categories, especially Tools and Databases, with a total score of 0.5018. Its groundedness (ROUGE-L) also stands out, making it particularly robust in terms of providing reliable, grounded information.
- chatparts-qwen2.5 sft shows decent results, performing better than qwen 2.5, but it still ranks lower in terms of total performance, particularly in Comprehensiveness (BLEU).
In summary, 'chatparts-llama3.1 sft' performers top in this evaluation, showing a balance between context relevance, groundedness, and precision across various task types.