Results

Introduction

In our iGEM 2024 project Chatparts, we aim to optimize the iGEM parts database by utilizing advanced Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. Our foundational model is based on pre-trained LLM architectures, which we further fine-tuned using specific data from the iGEM Parts Registry. The dataset for model training was collected by scraping, processing, and categorizing relevant information from previous years' iGEM Wikis, ensuring both accurate and comprehensive coverage. After data preparation, we trained our model using state-of-the-art machine learning methods, optimizing key parameters for improved retrieval accuracy and contextual understanding. The outcomes of our project underscore the potential of LLMs to significantly boost the usability and accessibility of the iGEM parts database, thereby paving the way for more efficient synthetic biology research. Detailed results and the performance of our final trained model after thorough evaluation are presented below.

Training validation

Model initial training loss

The loss function serves as a critical metric for optimization during training. In the context of language model training, the loss function typically evaluates how well the model predicts the next word or token in a sequence based on its prior knowledge and learned representations. A high loss indicates significant prediction errors, while a lower loss suggests that the model is learning and adapting to the training data effectively. The figures present both the original and smoothed loss values, where the smoothed curve helps to reveal the underlying trend, filtering out the noise inherent in the raw data (Refer to the model page for the detailed explanation of the loss function.)

Figure 1Loss function of qwen2.5 14b and llama3.1 8b in the pretraining and fine-tuning phases

Training loss with refined model parameters

Both models, qwen2.5 14b and llama3.1 8b, exhibit consistent decreases in loss values during both the pretraining and fine-tuning phases. The loss functions for both models converge well, indicating that the models are minimizing prediction errors over time. The smoothed loss curves provide a representation of the overall learning progress.

Figure 2Training loss after reducing learning rate and retraining the model

Aftering evaluating qwen2.5-14b and llama3.1-8b by the measurement criteria, we choosed llama3.1 8b as our chatparts LLM model. To further optimize the results, we reduced the learning rate and re-trained the model to improve the process of optimization. The new results, as shown in the figure below, demonstrate that the model is trying to find out the global minimum. The refined approach yielded better optimization results: the loss of model decreased from 0.9 to 0.4.

Evaluation Results

This evaluation compares the performance of various language models across five different tasks: Question Type, Task Type, Domain Category, Experimental Techniques, and Tools and Databases (Refer to the measurement page for the detailed explanation of the evaluation process). Each model is assessed based on several key metrics, including Context Relevance, Comprehensiveness (BLEU), Groundedness (ROUGE-L), Context Precision, and Context Recall.

-  –  Model Question Type Task Type Domain Category Experimental Techniques Tools and Databases
Context Relevance
qwen plus 0.5039 0.4184 0.5354 0.5437 0.4517
qwen 2.5 0.4247 0.4483 0.4494 0.4205 0.3488
llama 3.1 no sft 0.4252 0.4141 0.4520 0.4203 0.3354
chatgpt 3.5 turbo 0.5028 0.5046 0.5342 0.4929 0.3629
chatparts-llama3.1 sft 0.4940 0.4867 0.5145 0.4953 0.4537
chatparts-qwen2.5 sft 0.4635 0.4737 0.4935 0.4581 0.4195
Comprehensiveness (BLEU)
qwen plus 0.2406 0.2016 0.3024 0.1700 0.0943
qwen 2.5 0.0618 0.0772 0.0707 0.0507 0.0638
llama3.1 no sft 0.0843 0.1089 0.1019 0.0673 0.1468
chatgpt 3.5 turbo 0.2171 0.2469 0.2418 0.1617 0.2554
chatparts-llama3.1 sft 0.2536 0.2889 0.2688 0.2128 0.5350
chatparts-qwen2.5 sft 0.1960 0.2036 0.2144 0.1827 0.1998
Groundedness (ROUGE-L)
qwen plus 0.5136 0.4612 0.5882 0.4544 0.3439
qwen 2.5 0.2284 0.2775 0.2254 0.2043 0.2627
llama3.1 no sft 0.2705 0.3212 0.2929 0.2440 0.4098
chatgpt 3.5 turbo 0.4740 0.5102 0.5037 0.3438 0.5038
chatparts-llama3.1 sft 0.5294 0.5773 0.5366 0.4806 0.7516
chatparts-qwen2.5 sft 0.4395 0.4376 0.4547 0.4290 0.4396
Context Precision
qwen plus 0.4884 0.4072 0.5377 0.5271 0.4418
qwen 2.5 0.2762 0.2979 0.3126 0.2539 0.2392
llama3.1 no sft 0.3034 0.2895 0.3553 0.2794 0.2898
chatgpt 3.5 turbo 0.4264 0.4341 0.4776 0.3925 0.3480
chatparts-llama3.1 sft 0.4373 0.4255 0.4858 0.4258 0.4032
chatparts-qwen2.5 sft 0.3745 0.3639 0.4295 0.3662 0.3430
Context Recall
qwen plus 0.4769 0.4783 0.5363 0.4283 0.3789
qwen 2.5 0.4848 0.4855 0.5443 0.4372 0.3778
llama3.1 no sft 0.4449 0.4183 0.4794 0.4348 0.3800
chatgpt 3.5 turbo 0.4787 0.4803 0.5372 0.4303 0.3705
chatparts-llama3.1 sft 0.4782 0.4794 0.5370 0.4292 0.3671
chatparts-qwen2.5 sft 0.4807 0.4819 0.5393 0.4321 0.3728

The final scores for each task and metric are aggregated by weight (calculated by entropy weight method) into a Total score to provide an overall performance comparison (Refer to the measurement page for the detailed explanation of the evaluation process).

-  –  Model Question Type Task Type Domain Category Experimental Techniques Tools and Databases
qwen plus 0.4455 0.3963 0.5003 0.4234 0.3417
qwen 2.5 0.3081 0.3295 0.3337 0.2862 0.2674
llama3.1 no sft 0.3154 0.3196 0.3447 0.3001 0.3169
chatgpt 3.5 turbo 0.4248 0.4397 0.4635 0.3701 0.3692
chatparts-llama3.1 sft 0.4422 0.4555 0.4716 0.4120 0.5018
chatparts-qwen2.5 sft 0.3978 0.4003 0.4326 0.3796 0.3590

Key Observations:

  • qwen plus demonstrates consistently high performance across most tasks, particularly excelling in Domain Category and Experimental Techniques with total scores of 0.5003 and 0.4234, respectively. It also maintains strong context relevance and precision.
  • qwen 2.5 exhibits lower overall performance, with a total score of 0.3081 for Question Type and 0.2674 for Tools and Databases. This model struggles particularly in the Comprehensiveness (BLEU) metric.
  • llama3.1 no sft delivers moderate results, with the highest score in Domain Category (0.3447). However, its Comprehensiveness (BLEU) score remains relatively low compared to other models.
  • chatgpt 3.5 turbo performs consistently well across all metrics, achieving a total score of 0.4635 in Domain Category and 0.4397 in Task Type, reflecting its well-rounded capabilities.
  • chatparts-llama3.1 sft leads in several categories, especially Tools and Databases, with a total score of 0.5018. Its groundedness (ROUGE-L) also stands out, making it particularly robust in terms of providing reliable, grounded information.
  • chatparts-qwen2.5 sft shows decent results, performing better than qwen 2.5, but it still ranks lower in terms of total performance, particularly in Comprehensiveness (BLEU).

In summary, 'chatparts-llama3.1 sft' performers top in this evaluation, showing a balance between context relevance, groundedness, and precision across various task types.