Measurement

Introduction

To evaluate the accuracy of our model, we carefully designed a comprehensive and rigorous evaluation benchmark. This benchmark evaluates the performance of the model with quantitative evaluation metrics in terms of data collection, database retrieval, and answer generation. In addition, we compared our model with some of the popular large-scale models currently in use (e.g., the ChatGPT model and the Qwen model), assessing their performance and answer quality on the same tasks. Through model comparison, we can highlight the excellent performance of our model, so as to affirm the value of the model.

Our benchmark is also a groundbreaking benchmark. Because there may not be such a large language model AI assistant in Synthetic Biology at present, let alone a systematic evaluation process related to this aspect. And our assessment provides a solid foundation for future teams to build or improve their own models. They can use this benchmark to evaluate whether their model's performance meets the requirements of a particular scenario and compare it to the model we have developed, or they can directly evaluate aspects of their own model using the classification dataset we designed. Since we've made our metrics and methodologies public, other teams can adapt and refine their evaluation criteria to their own needs and share them with us, ensuring that benchmarks can be adapted to a variety of use cases.

Evaluation Pipeline

Figure 1Workflow diagram of the entire evaluation benchmark

Data Collection

Our model is trained using data publicly available on the official iGEM website from past years, including but not limited to wikis, parts, protocols, and models. Considering that high-quality data is fundamental to generative AI, we adopted a novel and highly accurate data filtering and cleaning method —— A combination of AI-assisted with manual review.

1. Data Crawling

To avoid missing information that might occur during manual browsing of the wikis, we used web spider to capture all the information from each team’s wiki pages for every year, storing it as txt files. Although 'txt' files are not particularly human-readable, this method ensures that no information is missed. We have collected data and information on all teams from 2018 to 2023 on the official website of iGEM, and there are almost 60,000 pieces of data.

Figure 2An example txt file of the 2023 USC experimental protocols obtained via web crawling

2. Data Structuring

After data extraction, we used a JSON template to structure the data. By inputting the crawled data and the pre-set template into ChatGPT, we were able to expedite the initial organization process. Once we had a general understanding of the information, we created a JSON template for each section of the wiki to ensure that all data retrieved by the crawler was properly formatted and systematically organized.

JSON template for RAG(Parts)

Parts:
{
    "document_id": "Document ID",
    "title": "Document Title",
    "Group": "Team Name",
    "author": "Author",
    "date": "Publication Year",
    "tags": ["Data Category", "Data Function"],
    "abstract": "Document Abstract",
    "metadata": {
        "Part Abstract": "Brief description of the part content",
        "Part Name": "Part Name",
        "Part ID": "Part ID",
        "Base Pairs": "Number of base pairs in the part (you can directly calculate the character count of the sequence attached to the text)",
        "Origin": "Chassis Bacteria",
        "Part Type": "Part Type",
        "Part Description": "Part Description",
        "Part Contribution": "Part Contribution",
        "Sequence": "Sequence",
        "Experiment Parameters": {
            "Assembly": "Assembly Method",
            "Transformation Medium": "Transformation Medium",
            "Expression Medium": "Expression Medium",
            "Expression Temperature": "Expression Temperature",
            "Primers Used": {
                "forward primer": "Forward Primer",
                "backward primer": "Reverse Primer"
            }
        },
        "Reference": "Reference Literature"
    },
    "URL": "https://part.igem.org/Part:PartID"
}

JSON template for RAG(Protocols)

{
  "team": "iGEM_Team_Name_Year",
  "project": {
    "title": "Project_Title",
    "description": "Project_Description",
    "goals": [
      "Goal1",
      "Goal2",
      "Goal3"
    ]
  },
  "experiments": [
    {
      "title": "Experiment_Title",
      "objective": "Objective of the experiment",
      "concepts": "Concepts underlying the experiment",
      "materials": [
        {
          "name": "Material_Name",
          "dosage": "Dosage or amount"
        }
      ],
      "methods": [
        {
          "step": "Step_Description"
        }
      ],
      "protocols": [
        {
          "protocol_name": "Protocol_Name",
          "steps": [
            "Step_1_Description",
            "Step_2_Description"
          ]
        }
      ],
      "results": [
        "Result_1_Description",
        "Result_2_Description"
      ]
    }
  ]
}

JSON template for RAG(Models)

Parts:
{
  "document_id": "Document ID",
  "title": "File title",
  "author": "Team name ",
  "date": "year of publication ",
  "tags": [" Data categories ", "data features "],
  "abstract": "Document summary ",
  "metadata": {
    "Model Overview": "Model Overview",
    "Model Assumptions": "Model assumptions ",
    "Mathematical Description": "Mathematical description of the model ",
    "Model Implementation": "Model implementation ",
    "Parameter Estimation": "Parameter estimation",
    "Model Validation and Verification": "Model validation and verification ",
    "Simulation Results": "Simulation results ",
    "Robustness and Sensitivity Analysis": "Robustness and Sensitivity Analysis",
    "Model Limitations": "Limitations of the model",
    "Conclusions": "What are the conclusions?",
    "References": "References"
    },
    "url": "https:// year.igem.wiki/team name /model"
}

JSON template for LoRA(The same for parts, protocols, models)

[
    {
        "context": "Conetext information",
        "question": "Questions fromm users",
        "answer": "Predicted answer from model"
    }
]

Figure 3An example JSON-formatted protocols data from 2021 XJTLU

3. Manual Review

After sorting the txt file obtained by the crawler into json format, we will compare the information in the json file with the official website one by one to avoid information mismatch or real problems caused by AI assistant errors.

Through such a systematic data collection process, we can ensure that the data we train the model is relatively high-quality, laying a solid foundation for a well-performing model.

Quantitative Evaluation Metrics

For the evaluation of our model, we did not choose any existing tools or software packages for direct assessment. Considering that the field of large prediction models in synthetic biology is still relatively new, there is a lack of standardized evaluation packages, as well as relevant evaluation datasets. Therefore, we employed the most fundamental approach — manual evaluation. This manual process involved two aspects: manually writing evaluation functions and manually classifying the evaluation dataset.

There are many ways to evaluate the performance of large models, and quantitative metrics are the most straightforward method to provide a numerical score that tells us how well a model performs. In our evaluation process, we used five quantitative metrics: Faithfulness (Context Relevance), BLEU score (Answer Correctness or Answer Comprehensiveness), ROUGE-L F1 score (Answer Relevance or Groundedness), Context Precision, and Context Recall. These five metrics evaluate the model’s retrieval and generation capabilities[1].

Generation: The ability of the model to generate appropriate and coherent responses based on the given context or input question

Retrieving: The ability of the model to accurately and comprehensively retrieve information relevant to the input question from a large knowledge base, database, or document set.

Figure 4Overview of evaluation metrics categorized into two groups: Generation and Retrieving

Followed diagram shows the roles played by our five quantitative metrics in the relationship among the three key elements of large language models. These three elements are: the generated Answer, the Context needed to produce the answer, and the Query provided by the user. These elements are interconnected. The answer should accurately respond to the query, the context should provide sufficient knowledge for the answer, and the context should also align with the query. The five quantitative metrics we designed cover all the relationships among these three elements, demonstrating that this evaluation system is comprehensive and valuable[2].

Figure 5The relationship between Query, Relevant Context, and Answer in a large language model, as assessed by five quantitative metrics

1. Faithfulness(Context Relevancy)

This metric evaluates the degree to which the generated answer aligns with the given context. It is calculated as a ratio based on the consistency between the answer and the retrieved context, with values ranging from 0 to 1, where higher scores indicate better faithfulness[3]. If all the ideas in the answer can be inferred from the provided context, the generated answer is considered faithful.

Evaluation Metrics

2. BLEU Score (Answer Correctness)

The BLEU score (Bilingual Evaluation Understudy Score) is mainly used to assess the similarity between the generated text (answer) and the reference text (context)[4]. It was originally designed for evaluating machine translation. By calculating the match of n-grams (continuous sequences of words), BLEU measures the similarity between the generated and reference texts. The BLEU score ensures both the comprehensiveness and accuracy of the generated answer.

Comprehensiveness

Since BLEU calculates the matching rate of n-grams, if the generated text includes phrases and structures that are the same as the reference text, it indicates that the generated answer is more comprehensive and captures the main content of the reference answer. Therefore, a high BLEU score means that the generated text performs well in covering the content of the reference text.

Answer Correctness

The BLEU score is computed by comparing the n-grams in the generated answer with those in the reference answer. If there are many matching n-grams, the BLEU score will be high. This level of matching can be seen as a measure of 'correctness.'

Calculation

Step 1: Calculating n-gram Precision

For each n-gram (a sequence of n words), calculate the matching degree between the generated text and the reference text. For example, for 1-gram, precision is calculated using the following formula:

Evaluation Metrics

This process can be extended to different n values (e.g., 1-gram, 2-gram, 3-gram, etc.).

Step 2: Calculating the Weighted n-gram Precision

To balance the importance of different n-grams, BLEU typically calculates a weighted average of the 1-gram to 4-gram precisions. The formula is as follows:

Evaluation Metrics

Step 3: Brevity Penalty (BP)

To avoid giving high scores to overly short generated answers, BLEU introduces a Brevity Penalty (BP). The formula for this is:

Evaluation Metrics

The brevity penalty is mainly used to penalize shorter generated texts that might artificially boost precision scores.

Step 4: Final BLEU Score Formula

Combining the n-gram precision and the brevity penalty, the final BLEU score formula is as follows:

Evaluation Metrics

3. F1 Score of ROUGE-L (Answer Relevancy)

The F1 score of ROUGE-L measures the matching degree of the longest common subsequence (LCS) between the generated text and the reference text[5]. It combines precision (how much of the generated text matches the reference) and recall (how much of the reference text is covered by the generated text). By balancing these two, the F1 score assesses the groundedness of the generated text, serving as a quantitative measure of answer relevancy.

Groundedness (Answer Relevancy)

The F1 score of ROUGE-L measures how much of the generated answer can be traced back to the original text (through recall) while also considering whether irrelevant or additional information is minimized (through precision), ensuring the generated answer aligns with the source text[5]. This balance makes it a good metric for evaluating whether the generated answer is well-grounded in the reference text. A high ROUGE-L F1 score indicates that the generated content is well-supported and accurately reflects information from the original or contextual text, showing strong answer relevancy.

Calculation

Step 1: Precision

Evaluation Metrics

Where LCS(X, Y) represents the longest common subsequence between the generated text Y and the reference text X.

Step 2: Recall

Evaluation Metrics

Similarly, LCS(X, Y) represents the longest common subsequence between the generated and reference text.

Step 3: F1 Score

The F1 score is the harmonic mean of Precision and Recall, and its formula is as follows:

Evaluation Metrics

4. Context Precision

Context precision is a metric that assesses whether all elements relevant to the ground truth are ranked higher within the retrieved context. Ideally, all relevant items should appear at the top. This metric is calculated based on the query, ground truth, and context, with a value ranging from 0 to 1. A higher score indicates better precision and relevance[6].

Evaluation Metrics

5. Context Recall

Context recall evaluates the consistency between the retrieved context and the ground truth (i.e., the annotated answer). It is calculated as a ratio based on the question, ground truth (GT), and the retrieved context, with values ranging from 0 to 1. A higher value indicates better alignment, showing that the context used to generate the answer is well-supported by factual information[7].

Evaluation Metrics

Categorical evaluation dataset

To evaluate the model’s answer generation capability across various aspects, we manually categorized the evaluation dataset. The dataset was classified based on five key aspects: question type, task type, domain category, experimental techniques, and data sources. Our test dataset, consisting of approximately 5,500 entries, was further subdivided into these main categories. For more details, please refer to the table.

1. Categories Design

Category Dimension	Label	Description	Example Question
Question Type	Experimental Steps	Questions involving experimental procedures and workflows	How is PCR amplification performed?
	Gene Design	Questions related to gene editing, gene circuit design, etc.	How is a synthetic gene circuit designed?
	Protein Function	Questions about protein structure, function, and activity	Where is the active site of this protein?
	Data Analysis	Questions about data interpretation and modeling	How is FBA used to analyze metabolic pathways?
	Tool Usage	Questions on using specific biological tools or databases	How do you search for genetic parts in the iGEM Registry?
Task Type	Question Answer	Tasks where the model generates a specific answer	How does CRISPR perform gene editing?
	Classification	Tasks where the model categorizes genes, parts, etc.	How are synthetic biology parts categorized?
	Generation	Tasks where the model generates detailed descriptions	Describe the steps for plasmid transformation.
Domain Category	Metabolic Engineering	Involves questions related to pathway optimization and design	How can metabolic engineering be used to improve biofuel production?
	Protein Engineering	Questions related to protein function, synthesis, and modification	How can the expression level of a recombinant protein be increased?
	Pathway Design	Designing new metabolic pathways or biosensors	How do you design a biosensor based on a gene circuit?
Experimental Techniques	PCR Techniques	Experimental steps related to PCR (Polymerase Chain Reaction)	How do you design PCR primers for gene amplification?
	Transformation and Cloning	Experimental techniques related to plasmid transformation and gene cloning	How is plasmid transformation performed in E. coli?
	Protein Purification	Experimental procedures for protein purification and analysis	How do you purify a protein using His-tag purification?
	Gene Knockout	Techniques related to gene knockout	How is gene knockout performed using CRISPR?
Tools and Databases	iGEM Registry	Tasks related to searching and operating within the iGEM Parts Registry	How do you find BioBrick parts in the iGEM Registry?
	Addgene	Tasks related to retrieving, storing, and submitting DNA parts	How do you obtain plasmids from Addgene?
	Benchling	Tasks related to experimental design and data management on Benchling	How do you use Benchling to design experiments and record data?

2. Categorical label in JSON

Considering that some questions may belong to multiple categories, we adopted the principle of assigning three category labels to each data entry during manual annotation. These labels are labeled as label_1, label_2, and label_3, representing the primary, secondary, and tertiary categories of the data, respectively. If a data entry belongs to only one category, the second and third labels are marked as “none". The template and the examples are showed as below.

{
    "contexts": "original contexts",
    "question": "original question",
    "ground_truth": "original ground_truth",
    "answer": "original answer",
    "label_1": "your proposed label",
    "label_2": "your proposed second label, if none write 'none'",
    "label_3": "your proposed third label, if none write 'none'",
    "category": "your proposed category dimension"
}

Figure 6Examples of json documents with a labeled categories

Comparison Between Models

In the model comparison section, in order to highlight the performance of our model, we compared a total of six models. The models' information is as follows:

a. Chatparts - Llama3.1 sft: This is our model, Chatparts, fine-tuned using Meta's Llama3.1 8B model. Released in early 2024, the Llama3.1 model is known for its excellent performance in understanding complex language structures and subtle semantic differences. This model benefits from the strengths of the Llama architecture and is able to understand iGEM and synthetic biology-related knowledge very well.

b. Chatparts - Qwen2.5 sft: This is also our model, Chatparts, but it was fine-tuned using the open-source Qwen2.5 model developed by Alibaba Cloud. This fine-tuning allows the Chatparts model to leverage Qwen’s capabilities in natural language understanding and generating specific tasks.

c. Qwen 2.5: An open-source large language model developed by Alibaba Cloud, released in mid-2023. It has been widely used in the open-source community for various natural language processing tasks.

d. Qwen-plus: A large language model released by Alibaba Cloud at the end of 2023. Compared to the previous Qwen 2.5, it has enhanced capabilities, but it is not open-source.

e. Llama3.1 no sft: This is our Chatparts model without fine-tuning using the Llama3.1 8B base model. Comparing this version with the fine-tuned model highlights the impact of fine-tuning on model performance.

f. ChatGPT 3.5 turbo: Released by OpenAI in March 2023, ChatGPT 3.5 turbo is a powerful closed-source language model that has been widely adopted across various industries and has become a benchmark for large language models.

1. Overall Performance Comparison

First, we will examine the performance of the six models across five quantitative measures.

Our initial analysis of the five quantitative metrics clearly shows that chatparts-llama3.1 sft (highlighted in red) outperforms the other models in several categories. Notably, it consistently achieves the highest scores in Context Relevance, Comprehensiveness (BLEU), Groundedness (ROUGE-L), and Context Precision. Although our model doesn't have the highest score in Context Recall, its performance is still quite strong.

-  –	chatparts-llama3.1 sft	chatparts-qwen2.5 sft	qwen-plus	qwen 2.5	llama3.1 no sft	chatgpt 3.5 turbo
Context Relevance	0.505676	0.451055	0.488147	0.416728	0.413907	0.490677
Comprehensiveness (BLEU)	0.288867	0.203761	0.233562	0.066117	0.092745	0.236093
Groundedness (ROUGE-L)	0.561990	0.444764	0.507912	0.239010	0.287860	0.492046
Context Precision	0.423080	0.360883	0.471571	0.270054	0.295718	0.414071
Context Recall	0.481943	0.484913	0.466638	0.488848	0.486847	0.465446

Because tabular data is not very intuitive, a bar chart is used to show the values of the different indicators of each model. Our model is chatparts-llama3.1 sft (dark blue bars), which is clearly superior to other models.

Figure 7Comparison of the performance of different models across five quantitative metrics

2. Categorical Comparison

In the step of model evaluation and comparison, we will also assess all category dimensions in the table based on the five quantitative metrics mentioned earlier, and calculate the value for each dimension. Then, we will compute the average value of each metric for every category. Using the cross entropy method, we will determine the weight for each metric. Finally, the score for each category will be the weighted average of the five metrics according to the large classification standard. (Refer to the results page for the detailed evaluation results by category)

Figure 8The entropy weight distribution of the five metrics calculated using the entropy weight method

Why cross entropy method?

The entropy weight method is a highly objective weighting approach, widely used in multi-criteria evaluation. It calculates the information entropy of each indicator to assess its variability and uncertainty. When the variability of an indicator is greater, its entropy value is smaller, indicating that the indicator provides more information, and thus its weight is higher. The entropy weight method avoids the influence of subjective human factors, making the weight distribution more objective and reasonable, which helps improve the scientific accuracy of the evaluation results.

Figure 9Comparison of the performance of different models across various categories

The radar plot in Figure 9 further validates several excellent performances of the chatparts-llama3.1 sft model. Compared with other models, it shows more comprehensive and balanced advantages in the dimensions of task type, problem type and experimental technology tools. Especially in key areas such as databases and domain categories, its coverage is wider and its performance is more stable, indicating the strong ability of the model in the application of different tasks and technical tools.

References

[1]: Metrics | Ragas, https://docs.ragas.io/en/stable/concepts/metrics/index.html.

[2]: Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv preprint arXiv:2309.15217. https://doi.org/10.48550/arXiv.2309.15217.

[3]: Faithfulness | Ragas, https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html.

[4]: Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). “BLEU: a Method for Automatic Evaluation of Machine Translation.” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pp. 311–318. https://aclanthology.org/P02-1040.pdf.

[5]: Dalal, A., Ranjan, S., Bopaiah, Y., and others, "Text summarization for pharmaceutical sciences using hierarchical clustering with a weighted evaluation methodology," Sci. Rep., 2024, 14, 20149. https://doi.org/10.1038/s41598-024-70618-w

[6]: Context Pricision | Ragas, https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html.

[7]: Context Recall | Ragas, https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html.