Purpose

Prometheus utilizes the iGEM official database[1] for its component data. However, the extracted metadata often contains excessive text, noise, or truncation issues, making it unsuitable for direct matching with large language models.

To address this, we employed ChatGLM3 (before V2.3) and Llama 3 (after V2.3) with fine-tuning or prompt engineering to extract functional descriptions of components, condensing them into 2-3 sentences that retain key functional information. This extraction significantly improved the matching success rate of the Sent2Vec model compared to raw data.

Design

We manually annotated 1,000 components, with two annotators per component. The data was then split into training and testing sets for fine-tuning ChatGLM3. The outputs from these models are stored in a knowledge graph for Prometheus to facilitate matching searches.

Implementation

The first round of data was collected in markdown format, and we formatted it for input into ChatGLM3 for fine-tuning.

CUDA_VISIBLE_DEVICES=0,1,2,3 python finetune_hf.py data/Cleared $ChatGLM3_HOME/models/chatglm3-6b configs/lora.yaml

The loss function during training is shown below:

For the second round, we applied prompt engineering with Llama 3 to extract functional and environmental details. The prompts included:

The prompts used are structured as follows:

prompt100 = f'''
#Background#
    1. We will provide you with a text after 'Here is the text:', which is converted from web page file, and may contain excessive and non-essential information.
    2. The first line of the text is the name of a synthetic biology component, '{target_part}'.
    3. The content of this web page is very irregular at present. Originally this page was supposed to describe only the function of '{target_part}', but now it contains descriptions of many other operating components related to '{target_part}', or of the entire project. It is necessary to extract the functional description and operating environment of '{target_part}' from the web page in order to better understand the characteristics and applicability of '{target_part}'.

#Task#
    1. Please summarize the function of the synthetic biology component named '{target_part}' from the provided text, and provide a clear and concise description focusing on the function and usage of '{target_part}'. Your extraction should be no more than 7 sentences. The function here includes how can it affect how specific proteins are expressed, what strains it works, how do it interact with other elements, and any synthetic biological function that the element itself plays. Be careful to output few numbers, more technical terms, and summarize the content of the text.
    2. Please provide a detailed summary of the necessary operating environments for the use of '{target_part}', focusing on the strains employed. Describe the essential settings, conditions, and biological systems required for its effective use, especially any unique environmental parameters and relevant microbial or genetic strains.
    3. Please ignore and remove the HTML contents that are not removed, and response with natural, fluid and accurate language.
    4. Please remove meaningless and confusing symbols around words, such as '_', '\\', etc

#Role#
    You are an excellent synthetic biologist, especially good at summarizing the functions of synthetic biological components, and you are always able to accurately summarize the functions of components and the required chemical and operating environment.

#Profile#
    You have the the ability to work as a researcher or analyst in the field with an in-depth understanding of synthetic biology components and their applications.

#Skills#
    You are familiar with synthetic biology terminology, understanding web information extraction techniques, bioinformatics background. You can accurately distinguish between the description of '{target_part}' and the description of other components.

#Goals#
    Accurately extract the functional description and operating environment of '{target_part}' from the web page. If the function or operating environment of '{target_part}' is related to other components, feel free to write other components' contents, functions, characteristics, etc.

#Constrains#
    1. You should ensure that the extracted information is accurate to avoid any possible misinterpretation or information loss.
    2. If the function or operating environment of '{target_part}' is related to other components, feel free to write other components' contents, functions, characteristics, etc.
    3. Please try to use the original sentences and words to summarize and keep professional terms.
    4. Please do not include any web links or html text in the output.
    5. In the process of summary, try to use more technical terms and comparative words from the original text, less use of numbers.
    6. If there is not much information in the text, do not expand the sentence and add information yourself.
    7. If the text is short and contains no more than 3 sentences of useful information, please output all the useful sentences.

#OutputFormat#
    1. Please response with "FUNCTION: (functional description of '{target_part}') \nEnvironment: (operating environment of '{target_part}')"
    2. Your answer should only contain the functional description and operating environment, no need for any other words.
    3. If the function details are not clear, respond with 'No description of the function'.
    4. If information on the operating environment or strains is insufficient, respond with 'No description of the environment'.
    
#Workflow#
    1. Identify and locate the functional description and operating environment related to '{target_part}' in the web page.
    2. Extract and organize information to ensure its accuracy and completeness.
    3. If you cannot extract enough information, just output 'No description of the function' or 'No description of the environment'.

    Here is the text:'''
fun, env = async_process_chatbot(f'{prompt100}\n{page}', [])
while fun == 'None' or env == 'None':
	fun, env = async_process_chatbot(f'{prompt100}\n{page}', [])

Both phases of function and environment extraction were successfully completed. The final results were saved in JSON format as knowledge graph nodes in ./API_import/part_llama3.json.

[1]: https://parts.igem.org/.

BACK TO
TOP !