Purpose

The knowledge graph is a method for representing and organizing information through nodes and edges, where nodes represent entities or concepts, and edges denote their relationships. Knowledge graphs are widely used in natural language processing, search engines, and recommendation systems, helping machines understand and reason about information for smarter question-answering and information retrieval. They establish complex semantic connections, enabling users to obtain more accurate and relevant results during queries. With the rise of artificial intelligence, knowledge graphs have become essential in driving intelligent services.

Design

We adopted a knowledge graph as the database, including nodes for:

※ Primary Categories (e.g., Promoters, Ribosome Binding Site)
※ Secondary Categories (e.g., Promoters-Constitutive_E_coli_promoters)
※ Tertiary Categories (e.g., Promoters-Constitutive_E_coli_promoters-Constitutive_promoters_from_bacteriophage_T7)
※ Components

Category nodes include:

※ Category Name (name)
※ Category Description (description)
※ Parent Category Set (father_catalog)

Component nodes include:

※ Component Name (name)
※ Component Short Description (short_desc)
※ Component Function Description (function)
※ Component Applicable Environment (environment)
※ Component Gene Sequence (sequence)

Relationships between components and categories include:

※ Component included in Category (catalog)

Relationships between components include:

※ Components as twins (twins)
※ Components used together (uses)

Initially, we used the iGEM official database^[1] for the classification index to construct the knowledge graph. The webpages we crawled were limited to about 16,000 components from the incomplete iGEM classification. During the subsequent HP activities, we identified many iGEM components not included in Prometheus. After discussions with Asimov and iGEM, we discovered that iGEM has an open iGEM formatted database^[2] with about 80,000 records. However, the classification standards differ between the two databases. To maximize the utility of all available information and ensure compatibility with the V1.0 Prometheus model, we performed database merging.

Implementation

In the construction of V1.0, we manually created a download index for the default iGEM database classifications and wrote a web scraper to recursively crawl webpage content. During this process, we also extracted component sequence information that is challenging for large models to handle.

try:
    resp = requests.get(url, headers=headers, timeout=20)
    resp.encoding = 'utf-8'
    time.sleep(20)
    with open('$iGEM_HOME/pachong/saved_url_5.txt', 'a+') as file:
        file.write(f'{url}\n')
except:
    print('Request Failure.')
    with open('$iGEM_HOME/pachong/wrong_url_5.txt', 'a+') as file:
        file.write(f'{url}\n')
    return
    
sub_url = f"https://parts.igem.org/cgi/xml/part.cgi?part=recursive.{uid}"
sub_resp = requests.get(sub_url, headers=headers, timeout=20)
time.sleep(6)
sub_obj = re.compile(r"(?P.*?)", re.S)
seq = sub_obj.search(sub_resp.text).group('seq')
sub_save(resp.text, seq, url)

After crawling the webpages, we used the html2text tool to convert all HTML pages to markdown format.

When downloading data from the iGEM formatted database, we found that some component descriptions were likely truncated, and the short descriptions were often too vague to accurately convey component functions. Thus, we conducted a secondary crawl of all components referenced in the formatted database (./API_import/xml_parts.xml).

Since the classification standards from the second crawl significantly differed from the first, we needed to map the categories in the XML database to those used in the V1.0 Prometheus model. We identified shared components between the two databases and counted their pairings across different categories:

def analyze():
    with open(part_API_path, 'r') as f:
        data_API = json.load(f)
    with open(part_ultra_path, 'r') as f:
        data_ultra = json.load(f)
    
    BBa_ultra = set()
    BBa_ultra_type = {}
    relationship = set()
    examples = {}
    
    for part in data_ultra:
        BBa_ultra.add(part['name'])
        BBa_ultra_type[part['name']] = part['catalog'][0]
    
    for part in data_API:
        if part['name'] in BBa_ultra:
            if BBa_ultra_type[part['name']] == 'Unknown':
                continue
            cur_pair = (part['catalog'][0], BBa_ultra_type[part['name']])
            if cur_pair not in relationship:
                examples[cur_pair] = []
            
            examples[cur_pair].append(part['name'])
            relationship.add(cur_pair)
            
    def custom_sort(pair):
        first_key = pair[0]
        example_length = -len(examples[pair])
        return (first_key, example_length)
    
    for i in sorted(list(relationship), key=custom_sort):
        print(f"{len(examples[i])}: {i}")

Then we got the heatmap and therefore constructed our mapping rules.

![[Pasted image 20240928014059.png]] We obtained mapping rules for certain categories with higher confidence but still found many categories with low confidence that could not be accurately mapped. To address this, we incorporated specific category names into the system prompt and fine-tuned ChatGLM3 accordingly:

String_A = "You are an excellent synthetic biologist, especially good at categorizing synthetic biological components through their functions. You are not given a description of a function of a synthetic biological component, please categorize it among these categories: Promoter, Ribosome_Binding_Site, Protein_Domain, Protein_Coding_Sequence, Translational_Unit, Terminator, DNA, Plasmid_Backbone, Plasmid, Primer, Composite_Part. Only contain exactly one category name in your output. DO NOT CONTAIN ANY OTHER WORDS ELSE IN YOUR OUTPUT. If you cannot decide how to categorize, output Unknown"

allowed_category = ['Protein_Domain', 'Protein_Coding_Sequence', 'Composite_Part', 'DNA', 'Plasmid', 'Plasmid_Backbone', 'Primer', 'Protein_Domain', 'Ribosome_Binding_Site', 'Promoter', 'Protein_Domain', 'Terminator', 'Translational_Unit', 'Unknown', 'Cell', 'Conjugation', 'Inverter', 'RNA', 'T7']

After fine-tuning, we organized the correctly classified components into ./API_import/part_llama3.json, totaling 77563 components.

[1]: https://parts.igem.org/.

[2]: https://parts.igem.org/Registry_API.