Purpose

Prometheus allows users to describe components in natural language and recommends the closest matching components. Traditional matching algorithms often struggle with varied wording and complex sentence structures, limiting accuracy in real-world applications. To address this, we employed Sent2Vec technology. Given that Prometheus is frequently used in contexts highly relevant to synthetic biology, we specifically utilized the BioSentVec model, retrained on PubMed data.

Design

During model construction, we first vectorized all component descriptions using BioSentVec, saving the results in binary array format in ./Sent2Vec/embedding. Similarly, when Prometheus receives user input, it vectorizes the input using BioSentVec, searches the database for the component corresponding to the vector with the highest cosine similarity, and returns that as the query result.

Implementation

To vectorize text segments, we first split them into individual sentences to meet BioSentVec's input requirements. Sentences are split using a period as a delimiter, which can lead to some incorrect segmentations:

※ Numbers like "2.15 mg/L" may be improperly split.
※ Proper nouns like "E.coli" might also be incorrectly separated.

For the first case, since numbers don't serve well as tokens, we don't expect significant impact on results. Regarding the second case, we observed that BioSentVec's tokenizer doesn't have special adjustments for such proper nouns. Aligning tokenization logic with the training process could enhance model performance. Notably, BioSentVec's tokenizer has been extensively trained on terms like “coli,” allowing it to recognize it as a proper noun.

def preprocess_sentence(text):
    text = text.replace('/', ' / ')
    text = text.replace('.-', ' .- ')
    text = text.replace('.', ' . ')
    text = text.replace('\'', ' \' ')
    text = text.lower()

    tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]

def split_sentences(text):
    parts = re.split(r'\.(?=\s)', text)
    sentences = []
    for part in parts:
        if sentences and re.match(r'^\d+', part):
            sentences[-1] += '.' + part
        else:
            sentences.append(part.strip())
    return sentences

After preprocessing, we vectorize the segmented sentences using BioSentVec.

for index, row in tqdm(df.iterrows()):
    cur_output = row['function']
    sent_list = split_sentences(cur_output)
    output_embedding = [(model.embed_sentence(preprocess_sentence(sent + '.' if sent[-1] != '.' else ''))) for sent in sent_list if len(sent) >= 4]
    flushed_embedding = []
    for embedding in output_embedding:
        if check_embedding(embedding):
            flushed_embedding.append(embedding)

With this approach, our model achieves rapid fuzzy matching between natural language segments, significantly outperforming traditional rule-based algorithms in accuracy.