Purpose

Under the guidance of the HP group and discussions with Professor Zhang, we identified that the model has significant utility for non-experts and beginners. However, their limited background in synthetic biology means that simply providing recommended components is insufficient for them to create the plasmids they need. Therefore, we require a more comprehensive design for downstream tasks that can directly provide complete plasmids for user reference.

Design

In traditional synthetic biology, we use relatively fixed engineering plasmids to achieve the expression of target genes. These plasmids have undergone extensive experimentation and optimization, making them efficient expression tools supported by substantial experimental data. While more customized plasmids might yield better results, the lack of experimental data makes customization inherently challenging. Balancing these factors, we decided to use existing engineering plasmids as a foundation, adding or replacing components to meet user functionality requirements.

Moreover, due to significant differences in plasmids across various organisms, a single plasmid cannot be universally applied to all expression environments. We considered the most common eukaryotic and prokaryotic hosts: yeast and E. coli, as well as animal and plant systems, resulting in four distinct expression environments, each with its own plasmid backbone. We also implemented code to identify the user’s required expression environment; if no specific environment is requested, we default to E. coli, which is the most common and component-rich.

Initially, we retained only the essential structures on the plasmid that matched user needs—such as promoters, regulators, target genes, and terminators. However, feedback from the HP and wet lab teams indicated that including additional components could aid in plasmid construction (e.g., directly indicating that the plasmid should include the AmpR gene simplifies the selection of marker genes). Consequently, we further refined the plasmid design to include fixed content that is necessary for expression alongside variable content determined by user needs.

Implementation

In practice, we decompose the plasmid backbone into three parts: user-defined regions and their surrounding fixed regions. For fixed regions, we store each component in sequence. To accommodate cases where users might have specific requirements for promoters and terminators, we prepare default options that can be substituted when users do not specify their preferences. For instance, in the E. coli plasmid, the fixed region before the user-defined area includes the AmpR promoter, AmpR, and ori, while the region after does not contain any additional content. For user-defined areas, we provide T7 promoter and T7 terminator as defaults.

backbones = {
    "ecoli": [
        [
            ["Promoter", "AmpR promoter", "These are the original parts needed for the plasmid to function properly."],
            ["Tag", "AmpR", "These are the original parts needed for the plasmid to function properly."],
            ["DNA", "ori", "These are the original parts needed for the plasmid to function properly."]
        ], {
            "Promoter": ["Promoter", "T7 promoter", "This is a default promoter..."],
            "Terminator": ["Terminator", "T7 terminator", "This is a default terminator..."]
        }, [
        ]
    ],
    # Additional expression environments...
}

Besides E. coli, we prepared corresponding plasmids for other expression systems. We identified the most frequently occurring environmental descriptors in user requirements and component descriptions to determine the plasmid expression environment. We also utilized the FuzzyWuzzy library for fuzzy matching to recognize various spellings, such as ecoli, E.coli, and Escherichia coli.

def get_env(environment):
    match_counts = {}
    long_string = environment.lower()
    for short_string in backbones.keys():
        short_string = short_string.lower()
        i = 0
        while i < len(long_string):
            substring = long_string[i:i + len(short_string) + 2]
            ratio = fuzz.ratio(short_string.lower(), substring.lower())
            if ratio >= 60:
                print(substring, short_string, ratio)
                match_counts[short_string] = match_counts.get(short_string, 0) + 1
                i += len(short_string) + 1
            i += 1
    most_common = Counter(match_counts).most_common(1)
    return most_common[0][0] if most_common else "ecoli"

In this phase, we did not employ large models; instead, we relied on basic algorithms, which provided high execution efficiency. However, we did use the third-party Python library FuzzyWuzzy, with its operational details available in its documentation^[1].

[1]: https://pypi.org/project/fuzzywuzzy