Software

Tips

To try PartHub 3.0 - please follow the installation tutorial to deploy our software on your own device!

Highlights

Efficiently uses the iGEM Registry and supports relevant synthetic biology standards such as Genbank and FASTA format
Validation against both published and new experimental data
Flexible and adaptable design, can be easily tailored to a wide range of application scenarios
Well-documented APIs; easy integration with Snapgene
Intuitive web UI and detailed tutorial

Parts are at the core of synthetic biology, and over the years, Fudan's software has been dedicated to providing useful tools for parts management and analysis (Table 1). Our initial efforts, PartHub 1.0 and PartHub 2.0, were well-received for their features in displaying parts' citation relationships and enhancing search functionality. However, a critical gap remained: the importance of sequence information for parts. While citation relationships and search functionalities are valuable, the sequence of a part is arguably the most essential piece of information. The sequence not only defines the functional properties of a part but also influences its compatibility and performance in host organisms.

Therefore, we are excited to introduce PartHub 3.0 this year, which addresses this gap by focusing on two critical aspects of parts: burden prediction and similarity estimation (Table 1).

Table 1: Comparison of PartHub 1.0-3.0

PartHub Version	Main Feature	Use case
1.0	Display Citation Relationships of Parts	Researchers can quickly find the upstream and downstream citations of a specific part, helping them understand its performance and usage in different experiments.
2.0	Enhanced Search Functionality	Users can quickly find the parts they are interested in thanks to advanced search and recommendation algorithm.
3.0	Burden Prediction and Similarity Estimation	Apart from the features above, researchers can also understand the burden of different parts and find parts similar to the target part.

As illustrated in Figure 1, PartHub 3.0 consists of two main components: the Burden Predictor and the Similarity Estimator. The Burden Predictor can predict the metabolic burden of a composite part, which can be either a monocistron or a pRAP system-based polycistron. The Similarity Estimator allows users to search for specific parts within PartHub and identify similar parts.

Figure 1. Schematic figure of PartHub 3.0.

Features

PartHub 3.0 is designed with a strong emphasis on user-friendliness, ensuring that researchers without computer science background can also easily navigate and utilize its advanced features. The frontend of our software is built using Vue.js 3.4 and Ant Design Vue 4.2.3 to enhance the visual appeal and usability of the application. On the backend, we employ Flask for efficient and scalable development. For data storage, we use Neo4j 5.11, a powerful graph database that excels in managing large datasets with complex relationships. This combination of technologies ensures that PartHub 3.0 is not only robust but also intuitive and accessible.

Our software is compatible with the following browsers:

Edge	Firefox	Chrome	Safari	Opera
version≥116	version≥116	version≥116	version≥18	version≥100

API

To facilitate integration with other tools and platforms, we have also created comprehensive and easy-to-use RESTful APIs for our software. The API documentation provides detailed information on all available endpoints, including request and response formats and example usage.

Integration with Snapgene

Our software supports the commonly used synthetic biology standards, including GenBank and FASTA formats. It can directly use sequence files derived from Snapgene as input, thereby integrating seamlessly with the SnapGene pipeline.

Figure 2. Integration of PartHub 3.0 with the SnapGene Workflow.

Code availability

We have also uploaded our source code on Gitlab, and created a Docker image for easier installation.

Tutorial

1. Installation

How to install?

You can directly visit our live demo at http://47.97.85.37:5000/.

Install with readily available docker image

Please install Docker first, and create a file named docker-compose.yml with the following content in your working directory:

services:
  flask:
    image: chc1234567890/fudanigem2024:1.0.0
    ports:
      - "5000:5000"
    restart: always
    depends_on:
      - parthub
    environment:
      - SERVER_URL=bolt://parthub:7687
      - SERVER_USER=neo4j
      - SERVER_PASSWORD=igem2024
  parthub:
    image: neo4j:5.11
    restart: always
    environment:
      - NEO4J_AUTH=neo4j/igem2024
      - NEO4J_PLUGINS=["graph-data-science"]
      - NEO4J_dbms_security_procedures_allowlist=gds.*
      - NEO4J_dbms_security_procedures_unrestricted=gds.*
    ports:
      - "7474:7474"
      - "7687:7687"
    deploy:
      resources:
        reservations:
          memory: 2G

Then, open the terminal (in Windows, cmd or powershell; in Linux and mac, bash), change the working directory to where docker-compose.yml is, and run the following command:

docker compose up -d

Once the deployment is complete, PartHub 3.0 will be running at http://localhost:5000.

Install from source code on Gitlab

The software use Docker, Git, and Node.js for deployment, so please install them first!

For Linux and mac platform, run the following commands in bash:

git clone https://gitlab.igem.org/2024/software-tools/fudan.git
cd fudan/webUI
npm install
cd ..
bash pack.sh

For Windows platform, run the following commands in cmd or powershell:

git clone https://gitlab.igem.org/2024/software-tools/fudan.git
cd fudan/webUI
npm install
cd ..
pack.bat

Once the deployment is complete, PartHub 3.0 will be running at http://localhost:5000.

2. Burden Predictor

After opening the brower, you can see the home page of our software like in Figure 3. You can click "Burden Predictor" at the top of the screen.

Figure 3. Home page of PartHub 3.0.

Then, you can select the basic parts from the basic parts library, then click "Add basic part", as shown in Figure 4.

Figure 4. Select basic parts.

If the part you are interested is not in the library, you can manually input its sequence or upload its sequence file, or click "Search in PartHub" to discover it in PartHub, which is mentioned below.

Figure 4. Input or upload parts.

After selecting the parts, you can view them under the "Current parts" line. Next, set the copy number of the plasmid where the parts will be located. When you hover over the copy number input box, a tooltip listing common plasmid copy numbers will appear. This feature helps you accurately specify the copy number, ensuring optimal expression of the selected parts.

On the right side of the copy number input box, there is a switch that allows you to indicate whether the composite part should be considered as part of the pRAP system. If your part is polycistronic, please turn this switch on to ensure accurate prediction.

Once you have configured the settings, click the "Calculate" button to predict the burden, and the result will be displayed on the right side of the interface.

Figure 5. Burden calculation interface.

Warning: Do not include Non-ASCII characters in the sequence file, or it may cause error!

3. Similar Parts Searching

To enter the similar parts searching page, please click "PartHub" at the top of the screen. Once there, you can input your search content in the input bar at the top and select the search type, which will navigate you to the "Search Results" page. Alternatively, you can choose the type of parts you want to query from the options below, and input the sequence or upload a sequence file, which will directly take you to the detailed information page for the part.

Figure 6. The similar parts searching page.

The "Search Results" page lists the parts that match your search query. To view detailed information about a specific part, click the orange title to navigate to the detailed information page.

If you wish to add a part to the Burden Predictor, click the "Add to Burden Predictor" button located below the part's information. Unfortunately, many parts in the iGEM registry do not adhere to the required format, and only those with a standardized format can be successfully imported into the Burden Predictor. We appreciate your understanding and encourage you to use parts with proper formatting for the best results.

Figure 7. The search results interface.

When you enter the detailed information page, the software automatically begins the process of finding similar parts to the queried part in PartHub. This process may take about one minute. Once the similarity calculation is complete, the interface will update as shown in Figure 8.

On the left side of the page, a tree map displays the queried part along with parts that have reference, twin, or similarity relationships with it. In the tree map, purple nodes represent basic parts, and blue nodes represent composite parts. You can scroll in or out to adjust the size of the tree map or drag it to change its position for better visibility.

On the right side of the detailed information page, a list showcases the parts most similar to the queried part, including three types of similarity scores, which are detailed below. Due to performance considerations, we only display the top 100 similar parts, and in the tree map, we show the top 30 most similar parts.

Figure 8. Detailed information page with similar parts shown.

To view specific information about a part or relationship, click the "Part/Relationship info" tab in the top-right corner. You can also click on nodes or edges in the tree map to view detailed information about the parts or relationships. For example, in Figure 9, the similarity information of the two parts is displayed after clicking on the edge indicated by the red arrow.

Figure 9. The similarity information of two parts is shown after clicking on an edge in the treemap.

Burden Predictor

Introduction

As synthetic biology continues to advance, the parts being introduced into cells are becoming increasingly complex. However, introducing complex parts into cells can increase the metabolic burden, thereby slowing down the growth rate of the cells. Excessive burden can lead to significant selective pressure, causing engineered bacteria to mutate back to their wild-type forms more quickly, which can result in the engineered cells being out-competed by their less functional or non-functional mutants^[1]. Therefore, it is crucial to investigate why some parts impose a greater burden than others.

One of the primary reasons for this metabolic burden is the depletion of cellular resources such as ribosomes, tRNAs, and ATP, which are essential for gene expression. A recent study quantified the burden of 301 BioBrick plasmids and found that the depletion of gene expression resources was the main cause of the observed burden^[2]. Despite this understanding, there are currently no methods available to predict the burden of a genetic part based on its sequence and structure.

To address these issues, we have developed Burden Predictor, a tool that can predict the burden caused by gene expression of a certain part. It takes into account the allocation of gene expression resources within the cell, using only the sequence and structure of the part as input.

The pRAP system

Our team has been utilizing the pRAP (polycistronic Ribozyme-Activated) system for several years. This system addresses the issue of lower expression of downstream genes in polycistronic vectors by incorporating a ribozyme RNA sequence that conducts self-cleaving^[3]. This process converts the polycistronic mRNA transcript into individual monocistrons post-transcriptionally, ensuring that each gene is translated with comparable efficiency.

Initially designed for monocistronic parts, Burden Predictor has been extended to incorporate the prediction of the burden of parts using the pRAP system, thereby enhancing its usability and applicability in a broader range of genetic constructs.

Implementation

Model Construction

The core of Burden Predictor is an improved version of the mechanistic model framework from Weiße et al. and Nikolados et al.^[4]^[5] We did not choose genome-scale models or machine learning models because of their computational intensity and time consumption, as well as the scarcity of high-quality data.

Moreover, we selected Escherichia coli as our host organism for initial development and validation. However, the model is designed to be flexible and adaptable, allowing researchers to conveniently model the burden in other host organisms by simply changing the parameters of the model. This flexibility ensures that our software can be widely applied across different situations, making it a versatile tool for synthetic biology research.

As illustrated in Figure 10, our model incorporates the basic gene expression processes including transcription (TX) and translation (TL). Transcription is simplified into a dumped process, while translation contains two stages - initiation, when the ribosome binds to the ribosome binding site (RBS) of the mRNA; and elongation, when the ribosome moves along the open reading frame and produces polypeptide chain.

Figure 10. Schematic figure of Burden Predictor. TX, transcription; TL, translation.

To capture how cells allocate their resources across various types of proteins, we partitions the proteome into five components, including heterologous proteins (expressed by the parts introduced into the cell), ribosomes, metabolic enzymes, transporters, and housekeeping proteins. In our model, mRNA transcripts compete for free ribosomes and energy for translation, leading to a competitive environment where genes from the introduced parts must contend with native genes for cellular resources. This competition, coupled with the dilution of heterologous proteins due to the predicted growth rate, creates a two-way interaction between the parts and the cellular host. The burden of a part is defined as the reduction percentage of the growth rate of the host organism compared to the wild type.

To make our software more useful to synthetic biology, we have introduced the commonly used parameters of parts into our software, including plasmid copy number, promoter strength, RBS strength, and the length the coding sequence (CDS). In our software, plasmid copy number and promoter strengths is modeled as the maximal TX rate of a gene, RBS strengths the binding affinity and dissociation constant between transcripts and ribosomes, and CAI the maximum TL elongation rate. These features allow synthetic biologists to explore how different environmental and genetic conditions influence the behavior of gene circuits within the host cell.

We have also built a small library containing several commonly used basic parts, including promoters, RBS, and CDS. These parts have been experimentally validated for their promoter strength and RBS strength. To know more information about this library, please visit this link.

Didn't find the parts you are interested in our library? You can easily add them to Burden Predictor by selecting one of the following methods below:

Input the Part's Sequence: Manually enter the sequence of the part directly into the software.
Upload the Sequence File: Upload sequence files from your device in Genbank or FASTA format.
Search the iGEM Registry by PartHub: Use PartHub to search the iGEM Registry for parts of interest and add them to the Burden Predictor.

This flexible approach ensures you can incorporate any part you need for your research.

For detailed instructions on how to use these features, please refer to our tutorial.

For basic parts not included in our library, we employ the Promoter Calculator and RBS Calculator to estimate the promoter strength and RBS strength based on their sequences, respectively^[6]^[7]. These tools use advanced algorithms to predict the functional properties of sequences, ensuring that users can accurately assess the performance of new parts.

Methods

Variables

The core variables of our model is shown below.

Table 2: Core variables of our model

Name	Definition
$s_{i}$	internalized nutrient
$a$	total pool of energy molecules (e.g. ATP)
$p_{r}$	ribosomes
$p_{t}$	transporters
$p_{m}$	metabolic enzymes
$p_{q}$	housekeeping proteins
$m_{x}$	free mRNAs, $x \in {r, t, m, q}$
$c_{x}$	ribosome-bound mRNAs

Mechanistic model

The nutrient uptake and consumption are modeled by equations (1) to (3):

\begin{matrix} (1) & \dot{s_{i}} = v_{i m p} - v_{c a t} - λ s_{i} \end{matrix}

\begin{matrix} (2) & v_{i m p} = p_{t} \frac{v_{t} s}{K_{t} + s} \end{matrix}

\begin{matrix} (3) & v_{c a t} = p_{m} \frac{v_{m} s_{i}}{K_{m} + s_{i}} \end{matrix}

Here, $v_{i m p}$ represents the import rate of nutrients, and $v_{c a t}$ represents the catalytic rate of metabolic enzymes, and $λ$ is the growth rate. All the intracellular molecules are assumed to be diluted at rate $λ$ because of cell growth.

The rate of energy molecule production and consumption is described by equation (4):

\begin{matrix} (4) & \dot{a} = n_{s} v_{c a t} - \sum_{x \in {r, t, m, q}} n_{x} v_{x} - λ a \end{matrix}

Here, $n_{s}$ refers to the nutrient efficiency of the growth medium, $n_{x}$ the length of proteins in terms of amino acids, and $v_{x}$ is the TL rate. We assume that the TL process is the dominant source of energy consumption, and other energy-consuming processes are negligible^[8].

The synthesis and degradation of proteins and mRNAs are captured by equations (5) to (8):

\begin{matrix} (5) & \dot{m_{x}} = w_{x} - (λ + d_{m}) m_{x} + v_{x} - k_{b} p_{r} m_{x} + k_{u} c_{x} \end{matrix}

\begin{matrix} (6) & \dot{c_{x}} = λ c_{x} - v_{x} + k_{b} p_{r} m_{x} - k_{u} c_{x} \end{matrix}

\begin{matrix} (7) & \dot{p_{x}} = v_{x} - λ p_{x}, x \in {t, m, q} \end{matrix}

\begin{matrix} (8) & \dot{p_{r}} = v_{r} - λ p_{r} + \sum_{x \in {r, t, m, q}} (v_{x} - k_{b} p_{r} m_{x} + k_{u} c_{x}) \end{matrix}

where $w_{x}$ represents the TX rate, $k_{b}$ and $k_{u}$ is the binding and unbinding rate of a ribosome and a mRNA, and $d_{m}$ is the degradation rate of mRNAs. Here, we assume that all the endogeneous proteins are not subject to active degradation.

The TX rate is represented by $w_{x}$ for ribosome, transporter, and metabolic enzyme mRNAs as the product of the maximum TX rate and a function of energy:

\begin{matrix} (9) & w_{x} = w_{x, max} \frac{a}{θ_{x} + a}, x \in {r, t, m} \end{matrix}

where $w_{x, max}$ is the maximum TX rate, $θ_{x}$ is a transcriptional threshold.

For housekeeping proteins, the TX rate is modeled under the assumption that housekeeping mRNA TX is under negative autoregulation in order to keep constant expression levels in the cell:

\begin{matrix} (10) & w_{q} = w_{q, max} \frac{a}{θ q + a} \cdot \frac{1}{1 + (\frac{p_{q}}{K_{q}})^{h_{q}}} \end{matrix}

Here, $K_{q}$ and $h_{q}$ are regulatory parameters for the negative autoregulation model.

The TL rate is modeled as follows:

\begin{matrix} (11) & v_{x} = c_{x} \frac{γ (a)}{n_{x}} \end{matrix}

where $γ (a)$ is an energy-dependent function:

\begin{matrix} (12) & γ (a) = \frac{γ_{max} a}{K_{γ} + a} \end{matrix}

where $γ_{max}$ is the maximum TL elongation rate, and $K_{γ}$ the energy required when the rate reaches half-maximum.

Weiße et al. pointed out that under the assumption of constant average cell mass, the growth rate $λ$ is defined by equation (13)^[4:1]:

\begin{matrix} (13) & λ = \frac{γ (a)}{M} \cdot \sum_{x \in {r, t, m, q}} c_{x} \end{matrix}

where $M$ is the constant cell mass.

Introducing heterologous part

As mentioned before, the parts introduced into the system should be either monocistronic or based on the pRAP system for polycistronic expression. For polycistronic parts using the pRAP system, we assume that the translation rate of each gene is proportional to the strength of its RBS.

The dynamics of the proteins, mRNAs, and ribosome-bound mRNAs for each gene in the parts are modeled by the following equations:

\begin{matrix} (12) & \dot{p_{i}^{c}} = v_{i}^{c} - (λ + d_{p}) p_{i}^{c} \end{matrix}

\begin{matrix} (13) & \dot{m_{i}^{c}} = w_{i}^{c} - (λ + d_{m}) m_{i}^{c} + v_{i}^{c} - k_{b, i}^{c} p_{r} m_{i}^{c} + k_{u, i}^{c} c_{i}^{c} \end{matrix}

\begin{matrix} (14) & \dot{c_{i}^{c}} = λ c_{i}^{c} + k_{b, i}^{c} p_{r} m_{i}^{c} - k_{u, i}^{c} c_{x} - v_{i}^{c} \end{matrix}

Here, $d_{p}$ represents the degradation rate of the proteins expressed by the parts, and $p_{i}^{c}, m_{i}^{c}, c_{i}^{c}$ denote the concentrations of the ii-th gene's protein, mRNA, and ribosome-bound mRNA, respectively.

We model TX rate of the i-th heterologous gene as:

\begin{matrix} (15) & w_{i}^{c} = w_{i, max}^{c} \frac{a}{θ_{x} + a} \end{matrix}

The maximum TX rate is determined by the copy number of the plasmid and the promoter strength:

\begin{matrix} (16) & w_{i, max}^{c} = \frac{N}{β_{n}} \cdot \frac{H_{p r o m}}{β_{p r o m}} \end{matrix}

where $N$ is the plasmid copy number, $H_{p r o m}$ is the promoter strength from our basic parts library, $β_{n}$ and $β_{p r o m}$ are scaling parameters.

TL initiation rate is modeled by the ratio of the binding rate constant $k_{b, i}^{c}$ to the unbinding rate constant $k_{u, i}^{c}$ :

\begin{matrix} (17) & \frac{k_{b, i}^{c}}{k_{u, i}^{c}} = \frac{H_{R B S}}{β_{R B S}} \end{matrix}

where $H_{R B S}$ is the RBS strength from our basic parts library, and $β_{R B S}$ is the scaling parameter.

TL elongation rate is modeled as follows:

\begin{matrix} (18) & v_{x} = c_{x} \frac{γ (a)}{n_{x}} \end{matrix}

where $c_{x}$ is the amount of ribosome-mRNA complex, $n_{x}$ the length of polypeptide chain, and $γ (a)$ is an energy-dependent function:

\begin{matrix} (19) & γ (a) = \frac{γ_{max} a}{K_{γ} + a} \end{matrix}

where $γ_{max}$ is the maximum TL elongation rate, and $K_{γ}$ the energy required when the rate reaches half-maximum.

Finally, the burden $b$ of the introduced part is:

\begin{matrix} (20) & b = 1 - \frac{λ}{λ_{0}} \end{matrix}

where $λ$ is the growth rate of host with the part introduced, and $λ_{0}$ is the growth rate of wild type (i.e. without any introduced parts).

Parameters

The parameters of our model are listed in Table 3.

Table 3: Core variables of our model

Parameter	Value	Source
$n_{s}$	$10^{4}$	[3]
$M$	$10^{8}$	[3]
$γ_{max}$	1260	[3]
$v_{t}$	726	[3]
$v_{m}$	5800	[3]
$w_{r, max}$	930	[3]
$w_{q, max}$	949	[3]
$w_{m, max}, w_{t, max}$	4.14	[3]
$K_{q}$	152219	[3]
$h_{q}$	4	[3]
$θ_{q}, θ_{t}, θ_{m}$	4.38	[3]
$n_{q}, n_{t}, n_{m}$	300	[3]
$θ_{r}$	427	[3]
$n_{r}$	7459	[3]
$k_{b}$	0.0095	[3]
$k_{u}$	1	[3]
$K_{γ}$	7	[3]
$K_{t}, K_{m}$	1000	[3]
$d_{m}$	0.1	[3]
$β_{p r o m}$	69.49	parameter fitting
$β_{R B S}$	50.31	parameter fitting
$β_{n}$	200	Set manually

Results

Parameter fitting

To calibrate our models and ensure they accurately reflect the experimental data, we performed parameter fitting using the curve_fit function from the SciPy library. The data we used for this process is detailed in the fitting data table. Given the characteristics of this data, we manually set the parameter $β_{n}$ to constant. The remaining parameters, $β_{p r o m}$ and $β_{R B S}$ , were fitted using curve_fit. The value of these parameters can be found in Table 3.

To validate the fitting, we plotted the experimental burden against the predicted burden and performed a linear regression analysis, which yielded an $R^{2}$ value of 0.7304, indicating a good fit between the experimental and predicted data (Figure 11).

Figure 11. Fitting results of burden

Benchmarking of Promoter Calculator and RBS Calculator

To test the accuracy of the Promoter Calculator and the RBS Calculator, we conducted a benchmarking process for both of the calculators. This benchmarking involves converting the predicted data from these calculators to the same magnitude as the experimental data, ensuring direct comparability and integration into our software.

For the Promoter Calculator, we benchmarked its accuracy by converting the predicted promoter strength to the same magnitude of the experimental data. This conversion was achieved using the following equation:

H_{p r o m} = K \cdot v_{p r o m} + B

where $H_{p r o m}$ represents the converted promoter strength that matches the magnitude of the experimental data, and $v_{p r o m}$ is the promoter strength calculated by the Promoter Calculator. The constants $K$ and $B$ are determined through linear regression, and the fitting results are shown in Figure 12.

Figure 12. Experimental promoter strength vs. promoter strength calculated by Promoter Calculator

Unlike the Promoter Calculator, the RBS Calculator calculates the total Gibbs free energy ( $Δ G$ ) of the entire translation initiation process. Therefore, we benchmarked the accuracy of the RBS Calculator by converting the calculated $Δ G$ to match the experimental data. This conversion was performed using the following equation:

H_{R B S} = K \cdot \exp (- B \cdot Δ G)

where $H_{R B S}$ is the converted RBS strength that aligns with the magnitude of the experimental data, and $Δ G$ is the Gibbs free energy calculated by the thermodynamic model of the RBS Calculator, and $K$ and $β$ are constants determined through least squares fitting, and the fitting results are shown in Figure 13.

Figure 13. Experimental RBS strength vs. RBS strength calculated by RBS Calculator

The data used for benchmarking can be accessed here.

Validation against experimental data

We selected a set of parts that have been previously measured by experiments^[2:1], and performed predictions for these selected parts using our software and compared the predicted burden values with the experimental data. The results are summarized in Figure 14, which plots the experimental burden against the predicted burden.

Figure 14. Experimental burden of parts from [2] vs. Predicted burden calculated by Burden Predictor

We also investigated the burden of a series of parts in our Measurement. All of these parts are expressed using the pRAP system.

Figure 15. Burden of parts from our experiment vs. Predicted burden calculated by Burden Predictor

The detailed information and data of the parts we used in this section can be accessed here.

Similarity Estimator

Introduction

Sequences are fundamental in synthetic biology, and any improvements or optimizations based on them can significantly enhance the design and experimental efficiency of biological constructs. Recognizing that similar parts are more likely to have similar biological characteristics and functions, we have developed the Similarity Estimator, a tool that can estimate the similarity of parts in the iGEM Registry.

This tool is seamlessly integrated into our previously developed PartHub 2.0, allowing users to visualize both the citation and similarity relationships between different parts. Similar to the Burden Predictor, the Similarity Estimator allows users to manually input sequences, upload sequence files in Genbank or FASTA format, or search for parts of interest within PartHub. Our software enables researchers to easily identify and analyze parts with similar biological properties, facilitating more efficient synthetic biology design and experimentation.

Implementation

Sequence similarity

Initially, we explored estimating the similarity of parts using k-mer embeddings, a machine learning-based method for sequence similarity calculation. However, this approach tends to focus too much on global features and fails to capture local characteristics effectively. Therefore, we employ BLAST (Basic Local Alignment Search Tool), a well-established method for sequence alignment and similarity comparison, to align sequences and estimate similarity scores. The score is calculated based on its key indicator, the bitscore, which is a robust measure of the alignment's quality. The primary reason for this choice is that BLAST is a robust method for sequence alignment, and using the bitscore allows for a more precise and biologically meaningful comparison of sequences compared to k-mer embedding methods.

Our software begins by constructing a database of part sequences. When a query is received, the software aligns the sequence of the queried part against the database. To ensure high-quality alignments, the results are filtered by an E-value threshold of less than $10^{- 5}$ . The E-value is a statistical measure in BLAST that indicates the number of alignments expected to occur by chance. A lower E-value signifies a higher confidence in the alignment, indicating a stronger similarity between the query and the database sequences.

After filtering the alignments, our software scales the bitscore of each result by the maximum bitscore among all the results, ensuring that the similarity scores are comparable across different alignments.

The sequence similarity score, denoted as $SeqScore$ is then calculated by combining the scaled bitscore and the identity, each weighted differently. The identity represents the percentage of nucleotides that are identical between the query and the database sequence. The formula for the sequence similarity score between part i and part j is as follows:

\begin{matrix} (21) & {SeqScore}_{i, j} = max_{id (k) = (i, j)} (0.7 \times \frac{{BitScore}_{k}}{max_{l} {BitScore}_{l}} + 0.3 \times {identity}_{k}) \end{matrix}

If two parts have multiple alignments, the software selects the maximum sequence similarity score among all the alignments.

Category similarity

Our software also considers the category information from the iGEM Registry, where each part can have multiple classification labels, and each category includes several levels of subcategories.

To be specific, our software calculates the number of shared categories and assigning a weight to each category based on its hierarchical level. This approach ensures that parts with similar functional annotations are given higher similarity scores.

The category similarity score, denoted as $CatScore$ , is calculated using the following formula:

\begin{matrix} (22) & CatScore = \sum_{i} \sum_{j = 1}^{d_{i}} B^{j - 1} \end{matrix}

where:

${c_{i}}$ is the array of category labels that both parts A and B share.
${d_{i}}$ is the array of hierarchical levels of the shared categories $c_{i}$ .
$B = 1.5$ is a constant that determines the weight of each hierarchical level.

Example calculation of category similarity score

Let's consider two parts, Part A and Part B, with the following category labels from the iGEM Registry:

Part A:
- //cds/enzyme/DNApolymerase/
- //plasmid/expression/T7/
- //function/biosynthesis/
Part B:
- //cds/enzyme/DNApolymerase/
- //plasmid/expression/T3/
- //function/isoamplification/

The shared categories between Part A and Part B are:

//cds/enzyme/DNApolymerase/ (Level 3)
//plasmid/expression/ (Level 2)

In this example, the hierarchical levels of the shared categories are 3 and 2, respectively.

Based on the information above, we have $B = 1.5$ and ${d_{i}} = {3, 2}$ .

Using formula (22), the category similarity score is:

CatScore = B^{0} + B^{1} + B^{2} + B^{0} + B^{1} = 10.875

Overall similarity

The overall similarity score is calculated by combining the sequence similarity score and the category similarity score:

OverallScore = min (SeqScore + CatScore, 100)

This ensures that the overall score does not exceed 100, providing a balanced and interpretable measure of similarity between parts.

Visualization

After calculating the similarity scores, our software will list the top similar parts of the queried part, and show the similarity relationships of the queried part along with the reference relationships built in PartHub 2.0.

The similarity relationships and reference relationships between parts are stored in a Neo4j database, a powerful graph database well-suited for handling complex relationships. For visualization, the software uses Neovis.js, a JavaScript library that provides an interactive way to explore the graph data.

For examples of visualization, please refer to Figure 8.

Exclusion of reference relationships

When calculating the similarity between parts, our software excludes parts that have reference or twin relationships to avoid redundancy and ensure that the similarity scores are meaningful. Reference parts are often duplicates or highly similar to other parts, and twins are exact copies, which can skew the similarity scores.

Moreover, our software does not distinguish between basic parts and composite parts when calculating similarity. This is because both types of parts can have significant biological relevance, and distinguishing between them may neglect important similarities.

DBTL Cycle

During the development of PartHub 3.0, we strictly followed the Design-Build-Test-Learn (DBTL) cycle to ensure continuous improvement and alignment with user needs. This iterative process allowed us to systematically refine our software, incorporating feedback and making improvements at each stage.

First Round

Design

In the initial design phase, we focused on developing a method to calculate the burden of a certain part and the similarity between biological parts. We decided to use k-mer embeddings combined with K-Nearest Neighbors (KNN) to compute the similarity scores at first. The k-mer embeddings were chosen because they capture global features of the sequences, which we believed would provide a robust basis for similarity calculations. The goal was to create a tool that could predict the burden of parts based on their sequences and integrate this functionality into PartHub 2.0.

Build

During the build phase, we implemented the k-mer embeddings and KNN algorithm to calculate the similarity scores. We developed a basic WebUI framework to input sequences, upload files, and display the similarity results. The burden calculation module was also initiated, but it was not yet fully integrated into the WebUI. We focused on ensuring that the similarity calculation functioned correctly and that the results were displayed in a clear and understandable format.

Test

We conducted initial testing of the similarity calculation using a set of known parts and their sequences. The k-mer embeddings and KNN algorithm produced similarity scores, but the results were not as expected, because the global features captured by k-mer embeddings did not accurately reflect the biological similarity between parts, particularly in cases where functional annotations and specific sequence features were more important. The burden calculation module was also tested, but it was incomplete and did not provide reliable predictions.

Learn

From the initial testing, we learned that k-mer embeddings, while useful for capturing global sequence features, were not sufficient for accurately measuring the biological similarity between parts. Therefore, we decided to switch to using BLAST for similarity calculations, as it is better suited for aligning sequences and identifying local similarities, which are crucial for biological function. The corresponding commit on Gitlab can be accessed here.

Second Round

Design

In the second round, we refocused our design efforts on using BLAST for similarity calculations. We also addressed the feedback from our advisor regarding the WebUI. The goal was to create a more intuitive and user-friendly interface that would facilitate the use of the similarity and burden prediction tools. We planned to integrate the similarity results more effectively into the search pipeline to provide a comprehensive view of part relationships.

Build

During the build phase, we implemented the BLAST algorithm to calculate the similarity scores. We also refined the WebUI by optimizing the layout and size of components, making the interface more intuitive. We added more descriptive text and annotations to help users understand the functions and features of the tool. The burden calculation module was fully integrated into the WebUI, allowing users to input sequences, upload files, and view burden predictions alongside similarity results.

Test

We conducted extensive testing of the updated similarity and burden prediction tools. The BLAST algorithm produced more accurate and biologically relevant similarity scores, reflecting the specific functional annotations and sequence features of the parts. The WebUI was tested by our team members (Figure 16) and advisors, who gave us valuable suggestions and provided feedback on the usability and intuitiveness of the interface. We made additional adjustments based on this feedback to further improve the user experience.

Figure 16. Discussion with our team members on our software

Learn

From the second round of testing, we learned that the BLAST algorithm significantly improved the accuracy of the similarity calculations. We also integrated the similarity results into the search pipeline, which enhanced the overall functionality of the tool.

During the development of our software, we adhered to a structured Design-Build-Test-Learn (DBTL) cycle to ensure continuous improvement and alignment with user needs. We received valuable suggestions and feedback from our team members and advisors, which was crucial in refining our software, and kept our improvements in the Changelog.

Reference

Rugbjerg, P., Myling-Petersen, N., Porse, A., Sarup-Lytzen, K., & Sommer, M. O. A. (2018). Diverse genetic error modes constrain large-scale bio-based production. Nature Communications, 9(1), 787. https://doi.org/10.1038/s41467-018-03232-w ↩︎
Radde, N. (2024). Measuring the burden of hundreds of BioBricks defines an evolutionary limit on constructability in synthetic biology. Nature Communications. https://doi.org/10.1038/s41467-024-50639-9 ↩︎ ↩︎
Liu, Y., Wu, Z., Wu, D., Gao, N., & Lin, J. (2023). Reconstitution of Multi-Protein Complexes through Ribozyme-Assisted Polycistronic Co-Expression. ACS Synthetic Biology, 12(1), 136–143. https://doi.org/10.1021/acssynbio.2c00416 ↩︎
Weiße, A. Y., Oyarzún, D. A., Danos, V., & Swain, P. S. (2015). Mechanistic links between cellular trade-offs, gene expression, and growth. Proceedings of the National Academy of Sciences, 112(9), E1038–E1047. https://doi.org/10.1073/pnas.1416533112 ↩︎ ↩︎
Nikolados, E.-M., Weiße, A. Y., Ceroni, F., & Oyarzún, D. A. (2019). Growth Defects and Loss-of-Function in Synthetic Gene Circuits. ACS Synthetic Biology, 8(6), 1231–1240. https://doi.org/10.1021/acssynbio.8b00531 ↩︎
LaFleur, T. L., Hossain, A., & Salis, H. M. (2022). Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria. Nature Communications, 13(1), 5159. https://doi.org/10.1038/s41467-022-32829-5 ↩︎
Tian, T., & Salis, H. M. (2015). A predictive biophysical model of translational coupling to coordinate and control protein expression in bacterial operons. Nucleic Acids Research, 43(14), 7137–7151. https://doi.org/10.1093/nar/gkv635 ↩︎
Bremer, H., & Dennis, P. P. (2008). Modulation of Chemical Composition and Other Parameters of the Cell at Different Exponential Growth Rates. EcoSal Plus, 3(1). https://doi.org/10.1128/ecosal.5.2.3 ↩︎