Locus screening program
Compared to plasmid-based expression, genomic expression offers multiple advantages:
1. It is generally more stable, as it is integrated into the host cell's chromosome, reducing the risk of plasmid loss.
2. It can better regulate gene expression levels through endogenous regulatory mechanisms.
3. It avoids issues related to plasmid replication and segregation, thereby enhancing product uniformity and reproducibility.
Based on our literature review, we summarized the characteristics of high-quality genomic integration loci and their rational reasons:
Step 1.The length of the intergenic region (IGR) is crucial and excessively short IGRs may lead to interference with regulatory mechanisms.
Step 2.IGRs containing sequences that code for tRNA, rRNA, and sRNA should be excluded from screening, as regions with abundant transcripts may harbor unknown regulatory sequences.
Step 3.IGRs containing TFbs (transcription factor binding sites) should be excluded from screening, as the loss of transcription factor binding may hinder effective transcription of the target gene and disrupt normal cellular functions.
Step 4.IGRs containing promoters should be excluded from screening, as the reasons are similar to those stated in step 3.
Step 5.The transcription levels of genes adjacent to the IGR should be relatively high in the transcriptome, as gene transcription levels are position-dependent, and neighboring genes may share certain regulatory elements, resulting in elevated transcription levels.
The screening methods for step 1, step 2, and step 5 are relatively simple, requiring only basic parameter adjustments based on needs. According to wet lab experience, suitable parameter choices include IGRs ≥ 200 bp, with transcription intensity of adjacent genes in the top 50% of all genes, and exclusion of RNA based on genomic annotations.
Due to the temporary absence of TFbs annotations for Halomonas, and given the strong homology between E. coli and Halomonas, we chose to use E. coli's TFbs annotations to identify TFbs on the Halomonas genome.[2,3] However, the large number of bases in bacterial genomes significantly extends the time required for the identification process.Therefore, we innovatively modified the KMP algorithm and applied it to this process.[4]
modified KMP algorithm:
T[i,i+n−1] = Target strings
P[i,i+n−1] = Pattern strings
positions=[] (list to store starting indices of matches)
m = Number of matching characters
Similarity definition Sim(T[i,i+n−1],P[i,i+n−1])=m/n
Sim(T[i,i+n−1],P[i,i+n−1])=m/n ≥ s → m ≥ s * n
If we shift the pattern string one base to the right, the similarity (sim) will be reduced to at most (m - 2*1) / n. If we shift the pattern string k bases to the right, the similarity will be reduced to at most (m - 2*k) / n.
(m - 2*k) / n ≥ s → k ≤ (m-s*n)/2
Then k+1 would be the number of bases that should be skipped.
If j=len(P),positions.append(i−j+1)
Update j to find the next possible match, j = LSP[j−1]
Based on the above reasoning, we have been able to maximize the efficiency of skipping while ensuring that no eligible strings are missed, by comparing the number of bases that need to be skipped with the two methods. Through this algorithm, the time complexity of the TFbs identification process is reduced from O(m*n) to O(m+n).
Utilizing this algorithm, when searching for TFbs in Halomonas TD using E.coli as the template, the final output results are identical to those of the original pattern matching method. However, the required running time has been reduced from the original 12 hours to just 30 minutes.
Besides Halomonas TD, we also ran the program on genomes such as Nissel and BL21. The results obtained indicate that the improved KMP algorithm can enhance the efficiency of TFbs identification by about 95%, significantly increasing the speed of parameter iteration.
Apart from the identification of TFbs, the prediction of promoters is a vital step that constrains the efficiency and precision of locus screening program. Although the methods for promoter prediction are quite mature, they often target short sequences of 80 base pairs in length and employ complex neural network algorithms for computation. Complex machine learning methods have been employed to identify promoters, which have improved accuracy in certain cases.[5] However, these improvements may not justify the heavy computational demands posed by training the classifiers. Additionally, the selection and optimization of parameters (such as the type of kernel function, the number of hidden layer nodes, etc.) require sufficient prior knowledge of the statistical properties of the samples, which is impractical for analyzing novel genomes. Deploying these tools locally requires extensive training of the model to obtain reliable results, whereas using the online services provided by other researchers is greatly limited by the scale of the data and the server's carrying capacity.
To address this issue, we implemented the vwz-curve algorithm, which is an enhancement of the z-curve algorithm.[6] Based on the Z-curve theory, this method maps DNA sequences into a two-dimensional space, leveraging the spectral properties of the sequences to identify gene structures. It represents DNA sequences as a 3D curve or point plot, producing a zigzag pattern, which is why it is referred to as the Z-curve. It is a rapid method for calculating the signal-to-noise ratio, eliminating the need for computing the Discrete Fourier Transform (DFT), thereby significantly enhancing computational efficiency.
The advantages of vwz-curve are as follows:
1. The mapping of DNA sequences reduces from three dimensions to two dimensions compared to the traditional z-curve method, significantly decreasing complexity.
2. It has low requirements for the training set, making it easy to deploy locally, and with a little training, it can achieve an accuracy of over 0.9.
3. Compared to neural network computations, it significantly reduces the runtime.
We developed a feature extraction tool for prokaryotic promoter recognition using the vwz-curve method. Features extracted from the DNA sequences (with window sizes w = 1, 2, ..., 6 ) are used as input variables for further classification with a partial least squares (PLS) classifier.
Unlike the variables extracted by algorithms based on position weight matrices (PWMs), the vw Z-curve parameters derive from the distribution of w-nucleotide patterns occurring within the same sequence fragment, rather than from the frequencies observed across different sequence fragments. Therefore, the vw Z-curve parameters are not affected by the uncertainty related to the motif position relative to the transcription start site (TSS).
To validate the performance of the model, we tested it using promoter fragments from E. coli and B. subtilis. The complete genomic sequences of E. coli K-12 and Bacillus subtilis were obtained from NCBI GenBank. The TSS positions were sourced from RegulonDB version 7.0 and DBTBS. The promoter regions defined as [TSS-60...TSS+19] (with the TSS site designated as +1) were considered as positive examples. In E. coli K-12, 81% of known TSSs are located in non-coding regions, while 19% are found in coding regions. Negative examples include two types: coding region and non-coding region. Negative samples from the coding region are extracted from the beginning of open reading frames (ORFs), while negative samples from the non-coding region are randomly selected from intergenic regions. It is important to note that this data has not been validated through wet lab experiments as negative, so the obtained negative examples may contain a small number of positive examples.
The PLS (Partial Least Squares) algorithm is an essential method for modeling the linear relationship between output variables (known class labels) and input variables (predictors). It generates orthogonal latent variables (LVs) as linear combinations of the original input variables. The key principle is to find weights for these combinations that maximize the covariance between input and output variables. This projection compresses the n-dimensional input space (X) into a v-dimensional latent variable space (where v << n ), effectively reducing noise and mitigating multicollinearity issues. Although PLS may produce biased regression coefficient estimates compared to ordinary least squares, it generally has lower variance. PLS is particularly effective in situations where the number of observed variables (n) greatly exceeds the number of samples (m) and where high multicollinearity exists among the variables.
For E. coli, the accuracies are 96.05% (s70 promoters, coding negative samples), 90.44% (s70 promoters, non-coding negative samples), 92.13% (known sigma-factor promoters, coding negative samples), 92.50% (known sigma-factor promoters, non-coding negative samples), respectively; for B. subtilis, the accuracies are 95.83% (known sigma-factor promoters, coding negative samples) and 99.09% (known sigma-factor promoters, non-coding negative samples)
By integrating these two improvements into our genomic integration site screening program, we achieved significant enhancements in locus screening program's performance. Using this , we screened the genomes of Halo6.0, Nissel, BL21, Halomonas LY01, and Halomonas LY03, identifying sites with markedly increased stability and expression strength. Below are the data obtained from the validation experiments.
locus | FI/OD(600) | the mean of FI/OD(600) | ||
---|---|---|---|---|
LY01_1 | 466 | 475 | 467 | 469.3 |
LY01_2 | 540 | 560 | 555 | 551.7 |
LY01_3 | 248 | 275 | 267 | 263.3 |
LY01_4 | 105 | 126 | 138 | 123.0 |
LY01_5 | 355 | 377 | 390 | 374.0 |
LY01_1 | 466 | 475 | 467 | 469.3 |
LY01_6 | 362 | 377 | 400 | 379.7 |
LY01_7 | 302 | 320 | 336 | 319.3 |
LY01_8 | 126 | 160 | 145 | 143.7 |
LY01_9 | 530 | 517 | 545 | 530.7 |
LY01_10 | 710 | 690 | 670 | 690.0 |
LY01_11 | 645 | 685 | 665 | 665.0 |
LY03_1 | 412 | 445 | 430 | 429.0 |
LY03_2 | 369 | 395 | 383 | 382.3 |
LY03_3 | 435 | 450 | 467 | 450.7 |
LY03_4 | 472 | 488 | 505 | 488.3 |
LY03_5 | 333 | 347 | 359 | 346.3 |
LY03_6 | 284 | 297 | 310 | 297.0 |
LY03_6 | 284 | 297 | 310 | 297.0 |
LY03_7 | 515 | 530 | 540 | 528.3 |
LY03_8 | 650 | 675 | 665 | 663.3 |
LY03_9 | 695 | 720 | 718 | 711.0 |
Nissel_1 | 150 | 165 | 172 | 162.3 |
Nissel_2 | 360 | 370 | 386 | 372.0 |
Nissel_3 | 285 | 300 | 310 | 298.3 |
Nissel_4 | 530 | 550 | 560 | 546.7 |
Nissel_5 | 415 | 430 | 440 | 428.3 |
Nissel_6 | 603 | 620 | 630 | 617.7 |
BL21_1 | 290 | 302 | 318 | 303.3 |
BL21_2 | 410 | 427 | 445 | 427.3 |
BL21_3 | 525 | 532 | 545 | 534.0 |
BL21_4 | 635 | 645 | 660 | 645.0 |
Based on the data from the table, the fluorescence intensity of the 28 loci ranges from 123.0 to 711.0, with a median of 380.5 and a mean of 468.6. Among these, 22 sites have a mean of FI/OD(600) greater than 300, and 6 sites have a mean greater than 600. This indicates that the expression potential of the introduced loci is relatively high while minimizing the impact on cell proliferation. Overall, the selected loci show good results, with a few particularly outstanding cases. These data demonstrate the applicability and effectiveness of the screening method for genomic loci.
We also developed a user-friendly graphical interface for the locus screening program, making it easy for users to quickly and conveniently screen for genomic integration loci.
Motrol —— automatic fermentation control model
A genome-scale metabolic model (GEM) is a computational representation that allows for mathematical exploration of metabolic behaviors under cellular and environmental constraints. Traditional GEMs often suffer from low accuracy and an inability to correctly predict cellular states. Subsequently, methods that incorporate enzyme constraints were developed to provide a more detailed characterization of cellular metabolic network models. However, these methods are limited by two key issues——insufficient and excessive omics data.[7]
This brings about several new challenges, such as the lack of proteomic data for many strains (e.g., Halo6.0), which is particularly severe for less-studied organisms. The absence of parameters that describe the physical and chemical properties of enzymes in metabolic processes makes it impossible to construct ec-GEMs for these strains. Additionally, even for well-researched bacteria like E. coli, there are still numerous issues to address. For example, quantifying enzymes can reduce the flexibility of ec-GEMs, and an excess of proteomic data may limit the model's ability to adapt to environmental changes.[7]
Therefore, the key to constructing the ec-GEM for our chassis strain Halo 6.0 lies in how to supplement kinetic and omics data while addressing the flexibility issues of the ec-GEM.
We used the GO-term method to classify enzymes into n categories. Based on the proteomic data of Halo 6.0, we set a limit on the total enzyme amount for each category, allowing the specific enzyme quantities within each category to exceed their corresponding omics content range by up to 25%. We also utilized a deep learning-based tool called ProtParam to supplement the characteristic parameters of the enzymes.We used it to predict the necessary enzyme parameters, including molecular weight (M), estimated half-life (t₀), and average hydration value (S). Finally, we used DL-Kcat to predict the kcat values for the enzymes.
First, we need to establish the basic GEM for Halo 6.0. Based on wet-lab experiments, we processed the sequencing reads and performed de novo assembly, obtaining the complete genome map of Halo 6.0. The genome was functionally annotated using a total of 13 databases, including GO, RAST, KEGG, and SwissProt, resulting in a comprehensive genome annotation file with descriptions of CDS, rRNA, sRNA, and tRNA.[8] The reactions in the GEM were constructed using a draft model based on MetaCyc. Transport reactions were manually collected from the MetaCyc database and added to the draft model. Due to the significant homology between Halo 6.0 and E. coli (with 223 homologous sequence pairs identified through BLAST, showing 86.84% identity). We extracted relevant reactions using the E. coli template model iML1515 and the orthologous pairs identified between E. coli and Halo 6.0. We used the orthologous pairs between the E. coli template model iML1515 and Halo 6.0 to extract relevant reactions, supplementing the completeness of the model. Additionally, we incorporated substrate uptake reactions and PHA synthesis reactions, enabling the model to interact with the substrate.[9] In the end, we obtained a basic GEM containing 2,269 reactions.
In the second step, we need to add enzyme constraints to the GEM. The physical and chemical properties of enzymes significantly influence their catalytic efficiency, so we incorporated these characteristics into the considerations for the ec-GEM. We defined a pseudo-metabolite parameter for the enzyme, which is related to the molecular weight M, catalytic constant kcat, half-life t0, and hydrophilicity S. Based on this definition, we propose a hypothesis:
where V represents the reaction rate, fakeM is the enzyme parameter, and e is the enzyme concentration.
We used ProtParam and DL-kcat to predict all these parameters. Then, to provide some flexibility to the ec-GEM, we classified various enzymes based on GO-terms. We impose the following constraints on the GO-term-based ec-GEM: the total protein quantity for each term is set to be within ±25% of the total enzyme quantity within that term, and the quantity range for each specific enzyme is set to be ±25% of the range provided by the omics data.
After the aforementioned modifications, our GO-term-based ec-GEM is now operational. The next step is to establish a connection between cellular states and the fermentation process.Define the objective optimization function Z as follows:
Since the GO-term has already categorized enzymes based on their functions, to optimize the fermentation process of Halo6.0, we need to assume that the relevant terms are in an optimal state. Using this optimal state, we can back-calculate the substrate uptake by Halo6.0. Since toolboxes like COBRA, RAVEN, and GECKO do not offer this type of analysis tool, we opted to use an optimization function Z for the calculations. This method is linear, allowing for significant time savings during the computation process, and can be integrated with hardware sensors for continuous feeding fermentation control.We separately determined the optimal substrate concentration ranges for cell growth and PHA synthesis. It is worth noting that this approach is still in its early stages, and future improvements can focus on optimizing the computation to achieve higher precision while maintaining efficiency.[10]
There are certain limitations to the material exchange between cells and their environment, and we make the following assumptions:
The lower limit for the exchange reaction flux is set at 0 mmol/gDW/h, and the upper limit is set at 1000 mmol/gDW/h, with similar constraints applied to other substrates. The parameters a, b, and Vmax are derived by fitting fermentation data obtained from wet-lab experiments. When the substrate uptake rate calculated by the ec-GEM exceeds v, it indicates that the substrate absorption of Halo6.0 is limited by hypoxia; therefore, the uptake rate should be restricted to v.
Let the uptake rate be A, then A can be expressed as:
Therefore, accurately predicting the cell growth curve is also crucial for automated fermentation control. The cell growth curve we use incorporates a feedback control mechanism. The fitting data of the growth curve can be composed of three parts: 1. Fermentation data obtained from wet experiments, 2. Fermentation data from sensor feedback during the fermentation process, and 3. Fermentation data derived from GO-term-based ec-GEM calculated through the cobra toolbox using Flux Balance Analysis (FBA).[11] Here's how it works:
calculate the similarity between each pair of curves using the Pearson correlation coefficient:
where:
Determine the weights for each curve based on their average similarity to other curves:
where:
mixed growth curve can be expressed as:
The initial value of Curve 2 is consistent with that of Curve 1, meaning that at the start of fermentation, it is effectively a mixture of Curves 1 and 3. As the fermentation process progresses, the parameters of Curves 2 and 3 are updated, and the weights of the curves change, ultimately resulting in a new mixed growth curve that feeds back into the model.
Once we have obtained the cell growth curve, the next consideration is the switching of substrate concentration. Taking glucose as an example, the optimal glucose concentration during the cell proliferation phase and the PHA synthesis phase is clearly different. We assume that the feeding flow rate is F the optimal concentration for growth is c1 and the optimal concentration for synthesis is c2. t-switch is the time point when fermentation begins to transition to a different state. Below, we will introduce the calculation method for t-switch
When it is necessary to maintain a stable state of cell proliferation or PHA synthesis:
When transitioning from the cell proliferation state to the PHA synthesis state:
Equation (2) is the difference equation of (1). Since we have obtained the time t2 when fermentation enters the stationary phase from the growth curve, these two equations allow us to calculate the time t1 to start transitioning based on the determined rate of concentration change. t1 is the t-switch we need.
Through the aforementioned GO-term-based ec-GEM and derivations, we can achieve rational predictions of the fermentation process, which is called Motrol.
Reference
[2]. Sultana, T., et al., Integration site selection by retroviruses and transposable elements in eukaryotes. NATURE REVIEWS GENETICS, 2017. 18(5): p. 292-308.
[3]. Liu, X., et al., Prediction and analysis of prokaryotic promoters based on sequence features. Biosystems, 2020. 197: p. 104218.
[4]. Lu, X., The Analysis of KMP Algorithm and its Optimization. Journal of Physics: Conference Series, 2019. 1345(4): p. 042005.
[5]. Bin Safdar, L., et al., Genome-Wide Association Study and QTL Meta-Analysis Identified Novel Genomic Loci Controlling Potassium Use Efficiency and Agronomic Traits in Bread Wheat. FRONTIERS IN PLANT SCIENCE, 2020. 11.
[6]. Song, K., Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. NUCLEIC ACIDS RESEARCH, 2012. 40(3): p. 963-971.
[7]. Chen, Y., et al., Reconstruction, simulation and analysis of enzyme-constrained metabolic models using GECKO Toolbox 3.0. NATURE PROTOCOLS, 2024. 19(3): p. 629-667.
[8]. Ma, Y.Y., et al., Flux optimization using multiple promoters in Halomonas bluephagenesis as a model chassis of the next generation industrial biotechnology. METABOLIC ENGINEERING, 2024. 81: p. 249-261.
[9]. Ling, C., et al., Engineering NADH/NAD+ ratio in Halomonas bluephagenesis for enhanced production of polyhydroxyalkanoates (PHA). METABOLIC ENGINEERING, 2018. 49: p. 275-286.
[10]. Moreno-Paz, S., et al., Enzyme-constrained models predict the dynamics of Saccharomyces cerevisiae growth in continuous, batch and fed-batch bioreactors. MICROBIAL BIOTECHNOLOGY, 2022. 15(5): p. 1434-1445.
[11]. Bi, X.Y., et al., A Multi-Omics, Machine Learning-Aware, Genome-Wide Metabolic Model of Bacillus Subtilis Refines the Gene Expression and Cell Growth Prediction. ADVANCED SCIENCE, 2024.