In our project, we optimized promoters, signal peptides, strains, and vectors through multiple rounds of engineering cycles, ultimately successfully constructing a highly efficient system for ICCG expression.
In the optimization of strains and vectors, we conducted one round of the "Design-Build-Test-Learn" cycle. We screened the performance of different strains and vectors and selected the most suitable expression system.
To further improve expression efficiency, we conducted two rounds of the "Design-Build-Test-Learn" cycle on culture conditions. By adjusting the culture conditions, we optimized the expression levels of ICCG.
We performed three rounds of the "Design-Build-Test-Learn" cycle for promoter optimization. In the first round, we generated the Mtac promoter library through random mutagenesis and trained a promoter prediction model using the screening data. In the second round, we designed new promoter mutants based on the model, validated their expression efficiency through experiments, and used the results to optimize the model. In the third round, we further refined the model, designed and tested more mutants, and eventually obtained a highly efficient promoter sequence and an optimized prediction model, providing a solid foundation for ICCG expression.
The optimization of signal peptides also went through three rounds of the "Design-Build-Test-Learn" cycle. By improving the signal peptide sequences and combining experimental data, we gradually enhanced the efficiency of signal peptides in ICCG expression. Ultimately, we built a signal peptide design model and obtained an optimized signal peptide sequence.
Through these engineering cycles, we not only successfully optimized the promoters and signal peptides but also constructed a highly efficient ICCG expression system, providing a solid foundation for future expression studies.
We selected four common Escherichia coli protein expression host strains and eight commonly used expression plasmids to conduct a univariate experiment.
Strain Table
Strain | Source | Genotype | Description | Antibiotic Resistance |
---|---|---|---|---|
BL21(DE3) | B | F– ompT hsdSB (rB–mB–) gal dcm (DE3) | Common expression host strain | None |
Solu BL21(DE3) | B | F– ompT hsdSB (rB–mB–) gal dcm (DE3) | Enhances the solubility of target proteins, improving the solubility of many recombinant proteins | None |
BL21(DE3)pLysS | B | F– ompT hsdSB (rB–mB–) gal dcm (DE3)pLysS(CmR) | Tight control expression host strain | Chloramphenicol |
Origami2(DE3) | K-12 | F− λ− ilvG− rfb-50 rph-1 Δgor ΔtrxB lacY1 ompT hsdS(rB− mB−) gal dcm lacI(DE3) | An E. coli strain specialized for enhancing disulfide bond formation within the cell. Provides an oxidative environment for proper folding of disulfide bond-containing proteins. | Tetracycline, Streptomycin |
Vectors:
Vectors |
---|
pET-22b(+) |
pET-26b(+) |
pET-28a(+) |
pET-28a-SUMO |
pET-32a(+) |
pGEX-6p-1 |
pMAT9-SUMO |
pCold-TF |
The ICCG-GFP fusion protein gene (BBa_K5292405) was constructed into eight different vectors, and the successfully constructed recombinant plasmids were respectively transformed into the BL21 (DE3) strain. After screening through a high-throughput screening system, the optimal vector was determined to be pET-26b(+).
The pET-26b-ICCG-GFP recombinant plasmid was then transformed into four different host strains. Based on a comprehensive consideration of the screening data and culture efficiency (induction time, induction temperature, etc.), we selected the BL21 (DE3) strain as the chassis for subsequent mutant screening.
For detailed experimental data, please refer to Results page, and for the high-throughput screening experiment protocol, refer to the Experiments page. The same applies to subsequent sections.
After this round of chassis and vector screening, we have determined that the chassis for subsequent experiments will be BL21 (DE3) and the vector will be pET-26b(+).
We optimized the culture conditions (induction duration, induction temperature, and inducer concentration) by designing an orthogonal experiment. The experimental conditions are shown in the table below:
Table 1 Cycle 2.1 Orthogonal Experiment Table
induction temperature/℃ | induction duration/h | inducer concentration/mM | |||||||
---|---|---|---|---|---|---|---|---|---|
16 | 25 | 30 | 8 | 12 | 16 | 0.2 | 0.5 | 1 | |
1 | + | + | + | ||||||
2 | + | + | + | ||||||
3 | + | + | + | ||||||
4 | + | + | + | ||||||
5 | + | + | + | ||||||
6 | + | + | + | ||||||
7 | + | + | + |
We constructed a high-throughput screening system and characterized protein expression levels using fluorescence intensity.
Based on the characterization results, we designed the second round of orthogonal experiments.
We designed orthogonal tests to optimise the culture conditions (induction duration, induction temperature, inducer concentration) and the experimental conditions are shown in the table below:
Table 2 Cycle 2.2 Orthogonal Experiment Table
induction temperature/℃ | induction duration/h | inducer concentration/mM | |||||||
---|---|---|---|---|---|---|---|---|---|
16 | 25 | 30 | 12 | 16 | 24 | 0.5 | 1 | 1.5 | |
1 | + | + | + | ||||||
2 | + | + | + | ||||||
3 | + | + | + | ||||||
4 | + | + | + | ||||||
5 | + | + | + | ||||||
6 | + | + | + | ||||||
7 | + | + | + |
We established a high-throughput screening system and characterized protein expression levels using fluorescence intensity.
After the second round of orthogonal experiments, we determined the culture conditions as follows: induction temperature of 25°C, induction time of 12 hours, and final IPTG concentration of 1mM.
In order to construct the CNN-LSTM model (Convolutional Neural Network - Long Short-Term Memory Network) to predict tac promoter mutants, we first needed to perform random mutagenesis on the tac promoter. Based on several literature reports, we selected the 16bp region between the -10 and -35 elements for mutagenesis. The mutant tac promoters were designed as shown in the figure below, where the red 16bp represents the mutation region, and the blue regions represent the fixed 15bp sequences flanking it.
Table.1 Mutation sequence design
To construct the Mtac promoter mutants, we used a random mutagenesis approach targeting the 16 bp sequence between the -35 and -10 regions of the tac promoter, a critical area for RNA polymerase binding and transcription initiation. We designed a pair of degenerate primers, with the mutation region encoded by 16 consecutive N nucleotides, to introduce diversity into this sequence, generating a series of different promoter mutants.
We used PCR amplification to obtain a linearized vector and seamless cloning to obtain recombinant plasmids carrying the mutated tac promoters. The mixture of plasmids generated by random mutagenesis was then transformed into the expression strain BL21 (DE3). After expression screening through a high-throughput screening system, we selected samples for sequencing of the mutated regions, resulting in the identification of the Mtac sequence.
After obtaining the dataset of tac mutant sequences and their expression levels from the experiment, we constructed a CNN-LSTM model to serve as a predictor for promoter strength.
After constructing the predictor, we used a GA (Genetic Algorithm) to design a generator, integrating the CNN-LSTM predictor as the evaluation module for the genetic algorithm. The previously obtained tac mutant sequences were selected as the initial population, and through operations such as crossover and mutation in the genetic algorithm, along with evaluation by the predictor, a new set of tac mutant promoters was designed based on the initial tac sequences. These promoters included both high-performing and low-performing variants, with the purpose of validating our model.
To construct the DMtac mutants, we designed single-point mutation primers and performed PCR amplification on the template to obtain a linearized vector. Seamless cloning technology was used to obtain recombinant plasmids carrying the DMtac mutants. The recombinant plasmids were then transformed into DH5α for plasmid amplification, followed by plasmid sequencing to confirm the correctly constructed recombinant plasmids.
To characterize the activity of the DMtac promoters, we constructed an ICCG-GFP fusion protein, using fluorescence intensity as a quantitative measure of expression. The following steps were taken to characterize the mutants:
The constructed DMtac promoter mutant plasmids were transformed into Escherichia coli BL21 (DE3) expression strains. The transformed strains were cultured in a high-throughput screening system, and the expression levels of different mutants were compared by measuring fluorescence/OD600.
The experimental results showed that while our GA generator had a certain level of accuracy, it was still not satisfactory and lacked interpretability. Therefore, we decided to modify the generator.
The experimental results showed that although our GA generator had a certain level of accuracy, it was still not satisfactory and lacked interpretability. Therefore, we replaced the generator with a statistical model based on binomial distribution, as the accuracy of this statistical model has been validated in several studies. Through operations such as confidence intervals and hypothesis testing, we identified a series of high-frequency combinations of strong promoter site-base pairs. Clearly, the statistical model offers higher interpretability.
Table.2 Mutation sequence design by binomial statistical model
Same as Cycle 3.2.
The expression level of the mutants generated using the binomial statistical model was ultimately increased to 108% of the tac promoter in the experiment. Considering that the predictor used Z-score normalization to avoid the influence of extreme values, the range of expression levels it can predict is relatively conservative, making it unable to predict mutants with significantly higher expression than the tac promoter. Additionally, we unexpectedly discovered that our model predicted a hidden outlier. These results are consistent with expectations and are quite encouraging.
In this experiment, structural and physicochemical features of the N-region, H-region, and C-region were extracted. Based on the RF (Random Forest) regression training model developed by Stefano Grasso et al. in 2023, which has high evaluation accuracy for Gram-positive bacterial signal peptides, we scored and screened Escherichia coli signal peptides from the signal peptide library. The selected high-scoring signal peptides were then constructed and expressed with ICCG.
We used PCR amplification to obtain a linearized vector and seamless cloning to obtain recombinant plasmids carrying the signal peptide. The recombinant plasmids were then transformed into the Escherichia coli strain DH5α. We then sequenced the amplified recombinant plasmid and obtained the correctly constructed recombinant plasmid. To characterize the activity of the signal peptide, we constructed an ICCG-GFP fusion protein, using fluorescence intensity as a quantitative measure of expression.
We obtained a series of signal peptide sequences and their corresponding expression levels based on feature values. Among them, the signal peptide **nfaA**, which we predicted to have the best expression, indeed showed the highest expression in the experiment. Although the overall accuracy was somewhat unsatisfactory due to poor mobility, the model still provided valuable guidance for the experiment.
A comprehensive scoring matrix was constructed based on the standard amino acid Blosum 62 substitution matrix and hydrophobicity matrix. Using the well-performing signal peptide nfaA from the previous round, we applied the Markov transfer frequency matrix to build a generator based on a Hidden Markov Model (HMM). This generator was used to modify the H-region of the signal peptide, generating artificially designed and optimized mutant signal peptides.
Table 1. Comprehensive score matrix for amino acid substitutions
Table 2.Markov transfer frequency matrix
We used PCR amplification to obtain a linearized vector and seamless cloning to obtain recombinant plasmids carrying the mutated nfaA signal peptide. The recombinant plasmids were then transformed into the Escherichia coli strain DH5α. We then sequenced the amplified recombinant plasmid and obtained the correctly constructed recombinant plasmid.
To characterize the activity of the MnfaA signal peptide, we constructed an ICCG-GFP fusion protein, using fluorescence intensity as a quantitative measure of expression.
We realized that relying solely on the Hidden Markov generator to modify signal peptides is insufficient; it is still necessary to screen them through a predictor.
We generated a large number of mutant signal peptides using the Hidden Markov Model and evaluated them with the previous prediction model. We then selected mutant signal peptides that performed better than the original nfaA for experimentation.
Same as Cycle 4.2.
We once again confirmed that the model's transferability from Gram-positive bacteria to Gram-negative bacteria is not strong, so adjustments to the model are needed. Fortunately, we have obtained sufficient data from two rounds of testing to support the next step of model fine-tuning.
We trained our own Random Forest model using the large dataset provided by Stefano Grasso et al., and then integrated the aforementioned data to build a new small dataset. Using this small dataset, we fine-tuned the larger model to design a predictor more suited to the type of signal peptides used in our experiments—specifically, those better suited for Gram-negative bacteria. After applying transfer learning, we obtained a model that can accurately predict the protein secretion capacity of Gram-negative bacteria. When we evaluated previous signal peptides using the new model, the results showed a satisfactory fit. For a more detailed introduction to the signal peptide models in the three design sections mentioned above, please refer to our Model page.
1. Zhang, S., Liu, D., Mao, Z., Mao, Y., Ma, H., Chen, T., Zhao, X., & Wang, Z. (2018). Model-based reconstruction of synthetic promoter library in Corynebacterium glutamicum. Biotechnology Letters, 40(5), 819–827.
2. Van Brempt, M., Clauwaert, J., Mey, F., Stock, M., Maertens, J., Waegeman, W., & De Mey, M. (2020). Predictive design of sigma factor-specific promoters. Nature Communications, 11, 5822.
3. Li, Z.-J., Zhang, Z.-X., Xu, Y., Shi, T.-Q., Ye, C., Sun, X.-M., & Huang, H. (2022). CRISPR-based construction of a BL21 (DE3)-derived variant strain library to rapidly improve recombinant protein production. ACS Synthetic Biology, 11(1), 343-352.
4. Cui-Fang G, Sen W, Feng-Wei T, et al. An Artificial Design Technique to Optimize Signal Peptide [J]. American Journal of Biochemistry and Biotechnology, 2017, 13(3).