In this project, through multiple rounds of engineering cycles, we optimized the promoter, signal peptide, and culture conditions, successfully constructing a highly efficient ICCG expression system. First, we created an ICCG-GFP fusion protein and used fluorescence intensity to characterize ICCG protein expression levels. During promoter optimization, we generated a Mtac promoter library through random mutagenesis and used a CNN-LSTM model to design and optimize the novel DMtac promoter. For the signal peptide, we screened and optimized the nfaA signal peptide and designed MnfaA mutants using a Markov transfer frequency matrix model. By combining high-throughput screening systems with predictive models, we significantly improved the expression efficiency of ICCG, providing valuable components and tools for future synthetic biology research.
We used seamless cloning to attach the GFP protein to the C-terminus of ICCG through a GS linker.
Figure 1.1 PCR Amplification Results. According to the agarose gel electrophoresis results, the PCR amplification yielded fragments of the expected length.
Through shake flask cultivation and protein purification, we obtained ICCG-GFP fusion protein samples.
The above results indicate that we have successfully constructed the correct ICCG-GFP fusion protein.
After two rounds of orthogonal experiments, considering that increasing inducer concentration and extending induction time would raise protein production costs (which is unfavorable for large-scale production of ICCG), we selected an induction temperature of 25°C, an induction time of 12 hours, and a final IPTG concentration of 1 mM as the culture conditions for our subsequent high-throughput screening system.
After purifying the ICCG-GFP fusion protein, we measured the fluorescence intensity of protein samples at final concentrations of 0 μg/mL, 2 μg/mL, 4 μg/mL, 8 μg/mL, 16 μg/mL, 32 μg/mL, and 64 μg/mL under the conditions of an excitation wavelength of 488 nm and an emission wavelength of 507 nm. The resulting fluorescence intensity-protein concentration standard curve is shown below:
Figure 1.3 Fluorescence Intensity-Protein Concentration Standard Curve
We characterized the expression efficiency of four strains and eight vectors and the results are as follows:
Figure 2.1 Strain and Vector Screening Results
In the end, we selected *Escherichia coli* BL21(DE3) as the chassis and pET-26b(+) as the vector to construct the high-throughput screening system.
We constructed the tac promoter in the pET26b vector carrying the ICCG-GFP fusion protein gene using seamless cloning and characterized its activity.
Figure 3.1 Agarose Gel Electrophoresis Results of tac Promoter Construction
Figure 3.2 Characterization Results of tac Promoter
Compared to the T7 promoter, the tac promoter increased ICCG-GFP expression by 37.94%.
To construct the Mtac promoter mutants, we employed a random mutagenesis approach targeting the 16 bp sequence between the -35 and -10 regions of the tac promoter. We designed a pair of degenerate primers with the mutation region consisting of 16 consecutive N nucleotides to introduce diversity into that sequence, resulting in a series of different promoter mutants. We used PCR amplification to obtain a linearized vector and generated recombinant plasmids through seamless cloning. A total of 12 PCR amplification reactions were conducted, yielding 160 µL of recombinant product.
Figure 3.3 Agarose Gel Electrophoresis Results
We picked a total of 1,037 single clones from 16 plate cultures for trial expression in the high-throughput screening system. Based on the fluorescence intensity of the mutant proteins, we selected samples with a uniform distribution of data and those with fluorescence intensity higher than that of the wild-type tac promoter for sequencing, resulting in 88 Mtac promoter data points.
Figure 3.4 Normalized Fluorescence of Mtac Mutants
The obtained mutant sequences are as follows:
Figure 3.5Mtac513
Compared to the wild-type tac promoter, Mtac662 increased ICCG-GFP expression by 36.84% and by 77.68% compared to the T7 promoter.
We generated the Mtac promoter library through random mutagenesis, and by combining experimental screening data, we trained a promoter prediction model. Based on this machine learning model, we designed the DMtac promoter, predicting mutants that could potentially enhance ICCG expression efficiency. Using seamless cloning, we constructed recombinant plasmids and successfully obtained correctly constructed recombinant plasmids. The trial expression results are as follows:
Figure 3.6 Normalized Fluorescence of DMtac Mutants
Compared to the original tac promoter, the DMtac promoter significantly improved expression efficiency. Fluorescence/OD600 data indicated that ICCG expression increased by 8.2% under the DMtac promoter. Compared to the T7 promoter, DMtac increased ICCG expression by 30%.
Given the limited amount of data, we used various regularization techniques (e.g., Dropout, Batch Normalization, and Early Stopping) to prevent overfitting and conducted multiple training sessions to find the optimal model.
Figure 3.7 Model Training Results
Final Model Performance on the test set:
These results significantly outperform the predictor reported by Wang et al. in Nucleic Acids Research in 2018 (PCC = 0.514).
Figure 3.8 Heatmap of the Binomial Statistical Mode
The mutants generated using the Binomial Statistical model ultimately demonstrated an expression level of 108% compared to the tac promoter in our experiments. It's important to note that the predictor utilized Z-score normalization to reduce the impact of extreme values, resulting in a relatively conservative predicted expression range. This approach made it less likely for the model to forecast mutants with expression levels significantly higher than that of the tac promoter. Nonetheless, the outcome meets our expectations and is very promising.
One specific data point stood out to us: a mutated tac promoter with an expression level exceeding 40,000 was consistently predicted by our model to be around 30,000 during several training sessions. Initially, we suspected this discrepancy was due to poor model fitting. However, in the final round of experimental validation, the protein expression level for this promoter was measured at 30,279, confirming that the previous value of over 40,000 was indeed erroneous. This result closely aligns with our model's prediction of 30,000, indicating that our model successfully identified a hidden data anomaly! This finding is undoubtedly exciting and further validates the effectiveness of our model. Details of this part can be found in the Model section.
We constructed 40 signal peptides into the pET26b vector carrying the ICCG-GFP fusion protein gene using seamless cloning, and we characterized their activity.
Figure 4.1 Signal Peptide Characterization Results
The characterization results of the signal peptides confirmed the model's prediction for the nfaA signal peptide: nfaA was the best-performing signal peptide during the characterization process. Although the secretion efficiency of these signal peptides did not significantly improve in the experiments, they provided valuable reference data for further optimizing PETase expression.
The experimental data showed that while the MnfaA signal peptide did slightly enhance ICCG secretion, the secretion level did not achieve the expected significant improvement. Compared to the control group, the fluorescence signal exhibited only minor changes, indicating that MnfaA's effect on improving ICCG secretion efficiency was limited.
Figure 4.2 Normalized fluorescence intensity characterization of the MnfaA mutants.
This suggests that MnfaA may require further optimization, particularly in the context of the Escherichia coli expression system, where unforeseen factors may be limiting its efficiency.
The final training outcomes of the dataset are as follows:
Picture 4.3 Fitted curve optimized with the new dataset
These metrics indicate that the model exhibits both high accuracy and predictive capability. The low MAE and RMSE values suggest that the model effectively minimizes prediction errors, while the relatively high R² demonstrates its substantial capacity to explain the variance within the data. Additionally, the Pearson Correlation Coefficient of 0.8275 indicates a strong positive correlation between the predicted and actual values. Taken together, these results validate the model as a robust tool for predicting Escherichia coli protein secretion efficiency.
This innovative approach significantly advances a relatively underexplored field, opening new avenues for the study and development of Escherichia coli protein secretion models while offering valuable insights for future research.
Throughout the project, we successfully optimized several key components, including promoters, signal peptides, and culture conditions, to significantly enhance ICCG expression. Each phase of the project yielded important results, contributing to the overall success of the system.
Using random mutagenesis, we generated the Mtac promoter library and identified mutants that significantly improved expression levels compared to the wild-type tac promoter. By integrating machine learning models, such as CNN-LSTM and binomial statistical models, we further optimized the promoter sequences, leading to the development of the DMtac series. These optimized promoters demonstrated higher expression levels, validating the effectiveness of our approach.
In the signal peptide optimization process, the nfaA peptide was identified as the top performer from a library of *E. coli* signal peptides. Further refinement using Markov transfer frequency matrix generated MnfaA mutants, which were experimentally tested for their impact on protein secretion. Although improvements in secretion efficiency were observed, some mutants did not meet the expected results, highlighting areas for further optimization.
We successfully constructed a high-throughput screening system, using fluorescence intensity to quantify ICCG expression levels via ICCG-GFP fusion proteins. This system allowed for rapid screening of various genetic constructs, facilitating efficient identification of high-performing variants.
Two rounds of orthogonal experiments were conducted to optimize the induction temperature, time, and IPTG concentration. The optimal conditions—25°C, 12-hour induction, and 1mM IPTG—were identified as the most suitable for ICCG expression, balancing production cost and efficiency.
Our results indicate that combining random mutagenesis with machine learning models can lead to significant improvements in gene expression. The DMtac promoter series and optimized signal peptides present new opportunities for enhancing protein production in synthetic biology. However, while we achieved higher expression levels and improved secretion efficiency, some of the experimental results did not fully align with model predictions, particularly in the case of signal peptide mutations. This highlights the need for further refinement of predictive models and optimization of genetic components.
Overall, the project successfully demonstrated the potential of integrating experimental and computational approaches to optimize genetic elements, providing a robust framework for future research in protein expression and synthetic biology.