Header with Sliding Bar
Hero Image

Promoter Optimization

Introduction

In order to further leverage the advantages of VersaTobacco as a chassis for synthetic biology, we aim to optimize promoters to enhance product yield.

Promoter optimization has been a key task in the field of synthetic biology, and recent advancements in deep learning technologies have provided new momentum for biological research. Therefore, we trained a deep learning model to predict promoter strength based on promoter sequences. We performed random mutations in the sequence ranging from -165 to +5 relative to the annotated TSS to generate promoter variants. The trained model was then used to predict the strength of these variants, and variants were screened based on their predicted strength to achieve promoter optimization. Selected variants with excellent predicted performance were further tested experimentally to validate the feasibility of this optimization approach.

Data Source and Preprocessing

We utilized data published by Jores, Tobias et al. in 2021, where they measured the expression levels of 72,158 core promoters from various sources in mature tobacco leaves using self-transcribing active regulatory region sequencing(STARR-seq). Depending on the characteristics of different models, we used One-hot or kmers encoding methods for the sequences.

Promoter Strength Predictor

To more accurately predict promoter strength, we designed and tested various deep learning models, including Bidirectional Convolutional Neural Networks (BiCNN), Residual Networks (ResNet), and models based on the Conformer architecture. Each of these models leverages its unique structural features in an effort to achieve better performance in the promoter strength prediction task.

BiCNN

First, we reproduced the bidirectional convolutional model used by Jores, Tobias et al. as the baseline for this task. The model consists of two forward- and reverse-sequence scan layers adapted from DeepGMAP (Onimaru, Koh et al., 2010) with 128 filters and a kernel width of 13.

Fig1: Architecture of our BiCNN model

ResNet

To further test the feasibility of convolutional neural networks on promoter sequences, we considered ResNet (K. He et al., 2016), which addresses the degradation problem in model training through shortcut connections, allowing complex models to better leverage their advantages.

Fig2: Architecture of our ResNet model

Conformer

Convolutional neural networks (CNNs) are widely used in various applications, such as motif and binding site recognition, due to their strengths in pattern recognition. Meanwhile, transformers excel in capturing long-range dependencies, making them highly effective for sequence analysis. To combine the strengths of CNN and transformer, we selected Conformer (Z. Peng et al., 2023) to couple local features with global representations.

Fig3: Architecture of Conformer

Conformer is a dual-network architecture consisting of a CNN branch and a Transformer branch, following the ResNet and ViT architectures, respectively. Due to feature misalignment between the CNN and transformer branches, we redesigned the Feature Coupling Unit (FCU). In the FCU, 1x1 convolutional kernels are used for up/down sampling of features to enable interaction between the different branches.

Table1: Important parameters in Conformer

Parameters Value
Embedding dimension 128
Conformer Layers 6
CNN channels 64
CNN kernel size 7
Transformer dimension 256
Number of heads 16

Training Results

In the model training and evaluation, we used Pearson's r and R² coefficient as metrics to compare the performance of different models. The BiCNN model accurately replicated the results from the reference (R² = 0.71), while the Conformer achieved the best performance, indicating that combining CNN and Transformer effectively learns patterns between promoter sequences. We used this model for subsequent promoter optimization.

Fig4: Comparison of Model Performance

Promoter optimization

By performing single-point mutations on the sequence ranging from -165 to +5 relative to the annotated TSS for protein-coding and microRNA (miRNA) genes and using the Conformer model trained in the previous step to predict their corresponding strengths, we selected the highest-scoring mutated sequences for the next iteration. Ultimately, we chose two mutated sequences from different evolutionary directions, which underwent ten and fifteen rounds of iteration, respectively.

Fig5: Predicted strength and mutation sequence


References

[1] Jores, Tobias et al. “Synthetic promoter designs enabled by a comprehensive analysis of plant core promoters.” Nature plants vol. 7,6 (2021): 842-855. doi:10.1038/s41477-021-00932-y

[2] Onimaru, Koh et al. “Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information.” PloS one vol. 15,7 e0235748. 23 Jul. 2020, doi:10.1371/journal.pone.0235748

[3] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90

[4] Z. Peng et al., "Conformer: Local Features Coupling Global Representations for Recognition and Detection," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 9454-9468, Aug. 2023, doi: 10.1109/TPAMI.2023.3243048

Footer