Promoter Optimization

Introduction

In order to further leverage the advantages of VersaTobacco as a chassis for synthetic biology, we aim to optimize promoters to enhance product yield.

Promoter optimization has been a key task in the field of synthetic biology, and recent advancements in deep learning technologies have provided new momentum for biological research. Therefore, we trained a deep learning model to predict promoter strength based on promoter sequences. We performed random mutations in the sequence ranging from -165 to +5 relative to the annotated TSS to generate promoter variants. The trained model was then used to predict the strength of these variants, and variants were screened based on their predicted strength to achieve promoter optimization. Selected variants with excellent predicted performance were further tested experimentally to validate the feasibility of this optimization approach.

Data Source and Preprocessing

We utilized data published by Jores, Tobias et al. in 2021, where they measured the expression levels of 72,158 core promoters from various sources in mature tobacco leaves using self-transcribing active regulatory region sequencing(STARR-seq). Depending on the characteristics of different models, we used One-hot or kmers encoding methods for the sequences.

Promoter Strength Predictor

To more accurately predict promoter strength, we designed and tested various deep learning models, including Bidirectional Convolutional Neural Networks (BiCNN), Residual Networks (ResNet), and models based on the Conformer architecture. Each of these models leverages its unique structural features in an effort to achieve better performance in the promoter strength prediction task.

BiCNN

First, we reproduced the bidirectional convolutional model used by Jores, Tobias et al. as the baseline for this task. The model consists of two forward- and reverse-sequence scan layers adapted from DeepGMAP (Onimaru, Koh et al., 2010) with 128 filters and a kernel width of 13.

Fig.1 Architecture of our BiCNN model

ResNet

To further test the feasibility of convolutional neural networks on promoter sequences, we considered ResNet (K. He et al., 2016), which addresses the degradation problem in model training through shortcut connections, allowing complex models to better leverage their advantages.

Fig.2 Architecture of our ResNet model

Conformer

Convolutional neural networks (CNNs) are widely used in various applications, such as motif and binding site recognition, due to their strengths in pattern recognition. Meanwhile, transformers excel in capturing long-range dependencies, making them highly effective for sequence analysis. To combine the strengths of CNN and transformer, we selected Conformer (Z. Peng et al., 2023) to couple local features with global representations.

Fig.3 Architecture of Conformer

Conformer is a dual-network architecture consisting of a CNN branch and a Transformer branch, following the ResNet and ViT architectures, respectively. Due to feature misalignment between the CNN and transformer branches, we redesigned the Feature Coupling Unit (FCU). In the FCU, 1x1 convolutional kernels are used for up/down sampling of features to enable interaction between the different branches.

Table 1: Important parameters in Conformer

Parameters	Value
Embedding dimension	128
Conformer Layers	6
CNN channels	64
CNN kernel size	7
Transformer dimension	256
Number of heads	16

Training Results

In the model training and evaluation, we used Pearson's r and R² coefficient as metrics to compare the performance of different models. The BiCNN model accurately replicated the results from the reference (R² = 0.71), while the Conformer achieved the best performance, indicating that combining CNN and Transformer effectively learns patterns between promoter sequences. We used this model for subsequent promoter optimization.

Fig.4 Comparison of Model Performance

By performing single-point mutations on the sequence ranging from -165 to +5 relative to the annotated TSS for protein-coding and microRNA (miRNA) genes and using the Conformer model trained in the previous step to predict their corresponding strengths, we selected the highest-scoring mutated sequences for the next iteration. Ultimately, we chose two mutated sequences from different evolutionary directions, which underwent ten and fifteen rounds of iteration, respectively.

Fig.5 Predicted strength and mutation sequence