The strength of both promoter and Ribosome Binding Site (RBS) are essential in genetic engineering and synthetic biology,
because it influences gene expression levels to a great extent. However traditional lab methods used for promoter and RBS
strength measurement are proven to be time-consuming and resource-extensive processes, since they involve laborious experimental
assays and complex bioinformatics analyses.
Software is a versatile tool that can be deployed to tackle significant problems in the field of synthetic biology and accelerate
traditional lab methods, by incorporating state-of-the-art technologies, such as machine learning, and increasing time efficiency
while avoiding potential human error.
Our team, wishing to contribute to this field, decided to work on developing a software tool that can tackle the above-mentioned problem.
By drawing great inspiration from the Software Tool of the iGEM Tsinghua 2023 Team1, the iGEM Thessaloniki 2024 Team presents
Sequence AI-nalyzer, an AI-powered software tool designed to streamline the analysis of genetic regulatory elements and assist in the improvement
of synthetic biology parts.
Sequence AI-nalyzer is comprised of two different machine learning models that combine to create a powerful software tool.
More specifically, the two functions are:
Promoter Strength Prediction Model
RBS Strength Prediction Model
Our goal when we started developing our software was to use it for the promoters and RBS selection during the cloning process and furthermore,
contribute to the iGEM Tsinghua 2023 Software Tool by enriching it with more functions and improving its efficiency, thus creating a more
well-rounded software tool for part enhancement. Our code can be accessed through our GitLab Repository.
Figure 1: Illustration of the gene expression process and how the concept of strength arises
Promoter Strength Prediction Model
Overview
The first software tool of Sequence AI-nalyzer is the Promoter Strength Prediction Model (PSP Model). To elaborate, the PSP
Model is designed to predict the strength of prokaryotic promoters (Ε. coli), by analyzing the promoter sequence and detecting motifs
that affect the strength value. We define this promoter strength value by the level of downstream gene regulation.
The iGEM Tsinghua 2023 Team has already developed two different machine learning models for this task. The Tsinghua 2023 team chose to view this task as a regression one,
rather than a classification one (classification between a strong or a weak promoter), because of the difficulties faced when a classification model is applied in labs 1.
Our team decided to view promoter strength as a regression task too, and create a deep learning model that differs from the machine learning models that the Tsinghua 2023 team developed.
We effectively applied the Engineering Cycle Methodology during every step of our model design and creation. More information about our design process can be found on our Engineering Cycle Page.
Here, we describe our work on our proposed model explicitly.
Our model has a total of 3,864,577 parameters (14,74 MB of memory space). For its deployment and training we used Google Colab (and
more specifically the T4 GPU that Google Colab offered us).
Training Dataset
To train our model, we used the same training dataset as the Tsinghua 2023 Team, which we derived from their GitLab Repository
2. This dataset is from a 2015 work by Thomason et al. 3 and is comprised of 11884 E. coli
promoter sequences, along with their respective strength. We tried to find more training data that match our requirements and the
training data the Tsinghua 2023 Team used, but we were unable to do so.
Since the promoter sequences are categorical data, we employed the One-Hot Encoding Method 4 to transform
them into numerical data that our model can process. Each promoter sequence is 50 bases long (A, C, G, and T). By using one-hot
encoding, each sequence is transformed into a matrix of shape (50, 4), as shown in the figure below.
Figure 2: One-Hot Encoding Method Application for Promoter Sequences
After carefully examining the training dataset, we noticed that the target values (the promoter strengths) vary significantly,
and that there is a small amount of promoter sequences with high strength values compared to the promoter sequences with low
strength values (more information regarding this issue can be found at our Engineering Cycle Page, Software Cycle 1, Learn Step). This can lead to instability during the model
training and cause gradient vanishing. Thus, we treated the data logarithmically to ensure numerical stability. Also, we created
two different training datasets, one that contains all the 11884 promoter sequences and one that contains only promoter sequences
with low and medium strength values, so that we can assess the model performance effectively. We defined the low - medium – high
strength values arbitrarily, by observing the range of the strength values of all sequences.
Both encoded training datasets were randomly split into training (85% of sequences) and validation (15% of sequences) sets using
the train_test_split function from scikit-learn.
Model Architecture
For our task, we designed our model architecture to be similar to ResNet (Residual Network) 5. ResNet
uses convolutional blocks (Convolutional Neural Network Architecture) and is enriched with residual (or skip) connections,
forming what are known as Residual Blocks. In these blocks, the input to a layer is added directly to the output after passing
through a few convolutional layers. This skip connection allows the network to retain information across layers and prevents
issues like vanishing gradients during training. In other words, our model consists of convolutional blocks which are connected
by residual connections. Each convolutional block in our model consists of a Convolutional Layer 1D, Batch Normalization, and ReLU
layers, except for the first convolutional block which has a MaxPooling1D layer at the end.
Figure 3: Promoter Strength Prediction Model Architecture
Figure 4: Residual Block Architecture
The advantages of the ResNet Architecture compared to the traditional CNN Architecture are various. The residual connections
address the problem of diminishing gradients, by allowing gradients to flow directly through the network, ensuring that even
very deep layers can be trained efficiently. Furthermore, ResNet is excellent at learning complex features because it can
efficiently stack more layers. By allowing residual connections, the model learns incremental changes at each layer, which
means that deeper layers learn increasingly complex representations without overwhelming the model or overfitting to the data.
Because of this characteristic, our model is able to capture subtle patterns and motifs.
In the tables below, we have listed the parameters we used for our model.
Initial Convolutional Layer 1D
filters
64
kernel_size
7
strides
2
padding
'same'
Initial MaxPooling1D Layer
pool_size
3
strides
2
padding
'same'
Learning Rate
initial_learning_rate
0.001
decay_steps
10000
decay_steps
0.9
staircase
'True'
Model Training
epochs
10
batch_size
32
optimizer
‘Adam’
Residual Block Parameters
filters
{64, 64, 128, 128, 256, 256, 512, 512}
strides
{2, 1, 2, 1, 2, 1, 2, 1}
conv_shortcut
{1, 0, 1, 0, 1, 0, 1, 0}
kernel_size
1 for the shortcut conv. block and 3 for the others
padding
‘valid’ for conv. block and ‘same’ for the others
Model Perfomance
As mentioned in “Training Dataset”, due to the large range of the target values, we trained our model with two
different datasets: one that contains all the 11884 sequences, and one that contains the sequences with low and
medium strength values. The reason we trained our model with these two datasets independently was to examine each
performance and find out whether the large range of target values negatively influences the model’s efficacy. As a
metric for our regression task, we used Mean Average Error (MAE).
For the whole dataset, the model MAE was approximately 1154.79. In order to examine whether this large number was a result
of the small number of training data for the promoter sequences with high strength, we calculated the MAE for
low-medium strength promoters and high strength promoters independently (we arbitrarily set the difference between
those two categories to the strength value of 7000, after carefully analyzing the training data and the target
values – strength value range).
For the promoters whose real strength value was less or equal to 7000, the Mean Average Error was approximately 363.94.
For the promoters whose real strength value was more than 7000, the Mean Average Error was approximately 27685.07, a much higher
number. This showcased that our model is unable to accurately predict high strength promoter values. We hypothesized that this is
due to a lack of training data for high strength promoters, since our model predicts the strength values of low-medium strength
promoters with much greater accuracy, and for which we have an adequate number of training data at our disposal.
For the second dataset that contained only promoters whose real strength values were less or equal to 7000, the Mean Average Error was
approximately 418.94. Since this value is higher than 363.94, this indicates that including all the available sequences in the
training dataset helps the model capture more motifs and patterns and more accurately predict the strength of promoters whose
strength values are less or equal to 7000.
Final Remarks
Although we have submitted our model, we acknowledge that it is imperative to continue our work and enhance the PSP Model further.
We can achieve this by integrating other architectures, such as LSTM and Attention Layers, that can improve the model performance and
decrease its current prediction errors. Further, we can enhance the model by searching for more training data, in order to address one
of the limitations of our current training dataset.
Despite the fact that our model has room for improvement, it can still be used to predict the promoter strength of sequences that match
our training dataset with great accuracy.
You can access our code through our GitLab Repository. We are open to contributions!
RBS Strength Prediction Model
Overview
The second tool of the Sequence AI-nalyzer is the RBS Strength Prediction Model (RSP Model). The purpose of the model is to
qualitatively characterize the sequences from the end of the promoter to the beginning of the gene (conventionally "rbs" sequences) as "strong", "medium" or
"weak". We define these strength levels by the rate of gene expression that exists after the rbs sequence.
Training Dataset
To train our model, we used a training dataset, which we derived from the Athens 2022 Team’s wiki page 6. This dataset consists of 163 RBS sequences along with
their strength values. We were unable to find more training data in order to have an ideal volume for calculating the strength number rather than the level.
So, we filled the original dataset with a strength classification (weak, medium, strong) next to each strength number in order to produce three equal categories.
Furthermore, due to the particular variation in strength values and limited data, we created another dataset without the medium class by eliminating some sequences
with intermediate values and assigning high-medium values to strong and low-medium to weak ones.
The feature extraction process is based on k-mer generation. K-mers are overlapping substrings of a fixed length k extracted from the RBS sequences, which capture short
patterns of nucleotides within each sequence. For this implementation, the model uses quad-nucleotide k-mers (k=4), meaning that each k-mer is composed of four consecutive
nucleotides. The k-mers are generated using a sliding window approach over each sequence, allowing for overlapping substrings. By applying this method, the RBS sequence is broken
down into smaller pieces that contain biologically meaningful patterns, which can be useful in determining the strength of ribosome binding. Further information about the data
treatment can be found in the next subsection.
Figure 5: K-mer Encoding Method Application for RBS Sequences
Model Architecture
Once the k-mers are generated, they are joined into a single string for each sequence, and the model applies the CountVectorizer technique to convert these k-mer strings into numerical
data. CountVectorizer transforms the k-mer sequences into a sparse matrix where each row corresponds to an RBS sequence, and each column represents a unique k-mer. The value in the
matrix represents the frequency of the k-mer in the sequence. This transformation is important because machine learning models cannot directly process textual or biological data, so the
sequences are converted into a format that the model can interpret. Essentially, CountVectorizer works like a bag-of-words model used in text classification, treating the k-mers as features.
The strength labels need to be encoded into numerical values before the model can use them. LabelEncoder is employed to convert these categories into integers, where each unique label is mapped
to a corresponding number. This ensures that the labels are in a suitable format for classification. After the feature and label encoding steps are complete, the data is split into training and
testing sets using an 80-20 split. The training set is used to train the model, while the testing set is reserved for evaluating the model’s performance on unseen data, ensuring that the model does not
overfit to the training data.
For this classification task, we had planned to create a neural network model but due to our limited amount of data, we decided on the SVM solution. Neural networks require large amounts of data to avoid
overfitting because of their complex structure and numerous parameters that need to be trained. With insufficient data, neural networks are more likely to memorize the training set rather than generalizing.
SVMs, on the other hand, are better for smaller datasets because they are less prone to overfitting, require fewer hyperparameters, and handle high-dimensional feature spaces more effectively 78.
A Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel was used for the multiclassification task (weak, medium, strong) and a linear kernel for the simple classification (weak, strong). SVM
is a powerful algorithm that finds the optimal hyperplane to separate data points into different classes. The RBF kernel is particularly useful for capturing non-linear relationships between features, which may
exist in complex biological data like nucleotide sequences 9. In addition, the regularization parameter C is set to 10.0, which controls the trade-off between achieving a large margin for separating classes and minimizing
classification errors. By increasing C, the model is allowed to fit more tightly to the training data, which can improve performance on complex datasets where linear separation is insufficient 1011.
Figure 6: RBS Strength Prediction Model Architecture
Figure 7: Linear and RBF Support Vector Machine representation
Once the SVM model is trained on the training set, it makes predictions on the test set. These predictions are compared with the true labels of the test set to evaluate the model’s performance. The accuracy score is computed,
which gives the proportion of correct predictions made by the model. Additionally, a classification report is generated, which provides detailed metrics such as precision, recall, and F1-score for each class. These metrics give
insight into how well the model performs for each category, allowing the user to assess whether the model is equally effective at predicting high, medium, and low strength sequences.
Finally, the accuracy and the classification report are printed to provide a summary of the model’s performance. This architecture provides a robust framework for classifying biological sequences based on their nucleotide patterns
and demonstrates the power of using machine learning to solve bioinformatics problems. By employing k-mer generation, vectorization, and a non-linear SVM classifier, the model can efficiently handle sequence data and make predictions
based on the inherent structure of the RBS sequences.
Model Perfomance
For Multiclassification Task (Weak – Medium – Strong)
The Support Vector Machine (SVM) model trained to classify DNA sequences into "medium," "strong," and "weak" categories yielded an overall accuracy of 69.7% on the test data. This indicates that the model correctly predicted the class of
the sequences nearly 70% of the time. While this accuracy shows the model is somewhat effective, there is still room for improvement, particularly in distinguishing the "medium" class from the others.
The "strong" class had the best overall performance with a precision of 0.82 and a recall of 0.82, meaning that the model was highly accurate in identifying strong sequences and correctly labeled 82% of the actual strong sequences. This results
in a strong F1-score of 0.82, indicating a good balance between precision and recall for this class.
The "weak" class also performed relatively well, with a precision of 0.64 and an impressive recall of 0.90. The high recall shows that the model correctly identified 90% of the actual weak sequences, though its lower precision indicates that 36% of the sequences
labeled as weak were actually misclassified. The F1-score of 0.75 for the weak class demonstrates a reasonable balance between precision and recall.
However, the model struggled with the "medium" class, achieving only 0.62 precision and 0.42 recall. This means that the model was less reliable in identifying medium sequences and missed 58% of actual medium sequences, resulting in an F1-score of 0.50. This lower
performance suggests that the model has difficulty distinguishing medium sequences from the other classes.
Figure 8: Accuracy and Classification Report for Multiclassification Task
For Classification Task (Weak – Strong)
The Support Vector Machine (SVM) model trained to classify DNA sequences as "strong" or "weak" achieved a commendable overall accuracy of 96%. This indicates that the model correctly predicted the strength of the sequences 96% of the time when tested on unseen data.
Such a high accuracy highlights the model's effectiveness in distinguishing between the two classes based on the k-mer encoded features of the sequences.
The classification report offers deeper insights into the model's performance for each class. For the "strong" class, the model achieved a perfect precision of 1.00, meaning that all sequences predicted as "strong" were indeed strong. The recall for this class was 0.93,
implying that the model identified 93% of all actual strong sequences correctly. The F1-score, which balances precision and recall, was 0.97, reflecting the strong overall performance for this class.
For the "weak" class, the model performed slightly worse but still very well, with a precision of 0.91, meaning that 91% of sequences predicted as weak were accurate. The recall was 1.00, indicating that the model successfully identified all weak sequences without missing any.
The corresponding F1-score for the weak class was 0.95, which reflects the strong balance between precision and recall.
The model handled both classes (strong and weak) efficiently, maintaining balanced performance as shown by the macro average F1-score of 0.96. This average ensures that each class contributes equally to the final score, and it demonstrates that the model performs consistently
well across the different classifications, regardless of class imbalance.
Overall, the SVM model's performance is impressive. With high precision and recall across both classes, particularly for the strong class, this model can be confidently used to predict the strength of new DNA sequences. Its high accuracy and robust metrics demonstrate its capacity
to generalize well from the training data to unseen sequences, making it a reliable tool for classification tasks in this domain.
Figure 9: Accuracy and Classification Report for Classification Task
Final Remarks
It is obvious that the model performs very well when separating only two categories but moderately well when the intermediate category is added. Ideally, with more data we would try to include more classes or even instead of classification to create a model to predict strength numerically.
With the existing data we could perhaps try a Random Forest or a Logistic Regression model, but literature suggested that we had hopes for better results with SVM.
Despite the fact that our model has room for improvement, it can still be used to predict the rbs strength class with great accuracy.
You can access our code through our GitLab Repository. We are open to contributions!
Conclusion
Comments on the Software
Our software, Sequence AI-nalyzer, demonstrates the potential of AI in synthetic biology, particularly in gene expression prediction. The integration of machine learning in promoter and RBS strength prediction accelerates traditional experimental approaches, making genetic engineering more efficient. However, the software is not without its limitations. The most significant challenge is the limited availability of high-quality, diverse training datasets, especially for high-strength promoters and medium-strength RBS sequences. This data scarcity impacts model accuracy and generalizability, particularly in the multiclass classification tasks for RBS strength. Another limitation is the reliance on computational resources, as some model architectures, such as deep neural networks, demand significant GPU power. Despite these challenges, the software still provides useful predictions and serves as a powerful starting point for more refined genetic analyses.
Future Work
Moving forward, several improvements can be made to the Sequence AI-nalyzer. First, expanding the training datasets by sourcing new promoter and RBS sequences would help enhance model performance, particularly for underrepresented categories. Additionally, we plan to integrate additional features into our RBS analysis. Specifically, we want to include data from the gene downstream of the RBS, as any sequence similarity between the RBS and the gene may reduce expression efficiency by promoting ribosome binding to the gene instead of the intended RBS sequence. Moreover, we aim to incorporate organism-specific information (e.g., E. coli) to evaluate whether restriction endonucleases in the host organism might cleave the RBS, thereby affecting expression.
Another planned advancement is to create a unified or combinatorial "strength value" that considers both the RBS and promoter together. This would allow for a more holistic evaluation of gene expression potential by integrating the contributions of both regulatory elements. Additionally, we plan to experiment with integrating machine learning architectures, such as Long Short-Term Memory (LSTM) networks and attention mechanisms, to capture sequence-based features more effectively.
Overall, our model can be a valuable asset to future iGEM teams. It offers flexibility for teams to use as is or to further develop it according to their project needs. By enhancing it, they can contribute to its evolution and extend its functionality. Additionally, Sequence AI-nalyzer has the potential to be incorporated into various synthetic biology projects as a supplementary tool, assisting in the design and optimization of genetic parts. These improvements would ensure the software evolves alongside developments in synthetic biology and machine learning.
Real-life Applications
The practical implications of Sequence AI-nalyzer in synthetic biology are vast. Researchers working on gene expression optimization in prokaryotic systems can use the software to predict promoter and RBS strengths before initiating wet-lab experiments. This saves time and resources while minimizing trial-and-error processes. Furthermore, Sequence AI-nalyzer can be a valuable tool in metabolic engineering, where fine-tuning gene expression levels is crucial to maximizing production yields of desired biomolecules. With its capacity to handle large datasets, the software can also be used to screen synthetic biology parts at scale, streamlining the process of genetic circuit design for various biotechnological applications.
References
iGEM Tsinghua 2023 Wiki - Software page https://2023.igem.wiki/tsinghua/software
Thomason M. K., Bischler T., Eisenbart S. K., Förstner K. U., Zhang A., Herbig A., Nieselt K., Sharma C. M., & Storz G. Global transcriptional start site mapping using differential RNA sequencing reveals novel antisense RNAs in Escherichia coli. Journal of bacteriology. 2015 https://doi.org/10.1128/JB.02096-14
Jamell Alvah Samuels. One-Hot Encoding and Two-Hot Encoding: An Introduction. 2024 http://dx.doi.org/10.13140/RG.2.2.21459.76327
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition.IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016 https://ieeexplore.ieee.org/document/7780459
iGEM Athens 2022 Wiki - Software page. 2022. https://2022.igem.wiki/athens/model
Gönen M., Alpaydın E. Multiple kernel learning algorithms. Journal of Machine Learning Research. 2011;12:2211-2268. http://jmlr.org/papers/volume12/gonen11a/gonen11a.pdf
Vert J. P., Tsuda K., Schölkopf B. A primer on kernel methods. Kernel Methods in Computational Biology. 2004.
Mangkunegara I. S., Purwono P. Analysis of DNA sequence classification using SVM model with hyperparameter tuning grid search CV. 2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom). Malang, Indonesia. 2022:427-432. https://doi.org/10.1109/CyberneticsCom55287.2022.9865624
Seo T. K. Classification of nucleotide sequences using support vector machines. Journal of Molecular Evolution. 2010;71(4):250-267. https://doi.org/10.1007/s00239-010-9380-9
Hsu C. W., Chang C. C., Lin C. J. A practical guide to support vector classification. Department of Computer Science, National Taiwan University. 2016.