Promoter strength prediction is a very valuable topic for synthetic biology. If one wants to screen promoters with the required strength, the traditional way is often to perform a large number of wet experiments, which is extremely time-costly. And in the course of carrying out our projects, we encounter difficulties in obtaining promoters with strengths that meet our needs. Therefore, we wanted to develop a computational method that could quickly determine promoter strength without the need for extensive experiments. In the context of artificial intelligence technologies that are reshaping the overall landscape of biological research, we believe that the use of machine learning techniques for this purpose is appropriate.
As a result, Deepro, a promoter strength prediction toolkit developed by us and based on multiple machine learning models, was created. All the script files, data files, and parameter files related to the Deepro project can be found at the link https://gitlab.igem.org/2024/software-tools/pekinghsc. All models mentioned below can be found in the repository
A series of project files that we store in our repositories make up Deepro, and with the help of our project you can do all of these things:
Typically, the performance of a machine learning model is directly related to the quality and size of its dataset, so obtaining high-quality data is the first step to success. During the construction of the dataset for the Deepro project, we encountered the following difficulties:
In the project, we mainly used the following datasets:
The first step in building a model from scratch is to decide how the computer should read the promoter sequence information. That is, choosing the appropriate encoding method to produce inputs suitable for machine learning. In total, we tried three different coding approaches in the Deepro project:
The first strategy we tried was to directly encode the bases as individual numbers, but intuitively this approach is largely irrational because it would convey information to the model that is not there, for example, there should not be a definite order of size between the bases per se, yet this encoding would reflect this illusory size relationship.(Figure 5)
We then tried a coding approach based on one-hot vectors, which represents the bases as vectors with only one bit being 1 and the rest of the bits being 0. In terms of model performance, this encoding approach is effective but not optimal.(Figure 6)
Embedding vector is a better choice, which encodes the bases as consecutive vectors with the same length, and these vectors have richer semantic information and higher feature dimensions compared to one-hot vectors. This is the encoding we eventually adopted (one-hot encoding is still used in the classical machine learning model below).(Figure 7)
Then it is to determine how to divide the dataset and the training strategy, due to the small size of the data, we decided to train the model using k-fold cross-validation, using log-RMSE and MAE as metrics.
After deciding on the encoding method and the training strategy, we first try to construct an MLP model as a preliminary attempt. MLP (Multilayer Perceptron Machine) is one of the most classical neural network structures, but it is not good at processing sequential data, the reason is that the MLP has different weights and bias parameters at all the sequential positions, which on one hand implies that the number of parameters of the MLP will be huge, and on the other hand also means that the model is not equivariant, i.e., if the relative positions of the inputs change, the MLP will not be able to capture the new distribution. In terms of model performance, MLP lacks the ability to capture sequence features. But this is just the beginning of our attempt.
Our next attempt is a CNN (Convolutional neural network) model, which uses a convolutional layer to capture sequence features with fewer parameters compared to MLPs, and in addition, it reads sequence information in a sequential order, which is more in line with human intuition. From the results, the MAE value of CNN is decreased by 25% compared to MLP.
LSTM (Long Short-Term Memory) is a specialized model for processing sequence information, and its design allows it to cope with long-range correlations present in sequences, in addition, its design allows it to flexibly cope with sequence inputs of different lengths without the need for padding operations on sequences, as is done when using other models, and the structure of parameter sharing makes its parameter size relatively small. Surprisingly, however, the LSTM did not perform as we expected in our attempts.
As can be seen, although we have made extensive attempts at deep learning models, their performance does not meet our requirements well, and after fully communicating with the faculty involved, we realized that the size of our dataset may not be sufficient for overly complex models to converge adequately, and thus we began to pin our hopes on simpler, less parametric machine learning models.
The first thing we try is XGBoost, a powerful model widely used in classification and regression tasks. XGBoost is a gradient boosting decision tree model that improves the model by iteratively adding trees, each one trying to correct the prediction error of the previous one. It introduces a number of improvements over the classic gradient boosting decision tree model, such as adding regularization terms to the loss function to prevent overfitting. The superior performance of this model in promoter strength prediction has been able to be demonstrated in the literature.3 We can easily build the model by calling a python module called xgboost. Initially we thought that XGBoost really excelled at working with tabular data rather than sequential data, however, from the test results, XGBoost did reach the SOTA level of all our models on the test set.
Subsequently, we constructed a series of machine learning models including GBDT, Random Forest, AdaBoost, and SVR using the classic machine learning library scikit-learn, and optimized their parameters using grid search, in which GBDT performed relatively well, Adaboost, on the other hand, performs extremely poorly, And none of these models outperformed XGBoost.
Although, as of now, classical machine learning models have significantly better predictive performance than deep learning models such as MLP, CNN, etc., we harbor the speculation that perhaps more sophisticatedly designed models will be able to better capture latent features in DNA sequences, leading to better predictive accuracy, and thus the Deepro story is not yet over.
The first thing we try is the famous transformer model, which is a deep learning model based on a Self-attention mechanism that allows it to capture the global features of a sequence while avoiding the multiple iterations required when using RNN-like models (e.g., LSTM). Its unique encoder-decoder structure makes its performance on sequence-to-sequence (Seq2Seq) problems (e.g., machine translation) extremely good. However, due to the specificity of our task, we do not need the model to generate sequences (e.g., new DNA sequences), so we only use the encoder part of the model (Encoder-only transformer).
The number of parameters and complexity of the transformer model is significantly higher compared to previous models, on the one hand this allows the model to fit the distribution of the data in the training set more easily, on the other hand it may lack the ability to generalize due to the limitation of a smaller training set. (We would also like to thank our Secondary PI, Mr. Zhengwei Xie, for providing us with the server resources to train the model).
From the final results Transformer model performance has been basically the same as XGBoost, but this does not mean that Transformer is redundant, we think Transformer's performance will improve even further with more sufficient data
Finally, we additionally tried a joint CNN-LSTM model, due to the fact that we found that such models are widely used in the literature for regression problems related to DNA sequences. The results show that the performance of this model is indeed somewhat improved compared to the CNN or LSTM models alone, but with the consequent increase in computational complexity and training time consumption, and we do not intend to do further optimization as we believe that the accuracy gain from the improvement is not enough to compensate for this consumption.
We tried to validate the model using two means:
The first was to use the previously mentioned data published by the Berkeley team in 2006 as a test set, but since our training data was chosen as a different benchmark than the Berkeley team's dataset, trying to predict the absolute value of the relative strengths of the promoters in their dataset was a pipe dream. We simplified our goal to predicting the order of magnitude of the relative strength of promoters in their dataset. Also due to the short length of the promoters in this dataset, we chose to re-train the XGBoost model on a truncated version (sequence length of 35bps, starting at the TTGACA box) of Dataset 3.0, and as expected, the model's performance on both the training and validation sets showed a large level of degradation due to the reduction in feature dimensions. No matter how much hyperparameter tuning is done, the model never regresses to the level in the full sequence version of the dataset.
Subsequent tests on the Berkeley dataset confirmed this, with the model failing to capture the distribution in that dataset and performing poorly, and we believe the reasons for such a large difference in the model's performance in Dataset 3.0 (full length) and Dataset 4.0 include:
In addition, we try to validate the model performance in another way. We used our model to predict the strength of the JWW promoter and compared the prediction with that of the arabinose promoter, and the prediction concluded that the strength of the JWW promoter is much weaker compared to the arabinose promoter, a conclusion that was then further confirmed in wet experiments.
More details about the promoter verification part.In the Deepro project, we used 10 machine learning or deep learning methods to try to build a promoter strength prediction model, and in the end the XGBoost method had the best overall performance in the available dataset. Taking the JWW promoter as the study object, we further validated the model by means of wet experiments. However, in other datasets with significantly different data distributions, the model's cannot show the desired performance.
For the current version of Deepro, we believe that it's far from perfect and powerful, with at least the following flaws and shortcomings:
In response to these issues, we have proposed some directions for improvement and feasible options:
1. Huang, Y.-K., C.-H. Yu, and I.S. Ng, Precise strength prediction of endogenous promoters from Escherichia coli and J-series promoters by artificial intelligence. Journal of the Taiwan Institute of Chemical Engineers, 2024. 160: p. 105211.
2. Yang, W., D. Li, and R. Huang, EVMP: enhancing machine learning models for synthetic promoter strength prediction by Extended Vision Mutant Priority framework. Front Microbiol, 2023. 14: p. 1215609.
3. Zhao, M., et al., Precise Prediction of Promoter Strength Based on a De Novo Synthetic Promoter Library Coupled with Machine Learning. ACS Synthetic Biology, 2022. 11(1): p. 92-102.
4. Zhou, Z., et al., DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. ArXiv, 2023. abs/2306.15006.