Machine Learning | PekingHSC

Background

Data Acquisition

Initial Attempt

Traditional Machine Learning Models

Transformer and so on

Tests of Model Validity

Conclusion & Looking Forward

References

Background

Promoter strength prediction is a very valuable topic for synthetic biology. If one wants to screen promoters with the required strength, the traditional way is often to perform a large number of wet experiments, which is extremely time-costly. And in the course of carrying out our projects, we encounter difficulties in obtaining promoters with strengths that meet our needs. Therefore, we wanted to develop a computational method that could quickly determine promoter strength without the need for extensive experiments. In the context of artificial intelligence technologies that are reshaping the overall landscape of biological research, we believe that the use of machine learning techniques for this purpose is appropriate.

As a result, Deepro, a promoter strength prediction toolkit developed by us and based on multiple machine learning models, was created. All the script files, data files, and parameter files related to the Deepro project can be found at the link https://gitlab.igem.org/2024/software-tools/pekinghsc. All models mentioned below can be found in the repository

About Deepro

A series of project files that we store in our repositories make up Deepro, and with the help of our project you can do all of these things:

Intensity prediction by populating the .xlsx file with a single or a set of promoter sequences.
Train the model from scratch in conjunction with your own dataset and any of the model structures we've mentioned. At the same time, visualization of the training process can be easily done using the built-in plotting section of the scripts. Deepro also makes it easy to save your completed models.
Use the tools preset in the script to perform a number of data preprocessing operations, such as generating one-hot vector-encoded DNA sequences for use in other downstream work or as input to your own models.
Or use our collected dataset itself for your own modeling!

Data Acquisition

Typically, the performance of a machine learning model is directly related to the quality and size of its dataset, so obtaining high-quality data is the first step to success. During the construction of the dataset for the Deepro project, we encountered the following difficulties:

Most of the existing datasets are small in size.
The expression (fluorescence intensity) of GFP or other fluorescent proteins is traditionally used as a measure of promoter strength, this means that different datasets use different null bases, such that there is no uniform distribution across datasets, and therefore simple merging cannot be performed to achieve dataset expansion.
A single dataset typically collects strength information for promoters from the same organism or artificial promoters adapted from wild-type promoters, which means that models trained using a single dataset may have difficulty generalizing to promoter strength profiles of other species (e.g., a model trained on a dataset of bacterial promoters may have difficulty capturing the promoter strength distribution of viruses).

In the project, we mainly used the following datasets:

Dataset 1.0: First is the work of Tianjin University two years ago(https://github.com/mesb-tju/models_tools), the dataset contains 592 promoter sequences and relative strength information, each with a sequence length of 61 bps.(Figure 1)

Figure 1. Dataset 1.0

Dataset 2.0: A dataset from the literature containing information on 936 promoter sequences with sequence lengths between 47-50 bps and their relative intensities.1(Figure 2)

Figure 2. Dataset 2.0
Dataset 3.0: A dataset also from a piece of literature, containing information on 3512 promoter sequences of sequence length 78bps and their intensity information, which is the largest of the datasets we have used and in this dataset, missing bases are replaced with the character 'B'.2(Figure 3)

Figure 3. Dataset 3.0
Dataset 4.0: Finally, there is the dataset created by iGEM team Berkeley in 2006 (Part:BBa_J23100), which is small in size, containing only 20 promoter sequences of length 35bps information and their relative intensities, which has been verified by rigorous wet experiments and which we would like to use as the final test set for the model.(Figure 4)

Figure 4. Dataset 4.0

Initial Attempt

The first step in building a model from scratch is to decide how the computer should read the promoter sequence information. That is, choosing the appropriate encoding method to produce inputs suitable for machine learning. In total, we tried three different coding approaches in the Deepro project:

The first strategy we tried was to directly encode the bases as individual numbers, but intuitively this approach is largely irrational because it would convey information to the model that is not there, for example, there should not be a definite order of size between the bases per se, yet this encoding would reflect this illusory size relationship.(Figure 5)

We then tried a coding approach based on one-hot vectors, which represents the bases as vectors with only one bit being 1 and the rest of the bits being 0. In terms of model performance, this encoding approach is effective but not optimal.(Figure 6)

Embedding vector is a better choice, which encodes the bases as consecutive vectors with the same length, and these vectors have richer semantic information and higher feature dimensions compared to one-hot vectors. This is the encoding we eventually adopted (one-hot encoding is still used in the classical machine learning model below).(Figure 7)

Figure 5. Single number encode

Figure 6. One-hot vector encode

Figure 7. Embedding vector encode

Then it is to determine how to divide the dataset and the training strategy, due to the small size of the data, we decided to train the model using k-fold cross-validation, using log-RMSE and MAE as metrics.

After deciding on the encoding method and the training strategy, we first try to construct an MLP model as a preliminary attempt. MLP (Multilayer Perceptron Machine) is one of the most classical neural network structures, but it is not good at processing sequential data, the reason is that the MLP has different weights and bias parameters at all the sequential positions, which on one hand implies that the number of parameters of the MLP will be huge, and on the other hand also means that the model is not equivariant, i.e., if the relative positions of the inputs change, the MLP will not be able to capture the new distribution. In terms of model performance, MLP lacks the ability to capture sequence features. But this is just the beginning of our attempt.

Figure 8. Structure of MLP

Figure 9. Train set performance of MLP

Figure 10. Test set performance of MLP

Our next attempt is a CNN (Convolutional neural network) model, which uses a convolutional layer to capture sequence features with fewer parameters compared to MLPs, and in addition, it reads sequence information in a sequential order, which is more in line with human intuition. From the results, the MAE value of CNN is decreased by 25% compared to MLP.

Figure 11. Structure of CNN

Figure 12. Train set performance of CNN

Figure 13. Test set performance of CNN

LSTM (Long Short-Term Memory) is a specialized model for processing sequence information, and its design allows it to cope with long-range correlations present in sequences, in addition, its design allows it to flexibly cope with sequence inputs of different lengths without the need for padding operations on sequences, as is done when using other models, and the structure of parameter sharing makes its parameter size relatively small. Surprisingly, however, the LSTM did not perform as we expected in our attempts.

Figure 14. Structure of LSTM

Traditional Machine Learning Models

As can be seen, although we have made extensive attempts at deep learning models, their performance does not meet our requirements well, and after fully communicating with the faculty involved, we realized that the size of our dataset may not be sufficient for overly complex models to converge adequately, and thus we began to pin our hopes on simpler, less parametric machine learning models.

The first thing we try is XGBoost, a powerful model widely used in classification and regression tasks. XGBoost is a gradient boosting decision tree model that improves the model by iteratively adding trees, each one trying to correct the prediction error of the previous one. It introduces a number of improvements over the classic gradient boosting decision tree model, such as adding regularization terms to the loss function to prevent overfitting. The superior performance of this model in promoter strength prediction has been able to be demonstrated in the literature.3 We can easily build the model by calling a python module called xgboost. Initially we thought that XGBoost really excelled at working with tabular data rather than sequential data, however, from the test results, XGBoost did reach the SOTA level of all our models on the test set.

Figure 15. Structure of XGBoost

Figure 16. Train set performance of XGBoost

Figure 17. Test set performance of XGBoost

Subsequently, we constructed a series of machine learning models including GBDT, Random Forest, AdaBoost, and SVR using the classic machine learning library scikit-learn, and optimized their parameters using grid search, in which GBDT performed relatively well, Adaboost, on the other hand, performs extremely poorly, And none of these models outperformed XGBoost.

Figure 18. Test set performance of GBDT

Figure 19. Test set performance of Randomforest

Figure 20. Test set performance of Adaboost

Figure 21. Test set performance of SVR

Transformer and so on

Although, as of now, classical machine learning models have significantly better predictive performance than deep learning models such as MLP, CNN, etc., we harbor the speculation that perhaps more sophisticatedly designed models will be able to better capture latent features in DNA sequences, leading to better predictive accuracy, and thus the Deepro story is not yet over.

The first thing we try is the famous transformer model, which is a deep learning model based on a Self-attention mechanism that allows it to capture the global features of a sequence while avoiding the multiple iterations required when using RNN-like models (e.g., LSTM). Its unique encoder-decoder structure makes its performance on sequence-to-sequence (Seq2Seq) problems (e.g., machine translation) extremely good. However, due to the specificity of our task, we do not need the model to generate sequences (e.g., new DNA sequences), so we only use the encoder part of the model (Encoder-only transformer).

The number of parameters and complexity of the transformer model is significantly higher compared to previous models, on the one hand this allows the model to fit the distribution of the data in the training set more easily, on the other hand it may lack the ability to generalize due to the limitation of a smaller training set. (We would also like to thank our Secondary PI, Mr. Zhengwei Xie, for providing us with the server resources to train the model).

From the final results Transformer model performance has been basically the same as XGBoost, but this does not mean that Transformer is redundant, we think Transformer's performance will improve even further with more sufficient data

Figure 22. Structure of Transformer

Figure 23. Train set performance of Transformer

Figure 24. Test set performance of Transformer

Finally, we additionally tried a joint CNN-LSTM model, due to the fact that we found that such models are widely used in the literature for regression problems related to DNA sequences. The results show that the performance of this model is indeed somewhat improved compared to the CNN or LSTM models alone, but with the consequent increase in computational complexity and training time consumption, and we do not intend to do further optimization as we believe that the accuracy gain from the improvement is not enough to compensate for this consumption.

Tests of Model Validity

We tried to validate the model using two means:

The first was to use the previously mentioned data published by the Berkeley team in 2006 as a test set, but since our training data was chosen as a different benchmark than the Berkeley team's dataset, trying to predict the absolute value of the relative strengths of the promoters in their dataset was a pipe dream. We simplified our goal to predicting the order of magnitude of the relative strength of promoters in their dataset. Also due to the short length of the promoters in this dataset, we chose to re-train the XGBoost model on a truncated version (sequence length of 35bps, starting at the TTGACA box) of Dataset 3.0, and as expected, the model's performance on both the training and validation sets showed a large level of degradation due to the reduction in feature dimensions. No matter how much hyperparameter tuning is done, the model never regresses to the level in the full sequence version of the dataset.

Train set(truncated) performance of truncated XGBoost

Figure 24. Train set performance of truncated XGBoost

Test set(truncated) performance of truncated XGBoost

Figure 25. Test set performance of truncated XGBoost

Subsequent tests on the Berkeley dataset confirmed this, with the model failing to capture the distribution in that dataset and performing poorly, and we believe the reasons for such a large difference in the model's performance in Dataset 3.0 (full length) and Dataset 4.0 include:

Sequence length has a huge impact on model performance, with longer sequences meaning the model is able to capture more information.
The generalization ability of the model is insufficient, and this is caused by the fact that the sequence data in Dataset 3.0 does not reflect the true promoter strength distribution.
In addition, the promoter sequences in this dataset are too similar, making the sampling range of the model very limited.

In addition, we try to validate the model performance in another way. We used our model to predict the strength of the JWW promoter and compared the prediction with that of the arabinose promoter, and the prediction concluded that the strength of the JWW promoter is much weaker compared to the arabinose promoter, a conclusion that was then further confirmed in wet experiments.

More details about the promoter verification part.

Conclusion & Looking Forward

In the Deepro project, we used 10 machine learning or deep learning methods to try to build a promoter strength prediction model, and in the end the XGBoost method had the best overall performance in the available dataset. Taking the JWW promoter as the study object, we further validated the model by means of wet experiments. However, in other datasets with significantly different data distributions, the model's cannot show the desired performance.

For the current version of Deepro, we believe that it's far from perfect and powerful, with at least the following flaws and shortcomings:

Due to the different data distributions in different datasets, it is difficult to obtain sizable training data through extensive literature review and direct data merging, which greatly limits the generalization ability of the model.
The quality of existing data from the literature or other publicly available datasets is difficult to ensure.
The design of the model does not take into account the promoter homology region, and if the effect of non-homologous sequences on the output results can be increased appropriately, the model performance should be more impressive.
We originally planned to use a pre-trained DNA language model, DNABERRT-2 4, as the sequence-embedded part of our model to enhance the model performance, which would allow the model to capture the promoter sequences more efficiently, but this was not implemented in the end due to the lack of computational resources and time for model fine-tuning.
For time reasons, we did not create a GUI or Web App (but the scripts stored at https://gitlab.igem.org/2024/software-tools/pekinghsc are fully available), so it may not be very convenient for researchers who have not been exposed to python to use our models.

In response to these issues, we have proposed some directions for improvement and feasible options:

The first thing to do is to conduct extensive wet experiments to validate the model performance and test the data quality of the training set on the one hand, and to generate more usable training data on the other hand.
Second, the three datasets we have found so far cannot be simply merged, and we would like to find a way to merge the three datasets so that the labels of the three datasets can conform to a uniform distribution, or to introduce some semi-supervised learning to fully utilize the value of all the data.
In addition, the GUI should be developed to make the tool easier to use.

References

1. Huang, Y.-K., C.-H. Yu, and I.S. Ng, Precise strength prediction of endogenous promoters from Escherichia coli and J-series promoters by artificial intelligence. Journal of the Taiwan Institute of Chemical Engineers, 2024. 160: p. 105211.

2. Yang, W., D. Li, and R. Huang, EVMP: enhancing machine learning models for synthetic promoter strength prediction by Extended Vision Mutant Priority framework. Front Microbiol, 2023. 14: p. 1215609.

3. Zhao, M., et al., Precise Prediction of Promoter Strength Based on a De Novo Synthetic Promoter Library Coupled with Machine Learning. ACS Synthetic Biology, 2022. 11(1): p. 92-102.

4. Zhou, Z., et al., DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome. ArXiv, 2023. abs/2306.15006.