Modeling
Overview
Machine learning (ML) has become an essential tool in modern biology, providing powerful methods for analyzing large and complex datasets. In fields like genomics, transcriptomics, and synthetic biology, ML algorithms enable researchers to uncover hidden patterns, make predictions, and optimize biological systems. Specifically, in synthetic biology, ML can help design regulatory elements like promoters or predict the outcomes of genetic modifications based on experimental data.
Among the commonly used ML techniques, Random Forests (RF) and Support Vector Machines (SVM) play prominent roles. RF is an ensemble learning method that builds multiple decision trees and aggregates their results to improve accuracy and reduce overfitting. It's particularly useful for handling high-dimensional data with complex interactions. SVM, on the other hand, is a powerful classifier that finds the optimal hyperplane to separate different classes, and it can be extended to non-linear problems using kernel functions like the radial basis function (RBF) or linear kernels.
In this study, we employed machine learning (ML) techniques to analyze the relationship between sequences and responses using our experimental dataset of 88 samples. Our initial objective was to establish a regression model to predict fluorescence intensity based on sequence features. We implemented a feature extraction strategy focusing on k-mers, constructing datasets from various k-mer lengths. However, we found that using 3-mers and 4-mers as features did not yield sufficient predictive power.
In response, we shifted our approach by changing the prediction target to classify promoters into three categories based on fluorescence intensity: high, medium, and low. When utilizing the k-mers feature extraction method for this classification task, we still encountered limited success. Conversely, when we employed one-hot encoding for the promoter sequences, we achieved notable results, with our logit boost model reaching an impressive classification accuracy of 0.96 and an AUC greater than 0.9. This indicates that our refined approach significantly improved the model's ability to predict promoter activity based on sequence information, paving the way for more effective designs in synthetic biology applications.
Regression model Task
We established a dataset based on experimental data and sequencing results. The promoter sequences in this dataset were derived from the first 51 bp upstream of the transcription start site (TSS) of E. coli, as identified in the study referenced in [1] through RNA-seq. After randomly inserting GGA, we selected 96 single colonies from plates, and sequencing revealed 88 unique sequences, with corresponding fluorescence data provided in the appendix.
Initially, we considered using k-mer methods for modeling, extracting both 4-mers and 6-mers as input features. We allocated 30% of the dataset as a test set and used the remaining 70% for training, conducting 5 rounds of random sampling to enhance result reliability. After training with four different models, we found that the R² values for both 4-mers and 6-mers did not exceed 0.2, indicating poor model fitting. [2]
We then shifted to a one-hot encoding approach. The advantage of one-hot encoding lies in its ability to represent each nucleotide independently and non-sequentially, avoiding potential relationships and biases introduced by the k-mer method. This approach can capture the diversity and complexity of sequence features more comprehensively. However, the final model performance did not improve, suggesting that further refinement of feature extraction and selection strategies may be necessary to enhance predictive capabilities.
We analyzed the distribution of our fluorescence data and found that most values were below 5000. This indicates that the majority of the promoter insertions had a negative impact on fluorescence output. This skewed distribution can significantly affect our predictive model, as models often rely on balanced datasets for accurate training. When the data is heavily biased towards lower values, it may lead to overfitting on the negative cases and diminish the model's ability to generalize to sequences with potentially positive effects. Consequently, addressing this imbalance—perhaps by employing techniques such as data augmentation or re-sampling—could enhance the predictive power and robustness of our model.
Classification model Task
To mitigate the impact of the skewed distribution, we transitioned to a classification model. This approach allows us to categorize promoter sequences into three distinct classes based on their fluorescence intensity—high, medium, and low—rather than focusing on continuous output values. By doing so, we can better capture the underlying relationships between sequence features and their corresponding functional outcomes. Classification models are generally more robust to imbalances in data, as they focus on distinguishing between predefined categories. This shift enables us to leverage the information contained in the data more effectively and improve the model's predictive accuracy. Additionally, this method allows for a clearer interpretation of how specific sequence characteristics influence promoter activity, which is critical for our subsequent promoter engineering efforts.
We also incorporated the Logit Boost (LB) model as an additional method. LB is a boosting algorithm that builds a series of weak learners, typically decision stumps, by focusing on the examples that are hardest to predict. Each subsequent model is trained on the errors of the previous one, gradually improving accuracy. This iterative process makes LB particularly effective in handling complex datasets, especially where there might be subtle patterns or non-linear relationships that traditional models may struggle to capture.
By categorizing the promoters based on fluorescence intensity into high, medium, and low groups, we aimed to simplify the prediction task. We defined thresholds such that fluorescence intensity below 2500 is classified as low, between 2500 and 10,000 as medium, and above 10,000 as high. This classification approach allows us to focus on distinguishing between discrete outcomes, which can be more manageable than predicting precise values. Additionally, the one-hot encoding method proved more effective in this context, as it captures the presence of specific nucleotides without imposing sequential dependencies, making it well-suited for identifying patterns that correlate with these categorical outcomes. This simplification ultimately enhances our model's predictive power and interpretability.
run | time | rf | svmr | svml | lb | dataset |
---|---|---|---|---|---|---|
1 | 2.629882 | 0.44 | 0.4 | 0.36 | 0.5263158 | 4mers |
2 | 2.160985 | 0.28 | 0.4 | 0.4 | 0.3181818 | 4mers |
3 | 2.003939 | 0.32 | 0.4 | 0.36 | 0.2142857 | 4mers |
4 | 2.061656 | 0.56 | 0.4 | 0.64 | 0.5263158 | 4mers |
5 | 2.054982 | 0.48 | 0.4 | 0.32 | 0.2941176 | 4mers |
1 | 1.753595 | 0.96 | 0.4 | 0.56 | 0.96 | onehot |
2 | 1.47514 | 0.92 | 0.4 | 0.44 | 1 | onehot |
3 | 1.487382 | 0.96 | 0.4 | 0.52 | 0.96 | onehot |
4 | 1.504315 | 0.96 | 0.4 | 0.24 | 0.88 | onehot |
5 | 1.460813 | 0.96 | 0.4 | 0.56 | 0.96 | onehot |
The agreement and accuracy values among the models provide valuable insights into their performance. The low accuracy of 0.40 between SVMR and LB indicates a significant divergence in their predictions, suggesting differing interpretations of the data or feature importance. In contrast, the moderate accuracy of 0.56 between LB and SVML implies some shared understanding, indicating that both models may identify common features effectively. The relatively low agreement of 0.44 between SVML and SVMR suggests that they may capture different aspects of the data, possibly influenced by their hyperparameter settings. Finally, the reasonable agreement of 0.52 between LB and Random Forest (RF) indicates that these models are utilizing similar predictive features effectively. Overall, these metrics highlight varying degrees of alignment, suggesting that while some models complement each other, others may offer distinct insights, guiding further refinement and enhancement of predictive power in the modeling process.
Agreement.Accuracy | |
---|---|
svmr vs. lb | 0.4 |
lb vs. svml | 0.56 |
svml vs. svmr | 0.44 |
lb vs. rf | 0.52 |
We ultimately evaluated the models using ROC curves and AUC metrics, revealing that the LB model achieved the highest classification AUC of 0.96. This indicates exceptional predictive performance, suggesting that LB effectively distinguishes between the different classes of promoter sequences. This strong performance may be attributed to the model's ability to capture complex relationships within the data, making it a valuable tool for future experimental validation and system refinement.
These modeling efforts provide a novel reference for future promoter modifications in our system. By shifting our focus to classification, we were able to better capture the distinct functional groups of promoters based on their fluorescence intensity, which is inherently a categorical outcome. This approach allows us to leverage patterns in the data that correlate specific sequences with their respective activity levels.
The superiority of one-hot encoding in our analysis can be attributed to its ability to represent categorical data without imposing any ordinal relationships between features. Unlike k-mer representations, which may inadvertently suggest that certain sequences are more similar or relevant than others, one-hot encoding treats each nucleotide position independently. This independence allows the model to learn more nuanced interactions between the sequence and the resulting promoter activity.
Furthermore, one-hot encoding captures all possible features of the sequence without losing information about individual bases, making it particularly effective for classification tasks where the relationship between the sequence and its biological function is complex and multifaceted. Ultimately, this methodology not only enhanced the model's predictive performance but also provided valuable insights for the strategic design of synthetic promoters in our ongoing research.
References
[1] Thomason MKBischler T, Eisenbart SKFörstner KU, Zhang A, Herbig A, Nieselt K, Sharma CM, Storz G.2015.Global Transcriptional Start Site Mapping Using Differential RNA Sequencing Reveals Novel Antisense RNAs in Escherichia coli. J Bacteriol197:.
[2] Greener, J.G., Kandathil, S.M., Moffat, L. et al. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 23, 40–55 (2022).
[3] Bioinformatics 1: K-mer Counting: https://medium.com/swlh/bioinformatics-1-k-mer-counting-8c1283a07e29
[4]Modeling DNA Sequences with PyTorch: https://towardsdatascience.com/modeling-dna-sequences-with-pytorch-de28b0a05036
[5] Wu C, Balakrishnan R, Braniff N, et al. Cellular perception of growth rate and the mechanistic origin of bacterial growth law[J]. Proceedings of the National Academy of Sciences, 2022, 119(20): e2201585119.
[6] Gunasekaran H, Ramalakshmi K, Rex Macedo Arokiaraj A, et al. Analysis of DNA sequence classification using CNN and hybrid models[J]. Computational and Mathematical Methods in Medicine, 2021, 2021(1): 1835056.