In this study, we investigate the intricate dynamics of our transcriptional detection system through two complementary modeling approaches, each serving to deepen our understanding of gene regulation. The first approach utilizes an ordinary differential equation (ODE) model to analyze how varying concentrations of LuxR, the sensor protein, affect the system's behavior. This model provides valuable insights into the kinetic responses of the system, highlighting how LuxR concentration can optimize sensitivity and specificity in detecting target ligands.
The second approach employs machine learning techniques to analyze experimental data from different promoters controlling LuxR expression. By uncovering the relationship between promoter sequences and system responses, this model addresses the need for predictive capabilities in synthetic biology. Understanding these relationships can guide the design of more effective genetic circuits, ultimately enhancing the performance and reliability of biotechnological applications. Together, these modeling strategies not only provide a comprehensive framework for our system but also contribute to advancing the field of synthetic biology by facilitating the rational design of transcriptional devices.
Methods to modify an induction-based transcriptional detection system include adjusting the concentration of the sensor protein, altering the protein-promoter relationship, and mutating the sensor protein. Among these, adjusting the sensor protein concentration is the most straightforward approach. We referenced a study that examined how varying LuxR (the sensor protein) concentration affects the final response curve. The study utilized modeling and experimental methods to demonstrate that increasing LuxR concentration significantly reduces the background signal and lowers the half-maximal induction curve, offering new insights for optimizing our detection system to achieve higher sensitivity and accuracy.
The mathematical model for the inducible device is built on several premises: the transfer function is assessed at steady state, with constant concentrations of molecular species related to the genetic device. The total promoter concentration is much smaller than that of transcriptional activators, and the ligand concentration is assumed to be invariant and sufficiently large to neglect the bound portion. The model also disregards metabolic burden, assuming that the device's load does not impact cellular behavior. Additionally, the genetic device components are designed for orthogonal functionality, avoiding unwanted interactions with other cellular components. RFP expression is modulated by lactone concentration, considering both luxR-independent and luxR-dependent leakiness.
Building on this modeling framework, we first analyze the organization of pathways within the cell. Treating the total amount of LuxR as constant, it can exist in three interchangeable states: the free monomer \(R\), the dimer \(R_2\), and the dimer \(Q\) bound to a signaling molecule. These states correspond to three outcomes: no induction of RNA transcription, low-level induction (leaky expression), and high-level induction of mRNA transcription. The resulting mRNA produces the fluorescent protein mVenus (\(G\)), with expression controlled by a constant factor \(K_p\). Throughout the system, we ignore LuxR degradation.
Different promoters correspond to varying total amounts of LuxR \([R + R_2 + Q]\), and we define promoter strength as \(\phi\). This framework integrates the enzymology-based model premises with the specific mechanisms of the LuxR device, enhancing our understanding of the regulatory network and the dynamic characteristics of gene expression.
In this streamlined model, we assume that receptor binding and the unbound promoter do not exhibit leakiness, which results in km0=0 and K2=0. The reactions we consider include the binding of the ligand to form a receptor-ligand complex, the dimerization of the receptor to generate an active transcriptional complex, the initiation of RNA transcription from the promoter, and the subsequent translation of the RNA into the fluorescent protein mVenus. This framework gives rise to the following differential equations:
At steady state, where
The concentration of the active complex [G∗] can be expressed as a function of the active complex [SL]:
Next, we aim to derive a mathematical expression for [G∗] in relation to the control variables of this model: [L], [RT] and [PT]. These variables represent the concentration of 3OC6HSL (the ligand), the amount of LuxR (maintained constant via constitutive expression, [RT]), and the total number of Lux promoters ([PT]), determined by a stable population of plasmids.
For simplicity and in line with the biological context, we assume that dimerization takes place before ligand binding. Additionally, we mathematically assert that the interaction with the ligand has minimal impact on the equilibrium between the dimer and monomer forms. Therefore, we will focus solely on the quantity of LuxR dimers, [R2T].
Based on this, we define conservation equations below:
To put this model in a compact expression in the familiar Michaelis–Menten form, we introduce the maximum expression of the reporter Gapp and the affinity constant of the device Kapp.
In this model, the kinetic constants \(g\) are combined to express the total effect on the system. The total concentration \([R_2T]\) can be represented as a function of the normalized promoter strength ϕ. This highlights the influence of promoter design on gene expression efficiency and sensitivity.
We replicated the parameters of the original model and fitted the experimental data for the best and original curves to calculate the ϕ values. This allowed us to determine the appropriate ϕ settings for our system.
Discussion of the Relationship Between phi, G, and K
In our model, phi serves as a parameter that enhances the activity of the promoter. As phi increases, the effective affinity of the device for its ligand improves, leading to a reduction in the apparent dissociation constant Kapp proportional to phi inverse. This change results in a more pronounced output signal Gapp. When phi approaches zero, Gapp m tends toward zero while Kapp reaches its maximum value K1 inverse, indicating minimal responsiveness. Conversely, higher phi values lead to maximum Gapp, driven by the efficiency of the machinery.
This behavior parallels concepts in enzyme kinetics, particularly the effects of allosteric activators. In enzyme systems, an increase in substrate concentration enhances reaction rates, akin to how an increase in phi boosts the output of our genetic device. Conversely, in inhibitory scenarios, rising inhibitor concentrations would lead to a higher Kapp and a decrease in Gapp, demonstrating the contrasting effects of activation and inhibition on system performance. This underscores the importance of regulatory mechanisms in biological systems and highlights the versatility of promoter dynamics in influencing gene expression.
In our case, we need a more sensitive system to detect ~10mM AHL in expired milk sample, which means we need a stronger promoter than J23119 for LuxR.
We introduced a library of 50 promoters from iGEM parts, which were inserted upstream of the LuxR gene using Golden Gate assembly to create a system that contains varying concentrations of LuxR. We obtained 16 clones with different promoters and tested them in 96-well plates with overnight cultures at six different LuxR concentrations: 50, 25, 12.5, 6.25, 3.125, and 0 nM. Fluorescence intensity (excitation at 510 nm, emission at 550 nm) and OD600 were measured for each condition. The resulting induction curves for each promoter were plotted, showing that the half-maximal induction concentration was altered by the different promoters.
Promoter sequences were verified using Sanger sequencing. Since the promoters varied in length, we first extracted sequence features that were independent of length, including GC content, sequence entropy, and 3-mer frequency distributions. We established a regression task and trained three different machine learning models to evaluate their performance. By comparing the results, we aim to identify the model that performs the best for this task. The three models include Random Forest Regression, Support Vector Regression (SVR), and Linear Regression. Each model underwent feature engineering, hyperparameter tuning, and cross-validation. The key evaluation metrics we focused on are Mean Squared Error (MSE), Mean Absolute Error (MAE), and R² (coefficient of determination), to assess the accuracy and robustness of the models in the prediction task. We also performed 10-fold cross-validation to ensure that the models were evaluated on a broader range of data and to minimize overfitting. This approach allowed us to assess the stability and generalization ability of each model by training on different subsets of the data and validating on the remaining portions.
In our first circle, the average RMSE for the RF model was 4.759, with an R-squared of 0.383324, indicating a low explanatory power. The SVMR model had an RMSE of 4.456 and an R-squared of 0.3447, while the SVMl model had an RMSE of 6.048 and an R-squared of 0.3873. All models exhibited relatively high RMSE values and low R-squared values (<0.5), suggesting poor predictive performance. While the RF model outperformed SVMR and SVMl on all metrics, its predictions were still not meaningful.
sequence | GC_Content | Sequence_Length | Entropy | AAA | AAC | AAG | AAT | ACA | ACC | ACG | ACT | AGA | AGC | AGG | AGT | ATA | ATC | ATG | ATT | CAA | CAC | CAG | CAT | CCA | CCC | CCG | CCT | CGA | CGC | CGG | CGT | CTA | CTC | CTG | CTT | GAA | GAC | GAG | GAT | GCA | GCC | GCG | GCT | GGA | GGC | GGG | GGT | GTA | GTC | GTG | GTT | TAA | TAC | TAG | TAT | TCA | TCC | TCG | TCT | TGA | TGC | TGG | TGT | TTA | TTC | TTG | TTT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tttacagctagctcagtcctaggtattatgctagc | 0.428571429 | 35 | 1.967253461 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 3 | 2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 1 |
ttgacagctagctcagtcctaggtactgtgctagc | 0.514285714 | 35 | 1.988442576 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 3 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
ctgatagctagctcagtcctagggattatgctagc | 0.485714286 | 35 | 1.993608561 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 3 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
ttgacagctagctcagtcctaggtattgtgctagc | 0.485714286 | 35 | 1.979724238 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 2 | 0 |
tttacggctagctcagtcctaggtactatgctagc | 0.485714286 | 35 | 1.979724238 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 2 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
ttgacggctagctcagtcctaggtattgtgctagc | 0.514285714 | 35 | 1.964060052 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 2 | 0 |
tttacggctagctcagccctaggtattatgctagc | 0.485714286 | 35 | 1.979724238 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 3 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 1 |
taatacgactcactatagggaga | 0.391304348 | 23 | 1.925796479 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 2 | 1 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
tttacagctagctcagtcctagggactgtgctagc | 0.514285714 | 35 | 1.988442576 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
tttacggctagctcagtcctaggtacaatgctagc | 0.485714286 | 35 | 1.993608561 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 2 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
ttgacggctagctcagtcctaggtatagtgctagc | 0.514285714 | 35 | 1.983853121 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
ctgatagctagctcagtcctagggattatgctagc.1 | 0.485714286 | 35 | 1.993608561 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 3 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
ctgatggctagctcagtcctagggattatgctagc | 0.514285714 | 35 | 1.983853121 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 |
tttatggctagctcagtcctaggtacaatgctagc | 0.457142857 | 35 | 1.984890223 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
tttaattatatatatatatatataatggaagcgtttt | 0.135135135 | 37 | 1.524004205 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 8 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 3 |
tataagatcatacgccgttatacgttgtttacgctttg | 0.368421053 | 38 | 1.920673491 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 1 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 3 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 2 | 2 |
We hypothesized that too much information was lost during feature extraction. To increase the information fed into the models, we expanded the k-mers from 3 to 4, recalculating the frequency distributions, and retrained the three models. This adjustment significantly improved performance: the RF model's RMSE decreased to 3.535 with an R-squared of 0.7652, the SVMR model's RMSE decreased to 2.824 with an R-squared of 0.44306, and the SVMl model's RMSE dropped to 5.464 with an R-squared of 0.2687327. The improvements in RMSE and R-squared for the RF and SVMR models suggest that incorporating 4-mers motif enhanced the models' accuracy and explanatory power.
The improved performance of our model using 4-mers over 3-mers likely stems from the fact that 4-mers capture more detailed and biologically relevant patterns in the sequence. While 3-mers may miss important motifs or structural features that are predictive of our task, 4-mers provide a richer representation of the sequence, allowing the model to better recognize key patterns and associations. This additional information likely enhances the model's ability to generalize and make more accurate predictions.
With our high-performing model, we can virtually screen a larger library of promoters to identify suitable sequences. However, our current experimental timeline and budget do not support the validation of this phase. We have made our dataset available online, allowing future teams to reference our training methods and continue this work.