Model


Overview

In this study, we investigate the intricate dynamics of our transcriptional detection system through two complementary modeling approaches, each serving to deepen our understanding of gene regulation. The first approach utilizes an ordinary differential equation (ODE) model to analyze how varying concentrations of LuxR, the sensor protein, affect the system's behavior. This model provides valuable insights into the kinetic responses of the system, highlighting how LuxR concentration can optimize sensitivity and specificity in detecting target ligands.

The second approach employs machine learning techniques to analyze experimental data from different promoters controlling LuxR expression. By uncovering the relationship between promoter sequences and system responses, this model addresses the need for predictive capabilities in synthetic biology. Understanding these relationships can guide the design of more effective genetic circuits, ultimately enhancing the performance and reliability of biotechnological applications. Together, these modeling strategies not only provide a comprehensive framework for our system but also contribute to advancing the field of synthetic biology by facilitating the rational design of transcriptional devices.


ODE model of system dynamics with different LuxR concentration

Methods to modify an induction-based transcriptional detection system include adjusting the concentration of the sensor protein, altering the protein-promoter relationship, and mutating the sensor protein. Among these, adjusting the sensor protein concentration is the most straightforward approach. We referenced a study that examined how varying LuxR (the sensor protein) concentration affects the final response curve. The study utilized modeling and experimental methods to demonstrate that increasing LuxR concentration significantly reduces the background signal and lowers the half-maximal induction curve, offering new insights for optimizing our detection system to achieve higher sensitivity and accuracy.

The mathematical model for the inducible device is built on several premises: the transfer function is assessed at steady state, with constant concentrations of molecular species related to the genetic device. The total promoter concentration is much smaller than that of transcriptional activators, and the ligand concentration is assumed to be invariant and sufficiently large to neglect the bound portion. The model also disregards metabolic burden, assuming that the device's load does not impact cellular behavior. Additionally, the genetic device components are designed for orthogonal functionality, avoiding unwanted interactions with other cellular components. RFP expression is modulated by lactone concentration, considering both luxR-independent and luxR-dependent leakiness.

Building on this modeling framework, we first analyze the organization of pathways within the cell. Treating the total amount of LuxR as constant, it can exist in three interchangeable states: the free monomer \(R\), the dimer \(R_2\), and the dimer \(Q\) bound to a signaling molecule. These states correspond to three outcomes: no induction of RNA transcription, low-level induction (leaky expression), and high-level induction of mRNA transcription. The resulting mRNA produces the fluorescent protein mVenus (\(G\)), with expression controlled by a constant factor \(K_p\). Throughout the system, we ignore LuxR degradation.

Different promoters correspond to varying total amounts of LuxR \([R + R_2 + Q]\), and we define promoter strength as \(\phi\). This framework integrates the enzymology-based model premises with the specific mechanisms of the LuxR device, enhancing our understanding of the regulatory network and the dynamic characteristics of gene expression.

This schematic illustrates the genetic regulation mechanisms for the inducible LuxR-pLux engineered devices employed in this study. The notation adheres to the mathematical model, where R denotes the LuxR receptor, L represents the lactone, and P indicates the pLux promoter. R2 refers to the dimerized receptor in the absence of lactone, while Q represents the dimerized receptor in the presence of lactone. SR signifies the transcriptional complex that is not mediated by lactone, and SL denotes the lactone-mediated transcriptional complex. mG is the mRNA of the reporter gene, and G refers to the reporter protein. The notation Ki indicates equilibrium constants, whereas ki denotes kinetic constants.

In this streamlined model, we assume that receptor binding and the unbound promoter do not exhibit leakiness, which results in km0=0 and K2=0. The reactions we consider include the binding of the ligand to form a receptor-ligand complex, the dimerization of the receptor to generate an active transcriptional complex, the initiation of RNA transcription from the promoter, and the subsequent translation of the RNA into the fluorescent protein mVenus. This framework gives rise to the following differential equations:

\begin{equation} \frac{d[mRNA]}{dt} = k_{mL}[S_{L}] - k_{md}[mG] \quad \end{equation}
\begin{equation} \frac{d[G]}{dt} = k_p[mG] - k_{pd}[G] \end{equation}

At steady state, where

\begin{equation} \frac{d[mG]}{dt} = \frac{d[G]}{dt} = 0 \end{equation}

The concentration of the active complex [G∗] can be expressed as a function of the active complex [SL]:

\begin{equation} [G^*] = \frac{k_p k_{mL}}{k_{pd} k_{md}} [S_{L}] \end{equation}

Next, we aim to derive a mathematical expression for [G∗] in relation to the control variables of this model: [L], [RT] and [PT]. These variables represent the concentration of 3OC6HSL (the ligand), the amount of LuxR (maintained constant via constitutive expression, [RT]), and the total number of Lux promoters ([PT]), determined by a stable population of plasmids.

For simplicity and in line with the biological context, we assume that dimerization takes place before ligand binding. Additionally, we mathematically assert that the interaction with the ligand has minimal impact on the equilibrium between the dimer and monomer forms. Therefore, we will focus solely on the quantity of LuxR dimers, [R2T].

Based on this, we define conservation equations below:

\begin{equation} [SL] = K_1 K_3 [P][R_2][L] \end{equation}
\begin{equation} [R_2T] = [R_2](1 + K_1[L]) \end{equation}
\begin{equation} [PT] = [P](1 + K_1 K_3 R_2[L]) \end{equation}

To put this model in a compact expression in the familiar Michaelis–Menten form, we introduce the maximum expression of the reporter Gapp and the affinity constant of the device Kapp.

\begin{equation} K_{\text{app}} = K_1^{-1} \frac{(K_{3}r)^{-1}}{(K_{3}r)^{-1} + \phi} \end{equation}
\begin{equation} G_{\text{app}}^m = g k_{mL} [PT] \frac{\phi}{(K_{3}r)^{-1} + \phi} \end{equation}
\begin{equation} [G^*] = \frac{G_{\text{app}}^m [L]}{K_{\text{app}} + [L]} \end{equation}

In this model, the kinetic constants \(g\) are combined to express the total effect on the system. The total concentration \([R_2T]\) can be represented as a function of the normalized promoter strength ϕ. This highlights the influence of promoter design on gene expression efficiency and sensitivity.

Figure2. Transfer function for five different φ according to the simplified model without leakiness. Upper dotted line corresponds to the maximum achievable expression. Lower dotted line shows the Gappₘ/2 of φ=1000 and 100 . The intersections with the transfer functions (little open circles) highlight the Gappₘ/2 for the four transfer functions. Their projections over the [L] axis show their respective Kapp values (bigger open circles). Parameters for the simulation were: gkmL[PT] = 5.0 ,K1 = 1.0,K3*r = 5.0

We replicated the parameters of the original model and fitted the experimental data for the best and original curves to calculate the ϕ values. This allowed us to determine the appropriate ϕ settings for our system.

Discussion of the Relationship Between phi, G, and K

In our model, phi serves as a parameter that enhances the activity of the promoter. As phi increases, the effective affinity of the device for its ligand improves, leading to a reduction in the apparent dissociation constant Kapp proportional to phi inverse. This change results in a more pronounced output signal Gapp. When phi approaches zero, Gapp m tends toward zero while Kapp reaches its maximum value K1 inverse, indicating minimal responsiveness. Conversely, higher phi values lead to maximum Gapp, driven by the efficiency of the machinery.

This behavior parallels concepts in enzyme kinetics, particularly the effects of allosteric activators. In enzyme systems, an increase in substrate concentration enhances reaction rates, akin to how an increase in phi boosts the output of our genetic device. Conversely, in inhibitory scenarios, rising inhibitor concentrations would lead to a higher Kapp and a decrease in Gapp, demonstrating the contrasting effects of activation and inhibition on system performance. This underscores the importance of regulatory mechanisms in biological systems and highlights the versatility of promoter dynamics in influencing gene expression.

Caption: The dependence of Kapp and Gappm on the value of φ is illustrated. As φ approaches 0, Kapp ∼ K1-1 and Gappm ∼ 0. The vertical dashed line indicates the position of φ = (K3r)-1. At this position, Gappm reaches its maximum value, while Kapp decays inversely with respect to φ.

In our case, we need a more sensitive system to detect ~10mM AHL in expired milk sample, which means we need a stronger promoter than J23119 for LuxR.


Machine learning model of our high-throughput screening data

We introduced a library of 50 promoters from iGEM parts, which were inserted upstream of the LuxR gene using Golden Gate assembly to create a system that contains varying concentrations of LuxR. We obtained 16 clones with different promoters and tested them in 96-well plates with overnight cultures at six different LuxR concentrations: 50, 25, 12.5, 6.25, 3.125, and 0 nM. Fluorescence intensity (excitation at 510 nm, emission at 550 nm) and OD600 were measured for each condition. The resulting induction curves for each promoter were plotted, showing that the half-maximal induction concentration was altered by the different promoters.

Promoter sequences were verified using Sanger sequencing. Since the promoters varied in length, we first extracted sequence features that were independent of length, including GC content, sequence entropy, and 3-mer frequency distributions. We established a regression task and trained three different machine learning models to evaluate their performance. By comparing the results, we aim to identify the model that performs the best for this task. The three models include Random Forest Regression, Support Vector Regression (SVR), and Linear Regression. Each model underwent feature engineering, hyperparameter tuning, and cross-validation. The key evaluation metrics we focused on are Mean Squared Error (MSE), Mean Absolute Error (MAE), and R² (coefficient of determination), to assess the accuracy and robustness of the models in the prediction task. We also performed 10-fold cross-validation to ensure that the models were evaluated on a broader range of data and to minimize overfitting. This approach allowed us to assess the stability and generalization ability of each model by training on different subsets of the data and validating on the remaining portions.

In our first circle, the average RMSE for the RF model was 4.759, with an R-squared of 0.383324, indicating a low explanatory power. The SVMR model had an RMSE of 4.456 and an R-squared of 0.3447, while the SVMl model had an RMSE of 6.048 and an R-squared of 0.3873. All models exhibited relatively high RMSE values and low R-squared values (<0.5), suggesting poor predictive performance. While the RF model outperformed SVMR and SVMl on all metrics, its predictions were still not meaningful.

Table1. RMSE and R square of 3 model trained on 3-mers character in 10 interation
Table1. Features used in first round machine learning
sequence GC_Content Sequence_Length Entropy AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
tttacagctagctcagtcctaggtattatgctagc 0.428571429 35 1.967253461 0 0 0 0 1 0 0 0 0 3 1 1 0 0 1 1 0 0 2 0 0 0 0 1 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0 3 0 0 0 1 1 1 0 0 0 1 3 2 1 1 0 0 0 1 0 0 2 0 0 1
ttgacagctagctcagtcctaggtactgtgctagc 0.514285714 35 1.988442576 0 0 0 0 1 0 0 1 0 3 1 1 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 3 1 1 0 0 1 0 0 0 0 0 3 0 0 0 1 1 1 1 0 0 1 3 0 1 1 0 0 1 1 0 1 0 0 1 0
ctgatagctagctcagtcctagggattatgctagc 0.485714286 35 1.993608561 0 0 0 0 0 0 0 0 0 3 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 3 1 1 0 0 0 0 2 0 0 0 3 1 0 1 0 0 1 0 0 0 0 4 1 1 1 0 0 1 1 0 0 1 0 0 0
ttgacagctagctcagtcctaggtattgtgctagc 0.485714286 35 1.979724238 0 0 0 0 1 0 0 0 0 3 1 1 0 0 0 1 0 0 2 0 0 0 0 1 0 0 0 0 3 1 0 0 0 1 0 0 0 0 0 3 0 0 0 1 1 1 1 0 0 0 3 1 1 1 0 0 1 1 0 1 0 0 2 0
tttacggctagctcagtcctaggtactatgctagc 0.485714286 35 1.979724238 0 0 0 0 0 0 1 1 0 2 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 4 1 0 0 0 0 0 0 0 0 0 3 0 1 0 1 1 1 0 0 0 2 3 1 1 1 0 0 0 1 0 0 1 0 0 1
ttgacggctagctcagtcctaggtattgtgctagc 0.514285714 35 1.964060052 0 0 0 0 0 0 1 0 0 2 1 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 3 1 0 0 0 1 0 0 0 0 0 3 0 1 0 1 1 1 1 0 0 0 3 1 1 1 0 0 1 1 0 1 0 0 2 0
tttacggctagctcagccctaggtattatgctagc 0.485714286 35 1.979724238 0 0 0 0 0 0 1 0 0 3 1 0 0 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 3 1 0 0 0 0 0 0 0 1 0 3 0 1 0 1 1 0 0 0 0 1 3 2 1 0 0 0 0 1 0 0 2 0 0 1
taatacgactcactatagggaga 0.391304348 23 1.925796479 0 0 0 1 0 0 1 2 1 0 1 0 2 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
tttacagctagctcagtcctagggactgtgctagc 0.514285714 35 1.988442576 0 0 0 0 1 0 0 1 0 3 1 1 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 0 3 1 1 0 0 1 0 0 0 0 0 3 1 0 1 0 0 1 1 0 0 1 3 0 1 1 0 0 0 1 0 1 1 0 0 1
tttacggctagctcagtcctaggtacaatgctagc 0.485714286 35 1.993608561 0 0 0 1 1 0 1 0 0 2 1 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 3 1 0 0 0 0 0 0 0 0 0 3 0 1 0 1 1 1 0 0 0 2 3 0 1 1 0 0 0 1 0 0 1 0 0 1
ttgacggctagctcagtcctaggtatagtgctagc 0.514285714 35 1.983853121 0 0 0 0 0 0 1 0 0 2 1 2 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 3 1 0 0 0 1 0 0 0 0 0 3 0 1 0 1 1 1 1 0 0 0 4 1 1 1 0 0 1 1 0 0 0 0 1 0
ctgatagctagctcagtcctagggattatgctagc.1 0.485714286 35 1.993608561 0 0 0 0 0 0 0 0 0 3 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 3 1 1 0 0 0 0 2 0 0 0 3 1 0 1 0 0 1 0 0 0 0 4 1 1 1 0 0 1 1 0 0 1 0 0 0
ctgatggctagctcagtcctagggattatgctagc 0.514285714 35 1.983853121 0 0 0 0 0 0 0 0 0 2 1 1 0 0 2 1 0 0 1 0 0 0 0 1 0 0 0 0 3 1 1 0 0 0 0 2 0 0 0 3 1 1 1 0 0 1 0 0 0 0 3 1 1 1 0 0 1 1 1 0 1 0 0 0
tttatggctagctcagtcctaggtacaatgctagc 0.457142857 35 1.984890223 0 0 0 1 1 0 0 0 0 2 1 1 0 0 2 0 1 0 1 0 0 0 0 1 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0 3 0 1 0 1 1 1 0 0 0 1 3 1 1 1 0 0 0 1 1 0 1 0 0 1
tttaattatatatatatatatataatggaagcgtttt 0.135135135 37 1.524004205 0 0 1 2 0 0 0 0 0 1 0 0 8 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 2 0 0 8 0 0 0 0 0 0 1 0 2 0 0 3
tataagatcatacgccgttatacgttgtttacgctttg 0.368421053 38 1.920673491 0 0 1 0 0 0 3 0 1 0 0 0 3 1 0 0 0 0 0 1 0 0 1 0 0 2 0 2 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 3 1 3 0 2 1 0 0 0 0 0 0 1 2 0 2 2


We hypothesized that too much information was lost during feature extraction. To increase the information fed into the models, we expanded the k-mers from 3 to 4, recalculating the frequency distributions, and retrained the three models. This adjustment significantly improved performance: the RF model's RMSE decreased to 3.535 with an R-squared of 0.7652, the SVMR model's RMSE decreased to 2.824 with an R-squared of 0.44306, and the SVMl model's RMSE dropped to 5.464 with an R-squared of 0.2687327. The improvements in RMSE and R-squared for the RF and SVMR models suggest that incorporating 4-mers motif enhanced the models' accuracy and explanatory power.

Figure4. Boxplot of prediction performance with different features, with 4-mers showed the training features contained 4-mers motif frequence. Error bar refers to std.

The improved performance of our model using 4-mers over 3-mers likely stems from the fact that 4-mers capture more detailed and biologically relevant patterns in the sequence. While 3-mers may miss important motifs or structural features that are predictive of our task, 4-mers provide a richer representation of the sequence, allowing the model to better recognize key patterns and associations. This additional information likely enhances the model's ability to generalize and make more accurate predictions.

With our high-performing model, we can virtually screen a larger library of promoters to identify suitable sequences. However, our current experimental timeline and budget do not support the validation of this phase. We have made our dataset available online, allowing future teams to reference our training methods and continue this work.


References


[1] Carbonell-Ballestero, Max & Duran-Nebreda et al. (2014). A bottom-up characterization of transfer functions for synthetic biology designs: Lessons from enzymology. Nucleic Acids Research. 42. 10.1093/nar/gku964.

[2]Hooshangi S , Thiberge S , Weiss R .Ultrasensitivity and noise propagation in a synthetic transcriptional cascade[J].Proceedings of the National Academy of Sciences of the United States of America, 2005, 102(10):p.3581-3586.DOI:10.1073/pnas.0408507102.

[3]Weber M , Buceta J .Dynamics of the quorum sensing switch: stochastic and non-stationary effects[J].BMC Systems Biology, 2013, 7(1):1-15.DOI:10.1186/1752-0509-7-6.

[4]Kuhn, Max (2008). “Building Predictive Models in R Using the caret Package.” Journal of Statistical Software, 28(5), 1–26. doi:10.18637/jss.v028.i05

[5]Zucca S, Pasotti L, Mazzini G, et al. Characterization of an inducible promoter in different DNA copy number conditions[J]. BMC bioinformatics, 2012, 13: 1-12.