Model

Explain your model's assumptions, data, parameters, and results in a way that anyone could understand.

Introduction

In the computational part of our project, we set out to understand how mutations in promoter binding sites can influence gene expression controlled by different transcription factors (TFs) in response to reactive oxygen species (ROS). We began by developing a computational model that qualitatively demonstrated how decreasing TF binding affinity shifts gene activation to higher ROS levels . This model showed that reducing binding affinity is indeed a viable strategy to fine-tune gene expression in response to different ROS concentrations.

Our initial attempt to quantify the direct impact of promoter mutations on the binding free energy (\(\Delta\Delta G\)) using structural modeling tools was unsuccessful. The limitations arose because these tools could not accurately represent oxidized cysteine residues, which are crucial for the activation of certain TFs like OxyR. Recognizing this challenge, we turned to an alternative approach using Position Specific Scoring Matrices (PSSMs) derived from DNA sequences of known binding sites. By applying this method to the OxyR transcription factor, we were able to design promoter mutations that reduce TF binding affinity. This approach provided a practical way to engineer promoter sequences that modulate gene expression in response to oxidative stress.


We hosted the full code for the modeling and plotting on our GitLab repository. You can visit the repository to see more details on the implementation:

https://gitlab.igem.org/2024/software-tools/eth-zurich.git


1. Modeling the Impact of Transcription Factor Binding Affinity on Gene Expression

To determine whether changing the binding affinity of the TFs to their respective promoter regions, we developed a computational model based on ordinary differential equations (ODEs). This type of model allows us to simulate the dynamic interactions between ROS, transcription factors, and gene expression within the cell. We aimed to model the transcription factor OxyR including some known kinetic parameters for OxyR. Nevertheless, the qualitative results can still apply to other TFs, as the kinetics can be modeled using adapted parameters.

  1. ROS Dynamics: The influx of external ROS into the cell and its detoxification by antioxidants.
  2. Transcription Factor Activation: The activation of transcription factors by ROS and their subsequent binding to promoter regions to initiate gene expression.
  3. Gene Expression: The transcription and translation of target genes, leading to the production of proteins such as GFP (Green Fluorescent Protein) and antioxidants.

By simulating these processes, we can investigate how mutations that reduce the binding affinity of transcription factors to their promoters influence the cell's response to oxidative stress.

For full details on the ODEs and parameters, please see the blue collapsed boxes at the end of this section.

1. Internal ROS Dynamics

We modeled the internal concentration of ROS (\(\text{ROS}_{\text{int}}\)) by considering two main factors:

  • Diffusion of ROS into the Cell: External ROS (\(\text{ROS}_{\text{ext}}\)) can diffuse into the cell, increasing the internal ROS concentration.
  • Detoxification of ROS: Antioxidants produced by the cell can detoxify ROS, decreasing its concentration.

Since we could not find specific values for the diffusion and degredation rate of ROS in the literature, we treated ROS levels as unitless measures. This simplification allows us to focus on the relative changes in ROS concentrations without needing exact units.

2. Transcription Factor Activation

TFs like OxyR are activated by ROS through oxidation. The activated TFs can then bind to promoter regions to regulate gene expression. We assume in our model that the total TF concentration is in steady state, i.e. that the total concentration of activated and inactivated TF remains constant over time

We modeled the activation of the TF by ROS using Hill-like kinetics that incorporate cooperativity:

  • Activation Rate: The rate at which the inactive TF (\(\text{TF}_{\text{inact}}\)) is converted to its active form (\(\text{TF}_{\text{act}}\)) depends on the internal ROS concentration.
  • Michaelis Constant (\(K_{m_{\text{TF}}}\)): This parameter represents the ROS concentration at which the activation rate is half-maximal. From the literature, we used a value of 43 [1].
  • Hill Coefficient (\(n_{\text{TF}}\)): The Hill coefficient reflects the cooperativity of the activation process. Based on the literature, we used a Hill coefficient of 1.4 [1], suggesting slight cooperativity in the activation of OxyR by ROS. A value greater than 1 indicates positive cooperativity, meaning for this case that the binding of one ROS molecule enhances the binding of others.

3. Gene Expression

The activated transcription factors bind to promoter regions to initiate the transcription of target genes, like an antioxidant or GFP.

We used GFP as a reporter gene and modeled its transcription and translation:

  • Transcription of GFP mRNA: The rate of GFP mRNA production depends on the concentration of active transcription factor and its binding affinity to the promoter.
  • Translation of GFP Protein: GFP mRNA is translated into GFP protein at a certain rate.
  • Cooperativity in Transcription: We assumed a higher Hill coefficient (\(n_{\text{gfp}} = 4\)) for the transcription process to reflect cooperativity. This is a reasonable assumption because multiple transcription factor molecules may bind to the promoter region, and OxyR forms dimers of dimers that interact with two binding sites[2]. We will discuss the impact of this assumption in the limitations section.

4. Simulating the Impact of Binding Affinity

To simulate the impact of promoter mutations that reduce transcription factor binding affinity, we adjusted the Michaelis constant \(K_{m_{\text{gfp}}}\) in our model: a higher \(K_{m_{\text{gfp}}}\) represents a lower binding affinity, meaning that a higher concentration of active transcription factor is needed to achieve the same level of binding to the promoter. Figure 1 shows the ratio of GFP expression at steady state at ROS concentrations of 1 divided by the GFP expression at ROS concentrations of 0.5 for different \(K_{m_{\text{gfp}}}\) values ranging from 0 to 1 (see Model Details for more information). Here, a higher ratio indicates a clearer signal specific to higher ROS levels. We can see that reducing the binding affinity shifts the threshold of gene activation to higher ROS concentrations, while starting to plateau at higher \(K_{m_{\text{gfp}}}\) values.


Ratio of simulated GFP concentrations at 1 and 0.5 units of ROS
Figure 1. Ratio of simulated GFP concentrations at 1 and 0.5 units of ROS. The x-axis represents the \(K_{m_{\text{gfp}}}\) value, which is inversely proportional to the binding affinity of the transcription factor to the promoter. A higher \(K_{m_{\text{gfp}}}\) value indicates lower binding affinity of the activated TF to the promoter.

Figure 2 illustrates the simulated GFP expression profiles under different \(K_{m_{\text{gfp}}}\) values represented by curves in different colors. The ROS concentration was increased from 0.1 to 0.5 at time 100, and to 1 at time 200. Again, we observe that lower binding affinities shift the activation of GFP to higher ROS levels. For example, when setting \(K_{m_{\text{gfp}}}\) to 0.01, we can see a clear increase in GFP expression already at a ROS concentration of 0.1, while for \(K_{m_{\text{gfp}}}\) = 1, clear activation occurs only at a ROS concentration of 1. Additionally we can clearly observe a practical problem when setting the binding affinity too low, as the maximal GFP expression remains low even at high ROS concentrations.

When comparing to experimental data where the H2O2 concentration was increased after 2h, we can observe that the qualitative shape of the fluorescence curve is similar to the GFP expression profiles in Figure 2. This indicates that the model captures essential dynamics of the system. You can find the relevant experimental data in Figure 3 of the results page.

Overall, our model demonstrates that reducing the binding affinity of transcription factors to their promoters is a viable strategy to control gene expression in response to ROS. But it also highlights the importance of finding the right balance in finding the optimal binding affinity to achieve specificity to high ROS levels while maintaining sufficient gene expression levels.


Simulated GFP expression profiles under different Km values
Figure 2. Simulated GFP expression profiles under different \(K_{m_{\text{gfp}}}\) values. The ROS concentration was increased from 0.1 to 0.5 at time 100, and to 1 at time 200.

5. Limitations and Future Directions

While our computational model provides qualitative insights into how decreasing transcription factor (TF) binding affinity can shift gene activation thresholds in response to reactive oxygen species (ROS), several limitations need to be acknowledged:

  • Simplified ROS Dynamics: The model treats ROS concentrations as unitless measures due to the lack of specific values for diffusion and degradation rates of ROS in the literature. This simplification limits the model's quantitative predictive power.
  • Assumptions of Steady-State Conditions: We assume that the total concentration of active and inactive TF remains constant over time. In a real cellular environment, TF synthesis and degradation, as well as other dynamic processes, can affect TF concentrations.
  • No Additional Regulatory Mechanisms: The model focuses on the direct activation of gene expression by TFs in response to ROS, without accounting for other regulatory factors such as interactions with other proteins that could influence gene expression.
  • Sensitivity to Cooperativity: The model assumes a relatively high Hill coefficient (\( n_{\text{gfp}} = 4 \)) for the transcription of GFP to reflect cooperativity. The choice of this parameter determines the switch like behavior of the system. As seen in Figure 3 for \(n_{\text{gfp}} = 1 \), the activation of GFP is less sharp, reaching only a ratio of about 1.55 at \(K_{m_{\text{gfp}}} = 1 \) instead of 11. Experimental determination of the actual cooperativity would thus greatly improve the model's accuracy.
  • Experimental Validation: The model's predictions have not been validated experimentally. Laboratory experiments are essential to confirm whether mutations that reduce TF binding affinity indeed shift gene activation thresholds as predicted. Such validation would not only confirm the model's accuracy but also provide data to refine model parameters.

In total, while our model provides useful qualitative insights into the modulation of gene expression through TF binding affinity in response to ROS, addressing these limitations is necessary for improving its accuracy and applicability. Future work should focus on incorporating more accurate parameters, considering additional regulatory mechanisms, and validating predictions experimentally.


Simulated GFP expression profiles under different Km values
Figure 3. Simulated GFP expression profiles under different \(K_{m_{\text{gfp}}}\) values. The ROS concentration was increased from 0.1 to 0.5 at time 100, and to 1 at time 200.

Model Details

Parameter Details
Parameter Definitions
Variable Description
\(\text{ROS}_{\text{int}}\) Internal concentration of ROS inside the cell.
\(\text{ROS}_{\text{ext}}\) External concentration of ROS outside the cell. This represents the ROS level in the environment that can diffuse into the cell.
\(\text{TF}_{\text{act}}\) Concentration of the active form of the transcription factor. This form is capable of binding to promoter regions and initiating gene expression.
\(\text{TF}_{\text{inact}}\) Concentration of the inactive form of the transcription factor. This form cannot bind DNA until activated by ROS.
\(\text{mRNA}_{\text{gfp}}\) Concentration of mRNA encoding GFP. This reflects the level of gene expression for GFP.
\(\text{mRNA}_{\text{antiox}}\) Concentration of mRNA encoding an antioxidant protein. This reflects the level of gene expression for antioxidant proteins, which help detoxify ROS.
\(\text{GFP}\) Concentration of GFP protein. GFP is used as a reporter protein to measure gene expression levels.
\(\text{Antioxidant}\) Concentration of antioxidant protein. This protein helps neutralize ROS and reduce oxidative stress within the cell.
\(D\) Diffusion rate of ROS into the cell. This controls how fast ROS enters the cell from the external environment.
\(d_{\text{ROS}_{\text{int}}}\) Detoxification rate of internal ROS. It determines how quickly ROS is neutralized within the cell under the presence of Antioxidants.
\(k_{\text{ROS}_{\text{int}}}\) Rate constant for ROS-induced activation of the transcription factor. This determines how fast ROS activates the transcription factor.
\(K_{m_{\text{TF}}}\) Michaelis constant for transcription factor activation. It represents the concentration of ROS at which the activation rate is half of its maximum.
\(n_{\text{TF}}\) Hill coefficient for transcription factor activation. It indicates the cooperativity of the activation process; higher values suggest stronger cooperative binding.
\(V_{\text{max}_{\text{gfp}}}\) Maximum transcription rate of GFP mRNA. This is the highest possible rate at which GFP mRNA can be produced.
\(K_{m_{\text{gfp}}}\) Michaelis constant for GFP transcription. It represents the transcription factor concentration at which GFP mRNA production is half of its maximum rate.
\(n_{\text{gfp}}\) Hill coefficient for GFP transcription. It reflects the cooperativity of transcription factor binding to the GFP promoter.
\(V_{\text{max}_{\text{antiox}}}\) Maximum transcription rate of antioxidant mRNA. This is the highest possible rate at which antioxidant mRNA can be produced.
\(K_{m_{\text{antiox}}}\) Michaelis constant for antioxidant mRNA transcription. It represents the transcription factor concentration at which antioxidant mRNA production is half of its maximum rate.
\(k_{\text{trans}_{\text{gfp}}}\) Translation rate constant for GFP protein. This controls the rate at which GFP protein is produced from its mRNA.
\(d_{\text{GFP}}\) Degradation rate of GFP protein. This represents how quickly GFP protein is broken down and removed from the cell.
\(k_{\text{trans}_{\text{antiox}}}\) Translation rate constant for antioxidant protein. This controls the rate at which antioxidant protein is produced from its mRNA.
\(d_{\text{Antioxidant}}\) Degradation rate of the antioxidant protein. This represents how quickly the antioxidant protein is broken down and removed from the cell.
Parameter Values
Parameter Value Unit Explanation
\(D\) 0.1 min\(^{-1}\) Represents an arbitrary diffusion rate.
\(d\_ROS\_int\) 0.02 min\(^{-1}\) Represents an arbitrary degredation rate. Chosen to roughly match the experimentally observed time until steady state was reached.
\(k\_ROS\_int\) 582 \(\mu\)M\(^{-1}\) min\(^{-1}\) Derived from the OxyR activation rate, ensuring rapid and efficient transcription factor activation in response to ROS[1].
\(d\_TF\_act\) 3 min\(^{-1}\) Set to facilitate timely deactivation of the active transcription factor, aligning with experimental observations of steady-state dynamics.
\(V\_max\_gfp\) 4.2 \(\mu\)M·min\(^{-1}\) Represents the maximum transcription rate of GFP mRNA, estimated based on typical transcription rates in *E. coli*.
\(d\_mRNA\_gfp\) 0.23 min\(^{-1}\) Increased to reflect a realistic mRNA degradation rate, ensuring proper turnover and preventing excessive accumulation of GFP mRNA.
\(V\_max\_antiox\) 2.1 \(\mu\)M·min\(^{-1}\) Estimated maximum transcription rate of antioxidant mRNA, accounting for the induction of protective mechanisms against ROS.
\(K\_m\_antiox\) 0.4 \(\mu\)M Represents a moderate affinity for antioxidant mRNA synthesis, balancing responsiveness to ROS levels.
\(d\_mRNA\_antiox\) 0.23 min\(^{-1}\) Set to ensure proper degradation of antioxidant mRNA, maintaining dynamic regulation of antioxidant protein levels.
\(k\_trans\_gfp\) 3.78 min\(^{-1}\) Estimated translation rate of GFP, facilitating adequate protein synthesis in response to mRNA levels.
\(d\_GFP\) 0.2 min\(^{-1}\) Increased to ensure timely degradation of GFP protein, preventing unrealistic accumulation and aligning with experimental steady-state observations.
\(k\_trans\_antiox\) 3.78 min\(^{-1}\) Estimated translation rate of antioxidant protein, ensuring efficient synthesis in response to antioxidant mRNA levels.
\(d\_Antioxidant\) 0.2 min\(^{-1}\) Increased antioxidant protein degradation rate to maintain appropriate protein levels and dynamic response to oxidative stress.
\(n\_TF\_act\) 1.4 Dimensionless Hill coefficient for TF activation, indicating slight positive cooperativity in response to ROS levels[1].
\(K\_m\_TF\_act\) 43.0 \(\mu\)M Michaelis constant for TF activation, representing the ROS concentration at which TF activation is half-maximal[1].
\(n\_mRNA\_gfp\) 4 Dimensionless Fixed Hill coefficient for GFP mRNA synthesis, reflecting cooperative binding effects in transcriptional regulation.
Full Model Equations
Internal ROS Dynamics

$$ \frac{d \text{ROS}_{\text{int}}}{dt} = D \left( \text{ROS}_{\text{ext}}(t) - \text{ROS}_{\text{int}} \right) - d_{\text{ROS}_{\text{int}}} \cdot \text{ROS}_{\text{int}} \cdot \frac{ \text{Antioxidant} }{1 + \text{Antioxidant}} $$ This equation describes the change in internal ROS concentration, where \(D\) is the diffusion rate (treated as a scaling factor), and \(d_{\text{ROS}_{\text{int}}}\) represents the detoxification rate of ROS, enhanced by the presence of antioxidants.

Transcription Factor Activation

$$ \frac{d \text{TF}_{\text{act}}}{dt} = \frac{ k_{\text{ROS}_{\text{int}}} \left( \text{ROS}_{\text{int}} \right)^{ n_{\text{TF}} } \text{TF}_{\text{inact}} }{ K_{m_{\text{TF}}}^{ n_{\text{TF}} } + \left( \text{ROS}_{\text{int}} \right)^{ n_{\text{TF}} } } - d_{\text{TF}} \cdot \text{TF}_{\text{act}} $$ This equation models the activation of the transcription factor by ROS, considering cooperativity in the activation process.

Inactive Transcription Factor Dynamics

$$ \frac{d \text{TF}_{\text{inact}}}{dt} = -\frac{ k_{\text{ROS}_{\text{int}}} \left( \text{ROS}_{\text{int}} \right)^{ n_{\text{TF}} } \text{TF}_{\text{inact}} }{ K_{m_{\text{TF}}}^{ n_{\text{TF}} } + \left( \text{ROS}_{\text{int}} \right)^{ n_{\text{TF}} } } + d_{\text{TF}} \cdot \text{TF}_{\text{act}} $$ This accounts for the decrease in inactive TF as it is converted to the active form.

mRNA Dynamics for GFP

$$ \frac{d \text{mRNA}_{\text{gfp}}}{dt} = \frac{ V_{\text{max}_{\text{gfp}}} \left( \text{TF}_{\text{act}} \right)^{ n_{\text{gfp}} } }{ K_{m_{\text{gfp}}}^{ n_{\text{gfp}} } + \left( \text{TF}_{\text{act}} \right)^{ n_{\text{gfp}} } } - d_{\text{mRNA}_{\text{gfp}}} \cdot \text{mRNA}_{\text{gfp}} $$ This models the transcription of GFP mRNA, incorporating cooperativity in TF binding to the promoter.

mRNA Dynamics for Antioxidant

$$ \frac{d \text{mRNA}_{\text{antiox}}}{dt} = \frac{ V_{\text{max}_{\text{antiox}}} \cdot \text{TF}_{\text{act}} }{ K_{m_{\text{antiox}}} + \text{TF}_{\text{act}} } - d_{\text{mRNA}_{\text{antiox}}} \cdot \text{mRNA}_{\text{antiox}} $$ This describes the transcription of antioxidant mRNA.

Protein Dynamics for GFP

$$ \frac{d \text{GFP}}{dt} = k_{\text{trans}_{\text{gfp}}} \cdot \text{mRNA}_{\text{gfp}} - d_{\text{GFP}} \cdot \text{GFP} $$ This models the translation of GFP from its mRNA and its degradation.

Protein Dynamics for Antioxidant

$$ \frac{d \text{Antioxidant}}{dt} = k_{\text{trans}_{\text{antiox}}} \cdot \text{mRNA}_{\text{antiox}} - d_{\text{Antioxidant}} \cdot \text{Antioxidant} $$ This models the translation of the antioxidant protein and its degradation.

Derivation of the Steady State

To determine the steady-state concentrations of \( \text{ROS}_{\text{int}} \), \( \text{TF}_{\text{act}} \), and GFP, we set the derivatives of these variables to zero and solve the resulting equations.

1. Steady-State of \( \text{ROS}_{\text{int}} \): At steady-state, the rate of change of \( \text{ROS}_{\text{int}} \) is zero:
$$ 0 = D (\text{ROS}_{\text{ext}} - \text{ROS}_{\text{int}}) - d_{\text{ROS}_{\text{int}}} \cdot \text{ROS}_{\text{int}} \left( \frac{\text{Antioxidant}}{1 + \text{Antioxidant}} \right) $$
Rearranging terms to solve for \( \text{ROS}_{\text{int}} \):
$$ \text{ROS}_{\text{int}} = \frac{D \cdot \text{ROS}_{\text{ext}}}{D + d_{\text{ROS}_{\text{int}}} \cdot Q} $$
where \( Q = \frac{\text{Antioxidant}}{1 + \text{Antioxidant}} \).

2. Steady-State of \( \text{TF}_{\text{act}} \): Balancing the activation and deactivation of the transcription factor at steady-state:
$$ \text{TF}_{\text{act}} = \frac{k_{\text{ROS}_{\text{int}}} \cdot \text{ROS}_{\text{int}}^{n_{\text{TF}}}}{d_{\text{TF}} \cdot (K_{m_{\text{TF}}}^{n_{\text{TF}}} + \text{ROS}_{\text{int}}^{n_{\text{TF}}}) + k_{\text{ROS}_{\text{int}}} \cdot \text{ROS}_{\text{int}}^{n_{\text{TF}}}} $$

3. Steady-State of Antioxidant: The concentration of Antioxidant at steady-state is derived from the steady-state of its mRNA:
$$ \text{Antioxidant} = C_{\text{antiox}} \cdot \frac{\text{TF}_{\text{act}}}{K_{m_{\text{antiox}}} + \text{TF}_{\text{act}}} $$
where \( C_{\text{antiox}} = \frac{k_{\text{trans}_{\text{antiox}}} \cdot V_{\text{max}_{\text{antiox}}}}{d_{\text{Antioxidant}} \cdot d_{\text{mRNA}_{\text{antiox}}}} \).

4. Steady-State of GFP: The steady-state GFP concentration is determined from \( \text{mRNA}_{\text{gfp}} \):
$$ \text{mRNA}_{\text{gfp}} = \frac{V_{\text{max}_{\text{gfp}}} \cdot \text{TF}_{\text{act}}^{n_{\text{gfp}}}}{K_{m_{\text{gfp}}}^{n_{\text{gfp}}} + \text{TF}_{\text{act}}^{n_{\text{gfp}}}} \cdot \frac{1}{d_{\text{mRNA}_{\text{gfp}}}} $$
$$ \text{GFP} = \frac{k_{\text{trans}_{\text{gfp}}} \cdot \text{mRNA}_{\text{gfp}}}{d_{\text{GFP}}} $$

Why Numerical Methods Are Necessary: The steady-state equations exhibit circular dependencies, making it impossible to isolate and solve for each variable independently:

  • \( \text{ROS}_{\text{int}} \) depends on \( \text{Antioxidant} \) through the term \( Q = \frac{1}{1 + \text{Antioxidant}} \).
  • \( \text{Antioxidant} \) depends on \( \text{TF}_{\text{act}} \), which in turn depends on \( \text{ROS}_{\text{int}} \).
  • \( \text{TF}_{\text{act}} \) is influenced by \( \text{ROS}_{\text{int}} \), creating a loop where each variable's steady-state value is interdependent.
These interdependencies create a system of non-linear equations that cannot be decoupled or simplified into a form solvable by basic algebraic methods.

The presence of non-linear relationships and circular dependencies between \( \text{ROS}_{\text{int}} \), \( \text{TF}_{\text{act}} \), and Antioxidant prevents the derivation of closed-form analytical solutions. Numerical methods, such as `fsolve` from the `scipy.optimize` package, help us to iteratively approximate the steady-state concentrations.

Implementation Overview: The steady-state estimation is implemented using the following open source Python packages:

  • NumPy: For numerical operations and handling arrays.
  • SciPy: Specifically, the `fsolve` function from `scipy.optimize` is used to solve the non-linear equations numerically.
  • Matplotlib: For plotting the resulting GFP concentration ratios.
The implementation involves defining a residual function that uses the steady-state equations. `fsolve` iteratively adjusts the guess for \( \text{ROS}_{\text{int}} \) until the residuals approach zero, effectively finding the roots of the system. Once \( \text{ROS}_{\text{int}} \) is determined, \( \text{TF}_{\text{act}} \) and GFP concentrations are calculated based on the derived relationships.


3D Structure Based Modeling the Impact of Mutations on TF-DNA Binding Affinity

Next, we aimed to model the change in binding affinity of transcription factors to the promoter for mutations in the binding site. We started out by determining the structure of the DNA-TGA2 complex. TGA2 is a repressor that binds to DNA in its reduced state, and disassociates upon oxidation by ROS. We employed AlphaFold 3, a deep learning based tool for protein structure predictions [3]. . Figure 4 illustrates the 3D structures of the TGA2-DNA complex generated from five different AlphaFold model runs. These aligned 3D structures as predicted by AlphaFold show that the TF consistently binds to the same DNA site, demonstrating the robustness of the predictions across model runs.


Alphafold structure of TGA2
Figure 4. 3D visualization of five aligned AlphaFold models of the TGA2-DNA complex.

Using this structural data, we employed SampDI 3D to calculate changes in binding free energy (ΔΔG) for single nucleotide mutations within the binding site [4]. Figure 5 shows the SampDI 3D output, displaying only positive ΔΔG values, which indicate changes that do not enhance binding strength. Our analysis revealed that none of the mutations increased binding strength between TGA2 and its DNA binding site. This suggests that enhancing the sensitivity of TGA2 to its binding site through mutational changes in the promoter sequence is not feasible. Thus, the only viable mechanism to decrease TGA2's sensitivity to ROS is by increasing the occurrence of its binding site within the genome.


SampDI-3D results
Figure 5. SampDI 3D output showing positive ΔΔG values for single nucleotide mutations around the TGA2 binding site. Each box plot represents the distribution of ΔΔG values for mutations at a specific position in the binding site. The y-axis indicates the ΔΔG values, where higher values suggest a decrease in binding affinity. The x-axis represents the nucleotide positions within the promoter.

Following our progress with TGA2, we extended our modeling efforts to other transcription factors, specifically OxyR and Yap1, which are central regulators in bacterial and yeast oxidative stress responses, respectively. Both OxyR and Yap1 require oxidized cysteine residues to undergo conformational changes that enable them to bind DNA. However, modeling these oxidized states presented a significant challenge. AlphaFold 3 cannot explicitly represent oxidized cysteine residues. To overcome this limitation, we explored by the recommendation of Prof. Dr. Pablo Rivera-Fuentes the use of RoseTTAFold All-Atom. This is a structure prediction model similar to AlphaFold 3, but capable of representing atomized versions of amino acid sidechains, including their oxidized states. For this, we had to set up the RoseTTAFold on an HPC cluster, as the no web server was available for this model. Unfortunately, while the provided test examples resulted in high quality structures, the structures for TGA2, OxyR and Yap1 were of insufficient quality for reliable ΔΔG calculations. Figure 6 illustrates one of these predicted structures, where significant inaccuracies are apparent, such as a broken DNA strand. These structural inconsistencies prevented us from obtaining reliable absolute values for changes in ΔΔG, thus halting our progress with these specific transcription factors using this method.


RoseTTAFold results
Figure 6. 3D structure of TGA2-DNA complex as predicted by RoseTTAFold All Atom.

Sequence Based TF-DNA Binding Modeling

Given the challenges with structural modeling, we adopted an alternative approach to design meaningful mutations for the promoters of these transcription factors, focusing on OxyR as a case study. To start, we obtained a Position Frequency Matrix (PFM) from RegulonDB. RegulonDB is a database that curates regulatory interactions in E. coli, including detailed information about transcription factor binding sites[5]. PFMs are derived from experimental data by aligning known binding site sequences of a given transcription factor. The PFM simply provides the count of each of the four nucleotides at every position in the aligned binding motifs.

Next, we used this data to derive a Position Specific Scoring Matrix (PSSM)[8] as implemented in biopython [9]. The entries of the matrix are calculated based on the observed frequencies of each nucleotide at each position in the PFM, adjusted by the background frequencies of nucleotides in the genome. To account for the general nucleotide composition of the genome, we introduce background frequencies \(p_b\), representing the expected probability of nucleotide \(b\) in a random sequence.

The PSSM value for position \(i\) and base \(b\)is derived by calculating the logarithm of the ratio of the observed frequency \(f_{b,i}\) to the background frequency \(p_b\). Specifically, when using base-2 logarithms to express the scores in bits, the PSSM can be defined as:

$$\text{PSSM}[i,b] = \log_2 \left( \frac{f_{b,i}}{p_b} \right)$$

Positive scores indicate that a nucleotide is more frequent at that position in the binding sites than expected by chance, suggesting a favorable binding interaction. Negative scores suggest a lower-than-expected frequency, potentially weakening the interaction.

The score for a specific sequence is computed by summing the individual scores from the PSSM for each nucleotide in the sequence:

$$\text{Score} = \sum_{i=1}^{L} \text{PSSM}[i, \text{sequence}[i]]$$

By applying the PSSM to our promoter sequences, we can systematically assess how single nucleotide mutations affect the sequence similarity to known OxyR binding sites, providing indirect information on the expected effect on binding affinity. This analysis allows us to predict and design mutations that reduce the sensitivity of the promoter to oxidative stress, aligning with our goal of fine-tuning the expression profile of target genes in response to varying reactive oxygen species (ROS) levels.


Motif logo
Figure 7. Sequence logo representing the binding motif of E. coli OxyR.

To visualize the nucleotide preferences at each position, we generated a sequence logo (Figure 7) based on the PFM with the help of the MEME Suite [7]. In a sequence logo, the overall height of the stack at each position represents the information content (measured in bits), indicating the degree of conservation at that position, while the height of each nucleotide symbol within the stack reflects its relative frequency[8]. This provides an intuitive visual summary of the binding motif, highlighting the most conserved positions and the preferred nucleotides.

The sequence logo (Figure 7) graphically represents the PFM of OxyR binding sites. Each position in the logo corresponds to a position in the binding site, and the size of each letter indicates the relative frequency of that nucleotide at that position. The total height of the letters at each position reflects the information content, calculated using the formula:

$$I_{\text{seq}}(i) = \sum_{b} f_{b,i} \log_2 \left( \frac{f_{b,i}}{p_b} \right)$$

This equation computes the information content \(I_{\text{seq}}(i)\) at position \(i\) by summing over all nucleotides \(b\), where \(f_{b,i}\) is the frequency of nucleotide \(b\) at position \(i\), and \(p_b\) is the background frequency of nucleotide \(b\)[7]. The information content measures how much the nucleotide distribution at a position differs from the background distribution. Positions with high information content are highly conserved and likely crucial for binding specificity.

By analyzing the sequence logo, we can identify positions within the binding site that are highly conserved and therefore critical for OxyR binding. This insight guides the design of promoter mutations by indicating which positions are less likely to tolerate changes without affecting binding affinity.

Limitations

While the PSSM-based modeling approach provides a quantitative method to predict the impact of mutations on OxyR binding affinity, there are some limitations to consider in the context of our project. Firstly, the PSSM assumes that nucleotide positions contribute independently to the overall binding affinity. However, in reality, there may be interactions between positions (positional dependencies) that affect binding, which the PSSM does not capture.

Secondly, the PSSM is derived from a finite set of known binding sites obtained from RegulonDB. This dataset may not fully represent the complete spectrum of OxyR binding preferences, potentially biasing the PSSM. Any uncharacterized binding sites or sequence variations not included in the dataset could lead to gaps in our understanding of OxyR specificity, affecting the reliability of our predictions.

Additionally, the background nucleotide frequencies \(p_b\) used in the PSSM calculation are assumed to be uniform or based on general genomic composition. However, local variations in nucleotide composition around promoter regions could influence binding site characteristics. Using a more context-specific background model might improve the accuracy but requires additional data that may not be readily available.

Finally, while PSSM scores provide an indication of binding affinity changes due to mutations, they do not directly translate to changes in gene expression levels. The relationship between transcription factor binding affinity and transcriptional output is complex and can be non-linear. Environmental factors, cellular conditions, and feedback mechanisms can all modulate gene expression independently of binding affinity. Here a cell model as described above can help to assess the impact of the mutations on gene expression.

Therefore, experimental validation is essential to confirm the expected effects of designed mutations on promoter activity and gene expression in response to oxidative stress. This we aimed to achieve as described in the next section.

Verification in the Wet Lab

Finally we were able to show that the PSSM scores can be successfully used to design mutations in the promoter region of the OxyR. We used the scores to design mutations in the promoter region of the OxyR promoter. In Figure 8, we show the PSSM scores for the promoter region between the GFP and OxyR coding region. The y-axis represents the scores, and the x-axis represents nucleotide positions relative to the start of the GFP coding region. Red dots show scores higher than seven. We can see that a few targeted point mutations could successfully be used to modifiy the scores. Next, we had to synthesize these sequences with the computationally determined mutations. A technical limitation here is that longer repetitive sequences, like the duplicated promoter region, cannot be synthesized. To avoid this limitation, we had to add additional mutations that should not alter the promoter function. To make sure not only the OxyR, but also the RNA polymerase binding continues to function, we also scored the impact of potential changes on the -10 and -35 binding sites using the same methodology as described above for OxyR. In the wet lab, we were able to verify the functionality of the mutated promoter regions by measuring the expression of the downstream GFP gene. This shows that this method The results of these experiments are presented on the results page.


score_no_mutation score_mut
Figure 8. PSSM-score for for OxyR for the promoter region between the GFP and OxyR coding region. The y-axis represents the scores, and the x-axis represents nucleotide positions relative to the start of the GFP coding region. Red dots show scores higher than seven. The top plot shows the scores for the original promoter region, the bottom plot shows the scores for the mutated promoter region.

Bibliography

  1. Lee, C., et al. Redox regulation of OxyR requires specific disulfide bond formation involving a rapid kinetic reaction path. Nat Struct Mol Biol 11, 1179–1185 (2004).
  2. Pomposiello, P. J. & Demple, B. Redox-operated genetic switches: the SoxR and OxyR transcription factors. Trends in Biotechnology 19, 109–114 (2001).
  3. Abramson, Josh, et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature (2024): 1-3.
  4. Li, Gen, et al. "SAMPDI-3D: predicting the effects of protein and DNA mutations on protein–DNA interactions." Bioinformatics 37.21 (2021): 3760-3765.
  5. Salgado, Heladia, et al. "RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions." Nucleic acids research 34.suppl_1 (2006): D394-D397.
  6. Ogilvie, Huw A. "Position-Specific Score Matrices." Species and Gene Evolution, Department of Computer Science, Rice University, 11 Sept. 2018, cs.rice.edu/~ogilvie/comp571/pssm/. Accessed 2 May 2024.
  7. Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.
  8. Stormo, G.D. (2000). DNA binding sites: representation and discovery. Bioinformatics, 16(1), 16–23.
  9. Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 https://doi.org/10.1093/bioinformatics/btp163 pmid:19304878