Project Description SJTU-software

1 Introduction

Our project's primary objective is to identify key enzymes that limit spinosad production in the $ Saccharopolyspora\:spinosa $ metabolic pathway due to mismatches between the optimal working temperature of the enzymes and the bacterial growth temperature, and to perform targeted experimental modifications on these enzymes. This requires the development of two modeling components: one being a HEATMAP-AI, the main part of our project, which utilizes deep learning networks to predict the optimal working temperature of enzymes based on their sequence information (detailed modeling content will be introduced in the AI section), and the other being the construction of an enzyme and temperature constrained genome-scale metabolic model, etcGEM in short, for $ Saccharopolyspora\:spinosa $. In this page, we mainly discuss the modeling principles, while the specific process for constructing the ecGEM will be presented in the Engineering section.

2 GEM

Genome-scale metabolic models (GEMs) provide a computational representation of the gene-protein-reaction associations for an organism's entire metabolic network and enable simulations to predict metabolic fluxes across various system-level metabolic studies. Within an organism, metabolic genes derived from annotated genomes and metabolic knowledge lead to metabolic reactions, which progressively convert basic elements such as carbon, hydrogen, and oxygen into cellular components and secondary metabolites or facilitate the degradation of compounds, forming metabolic reaction pathways. By integrating all metabolic reactions through shared metabolites, a metabolic network of the organism of interest can be constructed. This network is then converted into a stoichiometric matrix (S matrix), where rows represent metabolites, columns represent reactions, and each entry represents the reaction coefficient of a particular metabolite in a reaction. A cellular objective function is required to compute a feasible metabolic flux that optimizes the model's objective (in our modeling, the objective function chosen is to optimize the growth rate, represented by the biomass function). Using the previously constructed S matrix and the model's objective function, the flux distribution can be solved. Finally, additional constraints (such as the steady-state assumption and feasible ranges of metabolic flux) are applied to narrow the solution space and find the optimal rates for all metabolic reactions. In this project, we use an existing GEM model for $ Saccharopolyspora\:spinosa $ constructed with cobrapy as the initial data for our modeling efforts.

The enzyme-constrained genome-scale metabolic model (ecGEM) provides a more accurate simulation of cellular metabolic processes compared to traditional GEMs by incorporating enzymatic kinetic parameters, such as enzyme catalytic rates and enzyme quantity limitations. In an ecGEM model, the kcat parameter is introduced, transforming the stoichiometry of a reaction ‘$ {A\longrightarrow B} $’ into ‘$ {A+1/\left ( k_{cat}\cdot E \right ) \longrightarrow B} $’ thereby integrating the required enzyme usage into the metabolic reaction pathways, with protein levels serving as the upper limit for each enzyme's usage. Consequently, in the ecGEM model, submatrices appear within the new stoichiometric matrix: the upper left submatrix corresponds to the original stoichiometric matrix, the upper right submatrix contains only zeros, the lower left submatrix holds the kinetic information, and the lower right submatrix is the identity matrix.

In this project, to identify key enzyme systems affecting spinosad production in $ Saccharopolyspora\:spinosa $ from the genome-scale metabolic network, we integrate previously obtained GEM data with UniProt data for $ S.\:spinosa $ (uniprot.tsv) and KEGG pathway information (kegg.tsv). The ecGEM model is constructed using functions from the GECKO Toolbox 3.0 developed by Yu Chen et al., implemented in MATLAB. We then populate the kcat values for each enzyme either by similarity search from the BRENDA database or by prediction using DLkcat. The protein costs, expressed as molecular weight over kcat, are incorporated into all enzymatic reactions in the model. Finally, the model parameters are adjusted based on physiological data , resulting in a functional ecModel.

3 Parameter

4 Principles&Assumptions

4.1 Principle of Enzyme Resource Allocation

The sum of enzyme resources ($ {R_{E}} $) allocated to all metabolic reactions must not exceed the total available resource pool within the cell ($ {R_{pool}} $). Mathematically, this is expressed as: $R_{E, 1} + R_{E, 2} + . . . + R_{E, p} \leq R_{P o o l}$ This principle ensures that the total amount of enzyme resources assigned to various metabolic reactions is within the limits of the cell's total resource capacity.

4.2 Steady-state assumption

Under the steady-state assumption, the product of the stoichiometric matrix (S-matrix) and the reaction rate vector (v-vector) is assumed to be zero: $S \cdot v = 0$ This implies that the concentrations of metabolites remain constant over time, as there are no net changes in metabolite concentrations under steady-state conditions.

4.3 Feasible ranges of metabolic flux

The constraints on the flux $ v_{i} $ for each of the i-th metabolic reaction must fall between $ {a_{i}} $ and $ {b_{i}} $ : $a_{i} \leq v_{i} \leq b_{i}$

4.4 Enzyme demand constraints

The enzyme demand for each reaction is given by: $(\frac{k_{c a t, j i}}{M W_{j}}) \cdot v_{i} - e_{j} = 0$ This constraint ensures that the enzyme demand (i.e., enzyme concentration or abundance) equals the reaction rate multiplied by the time each enzyme takes to catalyze a substrate molecule (i.e., the ratio of molecular weight to catalytic constant).

4.5 a maximum constraint on the protein pool:

Protein pool should be less than or equal to $ {\sigma \times f\times P_{tot}} $ and must also be less than or equal to 0: $σ \times f \times P_{t o t} \leq p o o l \leq 0$

5 Model Construction

In the ecGEM model, the following three fundamental principles are applied:

5.1 Mass Conservation Principle for Metabolites

$S_{m 1} \times v_{1} + \dots + S_{m n} \times v_{n} = 0$ Under steady-state conditions, the mass conservation principle dictates that the sum of the stoichiometric coefficients of a metabolite across all reactions, multiplied by their respective fluxes, must be zero. This ensures that the rate of production of the metabolite equals its rate of consumption, thereby maintaining a constant concentration of the metabolite, i.e., the metabolite is at a steady state.

5.2 Enzyme Resource Allocation Principle

$(\frac{k_{c a t, j i}}{M W_{j}}) \times v_{i} - e_{j} = 0$ Within the cell, metabolic reactions require enzymatic catalysis, and both the production and maintenance of enzymes consume cellular resources. The ratio of molecular weight ($ {WM_{j}} $) to catalytic constant ($ {k_{cat}} $) reflects the time and resources needed for each enzyme molecule to complete one catalytic cycle. The enzyme demand for a reaction is determined by multiplying the reaction flux by the time required for each enzyme to catalyze one substrate molecule, which is the ratio of molecular weight to catalytic constant.

5.3 Resource Pool Constraint

$R_{E, 1} + \dots R_{E, p} = R_{p o o l}$ Metabolic reactions consume cellular resources in the form of enzyme requirements. Here, REidenotes the allocation of enzyme resources for the i-th metabolic reaction. The total enzyme resources allocated to all metabolic reactions must not exceed the total available resource pool ($ {R_{pool}} $) within the cell.

Figure 1:A. Metabolic network of an enzyme-constrained metabolic model, where the enzymes catalyzing reactions are treated as reactants (denoted as E), drawing protein mass from a shared protein pool through enzyme usage reactions. B. A system of equations derived from the metabolic network. (Image adapted from Fig. 1 of the GECKO Toolbox 3.0 documentation, with the magenta-highlighted sections representing the enzyme-constrained extensions in the GEM model.)

5.4 Main Steps for Model Construction

Step 1:

Import the existing GEM model into GECKO. Download the corresponding proteome and metabolic pathway information for $ Saccharopolyspora\:sp.\: ASAGF58 $ from KEGG and UniProt websites in the form of uniprot.tsv and kegg.tsv files, respectively. Subsequently, use DIAMOND to map the gene IDs in the GEM model to those in the UniProt database. Adjust the compatibility of the original GEM model using MATLAB, and construct the ecGEM model.

Step 2:

Integrate enzyme information obtained from the BRENDA database or predicted kcat values from the DLKcat package into the empty ecModel. Note that the use of DLKcat relies on Python and Docker images. Incorporate the flux of the enzyme reaction, represented by $ {-MW/k_{cat}} $, into the ecGEM model.

Step 3:

Fine-tune the existing parameters of the model by applying physiological data parameters. This step aims to better align protein-related data with the growth conditions of the organism, thereby enhancing the reliability of the ecGEM model. Specific methods include relaxing the maximum constraints on protein pools to optimize $ k_{cat} $ values for all proteins or performing sensitivity tuning on key $ k_{cat} $ values for minor optimization.

Step 4:

Evaluate the model's ability to replicate physiological states by simulating exchange fluxes at various growth rates. This final assessment determines the effectiveness of the model in mimicking physiological conditions.

6 Main Results

6.1 ecGEM models constructed at different stages

Initial GEM Model Integrated with Information from Uniprot.tsv and Kegg.tsv: ecGEM_1.yml

This model represents the initial GEM (Genome-scale Metabolic Model) that incorporates data from Uniprot.tsv and Kegg.tsv.

ecGEM Model Enhanced with BRENDA Database Information and $ k_{cat} $ Values Predicted via DLKcat Package: ecGEM_2.yml

This model integrates information from the BRENDA database and includes enzyme $ k_{cat} $ values predicted through the DLKcat package.

Final ecGEM Model with Sensitivity Tuning of $ k_{cat} $ Values and Adjustments for Better Replication of Physiological States: ecGEM_3.yml

This final model has undergone sensitivity tuning of $ k_{cat} $ values and adjustments to relax protein pool constraints to better replicate physiological states.

6.2 Partial Data Results of DIAMOND Alignment

By using the DIAMOND tool to perform a BLAST alignment between the fasta file (019D.fasta) of the microorganism used for modeling and the genome of that microorganism in the Uniprot database, we have successfully replaced the gene ID systems in model.genes and model.grRules with the standard gene ID system from Uniprot. This process resolves the issue of mismatched gene ID systems between the model.genes and model.grRules variables in the GEM model and the gene ID system in Uniprot.

The table below shows partial results: the first column displays the original gene.id in the GEM model, the second column lists the standard gene.id from the Uniprot database, and the subsequent columns present the relevant parameters for the successful alignment of the two gene sequences.

See diamond_matches.xlsx for complete data.

6.3 DLKcat Prediction Partial results for Enzyme $ k_{cat} $ Values

In Step 3, we used the DLKcat package to perform high-throughput $ k_{cat} $ value predictions for all metabolic enzymes present in the $ Saccharopolyspora\:spinosa $ eGEM model, based solely on substrate structures and protein sequences. The figure below shows a portion of the predicted results for various enzyme reactions. The "rxns" column indicates the biochemical reactions in which the predicted metabolic enzymes are involved, the "genes" column lists the Uniprot IDs of these enzymes, the "substrates" column specifies the substrates used by the enzymes in their catalytic reactions, and the "$ k_{cat} $" column presents the $ k_{cat} $ values predicted by DLKcat.

See DLKcatoutput.xlsx for complete data.

6.4 The Necessity of Model Tuning and Two Specific Approaches for Fine-Tuning

Figure A illustrates that under conditions where nutrient supply is not restricted, the growth rate of $ Saccharopolyspora\:spinosa $ is only $ {0.027674/hour} $, which does not accurately replicate the growth conditions of the microorganism. This indicates the need for fine-tuning the model to improve its simulation of real conditions. As shown in Figure B, relaxing the protein pool and nutrient constraints and setting the Protein Pool Usage to $ {1000 mg/g DCW} $ can enhance the simulation by increasing the overall $ k_{cat} $ values. Figure C demonstrates sensitivity tuning for $ k_{cat} $, which involves small-scale optimization of $ k_{cat} $ values. This approach iteratively increases the $ k_{cat} $ values of growth-limiting enzymes until the desired growth rate is achieved (the figure blow only shows 13 iterations, resulting in a 24-fold increase in simulated growth rate).

6.5 Simulating Exchange Fluxes at Various Growth Rates to Evaluate the Model's Replication of Physiological States

Figure A shows the simulation of exchange fluxes at various growth rates using the ecGEM model. Figure B illustrates the proportional usage of protein pools in the ecModel as the growth rate changes for the final constructed ecGEM model. Figure C presents the proportional usage of protein pools in the ecModel as the growth rate changes for the ecGEM model without adjusted protein pool constraints. Figure D depicts the proportional usage of protein pools in the ecModel as the growth rate changes for the ecGEM model without adjustments to kcat values.

The results demonstrate that the final ecGEM model adequately simulates the normal physiological state of $ Saccharopolyspora\:spinosa $ and progressively achieves a matched proportion of protein usage corresponding to the growth rates.

7 etcGEM

On the basis of ecGEM model, we introduced thermal parameters to ecGEM with the help of HEATMAP-AI to construct etcGEM, which is able to predict the grow situation in different temperature.

The constrution of etcGEM mainly focuses on two basic facts. First, the $ k_{cat} $ of single enzyme is different while temperature changes. Second, As the temperature rises or falls, more enzymes will lose activity, which leads to a decrease in the proportion of enzymes that can catalyze normally. However, quantifying these two changes is extremely challenging. If a laboratory experiment approach is taken, researchers would need to measure the $ k_{cat} $ values of the enzyme and the proportion of enzyme denaturation at multiple temperatures simultaneously, which is virtually impossible.

Therefore, we have introduced two functions to calculate this value.

7.1 Macromolecular rate theory

$k_{c a t} (T) \propto \frac{k_{B} T}{h} e^{- \frac{△ G^{‡} (T)}{R T}}$

The effect of temperature on $ k_{cat} $ values can be described with an expanded Arrhenius equation, by including a nonzero heatcapacity change between the transition state and the ground state of the enzyme catalytic process.

7.2 Two-state model

${[E]}_{N, i} = \frac{1}{1 + e^{- \frac{△ G_{u} (T)}{R T}}} {[E]}_{t . i}$

In such a model, a protein molecule could be either in a native state (N) or a denatured state (U), and an equilibrium state was assumed: N ↔ U.

${[E]}_{t, i} = {[E]}_{N, i} + {[E]}_{U, i}$

$ \left [ E \right ] _{t,i} $ is the concentration of enzyme i and $ \bigtriangleup G_{u} \left ( T \right ) $ is the free energy difference between the denatured state and the native state.

In summary, the values of $ \bigtriangleup G^{\ddagger } \left ( T \right ) $ and $ \bigtriangleup G_{u} \left ( T \right ) $ need to be determined in order to model the temperature dependence of enzyme activities, and they can be associated with six unknown parameters: $\bigtriangleup H_{T_{0} }^{\ddagger } $, $\bigtriangleup S_{T_{0} }^{\ddagger } $, $\bigtriangleup C_{p }^{\ddagger } $, $ \bigtriangleup H^{\ast } $, $ \bigtriangleup S^{\ast } $, $ \bigtriangleup C_{p,u } $.

To calculate the six unknown parameters and two functions, the enzyme optimal temperature $ T_{opt} $ and the melting temperature $ T_{m} $ is neccessary. However, due to the need for numerous repetitive experiments to measure the optimal temperature in wet labs, the currently available data is very scarce. Considering the low accuracy of existing models, we used our HEATMAP-AI to make high accuracy prediction.

With the help of cobrapy, we changed the primary matrix of metabolism in ecGEM in python. first, we replace the previous $ k_{cat} $ with $ k_{cat} \left ( T \right ) $ to simulate the change in enzyme activity. The value is caculated by Macromolecular rate theory as we mentioned before. Second, we calculate $ fN\left ( T \right ) =\left [ E \right ] _{N,i} /\left [ E \right ] _{t,i} $ by the Two-state model mentioned before to simulate the situation of denaturation.

8 Conclusion

By incorporating enzyme flux efficiency and protein pool constraints into the ecGEM model, we have successfully improved the simulation accuracy of $ Saccharopolyspora\:spinosa $ growth conditions compared to the original GEM. However, due to some initial differences between the GEM used and our target microorganism, coupled with the incomplete annotation of the proteome associated with this strain and the lack of sufficient wet-lab data for its growth under different physiological conditions, directly predicting key enzymes for spinosad production using the constructed ecGEM has limited significance. Nevertheless, the current ecGEM model for $ Saccharopolyspora\:spinosa $ can provide enzymatic information and guide metabolic engineering strategies under spinosad production-related chemical reactions. It can also quantitatively characterize the microbial phenotypic responses under industrial conditions for spinosad production and serve as a Heatmap to identify key enzymes whose activity is limited by temperature mismatches affecting spinosad yield.

9 Acknowledgments

We would like to thank Professor Hongzhong Lu from Shanghai Jiao Tong University for providing the GEM model of $ Saccharopolyspora\:spinosa $ as the initial material for our modeling efforts. We also extend our gratitude to Dr. Yu Chen from the Shenzhen Institute of Synthetic Biology, the principal designer of GECKO Toolbox 3.0, for his assistance in addressing various issues encountered during the model development

10 Reference

[1] Gu C, Kim GB, Kim WJ, Kim HU, Lee SY. Current status and applications of genome-scale metabolic models. Genome Biol. 2019 Jun 13;20(1):121. doi: 10.1186/s13059-019-1730-3. PMID: 31196170; PMCID: PMC6567666.

[2] Chen Y, Li G, Nielsen J. Genome-Scale Metabolic Modeling from Yeast to Human Cell Models of Complex Diseases: Latest Advances and Challenges. Methods Mol Biol. 2019;2049:329-345. doi: 10.1007/978-1-4939-9736-7_19. PMID: 31602620.

[3] Heinken A, Basile A, Hertel J, Thinnes C, Thiele I. Genome-Scale Metabolic Modeling of the Human Microbiome in the Era of Personalized Medicine. Annu Rev Microbiol. 2021 Oct 8;75:199-222. doi: 10.1146/annurev-micro-060221-012134. Epub 2021 Jul 27. PMID: 34314593.

[4] Heirendt L, Arreckx S, Pfau T, Mendoza SN, Richelle A, Heinken A, Haraldsdóttir HS, Wachowiak J, Keating SM, Vlasov V, Magnusdóttir S, Ng CY, Preciat G, Žagare A, Chan SHJ, Aurich MK, Clancy CM, Modamio J, Sauls JT, Noronha A, Bordbar A, Cousins B, El Assal DC, Valcarcel LV, Apaolaza I, Ghaderi S, Ahookhosh M, Ben Guebila M, Kostromins A, Sompairac N, Le HM, Ma D, Sun Y, Wang L, Yurkovich JT, Oliveira MAP, Vuong PT, El Assal LP, Kuperstein I, Zinovyev A, Hinton HS, Bryant WA, Aragón Artacho FJ, Planes FJ, Stalidzans E, Maass A, Vempala S, Hucka M, Saunders MA, Maranas CD, Lewis NE, Sauter T, Palsson BØ, Thiele I, Fleming RMT. Creation and analysis of biochemical constraint-based models using the COBRA Toolbox v.3.0. Nat Protoc. 2019 Mar;14(3):639-702. doi: 10.1038/s41596-018-0098-2. PMID: 30787451; PMCID: PMC6635304.

[5] Chen Y, Gustafsson J, Tafur Rangel A, Anton M, Domenzain I, Kittikunapong C, Li F, Yuan L, Nielsen J, Kerkhoven EJ. Reconstruction, simulation and analysis of enzyme-constrained metabolic models using GECKO Toolbox 3.0. Nat Protoc. 2024 Mar;19(3):629-667. doi: 10.1038/s41596-023-00931-7. Epub 2024 Jan 18. PMID: 38238583.

[6] Chang A, Jeske L, Ulbrich S, Hofmann J, Koblitz J, Schomburg I, Neumann-Schaal M, Jahn D, Schomburg D. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res. 2021 Jan 8;49(D1):D498-D508. doi: 10.1093/nar/gkaa1025. PMID: 33211880; PMCID: PMC7779020.

[7] Li F, Yuan L, Lu H, Li G, Chen Y, Engqvist MKM, Kerkhoven EJ, Nielsen J. Deep learning based kcat prediction enables improved enzyme constrained model reconstruction. Nature Catalysis. 2022;5(1):64-75. doi: 10.1038/s41929-022-00798-z.

[8] Xiaotao Wang, Yuwei Zong, Xuanjie Zhou, Li Xu, Wei He, and Shu Quan The Journal of Physical Chemistry B 2024 128 (10), 2281-2292 DOI: 10.1021/acs.jpcb.3c06526

[9] Japheth E. Gado, Gregg T. Beckham, and Christina M. Payne Journal of Chemical Information and Modeling 2020 60 (8), 4098-4107 DOI: 10.1021/acs.jcim.0c00489