Regarding the issues mentioned before, our proposed technical approach primarily revolves around the classic design, construction, testing, and learning cycle system in synthetic biology, which is DBTL in short.
On the one hand, as for the etcGEM model, we did a DBTL cycle surrounding dry lab. On the other hand, in terms of the HEATMAP-AI model, the core part of our project, we engineered a comprehensive DBTL model involving wet lab, which you can see detailedly in AI section.
The use of genome-scale models was a valuable approach to mathematically explore metabolism. A conventional genome-scale metabolic model (GEM) was reconstructed from all metabolic reactions of the organism of interest and converted into a computational format. By employing GEM models, we were able to numerically represent the metabolic pathways of \( Saccharopolyspora\:spinosa \), which allowed us to better identify key enzymes in the spinosad biosynthetic pathway for yield optimization. However, GEMs were limited by their lack of kinetic and omics data, which meant they could not accurately simulate the specific physiological state of the organism. Therefore, we used the GECKO3 software to integrate protein data and enzymatic parameters, applying enzyme constraints to construct an ecGEM model.
In the building-ecGEM module, we integrated protein data and enzyme information downloaded from the UniProt database with KEGG information for the organism retrieved from the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the previously obtained GEM model. This process involved expanding from a starting metabolic model to an ecModel structure. Next, we incorporated kcat values matched with the EC numbers from the BRENDA database, as well as kcat values predicted by the DLKcat package that were independent of EC numbers,, into the model. We then added the protein cost of enzymes in the form of MW / kcat to all enzyme-catalyzed reactions in the model. Finally, enzyme constraints were imposed using methods such as setting a maximum constraint on the protein pool. The model parameters were adjusted according to physiological data, resulting in a functional ecModel. This model was then used to predict key enzymes affecting spinosad yield, and combined with deep learning modeling results, it guided wet lab modifications of key enzymes whose optimal and growth temperatures were mismatched.
When importing the \( Saccharopolyspora\:spinosa \) GEM provided by our collaborator, Prof. Hongzhong Lu, into GECKO, we also completed the necessary adapters for constructing the ecGEM, with specific details provided in the table below (adapter for building the ecGEM). Notably, the parameters related to KEGG information and UniProt information were of particular importance.
Table 1:adapter for building the ecGEM for Saccharopolyspora spinosa
The KEGG database was used to assign EC number annotations. First, we located \( {Actinomycetota} \), \( {Saccharopolyspora} \) in the Category section on the KEGG official website (https://www.genome.jp/kegg/catalog/org_list.html). We then selected \( {Saccharopolyspora sp. ASAGF58} \) from the Organisms list. This strain, isolated and analyzed by Chao Guo and colleagues from Zhejiang Province, is used industrially to produce dodecane antibiotics, and its complete genome sequence is stored in GenBank. Although it differs from our wet lab strain, it was selected as the source of KEGG data for its industrial relevance. We set the KEGG ID to 'sacg' and KEGG gene ID to 'kegg', directing us to the necessary Organism ID in the KEGG database.
The UniProt database provided enzyme molecular weights, amino acid sequences, and other protein information. To ensure consistency between UniProt and KEGG data, we searched for the organism-specific proteome datasets on UniProt (https://www.uniprot.org/proteomes) and successfully retrieved the reference proteome for \( {Saccharopolyspora sp. ASAGF58} \). We set the UniProt ID to 'UP000501808' and UniProt type to 'proteome' to achieve complete gene coverage in the ecModel. Note: Initially, in constructing the ecGEM, we set the UniProt ID to 'UP000233786' in an attempt to match the variant, which led to mismatches between the two database files. This issue was subsequently optimized in the species design phase.
After importing the GEM model using the code 'model = loadConventionalGEM()', we discovered that the gene ID systems used in the GEM model’s workspace variables 'model.genes' and 'model.grRules' did not match the gene ID system in UniProt. To resolve this issue, where the gene names in the FASTA sequence file used for modeling did not match those in the database, we utilized DIAMOND to perform a BLAST alignment between the FASTA file ('019D.fasta') of the modeling strain and the genome in UniProt. Based on the alignment results, we could replace the gene ID systems in 'model.genes' and 'model.grRules'. According to its GitHub description, DIAMOND is a sequence aligner designed for high-performance analysis of large sequence datasets, offering pairwise alignment of proteins and translated DNA at speeds 100x-10,000x faster than BLAST with low resource requirements, allowing it to run on a standard laptop. Specific instructions for installing and running DIAMOND in WSL are detailed in the attached document (diamond.dox). We then modified the information format in Excel (matches.xlsx) retaining only the original gene ID numbers in the first column (e.g., '1_2') and the corresponding UniProt gene ID numbers in the second column (e.g., 'A0A6H1RF29'). We used the functions 'replaceGenesWithMap.m' and 'replaceGrRulesWithMap.m' to make the gene ID systems in 'model.genes' and 'model.grRules' with the UniProt format starting with 'A0A6'.
Next, we addressed other discrepancies between the original GEM model and the adapter, such as standardizing the 'Compartment name in which the added enzymes should be located' to 'cytoplasm' and 'extracellular' in the adapter. After completing these adjustments, our GEM model and adapter met the basic requirements for constructing the ecGEM. We then used '[ecModel, noUniprot] = makeEcModel(model, false);' to build the ecGEM model. During this construction, the 'loadDatabases' function was called to gather enzyme information via UniProtDB from the adapter and parse reactions associated with genes. Finally, proteins and protein pools were added as pseudometabolites to the GEM model, expanding it to an ecGEM matrix by incorporating protein usage reactions and protein pool reactions.It was observed that the constructed ecGEM model was a structure containing 28 fields, with specific variable information stored in 'ecGEM_1.yml '.
Figure 1: Basic information of \( Saccharopolyspora\:sp.\:ASAGF58 \) on the KEGG website, with its KEGG ID being 'sacg'.
Figure 2: Basic information of \( Saccharopolyspora\:sp.\:ASAGF58 \) on the UniProt website, with its Proteomics UniProt ID being 'UP000501808'.
The main task at this step was the integration of 'kcat' into the 'ecModel' structure, specifically by incorporating 'kcat' values obtained from the BRENDA database or predicted by the DLKcat package into the empty 'ecModel'. The BRENDA Enzyme Database (https://www.brenda-enzymes.org/) originated from the National Center for Biotechnology Research (GBF) established in Braunschweig, Germany, in 1987 and is currently operated by the Institute of Biochemistry, University of Cologne. It serves as a comprehensive database of enzymes, metabolic pathways, structures, functions, and reactions. BRENDA contains extensive information on enzyme functions and properties, including enzyme classification (EC numbers), functional data (reactions, substrates and products, kinetic parameters), and the sources of enzymes. This information allows GECKO to directly retrieve 'kcat' values for target gene enzymes from the BRENDA database. If the enzyme's EC number and reaction substrates match those in BRENDA, the 'kcat' value is copied directly. If some parameters differ, the 'phylogenetically closest organism, any alternative substrate, or a fuzzy EC number' is selected as a substitute. The specific process involves first using 'getECfromGEM()' and 'getECfromDatabase()' to assign EC numbers to reactions. After populating the variable 'ecModel.ec.eccodes' with the corresponding EC values, we gathered 'kcat' values from BRENDA using 'kcatList_fuzzy = fuzzyKcatMatching(ecModel);'.
Since there were discrepancies between the strain used in GEM modeling and the strains in the UniProt database, some reactions still lacked 'kcat' values. Therefore, we used the DLKcat package to predict 'kcat' values for all reactions. DLKcat, developed by Feiran Li et al. in 2022 in Nature Catalysis, predicts high-throughput 'kcat' values for metabolic enzymes of any organism using only substrate structures and protein sequences. DLKcat’s deep learning approach combines graph neural networks (GNN) for substrates with convolutional neural networks (CNN) for proteins. DLKcat relies on the simplified molecular-input line-entry system (SMILES) format to gather 'kcat' through deep learning predictions. We used the '[ecModel, noSMILES] = findMetSmiles(ecModel);' function to query PubChem and collect these annotations, creating the 'data/smilesDB.tsv' file. DLKcat predictions do not run natively in MATLAB but require Python. We prepared the initial files with 'writeDLKcatInput(ecModel, [], [], [], [], true);' to create the 'DLKcat.tsv' file. Once all initial files were ready, we called 'runDLKcat();' , which executes the DLKcat algorithm via a Docker image. Docker is a platform that helps developers build, share, and run applications using containers. After DLKcat completes, it fills the predicted 'kcat' values into 'DLKcat.tsv', and finally, we read the DLKcat predicted 'kcat' values into the model using 'kcatList_DLKcat = readDLKcatOutput(ecModel);'.
After obtaining 'kcat' values through fuzzy matching from the BRENDA database and deep learning predictions from DLKcat, we needed to merge DLKcat and BRENDA structures. The merging rule was to assign a single 'kcat' value to each reaction, prioritizing 'kcat' values obtained from BRENDA through complete EC number matches. If no complete match was available, 'kcat' values derived from DLKcat were used. If DLKcat could not assign 'SMILES' due to missing substrate information, fuzzy matching with EC number wildcards was used. This resulted in 'kcatList_merged'. Since the previous GEM model did not have standard 'kcat' values set, we used the 'selectKcatValue();' function to fill 'kcatList_merged' as 'kcat' information into the model, expanding the protein 'kcat' information. Subsequently, 'kcat' and MW were combined to form the stoichiometric coefficient: \( {-MW/k_{cat}} \), representing the flux used by the enzyme in reactions. After the Build step, the ecGEM model with added protein usage information was saved with specific variable information stored in 'ecModel_2.yml'.
Figure 3: Compilation of 'kcat' values through fuzzy matching from the BRENDA database and deep learning predictions from DLKcat, resulting in 'kcatList_merged'. A. Basic information contained in the 'kcatList_merged' structure variable. B. 'kcat' data in 'kcatList_merged', arranged according to the order of reactions in 'rxns'.
In the 'Build' module, the ecGEM model constructed was merely a listing of protein costs and other parameters along with metabolic reaction information. It was not able to accurately replicate the physiological conditions under which \( Saccharopolyspora\:spinosa \) grows, with its growth rate being only \( 0.027674 /hour \), significantly lower than the standard growth rate for \( Saccharopolyspora\:spinosa \). The specific testing approach involved checking whether the model could achieve its maximum growth rate under unlimited nutrient supply conditions. For unlimited nutrient conditions, constraints on nutrient exchange were removed, and biomass production was maximized using the 'RAVEN toolbox'. The RAVEN (Reconstruction, Analysis and Visualization of Metabolic Networks) Toolbox is a software suite for MATLAB that allows for semi-automated reconstruction of genome-scale models (GEMs), and it has quality control features and capabilities for system-wide data analysis in metabolic environments.According to the DBLT model, we needed to find and learn other constraints that could optimize the ecGEM model, design them as specific parameters, incorporate them into the model's build, and perform functionality tests on the optimized ecGEM. The iterative adjustment of parameters and addition of new parameters to better simulate real-world conditions effectively demonstrated the Design-Build-Test-Learn Cycle of synthetic biology.
To achieve a resulting 'functional ecModel', the model's parameters were tuned based on physiological data if available, such as imposing a maximum constraint on the protein pool. For the maximum constraint on the protein pool, following Sánchez, B. J. et al.'s definition in the construction of yeast ecGEM models, the value of the maximum constraint was calculated by \( {\sigma \times f\times P_{tot}} \). Here, \( σ \) represents the average saturation of all enzymes in the model, \( f \) is the mass fraction of metabolic enzymes in the total proteome, and \( P_{tot} \) is the total protein mass (often the ratio of protein mass to cell dry weight). In the 'Build' module, all three variables were set to 0.5, resulting in a default maximum constraint of 0.5 × 0.5 × 0.5 = 0.125.
Relaxing constraints on protein pool usage effectively removed all enzyme constraints. Specifically, this was done by using the RAVEN toolbox to set 'prot_pool_exchange' with the lower bound to -1,000. After relaxing the protein pool constraint, the model had no protein pool or nutrient constraints. The minimum protein pool usage needed to support the observed maximum growth rate was then calculated using the 'strcmp()' function from the RAVEN toolbox, and this predicted protein pool usage was used as the new parameter value. However, this global relaxation approach might result in unrealistic high protein pools due to some key enzymes having very low or incorrect 'kcat' values, and thus might not significantly optimize the model.
Furthermore, sensitivity tuning of 'kcat' values was performed. Unlike relaxing the maximum constraint on the protein pool, sensitivity tuning involved making small incremental adjustments to 'kcat' values, iteratively increasing the 'kcat' values of growth-limiting enzymes until the desired growth rate was achieved. During each deterministic iteration, the enzyme with the highest constraint (which corresponds to the enzyme that requires the maximum proportion of the protein pool) was identified, and its 'kcat' value, which contributes most to overconstraining the model, was increased. This step was performed using the 'sensitivityTuning()' function. The ecGEM model constructed with these adjustments and its specific variable information were stored in 'ecModel_3.yml'.
Both methods described above are automated, meaning they do not manually specify significant metabolic reactions for the growth state of the organism as starting points for adjustments. Sometimes, manually increasing 'kcat' values by 1-10 times for specific reactions can significantly enhance model accuracy. This approach can be applied in our future research process after obtaining sufficient wet lab data to target specific enzyme functions.
Finally, to assess the model’s ability to replicate physiological states, we simulated 'exchange fluxes at various growth rates'. The approach involved loading our constructed 'ecModel' to simulate the impact of protein pool constraints on metabolism at different growth rates. During the depiction of 'exchange fluxes', we discovered that the naming of metabolic reactions in the original GEM model did not match standard reaction names. For example, the standard system describes the Reaction ID for biomass pseudoreaction as 'r_4041', while the corresponding reaction ID in the original GEM data was 'BIOMASS_SCO'. We manually replaced reaction IDs, such as changing 'EX_o2_e', 'EX_co2_e', 'EX_glc__D_e', and 'EX_etoh_e' to 'r_1992', 'r_1672', 'r_1714', and 'r_1761' respectively. We then used the 'plotCrabtree()' function to predict the chemical reaction fluxes for \( Saccharopolyspora\:spinosa \) at different growth rates and plotted the results.
This demonstrated that the model was able to simulate the organism’s growth conditions to some extent. The remaining discrepancies between the simulated and real conditions might be due to minor differences between the original GEM and our target strain, as well as incomplete annotation of the strain’s proteome and insufficient wet lab data for different physiological conditions. Despite these challenges, we finally successfully built an enzyme-constrained metabolic model for \( Saccharopolyspora\:spinosa \) through the DBLT modeling process. The ecGEM model provided genomic and metabolic pathway guidance for predicting and modifying enzymes with mismatched optimal temperatures and growth temperatures for spinosad production pathways.
Figure 4: Simulation of growth conditions for \( Saccharopolyspora\:spinosa \) using the ecGEM model. A. Simulation of exchange fluxes at various growth rates using the ecGEM model. B. Proportional usage of protein pools in the ecModel as growth rate changes (final constructed ecGEM model). C. Proportional usage of protein pools in the ecModel as growth rate changes (ecGEM model without adjusted protein pool constraints). D. Proportional usage of protein pools in the ecModel as growth rate changes (ecGEM model without adjustments to kcat values).
During the constructing the etcGEM of \( Saccharopolyspora\:spinosa \), we found our ecGEM can not be recognized by cobrapy in python, which may be caused by the lack of reaction data. To make HEATMAP project complete, we build etcGEM based on the ecGEM of \( Actinoplanes\:sp.SE50/110 \), which is construct by IGEM 2022 SJTU-software team (https://gitlab.igem.org/2022/software-tools/sjtu-software).
Most of the data used in constructing etcGEM have been elaborated in the Learn phase in ecGEM, except optimal temperature and melting temperature.On the basis of ecGEM, the consruction of etcGEM requires six unknown parameters: \(\bigtriangleup H_{T_{0} }^{\ddagger } \), \(\bigtriangleup S_{T_{0} }^{\ddagger } \), \(\bigtriangleup C_{p }^{\ddagger } \), \( \bigtriangleup H^{\ast } \), \( \bigtriangleup S^{\ast } \), \( \bigtriangleup C_{p,u } \) for each enzyme. Considering the thermodynamic data are hard to collect, we can calculate them through basic protein properties such as optimal temperature and melting temperature. Considering the low accuracy of existing models, we used our HEATMAP-AI to make high accuracy prediction of \( T_{opt} \) and used the deep learning model DARWINS which is built by IGEM 2023 SJTU-software team (https://gitlab.igem.org/2023/software-tools/sjtu-software) to predict \( T_{m} \).
After the Test phase, we can collect the data of the growth situation of the \( Actinomycetes \), which can instruct the improvement of both etcGEM and HEATMAP-AI. In the future, to resolve the challenges arising from the uncertainties in the parameter values, Approximate Bayesian Computation can be used in the improvement of etcGEM, which is a probabilistic framework that has been successfully applied for quantifying and reducing uncertainties in various fields, including deep learning, ordinary differential equations, and biochemical kinetic models. The approach uses experimental observations (\( D \)) to update Prior distributions (\( P\left ( \theta \right ) \)) of model parameters to Posterior ones (\( P\left ( \theta |D \right ) \)).
To built etcGEM, optimal temperature and melting temperature should be added into the data of enzymes. If it is hard to search, we will use deeplearning model to predict as mentioned before. After that, With the help of cobrapy, we changed the primary matrix of metabolism in ecGEM in python. first, we replace the previous \( k_{cat} \) with \( k_{cat} \left ( T \right ) \) to simulate the change in enzyme activity. The value is caculated by Macromolecular rate theory. Second, we calculate \( fN\left ( T \right ) =\left [ E \right ] _{N,i} /\left [ E \right ] _{t,i} \) by the Two-state model to simulate the situation of denaturation, after which we will increase the uasge of the protein in protein pool according to \( fN\left ( T \right ) \).
Different from ecGEM, the construction of etcGEM is mainly based on python language, instead of MATLAB, so the first step is to save the ecGEM model with matlab file and then load it in the python project with cobrapy.
Subsequently, for each enzyme involved in ecGEM, we will search its optimal temperature and melting temperature in the database based on the sequence information. If the information cannot be retrieved, we will supplement it through HEATMAP-AI and DARWINS.
With \( T_{opt} \), \( T_{m} \), \( k_{cat} \), \( fN\left ( T \right ) \) and \( k_{cat} \left ( T \right ) \) of each enzyme in any given temperature T can be calculated through specific functions.
Calculated \( fN\left ( T \right ) \) and \( k_{cat} \left ( T \right ) \) will be used to replace the coefficient and \( k_{cat} \) of each protein in every reaction in ecGEM. A simple schematic diagram helps to explain it clearly.
In the buiding phase, we have successfully construct an etcGEM model. In order to test the model, we used etcGEM to simulate the growth situation in multiple temperture with function in cobrapy.
The results are close to what we have assumed before, which shows a trend that rises first and then falls.
The simulation shows that the growth situation differs in different temperature. What’s more, It can be observed that the most suitable growth temperature for the bacterial strain is around 56 degrees Celsius. When the temperature rises or falls, the activity of its internal proteins decreases, thereby affecting the growth and development of the entire organism.
[1] Gu C, Kim GB, Kim WJ, Kim HU, Lee SY. Current status and applications of genome-scale metabolic models. Genome Biol. 2019 Jun 13;20(1):121. doi: 10.1186/s13059-019-1730-3. PMID: 31196170; PMCID: PMC6567666.
[2] Heinken A, Basile A, Hertel J, Thinnes C, Thiele I. Genome-Scale Metabolic Modeling of the Human Microbiome in the Era of Personalized Medicine. Annu Rev Microbiol. 2021 Oct 8;75:199-222. doi: 10.1146/annurev-micro-060221-012134. Epub 2021 Jul 27. PMID: 34314593.
[3] Lewis NE, Nagarajan H, Palsson BO. Constraining the metabolic genotype-phenotype relationship using a phylogeny of in silico methods. Nat Rev Microbiol. 2012 Feb 27;10(4):291-305. doi: 10.1038/nrmicro2737. PMID: 22367118; PMCID: PMC3536058.
[4] UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023 Jan 6;51(D1):D523-D531. doi: 10.1093/nar/gkac1052. PMID: 36408920; PMCID: PMC9825514.
[5] Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 2021 Jan 8;49(D1):D545-D551. doi: 10.1093/nar/gkaa970. PMID: 33125081; PMCID: PMC7779016.
[6] Li, G., Hu, Y., Jan Zrimec et al. Bayesian genome scale modelling identifies thermal determinants of yeast metabolism. Nat Commun 12, 190 (2021). https://doi.org/10.1038/s41467-020-20338-2
[7] Stourac, J., Dubrava, J., Musil, M., Horackova, J., Damborsky, J., Mazurenko, S., Bednar, D., 2020: FireProtDB: Database of Manually Curated Protein Stability Data. Nucleic Acids Research 49: D319-D324
[8] Chang A, Jeske L, Ulbrich S, Hofmann J, Koblitz J, Schomburg I, Neumann-Schaal M, Jahn D, Schomburg D. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res. 2021 Jan 8;49(D1):D498-D508. doi: 10.1093/nar/gkaa1025. PMID: 33211880; PMCID: PMC7779020.
[9] Guo C, Guo W, Liu Y, Wang C. Complete genome sequence of butenyl-spinosyn-producing Saccharopolyspora strain ASAGF58. Ann Microbiol. 2020;70:46. doi:10.1186/s13213-020-01587-4.
[10] Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x
[11] Docker Inc, https://www.docker.com/
[12] Wang H, Marcišauskas S, Sánchez BJ, Domenzain I, Hermansson D, Agren R, Nielsen J, Kerkhoven EJ. (2018) RAVEN 2.0: A versatile toolbox for metabolic network reconstruction and a case study on Streptomyces coelicolor. PLoS Comput Biol 14(10): e1006541. doi:10.1371/journal.pcbi.1006541.
[13] Jeschek M, Gerngross D, Panke S. Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort. Nat Commun. 2016 Mar 31;7:11163. doi: 10.1038/ncomms11163. PMID: 27029461; PMCID: PMC4821882.
[14] Orth JD, Thiele I, Palsson BØ. What is flux balance analysis? Nat Biotechnol. 2010 Mar;28(3):245-8. doi: 10.1038/nbt.1614. PMID: 20212490; PMCID: PMC3108565.