//quote
The goal of our modeling: reconstruction life to analysis life.
//quote
Introduction
For a long time, people have been seeking mathematical methods to quantify life. Although it is challenging—since even the complexity of a single bacterium rivals that of the most intricate human-made factories—we havenever ceased our efforts to explain life through mathematics.
In our project, we explore the relationship between microscopic metabolic flux and the macroscopic fermentation system through a “remodeling” approach. Part 1 involves our use of the GSSM model to simulate metabolic flux, model PHB fermentation, improve fermentation conditions, analyze the differences in metabolic flux between polyploid and haploid, and identify key genes that affect growth. Part 2 focuses on our Deep Learning method to predict PHB yield. Part 3 involves our analysis of microplastic pollution and influencing factors in Qingdao.
Part 1: Metabolic Network Model analysis based on COBRApy
Why do we choose COBRApy?
Our wet experiment started with Escherichia coli DGF-298[1] and constructed its prokaryotic polyploid strain, E. coli DGF-298-103Z. DGF-298 is a minimal strain with a streamlined genome. The goal of the first part was to develop metabolic network models for both DGF-298 and its polyploid strain, analyze the factors influencing PHB production, and assess the impact of polyploidy on metabolic flux.
Traditional methods, such as chemical reaction kinetics modeling, are limited by various reaction parameters and the scale of the metabolic network. These constraints make it difficult to simulate metabolic flux on a larger scale.
So we turned our attention to genome-scale metabolic models (GSSM), which can predict an organism’s metabolic behavior, the production or consumption of metabolites, and gene functions and their impacts on a genome-wide scale under different conditions. Therefore, GSSM significantly broadened the scope of our modeling, allowing us to analyze the metabolic flux of DGF-298 and its polyploid variant on a genome-wide scale.
We chose COBRApy[2] as the tool for GSSM analysis. To simulate metabolic flux on a larger scale, a commonly used method is Flux Balance Analysis (FBA) (Fig. 1), which assumes that metabolite concentrations remain constant over time. Based on this assumption, constraints are established and solved using linear programming.
//img_wrap
Fig. 1: The S matrix represents the metabolic network, which includes the reactants, products, and their corresponding coefficients. The flux vector represents the flux of each metabolic flow.
//img_wrap
Part 1.1: Establishing the Metabolic Network Model for DGF-298
The first step was to construct the GSSM model for DGF-298. We built the metabolic network model of DGF-298 based on E.coli.core, which is a model that includes all the core metabolic pathways of E. coli, encompassing 72 metabolites, 95 reactions, and 137 genes.
Next, we compared the genes of E.coli.core with DGF-298 using whole-genome sequencing data[3]. We discovered that DGF-298 lacks 22 genes and 7 metabolic pathways compared to E.coli.core. After removing these 7 pathways, we obtained the metabolic network model of DGF-298.
//img_wrap
Fig. 2a: Metabolic network of MG1655
//img_wrap
//img_wrap
Fig. 2b: Metabolic network of DGF-298
//img_wrap
Gene deletions did not alter the core metabolic pathways of DGF-298 (Fig. 2b), such as the TCA cycle, glycolysis, or the HMP pathway. Additionally, simulations showed that removing these pathways did not significantly change the metabolic flux distribution of DGF-298 under certain levels of oxygen and glucose supply. This indicates that despite the reduction of numerous genes, the metabolic flux remains stable.
Part 1.2 Multi-Objective Optimization for Predicting PHB Yield
In this section, we explored the PHB fermentation capability of DGF-298 and the factors influencing PHB fermentation.
In FBA simulations, to solve for the entire metabolic flux distribution, we need to define an objective function F(x) that the model maximizes. This objective function is usually the reaction rate of a specific reaction.
If we define only a single objective function, several issues may arise, such as the inability to simulate a reasonable distribution of metabolic fluxes within the bacteria. For example, as shown in Fig. 3a, if we use PHB production as the sole objective function, the model maximizes the rate of the PHB production reaction to the extent that the rate of the biomass reaction is set to 0 mmol/gCDW/h, which is clearly unrealistic.
So to address this issue, we performed multi-objective optimization. In FBA, the objective function F(x) is the rate of a chemical reaction, and the model solves for the maximum value of the reaction rate through linear programming. Here, F(x) is defined as a linear combination of n objective functions. The model will solve for the maximum value of F(x) using this approach.
\[F\left(x\right)=\omega_1v_1+\omega_2v_2 + ... + \omega_nv_n\] \[\text{Maximize } F(x)= \omega^Tv\]Here, we defined \(f(x)_1\) as the PHB synthesis reaction and \(f(x)_2\) as the biomass reaction, and then optimized the objective functions. The results are shown in Fig. 3b. This multi-objective optimization method reflects the bacterial distribution of metabolic fluxes. The biomass reaction rate increases overall with the intake of glucose and oxygen, while the PHB production reaction rate differs from single-objective optimization in that it increases with the glucose intake rate (as shown in Fig. 3b). However, under high oxygen concentrations, PHB production is limited (Fig. 3b, bottom). The PHB yield reaches its maximum when glucose concentration is 20.0 mmol/gCDW/h and oxygen concentration is 8.7 mmol/gCDW/h. In contrast, the biomass reaction rate peaks when both glucose and oxygen intakes are at 20 mmol/gCDW/h (with maximum constraints for glucose and oxygen intake set at 20 mmol/gCDW/h).
//img_wrap
Fig. 3a Single-objective function optimization results, where the model only optimizes PHB production while neglecting individual growth.
Fig. 3b Results of multi-objective function optimization, where the model considers both PHB synthesis and individual growth.
//img_wrap
To more clearly illustrate the relationship between oxygen and glucose consumption with PHB yield and biomass reaction rate, in Fig. 4a and Fig. 4b we set only one variable at a time (either glucose uptake rate or oxygen uptake rate), effectively slicing Fig. 3a and Fig. 3b along the x or y axis. The results show that when the glucose uptake rate is low (< 6.0 mmol/gCDW/h), glucose is primarily used for bacterial growth rather than for PHB production. When glucose uptake ranges from 6.0 mmol/gCDW/h to 9.0 mmol/gCDW/h, PHB yield and growth are inversely related, indicating competition. However, when glucose uptake exceeds 9.0 mmol/gCDW/h, the growth rate and PHB yield become positively correlated (Fig. 4a).
For the relationship between oxygen uptake rate and PHB yield, we found that when the oxygen uptake rate is less than 5.0 mmol/gCDW/h, PHB yield is positively correlated with the oxygen uptake rate. However, when the oxygen uptake rate exceeds 5.0 mmol/gCDW/h, PHB yield actually decreases as the oxygen uptake rate increases (Fig. 4b).
//img_wrap
Fig. 4a: Relationship between PHB production rate and biomass reaction rate at different glucose uptake rates
//img_wrap
//img_wrap
Fig. 4b: Relationship between PHB yield and biomass reaction rate at different oxygen uptake rates
//img_wrap
The above conclusion is consistent with Katie A. Third’s findings [4], which suggest that high oxygen conditions are more favorable for bacterial growth, while low oxygen conditions are more advantageous for PHB accumulation. Katie A. Third posits that under low dissolved oxygen levels, the availability of adenosine triphosphate (ATP) is limited, preventing significant biomass growth. Instead, most ATP is used to transport acetate into the cell, with acetate being utilized as a precursor for PHB production(Reaction ACS). In contrast, higher oxygen consumption provides surplus ATP, leading to increased growth rates but reduced PHB yields.
\[\text{Reaction ACS:}Acetate + ATP + CoA → Acetyl-CoA + AMP + PPi\]Acetate is converted to Acetyl-CoA (reaction ACS) with the consumption of ATP, and Acetyl-CoA is a precursor for PHB production. Therefore, the next step was to analyze the relationship between acetate utilization and the consumption of oxygen and glucose. As shown in Fig. 5, at higher glucose consumption and lower oxygen consumption, the ACS reaction rate is higher. Conversely, when both glucose and oxygen consumption are high, the ACS reaction rate decreases (dark areas in Fig. 5). This indicates that at lower oxygen levels, more ATP is consumed for acetate and directed towards PHB production, leading to higher PHB yields, whereas at higher oxygen levels, more ATP is used for cell growth. From this analysis, it can be concluded that higher PHB yields require higher glucose concentrations and lower oxygen levels.
//img_wrap
//img_wrap
//img_wrap
Fig. 5: Relationship between ACS reaction rate and oxygen and glucose consumption
//img_wrap
Subsequently, we calculated the conversion rate as a function of oxygen and glucose levels. The conversion rate was higher under high glucose concentrations and low oxygen levels. The maximum conversion rate was observed when the glucose concentration was 4.3 mmol/gCDW/h and the oxygen concentration was 2.1 mmol/gCDW/h (Fig. 6). This indicates that when oxygen is more abundant, more glucose is allocated to cell growth rather than PHB production. Thus, higher glucose concentrations and lower oxygen levels are favorable for both conversion rate and yield.
//img_wrap
//img_wrap
//img_wrap
Fig. 6: Relationship between PHB conversion rate and O2 uptake rate and glucose uptake rate
//img_wrap
Part 1.3: Incorporate transcriptomic data into the metabolic network model
//img_wrap
Fig. 7: Integration of transcriptomic data with the GSSM of DGF-298
//img_wrap
First, we calculated the fold changes of numerous differentially expressed genes (Fig. 7). Under FBA conditions and according to the Michaelis-Menten equation, the reaction rate 𝑣 is positively correlated with enzyme concentration.
\[\text{Michael-Menten equation: }v=\frac{k_{cat}\cdot[E]\cdot[S]}{K_M+[S]}\] \[\text{under flux balance analysis: }v=k'\cdot[E]\]For more complex mechanisms, such as sequential and ping-pong mechanisms, or even more intricate reaction types, they can all be represented under FBA as \(v=k′[E]\)
\[\text{Sequential Reaction Mechanism: } v=\frac{k_{cat}\cdot[E_{total}]\cdot[A]\cdot[B]}{(K_A\cdot K_B)+(K_B\cdot[A])+(K_A\cdot[B])+[A]\cdot[B]}\] \[\text{Ping-Pong Mechanism: }v=\frac{k_{cat}\cdot E_{total}\cdot[A][B]}{K_A\cdot[B]+K_B\cdot[A]+[A]\cdot[B]}\]Therefore, assuming that gene expression fold changes are positively correlated with enzyme concentration, we incorporated the gene expression fold changes into GSSM.
Therefore, assuming that gene expression fold changes are positively correlated with enzyme concentration, we incorporated the gene expression fold changes into GSSM. In FBA (Flux Balance Analysis), we set:
\[S\ast V=0\]where we assume the reaction rate matrix is \(W\) . The entire metabolic network can then be represented as follows://no-indent
\[S\ast V ∘ W=0\]//img_wrap
Figure 8: Gene Expression Fold Change
//img_wrap
//img_wrap
Figure 9: The impact of upregulating or downregulating different reactions on cellular growth rate in the model
//img_wrap
Through simulation, we found that the impact of gene expression fold changes on bacterial growth rate is not uniform. We identified two reactions, NADH16 and PKF. Consequently, we selected the genes encoding these reactions, pfkB and nuoF, respectively, and constructed plasmids containing these genes. After transforming these plasmids into DGF-298 to enhance the expression levels of these genes, we performed RT-PCR to measure the expression levels. The results indicated significant increases in the expression levels of pfkB and nuoF under various combinations of promoters and RBSs.
//img_wrap
//img_wrap
Next, we incorporated the enhanced expression levels into the GSSM model of DGF-298 and calculated the rates of PHB synthesis and biomass production. Before the enhancement, the model predicted a maximum PHB synthesis rate of 8.72 mmol/gCDW/hr. After enhancing the expression levels, the maximum PHB synthesis rate increased to 17.02 mmol/gCDW/hr. Similarly, the maximum biomass production rate increased from 0.37 mmol/gCDW/hr to 1.21 mmol/gCDW/hr.
Part 2 Real-time Prediction of Fermentation Products Based on Deep Learning
Many fermentation products’ concentrations are difficult to predict in real-time. For example, the concentration of PHB produced during fermentation can only be determined by sampling and analyzing it through gas chromatography. The significance of predicting fermentation products lies in optimizing fermentation conditions (such as temperature, pH, nutrient supply, etc.), thereby improving production efficiency, and in making real-time adjustments to parameters during fermentation to avoid resource wastage and maximize yield.
Traditional predictive models struggle to handle the dynamic changes in environmental conditions, and parameter estimation is challenging. Therefore, we utilize deep learning methods to address the aforementioned issues.
In this section, we trained an LSTM-based model to predict PHB concentration, exploring the feasibility of using such machine learning models for fermentation product prediction and their potential applications in industrial settings.
2.1 Real-time Product Prediction Using Multi-Sensor Fermentation Tanks Combined with LSTM
LSTM (Long Short-Term Memory) is a deep learning model designed for processing sequential data, particularly suitable for time series, natural language processing, and speech recognition. Its core feature is that each unit can process data through forget gates and memory gates, dynamically updating and forgetting information based on input, thus better remembering long-term dependencies.
The update process of LSTM can be divided into the following steps:
The forget gate decides how much of the previous memory to forget, using the current input and the previous hidden state calculated through a sigmoid function:
\[f_t=\sigma\left(W_f\left[h_{t-1},x_t\right]+b_f\right)\]where://no-indent
- \(W_f\) is the weight matrix of the forget gate.
- \(b_f\) is the bias of the forget gate.
The input gate \(i_t\) determines the degree to which the current input \(x_t\) updates the cell state. The candidate cell state is \(\widetilde{C_t}\) the new candidate information obtained by transforming the current input, determining the new information to be added to the cell state:
\[i_t=\sigma\left(W_i\left[h_{t-1},x_t\right]+b_i\right)\] \[\widetilde{C_t}=\tanh{\left(W_C\left[h_{t-1},x_t\right]+b_C\right)}\]where://no-indent
- \(W_i\) and \(W_C\) are the weight matrices for the input gate and candidate cell state, respectively, and \(b_i\) and \(b_C\) are their corresponding biases.
The cell state \(C_t\) is determined by the outputs of the forget gate and the information controlled by the input gate. First, the forget gate \(f_t\) multiplies the previous cell state \(C_{t-1}\) to determine how much past information to retain; then, the input gate;controls the addition of the current candidate cell state \(i_t\) to the new cell state \(\widetilde{C_t}\) to the new cell state:
\[C_t=f_t\odot C_{t-1}+i_t\odot\widetilde{C_t}\]The output gate \(o_t\) determines the content of the current hidden state \(h_t\). The output gate is also calculated based on the current input \(x_t\) and the previous hidden stat \(h_{t-1}\) using a sigmoid function:
\[o_t=\sigma\left(W_o\left[h_{t-1},x_t\right]+b_o\right)\]The final hidden state \(h_{t-1}\) is a combination of the output gate \(o_t\) and the updated cell state. The updated cell state \(C_t\) undergoes a nonlinear transformation through the tanh function and is controlled by the output gate \(o_t\):
\[h_t=o_t\odot\tanh{\left(C_t\right)}\]We utilized data on temperature, pH, glucose consumption, and dissolved oxygen levels taken every minute as features. Through this method, we can predict the entire fermentation process in real time.
//img_wrap
Fig. 10 Real-time Prediction of Products by Combining Multi-Sensors with Deep Learning Models.2.2: Model Training and Evaluating
//img_wrap
Our data comes from three complete fermentation runs. The data from the first two fermentations are used for training, while the data from the third fermentation are used for validation. After training for 100 epochs, the loss function value is 0.000174.
//img_wrap
Fig. 11: Relationship between Loss and Epochs.
//img_wrap
After training, we made predictions for the three fermentation processes. The prediction results indicate that our model has good performance and generalization ability. On the validation set of the third fermentation, the accuracy reached 0.9871, and the mean squared error was 0.19. The prediction results are shown in Fig. 12.
//img_wrap
Fig. 12 prediction result
//img_wrap
Model Evaluation Metrics//caption
Metric | Value |
---|---|
Mean Square Error (MSE) | 0.1914 |
Root Mean Square Error (RMSE) | 0.4375 |
Mean Absolute Error (MAE) | 0.3833 |
R2 Score | 0.9871 |
Model Parameters//caption
Parameter | Value |
---|---|
Model Type | LSTM |
Number of LSTM layers | 2 |
Number of Units (LSTM) | 50 per layer |
Input Features | OD, Glucose |
Output | Yield |
Sequence Length | 10 |
Number of Epochs | 500 |
Batch Size | 32 |
Optimizer | Adam |
Loss Function | Mean Square Error (MSE) |
Tran/Test Split | 80% Train, 20% Test |
Validation Data | Yes (used during training) |
Model File | istm_fermentation_model_1_dataset.h5 |
Part 3: Investigation of Microplastic Distribution and Influencing Factors in Qingdao
Microplastic pollution has become a severe global environmental issue, affecting ecosystems and human health. Microplastics are plastic particles with a diameter of less than 5 millimeters, originating from various sources, including cosmetics, clothing fibers, and the degradation of plastic products. These tiny plastic particles can enter the environment through various pathways, such as rivers, oceans, the atmosphere, and even soil.
Shandong University is located in the coastal city of Qingdao, and we aim to investigate the microplastic pollution around the university and its influencing factors. Our microplastic distribution data come from two papers by YIN Shiqi et al.[5] and ZHANG Ying et al.[6]. These studies investigated the distribution of microplastics in the nearshore surface seawater and intertidal sediment in Qingdao.
//img_wrap
//img_wrap
Next, we analyzed the relationship between microplastic distribution and factors such as population density, the number of streets, seawater flow rate, distance to rivers, river flow rate, and the number of surrounding factories.
Our population data were obtained from the Citypopulation website, seawater flow rates were sourced from the Marine Development and Management official website, and the number of streets and factories were retrieved from Gaode Map.
//img_wrap
//img_wrap
The Spearman correlation coefficient analysis reveals that population density has the highest correlation with microplastic content, reaching 0.90, followed by the number of streets and seawater flow rate. This indicates that a high population near the coast significantly increases the surrounding microplastic content. Our results also show that there is almost no correlation between microplastic content and the distance from sampling sites to rivers, which may suggest that the dispersion of urban microplastics into the ocean is primarily through the atmosphere rather than rivers.
Daily activities of urban residents generate significant amounts of microplastics, including the use of cosmetics and skincare products, cleaning agents and laundry detergents, synthetic fiber clothing, wear and tear of plastic products, automotive tire wear, and plastic waste. Therefore, the use of alternatives to plastics, such as PHB, is crucial for combating microplastic pollution.
Reference
//referrence
[1] Hirokawa Y, Kawano H, Tanaka-Masuda K, Nakamura N, Nakagawa A, Ito M, Mori H, Oshima T, Ogasawara N. Genetic manipulations restored the growth fitness of reduced-genome Escherichia coli. J Biosci Bioeng. 2013 Jul;116(1):52-8. doi: 10.1016/j.jbiosc.2013.01.010. Epub 2013 Mar 7. PMID: 23477741.
[2] Ebrahim A, Lerman JA, Palsson BO, Hyduke DR. COBRApy: COnstraints-Based Reconstruction and Analysis for Python. BMC Syst Biol. 2013 Aug 8;7:74. doi: 10.1186/1752-0509-7-74. PMID: 23927696; PMCID: PMC3751080.
[3] Matteau D, Champie A, Grenier F, Rodrigue S. Complete sequence of the genome-reduced Escherichia coli DGF-298. Microbiol Resour Announc. 2023 Nov 16;12(11):e0066523. doi: 10.1128/MRA.00665-23. Epub 2023 Oct 16. PMID: 37843363; PMCID: PMC10652928.
[4] Third KA, Newland M, Cord-Ruwisch R. The effect of dissolved oxygen on PHB accumulation in activated sludge cultures. Biotechnol Bioeng. 2003 Apr 20;82(2):238-50. doi: 10.1002/bit.10564. PMID: 12584766.
[5] Yin S Q, Jia F L, Liu X Y, et al. 2021. The distribution of microplastics and their influence factors in surface seawater and tidal flat sediments in Qingdao coast. Acta Scientiae Circumstantiae, 41(4): 1410-1418.
[6] 张莹,赵信国,隋琪,等.青岛崂山近岸海域表层沉积物中微塑料的污染特征[J].渔业科学进展,2024,45(02):61-69.DOI:10.19663/j.issn2095-9869.20231019001.
//referrence