Model | BNUZH-China

Model

1 Overview

2 Tidal dynamic model

2.1 Tidal harmonic analysis

2.2 Tidal microbial perturbation model

3 Metabolic pathway

3.1 Extracellular absorption

3.2 Intracellular degradation

3.3 Suicide system

4 Exploration of Protein Modeling

4.1 PEase enzyme protein modeling analysis

4.2 Molecular Docking of Native P450 Enzyme and Introduced P450cam Enzyme in Pseudomonas aeruginosa PAO1

5 Peptide-PE binding predictors

5.1 Introduction of Background and material

5.2 Our method

5.3 Result analysis

6 From LLM to RAG: Building an Intelligent Chatbot for iGEM Competition

6.1 Designing the basic structure of the Chatbot

6.2 Explore the application of GraphRAG in Chatbot

6.3 Deploy the Chatbot on WeChat as assistant

Reference

1 Overview

Mangroves serve as a natural barrier connecting land and sea, not only rich in biodiversity but also a critical global carbon sink. Research shows that microplastic pollution is currently posing a serious threat to the health of mangrove ecosystems, with polyethylene (PE) being one of the most difficult pollutants to degrade.

Our modeling explores various aspects of the project. Focusing on the tidal background of mangroves, we aim to assess the role of tides in microplastic deposition and distribution, as well as how they disturb microorganisms. Regarding the mechanisms of metabolic processes, we studied how microorganisms break down polyethylene through extracellular absorption and intracellular degradation pathways, while also exploring suicide systems as well as doing the exploration of Protein Modeling. For a special need in the experimental process—finding PE-binding peptides—we built a predictor. Finally, to lead the project and enhance team collaboration and access to key resources, we developed and deployed a chatbot—iGEMBot V.2.

The modeling methods range from traditional mathematical modeling to biokinetic exploration, from machine learning to neural networks, and combine large language models with retrieval-augmented search technologies. This systematic exploration and innovation provide multi-dimensional solutions to the problem of microplastic pollution in mangroves.

2 Tidal dynamic model

Pseudomonas aeruginosa (PAO1), which is used to degrade polyethylene (PE) microplastics, is an aerobic symbiotic bacterium attached to the roots of mangrove plants. However, in the actual production and application, the marine environment exhibits significant variability and complexity, including the influence of precipitation, tidal patterns, and ocean currents. These environmental factors can impact the distribution and dispersal patterns of PAO1 within mangrove soil ecosystems. Thus, our model focuses on the effects of tidal scour on the distribution of microorganisms, aiming to explore the diffusion process of Pseudomonas aeruginosa and rhodopseudomonas in mangrove soil under tidal scour.

2.1 Tidal harmonic analysis

Tide wave represent periodic fluctuations in the level of ocean waters, occurring at distinct frequencies, due to the combined gravitational effects of the Moon and the Sun. Given the continual variation in the relative positions of the Earth, Moon, and Sun, the tidal forces exhibit periodic changes, leading to the manifestation of one or two tidal events daily. Furthermore, observational data collected from different geographical locations on Earth will give distinct tidal profiles, reflecting the complexity and variability of tidal patterns.

Figure 1 Major subtides (M2, K1, O1, S2) in and around the Pearl River Delta

In the conducted experiment, the mangrove population and soil samples were sourced from the Qi'ao-Dangan Island Nature Reserve located in Zhuhai, Guangdong Province (longitude: 113.635706°, latitude: 22.414662°), and an observation station with tidal and other hydrological data was established near this location. Consequently, the proposed model takes this specific location as a representative case and conduct a harmonic analysis of tidals, and find the solution of the amplitude and the tidal epoch of each tidal constituent.

For a specific location or monitoring station, the water level at a given time can be mathematically approximated as a superposition of several individual simple harmonic vibrational components, known as tidal constituent, which are characterized by the cosine function. The water level at this point can be formulated using the following mathematical expression: $$ Z(t) = S_0 + \sum_{j=1}^{N}[H_jcos(\sigma_jt-g_j)] \tag{1}$$

where $Z(t)$ represents the water level recorded at the observation point at time $t$ , $S_0$ represents the mean sea level height under the initial state, $H_j$ represents the amplitude of the tidal constituent, $sigma_j$ represents the angular velocity of the tidal constituent, $g_j$ represents the tidal epoch of the tidal constituent, and $N$ represents the total number of tidal constituents.

The expression is collated with the custom variables $a_j$ and $b_j$ : $$Z(t) = S_0 + \sum_{j=1}^{N}(a_jcos\sigma_jt+b_jsin\sigma_jt) \tag{2}$$

However, since the location we analyzed is close to the Pearl River Estuary, it is greatly affected by the runoff of the Pearl River. Consequently, pertinent parameters, including the amplitude and phase lag of each constituent wavelet, exhibit temporal variability in response to the fluctuating discharge. Therefore, when incorporating the impact of the runoff, the modified expression for water level dynamics is delineated as follows: $$Z(t) = S(t) + \sum_{j=1}^{N}(a_j(t)cos\sigma_jt+b_j(t)sin\sigma_jt) \tag{3}$$

In the process of solving equation $(3)$ , we adopt an independent point scheme. For the $j$ -th tidal constituent, several independent points are uniformly selected in the known time series, in which the $i$-th independent point is denoted as ($S_i$ , $a_{i,j}$ , $b_{i,j}$). These independent point values are ascertained through the application of interpolation techniques. The variation of mean sea level height and harmonic variable over time follows this expression: $$S(t) = \sum_{i=1}^{N_1}f_{i,t}S_i, \space a_j(t) = \sum_{i=1}^{N_2}f_{i,t}a_{i,j}, \space b_j(t) = \sum_{i=1}^{N_2}f_{i,t}b_{i,j} \tag{4}$$

where $f_{i,t}$ represents the interpolation weight of the selected $i$-th independent point at time $t$, $N_1$ represents the number of independent points selected for mean sea level, and $N_2$ represents the number of independent points selected for tidal constituent. By substituting equation (4) into equation (3), we get: $$Z(t)=\sum_{i=1}^{N_1}f_{i,t} S_i + \sum_{j=1}^N(\sum_{i=1}^{N_2}f_{i,t} a_{i,j} cos\sigma_j t+\sum_{i=1}^{N_2}f_{i,t} b_{i,j} sin\sigma_j t) \tag{5}$$

For ease of calculation, given T different time points of water level observations, the series of equations in (5) can be represented in the form of elementary transformations of a matrix, that is: $$\textbf{Z} = \textbf{KU} \tag{6} $$

Where $\textbf{Z}$ represents the matrix of water level observations, $\textbf{K}$ represents the coefficient matrix, and $\textbf{U}$ represents the parameter matrix of substitution. The specific expression of each matrix is as follows: $$ \textbf{Z} = \left\lbrack Z\left( t_{1} \right),Z\left( t_{2} \right),\cdots,Z\left( t_{T} \right) \right\rbrack^{T} \tag{7} $$ $$ \textbf{K} = \begin{pmatrix} f_{1,t_1} & \cdots & f_{N_1,t_1} & f_{1,t_1}cos\sigma_1t_1 & \cdots & f_{1,N_2}sin\sigma_Nt_1 \\ f_{1,t_2} & \cdots & f_{N_1,t_2} & f_{1,t_2}cos\sigma_1t_2 & \cdots & f_{1,N_2}sin\sigma_Nt_2 \\ \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ f_{1,t_T} & \cdots & f_{N_1,t_T} & f_{1,t_T}cos\sigma_1t_T & \cdots & f_{1,N_2}sin\sigma_Nt_T \\ \tag{8} \end{pmatrix} $$ $$ \textbf{U} = \begin{pmatrix} & S_1 & \cdots & S_{N_1} \\ & a_{1,1} & \cdots & a_{N_1,1} & \cdots & a_{1,N} & \cdots & a_{N_1,N} \\ & b_{1,1} & \cdots & b_{N_1,1} & \cdots & b_{1,N} & \cdots & b_{N_1,N} \end{pmatrix}^T \tag{9} $$

The optimal solution of the matrix obtained by least square method is: $$ \textbf{U} = (\textbf{K}^T\textbf{K})^{-1}\textbf{K}^T\textbf{Z} \tag{10}$$

The tide table data from 2024/06/01 to 2024/06/30 of Qi'ao Island Observation Station in Zhuhai City (data source: National Marine Science Data Center) were collected and counted, and a mathematical model was established for the harmonic analysis of tides. The following table shows the prediction results of the amplitude and phase of each tidal component on Qi'ao Island under 95% confidence interval:

Figure 2 Tide height on Qi'ao Island (2024/06/01-2024/06/30)

Figure 3 Result of tidal harmonic analysis (2024/06/01-2024/06/30)

2.2 Tidal microbial perturbation model

Based on tidal harmonic analysis, we have a visual understanding of the tidal conditions in our environment. In addition to the laboratory, we also hope that PAO1 and rhodopseudomonas will continue to work effectively in the real environment. However, due to the complexity and variability of the actual Marine environment, the tides will also affect the distribution of microorganisms, so that they can not be completely fixed in the mangrove roots and work, thus affecting their degradation efficiency of microplastics to a certain extent.

The model analysis is based on the following assumptions:

(1) Tides and currents are regarded as incompressible fluids;

(2) For ease of solution, microorganisms are regarded as rods of equal length. (1μm×1μm× 3.25μm)

The local scour flow is three-dimensional unsteady flow, which can be simulated well by continuity equation and Navier-Stokes equation: $$ \frac{\partial\rho}{\partial t}+\nabla\cdot(\rho V)=0 \tag{11}$$ $$\rho(\frac{\partial V}{\partial t}+V\cdot\nabla V)=-\nabla P+\rho g+\nabla\mathrm{T} \tag{12}$$

Where $\rho$ represents the density of the fluid, $V$ represents the velocity field of the fluid, $P$ represents the pressure of the fluid, $g$ represents the acceleration of gravity, and $T$ represents the viscous force.

The equation is converted into integral form as follows: $$\frac{\partial}{\partial t}\int_{V}\rho\mathrm{dV}+\oint_{\partial V}\rho(\boldsymbol{v}\cdot\boldsymbol{n})\mathrm{dS}=0 \tag{13}$$ $$\begin{aligned}\frac{\partial}{\partial t}\int_{V}\rho\boldsymbol{v}\mathrm{dV}+ \oint_{\partial V}\rho\boldsymbol{v}(\boldsymbol{v}\boldsymbol{\cdot}\boldsymbol{n})\mathrm{dS}= \int_{V}\rho\boldsymbol{f}_{e}\mathrm{dV}-\ \oint_{\partial V}p\boldsymbol{\cdot}\boldsymbol{n}\mathrm{dS}+ \oint_{\partial V}(\overline{\tau}^{\prime}\boldsymbol{\cdot}\boldsymbol{n})\mathrm{dS} \end{aligned} \tag{14}$$

This diagram shows the fluid velocity field at an intermediate time step (such as step 50 or step 100) during the simulation. In this process, we can observe the velocity change of the fluid under external forces, such as the velocity boundary condition of 1 at the top, which causes the fluid to move in one direction as a whole.

The contour plot represents the magnitude and distribution of fluid velocity, where the darker the color, the greater the velocity, and the lighter the color, the smaller the velocity. This can help us understand the local changes and the overall trend of the velocity field.

Figure 4 Final velocity field

3 Metabolic pathway

In the part of metabolic analysis, we focused on the three important reaction pathways of extracellular absorption, intracellular degradation and suicide system, aiming to simulate and solve the reaction process of metabolic pathway through ordinary differential equation modeling, and reflect the changes of substances in the form of images, so as to provide data and result support for the wet lab and verify the experimental results.

Model assumption

3.1 Extracellular absorption

(1) The expression of PEase is regulated by a constitutive promoter, and we assume that the expression of the gene for this enzyme is uniform and continuous.

(2) Assume that PAO1 can continue to produce additional secretory enzymes at a steady rate.

3.2 Intracellular absorption

(1) It is assumed that the transcription process of all genes in the intracellular degradation part is stable and continuous.

(2) Since the conversion of R-COOH to CO2 can occur by itself in PAO1 without the involvement of other enzymes, this model only considers the concentration of R-COOH and does not consider subsequent changes.

3.3 Suicide system

(1) It is assumed that the transcription process of all genes in the hok/sok system is stable and continuous.

(2) This model mainly discusses whether the hok/sok system can function, that is, whether PAO1 can commit suicide when it meets the conditions for suicide. Therefore, it is assumed that the plasmid has been lost, and the reaction process of this system is explored on this basis.

Model solution and results

3.1 Extracelluar absorption

Figure 5 A. Passenger protein schematic. B. Two proteins are immobilized on the bacterial surface.

In order to make PAO1 able to adsorb and degrade microplastics in mangrove soil, we fused PE binding peptide (PEBP) and PE degrading enzyme PEase to form Passenger protein, and introduced related genes into PAO1 at the same time. The absorption process is as follows: $$\mathrm{PEBP+PE}\rightleftharpoons\mathrm{PEBP}-\mathrm{PE} \tag{15}$$

Constitutive promoter regulates the expression of PEase, which means that genes are expressed evenly and continuously. Therefore, assuming that additional secretase can be continuously produced at a stable rate, the PEase expression reaction is as follows: $$\begin{align} \xrightarrow{C_{PEase}, k_{mPEase}}mRNA_{PEase} & \xrightarrow{k_{PE_{ase}}}\mathrm{PEase} \\ m\mathrm{RNA}_{PEase} & \xrightarrow{d_{mPEase}}\phi \\ \mathrm{PEase} & \xrightarrow{d_{PEase}}\phi \tag{16} \end{align}$$

Assuming that PE microplastics adsorbed by PEBP can immediately enter PAO1 and degrade under the action of PEase, from large PE molecules to small PE molecules, the reaction process is as follows: $$\mathrm{PEase}+\mathrm{PE}\xrightleftharpoons[k_{l}]{k_{r}}\mathrm{PEase-PE}\xrightarrow{k_{PE^*}}\mathrm{PEase}+\mathrm{PE}^{*} \tag{17}$$

According to the reaction process, the following ordinary differential equation is obtained: $$\frac{d[\mathrm{PEase}]}{dt}=(k_{l}+k_{PE^*})[\mathrm{PEase}-\mathrm{PE}]+k_{PEase}[m\mathrm{RNA}_{PEase}]-d_{PEase}[\mathrm{PEase}] \tag{18}$$ $$\frac{d[\mathrm{PE}]}{dt}={k_{l}[\mathrm{PEase-PE}]-k_{r}[\mathrm{PEase}][\mathrm{PE}]} \tag{19}$$ $$\frac{d[\mathrm{PEase-PE}]}{dt}=k_r[\mathrm{PEase}][\mathrm{PE}]-k_{l}[\mathrm{PEase-PE}]-k_{PE^*}[\mathrm{PEase-PE}] \tag{20}$$ $$\frac{d\mathrm{[PE^{*}]}}{dt}=k_{PE^*}[\mathrm{PEase-PE}] \tag{21}$$

The parameters used in the model are from existing papers or previous expriences, and the specific meanings and values are shown in the following table:

Parameter	Description	Value	Source
$k_{\text{mPEase}}$	PEase mRNA transcription rate	2.0 gene copy^-1min^-1	[16]
$C_{\text{PEase}}$	Plasmid copy number of PEase DNA per cell volume	1 copies	experiment
$k_{\text{PEase}}$	PEase translation rate	6.0 min^-1	[16]
$d_{\text{mPEase}}$	PEase mRNA degradation rate	0.2 min^-1	[16]
$d_{\text{PEase}}$	PEase degradation rate	1 min^-1	[16]
$k_{PE^*}$	$\text{PE}^*$generation rate	3.0 min^-1	estimate
$k_{l}$	Unbinding rate of PEase-PE complex	1.0	[17]
$k_{r}$	Binding rate of PEase-PE complex	1.0	[17]

Table 1 Model 3.1 parameters

Figure 6 Time evolution of PEase, PE, PEase-PE complex and PE^*

The results showed that with the expression of PEase, the content of PE microplastics decreased, the content of Pease-PE complex increased first and then decreased, and the content of PE small molecules gradually increased. This shows that our imported PEase can effectively bind to PE microplastics and degrade them into small molecules.

3.2 Intracelluar degradation

Figure 7 A. Assimilation and Mineralization module. B. The degradation of n-alkane.

After small molecule PE microplastics enter PAO1, further degradation will occur in the cell. We introduced ALKB2-RD45-ADH fusion protein, in which AlkB2 hydroxylated straight chain alkanes (R-H) to alcohols (R-OH) and consumed NADH. The equation of AlkB2 expression and acting as catalysis is as follows: $$\xrightarrow{C_{AlkB2}, k_{mAlkB2}} \text{mRNA}_{AlkB2}$$ $$\text{mRNA}_{AlkB2} \xrightarrow{k_{AlkB2}} \text{AlkB2}$$ $$R-H + O_2+ \text{NADH} \xrightarrow{\text{AlkB2}} R-OH + \text{NAD}^+ + H_2O$$ $${AlkB2}\xrightarrow{d_{AlkB2}}\phi$$

Subsequently, Adh converts alcohols into corresponding Aldehydes (R-CHO) and consumes NAD+ to produce NADH. The relevant equations are as follows: $$\xrightarrow{C_{Adh}, k_{mAdh}} \text{mRNA}_{Adh} $$ $$\text{mRNA}_{Adh} \xrightarrow{k_{Adh}} \text{Adh} $$ $$R-OH + \text{NAD}^+ \xrightarrow{\text{Adh}} R-CHO + \text{NADH} + H^+$$ $$\text{Adh}\xrightarrow{d_{Adh}}\phi$$

Aldehydes are catalyzed by CYP to produce the corresponding fatty acid (R-COOH). This process also consumes oxygen and NADH:$$\xrightarrow{C_{CYP}, k_{mCYP}} \text{mRNA}_{CYP}$$ $$ \text{mRNA}_{CYP} \xrightarrow{k_{CYP}} \text{CYP}$$ $$ R-CHO + O_2 + \text{NADH} + H^+ \xrightarrow{\text{CYP}} R-COOH + H_2O + \text{NAD}^+$$ $$\text{CYP}\xrightarrow{d_{CYP}}\phi$$

Finally, the fatty acids are broken down into acetyl-CoA via the β-oxidation pathway, which enters the tricarboxylic acid cycle to produce carbon dioxide and water. These two steps can be done by themselves in PAO1.

The ordinary differential equation is as follows: $$\frac{d[AlkB2]}{dt}=k_{AlkB2}\cdot[\text{mRNA}_{AlkB2}]-d_{mAlkB2}\cdot[AlkB2] \tag{22}$$ $$\frac{d[R-OH]}{dt}=k_{AlkB2}\cdot[AlkB2]\cdot\frac{[R-H]\cdot[NADH]}{K_m+[R-H]+[NADH]} \tag{23}$$ $$\frac{d[Adh]}{dt}=k_{Adh}\cdot[\text{mRNA}_{Adh}]-d_{mAdh}\cdot[Adh] \tag{24}$$ $$\frac{d[R-CHO]}{dt}={k}_{Adh}\cdot[{Adh}]\cdot\frac{[R-OH]\cdotp[NAD^+]}{K_m+[R-OH]+[NAD^+]} \tag{25}$$ $$\frac{d[CYP]}{dt}=k_{CYP}\cdot[\text{mRNA}_{CYP}]-d_{mCYP}\cdot[CYP] \tag{26}$$ $$\frac{d[R-COOH]}{dt}=k_{CYP}\cdot\begin{bmatrix}CYP\end{bmatrix}\cdot\frac{[R-CHO]-[NADH]}{K_m+[R-CHO]+[NADH]} \tag{27}$$ $$\begin{align} \frac{d[NADH]}{dt} &= -k_{AlkB2} \cdot [AlkB2] \cdot \frac{[R-H] \cdot [NADH]}{K_m + [R-H] + [NADH]} \nonumber \\ &\quad + k_{Adh} \cdot [Adh] \cdot \frac{[R-OH] \cdot [NAD^+]}{K_m + [R-OH] + [NAD^+]} \nonumber \\ &\quad - k_{CYP} \cdot [CYP] \cdot \frac{[R-CHO] \cdot [NADH]}{K_m + [R-CHO] + [NADH]} \tag{28} \end{align}$$ $$\begin{align} \frac{d[NAD^+]}{dt} &= k_{AlkB2} \cdot [AlkB2] \cdot \frac{[R-H] \cdot [NADH]}{K_m + [R-H] + [NADH]} \nonumber \\ &\quad - k_{Adh} \cdot [Adh] \cdot \frac{[R-OH] \cdot [NAD^+]}{K_m + [R-OH] + [NAD^+]} \nonumber \\ &\quad + k_{CYP} \cdot [CYP] \cdot \frac{[R-CHO] \cdot [NADH]}{K_m + [R-CHO] + [NADH]} \tag{29} \end{align}$$

The parameters used in the model are from existing papers or previous expriences, and the specific meanings and values are shown in the following table:

Parameter	Description	Value	Source
$k_{\text{mAlkB2}}$	AlkB2 mRNA transcription rate	3.0 gene copy ^-1min ^-1	[19]
$C_{\text{AlkB2}}$	Plasmid copy number of AlkB2 DNA per cell volume	1 copies	experiment
$k_{\text{AlkB2}}$	AlkB2 translation rate	3.0 min ^-1	[19]
$d_{\text{mAlkB2}}$	AlkB2 mRNA degradation rate	0.2 min^-1	estimated
$d_{\text{AlkB2}}$	AlkB2 degradation rate	0.025 min^-1	estimated
$k_{\text{mAdh}}$	Adh mRNA transcription rate	3.0 gene copy ^-1min ^-1	estimate
$C_{\text{Adh}}$	Plasmid copy number of Adh DNA per cell volume	1 copies	experiment
$k_{\text{Adh}}$	Adh translation rate	4.5 min^-1	estimated
$d_{\text{mAdh}}$	Adh mRNA degradation rate	0.2 min^-1	estimated
$d_{\text{Adh}}$	Adh degradation rate	0.025 min-^-1	estimated
$k_{\text{mCYP}}$	CYP mRNA transcription rate	5.0 gene copy^-1min^-1	[18]
$C_{\text{CYP}}$	Plasmid copy number of CYP DNA per cell volume	1 copies	experiment
$k_{\text{CYP}}$	PEase translation rate	5.0 min^-1	[18]
$d_{\text{mCYP}}$	CYP mRNA degradation rate	0.3 min^-1	[18]
$d_{\text{CYP}}$	CYP degradation rate	0.03 min^-1	[18]

Table 2 Model 3.2 parameters

Figure 8 The concentration changes of R_H and R-OH over time

Figure 9 The concentration changes of key enzymes and substances over time

The simulation results show that the concentration of R-H (intracellular small alkane molecules), which initially accumulated, approaches zero after two minutes. The first metabolic intermediate, R-OH, rises sharply at first, then gradually decreases and eventually approaches zero. The concentration of the second intermediate, R-COOH, increases slowly, reaches a stable state, and finally shows a slight decrease. This indicates that R-H is gradually oxidized into fatty acids after being absorbed by microorganisms, and ultimately converted into CO₂ through fatty acid metabolism.

The enzyme is generated through mRNA transcription, and the simulation results show that the enzyme concentration tends to stabilize, indicating that the rate of enzyme production is comparable to the rate of degradation. It should be noted that, in theory, NADH and NAD⁺ should be in dynamic equilibrium. However, because we did not include the metabolic conversion of fatty acids to carbon dioxide in the analysis system, the analysis may not be comprehensive enough.

During this metabolic process, the rate of NADH consumption is significantly higher than the rate of its production, and eventually the concentration of NADH approaches zero. Therefore, it can be inferred that NADH concentration is critical to the metabolic reaction during this stage. Therefore, it can be seen that the NADK and NADM enzymes introduced by us can increase the level of NADH in the bacterial biochemical reaction environment, thereby improving metabolic efficiency.

Based on these findings, we preliminarily conclude that our model construction is reasonable. At the same time, the wet lab experiments also verified that plastic can be effectively degraded through this metabolic pathway.

Figure 10 From left to right, PAO1::alkB2-Rd45-adhA, PAO1::cypY96F-vgb, wild type PAO1, blank control, respectively In the first row, the experimental group is magnified 900 times, the control group is magnified 300 times In the second row, the experimental group was magnified 5,000 times and the control group 2,500 times

3.3 suicide system

Figure 11 Process of preventing plasmid loss using hok/sok system.

In the safety module, we mainly discuss the modeling process of introducing hok/sok system into PAO1 to prevent plasmid loss.

When the recombinant plasmid in PAO1 is lost, neither hok nor sok will be transcribed in bacteria. Since the sRNA degradation rate of sok is much faster than that of hok mRNA, the remaining hok mRNA in cells can still be expressed after the total degradation of sok antisense RNA, translating hok protein to kill the cell.

The expression process of hok and sok is as follows: $$\xrightarrow[k_{\text{mRNA}}]{C_{\text{hok}}} \text{mRNA} $$ $$\text{mRNA} \xrightarrow{k_{\text{hok}}} \text{hok} $$ $$\text{mRNA} \xrightarrow{d_{\text{mRNA}}} \emptyset $$ $$\text{hok} \xrightarrow{d_{\text{hok}}} \emptyset $$ $$\xrightarrow[k_{\text{sRNA}}]{C_{\text{sok}}} \text{sRNA} $$ $$\text{sRNA} \xrightarrow{d_{\text{sRNA}}} \emptyset $$

When the recombinant plasmid of PAO1 is present, sok sRNA will bind to hok mRNA and inhibit the expression of hok toxic protein. The process is as follows: $$\text{mRNA} + \text{sRNA}\xrightleftharpoons[h_{l}]{h_{r}} \text{mRNA} - \text{sRNA} $$ $$\text{mRNA} - \text{sRNA} \xrightarrow{d_{\text{mRNA-sRNA}}} \emptyset $$

We modeled the reaction process of the suicide system by ordinary differential equation, and solved the concentration changes of each product by numerical integration. The equations are as follows: $$\begin{align} \frac{d[\text{mRNA}]}{dt} & = k_{\text{mRNA}} C_{\text{hok}} - d_{\text{mRNA}} [\text{mRNA}] \\ & - h_r [\text{mRNA}] [\text{sRNA}] + h_L [\text{mRNA-sRNA}] \tag{30} \end{align}$$ $$\frac{d[\text{sRNA}]}{dt} = k_{\text{sRNA}} C_{\text{soK}} - d_{\text{sRNA}} [\text{sRNA}] \tag{31}$$ $$\begin{align} \frac{d[\text{mRNA-sRNA}]}{dt} & = h_r [\text{mRNA}] [\text{sRNA}] - h_L [\text{mRNA-sRNA}] \\ & - d_{\text{mRNA-sRNA}} [\text{mRNA-sRNA}] \tag{32} \end{align}$$ $$\frac{d[\text{hok}]}{dt} = k_{\text{hok}} [\text{mRNA}] - d_{\text{hok}} [\text{hok}] \tag{33}$$

The parameters used in the model are from existing papers and previous experience, and the specific meanings and values are shown in the following table:

Parameter	Description	Value	Source
$k_{\text{mRNA}}$	hok transcription rate	1.0 gene copy^-1min^-1	[20]
$C_{\text{hok}}$	Plasmid copy number of hok per cell volume	1 copies	experiment
$k_{\text{hok}}$	Toxic protein translation rate	5.0 min^-1	[20]
$d_{\text{mRNA}}$	hok mRNA degradation rate	0.2 min^-1	[20]
$d_{\text{hok}}$	Toxic protein degradation rate	0.035 min^-1	[20]
$k_{\text{sRNA}}$	sok sRNA transcription rate	6.0 gene copy^-1min^-1	[20]
$C_{\text{sok}}$	Plasmid copy number of sok per cell volume	1 copies	experiment
$d_{\text{sRNA}}$	sok sRNA degradation rate	1.0 min_-1	[20]
$h_{l}$	Unbinding rate of mRNA-sRNA complex	1.0	estimated
$h_{r}$	Binding rate of mRNA-sRNA complex	1.0	estimated
$d_{\text{mRNA-sRNA}}$	mRNA-sRNA degradation rate	0.1	[20]

Table 3 Model 3.3 parameters

Figure 12 Hok/Sok system dynamics over time

According to the information in the figure, we can know that the hok/sok system can achieve the purpose of cell death in the case of plasmid loss.

By comparing the result graph obtained by modeling (see Figure 12) with the measured result graph obtained by the weyt lab below (see Figure 13), we can find that in the experiment, the death rate of PAO1 after the introduction of hok/sok system is greater than that of PAO1 without the introduction of the system. This is also consistent with the results obtained in our model, which further demonstrates the help of modeling for wet lab.

Figure 13 The effects of hok/sok system by wet lab

4 Exploration of Protein Modeling

4.1 PEase enzyme protein modeling analysis

4.1.1 Introduction

In our extracellular degradation module, we use a PE-binding peptide to adsorb polyethylene (PE) microplastics and a PEase enzyme to degrade them into smaller molecules. The PEase enzyme, originally found in waxworm saliva by Tianjin University, can oxidize PE. Although its exact mechanism is unclear, research suggests PEase catalyzes the breakdown of PE by adding oxygen, leading to chain scission. We predicted the fusion protein and PEase structures, followed by molecular docking to study this degradation process.

4.1.2 Fusion Protein and PEase Structure Prediction

Background:

Due to the unknown structure of PEase and the uncertainty about whether the same domain sites are preserved after fusion expression, we used AlphaFold3 (AF3) to predict the structure of the fusion protein and Chai Discovery (provided by HZAU-China) to predict the structure of PEase. Using the known nucleotide sequences, we obtained the corresponding amino acid sequences and input them into the respective prediction platforms. UCSF ChimeraX, an excellent molecular visualization tool developed by the Resource for Biocomputing, Visualization, and Informatics (RBVI) as a successor to UCSF Chimera, was used for visualization.

Additionally, we used the DynaMut tool to predict the structural stability of the fusion protein, aiming to further assess whether it can maintain an appropriate conformation after fusion expression, thereby estimating its potential biological activity. Through DynaMut analysis, we can detect the flexibility changes of key residues and their impact on the overall stability of the protein, which helps predict the functional retention of the fusion protein in practical applications.

Results:

We visualized the predicted structures (Figures 14.A, 14.B, 14.C), with the fusion protein's regions color-coded by confidence levels. Alignment of the proteins using RMSD scoring in ChimeraX revealed minimal deviations in conserved domains, suggesting structural integrity was largely maintained (Figure 14.D). The DynaMut eigenvalues gradually increased, though they remained below 0.01. These low values indicate a stable protein structure (Figure 14.E: Deformation Energies). The correlation between the RMSD and DynaMut results supports the hypothesis that the fusion protein will retain functional stability, which encourages further molecular docking studies with PE microplastics.

A: Fusion protein structure prediction based on AlphaFold3

B: Fusion protein structure prediction based on AlphaFold3

C: PEase structure prediction based on Chai Discovery

D: Score results after aligning the two proteins

E: DynaMut structural stability prediction(Deformation Energies) The magnitute of the deformation is represented by thin to thick tube colored blue (low), white (moderate) and red (high).

Figure 14

4.1.3 Molecular Docking

We conducted molecular docking to study the interactions between the PEase enzyme and both large and small alkane molecules. Two alkane molecules were downloaded from ChemSpider in mol file format. We used AutoDock Vina to perform docking between the enzyme and the plastic ligands. Vina employs an efficient heuristic search algorithm combined with a force field-based scoring function, enabling a wide conformational space search within a short time to identify the optimal binding mode. This algorithm excels in handling complex small molecules with rotatable bonds and the flexible binding pockets of proteins.

4.1.3.1 Docking of Large Alkane Molecules

We first performed molecular docking with an alkane containing 48 carbon atoms to simulate the process of PEase binding to large plastic molecules. The docking visualization was done using PyMOL. Considering that alkanes are non-polar molecules, their binding with the enzyme mainly relies on hydrophobic interactions, which similar to some CYP enzymes. Since they lack polar groups, alkanes do not bind closely to the enzyme through hydrogen bonds or electrostatic interactions like polar small molecules. Additionally, the long-chain structure of alkane molecules needs to fit the spatial configuration of the enzyme's binding pocket. For surface visualization, we colored hydrophobic regions green and hydrophilic regions purple, and adjusted the transparency of the surface to enhance the image aesthetics (Figure 15.A: Docking site of the 48-carbon alkane molecule, detailed view; Figure 15.B: Docking site of the 48-carbon alkane molecule, overall view). We also used LigPlus to illustrate the amino acid residues interacting with the alkane in detail (Figure 15.C: LigPlus diagram of the 48-carbon alkane docking).

We obtained a total of 9 docking results, showing good binding affinity.

mode	affinity (kcal/mol)	dist from best mode
mode	affinity (kcal/mol)	rmsd l.b.	rmsd u.b.
1	-8.82	0	0
2	-8.799	0.6234	11.35
3	-8.724	1.671	10.59
4	-8.705	1.471	2.004
5	-8.611	1.772	4.285
6	-8.574	1.767	4.69
7	-8.449	0.9388	11.34
8	-8.346	3.677	5.906
9	-8.119	2.531	11.16

Label 1 A Auto Dock vina analysis rating for nine model(docking with alkane containing 48 carbon atoms)

We selected the highest-scoring model for analysis. According to the PLIP report generated from the Vina docking, the interactions between the ligand and the protein are primarily hydrophobic interactions, with no hydrophilic or hydrogen bond interactions detected. The report indicates that the ligand binds mainly through hydrophobic interactions with the listed residues. The key interacting residues include: LEU 246, ILE 250, TYR 254, ARG 269, PHE 273, TRP 276, GLN 279, LEU 280, ARG 283, LEU 563, LEU 588, VAL 589, and ILE 669.We found that these residues form a series of hydrophobic zones within the enzyme's binding pocket. Due to the non-polar nature of alkanes, their long-chain structure matches well with these hydrophobic residues, aligning along the hydrophobic channel of the binding site. The distances between the residues and the alkane are below 4 Å but above 3 Å, indicating a relatively tight pocket that facilitates strong hydrophobic interactions. This stable non-polar environment ensures accurate insertion of alkane molecules into the active site, allowing functional binding and catalytic activity. Residue ILE 250 contacts multiple carbon atoms, enhancing stability and binding specificity. Additionally, TRP 276 strengthens the interaction through π-π stacking, reinforcing the binding of long-chain alkanes.

A: Docking site of the 48-carbon alkane molecule, detailed view (The yellow molecule in the figure is a 48-carbon alkane molecule, which interacts with LEU246, ILE250, VAL589, etc. The green surface represents the hydrophobic surface.)

B: Docking site of the 48-carbon alkane molecule, overall view

C: LigPlus diagram of the 48-carbon alkane docking

Figure 15

The label "Unl1(d)" represents the 48-carbon alkane ligand we selected. The figure shows that this alkane is surrounded by several amino acid residues, primarily interacting with it via van der Waals forces or hydrophobic interactions (red arcs). For example, Leu588, Leu246, and Trp276 are positioned near the ligand. The distribution of hydrophobic residues, such as leucine, isoleucine, and phenylalanine, suggests a strongly hydrophobic environment. Although the precise distances between these residues and the ligand are not explicitly shown, their proximity indicates they likely play a direct or indirect role in binding.

4.1.3.2 Docking of Small Alkane Molecules

We downloaded 2-Methyldodecane from ChemSpider as the small alkane molecule for docking. Here, we present the top three docking results (Figure 16.A, B, C; Figure 16.D, E, F; Figure 16.G, H, I).

mode	affinity (kcal/mol)	dist from best mode
mode	affinity (kcal/mol)	rmsd l.b.	rmsd u.b.
1	-9.28	0	0
2	-9.159	1.446	2.53
3	-9.108	1.138	4.764
4	-9.077	1.453	5.016
5	-9.015	2.034	5.636
6	-8.991	7.296	11.14
7	-8.931	4.929	8.757
8	-8.918	2.828	5.297
9	-8.866	1.176	4.653

Label 2 A Auto Dock vina analysis rating for nine model(docking with 2-Methyldodecane)

The docking results indicate that the ligand primarily interacts with hydrophobic residues such as LEU, VAL, and PHE, with LEU 243, LEU 246, PHE 275, LEU 280, and others forming a hydrophobic pocket. These interactions occur at distances between 3.58 Å to 4.00 Å, suggesting stable ligand binding. The hydrophobic nature of the pocket implies that the ligand binds securely within this environment. Notably, a hydrophilic residue, GLN 279, also participates in the interaction, potentially stabilizing the binding mode. Figures 16.A and 16.B further illustrate the docking site and interaction network.

Figure 16 A, B, C: Detailed view of the top three docking sites for 2-Methyldodecane

Figure 16 D, E, F: Overall view of the top three docking sites for 2-Methyldodecane

Figure 16 G, H, I: LigPlus diagrams of the top three docking results for 2-Methyldodecane

The label "Unl1(d)" in the figure represents the 2-Methyldodecane ligand we selected.

4.1.4 Discussion

Through molecular docking analysis of PEase with large and small alkane molecules, we identified differences in their binding affinities. The enzyme's binding site, primarily composed of hydrophobic residues (e.g., LEU246, ILE250), stabilizes long-chain alkanes through hydrophobic interactions, suggesting that PEase degrades polyethylene (PE) via similar mechanisms.

The large alkane (48-carbon) exhibited strong binding affinity (-8.82 kcal/mol), but the lack of polar groups suggests slower degradation, potentially requiring auxiliary factors. Smaller alkanes (e.g., 2-methyl dodecane) showed stronger affinity (-9.28 kcal/mol), supporting the hypothesis that PEase breaks large PE into smaller fragments before efficient degradation.

Although docking results indicate good affinity, molecular docking has limitations. Future experiments, including molecular dynamics simulations and wet-lab testing, are essential to validate degradation mechanisms and efficiency. Thus, our study highlights PEase's potential for PE degradation, but further research is needed for conclusive evidence.

4.2 Molecular Docking of Native P450 Enzyme and Introduced P450cam Enzyme in Pseudomonas aeruginosa PAO1

4.2.1 Introduction

We discovered that the native Pseudomonas aeruginosa already contains a P450 enzyme. To enhance its ability to degrade plastics, we identified a mutant of the P450 enzyme and introduced it into Pseudomonas aeruginosa to improve its plastic-degrading capacity. Before starting the experiments, the modeling team performed molecular docking to compare the binding capabilities of these two enzymes.

4.2.2 Preparation of PDB Files for the Enzymes

For the P450 enzyme in Pseudomonas aeruginosa PAO1, we downloaded the PDB file from UniProt(Figure 17 B;). As for the P450cam Y96F mutant, the modeling team used the amino acid sequence provided by the experimental group and predicted the protein structure using AF3. The prediction results are color-coded based on confidence levels (Figure 17.A;).

Figure 17 A: Structure prediction of mutant 1

Figure 17 B: Structure of the native P450 enzyme in Pseudomonas aeruginosa PAO1

4.2.3 Molecular Docking

We used AutoDock Vina for molecular docking, selecting a 2-Methyldodecane as the ligand. Below are the docking affinity scores for the native P450 enzyme in Pseudomonas aeruginosa(Label 3;).

mode	affinity (kcal/mol)	dist from best mode
mode	affinity (kcal/mol)	rmsd l.b.	rmsd u.b.
1	-6.688	0	0
2	-6.663	2.414	4.122
3	-6.508	1.303	2.145
4	-6.448	1.225	5.59
5	-6.326	3.288	6.492
6	-6.317	2.111	3.979
7	-6.222	1.427	2.805
8	-6.191	1.401	3.382
9	-6.179	1.584	5.546

Label 3 Auto Dock vina rating for WT P450 enzyme

This is the docking affinity score for the P450 enzyme mutant(Label 4).

mode	affinity (kcal/mol)	dist from best mode
mode	affinity (kcal/mol)	rmsd l.b.	rmsd u.b.
1	-7.066	0	0
2	-6.891	0.9338	2.705
3	-6.846	0.7508	1.704
4	-6.829	1.469	3.523
5	-6.82	0.9377	5.806
6	-6.778	1.299	2.545
7	-6.752	1.346	5.861
8	-6.726	1.257	2.461
9	-6.608	1.547	3.292

Label 4 Auto Dock vina rating for Y96F mutant

The docking results show that the binding affinity of the Y96F mutant is higher than that of the native P450 enzyme in Pseudomonas aeruginosa, providing preliminary evidence that introducing this mutant may enhance the bacterium's ability to degrade plastics.

Meanwhile, by analyzing the pocket(Label 5;), we found that the Y96F mutant scored higher than the native P450 enzyme in Pseudomonas aeruginosa. Additionally, the native P450 enzyme's pocket contains fewer atomic residues compared to Y96F, whose binding pocket appears larger. This suggests that long-chain alkane molecules may more easily enter Y96F's active site, potentially resulting in higher degradation efficiency.

WT	Rank	Score	Probability	Y96F	Rank	Score	Probability
pocket1	1	29.69	0.932	pocket1	1	79.38	0.993
pocket2	2	13.45	0.687	pocket2	2	9.17	0.491
pocket3	3	5.36	0.254	pocket3	3	3.47	0.126
pocket4	4	1.38	0.017	pocket4	4	3.41	0.122

Label 5 Comparison of the binding pockets of wild-type P450 enzyme (WT) and mutant P450 Y96F enzyme

We exhibited the surface of the pocket interacting with the alkane and colored it to indicate hydrophilicity and hydrophobicity, with blue representing hydrophilic and yellow representing hydrophobic(Figure 18.A,Figure 18.B). The pocket shown is the highest-scoring binding pocket of the two enzymes. We found that in the wild-type P450 enzyme of Pseudomonas aeruginosa, the entrance of the binding pocket is hydrophilic (marked in blue) but blocked by two helical loops. We speculate that these loops may open via thermal motion during substrate binding, allowing the alkane to enter the enzyme's active site for catalysis. The deep part of the pocket is a hydrophobic surface (marked in yellow), which aligns well with the fact that the P450 enzyme binds non-polar alkanes and catalyzes their oxygenation. The pocket is relatively small, comprising 34 residues from chain A, suggesting it may only accommodate shorter alkane molecules.

A the pocket for riginal P450 enzyme in PAO1(WT)

B the pocket for mutant Y96F

The literature we reviewed found that the P450cam Y96F enzyme has a binding pocket with a hydrophilic entrance (marked in blue) located on the surface, directly exposed to the solvent, without any blocking by protein helices. This characteristic is similar to the structural features of MT CYP51. The deeper region of the binding pocket is primarily composed of hydrophobic amino acid surfaces, which aligns with its ability to bind alkanes and catalyze oxygenation reactions. The enzyme's binding pocket is relatively long and deep, consisting of 51 amino acid residues, suggesting that it may bind long-chain alkanes more efficiently and facilitate oxygenation reactions.

For the original P450 enzyme(Figure 18.C;) in PAO1, the molecular docking results show that key amino acid residues involved in the interaction are mainly TRP 93, THR 236, MET 239, ALA 240, ALA 304, ALA 308, and PHE 410 on chain A, which play a hydrophobic role. PHE 410 forms multiple interactions with the small molecule, suggesting it may play a critical role in the binding pocket, potentially helping to position the alkane molecule at the active site. The close hydrophobic interactions suggest the alkane molecule forms a stable binding with the hydrophobic residues, helping to secure it in a specific position.

C Docking of the original P450 enzyme in PAO1

D Docking of the mutant P450 Y96F enzyme

Figure 18

In the analysis of the mutant P450 Y96F enzyme(Figure 18.D), the small molecule forms significant hydrophobic interactions with residues like PHE 88, PHE 97, VAL 248, VAL 296, VAL 397, ASP 252, and ILE 396 on chain A. Notably, PHE 88 and PHE 97 show multiple interactions with the small molecule UNL, highlighting their key role in stabilizing the binding.

The interaction distances are around 3.5-4.0 Å, indicating tight hydrophobic interactions that help position the small molecule in the binding pocket. PHE 88 and PHE 97 form stable interactions, reinforcing the binding. Hydrophobic residues like PHE, VAL, and ILE dominate the binding pocket, making it favorable for binding hydrophobic molecules. Although ASP 252 is hydrophilic, its weaker direct interaction (3.97 Å) shows that hydrophobic interactions provide the main stability.

4.2.4 Wet Lab Results Validation

The bacteria with the introduced gene were co-incubated with plastic for 7 days, followed by scanning electron microscope (SEM) observation after 7 days(Figure 19).

Figure 19 SEM of the WT and cypY96F-vgb

In our molecular docking study of Pseudomonas aeruginosa P450 enzyme and its introduced P450cam Y96F mutant, the results indicate that the P450cam Y96F mutant has significantly higher affinity for alkane molecules compared to the native P450 enzyme. This is evident from the docking affinity data: the Y96F mutant exhibits a highest binding affinity of -7.066 kcal/mol, while the native P450 enzyme shows a highest binding affinity of -6.688 kcal/mol. This difference suggests that the introduction of the mutation has led to a favorable alteration of the active site structure, potentially enhancing substrate binding.

Regarding the docking results of the P450 enzyme and its mutant with a 46-carbon alkane, we infer that the enhanced affinity of the Y96F mutant may result from changes in the hydrophobicity or spatial structure of the active site. These changes may make the binding pocket more suitable for long-chain alkane molecules. Specifically, the Y96F mutation may increase hydrophobic interactions or induce spatial conformational adjustments, leading to tighter binding between the enzyme and alkane molecules.

While the docking results suggest a higher binding capacity for the P450cam Y96F mutant, it is important to note that this does not necessarily imply higher catalytic efficiency. Although increased binding affinity indicates favorable initial interactions, the rate of catalytic reactions still requires experimental validation. Thus, in addition to molecular docking simulations, enzyme kinetics experiments should be conducted to assess the actual catalytic efficiency of the Y96F mutant in degrading polyethylene or alkane molecules.

The results of wet-lab experiments further support our molecular docking findings: after co-culturing the Pseudomonas aeruginosa strain harboring the P450cam Y96F mutant with plastic for 7 days, the strain exhibited stronger plastic degradation ability compared to the wild-type strain. This experimental result provides preliminary evidence that the introduction of the P450cam mutant can enhance the strain’s plastic degradation capability.

5.From SVM to GNN : Building an Accurate Peptide-PE Binding Predictor(PEBP)

Abstract

According to the wet lab's requirement, we need to build a predictor to determine whether a peptide sequence can bind to PE. After an initial literature search, we collected 174 relevant data points. Due to the small data size, we first conducted machine learning training and then performed special processing, attempting to introduce a neural network. Specifically, we first trained an SVM model using the original data. Then, we expanded the data by a ratio of 1:10, resulting in nearly two thousand data points. Next, we used the trained SVM model to filter the expanded data according to specific criteria, resulting in a new dataset. To deeply explore sequence information, we converted the sequences into structural graphs, extracted features through a Graph Neural Network (GNN) and Position-Specific Scoring Matrix, and introduced an attention module for further learning, ultimately creating a predictor. In the experimental section, we divided the dataset into a training set and a test set, conducted experiments using the constructed model, and measured relevant experimental metrics such as accuracy, recall, and F1 score. Finally, we further analyzed the test results and adjusted and optimized the model parameters.

Figure 20 The whole structure of our model

5.1 Introduction of Background and material

5.1.1 Background Introduction

Although PEBPs can be identified experimentally, this is not cost-effective for handling large volumes of peptides. Therefore, computational methods for predicting PEBPs are urgently needed to save time and resources. Machine learning and neural network approaches have demonstrated significant advantages in peptide and protein classification problems. In this study, we propose a Peptide-PE Binding Predictor based on Support Vector Machines (SVM) and Graph Neural Networks (GNN), named PEBP. This tool aims to rapidly and effectively eliminate false-positive peptides and accurately identify peptides that are truly beneficial for PE degradation.

5.1.2 Materials

Dataset

Our training data was sourced from biological databases and literature, consisting of both positive and negative samples. Positive samples were peptides binding to polyethylene (PE) collected from the literature, while negative samples were randomly selected peptides that do not bind to PE. To ensure comparability between the datasets, we compared the sequences of the negative samples with those of the positive samples, removed duplicates, and supplemented the dataset with new peptides. To prevent PE-binding peptides from entering the negative set, we used generalized Jaccard similarity to keep the sequence similarity between positive and negative samples below 90%.

Ultimately, both positive and negative datasets contained 174 peptides, with 85% used for training and the remaining 15% for testing, ensuring no overlap between the training and test sets. Evaluation results were averaged across multiple random data splits.

5.2 Our method

5.2.1 Features and Feature Selection

Extracting appropriate features is a crucial step in building a high-performance predictive model. Common feature types include amino acid composition (AACs), dipeptide composition (DPCs), physicochemical properties of amino acids, and pseudo-amino acid composition, which have been widely used in the development of protein and peptide classifiers. These features have shown excellent performance in these contexts.

Representing protein sequences by calculating the frequency of amino acids is an effective method. Differences in the frequency distribution of amino acids among different sequences can be used to distinguish between different types of proteins. This also applies to peptide sequences; therefore, we selected AACs as features. To compensate for the lack of intrinsic connections between amino acids, we also introduced DPCs. A peptide sequence can be composed of any of the 20 amino acids (ACDEFGHIKLMNPQRSTVWY), so a peptide with L amino acids can be represented as: $$\beta = (\beta_1, \beta_2, \dots, \beta_L)$$

where $\beta_1$,$\beta_2$ and $\beta_L$ represent the first, second, and Lth amino acids of the peptide sequence , respectively. The definitions of AAC and DPC are as follows: $$\begin{equation} AAC(i) = \frac{x_i}{\sum_{i=1}^{20} x_i} \end{equation} \tag{33}$$ $$\begin{equation} DPC(j) = \frac{y_j}{\sum_{j=1}^{400} y_j} \end{equation} \tag{34}$$

where $i$ represents one of the 20 amino acids, and $j$ represents one of the 400 dipeptides. $x_i$ represents the number of residues of each type, and $y_j$represents the number of dipeptides of each type. To build an efficient PE-binder predictor (PEBP), we further refined AAC and DPC using the fselect.py script in LIBSVM3.22, removing irrelevant, redundant, and noisy features. The feature selection process was as follows: features were sequentially added to an initially empty set in descending order of accuracy, and the accuracy of this set was calculated each time an element was added. When the prediction accuracy reached its highest value, we selected that set as the optimal feature subset. After these steps, we ultimately obtained optimized AAC (OAAC) and optimized DPC (ODPC).

5.2.2 Support Vector Machine

Support Vector Machine (SVM) is a supervised learning model algorithm widely used for data regression analysis and prediction, and it has received increasing attention and application in bioinformatics. We applied SVM in the development of PEBP. This SVM model was developed using LIBSVM3.22,an integrated support vector classification software. The optimal error factor c and kernel function variance g in the model can be determined using the built-in Python script grid,py. To visualize the prediction results, the parameter b was set to 1 during model training.

5.2.3 Prediction Evaluation

In this study, all models were evaluated using five-fold cross-validation. The entire dataset was randomly divided into five groups, each containing an equal number of peptides. Four groups were used for training, and the remaining one was used for testing, with the process repeated five times to ensure that each group had the opportunity to serve as the test set. The average prediction accuracy across the five combinations was calculated as the model's final performance. To comprehensively evaluate the PEBP model, sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews Correlation Coefficient (MCC) were used as evaluation metrics.

$$\begin{equation} Sn = \frac{TP}{TP + FN} \end{equation} \tag{35}$$ $$\begin{equation} Sp = \frac{TN}{TN + FP} \end{equation} \tag{36}$$ $$\begin{equation} Acc = \frac{TP + TN}{TP + FN + FP + TN} \end{equation} \tag{37}$$ $$\begin{equation} MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{equation} \tag{38}$$

The results show that the area under the ROC curve (AUC) is 0.75, indicating that the model has good but not perfect predictive performance. The confusion matrix shows TP, TN, FP, and FN as 25, 15, 6, and 10, respectively, consistent with the calculations for sensitivity, specificity, and MCC. Classification performance metrics such as precision, recall, and F1 score demonstrate reasonable predictive capability across different categories. Overall, the results align with the model evaluation approach described for five-fold cross-validation, supporting the model's effectiveness in predicting PE-binders.

The results are as follows.

Figure 21 Receiver Operating Characteristic

Figure 22 Classification Metrics by Class

Figure 22 Confusion Matrix

5.2.4 Peptide Data Augmentation

Furthermore, to fully exploit the peptide data, we conducted further research. Regarding the dataset, given the limitations in the amount of source data, we first considered peptide data augmentation:

General process:

• Generate candidate sequences: Methods such as random mutation, sequence concatenation, and noise addition were used to generate candidate sequences.

• Screen candidate sequences: An SVM model was used to perform preliminary screening of the candidate sequences, identifying sequences that might bind to PE.

• Expand the dataset:The screened sequences were added to the dataset until the desired expansion size was reached (around 2,000 sequences).

Generating candidate sequences is the key step in data augmentation, aimed at creating more diverse peptide samples to improve the generalization ability of the model. Specific methods include random mutation, sequence concatenation, and noise addition. Random mutation involves selecting a random amino acid in the original peptide sequence and replacing it with another amino acid, introducing minor variations. Sequence concatenation combines two or more peptide fragments to create new peptide sequences, increasing sequence diversity while maintaining the original biological characteristics. Noise addition involves randomly inserting or deleting amino acids in the peptide chain or introducing small perturbations between amino acids to simulate possible variations during experiments. These methods work together to generate a large number of candidate sequences, providing a rich data foundation for subsequent screening and training.

5.2.5 Network architecture of the proposed PEBP

To further explore the sequence information, the model first converts the sequence into a structural graph to capture its spatial and topological features, and then performs layer-wise feature extraction through Graph Neural Networks (GNN). Additionally, the model utilizes a BLOSUM matrix to generate features and employs a CNN network for deep learning from the sequence level. Subsequently, the model introduces an attention mechanism to fuse and optimize the extracted features, enhancing the discriminative power of the feature representation. Finally, the output module generates an efficient predictor for accurate target property prediction.

Sequence processing module

The sequence processing module represents peptide sequences from two perspectives: structural information and evolutionary information.

Structural information is obtained using the RDKít tool, which converts peptide sequences in FASTA format into their corresponding chemical structures in MOL format. These structures contain molecular data such as atoms, bonds, and coordinates. To encode the structural information, the peptide sequence is represented as a molecular graph G = (V,E) , where V is the set of nodes (atoms), and E is the set of edges (chemical bonds). N(v) includes all nodes u connected to node v by edge (u,v), and it describes the neighborhood structure and relationships of node v. This molecular graph representation captures the spatal and chemical interactions within the peptide sequence, aiding in feature extraction.The example graph is as follow.

Figure 24 the structure of one peptide

Evolutionary Information:Represented through the BLOSUM matrix (Block Substitution Matrix). The BLOSUM matrix quantifies the similarity between protein sequences based on the conservation of residues observed across multiple species. It records the substitution rates of amino acids observed during sequence alignment across species. Higher scores in the matrix indicate common substitutions, while lower scores represent rare or unlikely substitutions. Different BLOSUM matrices, such as BLOSUM62, are used to balance sequence similarity based on evolutionary distances. BLOSUM62, for example, is often used to compare and quantify the similarity between protein sequences. The BLOSUM matrix is represented as follows:$$\text{BLOSUM} = \begin{bmatrix} b_{1,1} & b_{1,2} & \cdots & b_{1,20} \\ b_{2,1} & b_{2,2} & \cdots & b_{2,20} \\ \vdots & \vdots & \ddots & \vdots \\ b_{20,1} & b_{20,2} & \cdots & b_{20,20} \end{bmatrix}$$

Here, each row and column of the matrix represents a standard amino acid, and each element $b_{i,j}$ in the matrix represents the substitution score between amino acids $i$ and $j$.The BLOSUM matrix can help reveal the evolutionary history of sequences, playing an important role in tasks like sequence alignment and classification.

Feature extraction module

Graph Neural Networks (GNNs) for Molecular Graphs

In this study, we employed Graph Neural Networks (GNNs) to extract structural information from the molecular graph of a given peptide. Since GNNs cannot directly process graph-structured data, we introduced an embedding layer on top of the GNNs to convert the graphstructured data into numerical vectors. Following the method , we considered the atomic and edge information of the molecular graph and applied the 1-dimensiona Weisfeiler-Lehman (1-WL) algorithm. The 1-WL algorithm iteratively assigns a label to each node based on the previous labels of both the node itself and its neighbors. However, in this study, we also considered the previous labels of the edges between the nodes and their neighbors.

Additionally, we updated the label of each edge by considering the labels of the two nodes it connects. To formalize this embedding method, let $h^{(t)}(v) \in {N}$ and $g^{(t)}(u,v)\in {N}$ represent the labels of node $v$ and edge $(u,v)$ at iteration $t$ , where ${N}$ denotes the set of natural numbers. At iteration $t=0$, we initialize the nodes and edges based on their types in the molecular graph, such as $h^{(0)}(oxygen) = 0, h^{(0)}(nitrogen)=1$, and so on. For $t>0$, the node and edge labels are updated as follows: $$ h^{(t)}(v) = \text{HASH}(h^{(t-1)}(v), \{h^{(t-1)}(u), g^{(t-1)}(u,v) \mid u \in N(v)\}) $$ $$ g^{(t)}(u,v) = \text{HASH} \left( g^{(t-1)}(u,v), h^{(t)}(u), h^{(t)}(v) \right) $$

where $N(v)$ denotes the set of neighbors of node $v$, and HASH is a function that maps the node and edge information to a unique value in ${N}$, ensuring uniqueness across iterations.

Through this process, we obtain a “fingerprint” for the molecular graph of a given peptide. This fingerprint is then encoded into a matrix $X$ of size $n \times m$, where $n$ is the number of nodes in the molecular graph, and $m$ is a hyperparameter of the embedding layer. After embedding $X$ is processed by the GNNs to extract structural information, defined as: $$R = ReLU(X^{(k)} W_{gmn}) \tag{39}$$ where $k$ represents the $k$-th GNN layer, $W_{gm}$is the weight matrix of size $m \times m$ , and ReLU is a non-linear activation function.

To propagate this information through the layers, we use the following update rule: $$X^{(k+1)} = X^{(k)} + AR \tag{40}$$

where $A$ is the adjacency matrix of the molecular graph. To obtain the output $y_{graph}^{(k)}$ at each GNN layer, we compute the mean value of each column of $X^{(k)} = (x_1^{(k)}, x_2^{(k)}, \cdots, x_n^{(k)})$ as follows: $$y_{graph}^{(k)} = \frac{1}{n} \sum_{i=1}^{n} x_i^{(k)} \tag{41}$$

where $x_i^{(k)}$ represents the vector representation of the $i$-th node in the molecular graph at the $k$-th GNN layer, and $n$ is the number of nodes in the graph. The vector $y_{graph}^{(k)}$ provides the representation of the peptide sequence at the $k$-th GNN layer.

CNN Network for BLOSUM

To capture specific features, we designed a Convolutional Neural Network (CNN) that leverages evolutionary information from BLOSUM. The network mainly consists of a two-dimensional convolutional layer and a non-linear activation function.

In this CNN network, the BLOSUM of a peptide is first normalized through a scaling layer. The scaled BLOSUM is then processed by a two-dimensional convolutional layer with a non-linear activation function to extract local hidden features of the peptide sequence. The process isdescribed as follows:

First, the BLOSUM is scaled by the normalization layer: $$B_{scaled}= normalize(B) \tag{42}$$

Next, it is processed by a two-dimensional convolutional layer and ReLU activation function: $$H= ReLU(conv2d(B_{scaled})) \tag{43}$$

where $B_{scaled}$ is the normalized BLOSUM, Conv2D represents the two-dimensional convolution operation, and is the hidden feature matrix after the convolution operation.

To handle the variable lengths of input sequences, we applied zero- padding to standardize all sequences to a fixed length of 50, ensuring compatibility with the neural network, which requires fixed-length inputs. To further prevent overfitting, we employed dropout during training.

Attention module

Attention Mechanism is an attempt to mimic human brain action to selectively concentrate on a few important parts, while ignoring others in machine learning .In our network architecture, the attention module cooperates with the feature extraction module to capture interpretable features. It assigns different weights to the concatenated feature vectors obtained by separately combining the output of the CNN network with the output of each GNN layer in the feature extraction module. The higher the weight, the more important the corresponding feature vector. For the convenience of description, we denote all the concatenated feature vectors as F with the size $h \times u$: $$F = (f_1, f_2, \cdots, f_h) $$

where $f_1$ denotes i-th concatenated feature vector; h denotes the number of the concatenated feature vectors, which is determined by the number of GNN layers; $u$ denotes the length of the concatenated feature vector. In this work, the attention mechanism takes all the concatenated feature vectors F as input to calculate a weight vector $a$, in which each element is the weight of the corresponding feature vector: $$a = \text{softmax}(w_2 \tanh(w_1 F^T)) \tag{44}$$ $$z = aF \tag{45}$$

where $w_1$ denotes a weight matrix with the size $d \times u$; $w_2$ is a weight vector with the size $d$. Note that $d$ is a hyperparameter, which can be set randomly and fine-tuned during network training. The softmax was used to ensure all the computed weights sum up to 1 and tanh is a non-linear activation function. To this end, we multiplied the weight vector a with F, yielding the final representation z of a peptide sequence.

Output module

The feature vector z, containing structural and evolutionary information of the peptide, is produced by an upstream attention module. The feature vector z is passed into a fully connected layer with a ReLU activation function: $$y = \text{ReLU}(Wz + b) \tag{46}$$

where $W$ is the weight matrix,$B$ is the bias vector.

The output of the fully connected layer, $y = [y_1, y_2]$, is fed into a softmax layer to compute the binding probabilities. Here: $y_1$ represents the score indicating the likelihood of the peptide binding to PE. $y_2$ represents the score indicating the likelihood of the peptide not binding to PE. The softmax function transforms these scores into probabilities: $$p_1 = \frac{\exp(y_1)}{\exp(y_1) + \exp(y_2)} \tag{47}$$ $$p_2 = \frac{\exp(y_2)}{\exp(y_1) + \exp(y_2)} \tag{48}$$

where: $p_1$ is the predicted probability that the peptide binds to PE, and $p_2$ is the predicted probability that the peptide does not bind to PE.

The model will predict that the peptide binds to PE if $p_1 > p_2$, and predict no binding if $p_2 > p_1$.

5.3 Result analysis

Model performance:

The overall performance of the model is satisfactory. During training, the loss continuously decreases and eventually stabilizes, indicating that the model is optimizing and successfully capturing the features of the data. The accuracy, recall, and F1 scores on the test set remain at a high level. However, despite the good performance of the training metrics, fluctuations in the test set results suggest that the model may be overfitting. It is suspected that this could be due to the lower quality of the dataset. Overall, the model demonstrates good predictive ability, and further adjustments will be made to enhance its performance.

Figure 25 Model performance

Structure prediction as well as molecular docking:

Since the structure of PEBP (polyethylene-binding peptide) is unknown, we used AlphaFold2 (AF2), a powerful deep learning model, along with ChimeraX for visualization and ColabFold for structure prediction. We obtained the PEBP amino acid sequence in a FASTA file, entered it into the AlphaFold panel in ChimeraX, and selected options for energy minimization, structure alignment, and using PDB templates for enhanced accuracy. Despite the longer processing time, these settings allowed for more accurate 3D predictions of the PEBP structure, generated through Google's ColabFold environment. The relevant predictions are as follows:

Figure 26 The display of 3D structure

Figure 27 Predicted lDDT per position. This plot shows the model's confidence (lDDT) at each position of the protein, with higher scores indicating better prediction accuracy

Figure 28 Predicted Alignment Error. These heatmaps display the alignment error between residues for different ranked predictions, where darker colors represent lower error and higher confidence

Figure 29 Sequence Coverage. This plot indicates the sequence coverage, showing the number of aligned sequences at each position to ensure sufficient sequence data for accurate predictions

Next, we need to know the spatial structure of PE(polyethylene). We used Avogadro for drawing and obtaining pdb files for subsequent molecular docking. The result is shown as figure.

Figure 30 the structure of PE

We also used alpha fold3 to predict the structure of the PEBP-PEase fusion protein.

Figure 31 the structure of PEBP-PEase fusion protein

Let's see the result of what the wet lab group did【Effect verification of PEBP】:

The PEBP-GFP protein was extracted and purified. After being immersed in PBS and cleaned twice by the vortex shaker, the microplastics incubated with PEBP-GFP still generally showed obvious fluorescence. The microplastics incubated with GFP can see weak fluorescence of GFP residue after intense exposure, while the blank group can not see any fluorescence under the fluorescence microscope. It can be concluded that PEBP has strong binding ability with microplastics.

Experimental group:

Figure 32 500µm PE plastics + PEBP-GFP (visible light)

Figure 33 500µm PE plastics + PEBP-GFP (fluorescence)

Blank control group:

Figure 34 500µm PE plastics Blank (visible light)

Figure 35 500µm PE plastics Blank (fluorescence)

Control group:

Figure 36 500µm PE plastics + GFP (visible light)

Figure 37 500µm PE plastics + GFP (fluorescence)

6. From LLM to RAG:Building an Intelligent Chatbot for iGEM Competition

6.1 Designing the basic structure of the Chatbot

This section introduces iGEMBot V.2, a chatbot system based on large language models (LLM) designed to support iGEM (International Genetically Engineered Machine Competition) teams. It enhances team collaboration and access to critical project resources through a retrieval-augmented generation (RAG) pipeline. The primary goal of iGEMBot V.2 is to provide team members with accurate and timely information related to synthetic biology, laboratory protocols, project design, and competition guidelines, creating an interactive platform. Our system imports data from official iGEM documents and external synthetic biology resources into the RAG pipeline, enabling it to perform domain-specific question-answering tasks for iGEM teams. We evaluated the system's performance in assisting iGEM teams using the iGEM team from Beijing Normal University, Zhuhai campus, as a case study. The accuracy of its responses was measured using the LangSmith framework. Additionally, user satisfaction was assessed through a System Usability Scale (SUS) survey. The system demonstrated strong quantitative performance, achieving an average RAGAS score of 0.96, and user feedback also indicated good usability and effectiveness in supporting iGEM competition teams.

I.INTRODUCTION

BACKGROUND

Click to Read More

Colleges and universities have invested a significant amount of time and resources to ensure that their iGEM teams receive the necessary support for success in the competition. Although school websites and iGEM's official online resources provide ample information, they often lack the ability to offer personalized, real-time responses to teams. For example, when iGEM team members need guidance on how to submit project results, find suitable laboratory protocols, or determine which safety regulations apply to their designs, they must search through numerous resources. This process can be time-consuming, and team members may still struggle to find the necessary information due to its lack of clarity or overly technical nature.

The iGEM competition offers many resources, such as official manuals, safety guidelines, and mentor services, to help teams. However, accessing these resources often requires sifting through extensive documents or attending mentor sessions scheduled at specific times, which may not align with the team's busy project management timeline. Additionally, during peak times such as submission deadlines, delays in query responses may affect the team's ability to complete competition requirements on time.

To address these issues, iGEM teams are increasingly turning to AI-driven solutions, such as chatbots, to provide more efficient and personalized support. Chatbots are “software systems that mimic interactions with humans,” conducting conversations using natural language processing (NLP) technology. For example, iGEMBot V.2, a chatbot specifically designed for iGEM teams, can answer common questions in real-time regarding experimental protocols, safety compliance, and project submission guidelines. This chatbot can guide teams through various project stages, from design to documentation, ensuring that they can promptly access accurate information without the need to search through multiple resources or wait for a human response.

By using conversational agents like iGEMBot V.2, iGEM teams can reduce the friction of navigating complex competition requirements, allowing them to focus more on the creative and scientific aspects of their projects. Similar to how Arizona State University's “Sunny” and Georgia State University's “Pounce” assist students with administrative tasks, iGEMBot V.2 aims to streamline the project management process for competition teams. This not only improves team efficiency but also enhances the overall competition experience, enabling students to engage more deeply in their synthetic biology innovations.

In this paper, we introduce iGEMBot V.2, a second-generation chatbot system specifically designed for iGEM competition teams, although its architecture can also be applied to support various synthetic biology projects and research team scenarios. This system serves as an assistive tool, integrating comprehensive iGEM resources and providing intelligent and interactive analysis of iGEM-related content. It aims to enhance the ease of accessing relevant information and simplify the project management process for iGEM teams. Compared to other educational and project management chatbots, iGEMBot V.2 is more comprehensive, covering a wide range of topics related to synthetic biology, lab protocols, project guidelines, safety compliance, and competition requirements.

The development of iGEMBot V.2 adopts Retrieval-Augmented Generation (RAG) technology to generate responses. The RAG pipeline includes two key components: a retriever and a generator based on a large language model (LLM). The reason for choosing the RAG approach is that pretrained LLMs (such as gpt-3.5-turbo) alone cannot adequately answer domain-specific questions or handle data outside their training datasets, often producing false or inaccurate responses. As shown in Figure 38, the comparison between responses generated by standard LLMs and those enhanced by RAG demonstrates that iGEMBot V.2, through the use of RAG technology, can provide accurate, domain-specific answers to users' queries, which typical LLMs cannot achieve.

Figure 38 Comparison of Responses Generated by Standard LLMs vs. iGEMbot V.2

II.BACKGROUND AND RELATED WORK

iGEMBot V.2 integrates the latest Retrieval-Augmented Generation (RAG) Technology with Large Language Models (LLMs) to provide High-Quality Q&A Services. In this section, we will detail the development background of LLMs and RAG and their application in related fields to highlight the innovative features and advantages of iGEMBot V.2's architecture.

MORE DETAILS

Click to Read More

A. Large Language Models (LLMs)

Large language models (LLMs), such as OpenAI's GPT series, Google's PaLM, and Meta's LLaMA, have made significant strides in the field of natural language processing. LLMs are deep learning models trained on massive datasets and demonstrate outstanding language understanding and generation capabilities across various tasks. Their core advantage lies in their ability to comprehend context and generate coherent, semantically rich responses. LLMs are widely used in question-answering systems, text summarization, translation, and conversational generation, providing powerful language support for multiple application scenarios.

Despite their strong language generation capabilities, LLMs face certain limitations when dealing with complex tasks requiring precise information. Since LLMs generate content based on their training data, they may produce “hallucinations” — inaccurate or fabricated information — when confronted with new or domain-specific information. Therefore, integrating LLMs with dynamic, real-time data sources has become a critical direction for improving generation quality.

B. Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique that combines retrieval mechanisms with generation models to address the limitations of LLMs regarding information accuracy. The RAG architecture introduces a retrieval module that extracts documents related to the user's query from external data sources, which are then fed into the generation model as contextual input. This approach not only enriches the input information but also significantly enhances the accuracy and timeliness of the generated responses.

RAG typically works in two main stages: retrieval and generation. In the retrieval stage, the system searches for relevant information from external data sources (such as vector databases or web documents) and converts this information into inputs that the generation model can understand. In the generation stage, the retrieved content is input into the LLM alongside the user's original query, allowing for contextually relevant responses. By integrating context retrieval with generation, RAG effectively reduces the "hallucination" issue in LLMs, particularly excelling in high-accuracy Q&A, technical support, and knowledge-based question-answering fields.

C. Related Work

In recent years, the combination of RAG technology with LLMs has been validated across various application scenarios. For example, OpenAI's GPT-3 integrates with Bing's retrieval module to provide more factually accurate answers, and Google's Bard uses a similar RAG architecture by integrating real-time search results from Google to enhance response precision. Research has shown that the RAG architecture significantly improves model performance in a range of knowledge-intensive tasks.

In the fields of biological sciences and academic Q&A, similar architectures have been applied in models such as BioBERT and SciBERT to improve their ability to respond to domain-specific queries. iGEMBot V.2 builds upon these foundations by further optimizing the integration of retrieval and generation modules, leveraging the unique advantages of RAGFlow to dynamically coordinate the combination of retrieved results with the generation process. This design allows iGEMBot V.2 to effectively serve the diverse needs of the iGEM competition environment, including queries related to experimental guidance, project submission, and competition rules.

III.iGEMBot V.2 Architecture and Methodology (based on RAGFlow)

This section describes the architecture of iGEMBot V.2, focusing on how RAGFlow (Retrieval-Augmented Generation Flow) is utilized to handle context retrieval and generation. The system is divided into two main stages: context retrieval and generation, as shown in Figure 39 In the first stage, RAGFlow retrieves documents related to the user's query from external data sources. In the second stage, these documents are used to generate contextually relevant responses, referred to as generated results. The following subsections will provide a detailed analysis of each stage, introducing their functions and the methods behind them.

Figure 39 the architecture of iGEMBot V.2

A. Context Retrieval (Powered by RAGFlow)

The context retrieval phase of iGEMBot V.2 is implemented through RAGFlow. RAGFlow combines both retrieval and generation capabilities, allowing information to be fetched from external data sources to provide contextual support for generating accurate responses. The input at this stage is the user's query, and the output consists of document fragments (chunks) related to the query.

In iGEMBot V.2, external data sources encompass a large collection of iGEM-related resources, including project guides, safety protocols, lab manuals, FAQs, and more, all collected from the official iGEM website and handbooks. To enable iGEMBot V.2 to handle a wide variety of complex queries, the system first uses RAGFlow's retrieval module to fetch documents relevant to the query. For instance, a user might ask, "What are the safety protocols for genetic engineering experiments?" or "What is the submission deadline for iGEM projects?" To answer these questions, the system needs to extract the relevant content from the iGEM safety manual or competition guide.

The retrieval process in RAGFlow is achieved through two core components: the embedding model and the vector database. The embedding model converts text data into vector representations, preserving the semantic information of the input data. The vector database stores these high-dimensional vectors generated by the embedding model, supporting complex similarity searches and rapid context retrieval.

RAGFlow specifically uses similarity score threshold retrieval, which returns the best-matching documents based on a set minimum similarity threshold. These document fragments are then passed as input to the generation phase.

B. Generation (RAGFlow-Enhanced Generation)

The generation phase of iGEMBot V.2 is carried out through the generation module of RAGFlow. This module uses a Retrieval-Augmented Generation (RAG) mechanism to combine the documents returned from the context retrieval stage with the user's query in order to generate accurate and context-relevant responses. In this phase, we use OpenAI's gpt-3.5-turbo as the core generation model.

The input to the generation model consists of the retrieved document chunks and the user's original query. The generation component of RAGFlow synthesizes these inputs to generate a coherent response. For example, when a user asks, "What are the safety protocols for genetic engineering experiments?" the system first extracts relevant content from the safety manual via RAGFlow's retrieval module, and then generates a detailed safety guidance response. Similarly, when a user asks, "What is the submission deadline for iGEM projects?" the system retrieves the relevant information from the competition guide and provides a clear answer.

The generation stage in RAGFlow utilizes API calls to generate highly relevant content in real time using OpenAI's LLM. This retrieval-augmented generation mechanism ensures that the generated responses are not only based on the user's query but also supported by extensive contextual information, resulting in more precise answers. Figure 40 illustrates an example of the generated response to the user query: "What are the safety protocols for genetic engineering experiments?"

Figure 40 an example of the generated response to the user query

B. Quantitative assessment

We conducted an in-depth evaluation of iGEMBot V.2's context retrieval and generation capabilities using the LangSmith framework. The retrieval phase was analyzed through context precision and recall, while the generation phase was assessed based on factual accuracy and relevance. For the Q&A tasks related to project management, lab resources, and research opportunities, iGEMBot V.2 demonstrated excellent performance. LangSmith's end-to-end evaluation indicated consistent scores between 0.90 and 0.93.

C. Usability Evaluation

To assess the user experience and satisfaction with iGEMBot V.2, we employed the System Usability Scale (SUS), a widely recognized method for evaluating system usability through a standardized questionnaire. To gather comprehensive feedback, we invited 38 participants, including graduate and undergraduate students involved in iGEM.

Figure 41 the screenshot of iGEMBot V.2 User Survey

Participants answered a set of 10 questions, each with five response options ranging from "strongly agree" to "strongly disagree". These questions covered aspects of the system's ease of use, functionality, and overall user satisfaction. After collecting and analyzing their feedback, we calculated an average SUS score of 76.85.

Figure 42 the questions of iGEMBot V.2 User Survey

Although this score indicates that the system's usability is acceptable, it also highlights areas for improvement. Feedback from participants suggested that while iGEMBot V.2 performed well overall, future iterations could further optimize user interaction, response speed, and the clarity of certain features to enhance the overall user experience. These insights are crucial for the future development of the system.

6.2 Explore the application of GraphRAG in Chatbot

Retrieval:	A component that retrieves information from a specified knowledge base and returns 'Empty response' if no information is found. Ensure the correct knowledge base is selected.
Generate:	A component that prompts the LLM to generate responses. Ensure the prompt is set correctly.
Interact:	A component that serves as the interface between human and bot, receiving user inputs and displaying the agent's responses.
Categorize:	A component that uses the LLM to classify user inputs into predefined categories. Ensure you specify the name, description, and examples for each category, along with the corresponding next component.
Message:	A component that sends out a static message. If multiple messages are supplied, it randomly selects one to send. Ensure its downstream is 'Answer', the interface component.

Figure 43 one of the application of RAG: the flow of interview in our team forming

6.3 Deploy the Chatbot on WeChat as assistant

WeChat is widely used in China, making it an ideal platform to deploy iGEMBot V.2 for seamless daily communication for iGEMers. By integrating the chatbot into WeChat, iGEMers can easily access information and assistance related to iGEM projects, such as safety protocols, submission deadlines, and project guidance, all within a single app. This deployment would provide a user-friendly experience, allowing real-time interaction, updates, and notifications in a familiar environment, thus greatly enhancing convenience and accessibility for iGEM participants.

Figure 44 a real situation of using iGEMBot V.2

Reference

1. Knott, B. C., Erickson, E., et al. (2020). Characterization and engineering of a two-enzyme system for plastics depolymerization. Proceedings of the National Academy of Sciences of the United States of America, 117(41), 25476-25485.

2. Tournier, V., Topham, C. M., Gilles, et al. (2020). An engineered PET depolymerase to break down and recycle plastic bottles. Nature, 580(7802), 216-219.

3. Zhao, S., Wesseling, S., Rietjens, I. M. C. M., & Strikwold, M. (2022). Inter-individual variation in chlorpyrifos toxicokinetics characterized by physiologically based kinetic (PBK) and Monte Carlo simulation comparing human liver microsome and Supersome™ cytochromes P450 (CYP)-specific kinetic data as model input. Archives of toxicology, 96(5), 1387-1409.

4. Hagelueken, G., Wiehlmann, L., Adams, T. M., Kolmar, H., Heinz, D. W., Tümmler, B., & Schubert, W. D. (2007). Crystal structure of the electron transfer complex rubredoxin rubredoxin reductase of Pseudomonas aeruginosa. Proceedings of the National Academy of Sciences of the United States of America, 104(30), 12276-12281.

5. Gong, C. C., & Klumpp, S. (2017). Modeling SRNA-Regulated plasmid maintenance. PLoS ONE, 12(1), e0169703.

6. Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., Bodenstein, S. W., Evans, D. A., Hung, C.-C., O'Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulytė, A., Arvaniti, E., . . . Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493-500.

7. Chai Discovery. (2024). Introducing Chai-1.

8. Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., & Ferrin, T. E. (2021). UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Science, 30(1), 70-82.

9. Schrödinger, LLC. (2015). The PyMOL Molecular Graphics System, Version 2.0.

10. Trott, O., & Olson, A. J. (2010). AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of Computational Chemistry, 31(2), 455-461.

11. Wallace, A. C., Laskowski, R. A., & Thornton, J. M. (1995). LIGPLOT: A program to generate schematic diagrams of protein-ligand interactions. Protein Engineering, Design & Selection, 8(2), 127-134.

12. UniProt Consortium. (2024). Pseudomonas aeruginosa Cytochrome P450 (Q9HXW1) entry. UniProt Knowledgebase.

13. ChemSpider. (2024). 2-Methyldodecane (CSID: 109857).

14. ChemSpider. (2024). 2-Methyldodecane (CSID: 109857).

15. Adasme,M. et al. PLIP 2021: expanding the scope of the protein-ligand interaction profiler to DNA and RNA. Nucl. Acids Res. (05 May 2021), gkab294. doi: 10.1093/nar/gkab294

16. Knott, B. C., Erickson, E., et al. (2020). Characterization and engineering of a two-enzyme system for plastics depolymerization. Proceedings of the National Academy of Sciences of the United States of America, 117(41), 25476-25485.

17. Tournier, V., Topham, C. M., Gilles, et al. (2020). An engineered PET depolymerase to break down and recycle plastic bottles. Nature, 580(7802), 216-219.

18. Zhao, S., Wesseling, S., Rietjens, I. M. C. M., & Strikwold, M. (2022). Inter-individual variation in chlorpyrifos toxicokinetics characterized by physiologically based kinetic (PBK) and Monte Carlo simulation comparing human liver microsome and Supersome™ cytochromes P450 (CYP)-specific kinetic data as model input. Archives of toxicology, 96(5), 1387-1409.

19. Hagelueken, G., Wiehlmann, L., Adams, T. M., Kolmar, H., Heinz, D. W., Tümmler, B., & Schubert, W. D. (2007). Crystal structure of the electron transfer complex rubredoxin rubredoxin reductase of Pseudomonas aeruginosa. Proceedings of the National Academy of Sciences of the United States of America, 104(30), 12276-12281.

20. Gong, C. C., & Klumpp, S. (2017). Modeling SRNA-Regulated plasmid maintenance. PLoS ONE, 12(1), e0169703.

21. Li, H., & Poulos, T. (2004). Crystallization of Cytochromes P450 and Substrate-Enzyme Interactions. Current topics in medicinal chemistry, 4, 1789-1802.

22. Carlos HM Rodrigues, Douglas EV Pires, David B Ascher, DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W350-W355.