Modeling

~Preface~

Figure 1: “I rebel-therefore we exist” -Albert Camus

Welcome to the Modeling page of iGEM Thessaly 2024, where we navigate the frontiers of synthetic biology through innovative and cutting-edge modeling techniques, in our pursuit of combatting Verticillium dahliae and defending olive trees.

Our aim is driven by a spirit of rebellion-captured in the words of Albert Camus: "I rebel-therefore we exist.". This quote embodies our commitment to challenging conventional wisdom, pushing the boundaries of what is possible and advancing knowledge via innovative scientific inquiry. In synthetic biology, modeling is not just a tool; It’s a transformative process that allows us to question, rethink and redefine the boundaries of our understanding.

~Overview~

In synthetic biology, diverse modeling techniques are essential to simulate complex biological processes accurately. These models provide both a mathematical and a computational framework to represent dynamic interactions between several molecules, offering valuable insights into how biological functions respond to various parameters. Our project aims to combat Verticillium dahliae, a harmful fungus that affects destructively olive trees in the Mediterranean region, using RNA interference (RNAi) technology. We have developed a comprehensive modeling strategy, integrating an Ordinary Differential Equation (ODE) model, Machine Learning as well as Deep Learning prediction models, and finally an agent-based model simulation. We chose to utilize three diverse modeling techniques to gain a more holistic and well-rounded view of our project design. Our primary objective is to enhance the wet lab design process by providing a detailed, modular in silico representation of our project components. This approach enables the effective utilization of data provided by our laboratory, which will be used to analyze various aspects of the biological pathway and compare experimental results with bibliographical data.

~Biological Timeline~

Figure 2: Model Biological Timeline Diagram

The Model page is structured to mirror the natural biological timeline of the pathway we’re studying, transitioning from molecular-level events to more macroscopic processes. This flow ensures that readers from any scientific field can easily follow the logical progression of our project’s modeling, while comprehending the purpose of each step within the context.

~Section 1: Unseen Bonds - an AI approach~

"The machine does not isolate man from the great problems of nature but plunges him more deeply into them."
- Antoine de Saint-Exupéry

4.1. System

Our strategy to combat V. dahliae relies on delivering dsRNA through outer membrane vesicles (OMVs), but a major challenge in our iGEM project was finding an efficient way to encapsulate the double-stranded RNA (dsRNA) within these OMVs. As we explored potential solutions, we discovered that simplicity often holds the answer. This realization led us to design a two-domain chimeric protein capable of binding dsRNA and transferring it into the periplasmic space, where it could be encapsulated into OMVs for delivery. This two-domain system offered an efficient solution by combining both dsRNA binding and transport functions into one structure, making it a practical and effective choice for RNA delivery. Instead of relying on multiple separate proteins or complex mechanisms to first bind and then encapsulate the dsRNA, the two-domain design allowed both tasks to be carried out simultaneously and efficiently.

Figure 3: The chimeric protein that leads to the encapsulation of the dsRNA into the OMVs. With light blue is depicted the dsRNA Binding Domain and with deep blue is depicted the Outer Membrane Associated protein that leads to the encapsulation of the dsRNA inside the OMVs

However, the real challenge was identifying a dsRNA binding domain with strong enough affinity to consistently capture and encapsulate the dsRNA. Unlike sequence-specific RNA motifs, dsRNA binding domains (dsRBDs) provide a universal solution by interacting with dsRNA based on its structure rather than its sequence. This versatility is important for maintaining our system's modularity, ensuring it can be easily adapted to encapsulate various dsRNA molecules regardless of their sequence. Solving this challenge was essential for creating a reliable, flexible system capable of delivering dsRNA efficiently.

4.2. Aim

Our main goal was to better understand how the chimeric protein would function, particularly because producing and evaluating protein domains is a highly time-consuming process that is difficult to execute in the lab. Simultaneously, we aimed to finalize our project by identifying the optimal dsRNA binding domain and seamlessly incorporating it into the chimeric protein design. This approach would not only streamline our system but also enhance its effectiveness in delivering dsRNA.

4.3. Requirements

While developing AI models to predict the binding efficiency between dsRNA and proteins, we encountered several challenges due to gaps in scientific research and infinitesimal data available for dsRNA, as well as its interactions with various proteins. Since our training data are limited to single-stranded RNA (ssRNA)- protein interactions, including RNA motif sequences, protein sequences, secondary structure of RNA and docking scores, we had to take into consideration some assumptions regarding our models.

Assumptions

• Homogeneity of Interaction Mechanisms: We assume that the binding patterns learned from ssRNA can also apply to dsRNA-protein interactions, due to limited data regarding dsRNA overall.

• Label Quality: Docking scores used as labels are assumed to accurately reflect the strength of the binding interactions.

• Independence of Features: We assume that the features provide independent information about binding interactions without significant overlap.

• Stability of Biological Systems: We assume that biological systems remain stable enough for patterns observed in training data to hold true in different contexts.

• Computational Resources: We assume that we have enough computational resources available for training.

• Error Tolerance: We acknowledge that some level of prediction error is acceptable due to the variability and complexity in biological systems, especially in molecular standards.

By defining these requirements and assumptions, we can gain a better understanding of our models, identify their limitations and better develop them in the future.

4.4. Methodology

The Turing Test: A brief Introduction to Artificial Intelligence(AI)

"You're programmed to be curious, just like we are."
- Ex Machina(2014)

Figure 4: Alan Turing, possibly the first person that gave the earliest public lecture (London, 1947) to mention computer intelligence, saying, “We want a machine that can learn from experience”.[2]

Artificial Intelligence refers to the capability of machines to perform tasks that typically require human intelligence, such as understanding natural language, recognizing patterns, solving problems, and making decisions. A key concept in evaluating AI is the Turing Test, proposed by Alan Turing.

In 1950 the British mathematician and computer scientist Alan Turing, proposed a test that was introduced in his seminal paper titled “Computing Machinery and Intelligence”[3].

How the Turing Test works:

In a typical Turing Test scenario , there are three participants involved:

1) A Human Judge(Interrogator): This role engages in the conversation with both the machine and a human through text-based interface, ensuring they cannot see or hear either of the participants.

2) A Machine:The AI system attempts to respond to questions in a way that mimics human conversation.

3) A Human:This participant’s role is to respond to the questions, providing a baseline comparison.

The interrogator's task is to determine which participant is the machine based solely on their responses. If the judge cannot reliably identify the machine, it is said to have passed the Turing Test, demonstrating its ability to simulate human-like intelligence.

Although influential, the Turing Test has faced criticism. Some argue that passing the test does not equate to true understanding or consciousness and that the test measures only conversational ability rather than genuine intelligence or comprehension[3]. However, the Turing Test remains a significant benchmark for evaluating AI systems and continues to inspire discussions about machine intelligence. As AI technology evolves, so do methods for assessing its capabilities, but the foundational ideas introduced by Turing still remain relevant to this day.

Domains of AI

Figure 5: Domains of Artificial Intelligence[4]

1) Artificial Intelligence (AI): AI generally refers to the simulation of human intelligence in machines, enabling them to perform tasks that typically require human cognitive understanding [4].

2) Machine Learning (ML): ML is a subset of AI that focuses on developing algorithms that allow computers to learn from data and improve over time without being explicitly programmed [4].

3) Deep Learning (DL): Subset of ML which makes the computation of multi-layer neural networks feasible[4].

Use of Artificial Intelligence in Synthetic Biology

Figure 6: AI instances in Synthetic Biology

Artificial Intelligence (AI) has been a field of interest across a wide range of scientific and non-scientific disciplines. Specifically regarding Synthetic Biology, AI tools are nowadays essential for the modeling of biological processes and the conduction of predictions, to assist in experimental validation and project designs. One of their key characteristics that make them extremely useful, is their ability to combine both efficiency and automation in a wide variation of tasks such as data analysis and experimental design betterment. This leads to faster and more practical engineering of biological systems and minimizes costs associated with research and development [5]. Moreover, their forecasting capability is crucial for optimizing the laboratory process, which can lead to breakthroughs [7]. Lastly, by integrating AI into synthetic biology, an interdisciplinary approach that fosters innovation can flourish. That happens due to the involvement of numerous scientific fields during the development of these kind of models, such as biologists, data analysts, computer scientists and many more[6]. As AI technology continues to evolve and assist Synthetic Biology, we can expect more groundbreaking advancements that will reshape our understanding of Biology overall and its applications in major problems that affect everyday human life.

We chose to utilize AI technology because it is particularly well-suited for predicting dsRNA-dsRBD protein binding efficiency, thanks to its ability to replicate the inherent stochastic nature of biological processes. RNA-protein interactions, like those involving dsRNA and dsRBD, often exhibit a degree of randomness, which AI models can capture effectively through their probabilistic frameworks. Additionally, AI models provide a substantial speed advantage over traditional docking tools, which can take hours or even days to compute binding scores. By leveraging deep learning and other AI techniques, we can obtain predictions within minutes, ensuring both speed and accuracy. This efficiency allows for rapid iteration and refinement in our project design, ultimately accelerating our understanding of dsRNA-dsRBD interactions.

Building an AI model

Figure 7: AI building steps

Developing an AI model can be a complex task that requires a variety of skills; coding experience, deep mathematical understanding, visualization capability, logical thinking and many more. During the development of such models, there are some distinct steps that need to be followed in order to successfully achieve the construction of such tools.

Here are the 5 most important steps that need to be taken during this process:

• Step 1: Defining the Objective

Our objective for these particular models was to predict the binding efficiency of our selected dsRNAs with specific domain candidates of our chimeric protein, for ensuring the encapsulation of dsRNA into the OMVs. This objective was clearly defined by our Wet Lab team, with the understanding that the utilization of this model would enhance the overall project design and strengthen the engineering aspects of the project.

• Step 2: Data Gathering and Preprocessing

By carefully gathering and preprocessing data related to our objective, such as RNA sequence motifs, protein sequences, secondary structure information and docking scores, we established a robust foundation for developing AI models capable of accurately predicting binding efficiencies. This structured approach ensures that our models are trained on high-quality data, enhancing their ability to provide valuable insights for our Lab.

• Step 3: Dataset Building

Building a robust dataset regarding our objective, required careful planning along with some validation steps. Ultimately, a well-constructed dataset not only improves the predictive power of AI models but also facilitates informed decision-making in the Lab project design while advancing our understanding of RNA biology and binding mechanics.

• Step 4: Model Development

For our project, we developed models specifically designed to predict the binding efficiency between dsRNA sequences and candidate protein domains. We utilized machine learning models like Random Forest, XGBoost, and CatBoost, as well as an LSTM deep learning model to handle the sequential nature of RNA and protein data. The architecture was designed to capture intricate patterns between RNA motifs, protein sequences, and secondary structures. During development, we tuned hyperparameters like learning rate, batch size, and network depth to achieve optimal performance for accurate binding efficiency predictions.

• Step 5: Model Performance Evaluation

Once our models were trained, we evaluated their performance using metrics such as Mean Squared Error (MSE) and Mean Absolute Error (MAE) for regression tasks. Cross-validation techniques were applied to ensure the model's predictions generalized well across different data subsets. We also compared predictions from our four distinct models to identify the most accurate one, refining the models based on their test set performance. This evaluation helped confirm the reliability of our predictions and guided further steps in our project’s experimental design.

Our AI Development Approach

Data Gathering

During the data gathering process we navigated through various databases in order to find a suitable dataset with the required information for our models. After an extensive bibliographical review, we found a database (RBPDB: RNA Binding Protein Database[8]) that provided the necessary information, from which we would be able to construct and shape our custom dataset.

The needed data for our dataset, which we were urged to revive by our Wet Lab subteam are the following:

RNA motif Sequence: The nucleotide sequences that serve as potential binding sites.

RNA Secondary Structure: Predicted folding structures of the RNA, necessary for understanding the crucial role of them during the binding process.

Protein ID: Unique Identifiers of the RNA-binding proteins involved.

Protein Sequence: The amino acid sequences of the RNA-binding proteins.

Docking Score of the Binding: Quantitative measures of how strongly the RNA and the protein are likely to bind, based on computational docking simulations.

By curating this data, we could accurately represent the biophysical properties of RNA-protein interactions for further computational analysis.

After obtaining FASTA files, we aimed to investigate the binding interactions between RNAs and their associated protein counterparts. To achieve this, we employed a multi-step computational approach, integrating RNA and protein structure prediction tools, followed by docking simulations to evaluate RNA-protein binding energies.

RNA Data Preprocessing

Initially, the sequences of single-stranded RNAs were provided in FASTA format. After we cleaned our data by removing any rows containing missing or invalid sequence motifs, we came across another task. We encountered complex RNA motif patterns such as (XXXX)(n), which needed to be converted into continuous sequences. Using regular expressions, we transformed these patterns. Furthermore, we established a threshold for the length of the RNA motif sequences.

Regular Expressions converter.py

import re
  
  def convert_pattern(sequence):
  # Define the regex pattern to match:
  # - Single character with repeats: X(n)
  # - Single character with range: X(n-k)
  # - Multiple characters with repeats: (XXXX)(n)
  # - Multiple characters with range: (XXXX)(n-k)
  pattern = r'([A-Z])\((\d+|n)(?:-(\d+|n))?\)|\(([^)]+)\)\((\d+|n)(?:-(\d+|n))?\)'
  
  def replacement(match):
    if match.group(1):
      letter = match.group(1)
      n1 = match.group(2)
      n2 = match.group(3)
  
      # Determine the maximum number to repeat the character
      n1 = 10 if n1 == 'n' else int(n1)
      if n2:
        n2 = 10 if n2 == 'n' else int(n2)
        n = max(n1, n2)
      else:
        n = n1
  
      return letter * n
  
    elif match.group(4):
      sequence_part = match.group(4)
      n1 = match.group(5)
      n2 = match.group(6)
  
      n1 = 10 if n1 == 'n' else int(n1)
      if n2:
        n2 = 10 if n2 == 'n' else int(n2)
        n = max(n1, n2)
      else:
        n = n1
  
      return sequence_part * n
  
  transformed_sequence = re.sub(pattern, replacement, sequence)
  return transformed_sequence if 50 <= len(transformed_sequence) <= 220 else "Sequence length out of range"

To obtain three-dimensional structures of the RNAs, we first predicted their secondary structures using RNAfold, a well-established tool that calculates RNA folding based on minimum free energy (MFE) [9]. These secondary structure predictions were then used as input for RNAComposer, a reliable tool that generates three-dimensional RNA models in PDB (Protein Data Bank) format [10]. The combination of RNAfold and RNAComposer allowed us to efficiently derive tertiary structures from the RNA sequences.

Protein Data Preprocessing

In parallel, the corresponding protein sequences, also provided in FASTA format, were used to predict protein tertiary structures. For this purpose, we utilized AlphaFold, a state-of-the-art tool for protein structure prediction based on deep learning[11]. AlphaFold produces highly accurate 3D models of proteins, which are crucial for understanding protein-RNA interactions. The output of AlphaFold, in PDB format, provided the necessary structural information to proceed with the docking simulations.

Docking Data

The final step of the methodology involved evaluating the binding interactions between the predicted RNA and protein structures. For this, we employed the HDock docking tool, which is specifically designed for protein-ligand and protein-nucleic acid docking [12]. HDock computes the optimal binding configurations and calculates the Docking Score of the RNA-protein complexes, providing insights into their potential interaction strength and stability. The docking results, allowed us to assess the affinity between each RNA and its associated protein.

This integrative approach, combining RNA structure prediction, protein modeling, and RNA-protein docking, provided a robust framework to explore RNA-protein interactions at a structural level, offering insights into their binding mechanisms and energetics.

Once we gathered these data, we proceeded in creating two dataframes:

The first dataframe contains protein sequences, RNA motifs and secondary secondary structures of the RNA motifs.

The second dataframe includes docking scores corresponding to specific RNA-protein interactions

For the models to process the sequences, we utilized a one-hot encoding technique to convert categorical data (RNA motifs, protein sequences and secondary structure) into numerical matrices.

RNA Motif Encoding

Each nucleotide in the RNA sequence is represented as a 4-dimensional binary vector. Nucleotides that can be any multiple bases (based on IUPAC codes) are encoded with fractional values to reflect uncertainty.

RNA IUPAC one-hot encoding map.py


  IUPAC_RNA = {
      'A': 'A', 'C': 'C', 'G': 'G', 'U': 'U',
      'R': 'AG', 'Y': 'CU', 'S': 'GC', 'W': 'AU',
      'K': 'GU', 'M': 'AC', 'B': 'CGU', 'D': 'AGU',
      'H': 'ACU', 'V': 'ACG', 'N': 'ACGU'
  }
  
  def onehot_RNA(motif):
      unique_bases = ['A', 'C', 'G', 'U']
      one_hot_matrix = np.zeros((len(motif), len(unique_bases)), dtype=float)
      for i, char in enumerate(motif):
          if char in IUPAC_RNA:
              for base in IUPAC_RNA[char]:
                  one_hot_matrix[i, unique_bases.index(base)] = 1 / len(IUPAC_RNA[char])
      return one_hot_matrix

Protein Sequence Encoding

Protein sequences are made of 20 different amino acids, each represented as a 20-dimensional vector, where each position corresponds to a specific amino acid.

RNA secondary structure Encoding.py


  def onehot_secondary_structure(structure_seq):
      structure_chars = '<[.>](){}'
      one_hot_matrix = []
      for char in structure_seq:
          one_hot_vector = [0] * len(structure_chars)
          if char in structure_chars:
              one_hot_vector[structure_chars.index(char)] = 1
          one_hot_matrix.append(one_hot_vector)
      return np.array(one_hot_matrix)

Protein Sequence encoding.py


  def onehot_protein(sequence):
      amino_acids = 'ACDEFGHIKLMNPQRSTVWY'
      one_hot_matrix = []
      for char in sequence:
          one_hot_vector = [0] * len(amino_acids)
          if char in amino_acids:
              one_hot_vector[amino_acids.index(char)] = 1
          one_hot_matrix.append(one_hot_vector)
      return np.array(one_hot_matrix)

Normalization of Docking Scores

To ensure the model processes the data effectively, we normalized the docking scores to fall in a range between 0 and 1 using MinMaxScaler.

Docking Scores Normalization .py

from sklearn.preprocessing import MinMaxScaler
  scaler = MinMaxScaler()
  docking_score_normalized = scaler.fit_transform(df["docking_score"].values.reshape(-1 ,1 ))

Mean Squared Error and Mean Absolute Error

Mean Squared Error

The Mean Squared Error(MSE) is used in regression tasks and is defined as:

Where:

•N is the total number of instances

•y_i is the actual label for instance i.

•μ is the mean of the labels, calculated as:

Mathematical Explanation:

1) For each data point i, you subtract the predicted mean μ from the actual value y_i .

2) Square the result to penalize larger errors more heavily.

3) Sum these squared errors for all N data points.

4) Divide the sum by N to calculate the average squared error, resulting in the Mean Squared Error.

Interpretation:

•The squaring of differences emphasizes larger errors, making MSE more sensitive to outliers. MSE grows quadratically as the prediction error increases, penalizing predictions that are further off the mark more than those closer to the actual values.

Mean Absolute Error

The Mean Absolute Error (MAE) is defined as:

Where:

• N is the total number of instances.

• y_i is the actual label for instance i.

•μ is the mean of the labels, calculated as:

Mathematical Explanation:

1) For each data point i, subtract the predicted mean μ from the actual value y_i .

2) Take the absolute value of this difference (ignoring whether the error is positive or negative).

3) Sum these absolute errors for all N data points.

4) Divide the sum by N to calculate the average absolute error, resulting in the Mean Absolute Error.

Interpretation:

• MAE is less sensitive to outliers compared to MSE because it uses absolute values rather than squaring the errors. It treats all errors equally, providing a linear penalty to errors, regardless of their size.

Comparison

• MSE penalizes larger errors more severely than MAE due to the squaring of errors.

• MAE provides a more balanced view of prediction errors and is not influenced as much by large outliers.

Both metrics are useful in regression tasks, depending on whether you want to emphasize larger errors (MSE) or treat all errors equally (MAE).

Model 1

Deep Learning Bidirectional LSTM Model BAF (Binding Affinity Forecasting)

Architecture

Figure 8: BAF model Architecture

We used a deep learning model with Bidirectional LSTM (Long Short-Term Memory) layers to handle the sequential form of RNA and protein data.

LSTM networks are particularly effective at capturing temporal dependencies in sequential data, which is crucial when modeling biological sequences, where each nucleotide or amino acid in a sequence impacts subsequent interactions. Additionally, LSTM networks can retain information over long sequences and manage vanishing gradient problems better than traditional recurrent neural networks (RNNs), making them ideal for our long RNA motifs and protein sequences.

Bidirectional LSTM was employed to allow the model to learn from both the forward and backward directions of the sequence. This bidirectionality improves the model's ability to capture context within the biological sequence, which is essential in understanding RNA-protein interactions, where both ends of a sequence may have crucial roles in binding affinity.

The model processes three different inputs: RNA motifs, protein sequences, and RNA secondary structure. Each input sequence passes through its own LSTM layer, followed by a Dropout to prevent overfitting. In our model architecture, we merge the outputs of the three different LSTM layers (protein, RNA, and secondary structure). After the merge, we add two Dense layers, with 128 and 64 units, respectively, both using ReLU activation. The Adam optimizer is applied for training, ensuring efficient convergence. The model is evaluated using Mean Squared Error (MSE) and Mean Absolute Error (MAE) to track its performance in predicting binding efficiency.

Model Architecture.py

from tensorflow.keras.models import Model
  from tensorflow.keras.layers import Input, LSTM, Bidirectional, Dropout, Dense, concatenate
  
  # Define input shapes
  protein_input_shape = (X_protein.shape[1], X_protein.shape[2])
  rna_input_shape = (X_rna.shape[1], X_rna.shape[2])
  structure_input_shape = (X_structure.shape[1], X_structure.shape[2])
  
  # Protein input and LSTM layers
  protein_input = Input(shape=protein_input_shape, name='protein_input')
  protein_lstm = Bidirectional(LSTM(128))(protein_input)
  protein_output = Dropout(0.4)(protein_lstm)
  
  # RNA input and LSTM layers
  rna_input = Input(shape=rna_input_shape, name='rna_input')
  rna_lstm = Bidirectional(LSTM(64))(rna_input)
  rna_output = Dropout(0.4)(rna_lstm)
  
  # Secondary structure input and LSTM layers
  structure_input = Input(shape=structure_input_shape, name='structure_input')
  structure_lstm = Bidirectional(LSTM(64))(structure_input)
  structure_output = Dropout(0.4)(structure_lstm)
  
  # Merge the outputs
  merged = concatenate([protein_output, rna_output, structure_output])
  dense_1 = Dense(128, activation='relu')(merged)
  dropout_1 = Dropout(0.5)(dense_1)
  dense_2 = Dense(64, activation='relu')(dropout_1)
  dropout_2 = Dropout(0.5)(dense_2)
  
  # Output layer
  output = Dense(1, activation='linear', name='output')(dropout_2)
  
  # Compile the model
  model = Model(inputs=[protein_input, rna_input, structure_input], outputs=output)
  model.compile(optimizer='adam', loss='mse', metrics=['mae'])
  model.summary()

Model Training

We used early stopping to prevent overfitting. By stopping the training once the validation loss stopped improving. The model was trained on 80% of the training data, with 20% used for validation.

Figure 9: 3D Visualization of Model Loss over Epochs and Batch Size

Model Training.py

from tensorflow.keras.callbacks import EarlyStopping
  
  early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
  
  history = model.fit(
      [X_protein_train, X_rna_train, X_structure_train],
      y_train,
      validation_split=0.2,
      epochs=50,
      batch_size=8,
      callbacks=[early_stopping],
      verbose=1
  )

Model 2

Machine Learning Random Forest Regressor

PEGGY(Protein Engineering and Genetic Guided Yield) V1

For our second model, we implemented Random Forest Regressor model to predict RNA-protein docking scores.Unlike deep learning, this model operates with flattened one-hot encoded inputs, providing a robust baseline for predicting binding affinity.

The Random Forest Regressor was chosen due to its efficiency and capability in handling high-dimensional input data like our encoded protein, RNA motif and secondary structure sequences. Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the average prediction of the individual trees. This provides us with a solid approach for reducing overfitting and improving the model’s generalizability.

Architecture and Training

Feature Preparation: The protein sequence was flattened into a single vector, while the RNA motifs and secondary structures were concatenated for double-stranded RNA (dsRNA). This flattened feature vector was fed into the Random Forest model for training.

We trained the Random Forest model on the same 80% training set used for the deep learning model, with the remaining 20% reserved for testing and evaluation.

Random Forest Regressor.py

# Initialize the Random Forest Regressor
  rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
  
  # Train the Random Forest model
  rf_model.fit(X_train_ml, y_train_ml)
  
  # Predict on the test set with Random Forest
  y_pred_rf = rf_model.predict(X_test_ml)
  
  # Evaluate the Random Forest model
  rf_mse = mean_squared_error(y_test_ml, y_pred_rf)
  print(f"Random Forest Mean Squared Error: {rf_mse}")

Evaluation

After training, the Random Forest model yielded a Mean Squared Error (MSE) of 0.044, indicating its ability to predict RNA-protein docking scores with reasonable accuracy. This model provides a straightforward machine learning alternative to the deep learning model for cases where computational resources are limited or interpretability is prioritized.

Model 3

Machine Learning XGBoost PEGGY V2

XGBoost was the third model we used to predict docking scores. XGBoost, a highly efficient implementation of gradient boosting, builds trees sequentially, where each subsequent tree attempts to correct the errors made by the previous ones. This iterative approach allows XGBoost to perform well even on complex datasets, making it an excellent choice for predicting RNA-protein interactions.

Architecture and Training

For the XGBoost model, we again flattened the protein, RNA, and secondary structure sequences and fed them into the model as feature vectors. We trained the model using the same training and test split as the Random Forest model, ensuring consistency across all our models.

XGBoost Regressor.py

# Initialize the XGBoost Regressor
  xgb_model=XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
  
  # Train the XGBoost model
  xgb_model.fit(X_train_ml, y_train_ml)
  
  # Predict on the test set with XGBoost
  y_pred_xgb = xgb_model.predict(X_test_ml)
  
  # Evaluate the XGBoost model
  xgb_mse = mean_squared_error(y_test_ml, y_pred_xgb) 
  print(f"XGBoost Mean Squared Error: {xgb_mse}")

Evaluation

The XGBoost model achieved a Mean Squared Error (MSE) of 0.058, making it a strong competitor alongside Random Forest. The model is particularly suited for larger datasets and provides more flexibility in handling overfitting due to its inherent regularization techniques.

Model 4

Machine Learning CatBoost Regressor PEGGY V2

The final machine learning model we implemented was CatBoost. CatBoost is designed to handle categorical data and reduce overfitting in small datasets. Its ability to automatically encode categorical features and deal with missing data made it a fitting choice for our relatively complex RNA-protein interaction dataset.

Architecture and Training

Like the other models, the protein, RNA, and secondary structure sequences were flattened and used as input features. The CatBoost model was trained using the same dataset and evaluation methods to maintain consistency.

CatBoost Regressor.py

# Initialize the CatBoost Regressor
  catboost_model = CatBoostRegressor(iterations=100, learning_rate=0.1, random_state=42, verbose=0)
  
  # Train the CatBoost model
  catboost_model.fit(X_train_ml, y_train_ml)
  
  # Predict on the test set with CatBoost
  y_pred_catboost = catboost_model.predict(X_test_ml)
  
  # Evaluate the CatBoost model
  catboost_mse = mean_squared_error(y_test_ml, y_pred_catboost)
  print(f"CatBoost Mean Squared Error: {catboost_mse}")

Evaluation

The CatBoost model resulted in a Mean Squared Error (MSE) of 0.053. This performance demonstrates that CatBoost can effectively model RNA-protein interactions, particularly when there is a need for robust handling of categorical data and a balanced performance across different datasets.

4.5. Results

We employed our four models to predict the docking score between our dsRNA sequences and our primary domain candidate, RNAse III. Based on insights from the Post Doctoral Researcher, Rodolfo Rasia, was identified as the ideal domain candidate, considering its compatibility with our biological system and the dsRNA sequences. To validate this hypothesis, we tested the dsRNA Binding Domain (dsRBD) of the RNAse III, using our models to confirm the accuracy of this selection. More specifically, we tested the sequence of dsRBD of RNAse III identified from P. putida KT2440, as it is codon optimized for our chassis.

These are the following genomic sequences from which our dsRNAs occur:

• RGS1 (V.dahliae gene (MG583845.1 (Accession number- NCBI)) : ATGGCCGCCATCTCCTATTCCAACCGCTCCTCAATCCGAGATCAACTACCGCAGTCGACATCGACCTCACG
ATCTTCCACCAGAGTTGCCAGTCGCCGCTCTTCACTCGCCTCGCCTTCGGCTCCCTCATCACCCGCCGTCG
CAGTCTGCGCTGTCCACCCTGCAGTCACAGTCACAGTCACAACCCGACTACTCCATACGTCAGCGTTCCTC
CAATTCCGCTTTGGGCAGAGTCGCCTCCAGTGCCGATTCCGCTTTCAGCCCCACCGACTTGGACAAG

• AAC (V.dahliae gene (NW_009276936.1:165449-167157 (Accession number- NCBI)):

ATCCAGTCGACGACTGCCTCTATTACCTGTCTTCATCCACCTCCGGTTCATCGAAACCACTCTCCCCTCAATC
CTTTCTTCGTCAGGTTATTCCGTTGAAGCAGTCACAATGTCCGTCGAGAAGCAGACCGTTCTGGGCATGCCG
GTATGTGAATCCCGTCCTCCTTTTTTTCCACGGCTCGCCAAAATCCCTCGCGACGACAACTCTCGAGCCCGA
TTGTCCATCATCACCAATGCTTCGTCGGATAAGCTGCTCATCCAATTGGCCCTTTGTGGGCGA

• THI20 (V.dahliae gene (XM_009655087.1 (Accession number- NCBI)):

ATGGCACAGCAGATGGGCCGAGTGCTCGTAATCGCGGGGTCGGACAATTCAGGCGGGGCCGGGCTCGAGGCCG
ACCAAAAGGTCATCGCCGCCCACGGCTGCTATGCGCAGACGGCAACGACGGCGCTCACGGCGCAGGACACAA
AGGGCGTGTACGATATTCACCACGTGCCGGGCGCCTTCCTGCGCAGGCAGATTGACCTCTGCTTCGGAGACG
TGGGCGTCGACGTTGTAAAGACGGGCATGCTCGCGTCGGCCGAGACGATTCGCGTCGTGGCGA

Our Metric

In this study, we aimed to predict the binding efficiency between dsRNA sequences and the dsRBD of RNAse III. As our metric, we utilized the Docking Score scale established by Hdock. These scores are calculated based on ITScorePP or ITScorePR functions, according to Hdock's documentation.

In RNA-protein complexes, an average docking score is approximately -200, with more negative values indicating a higher binding likelihood. However, it's important to note that these scores do not represent the true binding affinity between two molecules, as they are not calibrated against experimental data.

Our Results’ Matrices

Our first goal was to test the dsRBD from P. putida KT2440:

First Prediction for the dsRBD of RNaseIII fromP. putida KT2440:
Sequence from Uniprot (Accession number:Q88E47 · HGD_PSEPK):

MSASLAGLERKLGYTFKNQDQMLLALTHRSYAGRNNERLEFLGDAILNFVAGEALFERFPQAREGQLSRLRA RLVKGETLARLARGFDLGDYLRLGSGELKSGGFRRESILADALEALIGAIYLDADMDTARERILAWLADEFE GLTLVDTNKDPKTRLQEFLQSRSCELPRYEVVDIQGEPHCRTFFVECEVVLLNNKSRGQGVSRRIAEQVAAA SALIALGVENGND

Initially, we tested the dsRNA Binding Domain of RNAse III. The docking scores from our first prediction showed strong binding potential between the dsRNAs and the dsRBD of RNAse III, suggesting favorable interactions.

Docking Score	RGS1	AAC	THI20
BAF	-449.97	-445.14	-463.03
PEGGY V1	–420.71	-425.98	-438.56
PEGGY V2	-418.27	-419.13	-428.45
PEGGY V3	-431.03	-435.60	-433.33

Figure 10: First Prediction for RNAse III KT2440

Second Prediction for the dsRBD of RNaseIII fromP. putida KT2440 - Repeat
Sequence from Uniprot (Accession number:Q88E47 · HGD_PSEPK):

Considering the stochastic nature of our models, we decided to repeat the first prediction to verify that the outcome wasn’t random. By applying the concept of repeatability, we discovered that not only were our predictions consistent, but the results of this second prediction closely mirrored those of the first, reinforcing the reliability of our model.

Docking Score	RGS1	AAC	THI20
BAF	-451.09	-449.23	-458.05
PEGGY V1	–424.71	-426.68	-439.34
PEGGY V2	-420.72	-420.43	-427.91
PEGGY V3	-435.56	-438.57	-442.28

Figure 11: Second Prediction for RNAse III KT2440

Third Prediction (TolB was used as a Negative Control):

Sequence of TolB (Uniprot Accession number: P00509 · AAT_ECOLI):

MFENITAAPADPILGLADLFRADERPGKINLGIGVYKDETGKTPVLTSVKKAEQYLLENETTKNYLGIDG IPEFGRCTQELLFGKGSALINDKRARTAQTPGGTGALRVAADFLAKNTSVKRVWVSNPSWPNHKSVFNSA GLEVREYAYYDAENHTLDFDALINSLNEAQAGDVVLFHGCCHNPTGIDPTLEQWQTLAQLSVEKGWLPLF DFAYQGFARGLEEDAEGLRAFAAMHKELIVASSYSKNFGLYNERVGACTLVAADSETVDRAFSQMKAAIR ANYSNPPAHGASVVATILSNDALRAIWEQELTDMRQRIQRMRQLFVNTLQEKGANRDFSFIIKQNGMFSF SGLTKEQVLRLREEFGVYAVASGRVNVAGMTPDNMAPLCEAIVAVL

For our third prediction, we utilized the TolB protein as a negative control to ensure that our model wasn’t simply generating good docking scores for any protein. Since it was already known that TolB does not interact favorably with dsRNAs, we expected lower scores. As predicted, the docking scores fell below -200, confirming that binding was unlikely. This validated our model's accuracy in passing the negative control test.

Docking Score	RGS1	AAC	THI20
BAF	-145.32	-165.67	-126.54
PEGGY V1	–134.45	-147.94	-110.85
PEGGY V2	-128.47	-139.11	-101.43
PEGGY V3	-136.97	-130.36	-106.76

Figure 12: Third Prediction for TOLb (Negative Control)

Fourth Prediction (testing Double-stranded RNA-binding protein 2 (DRB2_ORYSJ)):

Sequence of DRB2_ORYSJ (Uniprot Accession number: Q0DKP4 · DRB2_ORYSJ):

MYKNQLQELAQRSCFNLPSYACIREGPDHAPRFKATVNFNGETFESPAFCSTLRLAEHAAAEVALNELS KRGPSSSLAAKVLDETGIYKNLLQETAHRAGLKLPVYTTIRSGPGHTPVFTCTVELAGMTFTGNPGKTKK QAQKNAAMAAWSELKQLPRVGEPSSSSCPPDHDDDDQEQIIVARTLASLNQTNGGKTPQQKEKQQSSNRP SSRRPSYPKSNASFYGRLHLQKHAYPSVPPEQAMYHMWHQVQATQQKPHFPMVPTMGSTGFPPPPTVLHM YPPPRGQFTMPSSQDGLGLIPCYPEASPVLPRYFSPYPASFVPRRPLPVNVHKIHEKRLVGADMVELPDA AVFSRYTAPDFSGTSENAVQDNKKEEYTESSPASEQESKSHTASSSATRSPSQQLESNQDIEIMGGLRLE SKKPAEQPPESSPSRVNPVLLCETGQRHHYSSVRHGDPVHRNSPQISVATSPSPIRRGDPAHINIPQISV ATPPECRSPRAQAPPRFGTRMPVNLPSSLYQQRPPWLAASVTIRTTIPVCSARPNVVNSSAGAAQPAVQI LSASPRKEEPEARTNTSDTSNAATASSELNKLHI

Lastly, we tested another dsRBD candidate, DRB2, to compare it with the dsRBD of RNAse III. While DRB2 achieved respectable docking scores, the results reaffirmed that RNAse III remained the superior candidate for binding efficiency, making it the best option for our project.

Docking Score	RGS1	AAC	THI20
BAF	-409.64	-387.18	-401.83
PEGGY V1	–400.38	-371.41	-392.63
PEGGY V2	-402.52	-382.08	-386.48
PEGGY V3	-391.09	-368.03	-398.91

Figure 13: Fourth Prediction for DRB2_ORYSJ

The results clearly support the dsRBD of RNAse III as the ideal domain candidate, with the BAF model showing the highest binding affinities for all tested dsRNA sequences, ranging from -445.14 to -463.03. This validation has not only confirmed the initial hypothesis but also provided critical insights that guided the Wet Lab in refining our project design, ensuring the experimental framework is based on strong and data-backed predictions.

Comparison of Machine Learning Models

Model	MSE (Mean Squared Error)
Random Forest	0.044
XGBoost	0.058
CatBoost	0.053

Figure 14: Comparison of Machine Learning Models

The Random Forest model had the lowest MSE at 0.044, followed by CatBoost at 0.053, and XGBoost at 0.058. Each model offered its advantages, with Random Forest showing better generalization on this dataset. However, XGBoost and CatBoost provided competitive results, demonstrating their ability to handle more complex feature relationships and regularization.

Comparison Between BAF and PEGGY(V1,2,3)

When comparing the deep learning model with the machine learning models, each approach has its distinct advantages:

Deep Learning BAF model (LSTM):

• Strengths: LSTM networks excel at learning patterns within sequential data. BAF model was particularly suited for complex, non-linear interactions between RNA and proteins. The ability to handle both RNA motifs and secondary structure data in a unified model made it a flexible solution, especially for more complex cases like double-stranded RNA (dsRNA) interactions.

• Weaknesses: While deep learning models, like our BAF model, showed strong performance in terms of capturing sequential dependencies, they are more computationally intensive and require more training data. They also take longer to train and are harder to interpret compared to traditional machine learning models.

Machine Learning PEGGY V1, V2, V3 (Random Forest, XGBoost, CatBoost):

• Strengths: PEGGY V1, V2, and V3 models provided strong baseline performances, especially when handling relatively smaller datasets. These models are faster to train, more interpretable, and less likely to overfit due to their inherent regularization techniques. Random Forest performed particularly well, demonstrating its ability to generalize on this task.

• Weaknesses: Nevertheless, our Machine learning models struggled with capturing long-term dependencies inherent in sequential data. Since they flatten the data into fixed feature vectors, they are less capable of understanding complex relationships within sequences, which is why they tend to underperform in comparison to LSTM models for more intricate RNA-protein interactions.

In summary, while the PEGGY models provided us with a quick and interpretable solution, BAF, our Bidirectional LSTM-based deep learning model, proved to be more powerful for complex biological sequence prediction tasks, especially when handling long-range dependencies and interactions in RNA-protein binding.

4.6. Future Aspects

There are several exciting potential directions to explore in the future to enhance the capabilities of our models and expand their applicability:

1. Data Augmentation via GANs (Generative Adversarial Networks):

One challenge in RNA-protein interaction modeling is the lack of sufficient labeled data, particularly for double-stranded RNA (dsRNA) interactions. A potential solution is to employ Generative Adversarial Networks (GANs) to generate synthetic RNA-protein interaction data. GANs could help simulate additional dsRNA interactions, allowing the models to be trained on a larger and more diverse dataset, improving their generalization capabilities.

2. Creating a dsRNA Database from Scratch:

To strengthen the predictive capabilities of our models, it would be valuable to develop a comprehensive database specifically focused on dsRNA interaction data. Currently, there is a notable gap in scientific research regarding dsRNA binding interactions. By creating a dedicated database that includes detailed dsRNA information, we could address this gap and provide a crucial resource for the scientific community. This effort would involve gathering experimental data from existing literature and public databases, ensuring the dataset is of high quality and centered on dsRNA interactions with RNA-binding proteins. Such a dataset would significantly enhance our model training, making dsRNA predictions more accurate and robust.

3. Developing a Web-Based Prediction Tool:

As a practical extension, creating a web interface for scientists to input RNA and protein sequences to predict their docking scores could be valuable. This tool could serve the scientific community and industries working on dsRNA-protein interactions.

~ Section 2: Decoding Silencing: an ODE simulation of RNAi pathway Dynamics~

“The beauty of the Universe lies in the details, where the infinitesimal becomes pivotal.”
-Giorgos Grammatikakis

5.1 The System

THAELIA is a bacterial system engineered to produce double-stranded RNA (dsRNA) molecules that target essential genes in V. dahliae. The goal is to silence these genes by delivering dsRNA directly to it and triggering its fungal endogenous RNA interference (RNAi) mechanism.

Figure 15: bmRNAi technology

So, once inside the fungus, the dsRNA is processed by the enzyme DICER, which cleaves it into small interfering RNAs (siRNAs). Next, the siRNAs are loaded into the Argonaute (AGO) protein, forming the RNA-induced silencing complex (RISC) as additional proteins are recruited. Under specific thermodynamic conditions, the dsRNA separates into single strands, and the RISC complex, now bound to one strand, targets complementary mRNA. This leads to one of two outcomes: either the target mRNA is cleaved and degraded, effectively silencing the gene, or its translation is blocked, preventing protein synthesis. In particular, RNAi prevents mRNAs from fulfilling their function by eliminating them before they would naturally break down.

Figure 16: RNAi mechanism. In the picture, the DICER protein (depicted in blue) is necessary to process the dsRNA. The AGO proteins (depicted in green and yellow) form the RISC complex with the siRNAs and target the fungus mRNA for silencing either with mRNA cleavage or translation blockage

Our team carefully designed dsRNA sequences that would generate siRNAs, focusing on optimizing thermodynamic properties like GC content and seed region specificity to minimize off-target effects. At the same time, we sought to deepen our understanding of the RNAi mechanism within the fungus, exploring the underlying reactions and protein components. This raises an exciting question: “How interesting would it be to simulate the RNAi process using mathematical equations to model the very mechanism we're aiming to harness?”

Simulating RNAi proteinic reactions—such as those involving Dicer, Ago and the RISC—using an ordinary differential equation (ODE) model provides critical insights into the kinetics and dynamics of gene silencing. By translating these molecular interactions into mathematical equations, we can capture the complex behavior of RNAi processes in a highly controlled and predictable manner. This allows us to analyze how different concentrations of RNA, Dicer, and RISC affect the overall silencing efficiency, offering a deeper understanding of dose-response relationships and system stability.

5.2 Our Aim

Our goal is to gain a deeper understanding of the RNAi mechanism that occurs within the fungus. By extracting insights from lab experiments and analyzing experimental data on the fungal DICER enzyme, we aim to uncover how this mechanism specifically operates in V. dahliae.

5.3 Assumptions

Our model does not entirely reflect realistic conditions, as representing numerous intricate processes through a deterministic method proves highly challenging. To ensure our project’s successful progression, we have established some key assumptions to simplify the system.

• Molecule Cell Distribution: All molecules are uniformly distributed within the cell.

• Independent Reactions: Each step in the pathway is treated as an independent reaction.

• Quasi-Steady State: The system reaches a quasi-steady state where the rate of production equals the rate of degradation for each component.

• Negligible External Disturbances: There are no external disturbances that could affect the system, and environmental conditions are fixed.

• Negligible mRNA Exit from Nucleus: The exit of mRNAs from the nucleus is considered negligible for this model.

• Complete Processing of dsRNA: It is assumed that every molecule of dsRNA that enters the cell is processed through the RNAi pathway without loss or inhibition.

• Sufficient Concentrations of Reactants: The concentrations of AGO and siRNAs are assumed to be sufficient to ensure effective binding and cleavage.

5.4 Methodology

Ordinary Differential Equation (ODE) based models are particularly effective for examining the RNA interference (RNAi) pathway of our project design due to some key reasons. One of them is the ability to simplify complex biological systems, by focusing on key components and their interactions. This is particularly important in RNAi studies, where the interactions between various proteins (like DICER) and other molecules (like various RNA species) can be very intricate[13]. Furthermore, ODE models provide a quantitative framework to describe dynamic interactions and regulatory mechanisms involved in RNAi. They can capture rates like degradation and production of specific molecular components, allowing researchers to simulate and predict the behavior of the RNAi pathway under various circumstances and conditions[13]. Additionally, this type of model gives us the chance to analyze the system’s dynamics over time, which is crucial for understanding the kinetics of mRNA hydrolysis and the overall efficiency of gene silencing[14]. Lastly, ODE-based models can be validated against experimental data, enhancing their reliability. They can also be used to make predictions about the gene silencing efficacy, which is valuable for designing experiments[13].

Our approach for this simulation was to utilize experimental data from our Lab regarding DICER concentrations in the cell while simultaneously using information extracted from bibliographical retrospection for other parameters involved in the pathway. By using experimental data for DICER, we have the ability to validate and calibrate our model specifically for this critical component of the RNAi pathway. Dicer plays a vital role in processing dsRNA (double-stranded RNA) into small interfering RNA (siRNA), so having accurate data for this enzyme can significantly enhance the reliability of our model.

Data Extraction by the Lab

During the data collection phase for the validation of our ODE simulation, we requested the Wet Lab team to provide quantitative data related of DICER. They not only employed real-time quantitative PCR (rt-qPCR) to measure the expression levels of DICER but also utilized computational methods to convert these measurements into a final concentration of 1.07 μM. This precise quantification is critical to our simulation, as DICER plays a key role in processing double-stranded RNA (dsRNA) into small interfering RNA (siRNA) within the RNA interference (RNAi) pathway. By integrating the 1.07 μM concentration into our in silico model, we ensured that it accurately represents the biological environment, significantly improving the effectiveness and precision of our RNAi simulation. This data is essential for fine-tuning the model, enhancing our understanding of gene silencing mechanisms.

Equations

5.5 Results

In our RNAi pathway simulations, SimBiology runs models not sequentially but simultaneously, meaning that all molecular interactions and processes are computed in parallel rather than in a step-by-step manner. Due to this fact, we decided to analyze our plots individually, allowing us to isolate and interpret the behavior of each molecular component in the pathway more effectively. This approach ensured that we could thoroughly examine the dynamics of the RNAi process for each species without interference from simultaneous processes.

mRNA VdRGS1 Plot Analysis

We focused our analysis on the VdRGS1 mRNA in the RNAi pathway because the outcomes for the other mRNAs were very similar. The close alignment of results across all mRNAs indicates that the RNAi mechanism functions consistently in the model. By concentrating on VdRGS1, we were able to conduct a detailed analysis with confidence, knowing that the behavior of the other mRNAs followed a nearly identical degradation pattern. This allowed us to streamline the validation process without sacrificing accuracy in the simulation.

Figure 18: mRNA VdRGS1 Plot, X AXIS: Time(hours), Y AXIS:1*(10-7) M

This Plot shows the concentration of mRNA over time during and after the RNAi pathway activation. Initially, the mRNA concentration is high, indicating abundant mRNA molecules present in the system. As the RNAi pathway activates, the mRNA levels rapidly decline due to the degradation mediated by the RNA-induced silencing complex (RISC). The concentration approaches near zero after a certain period, signifying the completion of mRNA degradation and the silencing of gene expression, which is the intended outcome of our RNAi mechanism.

RISC - mRNA VdRGS1 Plot Analysis

Figure 19: RISC - mRNA VdRGS1 Plot, X AXIS: Time(hours), Y AXIS:1*(10-7) M

Our second plot represents the concentration of the RISC-mRNA complex over time. Initially, as RISC begins to bind to the mRNA, the concentration of the complex rises steadily, peaking around the midpoint of the time interval. This peak indicates the maximum formation of the RISC-mRNA complex, where the majority of the available mRNA is bound to RISC. Following this peak, the concentration begins to decrease as the complex breaks down, and mRNA degradation proceeds to completion.

RISC Plot Analysis

Figure 20: RISC - mRNA VdRGS1 Plot, X AXIS: Time(hours), Y AXIS:1*(10-7) M

This plot illustrates the concentration of free RISC over time. At the beginning of the process, the concentration of free RISC is low because it is actively being recruited to form complexes with mRNA. As the RNAi process progresses and more mRNA molecules are targeted, the concentration of free RISC increases, reaching a peak when the availability of free mRNA is maximized. After this peak, the concentration of free RISC begins to decline as it becomes fully engaged in mRNA binding and degradation. This drop indicates that the RNAi pathway is approaching completion, with fewer free RISC molecules available as the mRNA is successfully degraded.

In summary, our RNAi pathway simulations effectively demonstrate the progression of mRNA degradation and the involvement of RISC over time. The models reveal a distinct pattern of rapid mRNA decay, accompanied by the formation and subsequent dissociation of RISC-mRNA complexes. However, while our simulations successfully capture the broader molecular interactions, they do not account for the sequence specificity of RNAi, which is a key determinant of the accuracy and efficiency of gene silencing. This limitation arises from the software's inability to simulate sequence-level variations, which are known to significantly influence the RNAi mechanism. Despite this, the results offer a solid foundation for understanding the general dynamics of RNAi pathway.

5.6 Future Aspects

1. RNAi in Multi-Gene Network:

Expanding the model to study RNAi in complex gene networks would allow for a deeper understanding of how silencing one gene affects other genes and regulatory pathways. This is especially useful for exploring cross-talk between pathways and feedback loops within cells, contributing to research on gene network dynamics and multi-target RNAi therapies.

2. Stochastic RNAi approach:

Incorporating stochastic elements into the RNAi model can provide insights into single-cell variability. A stochastic approach would capture the random fluctuations in RNAi processes, revealing how different cells exhibit variability in gene silencing outcomes. This could help explain RNAi efficiency at the single-cell level.

~ 3. Section: Chained to Growth - A NetLogo Exploration of Auxotrophic Dependencies~

"The essence of survival is need; every life, however small, is a story of searching and finding."
— Richard Dawkins

6.1 System

While developing THAELIA, a key priority was ensuring the safety of our system. Our chosen chassis, P. putida, is a bacterium with the natural ability to colonize plant roots, and our system harnesses this capability to combat V. dahliae at its source: within the roots of olive trees, where the infection begins! But as much as we’re focused on defeating this fungus effectively, it was equally important for us to create a solution that is safe and tightly regulated. To achieve this, we developed a dual biocontainment strategy, combining two complementary approaches: chemotaxis-based control and double auxotrophy.

Figure 21: THAELIA: Engineered bacteria colonizing roots to combat V. dahliae

The first approach, chemotaxis, involves a carefully engineered biological gate that stops our bacteria from reproducing unless triggered by the presence of the V. dahliae-specific protein, Ave1. This protein serves as a signal, ensuring our system only becomes active when it detects the pathogen. Without the pathogen, the bacteria remain in a state of minimal metabolic activity, conserving energy in a lag phase. Once the bacteria encounter Ave1, their replication is triggered, allowing them to effectively combat V. dahliae while remaining dormant otherwise. This highly regulated system minimizes the chance of unintended bacterial activity outside the target environment, prolonging their useful lifespan within the rhizosphere.

The second aspect of our biocontainment strategy is double auxotrophy. Auxotrophy refers to the inability of a microorganism to synthesize essential growth factors, meaning it cannot survive without external supplementation. We engineered P. putida to be auxotrophic for both Vitamin B12 and L-tyrosine—two nutrients crucial for survival. This means that vitamin B12, a cofactor essential for many bacterial enzymes, and L-tyrosine, an important amino acid, are not synthesized by our engineered bacteria and must be supplied through our irrigation system. This controlled provision of nutrients further strengthens the containment of our bacteria, allowing them to thrive only when these specific conditions are met.

In our efforts to design a safe bacterial system, it was crucial to ensure that our approach was on the right track. To accomplish this, we decided to begin our work in silico, developing simulations that mimic the complex soil conditions our engineered P. putida would encounter. By analyzing factors such as nutrient availability and root colonization dynamics, we gained invaluable insights into the potential performance of our system. Additionally, by predicting potential challenges and outcomes in advance, we can refine both our biocontainment strategies—like chemotaxis-based activation and double auxotrophy—and enhance the overall effectiveness of our system. This step is essential for ensuring that our solution is not only effective but also safe and predictable when deployed in agricultural settings.

6.2 Aim

Our goal for these simulations is to develop a comprehensive understanding of our engineered bacterial system and how it behaves under various soil conditions. By modeling the interactions of engineered P. putida with its environment, we can investigate how different factors—such as nutrient availability, moisture levels, and the presence of V. dahliae—impact bacterial growth, colonization, and overall effectiveness. Additionally, we will compare these outcomes with those of non-engineered P. putida to gain valuable insights into the advantages of our modifications. This in-depth analysis will enable us to optimize the performance of our biocontrol strategy while anticipating potential challenges, ensuring that our approach is both effective and safe for agricultural deployment.

6.3 Assumptions

In developing our agent-based model (ABM) of Pseudomonas putida colonization and reproduction in NetLogo[6.1], we made several simplifying assumptions to enable computational efficiency and to focus on the primary mechanisms of interest. These assumptions help reduce the complexity of modeling numerous biochemical and environmental interactions in a natural setting. The key assumptions are as follows:

Uniform Nutrient Distribution: Nutrients (e.g., tyrosine, vitamin B12) are assumed to be evenly distributed across the soil patches where the bacteria colonize. This simplification overlooks potential nutrient gradients in natural soil environments but allows us to focus on bacterial behavior and competition.

Independent Reproduction and Movement: The reproduction and movement of Pseudomonas putida (both regular and modified) are treated as independent processes. There are no interactions between these behaviors other than through nutrient depletion and competition.

No External Environmental Disturbances: The model assumes a closed and static environment with no external factors (e.g., temperature fluctuations, pH changes, or microbial predators) affecting bacterial growth or nutrient levels.

Negligible Horizontal Gene Transfer: Horizontal gene transfer (HGT) between regular and modified Pseudomonas putida is not considered in this model. In reality, HGT could influence genetic diversity and resource-sharing mechanisms but is excluded here for simplicity.

Auxotrophy-Dependent Growth: The modified Pseudomonas putida strains require external sources of tyrosine and vitamin B12 to grow and reproduce. This assumption is based on experimental genetic modifications made to make the bacteria auxotrophic for these compounds.

6.4 Methodology

An agent-based model (ABM) was chosenin order to simulate the behavior of both regular and genetically modified Pseudomonas putida in a soil environment and in the root, with a particular emphasis on the auxotrophic characteristics of the genetically modified strain. ABMs are particularly effective for biological systems, as they can capture the behavior of individual entities (in this case, bacteria) and the emergent patterns that arise from local interactions.

The justification for utilizing ABM lies in its ability to model complex systems where heterogeneity and spatial distribution play critical roles, such as in bacterial colonization. In contrast to continuous models like ODEs, ABMs can simulate discrete entities and their interactions with their environment, making it highly suitable for modeling bacterial behavior in heterogeneous environments.

Our ABM incorporates theoretical insights regarding the auxotrophic properties of the modified Pseudomonas putida strains (e.g., dependence on external tyrosine, vitamin B12 and AVE1). These genetic modifications were modeled based on findings from scientific literature that suggest auxotrophy can be an effective biocontainment strategy. Additionally, environmental conditions were simplified to exclude fluctuations, focusing primarily on nutrient availability.

Key elements of our Methodology

• Auxotrophy-Driven Growth: Reproduction of the genetically modified bacteria is dependent on the external presence of both tyrosine and vitamin B12, as described in literature.

• Patch Environment: The model environment consists of soil patches with varying levels of nutrients and metals. Regular Putida strains colonize freely, while the modified strains require specific external compounds (tyrosine and vitamin B12) to grow.

• Ave1 Influence: Due to the limited availability of relevant data in the literature, constructing an accurate model to simulate the behavior of the Ave1 protein was not feasible. As a result, we refrained from including any plots or simulations related to Ave1 protein dynamics in our analysis, as they would not provide reliable or meaningful insights.

6.5 Results

Test 1: Stable Presence of B12 and Tyrosine

Figure 22: Modified Pseudomonas Putida-Tyrosine and b12 Stable Presence

In this simulation, modified Pseudomonas putida is grown with a constant supply of both Tyrosine and Vitamin B12, which are necessary for its survival due to its auxotrophic nature.

The bacteria follow a typical growth cycle, starting with slow growth in the lag phase, followed by rapid expansion during the exponential phase due to the stable availability of both nutrients. The stationary phase is extended, as the constant supply of Tyrosine and Vitamin B12 allows the population to remain high. Eventually, a death phase occurs after the bacteria reach resource limitations, but the population continues to cycle due to the stable nutrient availability.

In this scenario, the bacteria can sustain growth over multiple cycles, showing the critical role of nutrient availability in population stability.

Test 2: Tyrosine reachs 0 level of Concetration

Figure 23: Modified Pseudomonas Putida dies due to lack of Tyrosine Resoures

In this simulation, we observe the behavior of modified Pseudomonas putida when Tyrosine levels deplete to zero while Vitamin B12 remains available.

The bacteria exhibit normal growth in the lag and exponential phases as both Tyrosine and Vitamin B12 are available. However, as Tyrosine becomes depleted, the population begins to decline, entering the death phase. Without Tyrosine, the auxotrophic bacteria cannot maintain their growth, leading to a sharp population drop. A brief plateau is observed before the population sharply declines to zero.

Figure 24: Tyrosine Levels reach 0 level

The second figure shows the depletion of Tyrosine while Vitamin B12 remains present. As Tyrosine depletes completely, the population cannot sustain itself, causing the modified bacteria to die off, even though Vitamin B12 is still available.

Test 3: B12 reachs 0 level of Concetration

Figure 25: Modified Pseudomonas Putida dies due to lack of B12 Resoures

In this simulation, we examine the behavior of modified Pseudomonas putida when Vitamin B12 levels deplete to zero while Tyrosine remains available.

Figure 26: B12 Levels reach 0 level

This figure illustrates the depletion of Vitamin B12 (red line), while Tyrosine (blue line) remains available. Once Vitamin B12 is fully depleted, the modified bacteria can no longer sustain their population, leading to an immediate population crash.

The simulations demonstrated that modified Pseudomonas putida relies heavily on both Tyrosine and Vitamin B12 for survival. When both nutrients were available, the bacteria followed typical growth cycles. However, when Tyrosine was depleted, the population rapidly declined despite the presence of Vitamin B12. Similarly, Vitamin B12 depletion caused an immediate population crash, even with Tyrosine available. These results highlight the bacteria’s auxotrophic dependency on both nutrients for growth and survival.

6.6 Future Aspects

1. Expanded Nutrient Interactions: Future models could incorporate more complex nutrient cycling and feedback loops in the soil environment, simulating more realistic conditions. This includes interactions between bacteria and other soil organisms that could affect nutrient availability and competition. A broader array of environmental conditions and microbial competitors could provide insights into the ecological interactions that influence bacterial colonization and survival in natural environments.

2. Incorporating Horizontal Gene Transfer (HGT): The current model assumes no genetic exchange between bacterial strains, but future work could explore the role of HGT in spreading auxotrophic traits or resistance mechanisms between bacterial populations. This would give insights into how genetic modifications, such as auxotrophy, spread or are maintained in bacterial communities.

~7. Conclusions~

Our project successfully combined computational modeling with synthetic biology to address Verticillium dahliae, a significant threat to olive trees. Using a blend of AI, ODE, and agent-based modeling, we gained valuable insights into RNA-protein interactions, RNA interference (RNAi) pathways, and the behavior of engineered bacteria.

The AI-driven models helped identify effective dsRNA binding domains for encapsulation, improving the dsRNA delivery efficiency. Meanwhile, the ODE simulations in SimBiology provided a fundamental understanding of RNAi pathway, incorporating precise data on DICER concentration to enhance accuracy. Our NetLogo agent-based model simulated the colonization of engineered Pseudomonas putida, demonstrating the effectiveness of biocontainment strategies while ensuring safety in agricultural environments.

This integrative approach demonstrates the power of computational tools in optimizing biological processes, allowing us to predict and refine outcomes before experimental validation. These models will continue to guide improvements in RNAi delivery and biocontrol systems, contributing to sustainable solutions for agriculture.

~8. References~

[1] (Eskelin, K., Lampi, M., Coustau, C., Imani, J., Kogel, K.-H., & Poranen, M. M. (2022). Analysis and purification of ssRNA and dsRNA molecules using asymmetrical flow field flow fractionation. In Journal of Chromatography A (Vol. 1683, p. 463525). Elsevier BV. ) https://doi.org/10.1016/j.chroma.2022.463525

[2] history of artificial intelligence (AI) Also known as: history of AI Written by .J. Copeland Fact-checked by The Editors of Encyclopaedia Britannica : https://www.britannica.com/science/history-of-artificial-intelligence

[3] Turing Test in Artificial Intelligence via geeksforgeeks.org : https://www.geeksforgeeks.org/turing-test-artificial-intelligence/

[4] Domains of AI : https://suryamaddula.medium.com/domains-of-artificial-intelligence-8046d0778f1a

[5] Unlocking the Transformative Power of Synthetic Biology Amaan Arif, Prekshi Garg and Prachi Srivastav* : https://www.biotechmedjournal.com/articles/abb-aid1039.php

[6] Synthetic Biology + Artificial Intelligence = Next Generation Therapeutics:Artificial intelligence is bridging the gap between synthetic biology and medicine - Meet the innovative companies leading the way : https://www.synbiobeta.com/read/synthetic-biology-artificial-intelligence-next-generation-therapeutics

[7] Artificial Intelligence for Synthetic Biology The opportunities and challenges of adapting and applying AI principles to synbio. By , , , , , , and : https://cacm.acm.org/research/artificial-intelligence-for-synthetic-biology/

[8] RBPDB: a database of RNA-binding specificities Kate B. Cook, Hilal Kazan, Khalid Zuberi, Quaid Morris, Timothy R. Hughes Nucleic Acids Research, Volume 39, Issue suppl_1, 1 January 2011, Pages D301–D308, https://doi.org/10.1093/nar/gkq1069

[9] Lorenz, R., Bernhart, S. H., Höner Zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6(1), 26. https://doi.org/10.1186/1748-7188-6-26

[10] Popenda, M., Szachniuk, M., Antczak, M., Purzycka, K. J., Lukasiak, P., Bartol, N., ... & Adamiak, R. W. (2012). Automated 3D structure composition for large RNAs. Nucleic Acids Research, 40(14), e112. https://doi.org/10.1093/nar/gks339

[11] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2

[12] Yan, Y., Zhang, D., Zhou, P., Li, B., & Huang, S. Y. (2020). HDOCK: A Web Server for Protein–Protein and Protein–DNA/RNA Docking Based on a Hybrid Strategy. Nucleic Acids Research, 48(W1), W329-W334. https://doi.org/10.1093/nar/gkx407

[13] Modeling RNA interference in mammalian cells Giulia Cuccato1†, Athanasios Polynikis2†, Velia Siciliano1, Mafalda Graziano1, Mario di Bernardo2,3, Diego di Bernardo1,3

[14] Mechanisms of epigenetic regulation by C. elegans nuclear RNA interference pathways Uri Seroussi, Chengyin Li, Adam E. Sundby, Tammy L. Lee, Julie M. Claycomb, Arneet L. Saltzman https://doi.org/10.1016/j.semcdb.2021.11.018

[15] NetLogo: A simple environment for modeling complexity Seth Tisue, Uri Wilensky https://www.researchgate.net/publication/230818221_NetLogo_A_simple_environment_for_modeling_complexity

[16] Agent-Based Model in Ecology (2019). "The use of ABMs in simulating microbial populations." https://doi.org/10.1093/sysbio/syv093

Modeling

~Preface~

Figure 1: “I rebel-therefore we exist” -Albert Camus

~Overview~

~Biological Timeline~

Figure 2: Model Biological Timeline Diagram

~Section 1: Unseen Bonds - an AI approach~

"The machine does not isolate man from the great problems of nature but plunges him more deeply into them." - Antoine de Saint-Exupéry

4.1. System

Figure 3: The chimeric protein that leads to the encapsulation of the dsRNA into the OMVs. With light blue is depicted the dsRNA Binding Domain and with deep blue is depicted the Outer Membrane Associated protein that leads to the encapsulation of the dsRNA inside the OMVs

4.2. Aim

4.3. Requirements

Assumptions

4.4. Methodology

The Turing Test: A brief Introduction to Artificial Intelligence(AI)

"You're programmed to be curious, just like we are." - Ex Machina(2014)

Figure 4: Alan Turing, possibly the first person that gave the earliest public lecture (London, 1947) to mention computer intelligence, saying, “We want a machine that can learn from experience”.[2]

Domains of AI

Figure 5: Domains of Artificial Intelligence[4]

Use of Artificial Intelligence in Synthetic Biology

Figure 6: AI instances in Synthetic Biology

Building an AI model

Figure 7: AI building steps

Our AI Development Approach

Data Gathering

Mean Squared Error and Mean Absolute Error

Mean Squared Error

Mean Absolute Error

Model 1

Deep Learning Bidirectional LSTM Model BAF (Binding Affinity Forecasting)

Architecture

Figure 8: BAF model Architecture

Model Training

Figure 9: 3D Visualization of Model Loss over Epochs and Batch Size

Model 2

Machine Learning Random Forest Regressor

PEGGY(Protein Engineering and Genetic Guided Yield) V1

Architecture and Training

Evaluation

Model 3

Machine Learning XGBoost PEGGY V2

Architecture and Training

Evaluation

Model 4

Machine Learning CatBoost Regressor PEGGY V2

Architecture and Training

Evaluation

4.5. Results

Our Metric

Our Results’ Matrices

Figure 10: First Prediction for RNAse III KT2440

Figure 11: Second Prediction for RNAse III KT2440

Figure 12: Third Prediction for TOLb (Negative Control)

Figure 13: Fourth Prediction for DRB2_ORYSJ

Figure 14: Comparison of Machine Learning Models

Comparison Between BAF and PEGGY(V1,2,3)

Deep Learning BAF model (LSTM):

Machine Learning PEGGY V1, V2, V3 (Random Forest, XGBoost, CatBoost):

4.6. Future Aspects

~ Section 2: Decoding Silencing: an ODE simulation of RNAi pathway Dynamics~

“The beauty of the Universe lies in the details, where the infinitesimal becomes pivotal.” -Giorgos Grammatikakis

5.1 The System

Figure 15: bmRNAi technology

Figure 16: RNAi mechanism. In the picture, the DICER protein (depicted in blue) is necessary to process the dsRNA. The AGO proteins (depicted in green and yellow) form the RISC complex with the siRNAs and target the fungus mRNA for silencing either with mRNA cleavage or translation blockage

5.2 Our Aim

5.3 Assumptions

5.4 Methodology

Data Extraction by the Lab

Equations

5.5 Results

mRNA VdRGS1 Plot Analysis

Figure 18: mRNA VdRGS1 Plot, X AXIS: Time(hours), Y AXIS:1*(10-7) M

RISC - mRNA VdRGS1 Plot Analysis

Figure 19: RISC - mRNA VdRGS1 Plot, X AXIS: Time(hours), Y AXIS:1*(10-7) M

RISC Plot Analysis

Figure 20: RISC - mRNA VdRGS1 Plot, X AXIS: Time(hours), Y AXIS:1*(10-7) M

5.6 Future Aspects

~ 3. Section: Chained to Growth - A NetLogo Exploration of Auxotrophic Dependencies~

"The essence of survival is need; every life, however small, is a story of searching and finding." — Richard Dawkins

6.1 System

"The machine does not isolate man from the great problems of nature but plunges him more deeply into them."
- Antoine de Saint-Exupéry

"You're programmed to be curious, just like we are."
- Ex Machina(2014)

“The beauty of the Universe lies in the details, where the infinitesimal becomes pivotal.”
-Giorgos Grammatikakis

"The essence of survival is need; every life, however small, is a story of searching and finding."
— Richard Dawkins