Modelling

The ProgRAM system presents a high level of complexity, necessitating comprehensive modeling to ensure that its design is functional and scalable. Given the intricacy and precise sequence constraints of our system’s components, such as the promoter, 5’ UTR, recording tape, coding region, 3’UTR, and RNA aptamers, dry lab modeling has become an integral part of our work, allowing us to predict and optimize the system’s behavior before experimental validation. Each component was modeled in silico to evaluate integrity, efficiency, and functionality. This combined dry and wet lab approach ensures that we can generate ProgRAM’s components by accurately predicting the performance of the system and efficiently iterate on its design.

Tape Construction Algorithm

Motivation

Our system, ProgRAM, utilizes specially designed RNA tapes to record biological events. The tapes are engineered with specific sequence elements crucial to their successful operation, including consecutive adenine deamination (writing) and multi-frame in vivo reporter expression (reading). These essential features impose certain constraints that limit tape design. However, within these constraints, the tape offers a highly customizable minimal consensus sequence. The nucleotides in the variable protospacer regions must, therefore, be strategically selected to optimize both the guide RNAs (gRNAs) necessary for writing, as well as the embedded translation initiation elements (Kozak sequences) for reading.

This optimization process was thus carried out using a custom computational algorithm, the main objectives of which can be summarized as follows:

Folding: ensure correct folding of the guide RNA by promoting proper stem-loop formation, which is necessary for binding to dPspCas13b.
Accessibility: prevent formation of unwanted secondary structures within the tape sequence enhancing accessibility for translation initiation and deamination machineries.
Selectivity: enhace binding affinity of gRNAs to their targets within the tape while minimizing affinity for seqeunce-similar mismatches.
Specificity: promote bioorthogonality and reduce global off-targets by ensuring unique targeting to the tape within the host transcriptome.
Precision: control nucleotide composition to minimize unintended local deamination events, ensuring consistent and accurate recording.
Compatibility: ensure compatibility with bioengineering standards (BioBrick) and the host organism by avoiding motifs that affect tape stability and expression.

Implementation

ProgRAM employs recording tapes with three distinct protospacers, each targeted by a unique guide RNA (gRNA). Our tape construction algorithm systematically generates all possible tape sequences from a minimal consensus using a brute-force approach and evaluates them based on their associated gRNAs.

The tape generation process begins with the creation of all possible protospacers, which are variable regions within each tape, by substituting the Ns (wildcards) in the consensus sequence with non-adenine. This is followed by constructing actual tapes from all possible permutations of the generated protospacers. Once a tape sequence is obtained, three gRNA sequences are extracted from it. In silico folding of these gRNA sequences, which include the stem-loop necessary for dPspCas13b binding and spacers targeting corresponding regions of the tape, is performed using ViennaRNA package (Lorenz et al., 2011) to ensure correct structure.

If structural verification is passed, gRNA binding affinities are calculated. It is crucial to ensure that each of the three gRNAs exhibits high affinity toward its respective target sequence while displaying lower affinity toward the two remaining mismatches. Therefore, a total of nine duplex folding simulations are performed, three for each gRNA, and binding energies are calculated. These energies are then combined using a partition function to obtain an arbitrary score for each of the three gRNAs. Finally, an aggregate tape score is derived by averaging all its associated gRNA scores.

The tape is then further analyzed for the absence of BioBrick-incompatible restriction sites, splice sites, polyadenylation signals, RNA instability motifs, and other elements that could interfere with proper expression and stability. Additionally, the tape is BLASTed (NCBI, 1994) against the host transcriptome to check for potential off-target effects. In the current implementation of the algorithm, these last verification steps are performed manually on the top hits resulting from the brute-force approach.

Further optimizations of the algorithm aimed at reducing the combinatorial space have been implemented. Their goal is to ensure that only non-cyclic permutations of protospacers are evaluated. Due to occasionally prolonged computation times, regular export of intermediate results is performed. The complete code can be reviewed below.

Tape Construction Code;


  # Load dependencies
  import ViennaRNA as rna
  from itertools import product, combinations, permutations, cycle
  import json
  import math
  from typing import Generator
  import regex as re
  
  # Generate sequence variants by replacing a pattern
  def variant(sequence: str, pattern: str, replacements: list[str]) -> str:
      """
      Substitute consecutive 'N's in the consensus sequence with provided replacements to generate a seqeunce variant.
  
      Args:
          sequence (str): The original sequence containing 'N's to be replaced.
          replacements (list[str]): A list of strings to replace the 'N's.
  
      Returns:
          str: The sequence with 'N's replaced by corresponding strings from the replacements list.
      """
      for replacement in cycle(replacements):
          if pattern not in sequence:
              break
          sequence = sequence.replace(pattern, replacement, 1)
      return sequence
  
  # Generate deaminated sequence
  def deaminate(sequence: str, state: int) -> str:
      """
      Deaminate the input sequence to the specified state.
  
      Args:
          sequence (str): The original sequence to be deaminated.
          state (int): The state to which to deaminate the sequence.
  
      Returns:
          str: The deaminated sequence.
      """
      return sequence.replace("GCCAUGG", "GCCGUGG", state)
  
  # Generate a set of 3 related gRNA sequence replacements
  def cyclic_permutation(group: list) -> Generator[list, None, None]:
      """
      Generate cyclic permutations of the input list.
  
      Args:
          group (list): The original list to be permuted.
  
      Yields:
          Generator[list, None, None]: A generator that yields cyclic permutations of the input list.
      """
      for _ in group:
          group.insert(0, group.pop())
          yield group
  
  # Generate a complementary sequence
  def complement(sequence: str) -> str:
      """
      Generate the complementary sequence of the input sequence.
  
      Args:
          sequence (str): The original sequence to be complemented.
  
      Returns:
          str: The complementary sequence of the input sequence.
      """
      return sequence.translate(str.maketrans("AUGC", "UACG"))
  
  # Reverse complement a sequence
  def reverse_complement(sequence: str) -> str:
      """
      Generate the reverse complement of the input sequence.
  
      Args:
          sequence (str): The original sequence to be reverse complemented.
  
      Returns:
          str: The reverse complement of the input sequence.
      """
      return complement(sequence)[::-1]
  
  # Calculate the grna score based on binding energy
  def score(targets: list[float], offtargets: list[float]) -> float:
      """
      Calculate the score of a gRNA based on its target and off-target binding energies.
  
      Args:
          target (float): The binding energy of the gRNA to the target.
          offtarget (list[float]): The binding energies of the gRNA to the off-targets.
      
      Returns:
          float: The score of the gRNA.
      """
      alpha = 5.0
      beta = 0.05
      gamma = -1
      delta = -0.1
    
      return alpha * sum([math.exp(-beta * target) for target in targets]) + gamma * sum([math.exp(delta * offtarget) for offtarget in offtargets])
  
  # Define a flag for sequence length
  gRNA_length = 30  # Set to 50 if you want to use the 50bp sequence
  
  # Define the recording tape sequence and its states based on the sequence length
  if gRNA_length == 30:
      tape = "GCCGUGGNNNGCCGUGGNNNGCCAUGGNNNGCCAUGGNNNGCCAUGGNNNGC"  # 30bp, single mismatch
      spacer = reverse_complement(deaminate(tape, 1)[2:32])  # 30bp, single mismatch
      pattern = "NNN"  # 30bp, single mismatch
  elif gRNA_length == 50:
      tape = "NNNGGNNNNNGCCGCCGUGGCGNNNGGNNNNNGCCGCCAUGGCGNNNGGNNNNNGCCGCCAUGGCGNNNGGNNNNNGCCGCCAUGGCGNNNGGNNNNNG"  # 50bp, single mismatch
      spacer = reverse_complement(deaminate(tape, 1)[5:55])  # 50bp, single mismatch
      pattern = "NNNGGNNNNN"  # 50bp, single mismatch
  else:
      raise ValueError("Unsupported sequence length. Please use 30 or 50.")
  
  # Define the gRNA sequence elements
  stemloop = "GUUGUGGAAGGUCCAGUUUUGAGGGGCUAUUACAAC"
  
  # Define global parameters
  rna.cvar.temperature = 37
  
  # Obtain the secondary structure of the direct repeat
  reference = rna.fold(stemloop)[0]
  
  # Calculate the number of occurrences and positions of the spacer in the tape
  occurrences = re.finditer(f'(?={reverse_complement(spacer)})', deaminate(tape, tape.count("A")))
  positions = [slice(occurrence.start(), occurrence.start() + len(spacer)) for occurrence in occurrences]
  
  # Generate all possible replacements for the pattern
  replacements = product("GCU", repeat = pattern.count("N"))
  replacements = ["".join(next(replacement) if nt == "N" else nt for nt in pattern) for replacement in map(iter, replacements)]
  replacements = combinations(replacements, len(positions))
  
  # Initialize the data structure
  data = []
  loop_counter = 1
  batches_per_saving_cycle = 120000
  # Generate sequence variants and their secondary structures, identify those that match the reference
  
  # Iterate trough all possible replacement combinations (unordered)
  for replacement in replacements:
      # For each combination generate all groupings that cannot be obtained by cyclic permutations 
      for grouping in map(lambda x: (replacement[0],) + x, permutations(replacement[1:], len(positions)-1)):
          
          # Generate the tape variant using the replacement grouping
          sequence = variant(tape, pattern, grouping)
          
          # Extract the gRNA sequences and their bidning sites from the tape
          targets = {
              i+1: deaminate(sequence, i)[positions[i]]
              for i in range(len(positions))
              }
          grnas = {
              i+1: reverse_complement(deaminate(sequence, sequence.count("A"))[positions[i]])
              for i in range(len(positions))
              }
          
          # Initialize the subdata structure
          subdata = {
              "tape": sequence,
              "replacements": grouping,
              "grna": []
          }
          
          # Iterate through all gRNA sequences and generate their secondary structures
          for i, grna in grnas.items():
              structure, energy = rna.fold(grna+stemloop)
              
              # Check if the direct repeat stemloop of the gRNA folds correctly
              if structure[-len(stemloop):] != reference:
                  break
              
              # Append the complete gRNA sequence, its secondary structure, and binding energy to the subdata structure
              subdata["grna"].append({
                  "sequence": grna+stemloop,
                  "structure": structure,
                  "energy": energy,
                  "binding": []
              })
              
              # Iterate through all target sequences and generate their binding duplexes
              for j, target in targets.items():
                  duplex = rna.duplexfold(target, grna)
                  
                  # Determine exact basepairing regions for the duplex
                  length = [len(fold) for fold in  duplex.structure.split("&")]
                  region = [
                      slice(duplex.i - length[0], duplex.i),
                      slice(duplex.j - 1, duplex.j - 1 + length[1])
                  ]
                  
                  # Append the duplex sequence, structure, and energy to the subdata structure
                  subdata["grna"][-1]["binding"].append({
                      "sequence": target[region[0]] + "+" + grna[region[1]],
                      "structure": duplex.structure.replace("&", "+"),
                      "energy": duplex.energy,
                      "target": i == j  # corrected typo
                  })
              
              # Calculate the score for the gRNA from on- and off-target binding energies
              subdata["grna"][-1]["score"] = score(
                  [binding["energy"] for binding in subdata["grna"][-1]["binding"] if binding["target"] == True],
                  [binding["energy"] for binding in subdata["grna"][-1]["binding"] if binding["target"] == False]
              )
              
              # Calculate the aggregate score for the grouping of gRNAs
              subdata["score"] = sum([grna["score"] for grna in subdata["grna"]])/len(subdata["grna"])
  
          else:
              # Append the subdata to the data structure
              data.append(subdata)
      
      loop_counter += 1
      if loop_counter % batches_per_saving_cycle == 0:
          # Sort the data by score
          data = sorted(data, key=lambda x: x["score"], reverse=True)
          # Write the data to a JSON file
          with open("output.json", "w") as file:
              json.dump(data, file, indent=4)
          print(f"saved another batch, current loop_cycle: {loop_counter}")

Evaluation

The provided algorithm was used to generate a set of tape sequences with spacers of both 30 and 50 nucleotides in length. We provide the complete result file for the 30 bp spacer tapes below. This file contains the tape sequences, replacement patterns, extracted gRNA sequences, their secondary structures (dot-bracket notation), duplex structures with the (off-)targets, binding energies, and calculated scores: download output. The geenrate parts, including both the tape and its gRNAS, were submitted as BBa_K5102018–BBa_K5102052 for 30 bp design and BBa_K5102101–BBa_K5102108 for 50 bp design. The experimental results obtained with some of these parts can be explored on our Results page.

Overall, while the designed algorithm performed well and met its goals, further optimizations are possible to reduce runtime for longer sequences. This may be achieved by introducing rational constraints on the sequence, such as moderating its GC content or excluding protospacers prone to disrupting gRNA folding, among other strategies. The algorithm can also be improved by implementing a genetic algorithm to reduce the number of evaluated sequences. Additionally, manual verification steps following the brute-force approach can be automated to further streamline the process by interfacing with the NCBI BLAST API to check for global off-targets and utilizing various computational tools available to identify RNA motifs and other sequence features.

Codon Optimization

Modeling the genetic sequences of our synthetic mRNA recording tape was essential to our project. As the frameshift-driven translation of multiple fluorescent proteins (XFPs) from a single construct required optimized codon usage and the elimination of premature stop codons across all three reading frames, we needed to modify the genetic sequence without altering the resulting proteins. Initially, we searched for existing tools to achieve this in all three frames, but soon realized we had to develop our own optimizer due to the lack of tools capable of handling multi-frame codon optimization while allowing for custom motif exclusion such as splice or restriction sites.

Due to multiple downstream tasks, modeling had to begin as early as possible to maintain the project timeline. Much of our wet lab activities depended on the results of the modeling, as the optimized sequences were necessary for proceeding to experimental validation. Specifically, we first needed to:

Order the optimized partial sequences from gene synthesis providers, which involved significant delivery delays.
Assemble these sequences via cloning
Sequence the assembled plasmid, before moving towards testing the whole construct.

This made the dry lab work a critical part of ensuring that enough time was left for our wet lab experiments.

Additionally, we needed to not only optimize the actual protein sequences but also the nonsense peptides produced in frameshifted regions before the T2A sites. These peptides must be efficiently produced (being a rate-limiting bottleneck) before being discarded, making their codon optimization relevant for the entire construct.

Finally, we had to make sure that no problematic motifs such as non-RFC compliant restriction sites, splice sites, repeats, or other sequences that could negatively affect mRNA stability were present. Factors like GC content were carefully controlled to maintain optimal mRNA stability and prevent secondary structures that could lead to increased activation of mRNA degradation pathways.

The code for the multi-frame codon optimization, including necessary data files is available here.

Evolutionary Algorithm for Multi-Frame Optimization

To address the challenge of optimizing codons across multiple reading frames, we developed a custom evolutionary algorithm based on Multi-objective Optimization (MOO) using the NGSA-III approach, implemented with the DEAP Python framework.

Evolutionary algorithms mimic natural selection, where a population of solutions evolves through mutation, crossover, and selection to improve performance. In our case, the algorithm optimized codon sequences without altering the amino acid sequence.

The algorithm’s key operations are:

Mutation:
- Swapped synonymous codons to optimize codon usage.
- A mutation probability of 0.7 controlled how often sequences underwent mutation at all, while a mutation rate of 0.1 determined the likelihood of any single codon changing.
Crossover:
- Combined genetic material from two parent sequences to generate offspring.
- Crossover points were restricted to occur between codons to maintain codon integrity and avoid frame shifts.
- A crossover probability of 0.5 ensured that 50% of sequences were recombined each generation.
Selection: Sequences were evaluated based on 4 main objectives:
- Global optimization: Addressing frame-independent factors such as avoiding problematic motifs (e.g., restriction sites, splice sites), controlling GC content, and preventing repeats.
- Local frame optimization: Three separate objectives, one for each reading frame, focused on preventing premature stop codons and ensuring efficient translation, tailored for expression in our model cell line, HEK cells.
- The top 10 sequences of each generation were stored in a Hall of Fame for further analysis and benchmarking.
Population Initialization: The algorithm was run with a population size of 500 sequences over 300 generations, exploring a wide range of solutions and converging on optimized sequences.

To initialize the first generation, we back-translated the target protein sequences into DNA. This method randomly selected synonymous codons for each amino acid, generating valid DNA sequences from the protein sequence.

We also provided the option to seed the population with preset sequences. This allowed us to:

Reuse high-performing sequences from the Hall of Fame of previous runs.
Seed with known sequences to better estimate convergence and reduce the search space.

The constructs we optimized were:

T2A_miRFP670nano3_P2A_eUnaG (BBa_K5102012-BBa_K5102001-BBa_K5102015-BBa_K5102009)
T2A_mScarlet-3_P2A_eUnaG (BBa_K5102013-BBa_K5102002-BBa_K5102016-BBa_K5102010)
T2A_mTagBFP2-mut_P2A_eUnaG (BBa_K5102014-BBa_K5102003-BBa_K5102017-BBa_K5102011)

In all cases, eUnaG was included to serve as a stoichiometric signal baseline, ensuring equal amounts of eUnaG and the produced XFPs.

Fitness Functions and Implementation

The fitness functions in our evolutionary algorithm were designed to optimize both frame-specific and global objectives. These functions were combined hierarchically into a single fitness score, using weighted sums to ensure each objective was appropriately balanced. Here’s how the system was designed:

Frame-Dependent Fitness Functions: For each of the three reading frames, the following metrics were calculated:
- Codon adaptation index (CAI): Measures how closely the codons in a sequence match those used in highly expressed genes in HEK cells. A higher CAI indicates better alignment with natural, efficiently expressed mRNAs in our model cell line.
- Frequency of optimal codons: This calculates the proportion of codons in the sequence that are considered “optimal” for HEK cell translation. Codons most frequently used in the model organism’s protein synthesis machinery are preferred, and this metric rewards sequences with a higher percentage of these codons.
- Mean of typical decoding rates: This metric assesses the average translation rates (speed) for the codons in the sequence. Faster decoding rates help ensure efficient protein synthesis, and the algorithm favors sequences that maintain a high mean rate.
- Stop codon exclusion (local): Ensures no premature stop codons are introduced in the active reading frame, which would halt translation prematurely.
- Splice site exclusion (global): Although this is a global concern, it was integrated into frame-specific evaluations to enhance evolutionary pressure against splice sites. Reasoning is provided in the relevant section of key challenges.
Global Fitness Functions: In addition to frame-specific evaluations, the following global metrics were considered:
- GC content: Sequences with GC content outside the target range (52% for HEK cells) were penalized, as improper GC content could negatively affect mRNA stability and structure.
- Stop codon exclusion (global): This penalized stop codons across the entire sequence, not just a single reading frame.
- Restriction site exclusion: Penalized sequences that contained problematic restriction sites, which could interfere with cloning and BioBrick RFC [10] and RFC [1000] standards.
- Repeat exclusion: Penalized sequences with repetitive elements that could destabilize the mRNA or increase the likelihood of errors in protein synthesis.
- Splice site exclusion (global): Screened for canonical splice sites to avoid unintentional splicing during expression.
- Sum of squared residuals (SSR): This measured the deviation of fitness scores between frames, ensuring translation efficiency is balanced across all frames by minimizing disparities.

These global fitness scores were combined with frame-specific scores to define a Pareto-optimal front across candidate sequences within a generation. This ensured sequence diversity across generations, preventing premature convergence toward local optima.

Key Challenges and Solutions

Comparability of Fitness Functions

Making different fitness functions comparable for proper weighting was a challenge, as they needed to contribute meaningfully to the overall fitness score.

Solution: All fitness functions were normalized to a range of 0 to 1. Depending on the desired behavior, intermediate values were scaled using different template functions (linear, quadratic, or sigmoid) to ensure proper comparability and impact on the overall optimization. After scaling, the scores were weighted differently to establish a hierarchical priority of criteria within the multi-objective approach. This allowed us to emphasize critical objectives, such as avoiding premature stop codons, while balancing other factors like codon usage and GC content.

Unavoidable Motifs

One challenge we encountered was the presence of unavoidable problematic motifs that consistently occurred at the intersection of certain amino acids in other reading frames. These motifs could not be eliminated through the substitution of synonymous codons without disrupting the protein in other frames.

Solution: We identified these problematic positions and introduced targeted mutations in the affected protein. To ensure the functionality of the protein, we researched conserved domains and replaced the problematic amino acids with functionally similar ones. Additionally, we used AlphaFold 3 to model the protein structure, confirming that the introduced mutations did not affect the overall structural integrity of the protein.

Convergence to Zero Penalized Motifs

A key challenge was reducing the number of penalized motifs (e.g., stop codons, restriction sites) to zero. As the absolute number of these motifs decreases, the likelihood of introducing new problematic motifs by random mutation becomes higher than removing the last few remaining penalized motifs, making it harder to converge.

Solution: We included penalized motifs in all four fitness objectives (global and frame-specific) to increase evolutionary pressure, driving the convergence toward cleaner sequences with fewer problematic motifs. This, in combination with a large population and high number of generations, resulted in a motif count of 0 at the end of every single optimization run.

Efficient Identification of Repeats and Motifs

Identifying direct and inverted repeats or long motifs in DNA sequences is computationally challenging. A naive approach for finding repeats involves comparing every possible substring of the DNA sequence with every other substring, resulting in a time complexity of O(n²), where n is the length of the DNA sequence. For motif searching, the naive method requires scanning the sequence for each motif, leading to a time complexity of O(m * n * l), where m is the number of motifs, n is the sequence length, and l is the average motif length. As both the sequence length and the number of motifs increase, these naive approaches become computationally expensive.

Solution: We developed a k-mer-based algorithm that created a dictionary of unique sequences (k-mers) and their positions, including inverted versions. These k-mers are short, fixed-length subsequences of DNA (of length k) that are used to break down longer DNA sequences into manageable parts for analysis.

For Repeats: By tracking consecutive k-mers, the algorithm efficiently identifies repeat regions, reducing the complexity to O(n) for generating the k-mer dictionary and locating repeats, as comparisons are limited to matched k-mers rather than all substrings.
For Motifs: We reused this approach to detect motifs and splice sites of length ≥ k by leveraging the same k-mer dictionary. Checking k-mer segments for all m motifs takes O(m), which is significantly more efficient than the naive method.

This method avoids reprocessing sequences, significantly improving efficiency in identifying repeats and motifs in large DNA sequences.

Non-Canonical Splice Sites

While we could screen for canonical splice motifs within the evolutionary algorithm, non-canonical splice sites were more difficult to predict and could not be efficiently screened using pattern matching. Instead, AI tools like SpliceAI were required, but not feasible to run for every sequence at every generation due to computational costs.

Solution: We incorporated a manual quality control step after the optimization run concluded, to screen for non-canonical splice sites. We did this by running the sequences through SpliceAI-based detection code, replacing any detected splice sites with synonymous codons. The code was written in python and designed for running on Google Colab. It can be reviewed below.

SpliceAI Screening Code;

  # Code adapted from Martin Grosshauser's input after the human practices meeting with him
  !pip install spliceai
  !pip install logomaker
  
  #import of packages
  from keras.models import load_model
  from pkg_resources import resource_filename
  from spliceai.utils import one_hot_encode
  import numpy as np
  import tensorflow as tf
  import logomaker
  import pandas as pd
  from termcolor import colored
  
  import matplotlib.pyplot as plt
  
  context = 10000 #context for spliceai: 10kb
  paths = ('models/spliceai{}.h5'.format(x) for x in range(1, 6))
  models = [load_model(resource_filename('spliceai', x)) for x in paths]

  
  def predict_spliceai(input_sequence):
    '''
    splicing prediction by SpliceAI
    input: nucleotide sequence
    output: donor and acceptor probabilities for each nucleotide.
    Attention: The probability for acceptor sites is after the splice site (at the first nucleotide that is spliced in')
    '''
    x = one_hot_encode('N'*(context//2) + input_sequence + 'N'*(context//2))[None, :]
    y = np.mean([models[m].predict(x) for m in range(5)], axis=0)
    
    acceptor_prob = y[0, :, 1]
    donor_prob = y[0, :, 2]
    return acceptor_prob, donor_prob
  
  input_sequence = 'GCCGTGGTCCGCCGTGGCTTGCCATGGCGGGCCATGGTCCGCCATGGCTTGC'
  acceptor_prob,donor_prob = predict_spliceai(input_sequence)
  
  plt.plot(donor_prob,label="donor probability")
  plt.plot(acceptor_prob,label="acceptor probability")
  plt.legend()
  plt.show()
  
  donor_sorting = np.argsort(donor_prob)
  acceptor_sorting = np.argsort(acceptor_prob)
  
  top_donor_sites = np.flip(donor_sorting[-30:])
  top_acceptor_sites = np.flip(acceptor_sorting[-30:])
  
  print("\n Top splice candidates 5'")
  for top_donor_index in top_donor_sites:
    centered_11mer = input_sequence[top_donor_index-5:top_donor_index+6]
    colored_11mer = centered_11mer[0:3]+colored(centered_11mer[3], 'cyan')+centered_11mer[4:]
    print("Position: {} \t Score: {:2.4f} \t {}".format(top_donor_index,donor_prob[top_donor_index],colored_11mer))
  
  print("\n Top splice candidates 3'")
  for top_acceptor_index in top_acceptor_sites:
    centered_7mer = input_sequence[top_acceptor_index-3:top_acceptor_index+4]
    colored_7mer = centered_7mer[0:3]+colored(centered_7mer[3], 'cyan')+centered_7mer[4:]
    print("Position: {} \t Score: {:2.4f} \t {}".format(top_acceptor_index,acceptor_prob[top_acceptor_index],colored_7mer))

Final Screening and Logging

Automated final screening and logging of sequences (including their fitness scores in each generation) were critical for benchmarking and fine-tuning of the fitness function weights for optimal performance. These records allowed us to monitor how well the algorithm was performing, ensuring that we maintained diversity of solutions in our population of sequences and avoided premature convergence toward local optima. Given that an average optimization run could take more than one hour of computation time, this screening process was crucial for post-run analysis and future iterations. A second manual screening step was performed to quality control the obtained optimized sequences with commercially available tools.

To verify the findings presented by the Hall of Fame, a secondary manual screening was conducted utilizing various bioinformatics tools. This was aimed at assessing the feasibility of the produced DNA sequences. The first step involved comparing the top 10 candidates from the Hall of Fame, optimized across three reading frames, with the codon-optimized sequences obtained from Genscript.

Rare Codon Parameters Analysis

The rare codon analysis revealed a clear trend: our model’s codon optimization capability is highly comparable to the commercial Genscript codon optimizer. The commercial control achieved a maximum codon adaptation index (CAI) of 0.94, while our model reached a CAI of 0.83. This result is consistent with the challenge of optimizing three different frames simultaneously. Furthermore, the second and third reading frames demonstrated a CAI reduction to 0.75, reflecting an impressive optimization in all frames. In contrast, available tools either performed poorly, with CAI values as low as 0.3, or failed to produce results due to the presence of stop codons.

In addition to CAI, other essential parameters from our model confirmed an unbiased approach in selecting optimized codons. This unbiased selection is likely to enhance cellular fitness during expression across the three reading frames, facilitating continuous protein production without overwhelming the cellular machinery or generating protein aggregates.A secondary validation was performed using SpliceAI to ensure the absence of both direct and inverted splice sites within the designed constructs. This confirmed the potential for uninterrupted protein expression, free from splice-related disruptions.

Results of Rare Codon Analysis

Due to Confidentiality of the software used, we cant upload the files, but we are eager to explain and discuss about the raw data privately

Send us an E-mail

Outlook

Future improvements could include an adaptive mutation mechanism, where fitness functions generate a mutation probability map based on penalty scores in the sequence. This would concentrate mutations in problematic regions, enhancing convergence speed without significantly increasing computational overhead. Ensuring efficiency in the selection step while computing these probability maps will be critical, as this defines the rate-limiting operation of the evolutionary algorithm.

Molecular Dynamics Simulation

To evaluate whether the introduced rational mutations, necessary for hidden stop codon exclusion, affect the final structure of the XFPs (fluorescent proteins), a biophysical analysis was conducted. The structure of the AF3-predicted mutants was compared with published structures of mScarlet (PDB ID: 7ZCT), Nano3RFP670 (PDB ID: 7LSC), and mTagBFP (PDB ID: 3M24). Mutations were introduced rationally based on homologous sequences of the respective XFPs. The root mean square deviation (RMSD) between the reference PDB structures and the XFP regions indicated only minor differences, ranging from 0.4 to 1 Å.

Molecular dynamics (MD) simulations were performed under constant pressure and temperature to asses the stability of the chromophore cavity. These simulations were aimed at determining whether mutations introduced outside the chromophore pocket exerted an allosteric effect on the region where the chromophore forms. Due to technical challenges with the HPC cluster, the full results of these simulations will be presented following a wiki freeze. Preliminary wet-lab data, highlighted on our Results page, suggests successful production of XFPs across all three reading frames, indicating that the introduced mutations did not compromise the primary function of the proteins. However, further theoretical validation and a more detailed analysis are essential to fully show our model’s performance.

Conclusion

Overall, our results indicate strong potential for successful codon optimization and protein expression in multiple reading frames. These findings are highly encouraging and create the way for further advancements in synthetic biology.

Structural modelling of tape’s 3’UTR

ProgRAM is a complex system, yet it offers significant modularity and supports the incorporation of various add-ons, as outlined on our Project Description page. This flexibility is particularly evident in the 3’ UTR, which can accommodate additional elements with minimal impact on deamination or reporter expression. This region of the RNA is highly suitable for introducing RNA elements such as barcodes, ribozymes, or RNA aptamers.

To ensure that the ProgRAM design can integrate elements into its 3’UTR while preserving their structural integrity, we conducted an in silico study. In this analysis, we structurally modeled our tape with the addition of the Okra aptamer system (BBa_K5102070), which we obtained as a tool for RNA visualization and quantification. The Okra aptamer is a bright and stable aptamer designed for mRNA imaging. Our structural modeling (as demonstrated with Okra-appended BBa_K5102075) confirmed that all eight fluorophore-binding sites in the 796 bp 4xdOkra region were preserved in their original conformation, with minimal structural abberations.

Visualisation of ViennaRNA folding results (Lorenz et al., 2011) of BBa_K5102108 using VARNA (Darty et al., 2009). RNA tape regions are highlighted as follows (from 5’ to 3’): 5’UTR in gray, recording tape 3.0 in gold, miRFP670nano3 in dark red, mScarlet3 in bright red, mTagBFP2 in light blue, eUnaG in pale green, WPRE in magenta, 4xdOkra in purple.

Original folding results in dot-bracket notation can be reviewed here:

Folding results;

Original folding of 4xdOkra (BBa_K5102070):

GGUCCAUGUGUAUGUGGGGUCGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACCCCACAUACUUCUGAUGAUCCCGGGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGGGAUCAUUUCAUGGACCGCAAUCAAAAACCGCGAUGUCUAUGCGGGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCCCGCAUAGUUCUGAUCAUCCGUCGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACGGAUGAUUUCAUCGCGGAUAUUACUACUUGCCAUGUGUAUGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCAUACUUCUGAUGAUCCCGGGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGGGAUCAUUUCAUGGCACAAAAACAAAAACGUCGCGAUGUCUAUGCGGGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCCCGCAUAGUUCUGAUCAUCCGUCGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACGGAUGAUUUCAUCGCGACG
((((((((.((((((((((((((.(((((....)))))...((((((...)))))).....((((...)))).......))..))))))))))))....((((((((((((..(((((....)))))...(((((.....))))).....((((...))))..........)))))))))))).))))))))...........((((((((.(((((((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))))))....(((((((((((((.(((((....)))))...(((((.....))))).....((((...)))).......))..))))))))))).))))))))...........(((((((.(((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))....((((((((((((..(((((....)))))...(((((.....))))).....((((...))))..........)))))))))))).)))))))............((((((((((.(((((((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))))))....(((((((((((((.(((((....)))))...(((((.....))))).....((((...)))).......))..))))))))))).))))))))))

Predicted 4xdOkra (exctracted from BBa_K5102108):

GGUCCAUGUGUAUGUGGGGUCGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACCCCACAUACUUCUGAUGAUCCCGGGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGGGAUCAUUUCAUGGACCGCAAUCAAAAACCGCGAUGUCUAUGCGGGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCCCGCAUAGUUCUGAUCAUCCGUCGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACGGAUGAUUUCAUCGCGGAUAUUACUACUUGCCAUGUGUAUGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCAUACUUCUGAUGAUCCCGGGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGGGAUCAUUUCAUGGCACAAAAACAAAAACGUCGCGAUGUCUAUGCGGGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCCCGCAUAGUUCUGAUCAUCCGUCGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACGGAUGAUUUCAUCGCGACG
((((((((.((((((((((((((.(((((....)))))...((((((...)))))).....((((...)))).......))..))))))))))))....((((((((((((..(((((....)))))...(((((.....))))).....((((...))))..........)))))))))))).))))))))(((((......((((((((.(((((((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))))))....(((((((((((((.(((((....)))))...(((((.....))))).....((((...)))).......))..))))))))))).))))))))...........(((((((.(((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))....((((((((((((..(((((....)))))...(((((.....))))).....((((...))))..........)))))))))))).)))))))...((((((...((((((((((.(((((((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))))))....(((((((((((((.(((((....)))))...(((((.....))))).....((((...)))).......))..))))))))))).))))))))))

pRAM_ProgRAM-recording-tape3.0-4xdOkra (BBa_K5102108):

GUCAGAUCGCCUGGACACGCCAUCCACGCUGUUUUGACCUCCAUAGAAGACACCGGGACCGAUCCAGCCUCCGGACUCUAUAAUACGACUCACUAUAGGGCCGUGGUCCGCCGUGGCUUGCCAUGGCGUGCCAUGGUCCGCCAUGGCUUGCUGGAUCUGGGGAGGGCAGAGGCUCGCUGCUCACCUGCGGCGACGUGGAGGAGAACCCUGGCCCCAUGGCCAACCUGGACAAGAUGCUCAACACAACAGUCACCGAGGUUCGCAAAUUUCUCCAGGCAGAUCGGGUGUGCGUCUUCAAAUUCGAAGAGGAUUAUUCGGGAACAGUGUCCCACGAGGCCGUGGACGACAGGUGGAUUUCAAUCCUCAAGACUCAGGUCCAGGAUCGGUACUUCAUGGAGACACGGGGAGAGGAAUAUGUGCACGGCCGAUACCAGGCCAUUGCAGACAUAUACACGGCCAACCUGGUGGAGUGCUACAGGGAUCUGCUCAUCGAGUUCCAGGUUCGCGCCAUUCUCGCAGUGCCCAUCCUCCAGGGCAAAAAGCUGUGGGGGCUGCUGGUCGCCCACCAGCUGGCAGGCCCCCGGGAAUGGCAGACCUGGGAGAUCGACUUUCUCAAGCAACAGGCUGUGGUCAUGGGGAUCGCCAUACAGCAGAGCGGCUCAGGCGCCACCAAUUUUUCACUGCUCAAACAGGCGGGGGACGUGGAGGAGAAUCCCGGCCCCCUGCUGGAAAAGUUCGUGGGGACCUGGAAGAUCGCCGAUUCUCACAACUUCGGAGAGUACCUCAAGGCCAUCGGUGCCCCCAAGGAACUUUCCGAUGGUGGCGAUGCCACCACCCCUACGCUGUACAUCUCACAGAAGGAUGGCGACAAGAUGUCCGUCAAAAUCGAGAAUGGCCCCCCUACCUUCCUGGAUACACAAGUCAAAUUCAAGCUGGGGGAAGAGUUCGACGAGUUUCCUUCCGAUCGGAGAAAAGGGGUCAAGUCAGUCGUCAAUCUGGUCGGCGAGAAGCUCGUUUAUGUCCAGAAAUGGGAUGGCAAGGAGACAACCUAUGUUCGCGAGAUCAAAGAUGGCAAACUGGUGGUCACGCUUACCAUGGGCGAUGUCGUGGCAGUUCGCAGCUACCGCCGCGCCACCGAGUGACGGCUCUGGGGAAGGCCGCGGCUCCUUACUCACGUGCGGGGAUGUCGAAGAAAACCCGGGCCCGAUGGAUUCGACGGAGGCGGUCAUCAAGGAGUUCAUGCGAUUCAAAGUCCACAUGGAGGGAUCUGUCAACGGACACGAGUUCGAAAUCGAAGGCGAAGGGGAGGGCAGACCUUACGAGGGGACUCAAACGGCGAAGCUUCGAGUUACAAAGGGUGGCCCAUUGCCUUUUUCCUGGGAUAUACUGUCUCCGCAGUUUAUGUACGGUUCGCGCGCGUUCACAAAGCAUCCCGCAGACAUCCCGGACUACUGGAAGCAGUCGUUUCCCGAAGGCUUCAAAUGGGAGAGGGUUAUGGUAUUCGAGGAUGGCGGUGCCGUUUCUGUUGCCCAAGACACAUCACUGGAGGACGGAACUCUCAUUUACAAGGUCAAGCUGCGCGGGACCAACUUCCCCCCGGAUGGCCCCGUUAUGCAGAAAAAGACCAUGGGUUGGGAGGCGUCUACGGAACGCCUCUACCCAGAGGACGUGGUUCUCAAGGGGGACAUCAAAAUGGCCUUGCGGCUCAAGGAUGGAGGGCGUUAUCUCGCGGAUUUCAAGACAACCUAUCGGGCAAAGAAGCCGGUCCAGAUGCCUGGAGCAUUCAAUAUCGAUCGGAAGCUCGAUAUCACCUCCCACAACGAAGAUUAUACAGUCGUCGAGCAAUACGAGCGUUCUGUCGCCCGCCACAGCACGGGAGGGUCCGGCGGUUCAGGAUCCGGUGCCACGAACUUUUCUUUACUCAAGCAAGCGGGGGACGUGGAAGAAAACCCCGGCCCGCUCCUCGAGAAGUUCGUCGGUACCUGGAAGAUCGCAGACUCACACAACUUCGGCGAGUACCUCAAAGCUAUUGGGGCCCCCAAAGAGCUCUCAGACGGUGGCGACGCGACUACUCCCACCCUAUAUAUUUCUCAGAAAGAUGGGGACAAGAUGUCGGUCAAGAUCGAGAACGGGCCUCCGACGUUCCUCGACACCCAGGUCAAAUUCAAGUUGGGGGAAGAGUUCGACGAAUUUCCGAGCGAUCGCCGGAAGGGGGUCAAGUCCGUCGUCAACCUGGUCGGGGAGAAACUGGUGUACGUCCAAAAGUGGGACGGGAAAGAAACCACUUAUGUGCGGGAGAUCAAGGACGGGAAGCUCGUGGUUACCCUCACAAUGGGGGACGUUGUCGCUGUACGAUCGUACCGCCGAGCAACGGAGUAACGGCAGCGGAGAAGGCCGUGGAUCUCUACUCACCUGUGGCGACGUGGAAGAAAACCCCGGCCCCAUGGUUUCGAAGGGGGAGGAACUUAUCAAAGAGAACAUGCACAUGCGCCUCUACAUGGAGGGCACAGUGGACAACCACCAUUUCAAGUGCACGAGCGAAGGCGAGGGAAAGCCCUACGAAGGGACACAAACAAUGCGGAUCAAGGUGGUCGAGGGUGGCCCCCUGCCCUUCGCUUUCGACAUCCUGGCUACCUCAUUCCUCUACGGCAGCAAAACAUUCAUCAACCACACGCAGGGAAUCCCGGAUUUCUUCAAGCAGUCCUUUCCAGAGGGCUUCACGUGGGAGCGCGUUACCACAUACGAGGACGGCGGCGUCCUCACGGCCACUCAAGACACCUCACUCCAGGAUGGCUGCCUUAUAUACAAUGUCAAGAUCAGAGGCGUCAACUUCACAUCCAACGGGCCGGUUAUGCAGAAGAAGACGCUCGGCUGGGAGGCUUUCACAGAAACCCUGUACCCUGCGGACGGAGGCCUGGAAGGGCGAAACGACAUGGCCCUCAAGCUGGUUGGGGGAUCCCAUCUUAUCGCCAACGCCAAGACCACAUACAGAUCCAAAAAGCCUGCGAAGAACCUCAAAAUGCCCGGCGUCUACUACGUCGACUACCGGUUGGAGCGGAUCAAGGAAGCCAACAACGAGACAUAUGUGGAGCAGCACGAAGUGGCAGUGGCCCGAUACUGCGAUCUUCCAUCCAAACUGGGCCACAAACUCAAUGGCUCCGGUGCUACAAACUUCAGCUUACUCAAACAGGCUGGGGACGUGGAGGAGAACCCUGGCCCCCUUCUGGAAAAAUUUGUCGGCACCUGGAAAAUCGCCGACAGCCACAAUUUUGGCGAGUACCUCAAAGCCAUCGGCGCCCCCAAGGAGCUCAGCGAUGGAGGAGAUGCGACCACACCUACACUCUAUAUCAGCCAGAAAGACGGGGACAAAAUGUCCGUCAAGAUCGAGAACGGCCCGCCGACUUUCCUGGACACCCAGGUCAAAUUCAAGCUGGGCGAGGAGUUCGACGAGUUCCCCAGCGACCGCCGGAAGGGGGUCAAAAGCGUGGUCAACCUGGUCGGAGAGAAACUGGUGUACGUGCAAAAGUGGGAUGGCAAAGAAACCACCUACGUUCGCGAGAUCAAGGACGGGAAACUGGUGGUCACACUUACUAUGGGAGAUGUCGUGGCGGUGCGCUCCUACCGGAGAGCUACCGAGUAGGCUCGCUUUCUUGCUGUCCAAUUUCUAUUAAAGGUUCCUUUGUUCGCGUUGCUGGACCAGAAGAUAUGGCUUCGGUUUACCGCUGAGCAAUAACUAGCAUAACCCCUUGGGGCCUCUAAACGGGUCUUGAGGGGUUUUUUGUCGACAAUCAACCUCUGGAUUACAAAAUUUGUGAAAGAUUGACUGGUAUUCUUAACUAUGUUGCUCCUUUUACGCUAUGUGGAUACGCUGCUUUAAUGCCUUUGUAUCAUGCUAUUGCUUCCCGUAUGGCUUUCAUUUUCUCCUCCUUGUAUAAAUCCUGGUUGCUGUCUCUUUAUGAGGAGUUGUGGCCCGUUGUCAGGCAACGUGGCGUGGUGUGCACUGUGUUUGCUGACGCAACCCCCACUGGUUGGGGCAUUGCCACCACCUGUCAGCUCCUUUCCGGGACUUUCGCUUUCCCCCUCCCUAUUGCCACGGCGGAACUCAUCGCCGCCUGCCUUGCCCGCUGCUGGACAGGGGCUCGGCUGUUGGGCACUGACAAUUCCGUGGUGUUGUCGGGGAAGCUGACGUCCUUUCCAUGGCUGCUCGCCUGUGUUGCCACCUGGAUUCUGCGCGGGACGUCCUUCUGCUACGUCCCUUCGGCCCUCAAUCCAGCGGACCUUCCUUCCCGCGGCCUGCUGCCGGCUCUGCGGCCUCUUCCGCGUCUUCGCCUUCGCCCUCAGACGAGUCGGAUCUCCCUUUGGGCCGCCUCCCCGCCUGGAAGGUCCAUGUGUAUGUGGGGUCGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACCCCACAUACUUCUGAUGAUCCCGGGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGGGAUCAUUUCAUGGACCGCAAUCAAAAACCGCGAUGUCUAUGCGGGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCCCGCAUAGUUCUGAUCAUCCGUCGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACGGAUGAUUUCAUCGCGGAUAUUACUACUUGCCAUGUGUAUGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCAUACUUCUGAUGAUCCCGGGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGGGAUCAUUUCAUGGCACAAAAACAAAAACGUCGCGAUGUCUAUGCGGGCGGGGACCAUCCAGCGAUGGAUACUGCUGCUUCGGCAGAGCAAGGAUAAUAUCUAGAACGACCAUCCGCCCGCAUAGUUCUGAUCAUCCGUCGGACCAUCCAGCGAUGGAUACUGCUGCGAAAGCAGAGCAAGGAUAAUAUCUAGAACGACCAUGACGGAUGAUUUCAUCGCGACGACAACUUGUUUAUUGCAGCUUAUAAUGGUUACAAAUAAAGCAAUAGCAUCACAAAUUUCACAAAUAAAGCAUUUUUUUCACUGC
((((((.(((.((((.......)))).)).).))))))((((((.((((..((((((((((..........(((.(((((((...........))))))))))))))))((((((((..(((((((....)))))))..)))))))).(.((((((((((((((((..((((.(((((..((..(((((((..(((((.((((((...(((((....)))))((((.(....(((.............)))..).))))........)))))).((((..(((((((..(.(((((......))))).)..)))))))......)))))))))..)))))))..))..))))).))))..)))))....))))))))))).).)))).))))))))))...((((((((((((((((((((.((((.......))))..)))..)))))).(((((((..((((.(((....)))))))....((((..((((.(((((((((...((((((((((((((((((........)))).....)))))(((((((((((((.....)))))).....)))))))))))))))).)))))))))..))))........))))...))))))).....((((...((((((((((((((((((((((((.((((.((((((((.(((......)))(((((((.((.(((......)))))..)))))))..)))))))).)))).).)))))......)))).)))).......(((((((((.(((...((((....).)))....))).)))))))))(((((((...))))))).......)))))...((((.......((((((........).)))))......))))))))).))))....((((.((........(((.((((((.(((.(((......(((((((....(((..(((..(((((..((((((((......((((((..(((((.(((((((...)))))...)).))))).....)))))).....((((.((((((...(((.(((((.((........))..))))).)))....))))))...))))(((((..(((.(((.....(((((((......))).)))).))).)))..))))))))))))).))).))..))))))))))))).....))).)))....)))))).)))((((((((((.....((((...((.((((((...((((........))))..(((..(((((....))))).....((.((((((((.((((((.........(((....)))..((.((....(((((((((((.....)))))((((.((((((((((((((((.(.((((....((.((..((((.((.....)).))))..))...)).....))))).))))))........(((((..(((.....)))..)))))....)))))))))).))))...))))))...)).))(((((((((((((((((...........)))..((((((((.(((........((((...(((((((((.(((...(((....)))))).)))))...)))).....)))).(((((..(((((((((....).)))))))))))))((((((...))))))..(((((.((((.....((((....))))....)))).(((..(((((((...)))).....)))..)))....(((......)))(.(((((....))))).).....(((((((.(....).)))))))...)))))....(((.((((....)))).)))........))).)))))))).((((........)))).(((((..((((((.....((((((((((((((((((((........(((..((.((((.............)))).))..)))....))))))))))).)))))).))).)))))).)))))........((((((.((.((((((.....))))))((((((.....(((((.(((((((((((((.((((((((((.((((((...((((...)))).))))))....(((....)))....(((.((((...((((((((.......(((......)))........))))))))....))))..)))..(((((.((....)))))))))))))..))))))))))).)))..))).)))))....(((.......))).((((((.............)))))).((((.(((.((((....(((((...))))))))).))).)))).))))))(((..((((.....)))))))))))))))))))))))..)))))).((((((.(((((((((((((((((((.((...)).)).)))).)))...((((((((....)))).....))))(((((((((......)))....(((((.(((.(((((....))))))))..((((....))))........))))).(((((((((((((((....((((....))))...........((((.....((.((((......)))))))))))))))))))))).).))............))))))..((((((((..................(((((((....))(((((.((...)).)))))..(((((..(((..((((.((((((..((........))(((((((....)))))))..(((((..........(((.(((((((((((((((.(((((((...((((..(......(((.((....((...((((((.((..((((..((.(.((.(((((((.(((...((((.((((.(((.....((((.((....)))))).(((.(((...(((((...((.(((((((((((.....)))))))...)))).)).)))))....))).).))...................))).))))..)))).....))).))))))).)).))).)))).)).))))))...)).....)).))).....)..))))))))))).)))))..(((((.((((((......)))))).).)))))))))..(((((((((.........)))).)))))........(((((((((........)))))))))...)))))))).....))))).((....)).......(((((((............)))))))((((......))))((((.(((....(((...))).......))).))))(((.((((((((((((((((((((((....((((...((.(((((...((((((.(.....).))))))............(((((((.......(((((....)))))..........((((((.(((..((...))...)))))))))((((.((....)).))))....))).))))...))))).)))))).....))))).((((.((..(((((.(((........)))))))).))))))....((((((.((((((.(((.(((((.......((((((((..((.((((((((((.((((....)))).)).))).((((((........))))))........................(((((((((..(((((((((.......))).))))))....))).))))))..((((((..(((((((((((((((......)))))))))))))))((((((....((((........))))))))))...(((((((...(((((((.......)))).)))....))))))))))).)).((((((..((......))...))))))..))))).))))))))))))))).))).)))))).))))))..........)))))).))))))....(((((((((..(((((((((...)))))).)))((((((.(((.((((((....))))))..(((((.....)))))...)))))))))....)))))))))...(((((.((.((((((((............(((.((((((.(((..(((....(((..((((.(((.....)))))))..))).....)))..)))...)))))).))).....)))))))).)).))))).))))).))))))).))))))..)))..)))))..))))).(((((((.........)))))))...((((.........((((..........))))))))))))))))..)))))))))).)))))))))))).)))))))).)).))))))))).)))))).....))))))))))...)).))))((((((((.((((((((((((((.(((((....)))))...((((((...)))))).....((((...)))).......))..))))))))))))....((((((((((((..(((((....)))))...(((((.....))))).....((((...))))..........)))))))))))).))))))))(((((......((((((((.(((((((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))))))....(((((((((((((.(((((....)))))...(((((.....))))).....((((...)))).......))..))))))))))).))))))))...........(((((((.(((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))....((((((((((((..(((((....)))))...(((((.....))))).....((((...))))..........)))))))))))).)))))))...((((((...((((((((((.(((((((((((((..(((((....)))))...((((((...)))))).....((((...))))..........)))))))))))))....(((((((((((((.(((((....)))))...(((((.....))))).....((((...)))).......))..))))))))))).)))))))))).....))))))))))).((((....((....))....))))..............................)))))))).))).

Mathematical Modeling of RNA Transcription and Degradation Dynamics

Overview

To understand and optimize the performance of our engineered RNA constructs within cellular environments, we developed a predictive mathematical model to simulate RNA transcription and degradation dynamics in living cells. This model allows us to understand how our construct behaves over time and to optimize its performance accordingly.

Model Development

The model is based on a first-order ordinary differential equation (ODE) that describes the rate of change of RNA concentration X(t) over time:

dX/dt = α(t) - β(t) * X(t)

α(t): Transcription rate function.
β(t): Degradation rate function.

Transcription Rate α(t)

To capture the complex dynamics of transcription initiation, peak expression, and eventual decline, we modeled the transcription rate α(t) using a double-sigmoid function:

α(t) = (1/h1) * [h0 + (h1 - h0) / (1 + exp(-lambda * (t - t1)))] * [h2 + (h1 - h2) / (1 + exp(-lambda * (t - t2)))]

Where:

h0: Initial transcription rate.
h1: Peak transcription rate.
h2: Final transcription rate after decline.
t1: Time when transcription rate increases.
t2: Time when transcription rate decreases.
λ: Steepness of the sigmoid transitions.

def alpha(t, h0, h1, h2, t1, t2, lambda_param):
    """
    Calculate the transcription rate alpha(t) using a double-sigmoid model.

    Parameters:
    - t: Time
    - h0, h1, h2: Sigmoid heights
    - t1, t2: Transition times
    - lambda_param: Steepness of the sigmoid transitions

    Returns:
    - alpha(t): Transcription rate at time t
    """
    return (1/h1) * (h0 + (h1 - h0) / (1 + np.exp(-lambda_param * (t - t1)))) * \
           (h2 + (h1 - h2) / (1 + np.exp(-lambda_param * (t - t2))))

Degradation Rate β(t)

We considered two scenarios for the degradation rate:

Constant Degradation Rate:

beta(t) = β_0
- β_0: Constant degradation rate.

def rna_dynamics_varying_degradation(t, X, h0, h1, h2, t1, t2, lambda_param, beta0, k):
    """
    Model RNA dynamics with a varying degradation rate.

    Parameters:
    - t: Time
    - X: Current RNA concentration
    - h0, h1, h2, t1, t2, lambda_param: Parameters for the alpha function (initial, peak and steady state)
    - beta0, k: Parameters for the exponentially varying degradation rate

    Returns:
    - dX/dt: Rate of change of RNA concentration
    """
    return alpha(t, h0, h1, h2, t1, t2, lambda_param) - beta0 * np.exp(k * t) * X

Exponentially Varying Degradation Rate:

β(t) = β_0 * exp(k * t)
- β_0: Base degradation rate.
- k: Exponential rate of change of the degradation rate.

def rna_dynamics_constant_degradation(t, X, h0, h1, h2, t1, t2, lambda_param, beta0):
    """
    Model RNA dynamics with a constant degradation rate.

    Parameters:
    - t: Time
    - X: Current RNA concentration
    - h0, h1, h2, t1, t2, lambda_param: Parameters for the alpha function (initial, peak and steady state)
    - beta0: Constant degradation rate

    Returns:
    - dX/dt: Rate of change of RNA concentration
    """
    alpha_t = alpha(t, h0, h1, h2, t1, t2, lambda_param)
    return alpha_t - beta0 * X

Implementation

We implemented the model in Python, utilizing the solve_ivp function from the scipy.integrate library to numerically solve the ODE. The complexity of α(t) necessitated a numerical solution rather than an analytical one.

Solving of ODE

def solve_rna_dynamics(t, X0, params):
    """
    Solve the ODE for RNA dynamics with varying degradation rate.

    Parameters:
    - t: Array of time points
    - X0: Initial RNA concentration
    - params: Dictionary of parameters for the model

    Returns:
    - X: RNA concentrations at the time points t
    """
    args = (
        params['h0'],
        params['h1'],
        params['h2'],
        params['t1'],
        params['t2'],
        params['lambda_param'],
        params['beta0'],
        params['k']
    )
    sol = solve_ivp(
        lambda t, X: rna_dynamics_varying_degradation(t, X, *args),
        [t[0], t[-1]],
        X0,
        t_eval=t,
        method='RK23',
        rtol=1e-6,
        atol=1e-9
    )
    return sol.y[0]

Parameter Estimation

To estimate the model parameters from data, we employed Maximum Likelihood Estimation (MLE). We defined a negative log-likelihood function assuming normally distributed errors between the observed data and model predictions. The optimization process minimizes this function to find the parameter values that best fit the data.

Negative Log-Likelihood Function

def negative_log_likelihood(params_array, t, X_obs, X0, sigma):
    """
    Negative log-likelihood function for RNA dynamics model.

    Parameters:
    - params_array: Array of parameters to estimate
    - t: Time points
    - X_obs: Observed RNA concentrations
    - X0: Initial RNA concentration
    - sigma: Standard deviation of measurement errors

    Returns:
    - nll: Negative log-likelihood
    """
    # Unpack parameters
    h0, h1, h2, t1, t2, lambda_param, beta0, k = params_array

    # Ensure parameters remain within valid ranges
    if any([
        h0 < 0, h1 < 0, h2 < 0, t1 < 0, t2 < 0,
        lambda_param < 0, beta0 < 0
    ]):
        return np.inf  # Penalize invalid parameter values

    # Pack parameters into a dictionary
    params = {
        'h0': h0,
        'h1': h1,
        'h2': h2,
        't1': t1,
        't2': t2,
        'lambda_param': lambda_param,
        'beta0': beta0,
        'k': k
    }

    # Solve the ODE to get predicted RNA concentrations
    X_pred = solve_rna_dynamics(t, X0, params)

    # Compute residuals
    residuals = X_obs - X_pred

    # Compute negative log-likelihood
    nll = (1 / (2 * sigma ** 2)) * np.sum(residuals ** 2)

    return nll

Optimization Procedure

We used the minimize function from scipy.optimize to perform the optimization:

result = minimize(
    negative_log_likelihood,
    initial_guess,
    args=(t_data, X_obs, X0, sigma),
    method='L-BFGS-B',
    bounds=bounds,
    options={'maxiter': 1000}
)

Data Generation

Synthetic Data

To validate and test our model, we first generated synthetic data:

Generation: We solved the ODE using known “true” parameters to generate the RNA concentration over time.
Adding Noise: We added normally distributed noise to simulate experimental variability:

np.random.seed(0)  # For reproducibility
sigma_noise = 0.1  # Standard deviation of the noise
X_obs = X_true + np.random.normal(0, sigma_noise, size=X_true.shape)
X_obs = np.clip(X_obs, a_min=0, a_max=None)  # Ensure non-negative concentrations

Purpose: This synthetic dataset serves as a proxy for experimental observations, allowing us to test the parameter estimation process in a controlled setting where the true parameters are known.

Experimental Data

Additionally, we plan to conduct quantitative PCR (qPCR) analysis to evaluate the behavior of our construct in HEK293T cells. This experimental data will provide real-world observations to further refine our model.

Results

The graph shown reflects the results from synthetic data generated to test our RNA dynamics model. The good agreement between the fitted model (red dashed line) and the true model parameters (solid blue line), along with the observed synthetic data points (blue dots), underscores the model’s effectiveness.

Key Takeaways:

Validation of the Model’s Predictive Accuracy: The close match between the model predictions and the synthetic data validates the mathematical model’s capability to accurately simulate RNA transcription and degradation processes. This confirms that our equations and numerical methods are well-suited to capture the underlying biological phenomena.
Preparation for Experimental Validation: The success with synthetic data sets the stage for subsequent validation with real experimental data. It ensures that the model is robust enough to handle real-world complexity and variability, which will be crucial in translating these findings into practical applications.

Future Directions

Use of Experimental Data: Incorporating experimental data from qPCR analyses will allow us to refine the model parameters further, increasing the model’s predictive power.
Model Refinement: Incorporate additional biological factors such as RNA-binding proteins like dCAS and microRNAs to enhance model accuracy.

Appendices

Parameter Definitions

h0: Initial transcription rate (before induction).
h1: Maximum transcription rate (peak expression).
h2: Final transcription rate (after decline).
t1: Time at which transcription rate begins to increase.
t2: Time at which transcription rate begins to decrease.
lambda: Controls the steepness of the increase and decrease in transcription rate.
beta0: Base degradation rate of RNA.
k: Rate at which the degradation rate changes over time.

Synthetic 5’UTR design

To enhance mRNA translation, we explored the use of synthetic 5’ untranslated regions (UTRs) as indicated by existing research in gene editing applications delivered via mRNA. We utilized a deep learning model developed by (Castillo-Hair et al., 2024), which is designed to optimize 5’UTRs for efficient mRNA translation. This model employs generative neural networks combined with gradient descent and is trained on polysome profiling data from randomized 5’UTR libraries across multiple cell types. This training allows the model to identify sequence features that significantly enhance translation efficiency.

We input our target coding sequence into the model, which then generated optimized 5’UTR sequences that aimed to maximize translation efficiency. The optimization process accounted for various factors affecting translation initiation, including Kozak consensus sequences, upstream open reading frames (uORFs), and RNA secondary structures. By applying gradient descent, the model iteratively adjusted the UTR sequences to reduce potential impediments to ribosome binding and scanning, such as stable secondary structures.

To validate the model’s predictions, we calculated the mean ribosome load (MRL) for each designed UTR, serving as an indicator of translation efficiency. We also assessed the minimum free energy (MFE) of these sequences using RNA folding prediction tools like RNAfold (Lorenz et al., 2011), where a lower MFE suggests fewer stable secondary structures that might negatively impact translation. Our analyses confirmed that the optimized UTRs had minimal secondary structures that could hinder translation initiation.

Our results showed that 50 bp UTRs provided the best balance between including necessary regulatory elements and minimizing secondary structures. These optimized UTRs exhibited higher predicted mean ribosome loads and favorable MFE values, indicating efficient translation initiation. The generated part was submitted as BBa_K5102065.

References

Castillo-Hair, S., Fedak, S., Wang, B., Linder, J., Havens, K., Certo, M., & Seelig, G. (2024). Optimizing 5’UTRs for mRNA-delivered gene editing using deep learning. Nature Communications, 15(1), 5284. https://doi.org/10.1038/s41467-024-49508-2
GenScript. (n.d.). GenSmart™ Free Gene Codon Optimization. Retrieved from https://www.genscript.com/gensmart-free-gene-codon-optimization.html
GenScript. (2020). Codon Optimization Patent: WO2020024917A1. Retrieved from https://www.genscript.com/gsfiles/techfiles/WO2020024917A1_Codon_optmization_Patent.pdf
Raab, D., Graf, M., Notka, F., Schödl, T., & Wagner, R. (2010).** The GeneOptimizer Algorithm: Using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization. Systems and Synthetic Biology, 4(3), 215-225. https://doi.org/10.1007/s11693-010-9062-3
Sharp, P. M., & Li, W. H. (1987). The codon Adaptation Index - a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research, 15(3), 1281-1295. https://doi.org/10.1093/nar/15.3.1281
Ikemura, T. (1981). Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: A proposal for a synonymous codon choice that is optimal for the E. coli translational system. Journal of Molecular Biology, 151(3), 389-409. https://doi.org/10.1016/0022-2836(81)90003-6
Sørensen, M. A., Kurland, C. G., & Pedersen, S. (1989). Codon usage determines translation rate in Escherichia coli. Journal of Molecular Biology, 207(2), 365-377. https://doi.org/10.1016/0022-2836(89)90260-X
Plotkin, J. B., & Kudla, G. (2011). Synonymous but not the same: the causes and consequences of codon bias. Nature Reviews Genetics, 12(1), 32-42. https://doi.org/10.1038/nrg2899
DEAP Team. (n.d.). Multi-objective Optimization using NSGA-III. Retrieved from https://deap.readthedocs.io/en/master/examples/nsga3.html
Deb, K., & Jain, H. (2014). An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach. IEEE Transactions on Evolutionary Computation, 18(4), 6577–622. https://doi.org/10.1109/TEVC.2013.2281535
Jaganathan, K., Kyriazopoulou Panagiotopoulou, S., McRae, J. F., Darbandi, S. F., Knowles, D., Li, Y. I., … & Farh, K. K. (2019). Predicting Splicing from Primary Sequence with Deep Learning. Cell, 176(3), 535-548.e24. https://doi.org/10.1016/j.cell.2018.12.015
Lorenz, R., Bernhart, S. H., Höner zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F., & Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6(1), 26. https://doi.org/10.1186/1748-7188-6-26
NCBI. (1994). BLAST: Basic Local Alignment Search Tool. https://blast.ncbi.nlm.nih.gov/Blast.cgi
Darty, K., Denise, A., & Ponty, Y. (2009). VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics, 25(15), 1974–1975. https://doi.org/10.1093/bioinformatics/btp250