Optimizing Wetlab with Software Analysis
This year, Lambert iGEM simulated CRISPRi and Toehold reactions using MATLAB, and also optimized toeholds using machine learning.
Toehold switches are versatile biosensors used across various applications, demonstrating significant utility in detecting infectious diseases, and precisely diagnosing rare genetic disorders and cancers (Chau et al., 2020). This year, our project uses toeholds to rapidly validate our CRISPRi-based antimicrobial. However, the design of novel toehold switches is the most significant barrier to widespread adoption. Current design tools, such as NUPACK and ViennaRNA, utilize thermodynamic models to predict secondary RNA structures, but these methods are far from perfect. Researchers must experimentally design and test multiple toeholds before identifying a suitable candidate. NUPACK’s efficiency is low, achieving an R2 value of only 0.15 out of 1. Motivated by this challenge and recognizing the transformative power of machine learning, we set out to develop SWORD (Structured Workbench for Optimized RNA Design): a better way to design Toeholds with machine learning. By integrating state-of-the-art language models and data-driven techniques, our tool not only predicts but also generates highly efficient toehold switches.
SWORD is designed to overcome the inefficiencies in traditional toehold switch design by utilizing machine learning to predict and generate optimized sequences. It consists of a predictive model that forecasts the on-off values of toehold switches based on their nucleotide sequences, and a generative model that creates new, optimized sequences with a higher probability of success.
The predictive model is built on a regression framework, trained on a dataset of over 191,000 validated toehold sequences, which includes information on the efficiency of each sequence in both active (on) and inactive (off) states. This dataset was obtained from Angenent-Mari et al.’s research and provided a solid foundation for training (Mari et al, 2020)
Initially, the regression model was designed to predict the likelihood of a toehold functioning effectively by analyzing its nucleotide composition and sequence properties. By experimenting with different model architectures, hyperparameter tuning, and training using advanced GPUs, we were able to improve its predictive accuracy. The final model incorporates insights from the sequence features that are critical for toehold functionality, including secondary structure stability, binding motifs, and nucleotide interactions. This was done by analyzing secondary structure through ViennaRNA and finding connections through convolutional neural networks. This predictive model serves as the backbone for assessing whether a designed toehold switch is likely to perform well in experimental settings, significantly reducing the need for repeated testing.
To automate and optimize the creation of new toehold sequences, we implemented a modified version of a Generative Adversarial Network (GAN). The GAN consists of two neural networks: a generator and the predictive model from before. The generator is responsible for creating new toehold sequences based on an input gene sequence (Del Pra, 2023). The predictive model assesses the generated sequences by comparing them with the dataset of toehold switches to evaluate their effectiveness.
During training, the generator learns to produce increasingly better toehold sequences by continuously improving to optimize its on-target score and reduce its off-target score by learning from the characteristics of effective switches. The predictive acts as a reward function, providing the feedback to the generator as a composite score from 0 to 1 of the toehold sequence. This process enables the generator to evolve and create sequences that are more likely to pass experimental validation, drastically reducing the time and cost associated with trial-and-error methods (see Fig. 1).
This dual-model approach, combining both the predictive and generative components, allows SWORD to not only evaluate existing toehold designs but also design optimized sequences tailored to specific gene targets. This innovation accelerates the design process and increases the likelihood of success in real-world applications.
SWORD can be used currently on your local machine by following these steps in this repository https://gitlab.igem.org/2024/software-tools/lambert-ga
Although SWORD currently operates without a user interface, our future plans include developing an accessible platform that researchers can use to input gene sequences and receive both predictions and newly generated toehold designs without requiring advanced technical knowledge.
By integrating machine learning into the design of toehold switches, SWORD represents a significant leap forward in RNA biosensor development, offering a more efficient, scalable, and accurate solution compared to traditional methods.
Through the implementation of these models, we aimed to address the inefficiencies in current toehold design methodologies, particularly overcoming the limitations of traditional tools such as NUPACK, which had an R² of 0.15 in predicting sequence functionality. The results presented here demonstrate the performance of both models, highlighting the success of SWORD in achieving its design goals.
Our predictive model demonstrated substantial improvements over existing methodologies, achieving an R² score of 0.75 for the “on” state and 0.49 for the “off” state. This represents a substantial enhancement over the R² scores previously reported and exceeds the performance benchmarks established by NUPACK as documented in Angenent-Mari et al.’s study. STORM (Sequence-based Toehold Regulator Optimization and Modeling) is a CNN model made by Mari et al that uses multiple layers and filters, as well as large datasets of experimental RNA sequences and their efficiency, to predict the accuracy of toeholds. SWORD improves on STORM by incorporating a Generative Adversarial Network (GAN), which allows for more dynamic and refined toehold switch design. While STORM relies on convolutional neural networks (CNNs) to predict functionality, SWORD’s GAN structure—using a generator and discriminator—iteratively improves designs through continuous feedback. Additionally, SWORD optimizes hyperparameters for more accurate and efficient results, surpassing STORM’s CNN-based approach. This combination of GANs and optimization makes SWORD a more powerful tool for generating and refining toehold switches (see Fig. 2).
The model utilizes two Conv1D
layers with 48 and 16 filters to help understand the data, followed by batch normalization to capture key sequence patterns. After flattening, a dense layer with 256 units and L2 regularization processes the features. Dropout (0.4) was added to prevent overfitting.
The model outputs two predictions—“on” and “off” values—using linear activation functions for regression. We employed the AdamW optimizer with a fine-tuned learning rate of 0.000527, and mean absolute error (MAE) as the loss function.
This architecture, combined with hyperparameter tuning, resulted in a more accurate and robust model for predicting toehold switch efficiency. However, the model can be improved specifically in tracking non-linear relationships between toeholds.
We used our SWORD predictive model to evaluate the efficacy of six custom-designed toehold sequences from Lambert iGEM targeting the inhA gene, a crucial factor in tuberculosis management. The toehold structures generated by our software differ from those designed using NUPACK, as our software integrates key components like the ribosome binding site, linker region, and switch directly within the 59 nucleotide switch sequence (see Fig. 3). In contrast, NUPACK separates these elements from the switch sequence, leading to distinct structural designs
We applied our predictive model to these specific segments to gauge their performance. The results, summarized in the table, provide the predicted ‘on’ and ‘off’ values for each toehold, highlighting the operational potential of these synthetic sequences.
Particularly, Toeholds 4 and 5 demonstrated notable efficacy in predictive assessments, suggesting a higher likelihood of successful function in practical applications. Experimental validations in the lab supported these findings, with Toehold 4 emerging as the most efficient in actual tests. This synergy between predictive modeling and empirical results demonstrates the utility of SWORD in improving the accuracy of toehold design for critical health applications and presents a major step forward for Toehold research.
Our generative model also showed improvement over previous technologies. It achieved an R² of 0.44 for the “on” state and 0.20 for the “off” state, improving upon previous studies that lacked effective toehold generation capabilities. This model’s loss, measured using Mean Squared Error (MSE), was 0.005, indicating its ability to generate sequences with high accuracy.
To validate our model, we used NUPACK to assess the structural accuracy of 20 toehold sequences generated by our software. Each sequence successfully folded into its designated secondary structure, confirming the model’s reliability and precision. The use of NUPACK, widely recognized for its robust algorithms, justified this approach by providing a trusted benchmark in synthetic biology. This validation demonstrates our model’s capacity to produce reliable sequences. The figure below is one of the sequences generated with a high predicted on and off value validated in NUPACK. This structure suggests that the toehold would be functional and then initiates the desired activity. Figure 4 below shows the corrrect structure of the toehold proving that this toehold would efficiently bind to the target gene.
Input Trigger Sequence: AAAAAAATGATAAAATAGATTAGTTTTATT
Generated Toehold Sequence: AAAAAAATGATAAAATAGATTAGTTTTATTAACAGAGGAGAAATAAAATGAATCTATTT
Predicted On Value: 0.6586666703224182
Predicted Off Value: 0.7447994947433472
Our generator model is part of a Generative Adversarial Network (GAN), where the discriminator is the predictive model, evaluating the generated sequences for “on” and “off” efficiency. The generator uses two Conv1D
layers with 64 and 128 filters to extract features from the input trigger sequence, followed by batch normalization and dropout (0.3) to ensure stability and prevent overfitting. After flattening, a dense layer generates a 59-nucleotide sequence in one-hot encoding.
We used the RMSprop optimizer (learning rate 0.001) and Mean Squared Error (MSE) loss to improve sequence generation, achieving an R² of 0.44 for “on” and 0.20 for “off.” This R² greatly improved upon Mari et al’s results reported in their figures.
SWORD represents a significant advancement in the design of RNA biosensors. Integrating machine learning into the process offers a more efficient, scalable, and accurate solution compared to traditional methods. The predictive model enhances the accuracy of identifying functional toehold sequences, while the generative model accelerates the creation of new, viable switches. The results highlight the potential of SWORD to streamline toehold design, opening new possibilities for diagnostics, therapeutics, and biosensor applications. Future developments will focus on making SWORD more accessible, thereby empowering researchers with an effective tool for RNA biosensor design.