


ogRNA design assistant




This project is based on an RNA recognition and editing system utilizing deaminases, aiming to develop suitable in situ RNA sensing and regulation tools. By leveraging ADAR enzymes, we can artificially design RNA sequences that complement endogenous cellular RNAs, thereby forming hybrid double-stranded RNA (dsRNA) regions. However, a key challenge is identifying appropriate regions within the target RNA that can form stable dsRNA structures with ADAR enzymes, thereby enhancing RNA editing efficiency. To address this challenge, we plan to develop and apply RNA sensor design assistance models to guide end users in designing efficient dsRNA structures and subsequently formulate RNA sensing and regulation strategies. Through a combination of experimental validation and computational simulations, we aim to improve the specificity and efficiency of RNA editing, providing a solid theoretical foundation for the development of in situ RNA sensing and regulation tools.

To achieve this goal, we need to employ RNA structure and function prediction techniques. In the field of molecular biology, predicting RNA structure and function has long been a hotspot and challenge for researchers. The diversity and dynamic nature of RNA molecules pose significant challenges for functional prediction. Although traditional experimental methods, such as nuclease protection assays and X-ray crystallography, can provide precise structural information, these methods are time-consuming, costly, and generally limited to known RNA structures, often proving inadequate for newly discovered or designed RNAs.

With the advancement of computational biology and machine learning technologies, the development of RNA structure and function prediction models has become increasingly important. These models can utilize known RNA sequence and structural data to algorithmically predict the three-dimensional structure and function of previously uncharacterized RNA. For instance, deep learning-based models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have demonstrated strong capabilities in RNA structure prediction, capable of identifying conserved domains within RNA and predicting interactions with other molecules, such as proteins. This provides robust support for establishing predictive models tailored to the design of double-stranded RNA.

In summary, RNA structure and function prediction provides critical support for our research, aiding in the identification of appropriate regions within the target RNA, thereby enhancing ADAR enzyme-mediated RNA editing efficiency. By utilizing these tools, we can better construct in situ RNA sensing and regulation models suitable for systems such as *Saccharomyces cerevisiae*. Through a combination of computational simulations and experimental validation, these models will provide a solid theoretical basis for optimizing RNA editing and regulation strategies, opening new possibilities for applications in synthetic biology and biotechnology.





Technical Analysis


Existing RNA Prediction Tools


In the realm of RNA secondary structure prediction, various computational methods have been proposed, primarily including energy-based methods, co-evolutionary methods, and deep learning-based methods. Energy-based prediction methods rely on thermodynamic principles to predict the most stable secondary structures in RNA molecules by minimizing free energy. Tools such as Mfold, RNAstructure, and MC-Fold utilize free energy minimization algorithms to calculate the most probable secondary structures. The advantage of these methods lies in their solid physical foundations, enabling them to provide high stability in structure predictions.

Co-evolutionary prediction methods leverage covariation information between homologous RNA sequences to predict secondary structures. These methods assume that highly conserved regions of RNA during evolution often hold structural significance and infer RNA secondary structures by analyzing co-evolutionary features among homologous sequences. For instance, tools like Dynalign II, R-scape, and CaCoFold use covariation information, particularly when the covalent bonds in RNA remain stable, to effectively enhance prediction accuracy.

Deep learning-based prediction methods have made significant strides in recent years. These methods train models using a large amount of known RNA sequence and structural data to efficiently predict unknown RNA secondary structures. For example, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) can automatically learn sequence features of RNA and predict their secondary structures without explicit encoding rules. These methods are especially suited for handling large-scale RNA data and demonstrate stronger generalization capabilities compared to traditional approaches.

For RNA tertiary structure prediction, many computational tools have been developed, such as iFold, SimRNA, and FARNA. These tools attempt to predict three-dimensional structures by simulating the folding processes of RNA molecules. However, despite their excellent performance in predicting the structures of small RNA molecules, accurately predicting the complex topologies of larger RNA molecules remains a significant challenge. In particular, when dealing with long-chain RNA or molecules with complex secondary structures, the computational burden of existing algorithms increases substantially, leading to decreased prediction accuracy.

Deep learning techniques have also achieved important breakthroughs in RNA tertiary structure prediction. Inspired by the successful experiences of protein structure prediction tools like AlphaFold3 and RoseTTAFold, new deep learning tools such as DeepFoldRNA, RoseTTAFoldNA, and RhoFold have been developed. These methods significantly improve the prediction capabilities of RNA tertiary structures by integrating RNA sequences, multiple sequence alignments (MSA), and known structural information. Compared to traditional methods, deep learning tools can better handle sequence variations and automatically identify important structural domains within sequences, thereby enhancing prediction accuracy.

Moreover, integrating experimentally determined RNA structural data into computational models can further enhance prediction accuracy. For example, the RNAstructure software can convert experimental probe data (such as SHAPE reactivity scores) into "pseudo-energy terms," incorporating them into energy or statistical models to improve prediction performance. This integration of experimental data with computational models offers an effective pathway to enhance prediction quality, especially in the face of complex experimental conditions or a lack of high-quality sequence data.

In summary, as RNA structure detection technologies advance and computational methods develop rapidly, our understanding of RNA structure and function will deepen. This progress will not only promote basic life sciences research but also provide new avenues for RNA-based drug design. For instance, utilizing these advanced prediction tools can facilitate the design of RNA molecules with specific functions or the development of small-molecule drugs targeting RNA structures, thereby creating more opportunities for biomedical research and applications.


基于共进化的预测方法则利用同源RNA序列之间的共变异信息来预测RNA的二级结构。这些方法假设在进化过程中高度保守的RNA区域往往具有结构上的重要性,并通过分析同源序列之间的共进化特征,推测RNA的二级结构。例如,Dynalign II、R-scape和CaCoFold等工具基于共变异信息,特别是在RNA的共价键保持稳定的情况下,能够有效提高预测精度。






Existing RNA Sensor Design Support Models


Regarding the identification of suitable regions within target RNA, previous research has provided insights into this issue. Based on the principles of ADAR enzymes, it is known that we need to introduce a UAG stop codon on the artificial RNA to form a single nucleotide mismatch with the ACC codon on the endogenous RNA, thereby recruiting ADAR enzymes and initiating their editing activity. Consequently, in the coding implementations of previously proposed models, this principle has been translated into a series of automated steps, including:

1. Identification of Target RNA Sequences: By searching for specific patterns (such as the CCA sequence) within the target RNA sequence, potential editing sites are determined.

2. Generation of Guide Sequences: Guide sequences capable of forming complementary double strands with the target RNA are designed around these sites. These sequences include MS2 binding sequences and other necessary RNA structural domains.

3. Optimization of Editing Sites: Mutation strategies are employed to optimize guide sequences, avoiding non-specific editing and enhancing editing efficiency.

4. Validation of Guide Sequences: Guide sequences are checked to ensure they meet design requirements, such as avoiding the formation of undesired homologous polymer regions, thereby ensuring specificity and functionality.

5. Output and Application: The designed guide sequences are outputted for experimental validation and application in the laboratory, achieving precise editing of the target RNA.


1. 目标RNA序列的识别:通过搜索目标RNA序列中的特定模式(如CCA序列),确定可能的编辑位点。

2. 引导序列的生成:围绕这些位点,设计能够与目标RNA形成互补双链的引导序列,这些序列包含了MS2结合序列和其他必要的RNA结构域。

3. 编辑位点的优化:通过突变策略,优化引导序列,以避免非特异性编辑和提高编辑效率。

4. 引导序列的验证:检查引导序列是否满足设计要求,如避免形成非期望的同源多聚体区域,确保引导序列的特异性和功能性。

5. 输出和应用:将设计好的引导序列输出,供实验人员在实验室中进行验证和应用,以实现对目标RNA的精确编辑。

The proposed models automate the design of RNA guide sequences, accurately pinpointing target RNA sequences and introducing specific base mismatches to activate the editing function of ADAR enzymes. This strategy not only enhances the precision and efficiency of RNA editing but also significantly accelerates the research process through automated design workflows. Additionally, the flexibility and customizability of the model allow it to adapt to various experimental conditions and target RNA sequences, while the direct potential for experimental validation provides a solid foundation for the application of RNA editing technologies.

However, the model also has limitations. It primarily focuses on the complementarity at the sequence level and may not sufficiently consider the impacts of RNA secondary and tertiary structures on editing efficiency. Moreover, the model has yet to incorporate the complexities of RNA-protein interactions, which could significantly affect the delivery and editing effectiveness of guide sequences. Interactions between RNA chains have also been overlooked, which may influence the formation and stability of double-stranded RNA. Additionally, the model's design does not adequately simulate the effects of intracellular environments on the RNA editing process, such as the stability and accessibility of RNA molecules and the activity of editing enzymes. Lastly, the performance of the model may be limited by the diversity and quality of the datasets used for training and validation.



Proposed Solutions


To overcome the limitations of existing RNA sensor design support models and enhance their efficiency and specificity in practical applications, we plan to undertake a series of innovative improvements to the current guide RNA models. We will systematically optimize various aspects of the process, ensuring that each step contributes to overall editing efficiency and accuracy.

First, we will integrate advanced RNA secondary structure prediction tools to optimize the design of guide sequences. The secondary structure of RNA is crucial for its function and interactions, particularly during the binding of guide RNA to target RNA. If the guide RNA cannot form a stable double-stranded structure with the target RNA, editing efficiency will be significantly compromised. Therefore, by utilizing RNA secondary structure prediction tools, we can design guide sequences with higher affinity and stability, as well as assess easily editable regions by predicting the secondary structure of the target RNA. Furthermore, non-specific binding is a common issue in RNA editing, as there is a plethora of non-target RNA in cells. If the guide RNA non-specifically binds to these non-target RNAs, unexpected editing events may occur, leading to off-target effects. Through secondary structure prediction, we can effectively reduce such non-specific binding, thus enhancing the specificity of editing and ensuring that guide RNA preferentially binds to target RNA.

Secondly, we will delve into RNA-protein interactions, particularly the binding between ADAR enzymes and target double-stranded RNA. By integrating experimental data and computational methods, we will predict and optimize the delivery and editing effects of guide sequences. We aim to enhance the interaction between ADAR enzymes and guide sequences, as this will directly improve RNA editing efficiency and specificity. Molecular dynamics simulations will be utilized to predict the dynamic interactions between ADAR enzymes and the designed RNA guide sequences. These simulations will help us understand the dynamic changes during the binding process and predict the binding stability of different guide sequence variants. Through these simulations, we can optimize the design of guide sequences at the molecular level to achieve optimal binding with ADAR enzymes.

Thirdly, we recognize that in the intracellular environment, the hybridized double-stranded RNA is often unstable, and its degradation rate can significantly affect the effectiveness of RNA editing. The degradation of double-stranded RNA is mediated by endogenous enzymes (such as nucleases) within cells. Therefore, if the double-stranded structure formed by guide RNA and target RNA cannot remain stable for a sufficiently long time, the efficiency of RNA editing will be greatly reduced.

Finally, we will adopt a multi-factorial balancing strategy to comprehensively consider the interrelationships among four key factors through experimental data fitting and modeling, aiming to identify the optimal guide RNA. These four factors include: (1) the interaction between guide RNA and ADAR enzymes; (2) the binding efficiency of guide RNA to target RNA; (3) non-specific binding of guide RNA to non-target RNA; and (4) the degradation rate of hybrid double-stranded RNA. We will assign different weights to each factor and use experimental data to fit and optimize these weights, ensuring that each step achieves optimal balance. Through this approach, we aim to identify a theoretically optimal guide RNA sequence that can efficiently bind ADAR enzymes and stably form double strands with target RNA while minimizing non-specific binding and rapid degradation of double-stranded RNA. The basic idea is illustrated in the following figure.






Figure 1: Integration Diagram of RNA Sensor Design Assistance Model

图1 RNA传感器设计辅助模型整合思路图

We organize the above ideas into the equation for ADAR-mediated RNA editing in cells, as shown in Figure 2.


Figure 2: Equation for ADAR-Mediated RNA Editing in Cells

图2 细胞内ADAR介导的RNA编辑方程

In this equation, $mRNA$ represents the target RNA within the cell, $Sensor RNA$ denotes the ogRNA, $Endo-RNA$ refers to the endogenous non-target RNA, $dsRNA1$ indicates the hybrid double-strand formed between the target RNA and ogRNA, $dsRNA3$ represents the hybrid RNA double-strand after editing by ADAR, and $Sensor RNA'$signifies the ogRNA obtained from the dehybridization of the edited hybrid RNA double-strand. After editing, the stop codon UAG in ogRNA is edited to UIG, which is typically recognized by the ribosome as a non-stop codon UGG, thereby initiating the translation of downstream transcripts and producing the fluorescent signal signalsignalsignal. We consider signalsignalsignal to represent the overall activity of the RNA sensor.

在该方程中,$mRNA$ 代表细胞内目标RNA,$Sensor RNA$ 代表ogRNA,$Endo-RNA$ 代表细胞内源非目标RNA,$dsRNA1$ 表示目标RNA与ogRNA形成的杂交双链,$dsRNA3$ 代表经过ADAR编辑后的杂交RNA双链,$Sensor RNA'$ 代表编辑后杂交RNA双链解聚得到的ogRNA。经过编辑后,ogRNA中的终止密码子UAG被编辑为UIG,通常被核糖体识别为非终止子UGG,进而启动下游转录本的翻译,产生荧光信号$signal$,经细胞讲解后,我们认为降解后$signal$可以代表整个RNA传感器的活力。

In the equation, $\Delta G_1$ denotes the Gibbs free energy constant of the interaction between $mRNA$ and $Sensor RNA$, $\Delta G_2$ represents the Gibbs free energy constant of the interaction between $Endo-RNA$ and $Sensor RNA$, $\Delta G_3$signifies the Gibbs free energy constant of the interaction between ADAR and the hybrid RNA double-strand, and $\Delta G_4$indicates the Gibbs free energy constant of the interaction between ADAR and the edited hybrid RNA double-strand.

方程中,$\Delta G_1$ 表示$mRNA$与$Sensor RNA$相互作用的吉布斯自由能常数,$\Delta G_2$ 表示$Endo-RNA$与$Sensor RNA$相互作用的吉布斯自由能常数,$\Delta G_3$ 表示ADAR与杂交RNA双链的相互作用吉布斯自由能常数,$\Delta G_4$ 表示ADAR与经过ADAR编辑后的杂交RNA双链的相互作用吉布斯自由能常数。

Through the ADAR-mediated RNA editing equation, we can correlate intracellular substance concentrations with Gibbs free energy constants, as expressed in the following equations:


$$ \begin{align*}\frac{d(dsRNA1)}{dt} & = k_1 \cdot mRNA \cdot sensor\_RNA - k_{1}^{-1} \cdot dsRNA1 - \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(dsRNA2)}{dt} & = k_2 \cdot endo\_RNA \cdot sensor\_RNA - k_{2}^{-1} \cdot dsRNA2, \\\frac{d(dsRNA3)}{dt} & = \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(mRNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 + k_4 \cdot dsRNA3 - k_{4}^{-1} \cdot mRNA \cdot sensor\_RNA', \\\frac{d(sensor\_RNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 - k_2 \cdot sensor\_RNA \cdot endo\_RNA + k_{2}^{-1} \cdot dsRNA2, \\\frac{d(sensor\_RNA')}{dt} & = k_4 \cdot dsRNA3 - k_4 \cdot mRNA \cdot sensor\_RNA' -k_5 \cdot sensor\_RNA', \\\frac{d(endo\_RNA)}{dt} & = k_2 \cdot sensor\_RNA \cdot endo\_RNA - k_2 \cdot dsRNA2, \\\frac{d(n)}{dt} & = k_5 \cdot sensor\_RNA' - k_6 \cdot n, \\\frac{d(f)}{dt} & = k_6 \cdot n, \\K_1 & = e^{-\frac{\Delta G_1}{RT}}, \\K_2 & = e^{-\frac{\Delta G_2}{RT}}, \\K_3 & = e^{-\frac{\Delta G_3}{RT}}, \\K_4 & = e^{-\frac{\Delta G_4}{RT}}, \\k_{1}^{-1} & = \frac{k_1}{K_1}, \\k_{2}^{-1} & = \frac{k_2}{K_2}, \\k_{4}^{-1} & = \frac{k_4}{K_4}, \\\end{align*} $$

$$ \begin{align*}\frac{d(dsRNA1)}{dt} & = k_1 \cdot mRNA \cdot sensor\_RNA - k_{1}^{-1} \cdot dsRNA1 - \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(dsRNA2)}{dt} & = k_2 \cdot endo\_RNA \cdot sensor\_RNA - k_{2}^{-1} \cdot dsRNA2, \\\frac{d(dsRNA3)}{dt} & = \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(mRNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 + k_4 \cdot dsRNA3 - k_{4}^{-1} \cdot mRNA \cdot sensor\_RNA', \\\frac{d(sensor\_RNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 - k_2 \cdot sensor\_RNA \cdot endo\_RNA + k_{2}^{-1} \cdot dsRNA2, \\\frac{d(sensor\_RNA')}{dt} & = k_4 \cdot dsRNA3 - k_4 \cdot mRNA \cdot sensor\_RNA' -k_5 \cdot sensor\_RNA', \\\frac{d(endo\_RNA)}{dt} & = k_2 \cdot sensor\_RNA \cdot endo\_RNA - k_2 \cdot dsRNA2, \\\frac{d(n)}{dt} & = k_5 \cdot sensor\_RNA' - k_6 \cdot n, \\\frac{d(f)}{dt} & = k_6 \cdot n, \\K_1 & = e^{-\frac{\Delta G_1}{RT}}, \\K_2 & = e^{-\frac{\Delta G_2}{RT}}, \\K_3 & = e^{-\frac{\Delta G_3}{RT}}, \\K_4 & = e^{-\frac{\Delta G_4}{RT}}, \\k_{1}^{-1} & = \frac{k_1}{K_1}, \\k_{2}^{-1} & = \frac{k_2}{K_2}, \\k_{4}^{-1} & = \frac{k_4}{K_4}, \\\end{align*} $$

In this equation, $V_{\text{max}}, K_m, k_1, k_2, k_4, k_5, k_6$ are the parameters to be fitted. $V_{\text{max}}$ and$K_m$derive from the Michaelis-Menten equation, where the biological significance of $V_{\text{max}}$ is the maximum reaction rate when the substrate concentration is sufficiently high; KmK_mKm is the substrate concentration at which the reaction rate reaches half of its maximum value, reflecting the enzyme's affinity for the substrate. $k_1, k_2, k_3, k_4, k_5, k_6$are the respective rate constants for each reaction.

在该方程中,$Vmax, Km, k1, k2, k4, k5, k6$ 为需拟合的参数。$V_{max}$ 与$K_m $ 源于米氏方程,其生物意义为$V_{max}$是当底物浓度充分高时的最大反应速率;$K_m $ 是底物浓度达到反应速率一半最大值时的浓度,反映了酶对底物的亲和力。$k_1, k_2, k_3, k_4, k_5, k_6$ 是每个反应相应的速率常数。

A Python program is used to fit this system of differential equations, where the design of the objective function aims to evaluate the discrepancy between the model's predicted results and experimental data. The objective function calculates the error between the predicted activity of the entire RNA sensor and the experimental data (denoted as $f\_values$), summing the squares of the errors as the optimization target. The TNC (Truncated Newton Conjugate-Gradient) algorithm is employed to minimize the objective function, thus optimizing the model parameters. The L-BFGS-B algorithm is utilized for parameter optimization of the differential equation model. L-BFGS-B is a commonly used quasi-Newton method particularly suited for large-scale optimization problems with boundary constraints. Its main advantage lies in efficient memory utilization, as it does not require storing the complete Hessian matrix but instead approximates second-order derivatives using historical information. This accelerates convergence during optimization, making it suitable for handling multiple parameters in rigid differential equation models like in this case. The differential equations are solved using solve_ivp, opting for the BDF method (Backward Differentiation Formula), which is appropriate for stiff problems. During the optimization process, for each parameter set, the program computes the model's predicted values and compares them with experimental data. The errors are assessed using the Mean Squared Error (MSE) and the coefficient of determination (R²), where MSE indicates the average discrepancy between the model's predicted values and the experimental data, while R² measures the goodness of fit of the model to the experimental data.

使用python程序对此微分方程进行拟合,拟合的目标函数的设计用于评估模型预测结果与实验数据之间的差距。目标函数计算了模型预测的整个RNA传感器的活力与实验数据(f_values)之间的误差,并将其平方和作为优化目标。采用了TNC(Truncated Newton Conjugate-Gradient)算法来最小化目标函数,从而优化模型参数。使用了 L-BFGS-B 算法来对微分方程模型进行参数优化。L-BFGS-B 是一种常用的准牛顿法,特别适合于具有边界约束的大规模优化问题。其主要优点在于对内存的高效利用,因为它不需要存储完整的Hessian矩阵,而是利用历史信息来近似二阶导数。这样能够在优化过程中加速收敛,适合处理如本案例中含多个参数的刚性微分方程模型。微分方程的求解通过solve_ivp来实现,选用了BDF方法(Backward Differentiation Formula),这是一种适合刚性问题的积分方法。在优化过程中,针对每一组参数,程序会计算模型的预测值,并与实验数据进行比较。误差通过均方误差(MSE)和决定系数(R²)进行评估,MSE表示模型预测值与实验数据之间的平均差距,而R²用于度量模型对实验数据的拟合优度。

Data Source


Endogenous Target mRNA Sequences


The mRNA sequences utilized in this study were obtained from the NCBI database (National Center for Biotechnology Information ( The specific sequences are as follows:

使用的mRNA序列均来源于NCBI数据库(National Center for Biotechnology Information (,具体序列如下。

  ------ Click to know the sequence of IL6 ------  




  ------ Click to know the sequence of EGFP ------  




  ------ Click to know the sequence of NPY ------  




Wet Lab Data


The data used in this experiment were sourced from the 2023 publication by the Jiang Kaiyi team in Nature Biotechnology and their associated datasets, as shown in the figure below:

本实验所用数据均来自Jiang Kaiyi团队在Nature Biotechnology上发表的2023年文章及其持有的相关数据,如下图所示:

Figure 3. RADARS data targeting IL6, EGFP, and NPY


We extend our gratitude to the Jiang Kaiyi team for providing the RADARS activation data targeting IL6, EGFP, and NPY (utilizing exogenous ADAR1p150). For each transcript, 12 (NPY) to 14 (IL6 and EGFP) oligonucleotide guide RNAs (ogRNAs) were designed to target different CCA sites. Each point in the figure represents the average of three technical replicates for a single sensor, while the horizontal solid line denotes the average across all 12 ogRNAs. These 40 data sets were utilized as the f_values mentioned earlier.

特此感谢Jiang Kaiyi团队提供的针对IL6、EGFP和NPY靶向的RADARS折叠激活数据(使用外源性ADAR1p150)。针对每个转录本,设计了12(NPY)至14(IL6和EGFP)种ogRNA,靶向不同的CCA位点。图中每个点表示单个传感器的三个技术重复的平均值,而水平实线则表示所有12种ogRNA的平均值。这40组数据用于即为上文所述f_values。

RNA Secondary and Tertiary Structure Prediction Data


There are various methods for predicting RNA secondary structures, each with its own advantages and limitations in different application contexts. However, our research presents a unique requirement: predicting the secondary structure of hybrid RNA. The secondary structure prediction of hybrid RNA is relatively complex, as it needs to consider the interactions between two RNA strands while accurately reflecting their internal structures.


To meet this requirement, we selected ViennaRNA-2.5.0, an advanced RNA secondary structure prediction tool. ViennaRNA is capable of efficiently predicting the secondary structure of a single RNA strand and can also handle the complex structures formed by hybrid RNA composed of two or more RNA strands. This capability is crucial for us, as the binding efficiency of ogRNA to the target RNA and the stability of its secondary structure directly impact the success of RNA editing. Through ViennaRNA, we can accurately predict the secondary structure of hybrid RNA, identify potential hotspot regions, and optimize guide sequence design to enhance editing specificity and efficiency. Additionally, ViennaRNA provides efficient energy computation capabilities, offering detailed information on the changes in free energy during the formation of hybrid RNA, further aiding our understanding of its stability and feasibility.


In terms of predicting RNA tertiary structures, we employed both 3dRNA and trRosettaRNA software to improve prediction accuracy and diversity. 3dRNA uses a fragment assembly-based modeling approach, starting from the RNA secondary structure (predicted by ViennaRNA-2.5.0) to construct tertiary structures of different fragments, which are then optimized through energy minimization. This method's advantage lies in its ability to quickly generate multiple candidate structures and effectively construct large-scale RNA molecules. By inputting the secondary structure information of the RNA, we generate multiple potential tertiary structures and output PDB files for further analysis.


trRosettaRNA, on the other hand, relies on deep learning techniques to predict distances and torsional angles between base pairs, thereby providing a more precise prediction of RNA's global conformation. By combining these two methods, we can compare the PDB files generated by different software to assess structural accuracy and select the model that best matches experimental results or exhibits the lowest energy. Furthermore, additional energy optimization or experimental validation can be employed to confirm the final RNA tertiary structure.


Gibbs Free Energy Data


- $\Delta G_1$、 $\Delta G_4$

- $\Delta G_1$、 $\Delta G_4$

As previously mentioned, $\Delta G_1$ and $\Delta G_4$ represent the Gibbs free energy constants for the interactions between mRNA and sensor RNA, and between ADAR and the hybrid RNA duplex edited by ADAR, respectively. To evaluate the performance of different RNA interaction software, we compared several RNA interaction tools, and the results are shown below, ultimately selecting IntaRNA as the RNA interaction prediction tool for this study.

如上文所述,方程中的 $\Delta G_1$ 与 $\Delta G_4$ 分别表示 mRNA 与 Sensor RNA 之间的相互作用,以及 ADAR 与经过 ADAR 编辑后的杂交 RNA 双链之间的相互作用的吉布斯自由能常数。为了评估不同 RNA 互作的软件效果,我们比较了多种RNA 互作软件,结果如下所示,最终选择IntaRNA 作为本次课题 RNA 互作预测工具。

Figure 4. Comparison of Various RNA Interaction Software

图4 多种RNA 互作软件的比较

IntaRNA is an RNA interaction prediction tool based on pairing energy calculations, designed to identify binding sites between RNA molecules. Its core principle involves predicting the thermodynamic stability of the binding by calculating the energy scores of RNA strands, including structural features such as base pairings, internal loops, and bulges. IntaRNA employs a dynamic programming algorithm to efficiently search all possible pairing combinations to identify the optimal binding modes. This software can handle interactions between single-stranded RNA as well as assess interactions between double-stranded RNA, providing robust support for the study of functional relationships between RNAs.

IntaRNA 是一种基于配对能量计算的 RNA 互作预测工具,旨在识别 RNA 分子之间的结合位点。其核心原理是通过计算 RNA 链的能量得分,包括碱基对之间的配对、内环和外突等结构特征,来预测其结合的热力学稳定性。IntaRNA 采用动态规划算法来高效地搜索所有可能的配对方式,从而找到最优的结合模式。该软件不仅能够处理单链 RNA 的互作,还能评估双链 RNA 之间的相互作用,为研究 RNA 之间的功能关系提供了有力支持。

- $\Delta G_2$

- $\Delta G_2$

$\Delta G_2$ indicates the Gibbs free energy constant for the interaction between Endo-RNA and Sensor RNA. To assess this interaction, we chose CopraRNA as our tool. CopraRNA is designed to predict interactions between a single RNA molecule and all endogenous RNAs within a cell. Its methodology integrates RNA sequence and structural information to identify potential binding sites and predict interactions between different RNA molecules. Specifically, CopraRNA first analyzes the sequence features of the target RNA and compares them with all endogenous RNAs in the cell. The tool utilizes the conservation of the sequence and structural similarity to calculate potential binding patterns and evaluate the changes in binding free energy between each pair of RNAs.

$\Delta G_2$ 表示$Endo-RNA$与$Sensor RNA$相互作用的吉布斯自由能常数,为了评估这一相互作用,我们选择了 CopraRNA 作为工具。CopraRNA 是一种用于预测单个 RNA 分子与细胞内所有内源 RNA 之间相互作用的工具。其工作原理是通过整合 RNA 序列和结构信息,识别可能的结合位点,从而预测不同 RNA 分子之间的相互作用。具体而言,CopraRNA 首先会分析目标 RNA 的序列特征,并与细胞内的所有内源 RNA 进行比对。该工具利用序列的保守性和结构相似性,计算潜在的结合模式,并评估每对 RNA 之间的结合自由能变化。

- $\Delta G_3$

- $\Delta G_3$

After obtaining the RNA tertiary structure, we further investigated the interaction between hybrid RNA and ADAR enzymes using HDOCK software for molecular docking. HDOCK is a powerful molecular docking tool suitable for docking analyses of various molecular systems, including protein-protein, protein-nucleic acid, and nucleic acid-nucleic acid complexes. Its docking mechanism is based on global search methods, quickly identifying binding regions through Fourier transformation and employing specific scoring algorithms to evaluate molecular binding affinities. The multifunctionality and flexibility of HDOCK make it one of the preferred tools for docking studies, especially in RNA-protein interaction research, where its precise predictions can reveal complex molecular interaction networks.


The docking simulations conducted with HDOCK not only provide preliminary binding modes but also allow for further energy optimization and structural adjustments, revealing possible conformational changes of RNA molecules during the binding process. This information is crucial for gaining a deeper understanding of the interaction mechanisms between RNA and ADAR enzymes, laying a solid foundation for subsequent experimental validation and molecular dynamics simulations.


Evaluate the capability of monitoring splice variants


Overview of RNA Splicing Isoforms


RNA splicing is a crucial process in eukaryotic gene expression, during which introns (non-coding regions) are removed from pre-mRNA, and exons (coding regions) are joined to produce mature mRNA. The process can generate multiple mRNA variants, known as RNA splicing isoforms, from a single gene through alternative splicing. This creates protein diversity and allows one gene to encode different proteins. RNA splicing isoforms play a significant role in gene regulation and are implicated in many diseases, including cancer and neurodegenerative disorders.


Our Project's Logic


In our project, we aim to analyze whether specific RNA splicing events produce the nucleotide sequence "ACC" at splice junctions across various isoforms. For each gene, we examine all its transcripts, checking a defined base range around splice junctions. If the "ACC" sequence appears within this range for any splice junction in a transcript, that transcript is considered successful. The logic is to check if every transcript in a gene is considered successful. If so, we consider the gene as "successful." Or we have a second way of determining, because we don't necessarily care about all the transcripts of a gene, so when 50% of the transcripts are successful, we consider the gene successful.


Data Sources


Source: We use data from the Ensembl Genome Browser (Ensembl website). The website provides high-quality genomic data for various species.

数据库:我们使用Ensemble Genome Browser (Ensembl website) 中的数据。该网站提供各类物种的高质量的基因数据。

GTF File: The GTF (Gene Transfer Format) file contains gene annotations, including exon positions for each transcript. For the Homo_sapiens.GRCh38.112.gtf.gz file.

GTF文件:GTF (Gene Transfer Format) 文件包含各类基因注释,包括每个转录本的外显子位置,针对Homo_sapiens.GRCh38.112.gtf.gz 文件.

FASTA File: The cDNA FASTA file contains full mRNA sequences for each transcript. For the Homo_sapiens.GRCh38.cdna.all.fa.gz file


Both files are compressed and stored in '.gz' format to reduce download sizes. These files are essential for parsing genomic data, with the GTF file providing gene structure annotations and the FASTA file giving nucleotide sequences of the transcripts.


Code interpretation


GTF File Parsing (parse_gtf_file):


Purpose: Extract exon start and end positions for each transcript from the GTF file, creating a map of exon positions.


Original Code:


  ------ Click to know the code for GTF File Parsing------  

    def parse_gtf_file(gtf_file):
        Parse GTF File to Extract Exon Information
        gtf_file (str): Path to the GTF file
        DataFrame: A DataFrame containing the gene ID, transcript ID, and exon start and end positions.

        gtf_data = []
        with open(gtf_file, 'r') as file:
            for line in file:
                if line.startswith('#'):
                fields = line.strip().split('\t')
                if fields[2] == 'exon':
                    attributes = {}
                    for item in fields[8].split(';'):
                        key_value = item.strip().split(' ')
                        if len(key_value) == 2:
                            key, value = key_value
                            attributes[key] = value.strip('"')

                    gene_id = attributes.get('gene_id')
                    transcript_id = attributes.get('transcript_id')
                    exon_start = int(fields[3])
                    exon_end = int(fields[4])
                    gtf_data.append([gene_id, transcript_id, exon_start, exon_end])

        gtf_df = pd.DataFrame(gtf_data, columns=['gene_id', 'transcript_id', 'exon_start', 'exon_end'])
        return gtf_df

Explanation: This function iterates through the GTF file, extracting only exon entries for each gene and transcript, and storing them in a DataFrame. The DataFrame contains columns for gene ID, transcript ID, and exon start/end positions, which are used later to find splice junctions.


FASTA File Parsing (Transcript Sequence Dictionary):


Purpose: Create a dictionary mapping transcript IDs to their full mRNA sequences.


Original Code:


  ------ Click to know the code for FASTA File Parsing ------  

    # Read the FASTA file and create a dictionary, while removing any potential version numbers.
    transcript_sequences = {}
    for record in SeqIO.parse(fasta_file, "fasta"):
        transcript_id ='.')[0]  # Remove possible version numbers
        transcript_sequences[transcript_id] = str(record.seq)

Explanation: This snippet reads the FASTA file using Biopython’s 'SeqIO.parse()', building a dictionary where each key is a transcript ID and the value is the corresponding cDNA sequence. It also removes any version numbers from transcript IDs to ensure consistency with the GTF file.


Finding Splice Junctions (Cumulative Exon Lengths):


Purpose: Identify the positions of exon-exon junctions in the cDNA sequence.


Original Code:


  ------ Click to know the code for Cumulative Exon Lengths ------  

    # Calculate the Proportion of Successful Genes
    success_rate = sum(gene_success.values()) / len(gene_success)

    print(f"Total number of transcripts processed: {total_transcripts_count}")
    return success_rate

Explanation: This calculates the cumulative lengths of exons to simulate the positions of exon junctions in a concatenated mRNA sequence. The 'cumsum()' method creates an array of cumulative exon lengths, which allows the identification of where one exon ends and the next one begins.


Extracting and Checking Splice Junction Sequences:


Purpose: For each splice junction, extract a window of bases surrounding the junction and check for the "ACC" sequence.


Original Code:


  ------ Click to know the code for Extracting and Checking ------  

    # Check Whether "ACC" Exists Within a Specific Number of Bases Upstream and Downstream of All Splice Sites
    for i in range(len(exon_cumulative_lengths) - 1):

        start = max(0, exon_cumulative_lengths[i] - 35)
        end = min(len(sequence), exon_cumulative_lengths[i] + 35)

        splice_site_seq = sequence[start:end]

        if 'ACC' in splice_site_seq:
            transcript_success = True

    # If Any Transcript Fails, the Gene is Considered Unsuccessful
    if not transcript_success:
        gene_success[gene_id] = False

Explanation: For each exon junction, a window of specific number of bases is extracted using cumulative exon lengths as the junction location. The function then checks for the presence of "ACC" in this window. The 'max()' and 'min()' functions ensure that the window does not go out of bounds. In the second judgment method, we consider 50% of the transcript success as success, the code idea is similar, so we will not go into details here


Success Criteria for Genes and Transcripts:


Purpose: For each transcript, ensure that at least one junction contains "ACC". A gene is considered successful only if **all** its transcripts meet this criterion.


Original Code:


  ------ Click to know the code for Success Criteria ------  

    # Iterate Through Each Gene
    for gene_id, group in gtf_df.groupby('gene_id'):
        gene_transcripts = group['transcript_id'].unique()
        gene_success[gene_id] = True  # Initialize to success

        # For Each Transcript Within the Gene
        for transcript_id in gene_transcripts:
            total_transcripts_count += 1  # Increment counter

            if transcript_id not in transcript_sequences:
                gene_success[gene_id] = False

Explanation: The outer loop iterates over each gene, and the inner loop iterates over each transcript within the gene. If any transcript fails to meet the "ACC" criterion at any junction, the entire gene is marked as unsuccessful.


Final Success Rate Calculation:


Purpose: Calculate the proportion of genes that were successful.


Original Code:


  ------ Click to know the code for Success Rate Calculation ------  

    # Calculate the Proportion of Successful Genes
    success_rate = sum(gene_success.values()) / len(gene_success)

    print(f"Total number of transcripts processed: {total_transcripts_count}")
    return success_rate

Explanation: The success rate is computed by dividing the number of successful genes by the total number of genes processed.
