Model

中文

EN

ogRNA design assistant

ogRNA设计助手

Background

背景

This project is based on an RNA recognition and editing system utilizing deaminases, aiming to develop suitable in situ RNA sensing and regulation tools. By leveraging ADAR enzymes, we can artificially design RNA sequences that complement endogenous cellular RNAs, thereby forming hybrid double-stranded RNA (dsRNA) regions. However, a key challenge is identifying appropriate regions within the target RNA that can form stable dsRNA structures with ADAR enzymes, thereby enhancing RNA editing efficiency. To address this challenge, we plan to develop and apply RNA sensor design assistance models to guide end users in designing efficient dsRNA structures and subsequently formulate RNA sensing and regulation strategies. Through a combination of experimental validation and computational simulations, we aim to improve the specificity and efficiency of RNA editing, providing a solid theoretical foundation for the development of in situ RNA sensing and regulation tools.

To achieve this goal, we need to employ RNA structure and function prediction techniques. In the field of molecular biology, predicting RNA structure and function has long been a hotspot and challenge for researchers. The diversity and dynamic nature of RNA molecules pose significant challenges for functional prediction. Although traditional experimental methods, such as nuclease protection assays and X-ray crystallography, can provide precise structural information, these methods are time-consuming, costly, and generally limited to known RNA structures, often proving inadequate for newly discovered or designed RNAs.

With the advancement of computational biology and machine learning technologies, the development of RNA structure and function prediction models has become increasingly important. These models can utilize known RNA sequence and structural data to algorithmically predict the three-dimensional structure and function of previously uncharacterized RNA. For instance, deep learning-based models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have demonstrated strong capabilities in RNA structure prediction, capable of identifying conserved domains within RNA and predicting interactions with other molecules, such as proteins. This provides robust support for establishing predictive models tailored to the design of double-stranded RNA.

In summary, RNA structure and function prediction provides critical support for our research, aiding in the identification of appropriate regions within the target RNA, thereby enhancing ADAR enzyme-mediated RNA editing efficiency. By utilizing these tools, we can better construct in situ RNA sensing and regulation models suitable for systems such as *Saccharomyces cerevisiae*. Through a combination of computational simulations and experimental validation, these models will provide a solid theoretical basis for optimizing RNA editing and regulation strategies, opening new possibilities for applications in synthetic biology and biotechnology.

本课题基于脱氨酶的RNA识别和编辑系统,旨在开发合适的RNA原位传感和调控工具。通过利用ADAR酶,我们能够人工设计RNA,使其与细胞内源的RNA靶向互补,从而形成杂交双链RNA区域。然而,关键挑战在于如何识别目标RNA中的合适区域,以便与ADAR酶形成稳定的双链RNA结构,进而提高RNA编辑效率。为解决这一问题,我们计划开发并应用RNA传感器设计辅助模型,以指导终端用户设计高效的双链RNA结构,进而设计RNA传感和调控策略。结合实验验证与计算模拟,我们希望提升RNA编辑的特异性和效率,为RNA原位传感和调控工具的开发提供坚实的理论基础。

为实现这一目标,我们需要采用RNA的结构和功能预测手段。在分子生物学领域,RNA的结构和功能预测一直是研究的热点与难点。RNA分子的多样性和动态性使得其功能预测面临巨大挑战。尽管核酸酶保护实验和X射线晶体学等传统实验方法能够提供精确的结构信息,但这些方法耗时且成本高昂,且通常局限于已知结构的RNA分子,对新发现或设计的RNA的适用性往往不足。

随着计算生物学和机器学习技术的发展,RNA结构与功能预测模型的开发愈发重要。这些模型可以利用已知的RNA序列和结构数据,通过算法预测未知RNA的三维结构及其功能。例如,基于深度学习的模型(如卷积神经网络CNN和循环神经网络RNN)在RNA结构预测中展示了强大的能力,既能够识别RNA中的保守结构域,又能预测其与蛋白质等分子的相互作用。这为我们建立针对双链RNA设计的预测模型提供了良好的支持。

综上所述,RNA结构与功能预测为我们研究提供了关键支持,帮助识别目标RNA的合适区域,从而提升ADAR酶介导的RNA编辑效率。借助这些工具,我们能够更好地构建适用于酿酒酵母等系统的RNA原位传感和调控模型。通过结合计算模拟与实验验证,这些模型将为RNA编辑和调控策略的优化提供坚实的理论基础,并为合成生物学和生物技术的应用带来新的可能性。

Technical Analysis

技术分析

Existing RNA Prediction Tools

现有的RNA检测工具

In the realm of RNA secondary structure prediction, various computational methods have been proposed, primarily including energy-based methods, co-evolutionary methods, and deep learning-based methods. Energy-based prediction methods rely on thermodynamic principles to predict the most stable secondary structures in RNA molecules by minimizing free energy. Tools such as Mfold, RNAstructure, and MC-Fold utilize free energy minimization algorithms to calculate the most probable secondary structures. The advantage of these methods lies in their solid physical foundations, enabling them to provide high stability in structure predictions.

Co-evolutionary prediction methods leverage covariation information between homologous RNA sequences to predict secondary structures. These methods assume that highly conserved regions of RNA during evolution often hold structural significance and infer RNA secondary structures by analyzing co-evolutionary features among homologous sequences. For instance, tools like Dynalign II, R-scape, and CaCoFold use covariation information, particularly when the covalent bonds in RNA remain stable, to effectively enhance prediction accuracy.

Deep learning-based prediction methods have made significant strides in recent years. These methods train models using a large amount of known RNA sequence and structural data to efficiently predict unknown RNA secondary structures. For example, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) can automatically learn sequence features of RNA and predict their secondary structures without explicit encoding rules. These methods are especially suited for handling large-scale RNA data and demonstrate stronger generalization capabilities compared to traditional approaches.

For RNA tertiary structure prediction, many computational tools have been developed, such as iFold, SimRNA, and FARNA. These tools attempt to predict three-dimensional structures by simulating the folding processes of RNA molecules. However, despite their excellent performance in predicting the structures of small RNA molecules, accurately predicting the complex topologies of larger RNA molecules remains a significant challenge. In particular, when dealing with long-chain RNA or molecules with complex secondary structures, the computational burden of existing algorithms increases substantially, leading to decreased prediction accuracy.

Deep learning techniques have also achieved important breakthroughs in RNA tertiary structure prediction. Inspired by the successful experiences of protein structure prediction tools like AlphaFold3 and RoseTTAFold, new deep learning tools such as DeepFoldRNA, RoseTTAFoldNA, and RhoFold have been developed. These methods significantly improve the prediction capabilities of RNA tertiary structures by integrating RNA sequences, multiple sequence alignments (MSA), and known structural information. Compared to traditional methods, deep learning tools can better handle sequence variations and automatically identify important structural domains within sequences, thereby enhancing prediction accuracy.

Moreover, integrating experimentally determined RNA structural data into computational models can further enhance prediction accuracy. For example, the RNAstructure software can convert experimental probe data (such as SHAPE reactivity scores) into "pseudo-energy terms," incorporating them into energy or statistical models to improve prediction performance. This integration of experimental data with computational models offers an effective pathway to enhance prediction quality, especially in the face of complex experimental conditions or a lack of high-quality sequence data.

In summary, as RNA structure detection technologies advance and computational methods develop rapidly, our understanding of RNA structure and function will deepen. This progress will not only promote basic life sciences research but also provide new avenues for RNA-based drug design. For instance, utilizing these advanced prediction tools can facilitate the design of RNA molecules with specific functions or the development of small-molecule drugs targeting RNA structures, thereby creating more opportunities for biomedical research and applications.

在RNA二级结构预测方面,已有多种计算方法被提出,主要包括基于能量的预测方法、基于共进化的预测方法以及基于深度学习的预测方法。基于能量的预测方法依赖于热力学原理,通过最小化自由能来预测RNA分子中最稳定的二级结构。例如,Mfold、RNAstructure和MC-Fold等工具,利用自由能最小化算法来计算最可能的二级结构。这类方法的优势在于其物理基础扎实,能够提供稳定性较高的结构预测。

基于共进化的预测方法则利用同源RNA序列之间的共变异信息来预测RNA的二级结构。这些方法假设在进化过程中高度保守的RNA区域往往具有结构上的重要性,并通过分析同源序列之间的共进化特征,推测RNA的二级结构。例如,Dynalign II、R-scape和CaCoFold等工具基于共变异信息,特别是在RNA的共价键保持稳定的情况下,能够有效提高预测精度。

基于深度学习的预测方法近年来取得了显著进展。这类方法通过利用大量已知的RNA序列和结构数据进行模型训练,从而实现对未知RNA二级结构的高效预测。例如,利用卷积神经网络(CNN)和循环神经网络(RNN)的模型可以自动学习RNA的序列特征,并在无须明确编码规则的情况下预测其二级结构。这些方法尤其适合处理大规模RNA数据,并显示出比传统方法更强的泛化能力。

在RNA三级结构预测方面,许多计算工具已被开发,如iFold、SimRNA和FARNA等。它们通过模拟RNA分子的折叠过程,试图预测其三维结构。然而,尽管这些工具在小型RNA分子的结构预测上表现出色,但对大型RNA分子的复杂拓扑结构进行准确预测仍是一个巨大挑战。特别是在涉及长链RNA或具有复杂二级结构的分子时,现有算法的计算负担大幅增加,导致预测结果的准确性下降。

深度学习技术近年来也在RNA三级结构预测中取得了重要突破。受到AlphaFold3和RoseTTAFold等蛋白质结构预测工具成功经验的启发,新的深度学习工具也相继开发出来,如DeepFoldRNA、RoseTTAFoldNA和RhoFold。这些方法通过结合RNA序列、多序列比对(MSA)以及已知的结构信息,显著提高了RNA三级结构的预测能力。与传统方法相比,深度学习工具可以更好地处理序列变异,并自动识别序列中的重要结构域,从而提高预测精度。

此外,将实验确定的RNA结构数据整合到计算模型中,能够进一步提高预测的准确性。例如,RNAstructure软件能够将实验探针数据(如SHAPE反应性分数)转换为“伪能量项”,并将其引入能量或统计模型中,以改进预测性能。这种将实验数据与计算模型结合的方式,提供了一条提升预测质量的有效途径,尤其是在面对实验条件复杂或缺乏高质量序列数据时。

总之,随着RNA结构探测技术的进步与计算方法的快速发展,我们对RNA结构与功能的理解将不断深入。这不仅将促进基础生命科学研究,还将为基于RNA的药物设计提供新的思路。例如,利用这些先进的预测工具,可以设计具有特定功能的RNA分子,或开发针对RNA结构的小分子药物,从而为生物医学研究和应用带来更多机遇。

Existing RNA Sensor Design Support Models

现有的RNA传感器设计辅助模型

Regarding the identification of suitable regions within target RNA, previous research has provided insights into this issue. Based on the principles of ADAR enzymes, it is known that we need to introduce a UAG stop codon on the artificial RNA to form a single nucleotide mismatch with the ACC codon on the endogenous RNA, thereby recruiting ADAR enzymes and initiating their editing activity. Consequently, in the coding implementations of previously proposed models, this principle has been translated into a series of automated steps, including:

1. Identification of Target RNA Sequences: By searching for specific patterns (such as the CCA sequence) within the target RNA sequence, potential editing sites are determined.

2. Generation of Guide Sequences: Guide sequences capable of forming complementary double strands with the target RNA are designed around these sites. These sequences include MS2 binding sequences and other necessary RNA structural domains.

3. Optimization of Editing Sites: Mutation strategies are employed to optimize guide sequences, avoiding non-specific editing and enhancing editing efficiency.

4. Validation of Guide Sequences: Guide sequences are checked to ensure they meet design requirements, such as avoiding the formation of undesired homologous polymer regions, thereby ensuring specificity and functionality.

5. Output and Application: The designed guide sequences are outputted for experimental validation and application in the laboratory, achieving precise editing of the target RNA.

对于如何如何在目标RNA中识别合适的区域这一问题上,前人对此也有一定的研究,根据ADAR酶的原理可知,我们需要通过在人工RNA上引入UAG终止密码子与内源RNA上的ACC密码子形成单碱基错配招募ADAR酶并启动其编辑活性。因此,前人提出的模型的代码实现中,这一原理被转化为一系列自动化的步骤,包括:

1. 目标RNA序列的识别:通过搜索目标RNA序列中的特定模式(如CCA序列),确定可能的编辑位点。

2. 引导序列的生成:围绕这些位点,设计能够与目标RNA形成互补双链的引导序列,这些序列包含了MS2结合序列和其他必要的RNA结构域。

3. 编辑位点的优化:通过突变策略,优化引导序列,以避免非特异性编辑和提高编辑效率。

4. 引导序列的验证:检查引导序列是否满足设计要求,如避免形成非期望的同源多聚体区域,确保引导序列的特异性和功能性。

5. 输出和应用:将设计好的引导序列输出,供实验人员在实验室中进行验证和应用,以实现对目标RNA的精确编辑。

The proposed models automate the design of RNA guide sequences, accurately pinpointing target RNA sequences and introducing specific base mismatches to activate the editing function of ADAR enzymes. This strategy not only enhances the precision and efficiency of RNA editing but also significantly accelerates the research process through automated design workflows. Additionally, the flexibility and customizability of the model allow it to adapt to various experimental conditions and target RNA sequences, while the direct potential for experimental validation provides a solid foundation for the application of RNA editing technologies.

However, the model also has limitations. It primarily focuses on the complementarity at the sequence level and may not sufficiently consider the impacts of RNA secondary and tertiary structures on editing efficiency. Moreover, the model has yet to incorporate the complexities of RNA-protein interactions, which could significantly affect the delivery and editing effectiveness of guide sequences. Interactions between RNA chains have also been overlooked, which may influence the formation and stability of double-stranded RNA. Additionally, the model's design does not adequately simulate the effects of intracellular environments on the RNA editing process, such as the stability and accessibility of RNA molecules and the activity of editing enzymes. Lastly, the performance of the model may be limited by the diversity and quality of the datasets used for training and validation.

前人提出的模型通过自动化流程设计RNA引导序列,能够精确地定位目标RNA序列并引入特定的碱基错配,从而激活ADAR酶的编辑功能。这一策略不仅提高了RNA编辑的精确性和效率,而且通过自动化的设计流程,大大加快了研究进程。此外,模型的灵活性和可定制性使其能够适应不同的实验条件和目标RNA序列,而直接的实验验证潜力为RNA编辑技术的应用提供了坚实的基础。

然而,该模型也存在一些不足之处。它主要关注序列层面的互补性,而可能没有充分考虑RNA的二级和三级结构对编辑效率的影响。此外,模型尚未整合RNA与蛋白质相互作用的复杂性,这些相互作用可能会显著影响引导序列的递送和编辑效果。RNA链间的相互作用也未被考虑,这可能会影响双链RNA的形成和稳定性。而且,模型的设计没有充分模拟细胞内环境对RNA编辑过程的影响,如RNA分子的稳定性、可访问性和编辑酶的活性。最后,模型的性能可能受限于用于训练和验证的数据集的多样性和质量。

Proposed Solutions

解决方案

To overcome the limitations of existing RNA sensor design support models and enhance their efficiency and specificity in practical applications, we plan to undertake a series of innovative improvements to the current guide RNA models. We will systematically optimize various aspects of the process, ensuring that each step contributes to overall editing efficiency and accuracy.

First, we will integrate advanced RNA secondary structure prediction tools to optimize the design of guide sequences. The secondary structure of RNA is crucial for its function and interactions, particularly during the binding of guide RNA to target RNA. If the guide RNA cannot form a stable double-stranded structure with the target RNA, editing efficiency will be significantly compromised. Therefore, by utilizing RNA secondary structure prediction tools, we can design guide sequences with higher affinity and stability, as well as assess easily editable regions by predicting the secondary structure of the target RNA. Furthermore, non-specific binding is a common issue in RNA editing, as there is a plethora of non-target RNA in cells. If the guide RNA non-specifically binds to these non-target RNAs, unexpected editing events may occur, leading to off-target effects. Through secondary structure prediction, we can effectively reduce such non-specific binding, thus enhancing the specificity of editing and ensuring that guide RNA preferentially binds to target RNA.

Secondly, we will delve into RNA-protein interactions, particularly the binding between ADAR enzymes and target double-stranded RNA. By integrating experimental data and computational methods, we will predict and optimize the delivery and editing effects of guide sequences. We aim to enhance the interaction between ADAR enzymes and guide sequences, as this will directly improve RNA editing efficiency and specificity. Molecular dynamics simulations will be utilized to predict the dynamic interactions between ADAR enzymes and the designed RNA guide sequences. These simulations will help us understand the dynamic changes during the binding process and predict the binding stability of different guide sequence variants. Through these simulations, we can optimize the design of guide sequences at the molecular level to achieve optimal binding with ADAR enzymes.

Thirdly, we recognize that in the intracellular environment, the hybridized double-stranded RNA is often unstable, and its degradation rate can significantly affect the effectiveness of RNA editing. The degradation of double-stranded RNA is mediated by endogenous enzymes (such as nucleases) within cells. Therefore, if the double-stranded structure formed by guide RNA and target RNA cannot remain stable for a sufficiently long time, the efficiency of RNA editing will be greatly reduced.

Finally, we will adopt a multi-factorial balancing strategy to comprehensively consider the interrelationships among four key factors through experimental data fitting and modeling, aiming to identify the optimal guide RNA. These four factors include: (1) the interaction between guide RNA and ADAR enzymes; (2) the binding efficiency of guide RNA to target RNA; (3) non-specific binding of guide RNA to non-target RNA; and (4) the degradation rate of hybrid double-stranded RNA. We will assign different weights to each factor and use experimental data to fit and optimize these weights, ensuring that each step achieves optimal balance. Through this approach, we aim to identify a theoretically optimal guide RNA sequence that can efficiently bind ADAR enzymes and stably form double strands with target RNA while minimizing non-specific binding and rapid degradation of double-stranded RNA. The basic idea is illustrated in the following figure.

为了克服现有的RNA传感器设计辅助模型的局限性,并提高其在实际应用中的效率和特异性,我们计划对现有的引导RNA模型进行一系列创新改进。我们将从多个方面入手,系统化地优化各个环节,确保每个步骤都能为整体的编辑效率和精度做出贡献。

首先,我们将整合先进的RNA二级结构预测工具,以优化引导序列的设计。RNA的二级结构对其功能和相互作用至关重要,特别是在引导RNA与目标RNA的结合过程中。如果引导RNA与目标RNA无法形成稳定的双链结构,编辑效率将大打折扣。因此,利用RNA二级结构预测工具,不仅可以帮助我们设计出更具亲和力和稳定性的引导序列,还能通过预测目标RNA的二级结构来评估其易于被编辑的区域。此外,非特异性结合是RNA编辑中一个常见的问题,细胞中存在大量的非目标RNA,如果引导RNA与这些非目标RNA发生非特异性结合,可能会引发意外的编辑事件,导致脱靶效应。通过二级结构预测,我们可以有效减少这种非特异性结合,从而提高编辑的特异性,确保引导RNA优先与目标RNA结合。

其次,我们将深入研究RNA-蛋白质相互作用,特别是ADAR酶与目标双链RNA之间的结合。通过实验数据和计算方法,我们将整合这些信息来预测和优化引导序列的递送和编辑效果。我们希望增强ADAR酶与引导序列之间的相互作用,因为这将直接提高RNA编辑的效率和特异性。利用分子动力学模拟来预测ADAR酶与设计好的RNA引导序列之间的动态相互作用。这种模拟将帮助我们理解结合过程中的动态变化,并预测不同引导序列变体的结合稳定性。通过这些模拟,我们可以在分子水平上优化引导序列的设计,以实现与ADAR酶的最佳结合。

第三,我们认识到在细胞内环境中,杂交形成的双链RNA往往是不稳定的,其降解速率会显著影响RNA编辑的有效性。双链RNA的降解是由细胞内的内源性酶类(如核酸酶)介导的,因此,引导RNA与目标RNA的双链结构如果不能稳定存在足够长的时间,RNA编辑的效率将会大大降低。

最后,我们将采用一种多因素平衡的策略,通过实验数据拟合和建模,综合考虑四个关键因素的相互关系,找到最优的引导RNA。这四个因素包括:(1)引导RNA与ADAR酶的相互作用;(2)引导RNA与目标RNA的结合效率;(3)引导RNA与非目标RNA的非特异性结合;(4)杂交双链RNA的降解速率。我们将为每个因素赋予不同的比重,并利用实验数据对这些权重进行拟合和优化,确保每个环节都能达到最佳平衡。通过这种方法,我们希望找到一种理论上最优的引导RNA序列,它不仅能够高效结合ADAR酶并与目标RNA稳定形成双链,还能最大程度避免非特异性结合和双链RNA的快速降解。基本思路如下图所示。

Figure 1: Integration Diagram of RNA Sensor Design Assistance Model

图1 RNA传感器设计辅助模型整合思路图

We organize the above ideas into the equation for ADAR-mediated RNA editing in cells, as shown in Figure 2.

将以上思路整理为细胞内ADAR介导的RNA编辑方程,如下图所示。

Figure 2: Equation for ADAR-Mediated RNA Editing in Cells

图2 细胞内ADAR介导的RNA编辑方程

In this equation, $mRNA$ represents the target RNA within the cell, $Sensor RNA$ denotes the ogRNA, $Endo-RNA$ refers to the endogenous non-target RNA, $dsRNA1$ indicates the hybrid double-strand formed between the target RNA and ogRNA, $dsRNA3$ represents the hybrid RNA double-strand after editing by ADAR, and $Sensor RNA'$signifies the ogRNA obtained from the dehybridization of the edited hybrid RNA double-strand. After editing, the stop codon UAG in ogRNA is edited to UIG, which is typically recognized by the ribosome as a non-stop codon UGG, thereby initiating the translation of downstream transcripts and producing the fluorescent signal signalsignalsignal. We consider signalsignalsignal to represent the overall activity of the RNA sensor.

在该方程中,$mRNA$ 代表细胞内目标RNA,$Sensor RNA$ 代表ogRNA,$Endo-RNA$ 代表细胞内源非目标RNA,$dsRNA1$ 表示目标RNA与ogRNA形成的杂交双链,$dsRNA3$ 代表经过ADAR编辑后的杂交RNA双链,$Sensor RNA'$ 代表编辑后杂交RNA双链解聚得到的ogRNA。经过编辑后,ogRNA中的终止密码子UAG被编辑为UIG,通常被核糖体识别为非终止子UGG,进而启动下游转录本的翻译,产生荧光信号$signal$,经细胞讲解后,我们认为降解后$signal$可以代表整个RNA传感器的活力。

In the equation, $\Delta G_1$ denotes the Gibbs free energy constant of the interaction between $mRNA$ and $Sensor RNA$, $\Delta G_2$ represents the Gibbs free energy constant of the interaction between $Endo-RNA$ and $Sensor RNA$, $\Delta G_3$signifies the Gibbs free energy constant of the interaction between ADAR and the hybrid RNA double-strand, and $\Delta G_4$indicates the Gibbs free energy constant of the interaction between ADAR and the edited hybrid RNA double-strand.

方程中,$\Delta G_1$ 表示$mRNA$与$Sensor RNA$相互作用的吉布斯自由能常数,$\Delta G_2$ 表示$Endo-RNA$与$Sensor RNA$相互作用的吉布斯自由能常数,$\Delta G_3$ 表示ADAR与杂交RNA双链的相互作用吉布斯自由能常数,$\Delta G_4$ 表示ADAR与经过ADAR编辑后的杂交RNA双链的相互作用吉布斯自由能常数。

Through the ADAR-mediated RNA editing equation, we can correlate intracellular substance concentrations with Gibbs free energy constants, as expressed in the following equations:

通过细胞内ADAR介导的RNA编辑方程,我们能够将细胞内物质浓度及吉布斯自由能常数相互关联,具体方程如下所示:

$$ \begin{align*}\frac{d(dsRNA1)}{dt} & = k_1 \cdot mRNA \cdot sensor\_RNA - k_{1}^{-1} \cdot dsRNA1 - \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(dsRNA2)}{dt} & = k_2 \cdot endo\_RNA \cdot sensor\_RNA - k_{2}^{-1} \cdot dsRNA2, \\\frac{d(dsRNA3)}{dt} & = \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(mRNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 + k_4 \cdot dsRNA3 - k_{4}^{-1} \cdot mRNA \cdot sensor\_RNA', \\\frac{d(sensor\_RNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 - k_2 \cdot sensor\_RNA \cdot endo\_RNA + k_{2}^{-1} \cdot dsRNA2, \\\frac{d(sensor\_RNA')}{dt} & = k_4 \cdot dsRNA3 - k_4 \cdot mRNA \cdot sensor\_RNA' -k_5 \cdot sensor\_RNA', \\\frac{d(endo\_RNA)}{dt} & = k_2 \cdot sensor\_RNA \cdot endo\_RNA - k_2 \cdot dsRNA2, \\\frac{d(n)}{dt} & = k_5 \cdot sensor\_RNA' - k_6 \cdot n, \\\frac{d(f)}{dt} & = k_6 \cdot n, \\K_1 & = e^{-\frac{\Delta G_1}{RT}}, \\K_2 & = e^{-\frac{\Delta G_2}{RT}}, \\K_3 & = e^{-\frac{\Delta G_3}{RT}}, \\K_4 & = e^{-\frac{\Delta G_4}{RT}}, \\k_{1}^{-1} & = \frac{k_1}{K_1}, \\k_{2}^{-1} & = \frac{k_2}{K_2}, \\k_{4}^{-1} & = \frac{k_4}{K_4}, \\\end{align*} $$

$$ \begin{align*}\frac{d(dsRNA1)}{dt} & = k_1 \cdot mRNA \cdot sensor\_RNA - k_{1}^{-1} \cdot dsRNA1 - \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(dsRNA2)}{dt} & = k_2 \cdot endo\_RNA \cdot sensor\_RNA - k_{2}^{-1} \cdot dsRNA2, \\\frac{d(dsRNA3)}{dt} & = \frac{V_{\text{max}} \cdot [dsRNA1]}{K_m + [dsRNA1]} \cdot K_3, \\\frac{d(mRNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 + k_4 \cdot dsRNA3 - k_{4}^{-1} \cdot mRNA \cdot sensor\_RNA', \\\frac{d(sensor\_RNA)}{dt} & = -k_1 \cdot mRNA \cdot sensor\_RNA + k_{1}^{-1} \cdot dsRNA1 - k_2 \cdot sensor\_RNA \cdot endo\_RNA + k_{2}^{-1} \cdot dsRNA2, \\\frac{d(sensor\_RNA')}{dt} & = k_4 \cdot dsRNA3 - k_4 \cdot mRNA \cdot sensor\_RNA' -k_5 \cdot sensor\_RNA', \\\frac{d(endo\_RNA)}{dt} & = k_2 \cdot sensor\_RNA \cdot endo\_RNA - k_2 \cdot dsRNA2, \\\frac{d(n)}{dt} & = k_5 \cdot sensor\_RNA' - k_6 \cdot n, \\\frac{d(f)}{dt} & = k_6 \cdot n, \\K_1 & = e^{-\frac{\Delta G_1}{RT}}, \\K_2 & = e^{-\frac{\Delta G_2}{RT}}, \\K_3 & = e^{-\frac{\Delta G_3}{RT}}, \\K_4 & = e^{-\frac{\Delta G_4}{RT}}, \\k_{1}^{-1} & = \frac{k_1}{K_1}, \\k_{2}^{-1} & = \frac{k_2}{K_2}, \\k_{4}^{-1} & = \frac{k_4}{K_4}, \\\end{align*} $$

In this equation, $V_{\text{max}}, K_m, k_1, k_2, k_4, k_5, k_6$ are the parameters to be fitted. $V_{\text{max}}$ and$K_m$derive from the Michaelis-Menten equation, where the biological significance of $V_{\text{max}}$ is the maximum reaction rate when the substrate concentration is sufficiently high; KmK_mKm is the substrate concentration at which the reaction rate reaches half of its maximum value, reflecting the enzyme's affinity for the substrate. $k_1, k_2, k_3, k_4, k_5, k_6$are the respective rate constants for each reaction.

在该方程中,$Vmax, Km, k1, k2, k4, k5, k6$ 为需拟合的参数。$V_{max}$ 与$K_m $ 源于米氏方程,其生物意义为$V_{max}$是当底物浓度充分高时的最大反应速率;$K_m $ 是底物浓度达到反应速率一半最大值时的浓度,反映了酶对底物的亲和力。$k_1, k_2, k_3, k_4, k_5, k_6$ 是每个反应相应的速率常数。

A Python program is used to fit this system of differential equations, where the design of the objective function aims to evaluate the discrepancy between the model's predicted results and experimental data. The objective function calculates the error between the predicted activity of the entire RNA sensor and the experimental data (denoted as $f\_values$), summing the squares of the errors as the optimization target. The TNC (Truncated Newton Conjugate-Gradient) algorithm is employed to minimize the objective function, thus optimizing the model parameters. The L-BFGS-B algorithm is utilized for parameter optimization of the differential equation model. L-BFGS-B is a commonly used quasi-Newton method particularly suited for large-scale optimization problems with boundary constraints. Its main advantage lies in efficient memory utilization, as it does not require storing the complete Hessian matrix but instead approximates second-order derivatives using historical information. This accelerates convergence during optimization, making it suitable for handling multiple parameters in rigid differential equation models like in this case. The differential equations are solved using solve_ivp, opting for the BDF method (Backward Differentiation Formula), which is appropriate for stiff problems. During the optimization process, for each parameter set, the program computes the model's predicted values and compares them with experimental data. The errors are assessed using the Mean Squared Error (MSE) and the coefficient of determination (R²), where MSE indicates the average discrepancy between the model's predicted values and the experimental data, while R² measures the goodness of fit of the model to the experimental data.

使用python程序对此微分方程进行拟合,拟合的目标函数的设计用于评估模型预测结果与实验数据之间的差距。目标函数计算了模型预测的整个RNA传感器的活力与实验数据(f_values)之间的误差,并将其平方和作为优化目标。采用了TNC(Truncated Newton Conjugate-Gradient)算法来最小化目标函数,从而优化模型参数。使用了 L-BFGS-B 算法来对微分方程模型进行参数优化。L-BFGS-B 是一种常用的准牛顿法,特别适合于具有边界约束的大规模优化问题。其主要优点在于对内存的高效利用,因为它不需要存储完整的Hessian矩阵,而是利用历史信息来近似二阶导数。这样能够在优化过程中加速收敛,适合处理如本案例中含多个参数的刚性微分方程模型。微分方程的求解通过solve_ivp来实现,选用了BDF方法(Backward Differentiation Formula),这是一种适合刚性问题的积分方法。在优化过程中,针对每一组参数,程序会计算模型的预测值,并与实验数据进行比较。误差通过均方误差(MSE)和决定系数(R²)进行评估,MSE表示模型预测值与实验数据之间的平均差距,而R²用于度量模型对实验数据的拟合优度。

Data Source

数据来源

Endogenous Target mRNA Sequences

内源靶向mRNA序列

The mRNA sequences utilized in this study were obtained from the NCBI database (National Center for Biotechnology Information (nih.gov)). The specific sequences are as follows:

使用的mRNA序列均来源于NCBI数据库(National Center for Biotechnology Information (nih.gov)),具体序列如下。

  ------ Click to know the sequence of IL6 ------  

IL6

IL6

    ATTCTGCCCTCGAGCCCACCGGGAACGAAAGAGAAGCTCTATCTCCCCTCCAGGAGCCCAGCTATGAACTCCTTCTCCACAAGCGCCTTCGGTC
    CAGTTGCCTTCTCCCTGGGGCTGCTCCTGGTGTTGCCTGCTGCCTTCCCTGCCCCAGTACCCCCAGGAGAAGATTCCAAAGATGTAGCCGCCCC
    ACACAGACAGCCACTCACCTCTTCAGAACGAATTGACAAACAAATTCGGTACATCCTCGACGGCATCTCAGCCCTGAGAAAGGAGACATGTAAC
    AAGAGTAACATGTGTGAAAGCAGCAAAGAGGCACTGGCAGAAAACAACCTGAACCTTCCAAAGATGGCTGAAAAAGATGGATGCTTCCAATCTG
    GATTCAATGAGGAGACTTGCCTGGTGAAAATCATCACTGGTCTTTTGGAGTTTGAGGTATACCTAGAGTACCTCCAGAACAGATTTGAGAGTAG
    TGAGGAACAAGCCAGAGCTGTGCAGATGAGTACAAAAGTCCTGATCCAGTTCCTGCAGAAAAAGGCAAAGAATCTAGATGCAATAACCACCCCT
    GACCCAACCACAAATGCCAGCCTGCTGACGAAGCTGCAGGCACAGAACCAGTGGCTGCAGGACATGACAACTCATCTCATTCTGCGCAGCTTTA
    AGGAGTTCCTGCAGTCCAGCCTGAGGGCTCTTCGGCAAATGTAGCATGGGCACCTCAGATTGTTGTTGTTAATGGGCATTCCTTCTTCTGGTCA
    GAAACCTGTCCACTGGGCACAGAACTTATGTTGTTCTCTATGGAGAACTAAAAGTATGAGCGTTAGGACACTATTTTAATTATTTTTAATTTAT
    AATATTTAAATATGTGAAGCTGAGTTAATTTATGTAAGTCATATTTATATTTTTAAGAAGTACCACTTGAAACATTTTATGTATTAGTTTTGAA
    ATAATAATGGAAAGTGGCTATGCAGTTTGAATATCCTTTGTTTCAGAGCCAGATCATTTCTTGGAAAGTGTAGGCTTACCTCAAATAAATGGCT
    AACTTATACATATTTTTAAAGAAATATTTATATTGTATTTATATAATGTATAAATGGTTTTTATACCAATAAATGGCATTTTAAAAAATTCA
        

  ------ Click to know the sequence of EGFP ------  

EGFP

EGFP

    ATGTCTAGAGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGT
    GTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCG
    TGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTAC
    GTCCAGGAGGTACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAG
    CTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAA
    GCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCA
    TCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTC
    CTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA
        

  ------ Click to know the sequence of NPY ------  

NPY

NPY

    ACCCCATCCGCTGGCTCTCACCCCTCGGAGACGCTCGCCCGACAGCATAGTACTTGCCGCCCAGCCACGCCCGCGCGCCAGCCACCATGCTAGGT
    AACAAGCGACTGGGGCTGTCCGGACTGACCCTCGCCCTGTCCCTGCTCGTGTGCCTGGGTGCGCTGGCCGAGGCGTACCCCTCCAAGCCGGACAA
    CCCGGGCGAGGACGCACCAGCGGAGGACATGGCCAGATACTACTCGGCGCTGCGACACTACATCAACCTCATCACCAGGCAGAGATATGGAAAAC
    GATCCAGCCCAGAGACACTGATTTCAGACCTCTTGATGAGAGAAAGCACAGAAAATGTTCCCAGAACTCGGCTTGAAGACCCTGCAATGTGGTGA
    TGGGAAATGAGACTTGCTCTCTGGCCTTTTCCTATTTTCAGCCCATATTTCATCGTGTAAAACGAGAATCCACCCATCCTACCAATGCATGCAGC
    CACTGTGCTGAATTCTGCAATGTTTTCCTTTGTCATCATTGTATATATGTGTGTTTAAATAAAGTATCATGCATTCAAAA
        

Wet Lab Data

湿实验数据

The data used in this experiment were sourced from the 2023 publication by the Jiang Kaiyi team in Nature Biotechnology and their associated datasets, as shown in the figure below:

本实验所用数据均来自Jiang Kaiyi团队在Nature Biotechnology上发表的2023年文章及其持有的相关数据,如下图所示:

Figure 3. RADARS data targeting IL6, EGFP, and NPY

图3 针对IL6、EGFP和NPY靶向RADARS数据

We extend our gratitude to the Jiang Kaiyi team for providing the RADARS activation data targeting IL6, EGFP, and NPY (utilizing exogenous ADAR1p150). For each transcript, 12 (NPY) to 14 (IL6 and EGFP) oligonucleotide guide RNAs (ogRNAs) were designed to target different CCA sites. Each point in the figure represents the average of three technical replicates for a single sensor, while the horizontal solid line denotes the average across all 12 ogRNAs. These 40 data sets were utilized as the f_values mentioned earlier.

特此感谢Jiang Kaiyi团队提供的针对IL6、EGFP和NPY靶向的RADARS折叠激活数据(使用外源性ADAR1p150)。针对每个转录本,设计了12(NPY)至14(IL6和EGFP)种ogRNA,靶向不同的CCA位点。图中每个点表示单个传感器的三个技术重复的平均值,而水平实线则表示所有12种ogRNA的平均值。这40组数据用于即为上文所述f_values。

RNA Secondary and Tertiary Structure Prediction Data

RNA二级结构与三级结构预测数据

There are various methods for predicting RNA secondary structures, each with its own advantages and limitations in different application contexts. However, our research presents a unique requirement: predicting the secondary structure of hybrid RNA. The secondary structure prediction of hybrid RNA is relatively complex, as it needs to consider the interactions between two RNA strands while accurately reflecting their internal structures.

RNA二级结构的预测方法多种多样,各方法在不同应用场景中各具优势与局限。然而,在我们的研究中存在一个特殊需求,即需要预测杂交链RNA的二级结构。杂交链RNA的二级结构预测相对复杂,既需考虑两条RNA链之间的相互作用,也需准确反映各自的内部结构。

To meet this requirement, we selected ViennaRNA-2.5.0, an advanced RNA secondary structure prediction tool. ViennaRNA is capable of efficiently predicting the secondary structure of a single RNA strand and can also handle the complex structures formed by hybrid RNA composed of two or more RNA strands. This capability is crucial for us, as the binding efficiency of ogRNA to the target RNA and the stability of its secondary structure directly impact the success of RNA editing. Through ViennaRNA, we can accurately predict the secondary structure of hybrid RNA, identify potential hotspot regions, and optimize guide sequence design to enhance editing specificity and efficiency. Additionally, ViennaRNA provides efficient energy computation capabilities, offering detailed information on the changes in free energy during the formation of hybrid RNA, further aiding our understanding of its stability and feasibility.

为满足这一需求,我们选择了ViennaRNA-2.5.0这一先进的RNA二级结构预测工具。ViennaRNA不仅能高效预测单条RNA链的二级结构,还能处理由两条或多条RNA链形成的杂交链RNA复合结构。这对我们来说至关重要,因为ogRNA与目标RNA的结合效率及其二级结构的稳定性直接影响RNA编辑的成败。通过ViennaRNA,我们能够精确预测杂交链RNA的二级结构,识别可能的热点区域,从而优化引导序列设计,提高编辑的特异性和效率。此外,ViennaRNA具备高效的能量计算功能,能够提供杂交链RNA形成过程中自由能变化的详细信息,进一步帮助我们理解其稳定性与可行性。

In terms of predicting RNA tertiary structures, we employed both 3dRNA and trRosettaRNA software to improve prediction accuracy and diversity. 3dRNA uses a fragment assembly-based modeling approach, starting from the RNA secondary structure (predicted by ViennaRNA-2.5.0) to construct tertiary structures of different fragments, which are then optimized through energy minimization. This method's advantage lies in its ability to quickly generate multiple candidate structures and effectively construct large-scale RNA molecules. By inputting the secondary structure information of the RNA, we generate multiple potential tertiary structures and output PDB files for further analysis.

在RNA三级结构预测方面,我们同时使用3dRNA和trRosettaRNA两种软件,以提高预测的准确性和多样性。3dRNA采用基于片段组装的建模方法,从RNA的二级结构(由ViennaRNA-2.5.0预测得到)入手,构建不同片段的三级结构,并通过能量最小化进行优化。该方法的优势在于能够快速生成多个候选结构,并对大规模RNA分子进行有效构建。我们通过输入RNA的二级结构信息,生成多个可能的三级结构,并输出PDB文件供进一步分析。

trRosettaRNA, on the other hand, relies on deep learning techniques to predict distances and torsional angles between base pairs, thereby providing a more precise prediction of RNA's global conformation. By combining these two methods, we can compare the PDB files generated by different software to assess structural accuracy and select the model that best matches experimental results or exhibits the lowest energy. Furthermore, additional energy optimization or experimental validation can be employed to confirm the final RNA tertiary structure.

trRosettaRNA则依托深度学习技术,通过训练模型预测碱基对之间的距离和扭转角度,以更精确地预测RNA的全局构象。通过结合这两种方法,我们能够比较不同软件生成的PDB文件,以评估结构的准确性,并选出最符合实验结果或能量最低的模型。此外,进一步的能量优化或实验验证可用于确认最终的RNA三级结构。

Gibbs Free Energy Data

吉布斯自由能数据

- $\Delta G_1$、 $\Delta G_4$

- $\Delta G_1$、 $\Delta G_4$

As previously mentioned, $\Delta G_1$ and $\Delta G_4$ represent the Gibbs free energy constants for the interactions between mRNA and sensor RNA, and between ADAR and the hybrid RNA duplex edited by ADAR, respectively. To evaluate the performance of different RNA interaction software, we compared several RNA interaction tools, and the results are shown below, ultimately selecting IntaRNA as the RNA interaction prediction tool for this study.

如上文所述,方程中的 $\Delta G_1$ 与 $\Delta G_4$ 分别表示 mRNA 与 Sensor RNA 之间的相互作用,以及 ADAR 与经过 ADAR 编辑后的杂交 RNA 双链之间的相互作用的吉布斯自由能常数。为了评估不同 RNA 互作的软件效果,我们比较了多种RNA 互作软件,结果如下所示,最终选择IntaRNA 作为本次课题 RNA 互作预测工具。

Figure 4. Comparison of Various RNA Interaction Software

图4 多种RNA 互作软件的比较

IntaRNA is an RNA interaction prediction tool based on pairing energy calculations, designed to identify binding sites between RNA molecules. Its core principle involves predicting the thermodynamic stability of the binding by calculating the energy scores of RNA strands, including structural features such as base pairings, internal loops, and bulges. IntaRNA employs a dynamic programming algorithm to efficiently search all possible pairing combinations to identify the optimal binding modes. This software can handle interactions between single-stranded RNA as well as assess interactions between double-stranded RNA, providing robust support for the study of functional relationships between RNAs.

IntaRNA 是一种基于配对能量计算的 RNA 互作预测工具,旨在识别 RNA 分子之间的结合位点。其核心原理是通过计算 RNA 链的能量得分,包括碱基对之间的配对、内环和外突等结构特征,来预测其结合的热力学稳定性。IntaRNA 采用动态规划算法来高效地搜索所有可能的配对方式,从而找到最优的结合模式。该软件不仅能够处理单链 RNA 的互作,还能评估双链 RNA 之间的相互作用,为研究 RNA 之间的功能关系提供了有力支持。

- $\Delta G_2$

- $\Delta G_2$

$\Delta G_2$ indicates the Gibbs free energy constant for the interaction between Endo-RNA and Sensor RNA. To assess this interaction, we chose CopraRNA as our tool. CopraRNA is designed to predict interactions between a single RNA molecule and all endogenous RNAs within a cell. Its methodology integrates RNA sequence and structural information to identify potential binding sites and predict interactions between different RNA molecules. Specifically, CopraRNA first analyzes the sequence features of the target RNA and compares them with all endogenous RNAs in the cell. The tool utilizes the conservation of the sequence and structural similarity to calculate potential binding patterns and evaluate the changes in binding free energy between each pair of RNAs.

$\Delta G_2$ 表示$Endo-RNA$与$Sensor RNA$相互作用的吉布斯自由能常数,为了评估这一相互作用,我们选择了 CopraRNA 作为工具。CopraRNA 是一种用于预测单个 RNA 分子与细胞内所有内源 RNA 之间相互作用的工具。其工作原理是通过整合 RNA 序列和结构信息,识别可能的结合位点,从而预测不同 RNA 分子之间的相互作用。具体而言,CopraRNA 首先会分析目标 RNA 的序列特征,并与细胞内的所有内源 RNA 进行比对。该工具利用序列的保守性和结构相似性,计算潜在的结合模式,并评估每对 RNA 之间的结合自由能变化。

- $\Delta G_3$

- $\Delta G_3$

After obtaining the RNA tertiary structure, we further investigated the interaction between hybrid RNA and ADAR enzymes using HDOCK software for molecular docking. HDOCK is a powerful molecular docking tool suitable for docking analyses of various molecular systems, including protein-protein, protein-nucleic acid, and nucleic acid-nucleic acid complexes. Its docking mechanism is based on global search methods, quickly identifying binding regions through Fourier transformation and employing specific scoring algorithms to evaluate molecular binding affinities. The multifunctionality and flexibility of HDOCK make it one of the preferred tools for docking studies, especially in RNA-protein interaction research, where its precise predictions can reveal complex molecular interaction networks.

在获得RNA的三级结构后,我们进一步研究了杂交链RNA与ADAR酶的相互作用,采用HDOCK软件进行分子对接。HDOCK是一款功能强大的分子对接工具,适用于多种分子体系的对接分析,包括蛋白质-蛋白质、蛋白质-核酸以及核酸-核酸等复杂体系。其对接机制基于全局搜索,通过傅里叶变换快速识别结合区域,并采用特定的打分算法评估分子之间的结合亲和力。HDOCK的多功能性和灵活性使其成为对接研究中的首选工具之一,尤其在RNA-蛋白质相互作用的研究中,其提供的精确预测能够揭示复杂的分子相互作用网络。

The docking simulations conducted with HDOCK not only provide preliminary binding modes but also allow for further energy optimization and structural adjustments, revealing possible conformational changes of RNA molecules during the binding process. This information is crucial for gaining a deeper understanding of the interaction mechanisms between RNA and ADAR enzymes, laying a solid foundation for subsequent experimental validation and molecular dynamics simulations.

HDOCK的对接模拟不仅能够提供初步的结合模式,还可以通过进一步的能量优化和结构调整,揭示RNA分子在结合过程中可能的构象变化。这些信息对于深入理解RNA与ADAR酶的相互作用机制具有重要的指导意义,并为后续的实验验证和分子动力学模拟奠定了坚实的基础。

Evaluate the capability of monitoring splice variants

剪接异构体监控能力衡量

Overview of RNA Splicing Isoforms

剪接异构体介绍

RNA splicing is a crucial process in eukaryotic gene expression, during which introns (non-coding regions) are removed from pre-mRNA, and exons (coding regions) are joined to produce mature mRNA. The process can generate multiple mRNA variants, known as RNA splicing isoforms, from a single gene through alternative splicing. This creates protein diversity and allows one gene to encode different proteins. RNA splicing isoforms play a significant role in gene regulation and are implicated in many diseases, including cancer and neurodegenerative disorders.

RNA剪接是真核生物基因表达中的一个关键过程,在此过程中,内含子(非编码区域)从前体mRNA中被移除,外显子(编码区域)被连接起来,从而生成成熟的mRNA。通过选择性剪接,该过程可以从一个基因生成多种mRNA变体,称为RNA剪接异构体。这种机制创造了蛋白质多样性,并使得一个基因能够编码不同的蛋白质。RNA剪接异构体在基因调控中发挥重要作用,并与多种疾病相关,包括癌症和神经退行性疾病。

Our Project's Logic

项目逻辑

In our project, we aim to analyze whether specific RNA splicing events produce the nucleotide sequence "ACC" at splice junctions across various isoforms. For each gene, we examine all its transcripts, checking a defined base range around splice junctions. If the "ACC" sequence appears within this range for any splice junction in a transcript, that transcript is considered successful. The logic is to check if every transcript in a gene is considered successful. If so, we consider the gene as "successful." Or we have a second way of determining, because we don't necessarily care about all the transcripts of a gene, so when 50% of the transcripts are successful, we consider the gene successful.

在我们的项目中,我们旨在分析特定的RNA剪接事件是否在各种异构体的剪接位点生成核苷酸序列“ACC”。对于每个基因,我们检查它的所有转录本,在剪接位点周围的定义碱基范围内进行检查。如果该序列在任何转录本的剪接位点范围内出现,则该转录本被认为是成功的。我们的逻辑是检查基因中的每个转录本是否都被认为成功。如果是这样,我们将该基因视为“成功”。或者,我们有第二种判断方法,因为我们不一定关心基因的所有转录本,所以当50%的转录本成功时,我们也将该基因视为成功。

Data Sources

数据来源

Source: We use data from the Ensembl Genome Browser (Ensembl website). The website provides high-quality genomic data for various species.

数据库:我们使用Ensemble Genome Browser (Ensembl website) 中的数据。该网站提供各类物种的高质量的基因数据。

GTF File: The GTF (Gene Transfer Format) file contains gene annotations, including exon positions for each transcript. For the Homo_sapiens.GRCh38.112.gtf.gz file.

GTF文件:GTF (Gene Transfer Format) 文件包含各类基因注释,包括每个转录本的外显子位置,针对Homo_sapiens.GRCh38.112.gtf.gz 文件.

FASTA File: The cDNA FASTA file contains full mRNA sequences for each transcript. For the Homo_sapiens.GRCh38.cdna.all.fa.gz file

FASTA文件:cDNAFASTA文件包含包含每个转录本的mRNA序列,针对Homo_sapiens.GRCh38.cdna.all.fa.gz文件.

Both files are compressed and stored in '.gz' format to reduce download sizes. These files are essential for parsing genomic data, with the GTF file providing gene structure annotations and the FASTA file giving nucleotide sequences of the transcripts.

这两个文件都经过压缩,并以.gz格式存储以减少下载大小。这些文件对于解析基因组数据至关重要,其中GTF文件提供基因结构注释,FASTA文件提供转录本的核苷酸序列。

Code interpretation

代码解释

GTF File Parsing (parse_gtf_file):

GTF文件解析:

Purpose: Extract exon start and end positions for each transcript from the GTF file, creating a map of exon positions.

目标:从GTF文件中提取每个转录本的外显子起始和结束位置,创建外显子位置图。

Original Code:

源代码:

  ------ Click to know the code for GTF File Parsing------  

          
    def parse_gtf_file(gtf_file):
        """
        Parse GTF File to Extract Exon Information
        Parameters:
        gtf_file (str): Path to the GTF file
        Returns:
        DataFrame: A DataFrame containing the gene ID, transcript ID, and exon start and end positions.

        """
        gtf_data = []
        with open(gtf_file, 'r') as file:
            for line in file:
                if line.startswith('#'):
                    continue
                fields = line.strip().split('\t')
                if fields[2] == 'exon':
                    attributes = {}
                    for item in fields[8].split(';'):
                        key_value = item.strip().split(' ')
                        if len(key_value) == 2:
                            key, value = key_value
                            attributes[key] = value.strip('"')

                    gene_id = attributes.get('gene_id')
                    transcript_id = attributes.get('transcript_id')
                    exon_start = int(fields[3])
                    exon_end = int(fields[4])
                    gtf_data.append([gene_id, transcript_id, exon_start, exon_end])

        gtf_df = pd.DataFrame(gtf_data, columns=['gene_id', 'transcript_id', 'exon_start', 'exon_end'])
        return gtf_df
          
        

Explanation: This function iterates through the GTF file, extracting only exon entries for each gene and transcript, and storing them in a DataFrame. The DataFrame contains columns for gene ID, transcript ID, and exon start/end positions, which are used later to find splice junctions.

说明:该函数遍历GTF文件,仅提取每个基因和转录本的外显子条目,并将其存储在DataFrame中。DataFrame包含基因ID、转录本ID和外显子起始/结束位置的列,这些列稍后用于查找剪接连接。

FASTA File Parsing (Transcript Sequence Dictionary):

FASTA文件解析:

Purpose: Create a dictionary mapping transcript IDs to their full mRNA sequences.

目标:创建一个字典,将转录本id映射到其完整的mRNA序列。

Original Code:

源代码:

  ------ Click to know the code for FASTA File Parsing ------  

          
    # Read the FASTA file and create a dictionary, while removing any potential version numbers.
    transcript_sequences = {}
    for record in SeqIO.parse(fasta_file, "fasta"):
        transcript_id = record.id.split('.')[0]  # Remove possible version numbers
        transcript_sequences[transcript_id] = str(record.seq)
          
        

Explanation: This snippet reads the FASTA file using Biopython’s 'SeqIO.parse()', building a dictionary where each key is a transcript ID and the value is the corresponding cDNA sequence. It also removes any version numbers from transcript IDs to ensure consistency with the GTF file.

说明:这段代码使用bioppython的SeqIO.parse()读取FASTA文件,构建一个字典,其中每个键是一个转录本ID,值是相应的cDNA序列。它还从成绩单id中删除任何版本号,以确保与GTF文件的一致性。

Finding Splice Junctions (Cumulative Exon Lengths):

剪接位点寻找(累积外显子长度):

Purpose: Identify the positions of exon-exon junctions in the cDNA sequence.

目标:确定cDNA序列中外显子-外显子连接的位置。

Original Code:

源代码:

  ------ Click to know the code for Cumulative Exon Lengths ------  

          
    # Calculate the Proportion of Successful Genes
    success_rate = sum(gene_success.values()) / len(gene_success)

    print(f"Total number of transcripts processed: {total_transcripts_count}")
    return success_rate
          
        

Explanation: This calculates the cumulative lengths of exons to simulate the positions of exon junctions in a concatenated mRNA sequence. The 'cumsum()' method creates an array of cumulative exon lengths, which allows the identification of where one exon ends and the next one begins.

说明:这计算外显子的累积长度来模拟外显子连接在连接的mRNA序列中的位置。cumsum()方法创建一个累积外显子长度的数组,它允许识别一个外显子结束和下一个外显子开始的位置

Extracting and Checking Splice Junction Sequences:

拼接连接序列的提取与检测:

Purpose: For each splice junction, extract a window of bases surrounding the junction and check for the "ACC" sequence.

目标:对于每个剪接位点,提取位点周围的碱基窗口,检查“ACC”序列。

Original Code:

源代码:

  ------ Click to know the code for Extracting and Checking ------  

          
    # Check Whether "ACC" Exists Within a Specific Number of Bases Upstream and Downstream of All Splice Sites
    for i in range(len(exon_cumulative_lengths) - 1):

        start = max(0, exon_cumulative_lengths[i] - 35)
        end = min(len(sequence), exon_cumulative_lengths[i] + 35)

        splice_site_seq = sequence[start:end]

        if 'ACC' in splice_site_seq:
            transcript_success = True
            break

    # If Any Transcript Fails, the Gene is Considered Unsuccessful
    if not transcript_success:
        gene_success[gene_id] = False
        break
          
        

Explanation: For each exon junction, a window of specific number of bases is extracted using cumulative exon lengths as the junction location. The function then checks for the presence of "ACC" in this window. The 'max()' and 'min()' functions ensure that the window does not go out of bounds. In the second judgment method, we consider 50% of the transcript success as success, the code idea is similar, so we will not go into details here

说明:对于每个剪接位点,使用累积外显子长度作为连接位置提取特定碱基数量的窗口。然后,该函数检查该窗口中是否存在“ACC”。max()和min()函数确保窗口不会超出边界。在第二种判断方法中,我们认为50%的转录本成功为总体成功,代码思路相似,不再赘述。

Success Criteria for Genes and Transcripts:

基因和转录本的成功标准:

Purpose: For each transcript, ensure that at least one junction contains "ACC". A gene is considered successful only if **all** its transcripts meet this criterion.

目标:对于每个转录本,确保至少一个连接包含“ACC”。只有当一个基因的所有转录本都符合这个标准时,它才被认为是成功的。

Original Code:

源代码:

  ------ Click to know the code for Success Criteria ------  

          
    # Iterate Through Each Gene
    for gene_id, group in gtf_df.groupby('gene_id'):
        gene_transcripts = group['transcript_id'].unique()
        gene_success[gene_id] = True  # Initialize to success

        # For Each Transcript Within the Gene
        for transcript_id in gene_transcripts:
            total_transcripts_count += 1  # Increment counter

            if transcript_id not in transcript_sequences:
                gene_success[gene_id] = False
                break
          
        

Explanation: The outer loop iterates over each gene, and the inner loop iterates over each transcript within the gene. If any transcript fails to meet the "ACC" criterion at any junction, the entire gene is marked as unsuccessful.

说明:外环遍历每个基因,内环遍历基因内的每个转录本。如果任何转录本在任何连接处不能满足“ACC”标准,则整个基因被标记为不成功。

Final Success Rate Calculation:

最终成功率计算:

Purpose: Calculate the proportion of genes that were successful.

目标:计算成功基因的比例。

Original Code:

源代码:

  ------ Click to know the code for Success Rate Calculation ------  

          
    # Calculate the Proportion of Successful Genes
    success_rate = sum(gene_success.values()) / len(gene_success)

    print(f"Total number of transcripts processed: {total_transcripts_count}")
    return success_rate
          
        

Explanation: The success rate is computed by dividing the number of successful genes by the total number of genes processed.

说明:成功率是通过将成功基因的数量除以处理的基因总数来计算的。