Software | Ionis-Paris

SafeRNA

Introduction

Safety is one of the most important issue to be adressed innovation-wise. From the very start, CAP'siRNA expressed the idea of specific viral targeting through the design and engineering of silencing RNA precursors (siRNA precursors). Long 21 to 25 nucleotides, siRNAs can only express their activity when 100% stringence is found with the target site [1]. This specificity allows extreme efficiency in neutralizing desired mRNA [1, 2], but also raises the question of off-targetting. The chances of a non-targetted mRNA falling victim of a siRNA by sheer coincidence are slim but not null given the number of living organisms on Earth. A key point to safely spray siRNA precursors in the environment is therefore to ensure that no living organism aside from the one/those targetted can suffer from it. In short, we need to ensure that our RNA interference-based pesticide does not repeat the mistakes of the past-used pesticides (neonicotinoids, glyphosate, ...). SafeRNA was built for the purpose of ensuring the safety for humans to work with our product, but also the safety of the soil microbiota too often forgotten, as well as pollinators, insects, etc.

Flowchart of the SafeRNA program

Principle of SafeRNA

SafeRNA is a bio-informatic pipeline designed to be able to :

Collect coding sequences from desired taxa
Simulate targeting of siRNA toward the collected taxa
Analyze results and provide insight on how to improve the siRNA sequences

SafeRNA has a dedicated tool for each of those tasks, all three written in python, with the only necessary library being Biopython. It is however required to download the datasets.exe file from NCBI to be able to use the sequence collector tool [3]. Each of those programs is designed to solve a specific problematic.

siRNA sequences may change

The siRNA sequences produced will never be the exact same as the length at which they are cut by the RISC complex varies between 21 to 25 nucleotides. The program `main.py` uses an algorithm to predict all possible 21 nucleotides long sequences that can be obtained from the DICER enzyme activity, wherever it may cut the siRNA double stranded precursor.

Collecting relevant taxa omic informations

From the litterature, it is relatively accessible to find the microbial and fungi taxa involved in the soil microbiome positive activity toward cultures. However, collecting omic data for all of them can prove tedious and impractical with a larger number of species. The program `get_data.py` leverages the datasets.exe executable file from NCBI command line tools to massively download large lists of coding DNA sequences from a given list of taxa, and make them ready for analysis.

Analyze siRNA safety toward the microbiome

siRNA can only work if their complementarity is perfect [1]. Thus, to check whether a siRNA sequence may be off-target, all possible siRNA sequences, obtained from `main.py` are aligned against the database downloaded previously with `get_data.py`. If the alignment is successful, meaning if the 21-nucleotides sequence is found in one of the genome's coding sequence in the database, then this sequence is deemed a "hit". A report then gives insight on the hits found in the database such as their number, the genome file location and a direct link to the species/strain assembly. A detailed user guide can be found in the README on the iGEM software tool repository for details on the practical use of SafeRNA.

Methods

The genomic data collection is performed using the command line tool from NCBI, datasets.exe, and based on publications studying the microbiota populations found in Beta vulgaris cultures [5]. The list of bacteria and fungi taxa is gathered from the litterature, and only the corresponding genomic sequences for the entire taxa are downloaded from NCBI. We limit the downloading of sequences to after 2010, in order to avoid downloading too many poor quality DNA or too many incomplete genomic sequences in our database.
The taxa were gathered based on the relative abundance found in the study of Peter Kusstatscher et al. in 2019, describing the microbiome found in Beta vulgaris cultures. All the coding sequences found in each taxa and sequenced between 2010 and 2024 were downloaded to build a database.
The siRNA sequences are obtained through a naive algorithm gathering all possible 21-nucleotides sequences from a precursor using a sliding window of 21 nucleotides along the precursor RNA sequence. All siRNAs are then assigned an ID, and finally written in a file in fasta format ready to be used.

Figure 1: Scheme of the sliding window program generating all possible siRNAs sequences from the target sequence

A BLASTn (Basic Local Alignment Search Tool for nucleotides) alignment is then conducted between the genomes and the siRNAs files using the `short-word` parameter. Local BLAST normally discard short sequences, and this parameter allows them to be taken into account. A BLAST alignment uses the BLAST algorithm to perform an alignment between sequences and find similarities or complementarity between the two of them. This means that both alignments shown in Figure 2 are considered "hits", given the nature of the double stranded RNA we are producing through synthetic biology.

Figure 2: Scheme of alignment counted as hits in BLAST

Results

The database gathered more than 50,000 genomes coding sequences from 17 different taxa of bacteria and fungi, resulting in a 200Gb database.

Figure 3: Taxa repartition of the 50,000 genomic coding sequences retrieved among the 17 taxa

There is an uneven repartition of sequences, especially regarding fungi taxa (Tetracladium, Plectosphaerella, Vishniacozyma, Tausonia and Mrakia, do not have any available genomic coding sequences). This shows the difference in the number of sequencing analysis done between bacteria and fungi, aside from metagenomics, which are often dedicated to taxa identification and abundance measurement.

The way cell mechanisms' processes the precursor of siRNAs with the DICER enzyme is random, and as such, where the production of iRNA starts and finish is not fixed. Thus the window from which siRNAs are derived is not fixed and may change slightly. Consequently, many different sets of 21 nucleotides-long siRNAs can be derived from a single sequence. The general formula for the number of 21-nucleotides long siRNAs that can be obtained in a sequence is the following:

$$n_{siRNA} = L_{target} - 21$$

We target RNA-dependant RNA-polymerase (RdRp) and the p21 sequences from the BYV (Beet Yellows Virus), respectively 1302 and 408 nucleotides long. Therefore the number of different siRNAs sequence possible to obtain from those sequence is respectively 1281 and 387 for our targets.

Runtime of aligning 50,000 assembly on all the siRNAs sequences took 36 hours. 8 hits were found in the entire database from 8 different bacterial strains. Each time only one siRNA amongst all those possible for both the RdRp and the p21 sequences was involved. This means the probability of an off-target siRNA to be produced is $P = \frac{1}{n_{siRNA}}$, and thus, extremely low. In this precise case, this means the probability of these 8 siRNAs to be generated is $P = \frac{8}{1282} = 0.006$, thus about 0.6% of chances. And in addition to this very low probability, the rogue siRNA would have to encounter exactly the strain of bacteria it targets! To further control this risk, we conducted the following analysis.

To complete the result analysis, we analyzed the strains found as potential off-targets and their geographical localization (as a geographical filter is not included at the moment in the search and download program). Consequently, it appeared that none of the bacteria species and strain found as off-target can be found in Europe : all 8 of them are found in Asia (mainly Korea, Japan and China). This information, combined with the low probability of the off-target siRNA sequence being generated, gives further reassurance of how specific inteference RNA is toward the BYV.

In case hits were found, depending on their number and localisation, they could be simply taken out of the targetted sequence (making it shorter). Another solution would be to shift the targetted window, changing the gene target sequence, to avoid the creation of those off-target siRNAs. In addition to this mass analysis, the Human and the Sugar Beet (Beta vulgaris) genomes were also tested against our siRNAs, and no off-targets were found. This verifies that we do not risk to harm the very culture we sought to save, but also that we can safely manipulate our own siRNAs... Meaning, the farmers would finally manipulate a pesticide causing no harm to them!

Conclusion

While this database analysis allows a good primary overview of whether collateral targets are being threatened by our siRNAs, it is far from comprehensive. Ideally, important shotgun transcriptomics would be needed for each field where the product is to be sprayed in the near future, to adapt to the microbiota present in specific geographic area. The supply chain could be adapted from these results, and consequently this program should be strengthened for future applications.

Aphidisperse

Introduction

The main vector for the Beet Yellows Virus (BYV) is the aphid Myzus persicae. CAP'siRNA offers a targetted treatment to this virus, with maximized efficiency during the early stages of plant contamination. The best case scenario would therefore be to treat the beets a few days prior aphids invading the field, preventing them from transmitting the virus initially. The problematic here is therefore to be able to predict the time of arrival of the aphids in beet plantations. Migration modelling for M. persicae is not a novel problem, and models were published in the litterature in 2004 by Aiming Qi et al. [1] and 2023 by M. Lucquet et al. [2]. While being the most efficient models developped to this day, they do not include visualization for farmers to use and base their treatments on. It is to tackle this issue that we developped Aphidisperse, a Python framework using an agent-based model combined with a lattice visualization (i.e. grid visualization) to provide insight on possible migration of Myzus persicae for the upcoming year.

Flowchart of the Aphidisperse program

Review of Myzus persicae life-cycle and migration variables.

Myzus persicae is a resistant, highly adaptable r-strategist pest (i.e. short life cycle/expectancy, fast growth rate) [3]. It is the privilegied host of more than 100 different phytoviruses, making it a dangerous vector for plant diseases in more than 30 major cultures [4, 5]. As an aphid, its complex life cycle alternates sexual (just before winter) and asexual reproduction (mostly during summer), but also oviparous (during winter) and viviparous cycles (during summer) [4]. Myzus persicae migration is mostly realized by winged individuals, obtained exclusively from sexual reproduction. This reproduction mode occurs massively before winter, to increase the odds of finding a suitable host for winter. However it also occurs during summer at peak reproduction rate, when the aphid's population density on a plant becomes too important. When this happens, alate specimens are produced to migrate and create populations on other plants, and the cycle starts again. Therefore one of the most important migration variable for a population of aphid is population density in a given location [4]. Another important element to factor in is that M. persicae only hibernate in hosts of the genus Prunus spp., which are therefore the starting point of the summer migration [4].

One of the key factor to model M. persicae dispersion is to determine the most probable date of the first migration post-wintering. The development temperature threshold in litterature is 4.3°C in average. M. persicae developments needs at least 16 added days degrees to develop completely [3]. Below it, there is no development of the aphids [4]. Added day degrees are counted as follow. If the threshold for development to occur is 4.3°C and the average temperature in a day is 7°C, then the added day degree is $7-4.3 = 2.7$. Aphids need a total of 16°C to complete their development, so after this day they would have 14.3 degree missing.

However for take-off to happen, the temperature has to be at least between 14 to 16 °C [3, 6]. Therefore, aphid migration requires that the average temperature does not drop too often below the average of 4.3°C (i.e. not too many freezing events) and that, at some point during the year, the average daily temperature reaches 14 to 16 degrees.

Aphid actual migration highly depends on the climate conditions. Distance and duration depends mostly on windspeed and consistency : airstreams as weak as 2 m/s are strong enough to have them start a migration flight [3]. A migration lasts for 30 minutes to 3 hours on average, and occur only once in a lifetime [7, 8, 9]. Therefore an estimate of the distance has been theorized by Parry, H.R. et al. [3] with $distance = windspeed * duration$. Thus, windspeed data is crucial in deriving a function giving the mean distance traveled by M. persicae from a given point. Windspeed changes with altitude, and aphids have an average flight height of 12.2m, although specimens have been found up to 1km high [3, 10].

Many other variables impact aphid's flight: rainfall, plant's concentration, terrain topography, predators, chemicals... [13]. In short, aphid's flight is a complex multivariate behaviour, requiring important amounts of data to obtain reliable results. Finally, the most important factor for migration to happen as said before is the local population density, and therefore, modelling the population growth dynamic of M. persicae is a key element in modelling the global migration.

Population dynamics: How can we model Myzus persicae population increase?

In population dynamics, a population's number of individuals can be modeled using either the intrinsic rate of increase usually called $r$. It describes the theoretical maximum rate of growth of a population per individual and per day [5]. Thus, the following differential equation can be used to describe a population dynamics: $$\frac{dN}{dt} = rN$$ With:

$N$ the population size
$r$ the intrinsic rate of increase
$\frac{dN}{dt}$ the population's growth rate

However, for easier computation and understanding, we will use the finite rate of increase of a population to model Myzus persicae populations, in order to express it as a rate of increase per day $\lambda$. We can therefore obtain the expression of a population $N$ at a time $t$: $$N_t = \lambda^t N_0$$ With $N_0$ the starting population size and $N_t$ the population size at a time $t$. Hereafter is a summary of the parameters determined and used in litterature to model the population dynamics of M. persicae.

Parameter/Articles	Hong F. et al. [12]	[12]	[12]	[12]	[12]	Williams CT [13]	Özgökçe MS et al. [14]
r (d^-1)	0.3831	0.3422	0.3616	0.3595	0.3595	0.1189	0.2748 (std = 0.03)
λ (d^-1)	1.4669	1.408	1.4356	1.4327	1.3781	1.1262	1.31606 (std = 0.04)
Culture temperature	23°	23°	23°	23°	23°	11°	25°
Country	China	China	China	China	China	United Kingdom	Brazil
Plant	Vicia faba	Capsicum annuum	Raphanus sativus	Nicotiana tabacum	Brassica campestris	Beta vulgaris	Capsicum annuum

The table and the litterature on which it is based[12, 13, 14] allow to strongly advocate to consider temperature as the most important factor for aphids population growth rate. Given that the climate in France is more similar to that of the UK respect to the one found in China and Brasil, we will consider a weighted average of the $\lambda$ values as follows: $$\lambda_{France} = \frac{\sum \lambda_{China} + \lambda_{UK}*k + \lambda_{Brasil}}{n_{\lambda}+k}$$
$$= 1.16576\ d^{-1}$$ With $k = n_{lambda} = 7$ here to emulate an equal importance between the value from the UK sample and the other 6 samples.

Model hypothesis

The goal of Aphidisperse it to build a framework modelling M. persicae flight behaviour. However, time constraints and data availability severely restrain the possibility of what can be implemented with enough thoroughness to be built upon by future iGEM teams, or ourselves if we were to improve the model later on. As a consequence, for the model version presented here we make multiple assumptions and omissions to propose a baseline model on which can be built upon easily (and typically, adapt it without problems to other species).

We assume the migration of non-alate aphids to be negligible
We assume the behaviour of alate aphids to be passive and the equiprobability of direction's chosen for migration from a population.
Due to insufficient and unreliable data available publicly and unresponse from private source, windspeed and luminosity variation will not be taken into account. The distance will be a stochastic estimation based on the litterature values.
Variable such as plant's concentration, terrain topography, predators and chemical use will be neglected, due to the lack of available data and/or their debatable quantitative effect on aphid's flight in litterature.
We assume aphid's growth rate follows strictly the function derived previously.
We assume negligible predation/effect of pesticides on aphids population growth.
We assume that only temperature determines the threshold for migration starting date.

While those assumptions make the model presented here unapplicable in real life, the program is built so that once the variable quantitative effect is determined it can easily be implemented. Therefore, once the lacking data has been gathered and given time to process, gaps in the model could be filled. It is also important to note that there are no "verification data", summarizing the date of arrival of aphids in France at given location based on empirical observation for a few years, that would allow verification of the model correctness, or a Machine Learning based approach.

Model parameters

The model combines a lattice with an agent-based model. The lattice's units are rectangles of 30x50 kilometers (km). This parameter can be changed if needed, however squares this big were chosen to accomodate the inequal distribution of the 3511 meteo stations from which the temperature data was collected. Hereafter is a display of all stations, in France (Data obtained from Meteo France). We consider the data from Corsica irrelevant as it is unlikely for aphids to cross the Mediterranean sea and migrate in the mainland.

Figure 1: Map of all the meteorological stations in France between 2000 and 2022 with lattice

Despite the number of stations and the size of the squares, the stations' uneven distribution led to gaps in data of several years without information on temperature in some cases. In order to correct those data, we apply a step of region growing on those squares. Let $u = \left\{p_{x_1y_1}, ..., p_{x_ny_n}\right\}$ the subset of n pixel with the coordinates (x, y) where wheather data is insufficient. Let $V(p_{xy}) = \left\{v_{i_1j_1}, ..., v_{i_nj_n}\right\}$ a subset of size m, containing all the valid neighbours $v_{ij}$ of a pixel $p_{xy}$. Finally, let $\sigma_v$ the standard deviation of the temperature between a pixel and their neighbours. We have therefore: $$\forall p_{xy} \in u, T(p_{xy}) = \overline{V(p_{xy})} + \sigma_v$$ To determine the value of $\sigma_v$ we look at the graph of the temperature difference ($\Delta T$) distribution of the neighbours toward the center pixel:

Figure 2a: $\Delta T$ distribution of neighbours relatively to their central pixel

Figure 2b: Scheme of a pixel P with missing data around its neighbour with working data

The distribution ressembles a normal distribution, thus the value $\sigma_v$ follows a normal distribution of parameters $\mu$ and $\sigma_{\sigma}$, with $\mu = \overline{V-T}$ the average temperature difference between a central pixel and its neighbour : $$T(p_{xy}) = \overline{V(p_{xy})} + \biggl[\sigma_v \thicksim \mathcal{N}(\mu,\ \sigma_{V-T}^2)\biggr]$$

Every square on the map has been attributed an average daily temperature, based on meteorological data or approximated thanks to their neighbours, from 2000 to 2022, totalling 8400 values for each of the 477 squares.
Now, we would like to determine the date at which the aphids will be able to take off. The temperature threshold is 4.3 degrees, but to minimize false positive we take 5 degrees as the threshold for development. The onset for aphids take off is the day where the derived accumulated day degrees are over 16 degrees ([Σ(daily average temperatures – 5 °C)]) AND the average temperature is over 16 degrees. From 2000 to 2022, we determine the date of flight onset this way.

The idea now would be to derive a function to be able to estimate the approximative onset date of aphid migrations. We use data from 2000 to 2022 from which was derived the most plausible onset date. To do so we apply regression based approaches as shown in Figure 3. Precision is calculated based on Root Mean Standard Error (RMSE).

Figure 3: Comparison of standard linear regression with hand-made implemented sinusoidal regression for predicting ability of the onset date for 2023 based on previous years.

Figure 3a: Linear regression estimate of flight onset. RMSE = 24.4 days

Figure 3b: Sinusoidal regression estimate of flight onset. RMSE = 19.6 days

We implemented a Grid Search parameter optimization in order to find the best possible parameters for the Sinusoidal regression to predict the flight onset date. This program is run prior to the model, and the parameters for the sinus function are stored for each square, and will be used afterward to determine the onset date for a given year.

Figure 4: RMSE difference between a linear regression and sinusoidal regression for the prediction of flight onset date. In blue is the RMSE of standard linear regression, and in red is the RMSE of sinusoidal regression

Aphidisperse being an agent based model: an agent here will be an aphid population of relative density. The density of this population (= the agent) increases over time based on the equation $N_t = \lambda^t N_0$ with $\lambda = 1.16576\ d^{-1}$. Once density reaches a fixed threshold of population density, winged aphids are produced and will migrate, just like it happens in reality. As mentionned before, we assume equiprobability of direction, so the offspring will migrate in the 8 possible cardinal directions (North, North-East, North-West, South, South-West, South-East, West, East). The lattice each day displays the total density of populations in it as shown in Figure 5a and b.

Figure 5: Example of the agent state and square density update process in the aphidisperse program

Figure 5a: Example of aphid population starting state on an empty lattice

Figure 5b: Example of aphid population reaching maximum density and migrating in other squares at state n+1

Finally, as mentionned previously, aphids spend most of the winter hidden in trees from the genus Prunus. Two genus are commonly found and cultivated in France: Prunus persicae and Prunus nigra. Therefore we derived the initial density of aphids populations based on the density of Prunus culture in France in each square [15].

Results

From the density of Prunus culture we obtain the following distribution of aphid's hivernating densities during winter.

Figure 6: Starting state for aphid's density based on Prunus culture density in France

It appears clearly that the aphid's population is concentrated in the South of France during winter, which is also where the climate is generally warmer. It allows them after hivernating to quickly reach the developmental state where they can migrate and disperse throughout France. Since Prunus nigra is cultivated in the West and East of France, it offers shelter for M. persicae during winter that is also very close to sugar beet fields. It will be interesting to see if there is a difference in impact between aphids coming from the South and those coming from the East or West. West originating aphids appear closer, but will have a flight onset delayed compared to the South specimens.

Figure 7: Comparison between the predicted aphid flight onset date by aphidisperse for 2023, compared to the actual date given

Figure 7a: Heatmap of earlyness of the onset date of Myzus persicae predicted for 2023

Figure 7b: Heatmap of the actual flight onset date distribtution of 2023

Figure 7c: Normalized heatmap for predicted onset of 2023

Figure 7d: Normalized heatmap for actual flight onset of 2023

Differences can be seen between Figure 7a and Figure 7b. The onset date predicted in Figure 7b seems slightly overestimated (by ~20 days) respect to the actual onset date that took place in 2023. If we take a look at the normalized values of the colorbars, clusters appear in relatively identical places. However, the prediction displays erratic dates prediction, with squares predicted to hit their threshold 50 days earlier than their neighbours. In short, the prediction model for the onset temperature is heterogeneous in its prediction, while the actual temperature is more homogeneous on the entire territory as displayed in Figure 7d.

Conclusion

This model gives an important idea of where aphids transmitting the virus are originating in France, highlighting the importance of the starting parameters. While not accurately depicting the full reality for the moment, the starting parameters are derived from published work and/or government information and reflect their reality. Now that the framework is established, working on improving the model by gathering extensive amount of data from variables that we were not able to take into account. Future work could allow the production of an accurate and comprehensive visualization of Myzus persicae migration dynamics. Considering the major impact of this pest on culture, achieving such model would be of significant help to farmers even beyond sugar beet exploitation.

PrediRNA

Introduction

RNA interference only works when the small RNA sequence is perfectly stringent toward its target. If even one base amongst the 21 is not paired correctly, the RISC protein complex will not recognize the binding, and the virus genome will not be degraded. Consequently, the two parameters on which depends the long-lasting of our solution for farmers are the virus' sequences we aim, and the rate of mutation of the virus, which are known to be very high. To maintain oversight on those parameters, we bio-informatically estimated the proportion of siRNA effectively targetting the Beet Yellows Virus (BYV) over the years, and developped an algorithm able to predict the most probable sequences in the upcoming years.

Flowchart of the PrediRNA program

Principle of PrediRNA

As target sequences of the virus (and thus, template for the siRNA's production), we chose to use the RNA-dependent RNA-polymerase (RdRp) and the p21 sequence. Their role is respectively to translate the proteins of the virus, and to inhibit the interfering RNA defence system of the plant. We chose those sequences for both practical and tactical reasons. Our idea is to make our solution efficient in both anticipation and remediation of a BYV invasion. Thus, if the virus transmission could be anticipated (for example, using the Aphidisperse program), our engineered interfering RNA precursors will have been sprayed on sugar beets. Thereof, when the virus hits, the siRNAs will already have been generated by the plant's RISC and DICER complexes. In particular, the sequences targetting the RdRp will have a boulevard to neutralize the virus. However, if the virus is already in the plant, the p21 is active. Consequently, we will more rely on the p21 targetting sequences, as they should neutralize the p21 production, allowing the plant to defend itself against the virus. One sequence is thus targetted to prevent the invasion, while the other is thought as a potential curation in case of late administration. To achieve this, having reliable siRNAs targetting properly given sequences is crucial. Hence why we need to be able to predict RdRp and p21's sequences change and efficency over time.

The train of thought of the program is as follow. First, we need to get insight on the BYV patterns of mutations for the RdRp and p21 sequences to determine which model is most adapted (exhaustive, probabilistic, stochastic, ...). To achieve this, a Multiple Sequence Alignment (MSA) is performed in seaview.exe [1] using the Muscle algorithm (v3.8.31). Then a Position Specific Scoring Matrix (PSSM) is built to estimate the probability of each coding element appearing at a given position, from which is derived the Shanon Entropy for each position. It measures the quantity of conserved information over time. The higher the score, the more conserved the nucleotide, and reciprocally. This allows to estimate the conservation of nucleotides, but also of amino-acid and codons, and thus determine the kind of mutations occuring most in the virus.

Once this step is realised, from the results can be derived the best approach. If the amino-acids are not specifically conserved, a program based on Hidden Markov Model (HMM) is a good approach as was done in the litterature previously. However, if amino-acids are very conserved, this paves the way for an exhaustive approach, and thus to calculating all possible combination of mutations for one or two cycles depending on computational power and the number of mutations. Finally, to adjust the temporality of the model, it is required to determine the mean evolution molecular rate, which corresponds to the rate of nucleotide's substitution in a sequence. This was done using three software. BEAUti (v10.5.0) for the data preprocessing and parameters selection, BEAST (v10.5.0) for the calculations and Tracer (v1.7.2) for the results analysis [2]. Based on those results, we could determine the rate of evolution of the BYV, and model the occurrence of mutations using a Poisson Law of probability.

Finally, based on the empirical data gathered from the public genome databases, we modelled the quantity and proportion of siRNA that are efficient as time passes by, and the virus sequence changes. In addition, based on this rate of evolution, we tried to infer the worst case scenario of substitutions occuring in the virus genome, and compared it with the efficieny rate derived from the empirical data obtained from NCBI. Finally, we can try to model the effectiveness of the siRNA over time if we were to engineer them from the last known sequence, based on the mean evolution rate and the mutation's probability determined with Poisson's laws./p>

Methods

1. PSSM calculation

Let a MSA of size $N$ the number of sequences of length $L$, composed of elements from an alphabet of size $A$. The formula to calculate the position specific score of alphabet letter $a\in A$, and column $i\in I$ in a PSSM with a pseudo-count of 1 is : $$\omega_i(a) = \frac{n_i(a) + 1}{N + A}$$

With $n_i(a)$ the number of occurences of the letter $a$ at a position $i$ in the MSA. $\omega_i(a)$ is the weighted probability of having the letter $a$ in position $i$. The +1 is so that there are no null-probabilities.

2. Shanon Entropy

The Shanon Entropy measures the conservation of a column of position $i$ in the MSA. It is obtained from the following formula :$$s_i = log_2(A) + \sum_{a\in A} \omega_i(a) + log_2\omega_i(a) \\ \iff S = \left\{s_1, s_2, ..., s_L\right\}$$ From this formula we obtain a vector $S$ of the entropies $\forall i\in I$.

3. Poisson's law of probability

Let $r$ the rate of mutation events over $n$ years. The parameter $\lambda$ of the Poisson law is therefore derived as $\lambda = r*n$. The probability of $k$ mutations occuring in $n$ years, for $X$ a random variable following a Poisson law is therefore obtained with:$$P(X = k) = \frac{\lambda^{k}}{k!}*e^{-\lambda}$$ From which we derive the formula for the probabilities of $X = [X_1, X_2, ..., X_n]$, the vector of random variables $X_i\in X$, each following a Poisson law of parameter $\lambda_i$. Thus the probability for $n$ random variables to each have over or equal to $k$ mutations is written : $$P(X_1^n) = \prod_{i=1}^{n} P(X_i \geq k)$$

4. Linear loss of efficiency of siRNA in the worst case scenario

Assuming the rate of mutation $r$ of the Beet Yellows Virus remains constant over time and under the application of $\theta_0$ siRNAs. If mutations occur in the spots neutralizing as many siRNAs as possible, spaced by exactly 21 nucleotides and not on the edges of the sequence (which is the worst case scenario), the linear function describing the number of siRNAs still targeting the virus is : $$\theta_t = -21rx+ 21r\sigma_0x_0$$ Where $x$ is the year for which we want the number, and $x_0$ the year in which the analysis starts.

The 14 sequences of RdRp and 12 of p21 have been retrieved from the NCBI database.

Results

1. Mutation profile of the RdRp sequence and p21 of BYV

To determine the type of mutation profile followed by the sequence of RdRp and p21, we applied the formula for the Shanon entropy on the MSA we built for both target sequences. In the MSA of the RdRp 12 sequences from 1993 to 2021 were compared. The date range was the same for the p21, but only 10 sequences were available. From the application of the formula, we derived heatmaps of the substiutions of amino-acids, codons and nucleotides visible in Figure 1, from which we can assume a propensity of substution type depending on the sequence studied.

Figure 1: Summary of two heatmaps for nucleotide, codon and amico-acid conservation in a MSA, and a complete heatmap of the BYV nucleotides. Values are calculated using Shanon's Entropy mentionned in the Methods. The scales are logarithmic and depends on the number of letters in the sequence's alphabet. They are not the same for each table. The maximum value indicates a full conservation (i.e. no substitution in the MSA).

Figure 1a: Conservation heatmap of the RNA-dependent RNA-polymerase

Figure 1b: Conservation heatmap of the p21-iRNA silencing protein

Figure 1c: Conservation heatmap of the BYV's nucleotides in the entire genome

In Figure 1a, the protein heatmap of the entropy for the RdRp shows that amino-acids' conservation is almost absolute. Between 1993 and 2021, there were no substitutions of amino-acids, with four exceptions which were not conserved up to nowadays. This observation is confirmed when we look at the turn-over of codons in the sequence. A similar pattern to the codon can be seen in nucleotides substitution heatmap as well. In 28 years, there have been 38 synonym substitutions, and among them, only 15 were conserved in at least more than one sequence. This is reasonable proof to make the assumption that the BYV needs this specific sequence to be invariable to an extreme extent, resulting in close to no amino acids substitutions over time in the virus.

In Figure 1b, we can see that the pattern of the entropy heatmap is completely different in general. The top heatmap shows that amino-acids substitutions are more common for the p21 sequences. This indicates that the p21 thus tolerates non-synonym amino-acid switching, which is not the case for the RdRp. In addition, the frequency of synonym substiutions in the p21 is greater than that of the RdRp as shown in the codon heatmap. In 30 years there have been 18 amino-acids substitutions, and 55 synonym codon substitutions, which account for almost twice more substitutions than the RdRp while being more than three times shorter in size (1302 nucleotides versus 415).

Comparing our results with the global entropy heatmap of the BYV genome, we observe the region where the RdRp is localized to be very conserved, while the p21 is nearby a very unstable area of the genome. These results highlight the necessity of developping different prediction methods depending on the sequence substitution behaviour. Typically, the RdRp having very few sequence changes, a naive exhaustive model can be envisioned, since the number of substitutions combinations over time will stay reasonable for a few years. On the other hand, the p21 being more versatile than RdRp a probabilistic model, based on Markov Chain is more adapted, as being exhaustive in this case would immediately lead to too many possible sequences.

Time being of the essence, and the many models predicting virus sequence evolution having been published, we focused on developping the exhaustive model coupled with an efficiency estimator. The estimator was built to accept sequences regardless of the model they originate, so we will be able to estimate the efficiency of our interfering RNA over time for it as well.

To develop the exhaustive model, we developped first an algorithm to generate the seed sequences. From either the consensus or the most recent sequence (this parameter can be chosen by the user), all the possible sequences with one codon substitution are generated. We then proceeded to determine the probability for each seed to appear in a given time frame. Since we want to model the probability of an event to occur in a set time, given a certain rate, the Poisson law of probability could be adapted. We thus now proceed to determine the average substitution rate per site and per year for the RdRp, the p21 and the entire BYV using the BEAST software suite.

In addition, we analyzed the disposition of nucleotides substitutions positions in the codons throughout the MSA. The substitution may happen in the codon position 1, 2 or 3 (CP1, 2, 3), with variable impacts on the virus. In Figure 2a we see that substitutions occur almost exclusively around codon position 3 for the RdRp on average. A similar pattern can be seen for the p21, although less obviously. Finally for the virus in its globality, no pattern appears clearly : all three positions are almost equally favored. The third position in the codons is known for being favoured for synonym substitutions. Consequently, given the prevalence of this type of substitutions in RdRp this confirms the substitution pattern followed by this sequence. The p21 follows a similar pattern but to a lesser extent, while the virus' genome in general does not follow a pattern at all. The result from this modelling analysis gives further credit to the assumption made from the results in Figure 1.

Figure 2: Average estimated rate of substitutions depending on nucleotide's positions in codonds for RdRp, p21 and the BYV's genome

Figure 2a: RdRp

Figure 2b: p21

Figure 2c: Entire BYV genome

Rates	RdRp	p21	Full BYV
Sequence size (bp)	1302	414	15300
Rate r (substitutions.site^-1.year^-1)	6.456×10^-4	7.2×10^-3	1.414×10^-3
Sequence rate r_seq (substitutions.year^-1)	0.84	2.987	22.3

As displayed in the table, the rate of substitutions is more than twice faster in the virus in general than in the RdRp sequence. In addition, the p21 rate of substitution appears much higher than the average one of the BYV. This results shows that while the choice of RdRp as a targetted sequence is an excellent one, the p21 will be much harder to target efficiently as its sequence changes very quickly.

2. Sequence prediction : example with the RdRp

Our objective is now to find a probability law to model the sequence substitutions in function of time using a Poisson Law and Shanon's entropy. We thus need to differenciate between conserved (c) and non-conserved (nc) substitutions as only conserved substitutions are of interest for us here. Even if we only consider synonym mutations, since the subject of the study is a virus, the long-term RNA conformation of it's genome is important. Some codons might therefore find themselves unsuited respect to others in regard of the structure preservation of the virus RNA genome.

Therefore we have : $$P_c(substitution) = \frac{n_c}{n_c + n_{nc}}\quad$$
$$P_{nc}(substitution) = \frac{n_{nc}}{n_c + n_{nc}}$$

With $n_c$ and $n_{nc}$ as each codon's count. Codons are considered conserved if and only if they are present in more than one sequence of the BYV RdRp. The latest sequencing is not taken into account as there are no sequence to compare it to, and so we cannot be sure of the codon's conservation state. As a result, we find 15 codons positions whose synonym substitutions were conserved among the 38 found in the MSA, and thus the 15 significant positions where substitutions are most likely to happen. Consequently, we can determine the probability of a substitutions occuring on a position where it will be conserved :

$n_c$ = 15
$n_{nc}$ = 23

$$P_c(substitution) = \frac{n_c}{n_c + n_{nc}} = 0.4 \quad $$
$$P_{nc}(substitution)= \frac{n_{nc}}{n_c + n_{nc}} = 0.6$$

We determined previously the rate of yearly substitutions for the entire RdRp $\lambda = 0.84 sub/year$. We can thus derive the following rates for $\lambda_c, \lambda_{nc}$, respectively the rate of yearly appearance of conserved and non conserved substitutions in the RdRp : And given that there are 15 positions where conserved substitutions are likely to occur, we thus determine $\lambda_c = \frac{0.336}{15} = 0.0224 sub.year^{-1}$ for each 15 sites.

Therefore the random variable $X$ describing the probability of a number $k$ of substitutions to occur in a year follows a Poisson probability law of parameter $\lambda_c = 0.0224$, which can be adjusted depending on the desired time span of the analysis: $$P(sub) = P(X = k) = \frac{\lambda_c^k}{k!}*e^{-\lambda_c} \$$
$$leftrightarrow P(X = k)= \frac{0.0224^k}{k!}*e^{-0.0224}$$

It is however important to note that not all conserved substitutions have the same probability of appearing. Some are much more conserved than others, and consequently have a higher probability of appearing, or being replaced by another codon encoding the same amino-acid. We therefore introduce the weighted value of Shannon's Entropy $s_c \in S_{conserved}$ found for each conserved codon at a position i. The entropy coefficient are determined respect the Shannon's score corresponding to a fully conserved position. Thus the number of substitutions occuring at one of the non-conserved codon site follows the Poisson probability law of parameter $\lambda_{cs} = 0.0224$ as follows: $$\forall s_c \in S_{conserved},\ $$
$$ P(X = k) = \frac{(s_i*\lambda_{cs})^k}{k!}*e^{-\lambda_{cs}*s_i}$$

Since calculating the probability of a conserved substitution given the Shannon's entropy of a site, and multiple sites have a the same entropy, we need to multiply the probability by the $c$ number of sites that were found having this entropy :$$\forall s \in S_{conserved},\ $$
$$ P(X = k) = \biggl[\frac{(s*\lambda_{cs})^k}{k!}*e^{-\lambda_{cs}*s}\biggr] * c(s)$$

We can therefore have an idea of the probability law followed by substitutions at specific positions, but would that be possible as well for multiple position? Intuitively we could think that the probability of having both substitutions i and j (that we consider being conserved here) in 1 year would be : $$P(X = k, Y = k) $$
$$= \frac{(s_i*\lambda)^k}{k!}*e^{-\lambda*s_i} * \frac{(s_j*\lambda)^k}{k!}*e^{-\lambda*s_j}$$

Thus we could deduce the more general expression, with : $$P(X = k, ..., Y = k) = \prod_{i=1}^{n} \frac{(s_i*\lambda)^k}{k!}*e^{-\lambda*s_i}$$

From the seed generated previously, we are able to merge sequences with different substitutions together. Consequently, we produce for each merging cycle all the fused sequences, with for each a unique set of substitutions on conserved positions. For each of these obtained sequence, we are thus able to associate a given probability from the probability law given previously. Overall, there is as expected, a gap in the proability of substitutions happening depending on the entropy of the codon position. For the positions with the most frequent substitutions, we find that likely for 1 to 4 substitutions to occur in 15 years. For the double substitutions, the probability is much lower, according to the formula given previously, and the distribution is visually similar.

Figure 3: Probability plot depending on the number of mutations for 5 and 15 years of time, for one or at least two substitutions. Each curves represents the substitution probability for a given value of Entropy. In red is the non-conserved substitution probability for comparison.

Figure 3a: Probabilities for substitutions of a given Entropy to occur following a Poisson Law in 5 years

Figure 3b: Probabilities for substitutions of a given Entropy to occur following a Poisson Law in 15 years

Figure 3c: Probabilities for at least two substitutions with different entropies to occur in 5 years

Figure 3d: Probabilities for at least two substitutions with different entropies to occur in 15 years

The progam takes as input the MSA, the number of substitutions cycles wanted and the number of years in the future we are interested in. The output is a fasta file containing the predicted sequences for each cycle, and a csv file for each year and each cycle containing detailed informations about the predicted sequence. It is not possible to match the number of cycles with the number of years to foresee, as the program is too computationally expensive. For example for the RdRp, if we were to try and predict all possible sequences in 30 years, the number of possibility woud be $\approx 15^3 \approx 3375$, and it is obvious that this many sequences cannot be produced industrially. Thus it would be beneficial to re-engineer interfering RNA when the sequences currently used are below a given threshold of efficiency (i.e. when there is over a given proportion of siRNA that cannot target the BYV because of substitutions). To do so, we first need to assess the evolution of the efficiency of CAP'siRNA's products over time. For this reason we developped another tool, meant to do exactly this.

This tool leverages the current knowledge on BYV's sequences to emulate the evolution of the intefering RNA efficiency if we were to engineer them based on the oldest sequence in the MSA. For example with our MSA, the sequences ranges from 1993 to 2021, thus we do as if our RNA's were generated from the 1993 sequences, and we study the RNA pool's efficiency over time until 2021's sequences. In addition, we display the worst case scenario, calculated from the formula given previously in the Methods, assuming a constant substitution rate over time : $$\theta_t(x) = -21rx+ 21r\sigma_0x_0$$

Figure 4: Number and proportion of siRNA efficient from 1993 to 2021. In blue are the efficiency predictions based on regressions from empirical data of the BYV sequences. In red is the worst case scenario estimates, see Methods for the formula.

Figure 4a: RdRp estimation of efficiency through time

Figure 4b: p21 estimation of efficiency through time

As expected, depending on the rate of substitutions in the protein sequence, the proportion of efficient siRNAs declines more rapidly. Following this simulation, sequences targetting the RdRp should remain over 80% efficient for at least 10 years, even in the worst case scenario and at its current rate of substitution. Onto the second result, following the empirical data we gathered, the p21 sequence's efficiency will be halved after 20 years, and in the worst case scenario following this simulation, efficiency will be halved after three years in the worst case scenario. This goes to show that the ability to predict the potential sequences of the BYV in the future is crucial for our product to have a long lasting effect. Additionnally, a likely evolutionary response of the BYV toward the selection made by siRNAs could be an increase in the rate of substitutions, making even more important this prediction ability.

Conclusion

RdRp and p21 hold key functions in both virus generation and neutralization of the plant defense pathway based on interference, making them prime targets for CAP'siRNA. However, while the RdRp sequence length and invariability makes its prediction easy computationnally, it is more challenging for the p21. There is only a limited number of sequences for the BYV, and no real continuity in the sequencing analysis realized. A such in its current form; prediRNA is inherently biased by it. Additional and regular studies are required to derive a robust and accurate model able to predict the BYV future sequences, and eventually consider extending it to other viruses. Finally, the current model only features exhaustive sequences predictions, and is thus not adapted to the p21 as the substitution rate is much higher. Updates of the software and future developpment will thus focus on developping a model fitting high rates of substitutions.

References

References SafeRNA

[1] Gavrilov K, Saltzman WM. Therapeutic siRNA: principles, challenges, and strategies. Yale J Biol Med. 2012 Jun;85(2):187-200. Epub 2012 Jun 25. PMID: 22737048; PMCID: PMC3375670.
[2] Hu B, Zhong L, Weng Y, Peng L, Huang Y, Zhao Y, Liang XJ. Therapeutic siRNA: state of the art. Signal Transduct Target Ther. 2020 Jun 19;5(1):101. doi: 10.1038/s41392-020-0207-x. PMID: 32561705; PMCID: PMC7305320.
[3] Assembly [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2012 – [cited 2024 August 09]. Available from: https://www.ncbi.nlm.nih.gov/assembly/
[4] Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., and Madden, T.L. 2009. BLAST+: architecture and applications. BMC Bioinformatics, 10, 421.
[5] Kusstatscher P, Zachow C, Harms K, Maier J, Eigner H, Berg G, et al. Microbiome-driven identification of microbial indicators for postharvest diseases of sugar beets. Microbiome. 2019 Dec;7(1):1–12.

References Aphidisperse

[1] Qi, A., Dewar, A. M., & Harrington, R. (2004). Decision making in controlling virus yellows of sugar beet in the UK. Pest Management Science, 60(7), 727–732. https://doi.org/10.1002/ps.871
[2] Luquet, M., Poggi, S., Buchard, C., Plantegenest, M., & Tricault, Y. (2023). Predicting the seasonal flight activity of Myzus persicae, the main aphid vector of Virus Yellows in sugar beet. Pest Management Science, 79(11), 4508–4520. https://doi.org/10.1002/ps.7653
[3] Parry, H. R. (2013a). Cereal aphid movement: General principles and simulation modelling. Movement Ecology, 1(1), 14. https://doi.org/10.1186/2051-3933-1-14
[4] Emden, H. F. van, Eastop, V. F., Hughes, R. D., & Way, M. J. (1969). The Ecology of Myzus persicae. Annual Review of Entomology, 14(Volume 14, 1969), 197–270. https://doi.org/10.1146/annurev.en.14.010169.001213
[5] Kennedy, J. 5., Day, M. F., Eastop,V. F. A Conspectus of Aphids as of Myzus persicae (Sulzer). Ann.Vectors of Plan' Viruses. (Commonwealth Inst. Entomol., London,1 14 pp., 1962)
[6] Bell, J.R., Alderson, L., Izera, D., Kruger, T., Parker, S., Pickup, J., Shortall, C.R., Taylor, M.S., Verrier, P. and Harrington, R. (2015), Long-term phenological trends, species accumulation rates, aphid traits and climate: five decades of change in migrating aphids. J Anim Ecol, 84: 21-34. https://doi.org/10.1111/1365-2656.12282
[7] Loxdale, H. D., Hardie, J., Halbert, S., Foottit, R., Kidd, N. a. C., & Carter, C. I. (1993). The Relative Importance of Short- and Long-Range Movement of Flying Aphids. Biological Reviews, 68(2), 291–311. https://doi.org/10.1111/j.1469-185X.1993.tb00998.x
[8] Nottingham, S. F., Hardie, J., & Tatchell, G. M. (1991). Flight behaviour of the bird cherry aphid, Rhopalosiphum padi. Physiological Entomology, 16(2), 223–229. https://doi.org/10.1111/j.1365-3032.1991.tb00559.x
[9] Nottingham, S. F., & Hardie, J. (1989). Migratory and targeted flight in seasonal forms of the black bean aphid, Aphis fabae. Physiological Entomology, 14(4), 451–458. https://doi.org/10.1111/j.1365-3032.1989.tb01114.x
[10] Taylor, L.R. (1974). Insect migration, flight periodicity and the boundary layer. Journal of Animal Ecology, 43, 225-238.
[11] Reynolds, A. M., & Reynolds, D. R. (2008). Aphid aerial density profiles are consistent with turbulent advection amplifying flight behaviours: Abandoning the epithet ‘passive’. Proceedings of the Royal Society B: Biological Sciences, 276(1654), 137–143. https://doi.org/10.1098/rspb.2008.0880
[12] Hong, F., Han, H.-L., Pu, P., Wei, D., Wang, J., & Liu, Y. (2019). Effects of Five Host Plant Species on the Life History and Population Growth Parameters of Myzus persicae (Hemiptera: Aphididae). Journal of Insect Science, 19(5), 15. https://doi.org/10.1093/jisesa/iez094
[13] Williams, C. T. (1995). Effects of plant age, leaf age and virus yellows infection on the population dynamics of Myzus persicae (Homoptera: Aphididae) on sugarbeet in field plots. Bulletin of Entomological Research, 85(4), 557–567. https://doi.org/10.1017/S000748530003306X
[14] Özgökçe, M. S., Chi, H., Atlıhan, R., & Kara, H. (2018). Demography and population projection of Myzus persicae (Sulz.) (Hemiptera: Aphididae) on five pepper (Capsicum annuum L.) cultivars. Phytoparasitica, 46(2), 153–167. https://doi.org/10.1007/s12600-018-0651-0
[15] Les chiffres-clés de la filière Fruits & Légumes frais et transformés en 2019. 2019. [accessed on September 6th]

References PrediRNA

[1] Gouy M., Guindon S. & Gascuel O. (2010) SeaView version 4 : a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Molecular Biology and Evolution 27(2):221-224.
[2] Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evolution. vey016. DOI:10.1093/ve/vey016