Software
Contents
Overview
The goal the software pipeline is to facilitate the storage and retrieval of information in DNA. This is achieved through the encoding and decoding of information. Encoding is the conversion of information from format A to format B . Decoding is the reverse process of encoding, converting format B back to format A. In the case of DNA storage, format A are bits and format B are DNA nucleotides (bases).
Figure 1: Bits to Bases Conversion
While encoding and decoding of information occurs for many processes, such as networking, computer and data centre storage, and encryption, these processes have not been dealt with using the building blocks of life: nucleotides. The properties of DNA differ from the properties of, for instance, solid state drives (SSDs), which are used in the processes mentioned above. Thus, DNA based storage systems suffer from different problems, making encoding, decoding and error correction a challenging, and a yet to be solved problem for using DNA as an efficient storage medium.
Motivation
What are Bits?
A bit, either 0 or 1, is the most basic unit of information a classical computer can store and interpret. Bits are interpreted physically as electrical or magnetic signals. For instance, on solid state drives, the state of a transistor, “charged” or “uncharged”, maps to 0 and 1 respectively. While traditional data centres can store information in the form of bits, DNA contains more than two unique basic units. Thus, given a user’s file, we must convert that file, which contains many bits, to a collection of appropriately-sized nucleotide sequences for synthesis in wet lab.
What are Nucleotides?
Nucleotides, or bases, are the four basic building blocks of DNA, namely adenine, guanine, thymine, and cytosine. Adenine and thymine are similar in structure, while guanine and cytosine are similar. There are many different ways to map from bits to bases, which will be discussed below. Additionally, DNA sequences like to maintain certain constraints, such as 50% GC content, which means some bit sequences may encode to unstable DNA sequences.
Is There a Catch…edit errors?
Another hidden step that is present in both encoding and decoding is error correction and detection. As with all physical systems, failures can occur, and the best way to deal with system failures is to try and set up measures to recover them. Edit errors including mutations, insertions or deletions of bases, can occur during synthesis, storage and sequencing of DNA. Each stage varies in the percentage of edit errors occurring, and the types of errors occurring: during synthesis, deletion errors are more common, while during sequencing, mutations are more likely.
Figure 2: Errors in Synthesis and Sequencing
Is There Another Catch….semi-specific synthesis?
TdT has been regarded as a way to synthesize template-independent strands of DNA, and has the potential to synthesize longer strands than chemical synthesis. However, native TdT adds bases semi-specifically meaning the type of base to add can be controlled, but the number of those bases added cannot. While there has been research on how to overcome this hurdle, the methodologies have been patented. In simple terms, no self-transitions, or repetitions of bases, are allowed in the encoded DNA sequence.
Figure 3: Specific vs. Semi-specific Synthesis
Installation
We have provided executables for Windows, Linux and MacOS, and features have been tested on all operating system types to ensure the user experience remains the same regardless of operating system.
Description and Theory
There a few key differences between DNA as a storage medium and traditional data storage mediums. Most notably, statistical methods are used with recovering DNA sequences, as opposed to formal methods used in traditional information theory, because of the reasons below.
Rate and Type of Edit Errors
Solid state drives (SSDs) are used in data centres to store information, and have an uncorrectable bit error rate (UBER) of around 10-11 and 10-91. This is the number of bits from a data stream that have been altered, or 1/10112. Additionally, most errors in SSDs or HDDs are in the form of mutations or erasures, and rarely insertions or deletions.
On the contrary, specifically pertaining to the semi-synthesis of DNA using TdT, edit error rates are much higher, and usually are deletion type errors. Church et all. reported that for strands that encoded 48 bases, the resulting synthesized strands resulted in: “9.5% of strands contained one or more mismatches, 10.7% contained one or more insertions, and 66.1% contained one or more missing nucleotides.” 3.
Another team using TdT reported the following errors 4:
- Single base deletions: 25.8%
- Single base insertions: 13.4%
- Mismatches: 8.9%
Furthermore, a previous iGEM team Aachen 2021, suffered even larger rates of deletion, losing over 50% of nucleotides for strands of desired length 25 nucleotides 5. Therefore, after consultations with Tony Liu, Prof. Anne Condon, Boyan and Jordan, it was concluded a statistical approach relying on the distribution of synthesized DNA strands was needed to attempt error correction, as opposed to a formal approach relying on mathematical proofs.
During sequencing, mutation errors may occur, but those errors are much less significant than the deletion errors that are predicted to occur in synthesis. In fact, NGS error rate is 10-3, much smaller and not our biggest concern 6.
Length of Encoded Sequence
Based on experimental data from Aachen 2021 5, it appears that the shorter the synthesized strand, the lower the rate of deleted bases. Therefore, our software will break a bit sequence into blocks. Additionally, breaking a long bit sequence into blocks has the added benefit of parallelization of synthesis, a key point brought up during brainstorming by Rodrigo Vallejos.
Diversity of Decoded Strands
As briefly touched upon with Tony Liu and highlighted with Prof. Anne Condon when we met to discuss Church et all.3, we will be dealing with a population of DNA sequences during decoding, as opposed to one or two copies. When reading back data from an SSD, there is typically only one or two copies of that data, so formal methods, as opposed to statistical methods, must be applied to correct errors.
In the case of decoding DNA strands, because there is a population of strands, and because there is a large variance of strand lengths (due to the expected many deletions), no formal methods exist yet to recover deleted information without being computationally expensive or requiring lots of redundancy. Thus, this is another reason why, statistical methods are to be used, either solely or in partnership with formal methods for DNA storage.
Semi-Specific Synthesis
Finally, we must use encoding mappings that avoid self-transitions. Additionally, each type of transition isn’t equal, as noted by Church et all.3, where certain transitions yielded higher rates of deletion errors than others. In our software, we utilize this information to encode strands that avoid these error-prone transitions.
Encoding
Before we encode data, we set up the collection of metadata. Metadata gives us the information we need to decode a sequence back.
Compression
By encoding a heavily compressed file, we can effectively increase the amount of information stored in DNA for a given number of nucleotide bases. We can also add on error correcting bits, while keeping the overall DNA strand smaller. Our team utilizes loses compression, a compression process that does not result in any data loss. Additionally, we utilize different compression tools, due to each tool’s unique compression ratio, which is ratio between the file size of the inputted and outputted files, which we will express as bits per byte (bpb, output/input).
LZ4
LZ4 is a lossless data compression algorithm optimized to balance rapid compression and decompression speeds with high compression ratios. Due to its widespread use and popularity, it offers easily accessible compression APIs; these interfaces provide up to twelve unique compression levels, where each level trades slower speeds for an improved compression ratio. A high compression derivative, named LZ4_HC, is also provided, and is the default compression method used in our encoding pipeline.
Tokenization and the ts_zip utility
ts_zip is a small utility that enables text compression through LLM-style tokenization. A given text file is broken down into tokens, and the token “values” are written to a binary file for storage. The binary can then be subjected to the same process in reverse, where each token is converted back to its string representation. As long as the same model is used for compression and decompression, this process is lossless. In fact, the process is relatively similar to dictionary compression, except the LLM model is used as a static dictionary for all input files, and the token values are used as identifiers. Thus, the model used for compression must be careful selected, with a focus on optimizing for model size, compression speed, and compression ratio. A lower compression ratio (greater efficiency) and a shorter compression time is ideal. Four models - falcon_40B, rwkv_7B, mistral_7B, and gptneox_20B - were evaluated for their relative performance. The benchmark results and technical specifications are shown below.
Figure 4: Time Comparison for Compression Strategies
Figure 5: Compression Ratio for Compression Strategies
Breaking into Blocks
Due to the limited length of synthesized sequences, “blocking” files into smaller sub-sections - which are be encoded separately - is necessary. Thus, a blocking and re-building algorithm was developed, harnessing fixed-length overlap sequences to determine block order during reconstruction. Specifically, upon creation, the end of each block matches the first n bits of the next block (where n is a parameter of the algorithm): this permits n to be the only data necessary for inclusion into the metadata, since no ordering information is required for reconstruction of the blocks. This process is thus performed through a depth-first search, where each block represents a node, and each set of matching overlaps represents an edge.
Mapping
The maximum amount of information that can be encoded into DNA is 2 bits/base. We have chosen to implement both the rotation based cipher and Church encoding 7.
Because TdT only adds semi-specifically, meaning it adds nucleotides until it runs out of nucleotides, we can select which nucleotides TdT should synthesize, but not how many (there are ways to get around this, but they take more time to implement). Thus, using either the ternary encoding or the Church encoding, we avoid self-transitions that TdT synthesis requires.
Church Encoding
The Church Encoding maps 0 to A or T, and 1 to G or C, and alternates between which nucleotide is chosen based on the most recently encoded bit. The information rate of this encoding scheme is 1 bit/nucleotide.
Figure 6: Church Encoding
Rotation Encoding
The transitions in bases encode for 0, 1 and 2. Thus, the cipher encodes for information in base3, whereas computers usually interpret information in base2. We can get around this by converting base2 information to base3. Then, we must select an arbitrary start base, and then follow the arrows to encode information. Church et al. states a 1.58 bits/nucleotide information rate, but also mentions that by avoiding CA and CG transitions, a better recovery rate may be achieved, at the cost of a lower information rate of 1.27 bits/nucleotide. Our software pipeline will try out both schemes, to see if there is a significant change in rate of deleted bases.
Figure 7: Rotation Encoding
Figure 8: Rotation Encoding, Alternative Visual Format
Addition of Scaffolding
Due to the high rate of deletion errors, aligning short sequences will be extremely difficult without prior information. Thus, as Church et all. also did, we will add known scaffolding bases, to help us with alignment, and allow us to guess the number of deleted or erased bases.
Collected Metadata
At this point, we have the DNA strands ready for synthesis. The metadata we have collected includes: the original file name, file type, the length of each bit block, compression type, encoding scheme type and a checksum for every bit block to verify the strand is correctly recovered.
Decoding
Before decoding occurs, the DNA strand needs to be sequenced. Due to the high level of variance of synthesized strands, Sanger is most likely not an option, as highlighted by Tony Liu, who suggested Nanopore. Unfortunately, as we don’t have access to Nanopore, we will rely on Next-Generation Sequencing (NGS). For this project, we did not end up being able to sequence any strands for decoding.
Removal of Primer
We have to remove the nucleotides representing the primer in order to only decode on the nucleotide strand that contains information bases. Using the primer we have stored, we can run fuzzy string matching algorithms8. We must use fuzzy (approximate) algorithms because there is a chance the primer doesn’t quite match the primer sequence we have stored (similarly due to edit errors during synthesis and sequencing).
Sequence collapse to single nucleotides
Given that we are doing semi-specific synthesis, we now collapse homonucleotides to mononucleotides, which simply means we reduce every occurrence of the repeated nucleotides added by TdT to mononucleotides. The number of each base in a homonucleotide block may be stored as insightful information during error correction.
Error Correction
Figure 9: Error Correction with Scaffolding
Alignment to Scaffolding
Because strands are short, and may contain a high number of deletions, the scaffold we added earlier will help us align strands. Every strand that is shorter then the original encoded strand length, is thrown away, as suggested by Jordan is aligned to the scaffolding. Thus, we eliminate the need to try and resolve deletion errors. If alignment to the scaffold cannot occur, the strand is discarded.
Alignment of Strands to Other Strands
The alignment to the scaffold, and the occurrence of homonucleotides as the probability the sequenced base is actually at that index, is used to deal with base conflicts. Statistically, other strands’ deleted bases should be recovered by strands that did not have the same base deleted. However, this is not always the case and not a method to rely on. Thus, Church et al.3 developed a Markov model to predict what base be synthesized, if a deleted base cannot be recovered during alignment. While the theoretical model is provided, the implemented model is only available upon request, and thus does not abide with iGEM’s open science policy.
Mapping Back to Bits
The DNA is mapped back to bits, based on the encoding type collected in the metadata.
All Information Recovered
If we are able to recover the mutated bases, the checksum we generated earlier is use to check for any errors; if the checksum passes, then recovery is successful, otherwise, the decoding process signals failure for that specific bit block.
Bit Blocks to Bit Sequence
Based on overlaps between blocks, a graph is formed and the blocks are appended to form the initial string, based on a greedy search.
Primer Generation
We have also designed a simple primer generation tool for our wet lab to utilize in designing primers for solid phase synthesis. Given a set of requirements, we create primers that the wet lab can use for synthesizing ssDNA with TdT.
How are we generating primers?
Primers will be generated using a “genetic algorithm”. These children are then checked against a set of constraints. If these constraints are satisfied, these children primers can be used, otherwise, these children primers become new parents. This cycle can continue for as many iterations as we want. A fitness function is determined by constraints, each having a weight or “acceptable” range. These constraints were collected from our wet lab team, and validated using online references.
User Manual
Below are screenshots of using the graphical user interface for our encoding and decoding software. We designed a GUI that mirrored the user experience of uploading files to a data storage servive such as Google Drive.
Future Plans
For future iterations of our software pipeline, we hope to adopt the DNA Storage Alliance guidelines for collecting and storing metadata for DNA, develop a more robust primer generation tool and improve the user experience of our software. Learn more about our software here.
References
-
Emna Farjallah, Jean-Marc Armani, Valentin Gherman, Luigi Dilillo, Improvement of the tolerated raw bit error rate in NAND flash-based SSDs with selective refresh, Microelectronics Reliability, Volume 96, 2019, Pages 37-45, ISSN 0026-2714, https://doi.org/10.1016/j.microrel.2019.01.014. ↩
-
“Bit Error Rate.” Wikipedia, 22 Jan. 2021, en.wikipedia.org/wiki/Bit_error_rate. ↩
-
Lee, H.H., Kalhor, R., Goela, N. et al. Terminator-free template-independent enzymatic DNA synthesis for digital information storage. Nat Commun 10, 2383 (2019). https://doi.org/10.1038/s41467-019-10258-1 ↩ ↩2 ↩3 ↩4
-
Lee, H., Wiegand, D.J., Griswold, K. et al. Photon-directed multiplexed enzymatic DNA synthesis for molecular digital data storage. Nat Commun 11, 5246 (2020). https://doi.org/10.1038/s41467-020-18681-5 ↩
-
“Team:Aachen - 2021.Igem.org.” Igem.org, 2021, 2021.igem.org/Team:Aachen. Accessed 28 Sept. 2024. ↩ ↩2
-
Ma, X., Shao, Y., Tian, L. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol 20, 50 (2019). https://doi.org/10.1186/s13059-019-1659-6 ↩
-
Bornholt, James, et al. “A DNA-Based Archival Storage System.” Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS ’16, 2016, homes.cs.washington.edu/~luisceze/publications/dnastorage-asplos16.pdf, https://doi.org/10.1145/2872362.2872397. ↩
-
Wikipedia Contributors. “Approximate String Matching.” Wikipedia, Wikimedia Foundation, 23 Aug. 2024, en.wikipedia.org/wiki/Approximate_string_matching#:~:text=In%20computer%20science%2C%20approximate%20string. Accessed 28 Sept. 2024. ↩