Improving DNA-Based Data Storage with Synthetic Biology

Contents

The Issue

Worldwide demand to process and store data has been increasing exponentially for the past few decades, with the 21st century marking the beginning of the zettabyte (10²¹ bytes or ZB) era 1. 1 ZB, equivalent to a trillion gigabytes, is large enough to store 3x10¹⁴ or 300 trillion songs. It is projected that the global “datasphere” will reach 163 ZB by the end of 2025, which is 10 times greater than its size in 20162. Current data storage models rely heavily on ‘the cloud’ or large data storage centres. Unlike the name ‘cloud’ suggests, data storage centres are actually large buildings that are “hundreds of thousands of square meters in size”3.

MIT-Efficient-Flash_0.jpg

(Image source: MIT News4)

While these data centers are capable of holding large amounts of data, studies report that this model is projected to be incapable of meeting the exponential increase in data storage demand worldwide, posing a significant challenge to a critical worldwide demand 1.

Group 43.svg

Apart from the technical challenges in maximizing data storage efficiency, building more data centres is not a sustainable option to meet future data storage demands due to their large carbon footprints. Studies report the following:

A little bit about data centres....svg

The technical limitations and significant environmental impact of traditional data centers motivated our team to develop innovative, synthetic biology-based solution to this problem.

Our Journey

Having identified the unmet significant demand to efficiently store large amounts of data worldwide, our team began searching for ways to resolve this issue through synthetic biology. After a thorough literature review and reaching out to several iHP contacts in both industry and academia, our team became inspired by the efficiency of nature’s data storage process - in the form of deoxyribonucleic acid, or DNA.

Why DNA?

Billions of years of evolution has resulted with all lifeforms storing their biological information in strands of DNA. All forms of life, including humans, trees, fish, and even bacteria, encode all the necessary information needed to sustain life through a polymer consisted of 4 nucleotides: guanosine (G), cytosine (C), adenosine (A), and thymidine (T). These information are all stored in a cellular level, making DNA a very compact data storage medium; for example, each human cell contains approximately 3 billion base pairs of nucleotides in its 6 μm-wide nucleus8.

Frame.svg

Previous research has already been done to benchmark nature in its data storage process, and studies report that DNA-based data storage offers several key advantages over traditional data storage mediums such as SSD hard drives including:

Group 44.svg

Group 49 (1).svg

Current DNA-Based Data Storage Platforms and Their Limitations

Due to the above mentioned benefits, research is ongoing to use DNA as a new data storage medium to meet the increasing data storage demands worldwide11. However, such platforms rely on chemical phosphoramidite DNA synthesis. Phosphoramidite DNA synthesis relies on chemicals to synthetically form DNA strands in a stoichiometric fashion: the chemicals can only be used once to add nucleotides. As such, this process generates “hazardous organic chemical waste” and requires toxic, bioincompatible chemicals and solvents. In addition, phosphoramidite synthesis cannot synthesize DNA strands longer than 200 nucleotides 1213. For this reason, there exists a precedence for generating DNA through safer, more robust catalytic strategies.

Enzymatic DNA synthesis offers advantages over the traditional chemical DNA synthesis platforms; it does not generate hazardous waste and is capable of synthesizing longer DNA strands13. It utilizes specific enzymes capable of adding a user-defined sequence of nucleotides to a DNA strand13. One of the most widely used enzymes is terminal deoxynucleotidyl transferase (or TdT), a type of DNA polymerase capable of adding nucleotides to the 3’ end of a single-stranded DNA in a template-independent manner13. As enzymes are catalytic biomachines, they are not required to be present in stoichiometric quantities: a small amount of enzyme can keep adding nucleotides to the primer. Thus, enzymatic DNA synthesis shows potential as a greener, safer, cost-efficient method of generating DNA sequences.

Group 43 (1).svg

Previous iGEM teams have investigated the use of TdT for data storage purposes, such as Aachen 2021. However, our team identified areas in which TdT-based enzymatic DNA synthesis platforms could be further improved for data storage purposes.

This is our vision with nuCloud.

nuCloud: The New Nucleotide-Based Data Storage Cloud

nuCloud is a TdT-based solid phase DNA synthesis platform that allows users to synthesize data-encoding DNA. It’s consisted of 2 main project components that each span our wet lab and dry lab efforts:

  1. DNA synthesis A solid-phase DNA synthesis platform that utilizes TdT to synthesize a user-defined sequence of data-encoding nucleotides.
  2. Data encoding and decoding pipeline A complete software pipeline capable of encoding and decoding user’s data between binary files and nucleotide sequences.

nuCloud offers 3 key advantages over previous iGEM projects that addressed this issue:

  1. Greater DNA synthesis efficiency of TdT,
  2. Greater reagent usage efficiency,
  3. Better preservation of user data through built-in error correction algorithms.

Wet Lab: DNA Synthesis

One key feature of nuCloud’s DNA synthesis platform is that it utilizes a thermostable TdT variant (ThTdT), capable of withstanding higher reaction temperatures14. This allows nuCloud to address a key limitation of wild type TdT (WT TdT): a significant drop in synthesis efficiency upon the formation secondary DNA structures15. As the likelihood of DNA secondary structures forming decreases with increasing temperature, we designed nuCloud to utilize ThTdT at a higher DNA synthesis reaction temperature for greater DNA synthesis efficiency16.

WT vs Thermostable TdT.svg

Another key feature of nuCloud is its solid-phase synthesis (SPS) DNA synthesis. Conventional TdT reaction protocols occur in liquid-phase synthesis (LPS), where a soluble primer is submerged into a liquid pool of reagents containing TdT, nucleotides, and other necessary co-factors17. The TdT enzyme then adds to the primer in solution. While LPS is capable of DNA elongation, it requires excess reagents that makes the system’s efficiency suboptimal18. SPS, on the other hand, synthesizes DNA by having primers immobilized on a glass slide (solid surface) and having the reagents flow over the surface18. This makes nuCloud an efficient DNA synthesis platform compared to others in terms of reagent usage. In addition, SPS allows for a rapid high throughput approach for the incorporation of a unique code of nucleotides. Whereas in LPS, the product of a single DNA elongation step must be purified before adding another nucleotide of a different nitrogenous base, SPS allows us to purify our product by simply washing the glass surface and then adding nucleotides corresponding to the next base you wish to add to the growing chain to the glass.

Group 32 1.svg

Software: Data Encoding/Decoding Pipeline

Our team wanted to design a complete system not only capable of synthesizing data-encoding DNA, but capable of converting data between its binary and DNA formats as well. This motivated our dry lab subteam to implement a complete software pipeline to convert conventional binary data files into a nucleotide sequence and convert it back to binary data after reading the information through DNA sequencing.

Group 54.svg

Our pipeline was built to be compatible with ThTdT’s reaction mechanism, where a random number of nucleotides are added to the 3’ end of a single-stranded DNA as long as reaction conditions remain favorable19. This limited our degree of control over the elongation reactions and made them semi-specific; we could only control the type of nucleotide being incorporated during each synthesis reaction, but not the exact number of nucleotides being added.

To mitigate this uncertainty, our information encoding pipeline employed a rotation-based encoding system where ‘trits’ (the ternary equivalent of ‘bits’) were encoded in the transition between 2 consecutive nucleotides20. For instance, if the first nucleotide was an A followed by T (AT), it would encode a 2, whereas an A followed by G (AG) would encode a 1.

Group 60.svg

To allow users to retrieve their data from ‘DNA’ files, our team implemented a decoding pipeline capable of converting information stored in nucleotide sequences back into binary files. While previous teams, such as Aachen 2021, also implemented such decoding pipelines, it lacked error correction features that protected the users’ data against potential errors including TdT being unable to add nucleotides during certain synthesis reactions and DNA sequencing errors.

After discussing with our iHP contacts in industry, our team realized how important data integrity is for our platform’s downstream implementation. This motivated our team to design built-in error correction algorithms that increases the chances of ‘lost’ user information to be recovered.

Hardware: Affordable Modular Bioreactor

To ensure nuCloud’s potential large-scale adaptation for biomanufacturing purposes, our team decided to design and construct an affordable and modular bioreactor that could accelerate the process of growing ThTdT-expressing Escherichia coli (or E. coli), our chosen chassis. After discussing with iHP contacts in industry who reminded us of the importance of making technology accessible for large stakeholder groups, our team decided that our hardware must not only functional, but also sustainable and accessible for those that don’t have access to advanced tools.

Our bioreactors were constructed with cheap and readily-available items, with custom-built control circuits that are much cheaper than the commercial equivalents. Multiple iterations of the bioreactor were constructed throughout the competition cycle to allow more fine-tuned control features and automation based on user feedback from both our wet lab members and local stakeholders as well.

hardware-bioreactor.png

Hardware: Automated Microfluidic Chip System

One of the key advantages of nuCloud is that it uses SPS DNA synthesis, making the system compatible with automation, and therefore, a potential large-scale biomanufacturing process21. To demonstrate its capacity for automation, we reached out to several iHP contacts for their advice on implementing an automated microfluidic chip system for our DNA synthesis reactions. Based on their feedback, we performed multiple design iterations on our microfluidic chips for wet lab to use in their experiments.

hardware-fluid-sim.gif

As reaction automation involving microfluidic chips require pumps to control the syringes, we also constructed low-cost microfluidic pumps to automate and scale-out our DNA elongation reactions. Combined with the chips, we were able to demonstrate the potential for nuCloud’s eventual implementation as a biomanufacturing process.

hardware-pump.png

Integrated Human Practices: Shaping nuCloud

At the heart of every iGEM project is its integrated human practices (iHP) — throughout the entire iGEM season spanning from problem identification, brainstorming and iterative design process of nuCloud, our team constantly reached out to professionals in both industry and academia for their expert feedback. This ensured that nuCloud was shaped in a way that accurately reflects the needs and interests of various stakeholders. We also explored various use cases based on the data storage demands of our local and global community, being mindful of the ethics, data ownership, and regulatory policies pertaining to each stakeholder.

References

  1. Ionescu, A. M. (2017). Energy efficient computing and sensing in the Zettabyte era: From silicon to the cloud. 2017 IEEE International Electron Devices Meeting (IEDM), 1.2.1-1.2.8. https://doi.org/10.1109/IEDM.2017.8268307 2

  2. Reinsel, D., Gantz, J., & Rydning, J. (2017). Data Age 2025: The Evolution of Data to Life-Critical Don’t Focus on Big Data; Focus on the Data That’s Big. International Data Corporation. https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf

  3. Mytton, D. (2021). Data centre water consumption. Npj Clean Water, 4(1), 11. https://doi.org/10.1038/s41545-021-00101-w

  4. Matheson, R. (2019, April 3). Advance boosts efficiency of flash storage in data centers. MIT News | Massachusetts Institute of Technology. https://news.mit.edu/2019/solid-state-drives-flash-storage-data-center-efficiency-0403)

  5. Zhang, Q., & Yang, S. (2021). Evaluating the sustainability of big data centers using the analytic network process and fuzzy TOPSIS. Environmental Science and Pollution Research, 28(14), 17913–17927. https://doi.org/10.1007/s11356-020-11443-2

  6. Siddik, M. A. B., Shehabi, A., & Marston, L. (2021). The environmental footprint of data centers in the United States. Environmental Research Letters, 16(6), 064017. https://doi.org/10.1088/1748-9326/abfba1

  7. Baek, D., Joe, SY., Shin, H. et al. Recent Progress in High-Throughput Enzymatic DNA Synthesis for Data Storage. BioChip J (2024). https://doi.org/10.1007/s13206-024-00146-2

  8. Alberts B, Johnson A, Lewis J, et al. Molecular Biology of the Cell. 4th edition. New York: Garland Science; 2002. Chromosomal DNA and Its Packaging in the Chromatin Fiber. Available from: https://www.ncbi.nlm.nih.gov/books/NBK26834/

  9. Ceze, L., Nivala, J., & Strauss, K. (2019). Molecular digital data storage using DNA. Nature Reviews Genetics, 20(8), 456–466. https://doi.org/10.1038/s41576-019-0125-3

  10. Coudy, D., Colotte, M., Luis, A., Tuffet, S., & Bonnet, J. (2021). Long term conservation of DNA at ambient temperature. Implications for DNA data storage. PLOS ONE, 16(11), e0259868. https://doi.org/10.1371/journal.pone.0259868

  11. Gervasio, J. H. D. B., Da Costa Oliveira, H., Da Costa Martins, A. G., Pesquero, J. B., Verona, B. M., & Cerize, N. N. P. (2024). How close are we to storing data in DNA? Trends in Biotechnology, 42(2), 156–167. https://doi.org/10.1016/j.tibtech.2023.08.001

  12. Simmons, B. L., McDonald, N. D., & Robinett, N. G. (2023). Assessment of enzymatically synthesized DNA for gene assembly. Frontiers in Bioengineering and Biotechnology, 11, 1208784. https://doi.org/10.3389/fbioe.2023.1208784

  13. Yoo, E., Choe, D., Shin, J., Cho, S., & Cho, B.-K. (2021). Mini review: Enzyme-based DNA synthesis and selective retrieval for data storage. Computational and Structural Biotechnology Journal, 19, 2468–2476. https://doi.org/10.1016/j.csbj.2021.04.057 2 3 4

  14. Chua, J. P. S., Go, M. K., Osothprarop, T., Mcdonald, S., Karabadzhak, A. G., Yew, W. S., Peisajovich, S., & Nirantar, S. (2020). Evolving a Thermostable Terminal Deoxynucleotidyl Transferase. ACS Synthetic Biology, 9(7), 1725–1735. https://doi.org/10.1021/acssynbio.0c00078

  15. Hoose, A., Vellacott, R., Storch, M., Freemont, P. S., & Ryadnov, M. G. (2023). DNA synthesis technologies to close the gene writing gap. Nature Reviews Chemistry, 7(3), 144–161. https://doi.org/10.1038/s41570-022-00456-9

  16. Liang, X., Kuhn, H., & Frank-Kamenetskii, M. D. (2006). Monitoring Single-Stranded DNA Secondary Structure Formation by Determining the Topological State of DNA Catenanes. Biophysical Journal, 90(8), 2877–2889. https://doi.org/10.1529/biophysj.105.074104

  17. Molina, A. G., & Sanghvi, Y. S. (2019). Liquid‐Phase Oligonucleotide Synthesis: Past, Present, and Future Predictions. Current Protocols in Nucleic Acid Chemistry, 77(1), e82. https://doi.org/10.1002/cpnc.82

  18. Ferrazzano, L., Corbisiero, D., Tolomelli, A., & Cabri, W. (2023). From green innovations in oligopeptide to oligonucleotide sustainable synthesis: Differences and synergies in TIDES chemistry. Green Chemistry, 25(4), 1217–1236. https://doi.org/10.1039/D2GC04547H 2

  19. Motea, E. A., & Berdis, A. J. (2010). Terminal deoxynucleotidyl transferase: The story of a misguided DNA polymerase. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 1804(5), 1151–1166. https://doi.org/10.1016/j.bbapap.2009.06.030

  20. Bornholt, J., Lopez, R., Carmean, D. M., Ceze, L., Seelig, G., & Strauss, K. (2016). A DNA-Based Archival Storage System. Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, 637–649. https://doi.org/10.1145/2872362.2872397

  21. Ma, Y., Zhang, Z., Jia, B., & Yuan, Y. (2024). Automated high-throughput DNA synthesis and assembly. Heliyon, 10(6), e26967. https://doi.org/10.1016/j.heliyon.2024.e26967