DNA Data Storage
DNA Data Storage
The amount of data that needs to be stored over the coming decade will rise to the exabyte range. Furthermore, as there
is an increase of AI intelligence that needs exponential amounts of training data and so the storage for the same.
Data storage technologies solutions change with OS virtualization, cloud integration, containers, and the scale-out
architectures that support them. However, this may make us long for the days when we could walk into a datacenter to
touch our storage. Nowadays, cloud gateways are integrated into enterprise storage arrays. As a result, the developers can
spin up hundreds of terabytes for software testing. Hence, it is incredibly harder than ever to know who is using storage
or why. Furthermore, it is even harder to know if it is cost-effective with that data storage technology.
So, what’s the solution? Maybe eventually, it will appear as a cross-vendor storage tracking and analysis application. You
can leverage machine learning to understand and advise admins on optimizing the total storage infrastructure related to
performance and cost. These applications will know about various storage options’ costs, performance details, availability
or reliability, and weight.
• Helium Drives
• Shingled Magnetic Recording (SMR)
• DNA
• Large memory servers – NVRAM
• Rack scale design
• 5D Optical storage
Is it hard to believe that biological molecules can be a part of data storage technologies? Yes, though it sounds the
strangest, DNA is the new storage technology of the future. The molecules that store biological information could also be
used to store other kinds of digital data. Harvard researchers in 2012 were able to encode DNA with digital information,
including a 53,400-word book in HTML, eleven JPEG images, and one JavaScript program.
With DNA, you can store 2.2 petabytes per gram, which is incredible in terms of storage density. This means a DNA hard
drive about the size of a teaspoon could fit all the world’s data on it! Not only space savings, but also DNA is ideal for long-
term storage.
However, the read/write time for DNA is high, and the technology is still too expensive for use. According to New Scientist,
in one recent study, the cost to encode 83 kilobytes was £1000 (about $1,500 U.S. dollars). It sounds like a Sci-fi story, but
scientists encode information into artificial DNA and add it to bacteria. Not to mention, DNA could be the ultimate eternal
drive one day.
By 2025, it's expected that the amount of data produced per day will be 463 exabytes. 1 exabyte is 1 million terabytes or
1 billion gigabytes—that's a lot of data! Safely storing data ensures that all the knowledge you gather over the years is
retained and maintained. Proper storage is super vital for all individuals, businesses, and other organizations.
There are many ways to store data, from old-fashioned paper records to modern digital methods. But why are there so
many different methods instead of one universal favorite?
Storage efficiency
Storage efficiency is the ability to store and manage data that consumes the least amount of space with little to no impact
on performance, resulting in a lower total operational cost. Efficiency addresses the real-world demands of managing costs,
reducing complexity, and limiting risk. The Storage Networking Industry Association (SNIA) defines storage efficiency in the
SNIA Dictionary as follows:
Classification: Confidential Contains PII: No
The efficiency of an empty enterprise level system is commonly in the 40–70% range, depending on what combination
of RAID, mirroring and other data protection technologies are deployed, and may be even lower for highly redundant
remotely mirrored systems. As data is stored on the system, technologies such as deduplication and compression may
store data at a greater than 1-to-1 data size-to-space consumed ratio, and efficiency rises, often to over 100% for primary
data, and thousands of percent for backup data.
Another reason it is essential to store data properly is that it can help keep it more organized to ensure that information
can be found when needed. If data is not stored the right way, it can be complicated to locate specific information, leading
to frustration, and wasted time.
Finally, the last reason why properly storing data is so important is because it helps to ensure the security of the information
stored. If data is not stored securely, there is a risk that unauthorized users could access it.
To avoid all these issues, you must determine a storage solution that works for you in the long term. Different methods are
used for redundancy, and some ways are more useful for specific purposes and datasets.
1. First, the solution should be able to scale to accommodate increasing amounts of data.
2. Second, the solution should be able to handle different types of data, such as structured, unstructured, and semi-
structured data.
3. Third, the solution should be able to protect data from corruption or loss.
4. Fourth, the solution should offer high levels of security to protect data from unauthorized access.
5. Finally, the solution should be easy to use and manage so that any user, including those who are not tech-savvy,
can use it without a problem.
Conclusion
There is no clear consensus on what type of data storage will be the most widely used in the future. This is because there
are various types of storage, each with its advantages and disadvantages.
Keep in mind that new technologies are constantly being developed, which could change the landscape of data storage in
the years to come. What will be the next big thing is anyone's guess, but what's for sure is that plenty of change is on the
horizon for the world of data storage.
Over the decades, storage technology has evolved and gotten better. It has moved from CDs and floppy disks to hard and
solid-state drives. But still, we have a problem: the amount of storage available and being produced can't cope with the
data we keep producing.
So, would DNA storage solve the problem? Can data be stored in DNA?
DNA storage is a relatively new technology that holds promise for the future of data storage. This method involves encoding
digital information into strands of DNA, which can then be stored for long periods without degradation.
DNA storage has a very high-capacity potential, with one gram of DNA holding up to 1 billion terabytes (TB) of data.
Additionally, DNA is extremely small and can be stored very space-efficiently.
DNA data storage is the process of using DNA molecules as a storage medium. Unlike the optical and magnetic forms of
storage technologies present today, DNA data won't be stored in binary digits (i.e., 1s and 0s). Instead, they would be
encoded into DNA nucleotide bases (A, C, G, T) and stored. These strands are then converted to binary digits when needed.
Right now, over 11 trillion gigabytes of data exist, with at least 2.5 million gigabytes more added each day. The data storage
media available in the world cannot keep up with this massive increase. DNA storage is one solution to this problem of
storage.
DNA stands for deoxyribonucleic acid. It is a complex organic molecule that carries the genetic information of a living thing.
It is found in all humans and stores information like skin color, eye color, height, and other physical and biological traits.
A DNA spiral has multiple and alternating pairs of four unique bases. They are adenine (A), guanine (G), cytosine (C), and
thymine (T). These bases are attached to the DNA spiral in pairs, called base pairs. The two base pairs are adenine-thymine
and guanine-cytosine.
Data is stored in binary digits (1s and 0s) in traditional computing. In DNA data storage, the four nucleotide bases (A, C, G,
T) store and encode data. Information is stored in permutations of three nucleotides bases, called codons.
DNA storage comprises three processes: coding the data, synthesizing, and storing it, and decoding it. Binary codes holding
information are translated into DNA codes or codons using an algorithm. They are then deposited in a container in a cool
and regulated environment. The DNA carrying information can be frozen in solution, stored as droplets, or stored on silicon
chips.
Scientists are working on making the reading of DNA storage faster and less expensive. As of now, data stored in DNA must
be taken to the lab to be decoded into error-free binary information, and it takes a long time.
DNA data storage is the preferred solution for the storage shortage problem because it can store large amounts of data in
very little space. One gram of DNA can store 215 petabytes of data. A petabyte is 1,024 terabytes. So one gram of DNA can
store approximately 220,160 terabytes.
Compare that with current technology: a one-terabyte hard disk drive weighs approximately 400 grams. So, to store the
equivalent amount of data one gram of DNA keeps, you need more than 88 million grams of hard drives.
With this information, researchers say that all the data in the world right now can fit into a shoebox using DNA data storage.
Using DNA storage as a storage medium comes with many benefits over digital storage. It provides high data storage
capacity, a considerably longer lifespan than other forms of storage, compactness, low susceptibility to technical and
electrical failures, and replicability.
Storage Density
The main advantage of DNA storage over other storage mediums is storage density. Even though you store your data
remotely on the cloud or NAS, they are still stored in big servers and data centers. These data centers are as large as football
stadiums and cost billions of dollars to build and maintain. It is not the same with DNA data storage.
DNA data storage allows you to store massive amounts of data in a very compact space. Hence, reducing the problems of
space, maintenance expenses, and shortage of storage equipment.
Durability
The digital storage equipment available have today is far from durable. They are all prone to decay and degradation. Digital
decay is the gradual decomposition of data stored on a computer, affecting millions of people every year.
The DNA has a half-life of 500 years. When stored in an optimum and regulated environment, data stored in DNA can be
available for hundreds of years.
Replicability
Because of the degradation of data, data in data centers must be copied and transferred onto other hardware after periods
of time to preserve the information stored. This process is frequently cumbersome. Data stored in DNA can easily be
replicated. One method scientist has tested is to insert the DNA with stored information into a bacterium. This bacterium
then reproduces—on its own—other generation of bacteria that possess the same information stored in the first DNA
without any errors or loss.
Quite frankly, yes. DNA data storage certainly ticks all the solution boxes for today's storage problems. It is already in use
today by companies who want to preserve extensive archives of information that do not need to be accessed regularly.
Unfortunately, it will be quite some time before DNA storage is a commonplace and affordable storage option available to
the public. In the meantime, we must carefully pick out the best storage format for long-term data storage.
On Earth right now, there are about 10 trillion gigabytes of digital data, and every day, humans produce emails, photos,
tweets, and other digital files that add up to another 2.5 million gigabytes of data. Much of this data is stored in enormous
facilities known as exabyte data centers (an exabyte is 1 billion gigabytes), which can be the size of several football fields
and cost around $1 billion to build and maintain.
Many scientists believe that an alternative solution lies in the molecule that contains our genetic information: DNA, which
evolved to store massive quantities of information at very high density. A coffee mug full of DNA could theoretically store
all of the world’s data, says Mark Bathe, an MIT professor of biological engineering.
“We need new solutions for storing these massive amounts of data that the world is accumulating, especially the archival
data,” says Bathe, who is also an associate member of the Broad Institute of MIT and Harvard. “DNA is a thousandfold
denser than even flash memory, and another property that’s interesting is that once you make the DNA polymer, it doesn’t
consume any energy. You can write the DNA and then store it forever.”
Scientists have already demonstrated that they can encode images and pages of text as DNA. However, an easy way to pick
out the desired file from a mixture of many pieces of DNA will also be needed. Bathe and his colleagues have now
demonstrated one way to do that, by encapsulating each data file into a 6-micrometer particle of silica, which is labeled
with short DNA sequences that reveal the contents.
Using this approach, the researchers demonstrated that they could accurately pull out individual images stored as DNA
sequences from a set of 20 images. Given the number of possible labels that could be used, this approach could scale up
to 1020 files.
Bathe is the senior author of the study, which appears today in Nature Materials. The lead authors of the paper are MIT
senior postdoc James Banal, former MIT research associate Tyson Shepherd, and MIT graduate student Joseph Berleant.
Stable storage
Classification: Confidential Contains PII: No
Digital storage systems encode text, photos, or any other kind of information as a series of 0s and 1s. This same information
can be encoded in DNA using the four nucleotides that make up the genetic code: A, T, G, and C. For example, G and C
could be used to represent 0 while A and T represent 1.
DNA has several other features that make it desirable as a storage medium: It is extremely stable, and it is fairly easy (but
expensive) to synthesize and sequence. Also, because of its high density — each nucleotide, equivalent to up to two bits,
is about 1 cubic nanometer — an exabyte of data stored as DNA could fit in the palm of your hand.
One obstacle to this kind of data storage is the cost of synthesizing such large amounts of DNA. Currently it would cost $1
trillion to write one petabyte of data (1 million gigabytes). To become competitive with magnetic tape, which is often used
to store archival data, Bathe estimates that the cost of DNA synthesis would need to drop by about six orders of magnitude.
Bathe says he anticipates that will happen within a decade or two, similar to how the cost of storing information on flash
drives has dropped dramatically over the past couple of decades.
Aside from the cost, the other major bottleneck in using DNA to store data is the difficulty in picking out the file you want
from all the others.
“Assuming that the technologies for writing DNA get to a point where it’s cost-effective to write an exabyte or zettabyte of
data in DNA, then what? You're going to have a pile of DNA, which is a gazillion files, images or movies and other stuff, and
you need to find the one picture or movie you’re looking for,” Bathe says. “It’s like trying to find a needle in a haystack.”
Currently, DNA files are conventionally retrieved using PCR (polymerase chain reaction). Each DNA data file includes a
sequence that binds to a particular PCR primer. To pull out a specific file, that primer is added to the sample to find and
amplify the desired sequence. However, one drawback to this approach is that there can be crosstalk between the primer
and off-target DNA sequences, leading unwanted files to be pulled out. Also, the PCR retrieval process requires enzymes
and ends up consuming most of the DNA that was in the pool.
“You’re kind of burning the haystack to find the needle, because all the other DNA is not getting amplified and you’re
basically throwing it away,” Bathe says.
File retrieval
As an alternative approach, the MIT team developed a new retrieval technique that involves encapsulating each DNA file
into a small silica particle. Each capsule is labeled with single-stranded DNA “barcodes” that correspond to the contents of
the file. To demonstrate this approach in a cost-effective manner, the researchers encoded 20 different images into pieces
of DNA about 3,000 nucleotides long, which is equivalent to about 100 bytes. (They also showed that the capsules could
fit DNA files up to a gigabyte in size.)
Each file was labeled with barcodes corresponding to labels such as “cat” or “airplane.” When the researchers want to pull
out a specific image, they remove a sample of the DNA and add primers that correspond to the labels they’re looking for
— for example, “cat,” “orange,” and “wild” for an image of a tiger, or “cat,” “orange,” and “domestic” for a housecat.
The primers are labeled with fluorescent or magnetic particles, making it easy to pull out and identify any matches from
the sample. This allows the desired file to be removed while leaving the rest of the DNA intact to be put back into storage.
Their retrieval process allows Boolean logic statements such as “president AND 18th century” to generate George
Washington as a result, like what is retrieved with a Google image search.
“At the current state of our proof-of-concept, we’re at the 1 kilobyte per second search rate. Our file system’s search rate
is determined by the data size per capsule, which is currently limited by the prohibitive cost to write even 100 megabytes
worth of data on DNA, and the number of sorters we can use in parallel. If DNA synthesis becomes cheap enough, we
would be able to maximize the data size we can store per file with our approach,” Banal says.
George Church, a professor of genetics at Harvard Medical School, describes the technique as “a giant leap for knowledge
management and search tech.”
“The rapid progress in writing, copying, reading, and low-energy archival data storage in DNA form has left poorly explored
opportunities for precise retrieval of data files from huge (1021 byte, zetta-scale) databases,” says Church, who was not
involved in the study. “The new study spectacularly addresses this using a completely independent outer layer of DNA and
leveraging different properties of DNA (hybridization rather than sequencing), and moreover, using existing instruments
and chemistries.”
Bathe envisions that this kind of DNA encapsulation could be useful for storing “cold” data, that is, data that is kept in an
archive and not accessed very often. His lab is spinning out a startup, Cache DNA, that is now developing technology for
long-term storage of DNA, both for DNA data storage in the long-term, and clinical and other preexisting DNA samples in
the near-term.
“While it may be a while before DNA is viable as a data storage medium, there already exists a pressing need today for
low-cost, massive storage solutions for preexisting DNA and RNA samples from Covid-19 testing, human genomic
sequencing, and other areas of genomics,” Bathe says.
Integrated information storage technology for writing large amounts of digital information in DNA using an enzyme-driven,
sustainable, low-cost approach.
The genetic material DNA has garnered considerable interest as a medium for digital information storage because its
density and durability are superior to those of existing silicon-based storage media. For example, DNA is at least 1000-fold
more dense than the most compact solid-state hard drive and at least 300-fold more durable than the most stable magnetic
tapes. In addition, DNA’s four-letter nucleotide code offers a suitable coding environment that can be leveraged like the
binary digital code used by computers and other electronic devices to represent any letter, digit, or other character.
Despite these advantages, DNA has not yet become a widespread information storage medium because the cost of
chemically synthesizing DNA is still prohibitively high at $3,500 per 1 megabyte of information. To help overcome this
limitation, research at the Wyss Institute spearheaded by Henry Hung-Yi Lee, Ph.D., in a collaborative project led by Core
Faculty member George Church, Ph.D., and Founding Director Donald Ingber, M.D., Ph.D., has developed new, enzyme-
based approaches that can write DNA simpler and faster than traditional chemical techniques. These approaches could
also produce much longer strands of DNA while being less toxic for the environment. Importantly, this approach is
projected to reduce the cost of DNA synthesis in the future by many orders of magnitude.
In a world flooded with data, figuring out where and how to store it efficiently and inexpensively becomes a larger problem
every day. One of the most exotic solutions might turn out to be one of the best: archiving information in DNA molecules.
The prevailing long-term cold-storage method, which dates from the 1950s, writes data to pizza-sized reels of magnetic
tape. By comparison, DNA storage is potentially less expensive, more energy-efficient and longer lasting. Studies show that
DNA properly encapsulated with a salt remains stable for decades at room temperature and should last much longer in
the controlled environs of a data center. DNA doesn’t require maintenance, and files stored in DNA are easily copied for
negligible cost.
Classification: Confidential Contains PII: No
Even better, DNA can archive a staggering amount of information in an almost inconceivably small volume. Consider this:
humanity will generate an estimated 33 zettabytes of data by 2025—that’s 3.3 followed by 22 zeroes. DNA storage can
squeeze all that information into a ping-pong ball, with room to spare. The 74 million million bytes of information in the
Library of Congress could be crammed into a DNA archive the size of a poppy seed—6,000 times over. Split the seed in half,
and you could store all of Facebook’s data.
Science fiction? Hardly. DNA storage technology exists today, but to make it viable, researchers have to clear a few daunting
technological hurdles around integrating different technologies. As part of a major collaboration to do that work, our team
at Los Alamos National Laboratory has developed a key enabling technology for molecular storage. Our software, the
Adaptive DNA Storage Codex (ADS Codex), translates data files from the binary language of zeroes and ones that computers
understand into the four-letter code biology understands.
ADS Codex is a key part of the Intelligence Advanced Research Projects Activity (IARPA) Molecular Information Storage
(MIST) program. MIST seeks to bring cheaper, bigger, longer-lasting storage to big-data operations in government and the
private sector, with a short-term goal of writing one terabyte—a trillion bytes—and reading 10 terabytes within 24 hours
at a cost of $1,000.
When most people think of DNA, they think of life, not computers. But DNA is itself a four-letter code for passing along
information about an organism. DNA molecules are made from four types of bases, or nucleotides, each identified by a
letter: adenine (A), thymine (T), guanine (G) and cytosine (C). They are the basis of all DNA code, providing the instruction
manual for building every living thing on earth.
A fairly well-understood technology, DNA synthesis has been widely used in medicine, pharmaceuticals and biofuel
development, to name just a few applications. The technique organizes the bases into various arrangements indicated by
specific sequences of A, C, G and T. These bases wrap in a twisted chain around each other—the familiar double helix—to
form the molecule. The arrangement of these letters into sequences creates a code that tells an organism how to form.
The complete set of DNA molecules makes up the genome—the blueprint of your body. By synthesizing DNA molecules—
making them from scratch—researchers have found they can specify, or write, long strings of the letters A, C, G and T and
then read those sequences back. The process is analogous to how a computer stores binary information. From there, it
was a short conceptual step to encoding a binary computer file into a molecule
The method has been proven to work, but reading and writing the DNA-encoded files currently takes a long time.
Appending a single base to DNA takes about one second. Writing an archive file at this rate could take decades, but research
is developing faster methods, including massively parallel operations that write to many molecules at once.
ADS Codex tells exactly how to translate the zeros and ones into sequences of four letter-combinations of A, C, G and T.
The Codex also handles the decoding back into binary. DNA can be synthesized by several methods, and ADS Codex can
accommodate them all.
Unfortunately, compared to traditional digital systems, the error rates while writing to molecular storage with DNA
synthesis are very high. These errors arise from a different source than they do in the digital world, making them trickier
to correct. On a digital hard disk, binary errors occur when a zero flips to a one, or vice versa. With DNA, the problems
come from insertion and deletion errors. For instance, you might be writing A-C-G-T, but sometimes you try to write A, and
nothing appears, so the sequence of letters shifts to the left, or it types AAA.
Classification: Confidential Contains PII: No
Normal error correction codes don’t work well with that kind of problem, so ADS Codex adds error detection codes that
validate the data. When the software converts the data back to binary, it tests to see that the codes match. If they don’t,
it removes or adds bases—letters—until the verification succeeds.
SMART SCALE-UP
We have completed version 1.0 of ADS Codex, and late this year we plan to use it to evaluate the storage and retrieval
systems developed by the other MIST teams. The work fits well with Los Alamos’ history of pioneering new developments
in computing as part of our national security mission. Since the 1940s, as an outcome of those computing advancements,
we have amassed some of the oldest and largest stores of digital-only data. It still has tremendous value. Because we keep
data forever, we’ve been at the tip of the spear for a long time when it comes to finding a cold-storage solution, but we’re
not alone.
All the world’s data—all your digital photos and tweets; all the records of the global financial sector; all those satellite
images of cropland, troop movements and glacial melting; all the simulations underlying so much of modern science; and
so much more—have to go somewhere. The “cloud” isn’t a cloud at all. It is digital data centers in huge warehouses
consuming vast amounts of electricity to store (and keep cool) trillions of millions of bytes. Costing billions of dollars to
build, power and run, these data centers may struggle to remain viable as the need for data storage continues to grow
exponentially.
DNA shows great promise for sating the world’s voracious appetite for data storage. The technology requires new tools
and new ways of applying familiar ones. But don’t be surprised if one day the world’s most valuable archives find a new
home in a poppy-seed-sized collection of molecules.