0% found this document useful (0 votes)
76 views4 pages

Storage in Synthesized DNA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views4 pages

Storage in Synthesized DNA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

LETTER doi:10.

1038/nature11875

Towards practical, high-capacity, low-maintenance


information storage in synthesized DNA
Nick Goldman1, Paul Bertone1, Siyuan Chen2, Christophe Dessimoz1, Emily M. LeProust2, Botond Sipos1 & Ewan Birney1

Digital production, transmission and storage have revolutionized digits (ASCII text), giving a total of 757,051 bytes or a Shannon
how we access and use information but have also made archiving an information10 of 5.2 3 106 bits (see Supplementary Information and
increasingly complex task that requires active, continuing mainten- Supplementary Table 1 for full details).
ance of digital media. This challenge has focused some interest on The bytes comprising each file were represented as single DNA
DNA as an attractive target for information storage1 because of its sequences with no homopolymers (runs of $2 identical bases, which
capacity for high-density information encoding, longevity under are associated with higher error rates in existing high-throughput
easily achieved conditions2–4 and proven track record as an informa- sequencing technologies19 and led to errors in a recent DNA-storage
tion bearer. Previous DNA-based information storage approaches experiment9). Each DNA sequence was split into overlapping seg-
have encoded only trivial amounts of information5–7 or were not ments, generating fourfold redundancy, and alternate segments were
amenable to scaling-up8, and used no robust error-correction and converted to their reverse complement (see Fig. 1 and Supplementary
lacked examination of their cost-efficiency for large-scale informa- Information). These measures reduce the probability of systematic
tion archival9. Here we describe a scalable method that can reliably failure for any particular string, which could lead to uncorrectable
store more information than has been handled before. We encoded errors and data loss. Each segment was then augmented with indexing
computer files totalling 739 kilobytes of hard-disk storage and with information that permitted determination of the file from which it
an estimated Shannon information10 of 5.2 3 106 bits into a DNA originated and its location within that file, and simple parity-check
code, synthesized this DNA, sequenced it and reconstructed the error-detection10. In all, the five files were represented by a total of
original files with 100% accuracy. Theoretical analysis indicates that 153,335 strings of DNA, each comprising 117 nucleotides (nt). The
our DNA-based storage scheme could be scaled far beyond current perfectly uniform fragment lengths and absence of homopolymers
global information volumes and offers a realistic technology for make it obvious that the synthesized DNA does not have a natural
large-scale, long-term and infrequently accessed digital archiving. (biological) origin, and so imply the presence of deliberate design and
In fact, current trends in technological advances are reducing DNA encoded information2.
synthesis costs at a pace that should make our scheme cost-effective We synthesized oligonucleotides (oligos) corresponding to our
for sub-50-year archiving within a decade. designed DNA strings using an updated version of Agilent Tech-
Although techniques for manipulating, storing and copying large nologies’ OLS (oligo library synthesis) process20, creating ,1.2 3 107
amounts of existing DNA have been established for many years11–13, copies of each DNA string. Errors occur only rarely (,1 error per 500
one of the main challenges for practical DNA-based information stor- bases) and independently in the different copies of each string, again
age is the difficulty of synthesizing long sequences of DNA de novo to enhancing our method’s error tolerance. We shipped the synthesized
an exactly specified design. As in the approach of ref. 9, we represent DNA in lyophilized form that is expected to have excellent long-term
the information being stored as a hypothetical long DNA molecule and preservation characteristics3,4, at ambient temperature and without
encode this in vitro using shorter DNA fragments. This offers the specialized packaging, from the USA to Germany via the UK. After
benefits that isolated DNA fragments are easily manipulated in vitro11,13, resuspension, amplification and purification, we sequenced a sample
and that the routine recovery of intact fragments from samples that are of the resulting library products at the EMBL Genomics Core Facility
tens of thousands of years old14,15 indicates that well-prepared synthetic in paired-end mode on the Illumina HiSeq 2000. We transferred the
DNA should have an exceptionally long lifespan in low-maintenance remainder of the library to multiple aliquots and re-lyophilized these
environments3,4. In contrast, approaches using living vectors6–8 are not for long-term storage.
as reliable, scalable or cost-efficient owing to disadvantages such as Our base calling using AYB21 yielded 79.6 3 106 read-pairs of 104
constraints on the genomic elements and locations that can be mani- bases in length, from which we reconstructed full-length (117-nt)
pulated without affecting viability, the fact that mutation will cause the DNA strings in silico. Strings with uncertainties due to synthesis or
fidelity of stored and decoded information to reduce over time, and sequencing errors were discarded and the remainder decoded using
possibly the requirement for storage conditions to be carefully regu- the reverse of the encoding procedure, with the error-detection bases
lated. Existing schemes used for DNA computing in principle permit and properties of the coding scheme allowing us to discard further
large-scale memory1,16, but data encoding in DNA computing is inex- strings containing errors. Although many discarded strings will have
tricably linked to the specific application or algorithm17 and no prac- contained information that could have been recovered with more
tical storage schemes have been realized. sophisticated decoding, the high level of redundancy and sequencing
As a proof of concept for practical DNA-based storage, we selected coverage rendered this unnecessary in our experiment. Full-length
and encoded a range of common computer file formats to emphasize DNA sequences representing the original encoded files were then
the ability to store arbitrary digital information. The five files com- reconstructed in silico. The decoding process used no additional
prised all 154 of Shakespeare’s sonnets (ASCII text), a classic scientific information derived from knowledge of the experimental design.
paper18 (PDF format), a medium-resolution colour photograph of the Full details of the encoding, sequencing and decoding processes are
European Bioinformatics Institute (JPEG 2000 format), a 26-s excerpt given in Supplementary Information.
from Martin Luther King’s 1963 ‘I have a dream’ speech (MP3 format) Four of the five resulting DNA sequences could be fully decoded
and a Huffman code10 used in this study to convert bytes to base-3 without intervention. The fifth however contained two gaps, each a run
1
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK. 2Agilent Technologies, Genomics–LSSU, 5301 Stevens Creek Boulevard, Santa Clara, California 95051, USA.

7 F E B R U A RY 2 0 1 3 | VO L 4 9 4 | N AT U R E | 7 7
©2013 Macmillan Publishers Limited. All rights reserved
RESEARCH LETTER

a Binary/text file

b Base-3-encoded

c DNA-encoded
A

Alternate fragments
25 bp have file information
d reverse complemented
DNA fragments

DNA-encoded indexing
information added

Figure 1 | Digital information encoding in DNA. Digital information (a, in were generated. This formed the basis for a large number of overlapping
blue), here binary digits holding the ASCII codes for part of Shakespeare’s segments of length 100 bases with overlap of 75 bases, creating fourfold
sonnet 18, was converted to base-3 (b, red) using a Huffman code that replaces redundancy (d, green and, with alternate segments reverse complemented for
each byte with five or six base-3 digits (trits). This in turn was converted in silico added data security, violet). Indexing DNA codes were added (yellow), also
to our DNA code (c, green) by replacement of each trit with one of the three encoded as non-repeating DNA nucleotides. See Supplementary Information
nucleotides different from the previous one used, ensuring no homopolymers for further details.

of 25 bases, for which no segment was detected corresponding to the encoded information would mean that each base is read fewer times
original DNA. Each of these gaps was caused by the failure to sequence and so is more likely to suffer decoding error. But extension of our
any oligo representing any of four consecutive overlapping segments. scaling analysis to model the influence of reduced sequencing coverage
Inspection of the neighbouring regions of the reconstructed sequence on the per-decoded-base error rate (see Supplementary Information)
permitted us to hypothesize what the missing nucleotides should have revealed that error rates increase only very slowly as the amount of
been (see Supplementary Information) and we manually inserted information encoded increases to a global data scale and beyond
those 50 bases accordingly. This sequence could also then be decoded. (Supplementary Table 4). This also suggests that our mean sequencing
Inspection confirmed that our original computer files had been recon- coverage of 1,308 times was considerably in excess of that needed
structed with 100% accuracy. for reliable decoding. We confirmed this by subsampling from the
An important issue for long-term digital archiving is how DNA- 79.6 3 106 read-pairs to simulate experiments with lower coverage.
based storage scales to larger applications. The number of bases of Figure 2b indicates that reducing the coverage by a factor of 10 (or
synthesized DNA needed to encode information grows linearly with even more) would have led to unaltered decoding characteristics,
the amount of information to be stored, but we must also consider the which further illustrates the robustness of our DNA-storage method.
indexing information required to reconstruct full-length files from DNA-based storage might already be economically viable for long-
short fragments. As indexing information grows only as the logarithm horizon archives with a low expectation of extensive access, such as
of the number of fragments to be indexed, the total amount of synthe- government and historical records23,24. An example in a scientific context
sized DNA required grows sub-linearly. Increasingly large parts of is CERN’s CASTOR system25, which stores a total of 80 PB of Large
each fragment are needed for indexing however and, although it is Hadron Collider data and grows at 15 PB yr21. Only 10% is maintained
reasonable to expect synthesis of longer strings to be possible in future, on disk, and CASTOR migrates regularly between magnetic tape for-
we modelled the behaviour of our scheme under the conservative mats. Archives of older data are needed for potential future verification
constraint of a constant 114 nt available for both data and indexing of events, but access rates decrease considerably 2–3 years after collec-
information (see Supplementary Information). As the total amount of tion. Further examples are found in astronomy, medicine and interplan-
information increases, the encoding efficiency decreases only slowly etary exploration26. With negligible computational costs and optimized
(Fig. 2a). In our experiment (megabyte scale) the encoding scheme is use of the technologies we employed, we estimate current costs to be
88% efficient; Fig. 2a indicates that efficiency remains .70% for data $12,400 MB21 for information storage in DNA and $220 MB21 for
storage on petabyte (PB, 1015 bytes) scales and .65% on exabyte (EB, information decoding. Modelling relative long-term costs of archiving
1018 bytes) scales, and that DNA-based storage remains feasible on using DNA-based storage or magnetic tape shows that the key para-
scales many orders of magnitude greater than current global data meters are the ratio of the one-time cost of synthesizing the DNA to the
volumes22. Figure 2a also shows that costs (per unit information recurrent fixed cost of transferring data between tape technologies or
stored) rise only slowly as data volumes increase over many orders media, which we estimate to be 125–500 currently, and the frequency of
of magnitude. Efficiency and costs scale even more favourably if we tape transition events (Supplementary Information and Supplementary
consider the synthesized fragment lengths available using the latest Fig. 7). We find that with current technology and our encoding scheme,
technology (Supplementary Fig. 5). DNA-based storage may be cost-effective for archives of several mega-
As the amount of information stored increases, decoding requires bytes with a ,600–5,000-yr horizon (Fig. 2c). One order of magnitude
more strings to be sequenced. A fixed decoding expenditure per byte of reduction in synthesis costs reduces this to ,50–500 yr; with two orders
7 8 | N AT U R E | VO L 4 9 4 | 7 F E B R U A RY 2 0 1 3
©2013 Macmillan Publishers Limited. All rights reserved
LETTER RESEARCH

a Figure 2 | Scaling properties and robustness of DNA-based storage.


10 a, Encoding efficiency and costs change as the amount of stored information
1 MB
increases. The x axis (logarithmic scale) represents the total amount of
1 GB
80 information to be encoded. Common data scales are indicated, including the
1 TB
8 three zettabyte (3 ZB, 3 3 1021 bytes) global data estimate, shown red. The black
1 PB
line (y-axis scale to left) indicates encoding efficiency, measured as the
1 EB proportion of synthesized bases available for data encoding. The blue curves (y-
60 3 ZB (Estimated global axis scale to right) indicate the corresponding effect on encoding costs, both at

Cost (104$ MB–1)


data amount) 6
Efficiency (%)

current synthesis cost levels (solid line) and in the case of a two-order-of-
magnitude reduction (dashed line). b, Per-recovered-base error rate (y axis) as a
function of sequencing coverage, represented by the percentage of the original
40 4 79.6 3 106 read-pairs sampled (x axis; logarithmic scale). The blue curve
represents the four files recovered without human intervention: the error is zero

Maximum information for


when $2% of the original reads are used. The grey curve is obtained by Monte
Carlo simulation from our theoretical error rate model. The orange curve
20 Current costs 2
represents the file (watsoncrick.pdf) that required manual correction: the

117 nt strings
minimum possible error rate is 0.0036%. The boxed area is shown magnified in
the inset. c, Timescales for which DNA-based storage is cost-effective. The blue
Projected costs (100× cheaper)
0 curve indicates the relationship between break-even time beyond which DNA
0
storage is less expensive than magnetic tape (x axis) and relative cost of DNA-
101 1013 1025 1037 1049 storage synthesis and tape transfer fixed costs (y axis), assuming the tape
Information to be encoded (B) archive has to be read and rewritten every 5 yr. The orange curve corresponds to
tape transfers every 10 yr; broken curves correspond to other transfer periods as
b indicated. In the green-shaded region, DNA storage is cost-effective when
100
transfers occur more frequently than every 10 yr; in the yellow-shaded region,
DNA storage is cost-effective when transfers occur every 5–10 yr; in the red-
0.10
shaded region tape is less expensive when transfers occur less frequently than
80 0.08 every 5 yr. Grey-shaded ranges of relative costs of DNA synthesis to tape
transfer are 125–500 (current costs for 1 MB of data), 12.5–50 (achieved if DNA
0.06 synthesis costs are reduced by one order of magnitude) and 1.25–5 (costs
reduced by two orders of magnitude). Note the logarithmic scales on both axes.
Base error (%)

60 0.04
See Supplementary Information for further details.
0.02
40 0.00 both processes can be accelerated through parallelization (Supplemen-
tary Information).
0.5 1.0 2.0 5.0 10.0
The DNA-based storage medium has different properties from
20 traditional tape- or disk-based storage. As DNA is the basis of life
on Earth, methods for manipulating, storing and reading it will remain
Theoretical (Monte Carlo)
watsoncrick.pdf
the subject of continual technological innovation. As with any storage
0 Four other files system, a large-scale DNA archive would need stable DNA manage-
0.01 0.05 0.10 0.50 1.00 5.00 10.00
ment27 and physical indexing of depositions. But whereas current
digital schemes for archiving require active and continuing mainten-
Reads used (%)
ance and regular transferring between storage media, the DNA-based
c storage medium requires no active maintenance other than a cold, dry
15
Tape less expensive 12.5 and dark environment3,4 (such as the Global Crop Diversity Trust’s
(assuming tape rewritten every 5 yr) 10 Svalbard Global Seed Vault, which has no permanent on-site staff28)
yr
7.5
500 5 yet remains viable for thousands of years even by conservative esti-
Relative cost of DNA-storage writing

Current costs for 1 MB of data 2.5 mates. We achieved an information storage density of ,2.2 PB g21
versus tape transfer fixed cost

(Supplementary Information). Our sequencing protocol consumed


y:
er
ev

125 just 10% of the library produced from the synthesized DNA (Sup-
en

plementary Table 2), already leaving enough for multiple equivalent


itt
wr

50 copies. Existing technologies for copying DNA are highly efficient11,13,


re

DNA synthesis
pe

10× cheaper meaning that DNA is an excellent medium for the creation of copies of
ta

any archive for transportation, sharing or security. Overall, DNA-


or
ef

12.5
based storage has potential as a practical solution to the digital archiv-
tim

ing problem and may become a cost-effective solution for rarely


en

5
ev

DNA synthesis accessed archives.


-
ak

100× cheaper
e
Br

DNA storage less expensive Received 15 May; accepted 12 December 2012.


1.25 (assuming tape rewritten every 10 yr)
Published online 23 January 2013.
100 101 102 103 104 105
1. Baum, E. B. Building an associative memory vastly larger than the brain. Science
Years 268, 583–585 (1995).
2. Cox, J. P. L. Long-term data storage in DNA. Trends Biotechnol. 19, 247–250 (2001).
3. Anchordoquy, T. J. & Molina, M. C. Preservation of DNA. Cell Preserv. Technol. 5,
180–188 (2007).
of magnitude reduction, as can be expected in less than a decade if 4. Bonnet, J. et al. Chain and conformation stability of solid-state DNA: implications
current trends continue (ref. 13, and http://www.synthesis.cc/2011/06/ for room temperature storage. Nucleic Acids Res. 38, 1531–1546 (2010).
new-cost-curves.html), DNA-based storage becomes practical for 5. Clelland, C. T., Risca, V. & Bancroft, C. Hiding messages in DNA microdots. Nature
399, 533–534 (1999).
archives with a horizon of less than 50 yr. The speed of DNA-storage 6. Kac, E. Genesis (1999); available at http://www.ekac.org/geninfo.html (accessed
writing and reading are not competitive with current technology, but 10 May 2012).

7 F E B R U A RY 2 0 1 3 | VO L 4 9 4 | N AT U R E | 7 9
©2013 Macmillan Publishers Limited. All rights reserved
RESEARCH LETTER

7. Ailenberg, M. & Rotstein, O. D. An improved Huffman coding method for 26. Baker, M. et al. in Proc. 1st ACM SIGOPS/EuroSys European Conf. on Computer
archiving text, images, and music characters in DNA. Biotechniques 47, 747–754 Systems (eds Berbers, Y. & Zwaenepoel, W.) 221–234 (ACM, 2006).
(2009). 27. Yuille, M. et al. The UK DNA banking network: a ‘‘fair access’’ biobank. Cell Tissue
8. Gibson, D. G. et al. Creation of a bacterial cell controlled by a chemically Bank. 11, 241–251 (2010).
synthesized genome. Science 329, 52–56 (2010). 28. Global Crop Diversity Trust. Svalbard Global Seed Vault. (2012); available at
9. Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in http://www.croptrust.org/main/content/svalbard-global-seed-vault (accessed
DNA. Science 337, 1628 (2012). 10 May 2012).
10. MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms (Cambridge
Univ. Press, 2003). Supplementary Information is available in the online version of the paper.
11. Erlich, H. A., Gelfand, D. & Sninsky, J. J. Recent advances in the polymerase chain
reaction. Science 252, 1643–1651 (1991). Acknowledgements At the University of Cambridge: D. MacKay and G. Mitchison for
12. Monaco, A. P. & Larin, Z. YACs, BACs, PACs and MACs: artificial chromosomes as advice on codes for run-length-limited channels. At CERN: B. Jones for discussions
research tools. Trends Biotechnol. 12, 280–286 (1994). on data archival. At EBI: A. Löytynoja for custom multiple sequence alignment
13. Carr, P. A. & Church, G. M. Genome engineering. Nature Biotechnol. 27, 1151–1162 software, H. Marsden for computing base calls and for detecting an error in the
(2009). original parity-check encoding, T. Massingham for computing base calls and advice
14. Willerslev, E. et al. Ancient biomolecules from deep ice cores reveal a forested on code theory and K. Gori, D. Henk, R. Loos, S. Parks and R. Schwarz for assistance
southern Greenland. Science 317, 111–114 (2007). with revisions to the manuscript. In the Genomics Core Facility at EMBL Heidelberg:
15. Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, V. Benes for advice on Next-Generation Sequencing protocols, D. Pavlinić for
710–722 (2010). sequencing and J. Blake for data handling. C.D. is supported by a fellowship from
16. Kari, L. & Mahalingam, K. in Algorithms and Theory of Computation Handbook Vol. 2, the Swiss National Science Foundation (grant 136461). B.S. is supported by an
2nd edn (eds Atallah, M. J. & Blanton, M.) 31-1–31-24 (Chapman & Hall, 2009). EMBL Interdisciplinary Postdoctoral Fellowship under Marie Curie Actions
17. Păun, G., Rozenberg, G. & Salomaa, A. DNA Computing: New Computing Paradigms (COFUND).
(Springer, 1998).
18. Watson, J. D. & Crick, F. H. C. Molecular structure of nucleic acids. Nature 171, Author Contributions N.G. and E.B. conceived and planned the project and devised the
737–738 (1953). information-encoding methods. P.B. advised on oligo design and Next-Generation
19. Niedringhaus, T. P., Milanova, D., Kerby, M. B., Snyder, M. P. & Barron, A. E. Sequencing protocols, prepared the DNA library and managed the sequencing
Landscape of next-generation sequencing technologies. Anal. Chem. 83, process. S.C. and E.M.L. provided custom oligonucleotides. N.G. wrote the software for
4327–4341 (2011). encoding and decoding information into/from DNA and analysed the data. N.G., E.B.,
20. LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) C.D. and B.S. modelled the scaling properties of DNA storage. N.G. wrote the paper with
oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, discussions and contributions from all other authors. N.G. and C.D. produced the
2522–2540 (2010). figures.
21. Massingham, T. & Goldman, N. All Your Base: a fast and accurate probabilistic
approach to base calling. Genome Biol. 13, R13 (2012). Author Information Data are available at http://www.ebi.ac.uk/goldman-srv/
22. Gantz, J. & Reinsel, D. Extracting Value from Chaos (IDC, 2011). DNA-storage and in the Sequence Read Archive (SRA) with accession number
23. Brand, S. The Clock of the Long Now (Basic Books, 1999). ERP002040. Reprints and permissions information is available at www.nature.com/
24. Digitalarchiving. History flushed. Economist 403, 56–57 (28 April 2012); available reprints. The authors declare competing financial interests: details are available in the
at http://www.economist.com/node/21553410 (2012). online version of the paper. Readers are welcome to comment on the online version of
25. Bessone, N., Cancio, G., Murray, S. & Taurelli, G. Increasing the efficiency of tape- the paper. Correspondence and requests for materials should be addressed to N.G.
based storage backends. J. Phys. Conf. Ser. 219, 062038 (2010). ([email protected]).

8 0 | N AT U R E | VO L 4 9 4 | 7 F E B R U A RY 2 0 1 3
©2013 Macmillan Publishers Limited. All rights reserved

You might also like