0% found this document useful (0 votes)
63 views5 pages

Next Generation Big Data Storage

1) DNA has the potential to store vast amounts of data due to its dense information storage capabilities. Billions of bytes of data can be encoded and stored in a very small amount of DNA. 2) Researchers have developed methods to encode digital data by assigning DNA nucleotide bases (adenine, cytosine, guanine, thymine) to represent the 1s and 0s of binary data. This allows for efficient storage of large datasets in DNA. 3) DNA's stability and durability make it well-suited for long-term archival storage. Properly stored DNA can retain encoded information for thousands of years, offering a major advantage over traditional digital storage media.

Uploaded by

Sahana Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views5 pages

Next Generation Big Data Storage

1) DNA has the potential to store vast amounts of data due to its dense information storage capabilities. Billions of bytes of data can be encoded and stored in a very small amount of DNA. 2) Researchers have developed methods to encode digital data by assigning DNA nucleotide bases (adenine, cytosine, guanine, thymine) to represent the 1s and 0s of binary data. This allows for efficient storage of large datasets in DNA. 3) DNA's stability and durability make it well-suited for long-term archival storage. Properly stored DNA can retain encoded information for thousands of years, offering a major advantage over traditional digital storage media.

Uploaded by

Sahana Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

NEXT GENERATION BIG DATA STORAGE

Rajesh B. Guide: Sheik Imran


Dept of Information Science and Dept of Information Science and
Engineering, Bapuji Institute of Engineering, Bapuji Institue of
Engineering and Technology, Engineering and Technology,
Davangere. Davangere.
[email protected]

Abstract- In today’s world rate of information produced nucleotides that consist of one of a five carbon
is higher than the rate of storage. It is about few zetta sugar, four nitrogen bases and phosphate group. An
bytes. Efficient storage of this Big data is the current immense amount of information is stored in DNA to
issue. The agenda of this paper is one of the efficient way
of storing Big data in DNA. An organic material DNA is utilize a significant number of combinatorial issues,
a unique factor for identification of living organism DNA registering methodologies are utilized. It can be a
which is not redundant and tiny to look. Synchronizing possibility that a few gram of DNA have the
this feature we store the Big data, by turning 0’s and 1’s
in digital data to four main blocks of DNA namely As, possibility of storing all data in the world which is 1
Gs ,Cs ,Ts sequences. It also provides the efficient gram of DNA has about 1021DNA bases. Also, this
method for DNA computing among all the various
methods. DNA can be kept in dry cold and dark conditions. As it
comes for storage problem, there are a lot of reasons
Keywords- Storage Based DNA, Coding theory, Encoding, to use DNA due to its ubiquity and its very small size. In
DNA cloud.
their initial work, Lipton, Adelman [1] and several
other researchers suggested that DNA based
I. INTRODUCTION approaches can be used to solve such problems as: SAT
problem and salesman traveling problems. In 1994,
Our information network is being loomed due to the according to Adelman it was indicated to store
preser- vation problem of the data which is stored and information and did a few calculations similarly, the
retrieved and is inevitable. The current annual data DNA might be used. And also it is stated by Adleman in
creation rate is 16.3ZB. The shock value of the recent 1994 that DNA used four bases with each input
prediction by research group International Data parameter[1]. In 1995, Lipton also described how all
Corporation (IDC) that the world will be creating 163 could be allowed of qualities to a SAT issue might be
zettabytes of data a year by 2025 and a zettabyte is one depicted in graph, and then can be translated into DNA
trillion.gigabytes. As the data was stored into DVDs and in the same salesman traveling example. A few
CDs, now it has been shifted to portable hard drives and researches from NEC showed some results after using
USB flash drives. Yet all of these techniques are of no computing of molecular for solving NP-complete
use when dealing with the rapid growth of data problem. problem. The researcher’s team and Adelman executed a
Moreover, the environment can be polluted with graph from six vertexes and every node connecting to
silicon and the other non-biodegradable materials which other individually. Then they employed a method
are limited in resources and would exhaust one day. Till called “thermal cycling” which generated all possible
2015, the projected data demand can be raised to 8000 paths between the nodes. Adelman and other
exabyte as the maximum storage density on these researchers in 2001 solved the largest problem using
devices is 1 terabyte per square inch. For archival DNA computer with a 20-variable
purposes libraries, corporations and file sharing systems 3-SAT problem [3].In 1995, Lipton proposed solving
are in favor of shifting to newer technologies as the SAT problem that general hunts against 1 million
current storage technologies are not capable to handle it possibilities utilizing a comparative encoding plan of
efficiently. one were performed by the team if researchers. Clell,
Risca and Bancroft in 1999, developed the idea of
II. RELAT ED WORK storing data in DNA based on encoding data in DNA
strands. In order to encrypt the information they used
Any organism which is made of two stranded spiral of DNA nucleotide polymer, PCR and the key. Then, one
nucleotides has cells called Deoxyribonucleic Acid needs to repeat the sequence of DNA for a hundred
(DNA) cells. Four polymers which are Adenine, times in order to have a strong background to store the
Cytosine, Guanine and Thymine make up these data
speed of reproduction rate. Like digital arrangement they
used oligonucleotide sequences as 1’s and 0’s to make
the ASCII scheme that can present the text in silicon
devices. Out of 10 billion sequences they can found, they
should carefully choose the safe sequences which cannot
harm of the encoding safety. After this step, a restriction
enzyme is used by the

Figure1. Identified the research work and coding


approaches (Clelland and Catherine Taylor 1999).

This process is called PCR. The technology which is


used to encode the picture to period and try to solve Figure2. Structure of DNA molecules used for data
this DNA strand is called Microdots technology. By storage (Bancroft 2011)
concealing the message in DNA, the privacy and
security are ensured using a complex background. In team to create the strands which have twenty bases long that
order to hide the information on DNA,the gel plays two roles one can be insert in the encoded required
electrophoresis analysis was performed. As a result, and the other is finally the reproduced take double strands
it was made clear how information can be stored using into a recombinant plasmid. Consequently, PCR is used
the model of DNA. The research into DNA storage when the data is required. Stop codons is used in order to
suggests that it can allow for far more privacy and keep and protect the message. From that in bacteria with
security, If compared with silicon devices. In the Figure the help of 57-99 base pair of foreign encoded
1, the coding approach is identified and used in store information, a procurement of 7 chemically synthesized
data according to the study of Cell, Risca and Bancroft. DNA fragments was enabled. Since one of the features of
According to Bancroft, IDNA is similar idea to store DNA is the fact it is dense it is able to store in a suitable
and encrypt the data[3]. Although the same mechanism host a large amount of the encoded data. In figure 3 the
is used but in IDNA they use poly primer key sequences encoding schemes are shown.
to access information to DNA. DNA information and Therefore, this analysis need those ticket to utilize the
encoded data are used for using IDNA. By using DNA with store information. Moreover, this investigates
encryption scheme, orderly sequence analysis takes might have been a critical on it need those strategies of
place in order to decode this data. Microdot data insurance of the fancied information starting with
storage mechanism was the first type of experiment extremities on environment, radiation Overall hurtful to the
which is based on DNA and it take the way for the delicate DNA part. The period the middle of 2003-2009
future for DNA computing. On the other hand, the really kept tabs on the techniques for encoding DNA.
drawback is that it do not provide the privacy and Those advanced information might have been constantly
security from inside and outside of DNA. The Fig. 2 utilized On 2009 Also begin will encode this data to triple
Computes the DNA molecules structure which is used about base done manifestation. Looking into the individuals
for storage reading information. six quite some time there is an alternate encoding plan cam-
In 2003, regarding to store data in DNA and wood be utilized.
retrieval there is the research was published by Between DNA bases Also different languages there
Wong et al.[6]to ensure the growth of DNA and appear An requirement to structure An widespread plan
duration of information; they needed a vector to contain for correspondence which may be illustrative from
data with DNA. Because the information can be lose if claiming know workable information Also to extend will
the strands of DNA break from ends. They can store and fit new formats about information. The prudent utilization
protect the encoded strands from harsh condition and of nucleotide bases for every character will be that
synthesized gene sequences. The team used Escherichia standout amongst the great prerequisite for the best DNA
coli and Deinococcus radidurans factor to increase the coding. In light of those base to characters rate around three
it might have been turned out mathematically, times compared with the first generation encoding. The
subsequently a significant number analysts favored figure4 and figure 5 portray those essential rules of the
Furthermore at present do it which is Huffman coding system and examination of the fill in with some existing
plan [8]. Craig Venter’s undertaking clinched advances.
alongside 2010 Seen an alternate breakthrough for DNA information
capacity in 2013 Goldman and his group attempted to use
the DNA as stockpiling for Data and recover this
information adequately. There are numerous choices for
documenting data yet all have their downsides. For example,
hard drive expense is so high and need to electrical supply
additionally the attractive tape will be harm in a period.
Be that as it may one petabyte of data on one gram
of DNA later on

Figure4. The Whole process(Church 2012).

might have been encoding a 7920-bit in the genome


successions of the bacterium mycoplasma [5].This
endeavor might have been one of the biggest ventures.
What’s more it encodes biggest data under DNA to date;
however, the advanced information stockpiling was not
yet considered within the venture. Moreover, it might
have been to start with time to process those engineered
cell, thus it might have been a critical accomplishment.
Moreover, it might have been to start with time to
process those engineered cell, thus it might have been a Figure5. Example for encoding
critical accomplishment.
In 2011 Church, GAO and Kosurib published a can be a superior answer for store an immense measure of
foundation paper that reported the use of over 11 information that is originating from advanced media. In
JPG images, an html coded draft of 53,000 expression the past we compose the data on the hard drives to the
book for their experimentation[4]. On show those glimmer drives which created by human data capacity
possibility about DNA will store the sum information productively control. Subsequently, the information is
they utilized combination on their record comes an immense to taking care of we require a system to
determination. They also determination adenine or archival this data and recovery purposes emerges.
cytosine as 0 and thymine or Guanine as 1, which
im- ply ‘10011001’ might have been encoded clinched
alongside double as ‘TAAGGCCT’ successions for
DNA. That grouping perused to one direction (5’-3’)
might provide for those same consequences on read
previously, (3’-5’). Encoded those 0’s also 1’s onto
oligonucleotide each holding contain 96 bit information
square. Due to the period may be gigantic those
successions chunk under 96 nucleotide bits each odds
encode onto 54,898 159-nt oligonucleotide. These
strategies might have been different and workable in
view naturally right errors starting with right duplicate as
the errors to amalgamation and sequencing. In the end,
for the next generation technologies they utilized the
state of the craftsmanship to union Further- more
Figure6. Functionality of DNA Cloud
sequencing which the cosset is lesquerella regarding 105 (Shah and Gupta 2014, p4)
There would numerous approaches to store
information with respect to DNA Yet there will be
approach might have been formed on encourage this
capacity which will be DNA cloud[4]. This DNA cloud
could store any sort for information image, feature
Furthermore content afterward might be spare it ahead
DNA. This programming need three steps on store any
information with respect to DNA, principal venture will
be encoding information after that unravel
information and At last capacity estimator Similarly as
indicated to figure6.
There will be numerous encoding procedures Store
data under D N A p r o g r e s s i o n s by utilizing
DNA codes[2]. A Huffman code is one of the
proficient source coding strategy [9] additionally it is
interesting and capable to decipher. Moreover, DNA
matrix has been used to represent the Metadata follow
by converting the DNA matrix into Quick Response
(QR) representation that offers a broad scope of practical
usage [12]. DNA Cloud is like Haffman encoding which Figure7. Flowchart proposed method
was executed[8]. Encoding model information takes any
configuration from the information document, for III. POSSIBLE AP P ROACH TO T H E DNA
example, (text.jpg.mp3.etc). DNA chunks and the parts COMPUTINGFOR BIG DATA
of DNA pieces covered to execute fourfold excess for A.Process
blunder amendment so to encode the DNA groupings is B.Encoding/Decoding algorithm
isolated into altered length of this chunks entwined.
They change over the unique document to code length 5
and 3 code Huffman(0,1,2) which is changed three 1) Encoding algorithm:
codon to DNA code and substituting one from these 1) Split the whole data into overlapping segments of 100
three with one of three nucleotide diverse from the past nucleotides with an offset of 50 nucleotides from previous.
one. This module spares encode of document as ’record 2) Form pairs of segments starting from the 1st segment.
format extension.dna however if these codes was erased 3) Index each pair from 0 to 107 and after 107, start from 0
any chunks or base then it can recovered by perusing that again.
covered code successions. To recuperate the information 4) Reverse complement 2nd segment in each pair.
on DNA this information ought to translate from DNA. 5) The index will be of 4 nucleotides long. The index is
Consequently, for translate this information changing encoded by a combination of nucleotides in a sequence of A,
C, G, T such that no 2 consecutive nucleotides same.
over back the base 3 Huffman DNA code to unique
Example: 0=ACAC, 1=ACAG, 2=ACAT.
information. The DNA arrangements are data on this
6) Prepend A and append C to the 1st segment of the pair.
module and store the unique information as yield.
7) Prepend T and append G to the 2nd segment of the pair.
The succession yield can be including in this
8) Each segment is now synthesized to actual DNA strand of
module as ".dnac" document. There will be two
length 106 nucleotides.
fundamental areas for the stockpiling estimator process
If the length of the code of a character is 1, then to avoid
from the framework the clients can picked the document. repetition of nucleotides, 1 more nucleotide is added in the
At that point this document has esteem, for example, code of the character.
(record estimate in bytes and size of DNA strings)
can be evaluated which Assistance the customer pick 2) Decoding algorithm:
what amount memory from this framework can be 1. The decoding process is simply the reverse of the encoding
utilized to encode and store information in DNA. Next process.
step is expense and biochemical properties which
select document ".dnac" from the framework to store 2. The 1st nucleotide of DNA will tell whether the DNA is
the information and will ask this record. the 1st or 2nd segment of the pair, whether the data is reverse
complemented or not and directionality of strand.
3. If 1st nucleotide is A then:
a. Remove 1st nucleotide.
b. Next 4 nucleotides will tell us about segment IV. CONCLUSION
number.
Losses of electricity, hardware cost and space for
c. Next 100 nucleotides will be data. maintenance of this Big data which is not accessed frequently
d. The last nucleotide can be used for can be done in compact, powerless and long lasting way.
confirmation of the type of segment. Imagination, of stopping to 0’s and 1’s made us to go further
to encode the DNA code due to its versatile properties. Use
4. If 1st nucleotide is C then: of Huffman Base 3 code is an efficient manner to decode the
a. Reverse whole segment. digital data which provides ease recovery while loss of DNA
b. Remove 1st nucleotide. code and high security for the stored data.

c. Next 4 nucleotides will tell us about segment ACKNOWLEDGEMENT


number.
We whole heartedly express our gratitude to our respected
d. Next 100 nucleotides will be data.
principal Dr. , Director Prof Y Vrushabhendrappa. To the
e. The last nucleotide can be used for respected head of the department, Dr . Poornima B for
confirmation of the type of segment. providing congenial and healthy atmosphere for the study and
5. If 1st nucleotide is G then: competent guidance, endless support and valuable
a. Reverse whole segment. suggestion. Finally we thank SJMIT, Ranebennur for
providing opportunity for exhibiting our skills.
b. Remove 1st nucleotide.
c. Next 4 nucleotides will tell us about segment
number. REFERENCES

d. Reverse complement next 100 nucleotides. [1] L M Adleman, Adleman, and Leonard M.
e. These 100 nucleotides will now be data. Molecular computation of solutions to combinatorial
problems. Science 266, 266(5187):1021–1024,
f. The last nucleotide can be used for 1994.
confirmation of the type of segment.
6. If 1st nucleotide is T then: [2] Masanori. Arita. Writing information into DNA.
a. Remove 1st nucleotide. In Aspects of Molecular Computing, 2004.
b. Next 4 nucleotides will tell us about segment
number. [3] Carter Bancroft,Timothy Bowler,Brian Bloom and
Catherine Taylor, Clell and Long –term storage of
c. Reverse compliment next 100 nucleotides. Information in DNA.
d. These 100 nucleotides will now be data.
[4] Church, George M., Yuan Gao, and Sriram Kosuri.
e. The last nucleotide can be used for Next-generation digital information storage in DNA.
confirmation of the type of segment. Science 337, (6102):1628–1628,2012.
7. If TTTT sequence is found, this will denote the end of
the file. The new character will start from next nucleotide. [5] Gibson, Daniel G., John I. Glass, Carole Lartigue,
Vladimir N. Noskov, Ray-Yuan Chuang, Mikkel A.
8. Now by using the same Huffman tree, data can convert
Algire, and Gwynedd A. Benders et Al. creation
the data into original characters.
of a bacterial cell controlled by achemically
It is possible to generate different Huffman tree for
synthesized genome science.
different files or single Huffman tree for whole data. This
will compress the data and decoding cannot be done
[6] Jack Parker, Computing with DNA. EMBO
unless one has the original tree. As specific orientation
Reports ,2003.
nucleotides have been used in the strands, it is possible to
read double number segments in the same number of
indexes. The user can read the strand from any direction. [7] Shrivastava Siddant Badlani and Roshan.Data
Storage in DNA, International Journal of
Electrical Energy.
-
[8] Huffman David A and David A Huffman.
A method for the construction of minimum
Redundancy codes.

You might also like