DNA Cryptography
DNA Cryptography
Abstract—The importance of data and its transmission rate Most previous works focused on developing only the en-
are increasing as the world is moving towards online services cryption method [2], and others focused on developing only
every day. Thus, providing data security is becoming of utmost the data hiding method [3]–[8], thereby providing only a single
importance. This paper proposes a secure data encryption and
hiding method based on DNA cryptography and steganography. layer of protection. The proposed hybrid methods [2], [4], [9]
Our approach uses DNA for encryption and data hiding processes used DNA cryptography and steganography, which use the
due to its high capacity and simplicity in securing various kinds of Playfair cipher method and decrease capacity. The works in
data. Our proposed method has two phases. In the first phase, it [3], [9], [10] have expanded the reference DNA size, which
encrypts the data using DNA bases along with Huffman coding. will get the attention of intruders, and the works in [5], [6],
In the second phase, it hides the encrypted data into a DNA
sequence using a substitution algorithm. Our proposed method [9], [10] have not preserved the biological functionality of the
is blind and preserves biological functionality. The result shows a DNA.
decent cracking probability with comparatively better capacity. The works using the Playfair cipher method generate am-
Our proposed method has eliminated most limitations identified biguous bits and transfer them in the reference sequence
in the related works. Our proposed hybrid technique can provide because the bits are like a key to deciphering the text.
a double layer of security to sensitive data.
Index Terms—Keywords: DNA Cryptography, DNA Steganog- Hence that affects the capacity of the method. Our aim is
raphy, Hybrid Encryption, Huffman Coding. to develop a method that improves the data hiding capacity
while it is blind, which means no other information along
I. I NTRODUCTION with the reference sequence needs to be sent by preserving the
In this new era of information technology, the security biological functionality of DNA. Thus, our method addresses
and confidentiality of information are becoming crucial. The the limitations of the previous works.
need for confidential information transmission is increasing Our objective of this work is to propose a robust method
(such as online transactions). Therefore, we need a strong of data encryption by using DNA sequences so that data can
encryption model. For this purpose, researchers aim to find be transmitted securely without getting the attention of the
out a more robust system of data encryption. Moreover, some intruder. The contributions of this work are: (i) proposing a
information needs to be transferred by hiding the encrypted hybrid method for data encryption based on DNA cryptog-
data in some medium, such as images, audio, video, etc., raphy and steganography, (ii) performing security analysis of
to avoid intruders’ attention because of security concerns. our proposed hybrid model, and (iii) comparing the proposed
Therefore, the hybridization of encryption and data hiding is model with existing works to analyze the efficiency.
getting more research attention. In our proposed method, we used the DNA cryptography
Researchers are concentrating on Deoxyribonucleic Acid concept and Huffman coding for data encryption; and used
(DNA) to develop a more robust encryption model because DNA as a medium to hide encrypted data with the substitution
of its advantages such as ultra-high storage density, ultra- method. Results show that our method has a decent system
low energy consumption, and the potential of ultra-large-scale cracking probability and capacity. Moreover, the payload is
parallel computing to realize the cryptographic functions of zero for our method. The comparison result shows that our
information encryption and authentication [1]. Furthermore, method has overcome the limitations of the previous methods.
the DNA sequence is only comprised of four symbols, which Our proposed approach will help secure data transmission,
can be used to encrypt any data. The DNA sequence is especially in banking, e-commerce, authentication, and server-
interesting for data hiding. There are around 163 Million DNA client secure communication sector.
sequences available in the public database. Hence, using DNA The rest of the paper is organized as follows. In Section II,
sequence as a medium significantly lowers system cracking we have explained some of the terminologies used in this
probability and makes the system robust. paper and briefly discussed some existing works with their
Finding a specific algorithm to encrypt and hide the data in advantages and disadvantages. In Section III, the proposed
such a way that it does not get intruders’ attention is challeng- approach is explained along with its strength and limitations.
ing; because no extra information is sent (blind technique), and Section IV presents the security analysis of the proposed
it should have a decent cracking probability. method. In section V, we have compared our method with
2
In [6] biological functionality of the DNA sequence is not Encryption Decryption
preserved. Again [2], [4], [9] used the Playfair cipher method
Convert Text to ASCII Binary Convert ASCII Binary to Text
for encryption, generating ambiguity and ambiguity bits that
must be passed to the receiver for decryption. The ambiguous
Huffman Coding
bits often decrease the capacity of the algorithm. Convert ASCII to DNA Format Scheme Convert DNA Format to ASCII
Using 2 bit Binary encoding Using 2 bit Binary encoding
Encryption Key
F. Novelty of Our Work
The novelty of our approach is that our approach provides Encrypt Data Decrypt Data
Publicly Available
double-layer security incorporating encryption and steganogra- NCBI Database
phy techniques, and the technique is blind. It does not expand Apply LSBase Real DNA Apply LSBase Reverse
Sequence
the DNA size so that it does not get the intruder’s attention Steganography Method Steganography Method
3
Plain Text, ASCII Binary of P,
20
P PBIN
ASCII Conversion 01101000 01100101 01101100 0 1
hello 01101100 01101111 13 7
0 1 T
Huffman code for DNA Conversion
(Based on Digital 7 6
DNA Bases DNA Based Coding)
0 1 G
A 000 3 4
C 001 TGGA TGTT TGCA TGCA
G 01 Huffman TGCC A C
T 1 Coding DNA form of P,
M Fig. 3. Huffman Code generation for DNA bases based on frequency as
described in Huffman Coding Algorithm.
Conversion of DNA with Huffman Coding
Fig. 2. Flowchart explaining the data encryption process with an example. DNA Sequence from NCBI, D
AATTCCAAAGAAACAGACTCTACAGC
CAGCGAAGGCATGGATTTGCTGGCTG
frequencies are A=3, T=7, G=6, and C=4. Hence, the sorted GGGCAAACAGGCAAAGAGAGAGCAA
Base MBIN Substitute
GCCTTCTTCTTCCATATC CTTTATATAG Value Value
order of base is A - C - G - T. From the sorted order, we ACTGCCAACTAAAGG A/G 0 A
construct a tree-like Fig. 3. We get variable length code for A/G 1 G
A, T, G, and C, which is shown in Table II. Next, we convert T/C 0 U
AACTCUAAGGAAACGGAUTCUACAGC Data T/C 1 T
M to MBIN according to Table II values. Therefore, we get CAGUGAGGGUATGGACTTGCTAGCUG Hiding Substitute 1st 3 & 5th Base
GAGCAAAUAGACAAAGAGCCTUACCG
MBIN=Cipher Text= 10101000 10111 101001000 101001000 CACTCCTCCCGTACCCCTTATAUAGACT
by sorted Bases from
Huffman Code
101001001. GCCAACCUAAAAGG
3:1 Substitution Rules
Final Cipher Text,
TABLE II DC
VARIABLE LENGTH CODE FOR DNA BASES .
Fig. 4. Flowchart explaining the data hiding process with an example.
DNA Base Huffman Code
A 000
C 001
G 01
to encode 1 and U to encode 0 from cipher text. We encode
T 1 this way until the length of the cipher text or the actual DNA
sequence is reached. We hide our key (the variable length code
from Huffman code) into the first 5 bases of the actual DNA
B. Phase II: Data Hiding sequence leaving the third base for cipher text encoding.
In this phase of our hybrid algorithm, we hide our cipher We get this opportunity because, in the case of the Huffman
text which is the encrypted string of our plain text, into an coding scheme, the variable length code is actually fixed, but it
actual DNA sequence. There are millions of natural DNA changes with the bases’ frequencies. That means the variable
sequences available in the public database. We can get our length codes can only be 000, 001, 01, and 1 every time.
DNA sequence from NCBI (National Center for Biotechnol- However, which base represents which one depends on the
ogy Information) database. Then, we hide the cipher text into frequency only. The least frequent base is encoded with 000,
that actual DNA sequence using the 3:1 LS (Least Significant) then 001, then 01, and the most frequent one in 1. Thus, if
base method. However, we have modified the method to we send only the sorted order of that bases, the code can be
increase capacity and security. The process is straightforward. obtained. We substitute the 1st, 2nd, 4th and 5th base with the
First, from the left, we select each base from the actual DNA sorted list of the bases based on the frequency. It also gives
sequence placed into positions of multiple of 3, i.e., 3, 6,9, us another benefit in security which we will discuss in the
12, 15, and so on. Moreover, we substitute them with another security analysis section. Therefore, in this way, we get our
base based on the binary value of our cipher text from left cipher text with the key hidden in our actual DNA sequence.
to right. As 1 of the 3 bases in the actual DNA sequence The process is explained using a flowchart in Fig. 4.
contains cipher text, and that is the least significant among From the previous example, we saw that the length
that 3. Hence it is called the 3:1 LS base method. If the base of the cipher text MBIN is 40 bits. To hide it in
is a Purine base (A or G), then we substitute that with A to a DNA sequence, the length of that sequence needs
encode 0 and G to encode 1 from the cipher text. If the base to be at least 120bp. Let the DNA Sequence be
is Pyrimidine base (T or C), then we substitute that with C D=AATTCCAAAGAAACAGACTCTACAGCCAGCGAAGG
4
CATGGATTTGCTGGCTGGGGCAAACAGGCAAAGAGA- E. Limitations of Our Approach
GAGCAAGCCTTCTTCTTCCATATC CTTTATATAGACT- Data redundancy is the main drawback of our approach. As
GCCAACTAAAGG. We check the 3rd base, which is T, we use the 3:1 LS base method, we need to take DNA of
and the 1st bit of our cipher text is 1. Thus, we substitute it length 3 times longer than the cipher we got to hide. Still, the
with C. Then, we go to 6th base, which is C. The 2nd bit processing steps for hiding remain within the length of the
of cipher text is 0. Thus, we substitute it with U. And that cipher.
way; it goes on. After all the cipher text gets hidden, we
hide the key. The sorted list of the bases from the previous F. Cost Benefit Analysis
example was A - C - G - T. Hence, we substitute 1st base Though our model introduces redundancy, it makes the
of the DNA sequence with A, 2nd one with C, 4th one data sending highly secure. The machines today contain high
with G and 5th one with T. The final DNA Sequence is processing power, and the internet connections are high speed.
DC=AACTCUAAGGAAACGGAUTCUACAGCCAGUGA Hence, data redundancy is not the primary problem. From the
GGGUATGGACTTGCTAGCUGGAGCAAAUA- security analysis below, we will see that the system cracking
GACAAAGAGCCTUACCGCACTCCTCCCGTAC- probability is very low; thus, it can be used to secure the
CCCTTATAUAGACTGCCAACCUAAAAGG transmission of highly secured data.
IV. S ECURITY A NALYSIS
C. Data Extraction - Receiver Side
An intruder needs to know vital information to get the
At the receiver end, the received message is just like an message back from the encrypted message we sent. They
actual DNA sequence which contains a hidden encrypted are: DNA reference, Encoding rule, and LSB substituted
message and a key to decrypt it. We need to do the opposite permutation. Analysis of the parameters is as follows.
procedure to get back the actual message from the received
message. First, we go through every 3 multiple bases and A. DNA Reference Sequence
check it. If that is A or U, then the cipher bit was 0. If To decode the information, the intruder needs to guess the
that is C or G, the cipher bit was 1. In that way, we first correct reference DNA so that he can analyze the changes in
extract the cipher text. Now from the cipher text, we match it to decode the message. This process is the toughest for our
which of the code is represented in places, i.e., 000, 001, model as there are around 163 million DNA sequences in the
01, 1. Then, according to the Huffman coding scheme, we public database. Again the first 6 bases of the sequence might
get the DNA encrypted message back. We get to know the be fully changed in our model. Therefore, the intruder needs
Huffman representation of the bases from first 5 bases. Next, to analyze the rest n-6 bases of a DNA sequence of length n
we convert the DNA encrypted message to binary using 2- to find the most related sequence. Therefore, the probability
bit binary encoding. That means we check each base of the of making a correct guess of DNA reference is:
DNA sequence and represent the digital code of that base.
1
In this way, we get the ASCII representation of our actual P (DN ARef ) = (1)
message. Now just convert that to the character. That is the 1.63 ∗ 108 ∗ (n − 6)
actual message we wanted to send securely. The whole process B. Binary Encoding Rule
is shown in Fig. 1. Let us assume that the intruder knows the number of
symbols used in the encoding process as it is a DNA sequence,
D. Strength of Our Approach so the number of symbols is 4. The Huffman code for the four
symbols can be 000, 001, 01, and 1. Each of the four bases
Our approach has several strong points described below: can get any of that code. The 2-bit binary encoding for DNA
1. This algorithm ensures three layers of protection against bases also creates 4 codes 00, 01, 10, and 11. Each of the
intruders. bases can have any of these codes. Thus, the probability of
a. Conversion of plain text to a DNA sequence. guessing the right code each time P(BER) is:
b. Encoding it again with variable length coding.
c. Hiding it into an actual DNA sequence. 1
P (BER) = (2)
2. The process of hiding the key or Huffman code into a 4! ∗ 4!
DNA sequence that we used made the fake DNA sequence
unique, and difficult to find the actual DNA sequence from C. The Least Significant Base Substitution Rule
the database for the intruder. LS Base method is applied by substituting pyrimidine base
3. On the receiver side decryption process is simpler and takes by ’U’ to encode the secret bit ’0’ or ’C’ to encode ’1’.
less effort, which will benefit this model in the server-client However, it is also can encode ’0’ by C and ’1’ by U, and
network as the client side machines are less powerful and the same for the Purine base. Briefly, the ’0’ secret bit can be
hence less work for it in this model. encoded by substituting the Pyrimidine base with ’U’or ’C’.
4. Though our model introduces data redundancy, it decreases If it is selected to be substituted by ’U’, then ’C’ will be used
cracking probability. to substitute the Pyrimidine base to encode ’1’. So the number
5
TABLE III
C OMPARISON BETWEEN RELATED WORKS .
Comparison Cri- P1: Enhanced Double P2: DNA Base Data En- P3: Proposed Steganogra- P4: A New Data Hiding P5: The Proposed Method
teria Layer Security using RSA cryption and Hiding us- phy Approach using DNA Scheme Based on DNA
over DNA based Data ing Playfair and Insertion Properties [6] Sequence [5]
Encryption [10] Techniques [9]
Secret Text Type Any Type of Data Any Type of Data Any Type of Data Binary Data Any Type of Data
Binary Coding 2-Bit Binary Coding Rule 2-Bit Binary Coding Rule 2-Bit Binary Coding Rule Binary Coding Rule Inde- 2-Bit Binary Coding Rule
Rule pendent
Encryption Type Symmetric Asymmetric Not Applicable Not Applicable Symmetric
Encryption Algo- Encrypting secret data by 5*5 Playfair cipher based No Encryption No Encryption DNA Based Huffman
rithm mapping it to DNA and on DNA and amino acids Coding Encryption
amino acids
Data Hiding Al- Insertion Insertion Complementary rules Substitution method us- Substitution method using
gorithm based hiding method, ing repeated nucleotides the least significant base
which is the rule that to hide the secret message of each codon in the DNA
specifies the strand of bits reference sequence
DNA directly opposite a
specified sequence
Blind/Not Blind Not Blind Blind Not Blind Not Blind Blind
System Cracking P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗
Probability (n − 1) ∗ 24 ∗ 2( m − 1) ∗ (n − 1) ∗ 24 ∗ 2( m − 1) ∗ (n − 1) ∗ 24 ∗ 24) (n − 1) ∗ 24 ∗ 6) (n − 6) ∗ 4! ∗ 4! ∗ 4)
2( s − 1)) 2( s − 1))
Security Level Double Layer Double Layer Single Layer Single Layer Double Layer
Modification High High Moderate High Low
Rate
Biological Func- Does not Preserve Does not preserve Does not preserve Does not preserve Preserves
tionality
Capacity High High Moderate Moderate Moderate
of possibilities is 2*1 guesses, and the same will be done for the data by converting it to DNA then amino acids form. P2
the Purine base. Thus, the probability of making a successful encrypts the secret data using DNA and amino acids Playfair
guess for the substituted nucleotides N is: cipher. P3 and P4 hide the original format of the data without
encryption; hence it increases the cracking probability and
1
P (N ) = (3) decreases processing overhead. Our proposed method uses
4 Huffman coding scheme-based encryption followed by DNA
Using the proposed method, the probability of an attacker encryption, providing extra protection against intruders.
making a correct guess or the system cracking probability P(S) The fifth parameter shows which data hiding algorithm is
is: used. P1 and P2 use the insertion method to hide the secret
1 message in the DNA sequence, increasing the DNA sequence’s
P (S) = (4) length. P3 hides the secret message using complementary
1.63 ∗ 108 ∗ (n − 6) ∗ 4! ∗ 4! ∗ 4
rules. P4 and P5 hide the secret message by substituting DNA
V. C OMPARATIVE S TUDY nucleotides based on the cipher text bits.
In this section, we have compared our proposed model with The sixth parameter shows us if the message can be
some of the recent DNA-based steganography algorithms, and retrieved without needing extra information other than the
the result is shown in Table IV-B. For the comparison, we have reference DNA sequence during data extraction. P2 and the
chosen some crucial parameters [13], [14] as shown in Table proposed scheme P5 are blind algorithms. The seventh pa-
IV-B. The first parameter of our consideration is the secret rameter is the cracking probability of each algorithm in the
text type. That shows us if an algorithm hides all data formats table. The eighth parameter shows the security level offered
comprising letters, symbols, or numbers. We can see that all by each algorithm. Our proposed method provides a double
the algorithms mentioned, excluding P4, support all types of layer of security as it encrypts the data before hiding it.
data. P4 supports only binary data. The second parameter is The ninth parameter shows us the modification rate. P1, P2,
the type of binary coding rule used in the conversion from and P4 have high modification rates. The modification rate for
the binary format of the message to DNA. All methods in P3 is moderate. Our proposed model has a low modification
the table use the 2-bit binary coding rule. The third parameter rate, as it only modifies the reference sequence for the length
shows the type of encryption used in every algorithm that we of the cipher text. The tenth parameter is the preservation of
mention in Table IV-B. Our proposed method uses symmetric Biological functionality. It is also crucial to avoid intruders’
key encryption. The fourth parameter shows if the method attention. We can see that only our method preserves the
encrypts the secret data before hiding it or not. P1 encrypts biological functionality of reference DNA. This is because
6
we substitute Purine bases with Purine bases and pyrimidine
bases with pyrimidine bases at the time of the steganographic
process. The eleventh parameter shows the capacity, and we
can see that only P1 and P2 have a high capacity, while our
method also gives moderate capacity. Although we consider
the method 3:1 LS base method, our method utilizes the
maximum capacity that can be given in this method.
After considering all the aspects, we found that we have a
decent cracking probability though it is not the best. P1 and
P1, and P2 show the best cracking probability. However, they
use the insertion method and hence increase the fake DNA
sequence length and may get into the eye of the intruder. Also,
P1 is not a blind method. Our method gives a double layer of
security, making it better than P3 and P4. Again our method is
the only one that preserves the biological functionality of the Fig. 5. The effect on encryption and decryption time based on the length of
reference DNA sequence having a low modification rate. We the reference DNA sequence. From left to right length of the DNA sequences
increased. It shows that the encryption and decryption time increase as the
can conclude that our proposed algorithm is decently strong length increases. Also, encryption time is more than decryption time.
compared to other algorithms represented here.
VI. E XPERIMENTAL R ESULT
and key into it. The second parameter is the ’Payload’. Payload
In this section, we have shown the performance of the pro- refers to the remaining length of the new DNA sequence after
posed algorithm based on some of the predefined parameters extracting the data from it. The third parameter is the ’bpn’.
that are used to evaluate the performance of an encryption BPN stands for a bit per nucleotide, which is the number of
algorithm in the literature. The proposed algorithm was tested bits hidden per nucleotide. It is the ratio of the total length
on Intel(R) Core (TM) i5-8300H CPU @ 2.30 GHz personal of the message and key bit to the capacity in bits. The last
computer with 8 GB RAM. The implementation is carried out two parameter shows the encryption and decryption time in
with Jupyter Notebook version 6.1.4. We have experimented seconds.
on a message kept in a file of size 5 kilobytes. The message
contains letters, symbols, and numbers. C. Summary of Findings
VII. C ONCLUSION
B. Performance Metrics In this paper, we have proposed a novel cryptographic
We have used some parameters that are commonly used in technique combining DNA cryptography and steganography.
evaluating the system’s performance [2-11]. The first param- The technique encrypts the data in its first stage and then hides
eter is ’Capacity’. Capacity refers to the total length of the the encrypted message into an actual DNA sequence. The
modified DNA sequence after hiding the encrypted message encryption method uses DNA bases to encrypt the message,
7
TABLE V
E XPERIMENTAL RESULTS .
Locus Capacity(bits) Payload bpn = (M+K)/C Encryption Time(Sec) Decryption Time (Sec)
AC166252 49965 0 3.6 0.049 0.038
AC168901 63822 0 2.8 0.063 0.048
AC168907 64746 0 2.8 0.063 0.048
AC153526 66709 0 2.7 0.065 0.050
AC168897 66738 0 2.7 0.065 0.050
AC167221 68284 0 2.6 0.067 0.052
AC168874 68833 0 2.6 0.068 0.053
AC168908 72680 0 2.5 0.071 0.055
followed by a variable length code generation and assignment [13] G. Hamed, M. Marey, S. A. El-Sayed, and M. F. Tolba, “Hybrid
for each DNA base using Huffman coding. The proposed technique for steganography-based on DNA with n-bits binary coding
rule,” in 7th International Conference of Soft Computing and Pattern
method is blind as it does not need to send the actual reference Recognition, Fukuoka, Japan, November 2015.
DNA sequence with the fake one. Also, it does not expand the [14] K. S. Sajisha and S. Mathew, “An encryption based on DNA cryptog-
actual DNA sequence while keeping its biological functional- raphy and steganography,” in International conference of Electronics,
Communication and Aerospace Technology (ICECA), Coimbatore, India,
ity. From our security analysis and comparison with a number April 2017, pp. 162–167.
of promising methods from different literature, we found that
our proposed method gives a decent level of security which is
quite impossible to break without having full knowledge of the
steps involved in particular encryption. The proposed method
can be modified in our future work to increase its data hiding
capabilities and security.
R EFERENCES