0% found this document useful (0 votes)
12 views8 pages

DNA Cryptography

Uploaded by

Shahriar Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

DNA Cryptography

Uploaded by

Shahriar Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Hybrid Encryption Technique based on DNA

Cryptography and Steganography


Shahriar Hassan1 , Md. Asif Muztaba1 , Md. Shohrab Hossain1 and Husnu S. Narman2
1
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Bangladesh
2
Department of Computer Sciences and Electrical Engineering, Marshall University, Huntington, WV, USA
Email: [email protected], [email protected], [email protected], [email protected]

Abstract—The importance of data and its transmission rate Most previous works focused on developing only the en-
are increasing as the world is moving towards online services cryption method [2], and others focused on developing only
every day. Thus, providing data security is becoming of utmost the data hiding method [3]–[8], thereby providing only a single
importance. This paper proposes a secure data encryption and
hiding method based on DNA cryptography and steganography. layer of protection. The proposed hybrid methods [2], [4], [9]
Our approach uses DNA for encryption and data hiding processes used DNA cryptography and steganography, which use the
due to its high capacity and simplicity in securing various kinds of Playfair cipher method and decrease capacity. The works in
data. Our proposed method has two phases. In the first phase, it [3], [9], [10] have expanded the reference DNA size, which
encrypts the data using DNA bases along with Huffman coding. will get the attention of intruders, and the works in [5], [6],
In the second phase, it hides the encrypted data into a DNA
sequence using a substitution algorithm. Our proposed method [9], [10] have not preserved the biological functionality of the
is blind and preserves biological functionality. The result shows a DNA.
decent cracking probability with comparatively better capacity. The works using the Playfair cipher method generate am-
Our proposed method has eliminated most limitations identified biguous bits and transfer them in the reference sequence
in the related works. Our proposed hybrid technique can provide because the bits are like a key to deciphering the text.
a double layer of security to sensitive data.
Index Terms—Keywords: DNA Cryptography, DNA Steganog- Hence that affects the capacity of the method. Our aim is
raphy, Hybrid Encryption, Huffman Coding. to develop a method that improves the data hiding capacity
while it is blind, which means no other information along
I. I NTRODUCTION with the reference sequence needs to be sent by preserving the
In this new era of information technology, the security biological functionality of DNA. Thus, our method addresses
and confidentiality of information are becoming crucial. The the limitations of the previous works.
need for confidential information transmission is increasing Our objective of this work is to propose a robust method
(such as online transactions). Therefore, we need a strong of data encryption by using DNA sequences so that data can
encryption model. For this purpose, researchers aim to find be transmitted securely without getting the attention of the
out a more robust system of data encryption. Moreover, some intruder. The contributions of this work are: (i) proposing a
information needs to be transferred by hiding the encrypted hybrid method for data encryption based on DNA cryptog-
data in some medium, such as images, audio, video, etc., raphy and steganography, (ii) performing security analysis of
to avoid intruders’ attention because of security concerns. our proposed hybrid model, and (iii) comparing the proposed
Therefore, the hybridization of encryption and data hiding is model with existing works to analyze the efficiency.
getting more research attention. In our proposed method, we used the DNA cryptography
Researchers are concentrating on Deoxyribonucleic Acid concept and Huffman coding for data encryption; and used
(DNA) to develop a more robust encryption model because DNA as a medium to hide encrypted data with the substitution
of its advantages such as ultra-high storage density, ultra- method. Results show that our method has a decent system
low energy consumption, and the potential of ultra-large-scale cracking probability and capacity. Moreover, the payload is
parallel computing to realize the cryptographic functions of zero for our method. The comparison result shows that our
information encryption and authentication [1]. Furthermore, method has overcome the limitations of the previous methods.
the DNA sequence is only comprised of four symbols, which Our proposed approach will help secure data transmission,
can be used to encrypt any data. The DNA sequence is especially in banking, e-commerce, authentication, and server-
interesting for data hiding. There are around 163 Million DNA client secure communication sector.
sequences available in the public database. Hence, using DNA The rest of the paper is organized as follows. In Section II,
sequence as a medium significantly lowers system cracking we have explained some of the terminologies used in this
probability and makes the system robust. paper and briefly discussed some existing works with their
Finding a specific algorithm to encrypt and hide the data in advantages and disadvantages. In Section III, the proposed
such a way that it does not get intruders’ attention is challeng- approach is explained along with its strength and limitations.
ing; because no extra information is sent (blind technique), and Section IV presents the security analysis of the proposed
it should have a decent cracking probability. method. In section V, we have compared our method with

978-1-6654-6316-4/22/$31.00 ©2022 IEEE 1


some related works and discussed the outcome. Section VI cover image to nucleotides. Then, nucleotides are converted
presents the implementation details and results. Finally, we to binary. Then, the XOR operation is performed between the
conclude the paper in Section VII. message and the cover image.
II. BACKGROUND AND L ITERATURE R EVIEW C. Related Works based on DNA Cryptography
A. Terminology Namdev et al. [2] proposed a method that does a significant
Security in data communication is required when message modification of the old approach of using DNA and Amino
transfer between sender and receiver is needed to be confiden- Acids based approach with Playfair Cipher by using the
tial. Cryptography is the process of achieving confidentiality same approach with a different encryption algorithm, i.e., a
in message transfer. Cryptography can be thought of as a Foursquare cipher to the core of the ciphering process. In this
process of secret writing in order to protect data or messages study, a binary form of data, such as plaintext messages or
from various intruders’ attacks. Secret writing is achieved images, is transformed into sequences of DNA nucleotides.
through the process of transforming a message called plaintext Subsequently, these nucleotides pass through a Foursquare
into cipher text using a cryptographic algorithm. Security is encryption process based on amino acid structure. The fun-
concerned with protecting messages or data while transmitting damental idea behind this encryption process is to enforce
over networks. DNA stands for Deoxyribonucleic Acid. It other conventional cryptographic algorithms that proved to be
contains biological information about every living being. A broken and to open the door for applying the DNA and Amino
DNA sequence is the sequence of Nucleotides. Nucleotides are Acids concepts to more conventional cryptographic algorithms
Adenine(A), Guanine(G), Cytosine(C) and Thymine(T). DNA to enhance their security features.
cryptography refers to converting plain text into a sequence of
D. Hybrid Methods based on DNA Cryptography and
nucleotides based on some specified rules. DNA can be used
Steganography
to hide data. Hiding data in some medium, like image, audio,
video, etc., is known as steganography. Hiding data in a DNA Mitras et al. [10] proposed a hybrid method based on
sequence is known as DNA steganography. the RSA algorithm and DNA encryption. They mapped the
message bits into a DNA sequence and Amino acid. Then
B. Related Works based on DNA Steganography they used the insertion method to hide the encrypted data into
Shiu et al. [3] proposed three methods of hiding messages an actual DNA sequence. The method is not blind. Taur et al.
based on DNA and considered them the main methods. The [9] proposed a hybrid method that uses a Playfair cipher based
first method is the insertion method which inserts a message bit on DNA and Amino acid followed by data hiding using the
in random places of a DNA reference sequence. It obviously insertion method. They used the 5*5 Playfair cipher method.
expands the real sequence. The second method is based on The method is blind and has a low cracking probability. Yadav
complementary rules, which detect the longest complementary et al. [11] proposed a hybrid technique that uses images to
pair in a DNA sequence. They insert message bits before hide a message and DNA to encrypt a message. They first
them, which also increases DNA size. The third one is based convert the message into DNA and from DNA to cipher text.
on the substitution method, where some DNA nucleotides are Then, they take a cover image and manipulate the pixel values
substituted based on secret message bits. Guo et al. [4] pro- according to cipher text. The used algorithm for hiding in the
posed a substitution data hiding technique using motif finding image is a well-known algorithm named KIMLA.
in DNA sequence. Repeated nucleotides in a DNA sequence
are known as motif. They [4] find those motifs and substitute E. Gap analysis
them with other nucleotides based on message bits. Yunus et There have been few works [3], [9], [10] which hide data
al. [5] also proposed a substitution method based on motif based on insertion method and expand the DNA sequence’s
finding in a DNA sequence. This method does not expand the size. Hence it might draw the attention of the intruder to the
DNA length but is not blind. Also, high modification may transmission. Few other works [4], [5], [7] used substitution
be required if the number of the motif is high in a DNA methods for data hiding, and there was no expansion of
sequence. Hamed et al. [6] also proposed a complementary DNA sequence. However, they did not use any encryption
rule-based steganography method. The complementary rule is method. Others [3], [8] did not used any encryption before
the rule that specifies the strand of DNA directly opposite steganography. Again [2] did not use steganography. Thus,
to a specified sequence. It does not expand the length of they provide a single layer of protection.
the sequence and is blind. However, it does not preserve the In some works, [8], [11] image has been used for data
biological function of the DNA. Mousa et al. [7] proposed hiding. Image resolution is changed when it is used as a cover
a hiding technique that preserves the biological functionality image, and it may get the attention of the intruder. Also, a long
of the DNA sequence using the reverse mapping method. message cannot be hidden in images of low resolutions. On
The method is based on the substitution method and does the other hand, DNA sequences can hide the long message,
not expand the length of the sequence. Vijaykumar et al. [8] and also cracking probability of DNA steganography is very
proposed a DNA steganography model for image encryption. low since there are 163 million DNA sequences available in
This method first converts the 3*3 matrix of pixels of the the public database [12].

2
In [6] biological functionality of the DNA sequence is not Encryption Decryption

preserved. Again [2], [4], [9] used the Playfair cipher method
Convert Text to ASCII Binary Convert ASCII Binary to Text
for encryption, generating ambiguity and ambiguity bits that
must be passed to the receiver for decryption. The ambiguous
Huffman Coding
bits often decrease the capacity of the algorithm. Convert ASCII to DNA Format Scheme Convert DNA Format to ASCII
Using 2 bit Binary encoding Using 2 bit Binary encoding
Encryption Key
F. Novelty of Our Work
The novelty of our approach is that our approach provides Encrypt Data Decrypt Data
Publicly Available
double-layer security incorporating encryption and steganogra- NCBI Database
phy techniques, and the technique is blind. It does not expand Apply LSBase Real DNA Apply LSBase Reverse
Sequence
the DNA size so that it does not get the intruder’s attention Steganography Method Steganography Method

and preserves the sequence’s biological functionality. It also


decreases cracking probability and increases hiding capacity Cipher Text Cipher Text
compared to the methods that use the Playfair cipher for data
encryption. In short, it eliminates all the disadvantages men- Sender End Receiver End
tioned above in the related works and incorporates advantages
into it. Fig. 1. Flowchart of the proposed method: encryption process is shown on
the left side and decryption process on the right side.
III. P ROPOSED A PPROACH
TABLE I
Our proposed method has two phases: D IGITAL DNA BASE CODING .
• In the first phase (data encryption phase), we encode the
DNA Base Binary Code
plain text message into ASCII binary and then encode
it with only DNA bases using 2-bit binary encoding. A 00
T 01
Then we apply Huffman coding scheme to further encode G 10
the encoded message with a variable length code for the C 11
bases.
• In the second phase (data hiding phase), we hide the
encoded message into an actual DNA sequence using the Step 2: Convert the PBIN into DNA Sequence M using 2-bit
3:1 LS Base method. Here we are using a modified 3:1 binary encoding.
LS Base method to hide both data and key, making it Step 3: Derive the variable length Huffman code for each
quite impossible to break. DNA Base, i.e., A, T, G, C.
Thus, our first contribution is to encode the plain text mes- Step 4: Convert the M into binary cipher text MBIN using
sage with DNA Cryptography and Huffman Coding scheme. variable length code from the Huffman scheme.
Moreover, our second contribution is innovatively hiding the Algorithm: Huffman Coding Step 1: Obtain the frequency
encoded message and key into actual DNA sequences. The of each DNA base (A, T, C, G) from the DNA Encoded String
whole process is shown in Fig. 1 and described in the M.
following subsections. Step 2: Sort the bases in ascending order based on frequen-
cies.
A. Phase I: Data Encryption Step 3: Take two minimum frequencies and add them.
The data encryption process starts with converting the Step 4: Make the resultant frequency as root and the
plain text message P containing letters, numbers, and special minimum frequencies as their left and right child.
characters into ASCII binary. Then we take each two binary Step 5: Repeat step 3-4 until a single tree is constructed.
digits from left to right and convert two bits into one DNA Step 6: Starting from the root, label the left child with 0
base according to the 2-bit binary encoding rule. Table I shows and the right child with 1.
the digital DNA base coding. In this way, we convert the plain Step 7: Obtain binary code for A, T, G, C.
text P into DNA bases M. Next; we calculate the frequencies The process is explained using a flowchart in Fig. 2. Assume
of the DNA bases in the encoded message. Moreover, based our text message is: ”hello”, which we want to send to our
on the frequency, we apply the Huffman coding rule to get a receiver securely. Hence, we want to encrypt it first with the
variable length code for each base. After that, we convert the above-mentioned way. Thus, we have P=hello. The ASCII
M into MBIN using that variable length code. The algorithms binary of P, PBIN=01101000 01100101 01101100 01101100
of this encryption method and the Huffman Coding scheme 01101111
are given below: We convert PBIN into M by substituting every two bits
Algorithm: Encryption Procedure with its corresponding DNA base. Thus, we get M= TGGA
Step 1: Convert the Plain text message P into ASCII Binary TGTT TGCA TGCA TGCC. Next, we try to get the variable
PBIN. length code for A, T, G, and C with Huffman coding. Here the

3
Plain Text, ASCII Binary of P,
20
P PBIN
ASCII Conversion 01101000 01100101 01101100 0 1
hello 01101100 01101111 13 7
0 1 T
Huffman code for DNA Conversion
(Based on Digital 7 6
DNA Bases DNA Based Coding)
0 1 G
A 000 3 4
C 001 TGGA TGTT TGCA TGCA
G 01 Huffman TGCC A C
T 1 Coding DNA form of P,
M Fig. 3. Huffman Code generation for DNA bases based on frequency as
described in Huffman Coding Algorithm.
Conversion of DNA with Huffman Coding

Plain Text, Cipher Text,


10101000 10111 101001000 P MBIN
101001000 101001001
Encryption 10101000 10111 101001000
Cipher Text, hello 101001000 101001001
MBIN

Fig. 2. Flowchart explaining the data encryption process with an example. DNA Sequence from NCBI, D
AATTCCAAAGAAACAGACTCTACAGC
CAGCGAAGGCATGGATTTGCTGGCTG
frequencies are A=3, T=7, G=6, and C=4. Hence, the sorted GGGCAAACAGGCAAAGAGAGAGCAA
Base MBIN Substitute
GCCTTCTTCTTCCATATC CTTTATATAG Value Value
order of base is A - C - G - T. From the sorted order, we ACTGCCAACTAAAGG A/G 0 A
construct a tree-like Fig. 3. We get variable length code for A/G 1 G
A, T, G, and C, which is shown in Table II. Next, we convert T/C 0 U
AACTCUAAGGAAACGGAUTCUACAGC Data T/C 1 T
M to MBIN according to Table II values. Therefore, we get CAGUGAGGGUATGGACTTGCTAGCUG Hiding Substitute 1st 3 & 5th Base
GAGCAAAUAGACAAAGAGCCTUACCG
MBIN=Cipher Text= 10101000 10111 101001000 101001000 CACTCCTCCCGTACCCCTTATAUAGACT
by sorted Bases from
Huffman Code
101001001. GCCAACCUAAAAGG
3:1 Substitution Rules
Final Cipher Text,
TABLE II DC
VARIABLE LENGTH CODE FOR DNA BASES .
Fig. 4. Flowchart explaining the data hiding process with an example.
DNA Base Huffman Code
A 000
C 001
G 01
to encode 1 and U to encode 0 from cipher text. We encode
T 1 this way until the length of the cipher text or the actual DNA
sequence is reached. We hide our key (the variable length code
from Huffman code) into the first 5 bases of the actual DNA
B. Phase II: Data Hiding sequence leaving the third base for cipher text encoding.
In this phase of our hybrid algorithm, we hide our cipher We get this opportunity because, in the case of the Huffman
text which is the encrypted string of our plain text, into an coding scheme, the variable length code is actually fixed, but it
actual DNA sequence. There are millions of natural DNA changes with the bases’ frequencies. That means the variable
sequences available in the public database. We can get our length codes can only be 000, 001, 01, and 1 every time.
DNA sequence from NCBI (National Center for Biotechnol- However, which base represents which one depends on the
ogy Information) database. Then, we hide the cipher text into frequency only. The least frequent base is encoded with 000,
that actual DNA sequence using the 3:1 LS (Least Significant) then 001, then 01, and the most frequent one in 1. Thus, if
base method. However, we have modified the method to we send only the sorted order of that bases, the code can be
increase capacity and security. The process is straightforward. obtained. We substitute the 1st, 2nd, 4th and 5th base with the
First, from the left, we select each base from the actual DNA sorted list of the bases based on the frequency. It also gives
sequence placed into positions of multiple of 3, i.e., 3, 6,9, us another benefit in security which we will discuss in the
12, 15, and so on. Moreover, we substitute them with another security analysis section. Therefore, in this way, we get our
base based on the binary value of our cipher text from left cipher text with the key hidden in our actual DNA sequence.
to right. As 1 of the 3 bases in the actual DNA sequence The process is explained using a flowchart in Fig. 4.
contains cipher text, and that is the least significant among From the previous example, we saw that the length
that 3. Hence it is called the 3:1 LS base method. If the base of the cipher text MBIN is 40 bits. To hide it in
is a Purine base (A or G), then we substitute that with A to a DNA sequence, the length of that sequence needs
encode 0 and G to encode 1 from the cipher text. If the base to be at least 120bp. Let the DNA Sequence be
is Pyrimidine base (T or C), then we substitute that with C D=AATTCCAAAGAAACAGACTCTACAGCCAGCGAAGG

4
CATGGATTTGCTGGCTGGGGCAAACAGGCAAAGAGA- E. Limitations of Our Approach
GAGCAAGCCTTCTTCTTCCATATC CTTTATATAGACT- Data redundancy is the main drawback of our approach. As
GCCAACTAAAGG. We check the 3rd base, which is T, we use the 3:1 LS base method, we need to take DNA of
and the 1st bit of our cipher text is 1. Thus, we substitute it length 3 times longer than the cipher we got to hide. Still, the
with C. Then, we go to 6th base, which is C. The 2nd bit processing steps for hiding remain within the length of the
of cipher text is 0. Thus, we substitute it with U. And that cipher.
way; it goes on. After all the cipher text gets hidden, we
hide the key. The sorted list of the bases from the previous F. Cost Benefit Analysis
example was A - C - G - T. Hence, we substitute 1st base Though our model introduces redundancy, it makes the
of the DNA sequence with A, 2nd one with C, 4th one data sending highly secure. The machines today contain high
with G and 5th one with T. The final DNA Sequence is processing power, and the internet connections are high speed.
DC=AACTCUAAGGAAACGGAUTCUACAGCCAGUGA Hence, data redundancy is not the primary problem. From the
GGGUATGGACTTGCTAGCUGGAGCAAAUA- security analysis below, we will see that the system cracking
GACAAAGAGCCTUACCGCACTCCTCCCGTAC- probability is very low; thus, it can be used to secure the
CCCTTATAUAGACTGCCAACCUAAAAGG transmission of highly secured data.
IV. S ECURITY A NALYSIS
C. Data Extraction - Receiver Side
An intruder needs to know vital information to get the
At the receiver end, the received message is just like an message back from the encrypted message we sent. They
actual DNA sequence which contains a hidden encrypted are: DNA reference, Encoding rule, and LSB substituted
message and a key to decrypt it. We need to do the opposite permutation. Analysis of the parameters is as follows.
procedure to get back the actual message from the received
message. First, we go through every 3 multiple bases and A. DNA Reference Sequence
check it. If that is A or U, then the cipher bit was 0. If To decode the information, the intruder needs to guess the
that is C or G, the cipher bit was 1. In that way, we first correct reference DNA so that he can analyze the changes in
extract the cipher text. Now from the cipher text, we match it to decode the message. This process is the toughest for our
which of the code is represented in places, i.e., 000, 001, model as there are around 163 million DNA sequences in the
01, 1. Then, according to the Huffman coding scheme, we public database. Again the first 6 bases of the sequence might
get the DNA encrypted message back. We get to know the be fully changed in our model. Therefore, the intruder needs
Huffman representation of the bases from first 5 bases. Next, to analyze the rest n-6 bases of a DNA sequence of length n
we convert the DNA encrypted message to binary using 2- to find the most related sequence. Therefore, the probability
bit binary encoding. That means we check each base of the of making a correct guess of DNA reference is:
DNA sequence and represent the digital code of that base.
1
In this way, we get the ASCII representation of our actual P (DN ARef ) = (1)
message. Now just convert that to the character. That is the 1.63 ∗ 108 ∗ (n − 6)
actual message we wanted to send securely. The whole process B. Binary Encoding Rule
is shown in Fig. 1. Let us assume that the intruder knows the number of
symbols used in the encoding process as it is a DNA sequence,
D. Strength of Our Approach so the number of symbols is 4. The Huffman code for the four
symbols can be 000, 001, 01, and 1. Each of the four bases
Our approach has several strong points described below: can get any of that code. The 2-bit binary encoding for DNA
1. This algorithm ensures three layers of protection against bases also creates 4 codes 00, 01, 10, and 11. Each of the
intruders. bases can have any of these codes. Thus, the probability of
a. Conversion of plain text to a DNA sequence. guessing the right code each time P(BER) is:
b. Encoding it again with variable length coding.
c. Hiding it into an actual DNA sequence. 1
P (BER) = (2)
2. The process of hiding the key or Huffman code into a 4! ∗ 4!
DNA sequence that we used made the fake DNA sequence
unique, and difficult to find the actual DNA sequence from C. The Least Significant Base Substitution Rule
the database for the intruder. LS Base method is applied by substituting pyrimidine base
3. On the receiver side decryption process is simpler and takes by ’U’ to encode the secret bit ’0’ or ’C’ to encode ’1’.
less effort, which will benefit this model in the server-client However, it is also can encode ’0’ by C and ’1’ by U, and
network as the client side machines are less powerful and the same for the Purine base. Briefly, the ’0’ secret bit can be
hence less work for it in this model. encoded by substituting the Pyrimidine base with ’U’or ’C’.
4. Though our model introduces data redundancy, it decreases If it is selected to be substituted by ’U’, then ’C’ will be used
cracking probability. to substitute the Pyrimidine base to encode ’1’. So the number

5
TABLE III
C OMPARISON BETWEEN RELATED WORKS .

Comparison Cri- P1: Enhanced Double P2: DNA Base Data En- P3: Proposed Steganogra- P4: A New Data Hiding P5: The Proposed Method
teria Layer Security using RSA cryption and Hiding us- phy Approach using DNA Scheme Based on DNA
over DNA based Data ing Playfair and Insertion Properties [6] Sequence [5]
Encryption [10] Techniques [9]
Secret Text Type Any Type of Data Any Type of Data Any Type of Data Binary Data Any Type of Data
Binary Coding 2-Bit Binary Coding Rule 2-Bit Binary Coding Rule 2-Bit Binary Coding Rule Binary Coding Rule Inde- 2-Bit Binary Coding Rule
Rule pendent
Encryption Type Symmetric Asymmetric Not Applicable Not Applicable Symmetric
Encryption Algo- Encrypting secret data by 5*5 Playfair cipher based No Encryption No Encryption DNA Based Huffman
rithm mapping it to DNA and on DNA and amino acids Coding Encryption
amino acids
Data Hiding Al- Insertion Insertion Complementary rules Substitution method us- Substitution method using
gorithm based hiding method, ing repeated nucleotides the least significant base
which is the rule that to hide the secret message of each codon in the DNA
specifies the strand of bits reference sequence
DNA directly opposite a
specified sequence
Blind/Not Blind Not Blind Blind Not Blind Not Blind Blind
System Cracking P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗ P (S) = 1/(1.63 ∗ 108 ∗
Probability (n − 1) ∗ 24 ∗ 2( m − 1) ∗ (n − 1) ∗ 24 ∗ 2( m − 1) ∗ (n − 1) ∗ 24 ∗ 24) (n − 1) ∗ 24 ∗ 6) (n − 6) ∗ 4! ∗ 4! ∗ 4)
2( s − 1)) 2( s − 1))
Security Level Double Layer Double Layer Single Layer Single Layer Double Layer
Modification High High Moderate High Low
Rate
Biological Func- Does not Preserve Does not preserve Does not preserve Does not preserve Preserves
tionality
Capacity High High Moderate Moderate Moderate

of possibilities is 2*1 guesses, and the same will be done for the data by converting it to DNA then amino acids form. P2
the Purine base. Thus, the probability of making a successful encrypts the secret data using DNA and amino acids Playfair
guess for the substituted nucleotides N is: cipher. P3 and P4 hide the original format of the data without
encryption; hence it increases the cracking probability and
1
P (N ) = (3) decreases processing overhead. Our proposed method uses
4 Huffman coding scheme-based encryption followed by DNA
Using the proposed method, the probability of an attacker encryption, providing extra protection against intruders.
making a correct guess or the system cracking probability P(S) The fifth parameter shows which data hiding algorithm is
is: used. P1 and P2 use the insertion method to hide the secret
1 message in the DNA sequence, increasing the DNA sequence’s
P (S) = (4) length. P3 hides the secret message using complementary
1.63 ∗ 108 ∗ (n − 6) ∗ 4! ∗ 4! ∗ 4
rules. P4 and P5 hide the secret message by substituting DNA
V. C OMPARATIVE S TUDY nucleotides based on the cipher text bits.
In this section, we have compared our proposed model with The sixth parameter shows us if the message can be
some of the recent DNA-based steganography algorithms, and retrieved without needing extra information other than the
the result is shown in Table IV-B. For the comparison, we have reference DNA sequence during data extraction. P2 and the
chosen some crucial parameters [13], [14] as shown in Table proposed scheme P5 are blind algorithms. The seventh pa-
IV-B. The first parameter of our consideration is the secret rameter is the cracking probability of each algorithm in the
text type. That shows us if an algorithm hides all data formats table. The eighth parameter shows the security level offered
comprising letters, symbols, or numbers. We can see that all by each algorithm. Our proposed method provides a double
the algorithms mentioned, excluding P4, support all types of layer of security as it encrypts the data before hiding it.
data. P4 supports only binary data. The second parameter is The ninth parameter shows us the modification rate. P1, P2,
the type of binary coding rule used in the conversion from and P4 have high modification rates. The modification rate for
the binary format of the message to DNA. All methods in P3 is moderate. Our proposed model has a low modification
the table use the 2-bit binary coding rule. The third parameter rate, as it only modifies the reference sequence for the length
shows the type of encryption used in every algorithm that we of the cipher text. The tenth parameter is the preservation of
mention in Table IV-B. Our proposed method uses symmetric Biological functionality. It is also crucial to avoid intruders’
key encryption. The fourth parameter shows if the method attention. We can see that only our method preserves the
encrypts the secret data before hiding it or not. P1 encrypts biological functionality of reference DNA. This is because

6
we substitute Purine bases with Purine bases and pyrimidine
bases with pyrimidine bases at the time of the steganographic
process. The eleventh parameter shows the capacity, and we
can see that only P1 and P2 have a high capacity, while our
method also gives moderate capacity. Although we consider
the method 3:1 LS base method, our method utilizes the
maximum capacity that can be given in this method.
After considering all the aspects, we found that we have a
decent cracking probability though it is not the best. P1 and
P1, and P2 show the best cracking probability. However, they
use the insertion method and hence increase the fake DNA
sequence length and may get into the eye of the intruder. Also,
P1 is not a blind method. Our method gives a double layer of
security, making it better than P3 and P4. Again our method is
the only one that preserves the biological functionality of the Fig. 5. The effect on encryption and decryption time based on the length of
reference DNA sequence having a low modification rate. We the reference DNA sequence. From left to right length of the DNA sequences
increased. It shows that the encryption and decryption time increase as the
can conclude that our proposed algorithm is decently strong length increases. Also, encryption time is more than decryption time.
compared to other algorithms represented here.
VI. E XPERIMENTAL R ESULT
and key into it. The second parameter is the ’Payload’. Payload
In this section, we have shown the performance of the pro- refers to the remaining length of the new DNA sequence after
posed algorithm based on some of the predefined parameters extracting the data from it. The third parameter is the ’bpn’.
that are used to evaluate the performance of an encryption BPN stands for a bit per nucleotide, which is the number of
algorithm in the literature. The proposed algorithm was tested bits hidden per nucleotide. It is the ratio of the total length
on Intel(R) Core (TM) i5-8300H CPU @ 2.30 GHz personal of the message and key bit to the capacity in bits. The last
computer with 8 GB RAM. The implementation is carried out two parameter shows the encryption and decryption time in
with Jupyter Notebook version 6.1.4. We have experimented seconds.
on a message kept in a file of size 5 kilobytes. The message
contains letters, symbols, and numbers. C. Summary of Findings

A. Used Dataset Table V displays the experimental results in terms of ca-


pacity, payload, and bpn parameters to evaluate the system’s
The eight real DNA sequences in Table IV were used performance. In the proposed algorithm, the capacity includes
and they are publicly available from NCBI database [10]. hiding the secret message and Huffman code (key) in the
In Table IV, the left-most column shows the locus of the sequence. Payload is zero, meaning that the length of the
DNA sequence, and the middle column shows the number fake DNA reference sequence is not expanded after hiding
of nucleotides in it. The right-most column shows the species the message bits within it, which avoids drawing attention to
definition for the locus. it. This is achieved by hiding the secret data by substituting
the nucleotides. Furthermore, bpn is within [2.5, 3.6], and
TABLE IV
S PECIFICATION OF EIGHT REAL DNA SEQUENCES USED IN OUR the proposed scheme has a sufficient embedding capacity
EXPERIMENT. distributed on both the message and Huffman code(key),
increasing the total number of nucleotides required for hiding
Locus Number of Nu- Species Definition the message bits only. Finally, the execution time to encrypt
cleotides(bp)
and hide 5KB data is calculated. Fig. 5 represents the relation
AC166252 149,884 Mus musculus 6 BAC RP23-100G10 found from Table V that the capacity and the execution time
AC168901 191,456 Bos taurus clone CH240-1851
AC168907 194,226 Bos taurus clone CH240-19517 are affected by the length of the DNA sequence used, i.e.
AC153526 200,117 Mus musculus 10 BAC RP23-383C2 the DNA sequence’s length is directly proportional to the
AC168897 200,203 Bos taurus clone CH240-190B15 execution time. As the DNA sequence’s length increases, its
AC167221 204,481 Mus musculus 10 BAC RP23-3P24
AC168874 206,488 Bos taurus clone CH240-209N9 hiding capacity increases, and consequently, the execution time
AC168908 218,028 Bos taurus clone CH240-195K23 and visa verse as shown in Fig. 5.

VII. C ONCLUSION
B. Performance Metrics In this paper, we have proposed a novel cryptographic
We have used some parameters that are commonly used in technique combining DNA cryptography and steganography.
evaluating the system’s performance [2-11]. The first param- The technique encrypts the data in its first stage and then hides
eter is ’Capacity’. Capacity refers to the total length of the the encrypted message into an actual DNA sequence. The
modified DNA sequence after hiding the encrypted message encryption method uses DNA bases to encrypt the message,

7
TABLE V
E XPERIMENTAL RESULTS .

Locus Capacity(bits) Payload bpn = (M+K)/C Encryption Time(Sec) Decryption Time (Sec)
AC166252 49965 0 3.6 0.049 0.038
AC168901 63822 0 2.8 0.063 0.048
AC168907 64746 0 2.8 0.063 0.048
AC153526 66709 0 2.7 0.065 0.050
AC168897 66738 0 2.7 0.065 0.050
AC167221 68284 0 2.6 0.067 0.052
AC168874 68833 0 2.6 0.068 0.053
AC168908 72680 0 2.5 0.071 0.055

followed by a variable length code generation and assignment [13] G. Hamed, M. Marey, S. A. El-Sayed, and M. F. Tolba, “Hybrid
for each DNA base using Huffman coding. The proposed technique for steganography-based on DNA with n-bits binary coding
rule,” in 7th International Conference of Soft Computing and Pattern
method is blind as it does not need to send the actual reference Recognition, Fukuoka, Japan, November 2015.
DNA sequence with the fake one. Also, it does not expand the [14] K. S. Sajisha and S. Mathew, “An encryption based on DNA cryptog-
actual DNA sequence while keeping its biological functional- raphy and steganography,” in International conference of Electronics,
Communication and Aerospace Technology (ICECA), Coimbatore, India,
ity. From our security analysis and comparison with a number April 2017, pp. 162–167.
of promising methods from different literature, we found that
our proposed method gives a decent level of security which is
quite impossible to break without having full knowledge of the
steps involved in particular encryption. The proposed method
can be modified in our future work to increase its data hiding
capabilities and security.

R EFERENCES

[1] Y. Niu, K. Zhao, X. Zhang, and G. Cui, “Review on DNA cryptography,”


in Bio-inspired Computing: Theories and Applications. Singapore:
Springer Singapore, 2020, pp. 134–148.
[2] S.Namdev and V. Gupta, “A DNA and amino-acids based implemen-
tation of four-square cipher,” Journal of Engineering Research and
Applications, vol. 6, pp. 90–96, January 2016.
[3] H. Shiu, K. Ng, J. Fang, R. lee, and C. Huang, “Data hiding methods
based upon DNA sequences,” Journal of Information Sciences: an
International Journal, vol. 180, pp. 2196–2208, June 2010.
[4] C. Guo, C. Change, and Z. Wang, “A new data hiding scheme based
on DNA sequence,” International Journal of Innovative Computing,
Information and Control, vol. 8, pp. 139–149, January 2014.
[5] Y. A. Yunus, S. Ab-Rahman, and J. Ibrahim, “Steganography: A review
of information security research and development in muslim world,”
American Journal of Engineering Research, vol. 11, pp. 122–128, 2013.
[6] H. Ghada, M. Mohammed, E. S., and T. Fahmy, DNA Based Steganog-
raphy: Survey and Analysis for Parameters Optimization. Springer
International Publishing, 2016, pp. 47–89.
[7] H. Mousa, K. Moustafa, W. Abdel-Wahed, and M. Hadhoud, “Data hid-
ing based on contrast mapping using DNA medium,” The International
Arab Journal of Information Technology, vol. 8, pp. 147–154, April
2011.
[8] P. Vijayakumar, V. Vijayalakshmi, and R. Rajashree, “Increased level
of security using DNA steganography,” Int. J. Advanced Intelligence
Paradigms, vol. 10, pp. 74–82, January 2018.
[9] H. L. J. Taur, H. Lin and C. Tao, “Data hiding in DNA sequences
based on table lookup substitution,” Journal of Innovative Computing,
Information and Control, vol. 8, pp. 6585–6598, October 2012.
[10] B. A. Mitras and A. K. Abo, “Proposed steganography approach using
DNA properties,” International Journal of Information Technology and
Business Management, vol. 14, pp. 96–102, June 2013.
[11] V. Yadav and I. Gupta, “A hybrid approach to metamorphic cryptog-
raphy using kimla and DNA concept,” Int. J. Computational Systems
Engineering, vol. 5, pp. 218–229, January 2019.
[12] R. E. Vinodhini, P. Malathi, and T. G. Kumar, “A survey on DNA and
image steganography,” in 4th International Conference on Advanced
Computing and Communication Systems (ICACCS), Coimbatore, India,
6-7 Jan, 2017, pp. 1–7.

You might also like