Security Analysis of DNA Based Steganography Techniques
Security Analysis of DNA Based Steganography Techniques
Received: 2 October 2019 / Accepted: 20 December 2019 / Published online: 9 January 2020
© Springer Nature Switzerland AG 2020
Abstract
This study investigates the most recent data hiding techniques based on DNA steganography, including the highly
improved DNA-based steganography technique, the data hiding using double DNA sequences method, and the enhanced
DNA-based steganography technique. The strengths and weaknesses of these techniques are discussed. Additionally, the
security of these techniques is analyzed based on several security parameters that measure the quality of DNA steganog-
raphy with respect to many factors, including, but not limited to, cracking probability, blindness, modification rate and
expansion rate, and layers of security. The goal of the comparison between the investigated techniques is to highlight
the advantages and disadvantages of the existing data hiding algorithms and to motivate future research in this field.
Moreover, the paper evaluates the discussed techniques based on some parameters, including capacity, payload, and bit
per nucleotide (bpn). The result shows that the enhanced DNA-based steganography technique hides 2 bpn, whereas the
highly improved method can hide on average 1.46 bpn, which is higher than data hiding using double DNA sequences
method can hide .The paper also presents suggestions for how each technique can be optimized to to achieve a higher
security level for hiding data within DNA sequences.
* Asia Othman Aljahdali, [email protected]; Omnia Abdullah Al‑Harbi, [email protected]; Walaa Essa Alahmadi,
[email protected] | 1College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia.
Vol.:(0123456789)
Review Paper SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1
three different techniques recently presented for hiding research in designing more reliable and secure data hid-
data in DNA sequences and to investigate their security ing techniques.
based on different factors. The paper suggests ways of
enhancing and improving each technique with respect 3.1 Highly improved DNA‑based steganography
to the security requirements. This paper is divided into techniques
sections, Sect. 2 defines a DNA sequence and its elements,
while Sect. 3 discusses in detail three different techniques Malathi in [18] modifies the insertion algorithm to
that are used to hide information within DNA sequences. decrease the cracking probability of the fake DNA
Sect. 4 analyzes each technique, clarifies their strengths sequence. The algorithm uses two different keys. The
and weaknesses, and compares them. The last section first key (K1 ) is a number in the range of 0 to 255, which
discusses each technique and proposed ideas for further is used to XOR the last character in the message (M);
improvements. the result will be XORed with the character preceding
the last one in the M, and so on. Accordingly, the first
key is used to encrypt the message. The second key
2 Elements of DNA (K2 ) is randomly generated and is used to divide the
DNA sequence into same-length segments. The result-
In biology, a deoxyribonucleic acid (DNA) is a huge mol- ing cipher characters are inserted as binary bits one by
ecule that exists within the cells of all living organisms, one at the beginning of each segment. Then, the binary
containing the genetic information that allows the func- sequence is converted into DNA bases using Table 1. The
tioning, reproduction, and evolution of these organisms second key is preferred to be a small number so that the
[13]. DNA has many small subunits called nucleotides. It DNA sequence has a minimum length while hiding the
is made up of four types of nucleotide bases: Adenine (A), secret message.
thymine (T), guanine (G), and cytosine (C) [14, 15]. The two The encryption process The proposed algorithm [18]
strands are held together by bonds between the bases; follows several steps to encrypt and hide messages
adenine binds to thymine, and cytosine binds to guanine inside a DNA sequence. The encryption process steps
[1, 14, 16]. Every three neighboring nucleotides make up are as follows:
a codon so we get 43 = 64 different possible codon com-
binations. In living organisms, the arrangement of these 1. Split M into characters, M = m1 , m2 , m3 , … , mn , and
combinations determines the structure and function of each character is converted into its 8-bit binary equiva-
the resultant protein [17]. DNA encoding techniques are lent based upon the ASCII standard.
binary coding schemes for the purpose of DNA computa- 2. Randomly generate a number between 0 and 255 to
tion. The most popular binary mapping of digital coding form K1, and then the key is converted into an 8-bit
is given in Table 1. binary sequence.
3. The last character in M is XORed with K1.
4. The result is XORed with the character preceding the
3 DNA data hiding techniques last one in M; the XORing is repeated until all the char-
acters are converted and stored in A.
Over the years, different algorithms have been proposed 5. The binary sequence A is converted into a protein
in hiding sensitive data within DNA sequences. In this sequence.
section, we will investigate and discuss the strengths 6. A sample DNA sequence S is selected randomly and
and weaknesses of the lately proposed DNA-based converted into a binary bit sequence using Table 1.
data hiding techniques. The analysis aims to help future 7. Generate a random number, which is preferred to be a
small number K2, and then divide the DNA sequence S
into segments; the segment length should be equal to
K2.
Table 1 DNA digital coding 6. Add the first binary value of A at the beginning of the
DNA Decimal Binary
[11] nucleo- first DNA binary segment, and insert the second binary
tide value of K1 into the second binary segment, and so on.
7. Concatenate all the binary sequences, and then con-
A 0 00
vert it to produce a fake DNA sequence using Table 1.
C 1 01
An illustrative example is given in Fig. 1 showing the
G 3 10
encryption processes.
T 3 11
Vol:.(1234567890)
SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1 Review Paper
Vol.:(0123456789)
Review Paper SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1
3.2 Data hiding using double DNA sequences resulting number to DNA codons (e.g, 122 converts
techniques into CGG).
2. Generate the encrypted message DṔ by performing
Ibrahim, Abdalkader, and Moussa [19] proposed an algo- the bitwise XOR of the secret message DP, and the ref-
rithm that uses a double DNA sequences technique. The erence sequence S, and then delete the extra unused
main idea is to pick a random pair of DNA sequences from nucleotide (DṔ = DP XOR S).
the DNA database (S, Ś), which is a combination of two 3. Append the ID of S at the beginning of the resulting
DNA sequences. The proposed algorithm consists of two DṔ to get IDsDṔ.
phases. In the first phase, the secret message P is encoded 4. Read the second DNA sequence Ś and mark the sec-
into the DNA sequence DP, in which each letter is replaced ond repeated characters.
by three nucleotides. The first selected DNA sequence S 5. Replace the second repeated characters with
is used for the encryption of DP. In the second phase, encrypted message characters DṔ.
the other DNA sequence Ś is used to hide the encrypted 6. Hide the characters of encrypted message DṔ in Ś
secret data. The encryption and decryption processes are using the replacement rules in Table 2. We refer to
explained below. nucleotides in DṔ by {A(DṔ), C(DṔ), G(DṔ), T(DṔ)}.
The encryption process Two inputs are used in the Table 2 is used to hide DṔ in Ś by replacing the second
encryption process: the secret message P and the DNA repeated letter in Ś with one of the four letters {A, C, G,
sequence pair (S, Ś ). The encryption steps are as follows: T} according to the encrypted message.
7. Append the ID of Ś to the beginning of the result-
1. Encode the secret message P into DNA sequence ing S̈ = (IDś (IDsDṔ) ′ ) and send it to the receiver [19].
DP using Algorithm 1 to generate a total of 64 DNA An illustrative example is given in Fig. 3 showing the
codons. The NUM_FORMAT is a combination of three encryption processes.
digits, and the DNA (NUM_FORMAT) transfers the
A A A C A G A T
C C C A C T C G
G G G T G A G C
T T T G T C T A
Vol:.(1234567890)
SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1 Review Paper
Algorithm 1: Generating DNA Codons. between the letters of English alphabet (capital and small
1: for i = 0 to 3 do letters), the ten digital numbers, the two punctuation
2: for j = 0 to 3 do marks, and the 64 codons that are generated from Algo-
3: for k = 0 to 3 do
4: NUM FORMAT= i jk rithm 1 is p64
64
[19]. Thus, the probability of inferring the
5: codon=DNA(NUM FORMAT) secret message is:
6: end for
7: end for
8: end for
( )2
1 1 1
1.63 × 108
× × 64
244 p64 (2)
The decryption process The Dncryption process’s input
is a faked DNA sequence S̈ = ((IDś (IDsDṔ) ′ )) with a secret 3.3 Enhanced DNA‑based steganography technique
hidden message. The decryption process is as follows: with a higher hiding capacity
1. Extract the first bases that represent ID of Ś used by the Marwan, Shawish, and Nagaty [20] introduced this
sender to hide the data. approach to simplify the current techniques and obtain a
2. Find the second repeated nucleotide in Ś. higher hiding capacity. This technique follows two phases.
3. Extract the DṔ sequence from S̈ using the replacement The first phase is the encryption phase, which is a modi-
inverse rules in Table 2. fied version of the 5 × 5 Playfair cipher grid called the 4 × 4
4. Extract the first bases form DṔ that represent the ID of Playfair cipher grid. The result of this phase is an encrypted
S. message. The second phase is the hiding phase, which is a
5. Decrypt DṔ as follows: use the commutative property substitution process used for hiding the encrypted mes-
of XOR DṔ XOR S = (DP XOR S) XOR S = DP XOR (S XOR sage. The result of this phase is a fake DNA sequence. The
S) = DP. encryption and decryption processes are described below.
6. Decode DP to letters, with each group of three nucleo- The encryption process There are four inputs for the
tides representing a letter from the English alphabet. encryption process: a message, a key, initial values of
7. Get plaintext P [19]. the 4 × 4 binary grid, and initial values of the 4 × 4 DNA
An illustrative example is given in Fig. 4 showing the grid. The 4 × 4 binary grid and DNA grid must be shared
decryption processes. between the sender and the receiver before the encryp-
tion and decryption processes. The encryption process
Probability of cracking The reference DNA size is about steps are as follows:
163 million, thus, the probability of predicting the refer-
1
ence DNA sequence is 1.63×108
. The probability of guessing 1. Generate 16 random English letters to create the 4 × 4
1 Playfair cipher grid using the given key input as a seed
the second selection Ś is 1.63×10 , where the reference DNA
8
value; an example of Playfair cipher grid is given in
sequence Ś is used to hide the secret message. There are Table 3.
244 possible situations for hiding based on Table 2 in the 2. Shuffle the initial values of the 4 × 4 binary grid and
hiding process; thus, the probability of an attacker making the 4 × 4 DNA grid using the key; an example of a
a successful guess is 2414 . The total number of permutations shuffled 4 × 4 binary and DNA grid is given in Tables 4
and 5, respectively.
Vol.:(0123456789)
Review Paper SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1
Table 3 Example of 16 The decryption process There are two inputs for the
randomly generated English H C M U
decryption process: the encrypted DNA sequence and
letters D G Z B
the key. The receiver will receive these inputs through a
I A X J
secure channel. The initial values of the 4 × 4 binary grid
Q V W F
and DNA grid should be shared before the encryption and
decryption processes.
Table 4 4 × 4 Shuffled binary 1. Use the reverse of the substitution process to extract
grid 0110 0001 1000 0000
the hidden encrypted DNA sequence.
1001 0101 1010 1110
2. Shuffle the initial values of the 4 × 4 binary grid and
0100 1111 1011 1101
4 × 4 DNA grid using the key.
3. Generate 16 random English letters to create the 4 × 4
Playfair cipher grid using the received key as a seed
Table 5 4 × 4 Shuffled DNA value.
grid GT CG CA TG
4. For each group of two letters of DNA sequence (Enc),
TA TC GG AA
map its position in a 4 × 4 DNA grid to its correspond-
AT TT CT AG
ing position in the 4 × 4 cipher grid and get the values.
GC GA CC AC
The outcome of this step is encrypted text (C).
5. Apply the inverse of the Playfair cipher technique to
the encrypted text (C) to get a sequence of English
3. Convert the input message into a binary sequence (B). letters (E).
4. Find all 4-bit values of B in the 4 × 4 binary grid, and 6. For each English letter in (E), map its positions in a
then map their positions to the corresponding posi- 4 × 4 cipher grid to its corresponding position in a
tions in a 4 × 4 cipher grid and fetch the English letter. 4 × 4 a binary grid and get the values. The outcome of
The result of this step is a sequence of English letters this step is a binary sequence (B).
(E). 7. Convert the binary sequence (B) into the original mes-
5. Apply the Playfair cipher technique to the sequence of sage [20].
English letters (E) to get the encrypted text (C). An illustrative example is given in Fig. 6 showing the
6. For each letter of (C), map its position in a 4 × 4 Playfair decryption processes.
cipher grid to its corresponding position in the 4 × 4
DNA grid and get the values. The outcome of this step Probability of cracking In this technique, the attacker
is a DNA sequence (Enc). needs 4 types of information to decrypt a message,
7. Pick a reference DNA sequence from the database which are the binary representation, the reference DNA,
(DNA database). the complementary rule, and the ciphering technique.
8 Hide the encrypted DNA sequence in the chosen ref- Thus, the probability of getting the binary scheme b is
1
erence DNA sequence using the substitution process 4!
. Since we have 4 DNA bases, the number of possible
[20]. binary schemes is 4!. The probability of guessing the
An illustrative example is given in Fig. 5 showing the 1
reference DNA r is 1.6×10 . The probability of the comple-
encryption processes. 8
1
mentary rule c is 16 . Thus, probability of cracking the k is
Vol:.(1234567890)
SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1 Review Paper
Vol.:(0123456789)
Review Paper SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1
security. This technique also possesses very high data hid- The blindness feature is to maximize the security level by
ing capacity and preserves the reference DNA’s original reducing as much as possible the required data that are
function of producing proteins. The output of the encryp- transferred to the receiver.
tion process is a fake DNA sequence with a modification A comparison between the investigated techniques is
rate of approximately 28.4%, which is low [19]. The expan- given to highlight the advantages and disadvantages of
sion rate is equal to zero, which means that after embed- the existing data hiding algorithms and to provide moti-
ding the secret data, the length of the reference DNA vation for future research in this field. Table 6 presents the
sequence is not expanded. In fact, a low modification rate strengths and weaknesses of each previously explained
and zero expansion rate ensures the security and results technique.
in a better quality of the fake DNA sequence. Furthermore,
this technique is a blindness technique, meaning that
there is no need to send the original DNA to the receiver, 5 Experimental results
so the security degree is maximized. Finally, the probability
of cracking is low. On the other hand, this technique has The techniques were tested in [18–20] using eight real
some weaknesses. The replacement rules should be sent DNA sequences from the NCBI database [Ref:https://
to the receiver, and plain text must contain only capital www.ncbi.nlm.nih.gov/]. The experiment’s goal was
letters, small letters, 0, … , 9, a period, and a dot; it cannot to evaluate the discussed techniques based on some
contain other punctuation marks. Also, the algorithm does parameters, including capacity, payload, and bpn.
not use any type of key. As mentioned before, the capacity refers to the total
Enhanced DNA-based steganography technique with length of the extended reference sequence after the
a higher hiding capacity The security of this technique secret message is hidden within it, which can be cal-
is based on several elements. First, the secret data culated by |S| + |M|2
[18]. The payload is the remaining
is encrypted before being embedded into the DNA length of the new sequence after extracting out the
sequence. Moreover, the encryption and hiding processes reference DNA sequence, and can be calculated by |M| 2
of secret data are done by using the Playfair and substitu- [18]. The bpn is the number of bits hidden per character,
tion methods. Accordingly, the Playfair method provides which can be calculated by bpn = |M| C
[19], where |M| is
a higher hiding capacity and stronger security, besides the length of the secret message, C is the capacity, and
being a fast and simple method. The substitution method |S| is the length of the reference DNA sequence.
preserves the length of DNA sequence, so the payload is We will show and compare the experimental result
always zero. Furthermore, this technique uses a secret key, of the three techniques. Table 7 shows the performance
which grants a higher security level to the data hiding sys- of the data hiding using double DNA sequences, and
tem. Finally, preserving the reference DNA’s original func- the highly improved DNA-based steganography tech-
tion of producing proteins is a considerable asset of this niques for hiding a 20000-byte secret message in the
technique. On the other hand, the technique is not a blind- DNA sequence regarding the capacity, payload, and bits
ness technique. The sender and receiver must share some per nucleotide (bpn).
data before the encryption and decryption processes.
Cracking probability Very low cracking probability Low cracking probability Low cracking probability
Security layer Double layer Double layer Double layer
Blindness Does not support blindness Support Blindness Does not support blindness
Modification rate Low Low Low
Payload Not equal to zero Always equal to zero Always equal to zero
Expansion rate Other DNA length Same DNA length Same DNA length
Encrypting the secret data Yes (XOR) Yes Yes
Preserving DNA functionality Changing DNA functionality Preserving DNA functionality Preserving DNA functionality
Using keys Uses two keys Doesn’t use a key Use a key
High capacity Yes Yes Yes
Easy to apply Easy to implement Not easy to implement Easy to implement
Number of used DNA sequences One Two One
Vol:.(1234567890)
SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1 Review Paper
Table 7 The capacity, payload, Sequence No. of nucleotides Highly improved DNA based Data hiding using double DNA
and bpn for each technique steganography techniques sequences techniques
Capacity Payload Bpn Capacity Payload Bpn
The enhanced DNA-based steganography technique this technique, an asymmetric encryption schema could
hides 2 bits per nucleotide; for example, a reference DNA be used and implemented by encrypting the message
sequence of 149,884 bp can hide a message up to 36.56 using any schema, and then starting the encryption
Kb [20], whereas the highly improved method can hide process of data hiding using the double DNA sequences
on average 1.46 which is higher than the 0.574 bpn that technique. This will increase the security degree and
the data hiding using double DNA sequences method eliminate one of its vulnerabilities.
can hide on average. As mentioned before, the enhanced DNA-based steg-
The data hiding using double DNA sequences method anography technique technique [20] has a weakness. The
preserves the length of the original DNA sequence (the initial values of the 4 × 4 binary grid and DNA grid must
payload is always 0), whereas the highly improved method be shared before the encryption and decryption pro-
and the enhanced DNA-based steganography method cesses, which means that this technique is not a blind-
increase the length of the reference DNA sequence. ness technique. To overcome this weakness, it is required
to minimize the shared data. Furthermore, using a public
key rather than a secret key would improve the algo-
6 Discussion rithm’s security.
Vol.:(0123456789)
Review Paper SN Applied Sciences (2020) 2:172 | https://doi.org/10.1007/s42452-019-1930-1
the most essential element in data hiding. Furthermore, 11. Clelland CT, Risca V, Bancroft C (1999) Hiding messages in DNA
the enhanced DNA-based steganography technique does microdots. Nature 399:533–534
12. Sharma A (2016) Security and information hiding based on
not possess the blindness property. The aim of the com- DNA Steganography. A Monthly J Comput Sci Inf Technol
parison in this study is to help in designing efficient and 5(3):827–832
secure DNA data hiding techniques; thus, the paper sug- 13. Ginu A, Jeenu J, Vishnu P, Jerin D (2017) DNA based cryptogra-
gests ways of enhancing and improving the investigated phy and steganography. GRD J Glob Res Dev J Eng 2:249–253
14. Kiss Gábor (2018) How to teach the history of cryptography and
techniques with respect to the security requirements. steganography. Educaţia Plus 20(2):13–23
15. Abbasy MR et al (2012) DNA base data hiding algorithm. Int J
New Comput Archit Appl 2(1):183–192
Compliance with ethical standards 16. Khalifa A (2013) LSBase: a key encapsulation scheme to improve
hybrid crypto-systems using DNA steganography. In: 2013 8th
Conflicts of interest The authors declare that they have no conflict international conference on computer engineering & systems
of interest. (ICCES). IEEE
17. Petsko Gregory A, Ringe Dagmar (2004) Protein structure and
function. New Science Press, Beijing
18. Pa Malathi, Ma Manoaj, Ra Manoj, Vaikunth R, Vinodhini R (2017)
References Highly improved DNA based steganography. Procedia Comput
Sci 115:651–659
1. Information Resources Management Association (2018) Cyber 19. Ibrahim Fatma E, Abdalkader HM, Moussa MI. Enhancing the
security and threats: concepts, methodologies, tools, and appli- security of data hiding using double DNA sequences. In: Indus-
cations. IGI Global try Academia collaboration conference (IAC)
2. Provos Niels, Honeyman Peter (2003) Hide and seek: an intro- 20. Marwan S, Shawish A, Nagaty K (2015) An enhanced DNA-based
duction to steganography. IEEE Secur Priv 1(3):32–44 steganography technique with a higher hiding capacity. Bioin-
3. Krishnan RB, Thandra PK, Sai Baba M (2017) An overview of text formatics 1:150–157
steganography. In: 2017 4th international conference on signal 21. S Sajisha K (2017) An encryption based on DNA cryptography
processing, communication and networking (ICSCN). IEEE and steganography. In: International conference on electronics,
4. Sokół B, Yarmolik VN (2005) Cryptography and steganography: communication and aerospace technology(ICECA)
teaching experience. Enhanced methods in computer security, 22. Jain S, Bhatnagar V (2014) Analogy of various DNA based secu-
biometric and artificial intelligence systems. Springer, Boston, rity algorithms using cryptography and steganography. In: 2014
pp 83–92 international conference on issues and challenges in intelligent
5. Siper A, Farley R, Lombardo C (2005) The rise of steganogra- computing techniques (ICICT). IEEE
phy. In: Proceedings of student/faculty research day, CSIS, Pace 23. Hamed G et al (2016) Comparative study for various DNA based
University steganography techniques with the essential conclusions about
6. Selvaraj D (2017) Development of a secure communication sys- the future research. In: 2016 11th international conference on
tem based on steganography for mobile devices. p 3 computer engineering & systems (ICCES). IEEE
7. Vinodhini RE, Malathi P, Gireesh Kumar T (2017) A survey on 24. Hamed G et al (2015) Hybrid technique for steganography-
DNA and image steganography. 2017 4th International Confer- based on DNA with n-bits binary coding rule. In: 2015 7th Inter-
ence on Advanced Computing and Communication Systems national conference of soft computing and pattern recognition
(ICACCS). IEEE (SoCPaR). IEEE
8. Kahn David (1996) The history of steganography, international 25. Dilovan Z, Habibollah H, Subhi z (2017) Security issues in DNA
workshop on information hiding. Springer, Berlin based on data hiding: a review . Int J Appl Eng Res ISSN
9. Petitcolas Fabien AP, Anderson Ross J, Kuhn Markus G (1999)
Information hiding—a survey. Proc IEEE 877:1062–1078 Publisher’s Note Springer Nature remains neutral with regard to
10. Malathi P, Gireeshkumar T (2016) Relating the embedding effi- jurisdictional claims in published maps and institutional affiliations.
ciency of LSB steganography techniques in spatial and trans-
form domains. Procedia Comput Sci 93:878–885
Vol:.(1234567890)