0% found this document useful (0 votes)
64 views7 pages

A Huffman-Based Text Encryption Algorithm: H (X) Leads To Zero Redundancy, That Is, Has The Exact

The document proposes two encryption procedures added to Huffman coding to provide data encryption in addition to compression. The first is a multiple substitution cipher that substitutes each Huffman coded symbol with a string of "fake codes" followed by the symbol, with an identification bit marking real vs fake codes. The second is a stream cipher that encrypts the identification bit via XOR with a secret key, encrypting the Huffman coded data. The document describes Huffman coding and its properties, then details the two proposed encrypted Huffman coding schemes.

Uploaded by

Malik Imran
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views7 pages

A Huffman-Based Text Encryption Algorithm: H (X) Leads To Zero Redundancy, That Is, Has The Exact

The document proposes two encryption procedures added to Huffman coding to provide data encryption in addition to compression. The first is a multiple substitution cipher that substitutes each Huffman coded symbol with a string of "fake codes" followed by the symbol, with an identification bit marking real vs fake codes. The second is a stream cipher that encrypts the identification bit via XOR with a secret key, encrypting the Huffman coded data. The document describes Huffman coding and its properties, then details the two proposed encrypted Huffman coding schemes.

Uploaded by

Malik Imran
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A HUFFMAN-BASED TEXT ENCRYPTION ALGORITHM

Ruy Luiz Milidi Departamento de Informtica Pontifcia Universidade Catlica (PUC-Rio) [email protected] Claudio Gomes de Mello Departamento de Engenharia de Sistemas (DE/9) Instituto Militar de Engenharia (IME) [email protected] Jos Rodrigues Fernandes Faculdade de Informtica Universidade Catlica de Petrpolis (UCP) [email protected]

Abstract. Huffman coding achieves compression by assigning shorter codes to the most used symbols and longer codes to the rare ones. This paper describes two cipher procedures added to Huffman coding to provide encryption besides its compression feature. We use multiple substitution to disguise symbols with fake codes and a stream cipher to encrypt these codes. The proposed scheme is simple and fast. It generates encrypted documents with enough confusion and diffusion.

1 Introduction In order to maintain a secure communication between remote points it is necessary to guarantee the integrity and confidentiality of both incoming and outcoming information. The communication cost is related to the volume of exchanged information. Hence, information compression is essential. Besides that, one must also guarantee that sniffers can not be able to decipher in transit messages. To protect data against statistical analysis, Shannon [Shan49] suggested that the language redundancy should be reduced before encryption. We use the well-known Huffman codes to achieve this. Huffman codes have optimal average number of bits per character, among prefix-free codes. Besides that, Rivest et al [Rive96] tried to cryptanalyse a file that has been Huffman coded (and not encrypted) and find that it was surprisingly difficult. First, let us introduce some concepts. Let P be the set of possible plaintexts, and S={s1, s2, , sn} the plaintext alphabet of X, XP such that X=x1x2 where xiS. Let n be the number of symbols in S. If pi is the probability of si to appear in the plaintext X we have the Entropy of X, defined by H(X) = SUMi (pi) . log pi, as the average number of bits to represent each symbol siS. Moreover, we say that

H(X) leads to Zero Redundancy, that is, has the exact number of necessary bits to represent S. The encoding produced by Huffmans algorithm is prefix-free, and satisfies [Stin95]: H(X) l(Huffman) < H(X)+1, where l is the weighted average length. Here, we propose two cryptosystems based on Huffman coding. Sometimes an alphabet provides multiple substitutions for a letter. Thus, a symbol xi of a plaintext X instead of always being replaced by a codeword c, will be replaced by any codeword of a set (c1, c2, ...). These alternates used in the multiple substitution are called homophones [Simm91]. Gnther et al [Gunt88] introduced a coding technique using homophones such that their encoding tables generates a stream of symbols which all have the same frequency. Then, Massey et al [Mass89] proposed a scheme, based on Gnters homophonic substitution, to generate homophones by decomposing the probability of each symbol in a sum of negative powers of 2, generating new symbols. Our first cipher is a multiple substitution procedure. We substitute each Huffman coded symbol by a string of fake codes followed by the symbol itself. It is a

steganographic technique, that is, to disguise the symbol by mixing it with other fake ones. Each symbol has an identification bit (ID bit) that marks it as real (bit 0) or fake (bit 1). Our second procedure is a stream cipher. It encodes the ID bit of each symbol by operating a XOR (exclusive-or) with a given secret-key. The result is an encrypted Huffman coding and decoding that can be used in communication or in gigabytes sized document collections as proposed in [Moff94]. In section 2, the Huffman coding, decoding and properties are introduced. In section 3, we describe our multiple substitution and stream cipher used to modify Huffman codes to add the encryption feature. In section 4, we relate some experiments using our implementation to compress and encrypt some documents. Finally, in section 5, we conclude our work.

(right child) shown in figure 1 is created for these symbols.

s1
0 1

s2
0 1

s3

s4

Figure 1 - prefix-free tree Then, a walk in the tree through the leaves generates the codetable of figure 2. Symbol s1 s2 s3 s4 Frequency 12 7 3 3 Codeword 0 11 100 101

2 Huffman Codes Suppose that a plaintext has a set of n different symbols S = {s1,...,sn}, n>1, and that the frequency of each symbol in the plaintext is known. So, a code for each symbol is required in order to compress the plaintext, limited by prefix-free codes. Prefix-free codes means that no codeword is the first part (prefix) of another codeword. When decoding, it is easy to recognize in the decoding process the end of a codeword without reading the next codeword. That is, the code searched is that of a prefix-free binary tree, where each symbol is located at one trees leaf. A walk through the tree assigns the codeword of a symbol when you reach a leaf. Huffman codes were introduced by David Huffman [Huff52] in 1952. This coding scheme compresses texts by assigning shorter codes to the most used symbols and longer codes to the rare ones. Now, let us illustrate the approach of Huffmans algorithm. Suppose the following plaintext with 25 symbols:
s2 s3 s2 s1 s2 s1 s4 s3 s1 s2 s1 s3 s1 s4 s1 s4 s1 s2 s1 s2 s1 s1 s1 s2 s1

Figure 2 - Codetable The plaintext is encoded with 44 bits as follows:


11100110110101100011010001010101011011000110

In a standard text coding, we would have 8 bits per symbol, and hence 200 bits for the above plaintext. With the codetable of figure 2, we achieve only 44 bits. This is due to the fact that the most frequent symbols in the plaintext have the smallest codeword lengths. Huffman decoding is almost immediate. With the Huffman codes assigned to each symbol, we go parsing the bits of the encoded text and walking through the Huffman tree. Guided by the labeled branches assigned with the bit, we traverse the coding tree until reaching a leaf. At this leaf, we find the encoded symbol. A Huffman tree is an optimal tree, but we also have several other optimal trees. Some of them are easily obtained by exchanging the places of symbols si at the same level of the tree. This can be used to hide code information. Wayner [Wayn88] proposed two methods for assigning a key to a tree. First, suppose we have a Huffman tree with N leaves. It is well known that for a strict binary tree with N leaves we have: 1. N-1 internal nodes; 2. The depth H of the tree satisfies: upperbound(log N) H N-1.

The frequencies of each symbol are calculated and a prefix-free tree labeled with 0 (left child) or 1

Therefore, Wayner proposed that we can have a set of optimal Huffman trees by operating an XOR between each N-1 branches of the tree with a control key with size N-1. Or we can assign one bit of a key for each level of the tree. The size of the key can be too short in this case, O(log N). Milidi et al [Mili97] show that one can efficiently implement prefix-free codes with length restriction, obtaining also very effective codes with little loss of compression. These two procedures lead to cipher procedures.

a. Codebook construction The set of fake codes are indeed homophonic codes of . This homophones can be generated according to several alternatives such as:

Encrypted Huffman

Alternative 1: After collecting the plaintext alphabet S={s1, s2, , sn} with their frequency values, create other m symbols that we define as to be fake symbols j with j in [1,m] that are homophones of . Generate random frequencies to each j symbol and then construct the Huffman tree with S and {1, 2, ..., m} symbols together in the same tree; Alternative 2: Create m symbols j with j in [1,m] with the frequency values equal to the first m symbols of S in the Huffman tree. Then, construct a second fake Huffman tree; Alternative 3: After constructing the Huffman tree with the S symbols, assume that the first m symbols codes represent the m fake codes too.

Here, we propose two cryptosystems to add encryption properties to Huffman codes. A Cryptosystem is a five-tuple (P, C, K, E, D), where the following conditions are satisfied: 1. P is a finite set of possible plaintexts; 2. C is a finite set of possible ciphertexts; 3. K, the keyspace, is a finite set of possible keys; 4. For each k in K, there is an encryption rule ek E and a corresponding decryption rule dk D. Each ek: P=>C and dk: C=>P are functions such that dk(ek(X)) = X for every plaintext X in P.

In both alternatives 2 and 3 we need an extra identification bit (ID bit) to indicate when the symbol code is real (bit 0) or fake (bit 1). We call the fake tree generation rate. It is a parameter that controls the expansion rate, say 30% for example, of fake symbols in the coding tree. With this rate we can configure the number of distinct fake symbols created as m = * n as shown in figure 3 that illustrates alternative 2.
0 1

3.1 Multiple Substitution Our first procedure is a multiple substitution cipher. We insert null symbols in the cipher text. A null symbol means nothing, and their inclusion is only to avoid easy decoding of the text by unauthorized people. This null symbol , that we call a fake symbol, is inserted in the ciphertext with multiple codes, also called homophones. We use a set = {1, 2, , m} of fake symbols generated to disguise the output ot the effective null symbol. Next, we describe the method. Our Multiple Substitution is a cryptosystem (P, C, K, L, F, E, D) where we additionally define: 1. L as a finite set of possible fake codes representing the fake symbol . So, we have L a subset of C and S+= S U {}, where S+ is the alphabet of symbols plus the null symbol; 2. F = (f1, f2, ) as the fake code generator. For i >= 1, fi: P=>Pi-1 x L.

Figure 3 Fake tree construction rate generates a new fake Huffman tree We use indeed alternative 3 due to its savings of memory space and small time to construct a Huffman tree with more symbols (alternative 1) or a second tree (alternative 2).

So, is the symmetric key that defines the number of distinct fake symbols used as shown in figure 3.

So, we have a Geometrical Distribution of generating mi fake codes plus the effective symbols defined by the following expression: P[mi + 1] = (mi+1)-1 . (1- ) = mi . (1- ) The decoding procedure is very simple, as shown below:

b. Coding With the m fake codes defined, we use a function to generate a string of fake codes for each symbol xi of X=x1x2. Let Li be the string of codes generated for xi; Codes of Li are a string of mi homophones of , that is, the fake codes of j with j in [1,m] followed by the xi symbol itself. So, Li = (1, 2, ..., mi, xi) as shown in figure 4. The j codes are randomly chose from the m fake codes. Sequential selection could be used if desired.

1. 2.

If the codeword read is a S symbol, then dk(s=xi) = xi; Else dk(s=tj) = and the null symbol is skipped.

By generating fake symbols we insert new characters in the encoded text in order to flatten the overall distribution of the symbols of a given language.

c. Diffusion

xi

fi
1 2 3 .. xi

Li

A well-known cipher attack is statistical analysis. In any language some characters are used more than others. Hence, an attack could be done by counting the frequencies of each symbol in a ciphertext and trying to assign each one to characters in the language that have the same distribution frequencies that the ones found in the ciphertext. With fake codes generation, we achieve some diffusion in the distribution of frequency of codes. Then, counting the frequencies of codes can lead to wrong assignments to symbols. Its obvious that a great disadvantage of this scheme is text expansion. But it has great benefits too: diffusion of distribution frequencies and the lack of correlation between symbols in the language. Both features lead to a more difficult statistical analysis. d. Text Expansion To estimate the text expansion of multiple substitution we observe that: (i) The average number of output ciphers generated is the average of a Geometrical Distribution, that is, 1/(1-); One additional bit per character due to the ID bit is needed. Since we have H(X) l(Huffman) < H(X)+1, it leads to H(X) + 2.

Figure 4 Generation of homophones of followed by the effective symbol Finally, let be a fake symbol generation rate, that is, the probability of generating a fake code at each round before outputting the effective code. So, the coding procedure is the following pseudocode:

1. 2.

3.

4. 5.

Choose a random number p, p in [0,1] If p< then eRs(xi)=j return to 1 Else eRs(xi)=xi i is increased by 1 If in then return to 1 End.

(ii)

The parameter is used to set the number of fake symbols generated between real symbol outputs. It is used to balance diffusion and text expansion.

The text expansion can be estimated by the average number of characters generated times the average number of bits. So, the average number of bits B per character is:

B [1/(1-)] . ( H(X) + 2 ) For example, using =0.30 and assuming that we have H(X)=4.19 for monogram parsing, and an average HL=1.25 for the entropy of english language we get: 1/(1-) = 1/(1-0.30) = 1.43, that is, 43% of text expansion due to fake codes. B1 1.43 (4.19+2) = 8.85 BL 1.43 (1.25+2) = 4.65 So, we should still achieve compression using word parsing in our encrypted Huffman if compared to standard ASCII representation (8 bits per character).

If |k|<|X|, that is, the size of the key is less than the size of the plaintext, we have two alternatives to generate the keystream:

Alternative 1: Cyclic keystream generation - when the key k ends up we return back to the beginning and so on. Then, we have the key bit defined by zi = f(k, i) = ki mod , where =|k|; Alternative 2: Random keystream generator function f generates bit-streams using key k as a seed.

3.2 Stream Cipher Suppose that someone has the encoded text, the Huffman codetable and the number of fake codes defined by . Then, decoding is immediate. Therefore, a secret-key is necessary to add confusion to the process. The secret-key we use is fully scalable. It can have any length we desire: 48 bits, 128 bits, 256 bits, etc. So, we introduce a second procedure, a stream cipher. A Stream cipher is a cryptosystem (P, C, K, L, F, E, D) where we additionally define: 1. L as a finite set called the keystream alphabet; 2. F=(f1, f2, ) as the keystream generator. For i >= 1, fi: K x Pi-1 => L.

We assume alternative 1 as illustrated in figure 5.This way, we maintain Huffmans coding synchronism properties. Alternative 2 is not used since it causes dependency with past. keystream k

0 0 1 0 0 1 1 0 1 0
XOR

ID-bit

0 1 1 0 1
codeword

Figure 5 XOR between the i-th bit of the key k and the ID bit of the codeword

b. Coding We define coding and decoding procedures as - XOR (exclusive-or) operations between the zi bit and all the bits of the codeword. This is equivalent of exchanging places of a symbol in the Huffman tree:

a.

Keystream

Let kK and X=x1x2. The stream cipher procedure is defined as: (i) Z=z1z2 is the keystream. We have a function fi that generates zi from k: zi = fi(k, x1, x2, , xi-1); zi is used to cipher xi such that yi = ezi(xi);

yi = ek(xi) = (zi xor xi,1, zi xor xi,2, , zi xor xi,|xi|) xi = dk(xi) = (zi xor yi,1, zi xor yi,2, , zi xor yi,|yi|) However, we use indeed a simpler procedure defined by a XOR between the zi bit and only the first bit of the codeword. This is equivalent of only disguising the ID bit:

(ii)

We have a new zi key for each xi that comes in, and this zi key is generated from the past z1, y1, z2, y2, , zi-1, yi-1. Our method consists of a single constant function such that zi = f(k,i).

yi = ek(xi) = (zi xor h1, xi,2, , xi,|xi|) xi = dk(xi) = (zi xor h1, yi,2, , yi,|yi|) Where h1 and h1 are the ID bits of xi and yi.

c. Confusion With this XOR operation we add hideness to our method. It is simple and has low processor cost. The secret-key can have any length and any kind of trial-and-error attacking would be so much time consuming that makes the analysis unfeasible. In this scheme, we have indeed a composite key that contains and the keytream seed k as shown in figure 6. Plaintext size Encoded text size (Huffman) Ciphertext size Expansion over plaintext Expansion over encoded text

Brazilian Constitution 7.004.160 4.078.717 8.309.822 19% 104%

Gutenberg Project 39.059.456 22.813.331 52.539.412 34% 130%

Table 1 Extra-space results K Figure 6 Composed key 5 Conclusions Usually, one make serial procedures to compress and then encrypt a file. In this work, we proposed simple modifications in Huffman codes to add encryption to its compression feature. It results in a fast and low computational power consumption that provides enough confusion and diffusion to the ciphertext. That follows from both its theoretical and practical properties. The diffusion and confusion are also controlled parameters. The factor controls the rate of generating fake codes. With we can set the ciphertext expansion due to null symbol insertion, what is directly associated with the security of the file. While the size of the key k is associated with the security of the confusion feature.

With this secret key included in the process, we have an encryption system to use in insecure communication channels.

d. Text Expansion The stream cipher does not cause any additional text expansion.

4 Experiments In table 1, we list some results with a C++ implementation of our encrypted Huffman coding and decoding. We use the Brazilian Constitution and the Gutenberg Project collection. We use = 0.30, monogram parsing and all sizes are expressed in bytes. We first measure the storage space required to compress the documents collection using standard Huffman coding. Then, we use the encrypted Huffman and measure the difference between the new space required to the encryption features, that is, the extra-space needed. Observe that text expansion is very close to its expected value. Moreover, the relative additional time to introduce encryption to the compression process is about 5%.

References [Gunt88] Gunter, C.G. 1988. An Universal Algorithm for Homophonic Coding. in Advances in Cryptology Eurocrypt-88, LNCS, vol. 330. [Huff52] Huffman, D. 1952. A Method for the Construction of Maximum of Minimum Redundancy Codes. Proc. IRE, 1098-1101. [Mass89] Massey, J.L., Kuhn, Y.J.B., Jendal, H.N. 1989. An Information-Theoretic Treatment of Homophonic Substitution. in Advances in Cryptology Eurocrypt-89, LNCS, vol. 434. [Mili97] Milidi, R.L., Laber, E.S. 1997. Improved Bounds on the Inefficient of Length-Restricted Prefix Codes.

[Moff94] Moffat, A., Witten, I.I., Bell, Timothy C. 1994. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold. [Rive96] Rivest, Ronald L., Mohtashemi, M., Gillman, David W. 1996. On Breaking a Huffman Code. IEEE Transactions on Information Theory, vol. 42, no. 3. [Shan49] Shannon, C. 1949. Communication Theory of Secrecy Systems. Bell Syst. Tech., vol. 28, no. 4, pp. 656-715. [Simm91] Simmons, G. 1991. Contemporary Cryptology The Science of Information Integrity. IEEE Press. [Stin95] Stinson, D.R. 1995. Cryptography: Theory and Practice, CRC Press [Wayn88] Wayner, P. 1988. A Redundancy Reducing Cipher, Cryptologia, 107-112.

You might also like