Modification of Adaptive Huffman Coding For Use in
Modification of Adaptive Huffman Coding For Use in
DOI: 10.1051/itmconf/20171501004
CMES’17
Abstract. The paper presents the modification of Adaptive Huffman Coding method – lossless data
compression technique used in data transmission. The modification was related to the process of adding a
new character to the coding tree, namely, the author proposes to introduce two special nodes instead of
single NYT (not yet transmitted) node as in the classic method. One of the nodes is responsible for
indicating the place in the tree a new node is attached to. The other node is used for sending the signal
indicating the appearance of a character which is not presented in the tree. The modified method was
compared with existing methods of coding in terms of overall data compression ratio and performance.
The proposed method may be used for large alphabets i.e. for encoding the whole words instead of
separate characters, when new elements are added to the tree comparatively frequently.
*
Corresponding author: [email protected]
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution
License 4.0 (http://creativecommons.org/licenses/by/4.0/).
ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004
CMES’17
- near accurate symbol probabilities hence better But the compression ratio does not show actual space
compression ratio of arithmetic coding, and capability to saving as besides of encoded message the table with the
fast encoding and decoding of Huffman coding [6]; code-character pairs should be transmitted. For that
- Facebook Zstandard algorithm is based on LZ77 reason, the other index should be introduced. The sent-
dictionary coder and tANS – effective entropy encoding to-original-bits ratio(SOBR) is described in accordance
based on the Huffman method [7]; with the formula below:
- Brotli coding algorithm which is used in most of the
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
modern Internet Browsers, such as Chrome, Opera. 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (2)
𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛
Similarly, to the Zstandard it is based on the
combination of LZ77 and modified Huffman coding [8]. The only way to make SOBR equal to DCR is to use
So, it may be clearly seen that despite its long history the one coding tree for all messages, but this tree will not be
Huffman encoding algorithm still presents great interest optimal since the character probabilities may be different
for application. in various messages.
2
ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004
CMES’17
3
ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004
CMES’17
To prove the feasibility of the method the term of coming word is attached to and for sending the signal
information entropy should be mentioned. This term has meaning that the ASCII code of the new-coming signal
several definitions: is going to be sent. This fact provides the opportunity for
- measure of unpredictability of the state; optimising the algorithm.
- expected (mean) value of information contained in
a message.
The value of entropy is calculated by the following 4 Modification
formula:
Entropy of i-th character in a message: 4.1 Introduction of NCW node
𝐼𝐼𝑖𝑖 = −𝑙𝑙𝑙𝑙𝑙𝑙𝑚𝑚 𝑝𝑝𝑖𝑖 (4) The NYT node should only act as the pointer for the
new-coming word. Its weight still should be 0 and its
Average entropy of a message: key number should have the least value in the tree. The
𝐻𝐻 𝐻 ∑𝑛𝑛𝑖𝑖𝑖𝑖 −𝑝𝑝𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙𝑚𝑚 𝑝𝑝𝑖𝑖 = ∑𝑛𝑛𝑖𝑖𝑖𝑖 −𝑝𝑝𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙𝑚𝑚 𝐼𝐼𝑖𝑖 (5) new node NCW should be introduced to the tree. Its
initial weight should be equal 0. This node should be
where: pi - probability of i-th character in the message; used for sending the "new-coming word" signal. The
n - number of unique characters in the message; introduced NCW node should be treated as a usual leaf
m - logarithm base, usually taken equal to 2, as binary node, i.e. after sending the bit sequence corresponding to
system is used in computer technics. the NCW node, the procedure described in the lines 8-17
Practically it may be stayed, that the average entropy of the Listing 1 should be carried out for the NCW node.
of a message is equal to the least possible average code The Listing 3 presents the modified algorithm.
length of the characters contained in the message [1].
Listing 3. Modified algorithm.
It is well known, that amongst discrete distribution
with equal number of states the uniform distribution has Transmitter Receiver
the maximum value of entropy [12]. Every state of the Transmit (character ch) Receiving is performed in the
uniform distribution has the same probability equal to 1 if ch is found in the tree stream, i.e. in infinite loop.
1/n, where n is the number of states, which yields the 2 send the code of ch bit Receive()
entropy: by bit; 1 while (true)
3 else 2 receive one bit from the
𝐻𝐻𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 = 𝑙𝑙𝑙𝑙𝑙𝑙2 𝑛𝑛 (6) 4 send the code of transmitter,
NCWNode bit by bit; 3 add received bit to
To estimate the maximum entropy, the maximum 5 bitSequence;
number of words encoded should be defined. It has been updateTree(NCWNode); 4 if bitSequence leads to a
decided to take the max number of words equal to 6 send the ASCII code of leaf node
16384, which yields the maximal entropy equals to 14 ch; 5 obtain the character
bits. The max word number is taken because the 7 end if ch stored in the node;
practical number of stored words may be higher than 8 UpdateTree(ch); 6 add ch to decoded
7000 because of some capitalized words, abbreviations, message;
mistakes and user-defined words. To roughly estimate 7 UpdateTree(ch);
the compression ratio, the average length of English 8 set bitSequenc
word is used, its value approximately equals to 4 letters. empty;
In case if ASCII coding is taken into consideration the 9 end if
average length in bits equals to 32. Based on this value 10 if bitSequence leads to
the average compression ratio equals to 14/32 ≈ 0.43. NCWNode
This estimation is fairly promising as average 11updateTree(NCWNode);
compression ratio for separate-character AHC is 12 receive 8 bit;
about 0.55 [4]. 13 convert received 8 bit
As it is stated in the previous chapter, sent-to-original to character ch;
14 add ch to decoded
bit ratio tends to be significantly higher than data
message;
compression ratio for complete-word AHC. This effect is
15 UpdateTree(ch);
especially noticeable during the phase of initial building
16 set bitSequence empty;
the coding tree, when new words are coming especially 17 end if
frequently. There are two factors that affect SOBR in 18 end while
this case: sending the complete ASCII codes of the new-
coming words and sending the NYT bit sequence. The Figure 3 presents the difference between
Precoded dictionaries may be used to decrease the modified and non-modified method. The phrase
influence of first factor. This possibility is not "A friend in need is a friend indeed" was used for the
considered in the current paper. test. As it may be noticed the weight of the NCW node
However, the second factor, i.e. sending the NYT bit is equal to the number of the unique leaf nodes.
sequence will be optimized within the frame of this
research. As it is described in the chapter 2 the NYT is
used for both indicating the place in the tree the new-
4
ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004
CMES’17
5
ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004
CMES’17