Lec 3 Huffman Coding
Lec 3 Huffman Coding
Lec 3 Huffman Coding
Unit 2 - Part 3
Today we will consider some examples of loss-less source coding algorithms to conclude the discussion
on the idea of binary representation of information signals.
1 Human Codes
1.1 Binary Human Codes
Consider the source S with symbols s1 , s2 , · · · , sq and symbol probabilities P1 , P2 , · · · , Pq . Let the
symbols be ordered so that P1 ≥ P2 ≥ · · · ≥ Pq }. By regarding the last two symbols of S as combined
into one symbol, we obtain a new source from S containing only q − 1 symbols. We refer to this new
source as a reduction of S . The symbols of this reduction of S may be reordered, and again we may
combine the two least probable symbols to form a reduction of this reduction of S . By proceeding in
this manner, we construct a sequence of sources, each containing one fewer symbol than the previous
one, until we arrive at a source with only two symbols.
Construction of a sequence of reduced sources, as illustrated in Figure 1, is the rst step in the
construction of a compact instantaneous code for the original source S .
The second step is the recognition of a compact binary instantaneous code for the last reduced
source. The nal step is to assign a compact instantaneous code for the source immediately preceding
the reduced source and work back-ward along the sequence of sources until we arrive at a compact
instantaneous code for the original source.
Figure 2 illustrates an example of the construction of a compact code. First we formed a sequence
of reduced sources from the original source S . Then we assigned the words 0 and 1 to the last source
1
in the sequence (S4 in this case). Finally, we worked our way back from S4 to S through the reduced
sources.
It may sometimes be unnecessary to form a sequence of reductions of the original source all the way
to a source with only two symbols. This is so since we need only form reductions until we nd the
rst reduction for which we have a compact code. Once we have a compact code for any reduction of a
source, we may start working backward from this compact code. This point is illustrated in Figure 3.
2
Figure 4: A quaternary compact code
Human codes are optimal in the sense that there are no other codes with shorter expected length,
when restricted to instantaneous (prex-free) codes, and symbols in the messages are drawn in inde-
pendent fashion from a xed, known probability distribution. The optimal (Human) code obtained is
also a Compact Code.
2 Dictionary Coding
The class of universal coding techniques include dierent versions of dictionary coding algorithms. To
understand the source coding problem better, let us consider the problem of digital representation of
the text of a book written in, say, English. There are several possible approaches:
3
2.1 Fixed Dictionary
One approach is to is to use a predened dictionary for the encoding. One can analyze a few books
and estimate the probabilities of dierent letters of the alphabet. Then, treat each letter as a symbol
and apply Human coding to compress the document of interest. This may work quite good for some
English texts. However, the algorithm will not work as good if it is used on e.g. a list of names, a
program code or even an image, which has a completely dierent distribution of the characters.
4
3.1.1 Algorithm
Step 1. Initialize dictionary to contain one entry for each byte. Initialize the encoded string with
the rst byte of the input stream.
Step 2. Read the next byte from the input stream.
Step 4. If concatenating the byte to the encoded string produces a string that is in the dictionary:
3.1.2 Pseudo-code
initialize TABLE[0 to 255] = code for individual bytes
STRING = get input symbol
while there are still input symbols:
SYMBOL = get input symbol
if STRING + SYMBOL is in TABLE:
STRING = STRING + SYMBOL
else:
output the code for STRING
add STRING + SYMBOL to TABLE
STRING = SYMBOL
output the code for STRING
3.2.1 Algorithm
Step 1. Initialize dictionary to contain one entry for each byte.
Step 2. Read the rst code word from the input stream and write out the byte it encodes.
Step 3. Read the next code word from the input stream.
5
Step 6. Concatenate the rst character in the new code word to the string produced by the
previous code word and add the resulting string to the dictionary.
Step 7. Go to step 3.
3.2.2 Pseudo-code
initialize TABLE[0 to 255] = code for individual bytes
CODE = read next code from encoder
STRING = TABLE[CODE]
output STRING
while there are still codes to receive:
CODE = read next code from encoder
if TABLE[CODE] is not dened: // needed because sometimes the
ENTRY = STRING + STRING[0] // decoder may not yet have entry!
else:
ENTRY = TABLE[CODE]
output ENTRY
add STRING+ENTRY[0] to TABLE
STRING = ENTRY
6
Table 1: Encoder Dictionary
S Message Byte Look up Result Code Output Encoded String
- a - - - -
a b ab not found 97 (a) Table[256]=ab
b c bc not found 98 (b) Table[257]=bc
c a ca not found 99 (c) Table[258]=ca
a b ab found - -
ab c abc not found 256 Table[259]=abc
c a ca found - -
ca b cab not found 258 Table[260]=cab
b c bc found - -
bc a bca not found 257 Table[261]=bca
a b ab found - -
ab c abc found - -
abc a abca not found 259 Table[262]=abca
a b ab found - -
ab c abc found - -
abc a abca found - -
abca b abcab not found 262 Table[263]=abcab
b c bc found - -
bc -end- - - 257 -
NOTE: The decoding process demonstrates the exception and how it is handled. (New encoded
string = old encoded string + the rst character of the old encoded string = abc + a = abca).
Uncompressed size = 18 ∗ 8 = 144 bits.
Compressed size = 9 ∗ 9 = 81 bits.
Uncompressed Size
Data Compression Ratio = = 1.78
Compressed Size
7
Compressed Size
Space Savings = 1 − = 0.4375
Uncompressed Size
4 Run-length encoding
Imagine you are given the string of bits
0000000 111 0000
| {z } |{z} 11 00000
|{z} |{z} | {z }
7 3 4 2 5
Rather than create codewords, we could simply store this string of bits as the sequence (7, 3, 4, 2, 5).
This strategy is used in fax-machine transmission, and also in jpeg.
8
Figure 6: Discrete sources require only the inner layer above, whereas the outer two layers are used for
analog sequences.