Huffman Coding
Huffman Coding
Motivation
Character a b c d e f
Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword
= 300K bits
Can we do better ??
YES !!
• Use variable-length codes instead.
• Give frequent characters short codewords, and
infrequent characters long codewords.
a b c d e f
Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword
Variable-length 0 101 100 111 1101 1100
codeword
No Ambiguity !!
100 100
0 1
0 1
86 14 a:45 0 55
0 1
1 0
14 25 30
58 28 0 1 1
1 0 1 0 1 0
0
c:12 b:13 14 d:16
a:45 b:13 c:12 d:16 e:9 f:5 0 1
f:5 e:9
Greedy Choice?
• The two smallest nodes are chosen at each
step.
The steps of Huffman's algorithm
f:5 e:9 c:12 b:13 d:16 a:45 c:12 b:13 14 d:16 a:45
0 1
f:5 e:9
f:5 e:9
a:45 55
0 1 0
100 1
a:45 55
25 30
0 1
0 1 0 1
d:16 25 30
c:12 b:13 14
0 1 0 1
0 1
c:12 b:13 14 d:16
f:5 e:9
0 1
f:5 e:9
Running Time Analysis
E = 01
I = 00
Input C = 10
◦ ACE A = 111
Output H = 110
◦ (111)(10)(01) = 1111001
1 0 C E I
5 5 8 7
1 0 1 0
10 15
1 0
25
Huffman Decoding 1
A H
1111001
3 2
1 0 C E I
5 5 8 7
1 0 1 0
10 15
1 0
25
Huffman Decoding 2
A H
1111001
3 2
1 0 C E I
5 8 7
5 A
1 0 1 0
10 15
1 0
25
Huffman Decoding 3
A H
1111001
3 2
1 0 C E I
5 8 7
5 A
1 0 1 0
10 15
1 0
25
Huffman Decoding 4
A H
1111001
3 2
1 0 C E I
5 8 7
5 AC
1 0 1 0
10 15
1 0
25
Huffman Decoding 5
A H
1111001
3 2
1 0 C E I
5 8 7
5 AC
1 0 1 0
10 15
1 0
25
Huffman Decoding 6
A H
1111001
3 2
1 0 C E I
5 8 7
5 ACE
1 0 1 0
10 15
1 0
25
Huffman Decoding 7
Induction on the number of code words
Correctness proof
Greedy Choice Proof
T T’ T’’
x a a
y y b
a b x b x y
C(T’) = C(T)
C(X’) <= C(X)
C(T) = C(T’)
= C(T’’) + x +y
<= C(X’’) + x+y
= C(X’)
<= C(X)
Problems:
Delay (network communication, file compression applications)
Extra disk accesses slow down the algorithm.
We need one-pass methods, in which letters are encoded “on the fly”.
Dynamic Huffman codes
Algorithm FGK
• The next letter of the message is encoded on the
basis of a Huffman tree for the previous letters.
• Encoding length = (2S + t), where S is the encoding
length by a static Huffman code, and t is the number
of letters in the original message.
5 5 5 6
3 5 6
4 d e
c
2 1
3 2
a b
Difficulty
How to modify this tree quickly in order to get a Huffman tree for MT+1?
X
21 10 22 10
11 9 11 9
f f
10 7 11 8 ai(t+1) = “b” 11 7 11 8
5 5 5 6 5 6 5 6
3 5 6 3 5 6
4 d e 4 d e
c c
2 1
3 2
2 1
4 2
a b a b
First phase
f f
11 8 10 7 11 8
10 7
5 5 5 6 5 5 5 6 6
3 5 6 3 5
4 d e 4
c c d e
2 3 2 1 3 2
1 2
a b a b
11
32 32 11
21 10
11 9
21 10
f 11 9
10 7 11 8
7 8
5 5 6 10 11
6
5 5 5 5 6 6 e f
3
4 e
c d 5 5
2 1 3 2 3 4
2 1 3 2 a b c d
a b
Second phase
• We turn this tree into the desired Huffman tree for MT+1 by incrementing the
weights of ai(t+1)’s leaf and its ancestors by 1
32 11 33 11
21 10
10 12 9
11 9 21
10 7 11 8 6 6 10 7 11 8
5 5 6 5 6
e
6
f e f
5 2 4 5 5
2 1 3 2 5 3 1 2 3 4
4 a
a c d b c d
b
• Huffman savings are between 20% - 90%
• Dynamic Huffman Coding optimal and efficient
• Optimal data compression achievable by a
character code can always be achieved with a
prefix code.
• Better compression possible (depends on data)
• Using other approaches (e.g., pattern
dictionary)
Conclusions
Thank you for your
attention!