11 Huffman Coding
11 Huffman Coding
compression technique)
Encoding messages
Encode a message composed of a string of
characters
Codes used by computer systems
ASCII
• uses 8 bits per character
• can encode 256 characters
Unicode
• 16 bits per character
• can encode 65536 characters
• includes all characters encoded by ASCII
ASCII and Unicode are fixed-length codes
all characters represented by same number of bits
Problems
Suppose that we want to encode a message
constructed from the symbols A, B, C, D, and E
using a fixed-length code
How many bits are required to encode each
symbol?
at least 3 bits are required
2 bits are not enough (can only encode four
symbols)
How many bits are required to encode the
message DEAACAAAAABA?
there are twelve symbols, each requires 3 bits
12*3 = 36 bits are required
Drawbacks of fixed-length codes
Wasted space
Unicode uses twice as much space as ASCII
• inefficient for plain-text messages containing
only ASCII characters
Same number of bits used to represent all characters
‘a’ and ‘e’ occur more frequently than ‘q’ and ‘z’
A = 00 0010110111001111111111
B = 01
C = 10 ACDBADDDDD
D = 11
Prefix property
A code has the prefix property if no character code
is the prefix (start of the code) for another character
Example:
Symbol Code
P 000
01001101100010
Q 11
R 01 RSTQPT
S 001
T 10
000 is not a prefix of 11, 01, 001, or 10
11 is not a prefix of 000, 01, 001, or 10 …
Code without prefix property
The following code does not have prefix property
Symbol Code
P 0
Q 1
R 01
S 10
T 11
The pattern 1110 can be decoded as QQQP, QTP,
QQS, or TS
Problem
Design a variable-length prefix-free code such that
the message DEAACAAAAABA can be encoded
using 22 bits
Possible solution:
A occurs eight times while B, C, D, and E each
occur once
represent A with a one bit code, say 0
• remaining codes cannot start with 0
represent B with the two bit code 10
• remaining codes cannot start with 0 or 10
represent C with 110
represent D with 1110
represent E with 11110
Encoded message
DEAACAAAAABA
Symbol Code
A 0
B 10
C 110
D 1110
E 11110
1110111100011000000100 22 bits
Another possible code
DEAACAAAAABA
Symbol Code
A 0
B 100
C 101
D 1101
E 1111
1101111100101000001000 22 bits
Better code
DEAACAAAAABA
Symbol Code
A 0
B 100
C 101
D 110
E 111
11011100101000001000 20 bits
What code to use?
Question: Is there a variable-length code that makes
the most efficient use of space?
Answer: Yes!
Huffman coding tree
Binary tree
each leaf contains symbol (character)
label edge from node to left child with 0
label edge from node to right child with 1
Code for any symbol obtained by following path from
root to the leaf containing symbol
Code has prefix property
leaf node cannot appear on path to another leaf
note: fixed-length codes are represented by a
complete Huffman tree and clearly have the prefix
property
Building a Huffman tree
Find frequencies of each symbol occurring in
message
Begin with a forest of single node trees
each contain symbol and its frequency
Do recursively
select two trees with smallest frequency at the root
produce a new binary tree with the selected trees
as children and store the sum of their frequencies
in the root
Recursion ends when there is one tree
this is the Huffman coding tree
Example
Build the Huffman coding tree for the message
This is his message
Character frequencies
A G M T E H _ I S
1 1 1 1 2 2 3 3 5
1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 1
1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 2
2 2
1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 3
2 2 4
1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 4
2 2 4
1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 5
2 2 4 6
1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 6
4 4
2 2 2 2 6
E H
1 1 1 1 3 3 5
A G M T _ I S
Step 7
8 11
4 4 5 6
S
2 2 2 2 3 3
E H _ I
1 1 1 1
A G M T
Step 8
19
8 11
4 4 5 6
S
2 2 2 2 3 3
E H _ I
1 1 1 1
A G M T
Label edges
19
0 1
8 11
0 1
0 1
4 4 5 6
0 1 0 1 S 0 1
2 2 2 2 3 3
0 1 0 1 E H _ I
1 1 1 1
A G M T
Huffman code & encoded message
This is his message
S 10
E 010
H 011
_ 110
I 111
A 0000
G 0001
M 0010
T 0011
00110111011110010111100011101111000010010111100000001010