Huffman Codes
Huffman Codes
Huffman Codes
Huffman Codes
• Huffman codes compress data very effectively: savings of 20% to 90%
are typical, depending on the characteristics of the data being
compressed.
• We consider the data to be a sequence of characters.
• Huffman’s greedy algorithm uses a table giving how often each
character occurs (i.e., its frequency) to build up an optimal way of
representing each character as a binary string.
Huffman Codes
• Suppose we have a 100,000 character data file that we wish to store
compactly.
• We observe that the characters in the file occur with the frequencies
given by Figure 16.3.
• That is, only 6 different characters appear, and the character a occurs
45,000 times.
Fixed-length code
• in which each character is represented by a unique binary string,
which we call a codeword.
• If we use a fixed-length code, we need 3 bits to represent 6
characters:
• a = 000, b = 001, . . . , f = 101.
• This method requires 300,000 bits to code the entire file.
• Can we do better?
Variable-length code
• A variable-length code can do considerably better than a fixed-length
code, by giving frequent characters short codewords and infrequent
characters long codewords.
• Figure 16.3 shows such a code;
• here the 1-bit string 0 represents a, and the 4-bit string 1100 represents f.
This code requires
For example, a code with code words {9, 55} has the prefix
property; a code consisting of {9, 5, 59, 55} does not, because "5"
is a prefix of "59" and also of "55".
Prefix codes
• Using prefix codes, a message can be transmitted as a sequence of
concatenated code words, without any out-of-band markers or (alternatively)
special markers between words to frame the words in the message.
• The recipient can decode the message unambiguously, by repeatedly finding
and removing sequences that form valid code words.
• This is not generally possible with codes that lack the prefix property, for
example {0, 1, 10, 11}: a receiver reading a "1" at the start of a code word
would not know whether that was the complete code word "1", or merely
the prefix of the code word "10" or "11"; so the string "10" could be
interpreted either as a single codeword or as the concatenation of the words
"1" then "0".
How to compute prefix codes?
Constructing a Tree
Algorithm
Analysis