0% found this document useful (0 votes)
100 views40 pages

Huffman Coding

Huffman coding is an algorithm that creates a variable-length prefix code to encode messages. It builds a binary tree based on symbol frequencies, with more frequent symbols nearer the root. Each symbol is assigned a code consisting of the path from root to its leaf node. This results in shorter codes for more frequent symbols, allowing the entire message to be encoded using the fewest possible bits compared to any other prefix code.

Uploaded by

Ricardo Lazo Jr.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views40 pages

Huffman Coding

Huffman coding is an algorithm that creates a variable-length prefix code to encode messages. It builds a binary tree based on symbol frequencies, with more frequent symbols nearer the root. Each symbol is assigned a code consisting of the path from root to its leaf node. This results in shorter codes for more frequent symbols, allowing the entire message to be encoded using the fewest possible bits compared to any other prefix code.

Uploaded by

Ricardo Lazo Jr.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Applications of Trees

Encoding messages

 Encode a message composed of a string of characters


 Codes used by computer systems
 ASCII
• uses 8 bits per character
• can encode 256 characters
 Unicode
• 16 bits per character
• can encode 65536 characters
• includes all characters encoded by ASCII
 ASCII and Unicode are fixed-length codes
 all characters represented by same number of bits
Problems

 Suppose that we want to encode a message


constructed from the symbols A, B, C, D, and E
using a fixed-length code
 How many bits are required to encode each
symbol?
 at least 3 bits are required
 2 bits are not enough (can only encode four
symbols)
 How many bits are required to encode the
message DEAACAAAAABA?
 there are twelve symbols, each requires 3 bits
 12*3 = 36 bits are required
Drawbacks of fixed-length codes

 Wasted space
 Unicode uses twice as much space as ASCII
• inefficient for plain-text messages containing only ASCII characters
 Same number of bits used to represent all characters
 ‘a’ and ‘e’ occur more frequently than ‘q’ and ‘z’

 Potential solution: use variable-length codes


 variable number of bits to represent characters when frequency of
occurrence is known
 short codes for characters that occur frequently
Advantages of variable-length codes

 The advantage of variable-length codes over fixed-length is short codes


can be given to characters that occur frequently
 on average, the length of the encoded message is less than fixed-
length encoding
 Potential problem: how do we know where one character ends and
another begins?
• not a problem if number of bits is fixed!

A = 00
0010110111001111111111
B = 01
C = 10 ACDBADDDDD
D = 11
Prefix property

 A code has the prefix property if no character code is the prefix (start of
the code) for another character
 Example:

Symbol Code
P 000
01001101100010
Q 11
R 01 RSTQPT
S 001
T 10
 000 is not a prefix of 11, 01, 001, or 10
 11 is not a prefix of 000, 01, 001, or 10 …
Code without prefix property

 The following code does not have prefix property

Symbol Code
P 0
Q 1
R 01
S 10
T 11

 The pattern 1110 can be decoded as QQQP, QTP, QQS, or TS


Problem
 Design a variable-length prefix-free code such that the message
DEAACAAAAABA can be encoded using 22 bits
 Possible solution:
 A occurs eight times while B, C, D, and E each occur once
 represent A with a one bit code, say 0
• remaining codes cannot start with 0
 represent B with the two bit code 10
• remaining codes cannot start with 0 or 10
 represent C with 110
 represent D with 1110
 represent E with 11110
Encoded message

DEAACAAAAABA

Symbol Code
A 0
B 10
C 110
D 1110
E 11110

1110111100011000000100 22 bits
Another possible code

DEAACAAAAABA

Symbol Code
A 0
B 100
C 101
D 1101
E 1111

1101111100101000001000 22 bits
Better code

DEAACAAAAABA

Symbol Code
A 0
B 100
C 101
D 110
E 111

11011100101000001000 20 bits
What code to use?

 Question: Is there a variable-length code that makes the most efficient


use of space?

Answer: Yes!
Huffman coding tree

 Binary tree
 each leaf contains symbol (character)
 label edge from node to left child with 0
 label edge from node to right child with 1
 Code for any symbol obtained by following path from root to the leaf
containing symbol
 Code has prefix property
 leaf node cannot appear on path to another leaf
 note: fixed-length codes are represented by a complete Huffman tree
and clearly have the prefix property
Building a Huffman tree

 Find frequencies of each symbol occurring in message


 Begin with a forest of single node trees
 each contain symbol and its frequency
 Do recursively
 select two trees with smallest frequency at the root
 produce a new binary tree with the selected trees as children and
store the sum of their frequencies in the root
 Recursion ends when there is one tree
 this is the Huffman coding tree
Example

 Build the Huffman coding tree for the message


This is his message
 Character frequencies

A G M T E H _ I S

1 1 1 1 2 2 3 3 5

 Begin with forest of single trees

1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 1

1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 2

2 2

1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 3

2 2 4

1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 4

2 2 4

1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 5

2 2 4 6

1 1 1 1 2 2 3 3 5
A G M T E H _ I S
Step 6

4 4

2 2 2 2 6
E H

1 1 1 1 3 3 5
A G M T _ I S
Step 7

8 11

4 4 6 5
S

2 2 2 2 3 3
E H _ I

1 1 1 1
A G M T
Step 8
19

11 8

6 5 4 4
S

3 3 2 2 2 2
_ I E H

1 1 1 1
A G M T
Label edges
19
0 1
11 8
0 1 S 01 0 1
E 110
6 5 4 4
H 111
0 1 S 0 1 0 1
_ 000
3 3 I 001
2 2 2 2
A 1000
_ I 0 1 0 1 H
G 1001 E
M 1010 1 1 1 1
T 1011
A G M T
Huffman code & encoded message
This is his message

S 01
E 110
H 111
_ 000
I 001
A 1000
G 1001
M 1010
T 1011

10111110010100000101000111001010001010110010110001001110
Huffman Coding

• an algorithm that takes as input the frequencies (which are the


probabilities of occurrences) of symbols in a string and produces as
output a prefix code that encodes the string using the fewest possible
bits, among all possible binary prefix codes for these symbols.
Huffman Coding
• This algorithm, known as Huffman coding, was developed by David
Huffman in a term paper he wrote in 1951 while a graduate student
at MIT.
• (Note that this algorithm assumes that we already know how many
times each symbol occurs in the string, so we can compute the
frequency of each symbol by dividing the number of times this symbol
occurs by the length of the string.)
Huffman Coding
• Huffman coding is a fundamental algorithm in data compression, the
subject devoted to reducing the number of bits required to represent
information.
• Huffman coding is extensively used to compress bit strings
representing text and it also plays an important role in compressing
audio and image files.
Example2:

 Use Huffman coding to encode the following symbols with the


frequencies listed: A: 0.08, B: 0.10, C: 0.12, D: 0.15, E: 0.20, F: 0.35.
What is the average number of bits used to encode a character?
 Solution: STEP1

0.08 0.10 0.12 0.15 0.20 0.35

A B C D E F
Example2:

 Solution: STEP2

0.18

0.10 0.08 0.12 0.15 0.20 0.35

B A C D E F
Example2:

 Solution: STEP3

0.18 0.27

0.10 0.08 0.15 0.12 0.20 0.35

B A D C E F
Example2:
 Solution: STEP4

0.38

0.20 0.18 0.27

0.10 0.08 0.15 0.12 0.35

B A D C F
Example2:
 Solution: STEP5

0.38 0.62

0.27
0.20 0.18 0.35

F
E

0.15 0.12
0.10 0.08

D C
B A
Example2:
1.00
 Solution: STEP6

0.38
0.62

0.20 0.18
0.35
0.27
E
F

0.10 0.08

0.15 0.12
B A

D C
Example2:
1.00
 Solution: STEP7
0

0.38
0.62 0
0

0.20 0.18
0.35
0.27
E 0
F
0
0.10 0.08

0.15 0.12
B A

D C
Example2:
1.00
 Solution: STEP8
1
0

0.38
0.62 0 1
0
1
0.20 0.18
0.35
0.27
E 0 1
F
0 1
0.10 0.08

0.15 0.12
B A

D C
Example2:
1.00
 Solution: STEP8
1
0

0.38
0.62 1
0
0 1 Symbol Code
A 111
B 110 0.20 0.18
0.27
0.35 C 011
D 010
E 10 E
F 0 1 0 1
F 00

0.10 0.08
0.15 0.12

B A
D C
Example2:

 Use Huffman coding to encode the following symbols with the


frequencies listed: A: 0.08, B: 0.10, C: 0.12, D: 0.15, E: 0.20, F: 0.35.
What is the average number of bits used to encode a character?
 Solution:
Symbol Code average number of bits
used to encode a
character
A 111 3*0.08
B 110 3*0.10
C 011 3*0.12
D 010 3*0.15
E 10 2*0.20
F 00 2*0.35
2.45
Try this!
 Construct a Huffman code for the letters of the English alphabet where
the frequencies of letters in typical English text are as shown in this
table.
Thank you
for learning discrete math
with me!

You might also like