GROUP ID: 18 SMART SHEET BASED ATTENDANCE SYSTEM 1
Date: 9th November, 2017
UNIVERSITY OF SINDH, JAMSHORO
Title: HUFFMAN CODING ALGORITHM
Presented To:
Miss Syeda Hira Fatima Naqvi.
Presented By:
Sadaf Rasheed (2K15-CSE-72)
Department of Computer Science
INTRODUCTION
Huffman codes are an effective technique of lossless data compression which means no
information is lost.
Huffman coding could perform effective data compression by reducing the
amount of redundancy in the coding of symbols.
Huffman code is method for the compression for standard text documents.
It makes use of a binary tree to develop codes of varying lengths for the letters used in the
original message.
The algorithm was introduced by David Huffman in 1952 as part of a course assignment at
MIT.
HUFFMAN CODING SCHEME 1
CONTD:
Huffman codes can be used to compress information
Like WinZip although WinZip doesnt use the Huffman algorithm
JPEGs do use Huffman as part of their compression process
The basic idea is that instead of storing each character in a file as an 8-bit
ASCII value, we will instead store the more frequently occurring
characters using fewer bits and less frequently occurring characters using
more bits
On average this should decrease the file size (usually )
2
HUFFMAN CODING SCHEME
EXAMPLE
Consider a file of 100,000 characters from af, with these frequencies:
o a = 45,000
o b = 13,000
o c = 12,000
o d = 16,000
o e = 9,000
o f = 5,000
HUFFMAN CODING SCHEME 3
CONTD
(FIXED-LENGTH CODE )
Typically each character in a file is stored as a single byte (8 bits)
If we know we only have six characters, we can use a 3 bit code for the characters instead:
a = 000, b = 001, c = 010, d = 011, e = 100, f = 101
This is called a fixed-length code (If every word in the code has the same length, the code is called a
fixed-length code, or a block code.)
With this scheme, we can encode the whole file with 300,000 bits
(45000*3+13000*3+12000*3+16000*3+9000*3+5000*3)
We can do better
Better compression
More flexibility
HUFFMAN CODING SCHEME 4
Variable length codes (If every word has the different length code, the code is called a fixed-length
code, or a block code) can perform significantly better
Frequent characters are given short code words, while infrequent characters get longer code words
o Consider this scheme:
a = 0; b = 101; c = 100; d = 111; e = 1101; f = 1100
How many bits are now required to encode our file?
45,000*1 + 13,000*3 + 12,000*3 + 16,000*3 + 9,000*4 + 5,000*4 = 224,000 bits
This is in fact an optimal character code for this file
HUFFMAN CODING SCHEME 5
PROBLEMS:
Suppose that we want to encode a message constructed from the symbols A, B, C,
D, and E using a fixed-length code
How many bits are required to encode each symbol?
at least 3 bits are required
2 bits are not enough (can only encode four symbols)
How many bits are required to encode the message DEAACAAAAABA?
there are twelve symbols, each requires 3 bits
12*3 = 36 bits are required
6
HUFFMAN CODING SCHEME
DRAWBACKS OF FIXED-LENGTH CODES:
Wasted space
Unicode uses twice as much space as ASCII
inefficient for plain-text messages containing only ASCII characters
Same number of bits used to represent all characters
a and e occur more frequently than q and z
Potential solution: use variable-length codes
variable number of bits to represent characters when frequency of occurrence is known
short codes for characters that occur frequently
7
HUFFMAN CODING SCHEME
ADVANTAGES OF VARIABLE-LENGTH CODES:
The advantage of variable-length codes over fixed-length is short codes can be given to characters that
occur frequently
on average, the length of the encoded message is less than fixed-length encoding
Potential problem: how do we know where one character ends and another begins?
not a problem if number of bits is fixed!
8
HUFFMAN CODING SCHEME
PREFIX PROPERTY:
Prefix codes
Huffman codes are constructed in such a way that they can be unambiguously
translated back to the original data, yet still be an optimal character code
Huffman codes are really considered prefix codes
A code has the prefix property if no character code is the prefix (start of the
code) for another character code
9
HUFFMAN CODING SCHEME
EXAMPLE (PREFIX):
000 is not a prefix of 11, 01, 001, or 10
11 is not a prefix of 000, 01, 001, or 10
10
HUFFMAN CODING SCHEME
CODE WITHOUT PREFIX PROPERTY:
The following code does not have prefix property
The pattern 1110 can be decoded as QQQP, QTP, QQS, or TS
13
11
HUFFMAN CODING SCHEME
CONTD:
A prefix code is a type of code system (typically a variable-length code) distinguished
by its possession of the "prefix property", which requires that there is no code word in
the system that is a prefix (initial segment) of any other code word in the system.
A prefix code is a uniquely decodable code: a receiver can identify each word
without requiring a special marker between words.
12
HUFFMAN CODING SCHEME
CONTD:
Suppose we have two binary code words a and b, where a is k bits long,
b is n bits long, and k < n. If the first k bits of b are identical to a, then a is
called a prefix of b. The last n k bits of b are called the dangling suffix.
For example, if
a = 010 and b = 01011,
then a is a prefix of b and the dangling suffix is 11.
13
HUFFMAN CODING SCHEME
GROUP ID: 18 SMART SHEET BASED ATTENDANCE SYSTEM 16
PURPOSE OF HUFFMAN CODING:
Proposed by Dr. David A. Huffman in 1952
A Method for the Construction of Minimum Redundancy Codes
Applicable to many forms of data transmission
Our example: text files
14
HUFFMAN CODING SCHEME
THE BASIC ALGORITHM:
Huffman coding is a form of statistical coding
Not all characters occur with the same frequency!
Yet all characters are allocated the same amount of space
1 char = 1 byte, be it e or x
15
HUFFMAN CODING SCHEME
THE BASIC ALGORITHM:
Any savings in tailoring codes to frequency of character?
Code word lengths are no longer fixed like ASCII.
Code word lengths vary and will be shorter for the more
frequently used characters.
16
HUFFMAN CODING SCHEME
THE (REAL) BASIC ALGORITHM:
1. Scan text to be compressed and tally occurrence of all characters.
2. Sort or prioritize characters based on number of occurrences in text.
3. Build Huffman code tree based on prioritized list.
4. Perform a traversal of tree to determine all code words.
5. Scan text again and create new file using the Huffman codes.
17
HUFFMAN CODING SCHEME
ALGORITHM:
n <- |C
Q <- C
for i <- 1 to n-1
do allocate a new node z
left [ z ] <- x <- EXTRACT-MIN (Q)
right [ z ] <- y <- EXTRACT-MIN (Q)
f [ z ] <- f [ x ] + f [ y ]
INSERT(Q, Z)
return EXTRACT-MIN (Q)
18
HUFFMAN CODING SCHEME
ANALYSIS :
Time Complexity
Time complexity of Huffman algorithm is O(nlogn) where each
iteration requires O(logn) time to determine the cheapest
weight and there would be O(n) iterations.
19
HUFFMAN CODING SCHEME
BUILDING A TREE
SCAN THE ORIGINAL TEXT
Consider the following short text:
Eerie eyes seen near lake.
Count up the occurrences of all characters in the text
20
HUFFMAN CODING SCHEME
BUILDING A TREE
SCAN THE ORIGINAL TEXT
Eerie eyes seen near lake.
Q. What characters are present?
E e r i space
ysnarlk.
21
HUFFMAN CODING SCHEME
BUILDING A TREE
SCAN THE ORIGINAL TEXT
Eerie eyes seen near lake.
What is the frequency of each character in the text?
22
HUFFMAN CODING SCHEME
BUILDING A TREE
PRIORITIZE CHARACTERS
Create binary tree nodes with character and frequency
of each character
Place nodes in a priority queue
The lower the occurrence, the higher the priority in the queue
23
HUFFMAN CODING SCHEME
BUILDING A TREE
The queue after inserting all nodes
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
24
HUFFMAN CODING SCHEME
BUILDING A TREE
While priority queue contains two or more nodes
Create new node
Dequeue node and make it left subtree
Dequeue next node and make it right subtree
Frequency of new node equals sum of frequency of left and right
children
Enqueue new node back into queue
25
HUFFMAN CODING SCHEME
BUILDING A TREE
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
26
HUFFMAN CODING SCHEME
BUILDING A TREE
27
HUFFMAN CODING SCHEME
BUILDING A TREE
28
HUFFMAN CODING SCHEME
BUILDING A TREE
29
HUFFMAN CODING SCHEME
BUILDING A TREE
30
HUFFMAN CODING SCHEME
BUILDING A TREE
31
HUFFMAN CODING SCHEME
BUILDING A TREE
32
HUFFMAN CODING SCHEME
BUILDING A TREE
33
HUFFMAN CODING SCHEME
BUILDING A TREE
34
HUFFMAN CODING SCHEME
BUILDING A TREE
35
HUFFMAN CODING SCHEME
BUILDING A TREE
36
BUILDING A TREE
37
HUFFMAN CODING SCHEME
BUILDING A TREE
38
HUFFMAN CODING SCHEME
BUILDING A TREE
39
HUFFMAN CODING SCHEME
BUILDING A TREE
Q. What is happening to the characters with a low
number of occurrences?
40
HUFFMAN CODING SCHEME
BUILDING A TREE
41
HUFFMAN CODING SCHEME
BUILDING A TREE
42
HUFFMAN CODING SCHEME
BUILDING A TREE
43
HUFFMAN CODING SCHEME
BUILDING A TREE
44
HUFFMAN CODING SCHEME
BUILDING A TREE
45
HUFFMAN CODING SCHEME
BUILDING A TREE
46
HUFFMAN CODING SCHEME
BUILDING A TREE
47
HUFFMAN CODING SCHEME
BUILDING A TREE
After
enqueueing
this node there
is only one
node left in
priority queue.
48
HUFFMAN CODING SCHEME
BUILDING A TREE
Dequeue the single node left in the
queue.
This tree contains the new code
words for each character.
Frequency of root node should equal
number of characters in text.
Eerie eyes seen near lake. 26 characters
49
HUFFMAN CODING SCHEME
ENCODING THE FILE
TRAVERSE TREE FOR CODES
Perform a traversal of the tree to
26
obtain new code words
16
Going left is a 0 going right is a 1 10
4
code word is only completed when a
e 8
6 8
leaf node is reached 2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
50
HUFFMAN CODING SCHEME
ENCODING THE FILE
TRAVERSE TREE FOR CODES
Char Code
E 0000
i 0001 26
y 0010
l 0011 16
10
k 0100
. 0101
4
space 011 6
e
8
8
e 10
r 1100 2 2
2 sp 4 4
s 1101 4
n 1110 E i y l k .
1 1 1 1 1 1 n a
a 1111 r s
2 2 2 2
51
HUFFMAN CODING SCHEME
ENCODING THE FILE
Rescan text and encode Char Code
file using new code
E 0000
words i
y
0001
0010
l 0011
Eerie eyes seen near lake. k 0100
000010110000011001110001 . 0101
0101101101001111101011111 space 011
100011001111110100100101
e 10
r 1100
Q. Why is there no need for a
s 1101
n 1110
separator character? a 1111
.
52
HUFFMAN CODING SCHEME
ENCODING THE FILE
RESULTS
Have we made things any better?
73 bits to encode the text 0000101100000110011100010
10110110100111110101111110
ASCII would take 8 * 26 = 208 bits
0011001111110100100101
If modified code used 4 bits per
character are needed. Total bits
4 * 26 = 104. Savings not as great.
53
HUFFMAN CODING SCHEME
APPLICATIONS OF HUFFMAN CODING:
Supports various file type as:
ZIP (multichannel compression including text and other data
types) JPEG
MPEG (only upto 2 layers)
Also used in steganography for JPEG carrier compression.
54
HUFFMAN CODING SCHEME
CONCLUSION:
Like many other useful algorithms we do require Huffman
Algorithm for compression of data so it could be transmitted
over internet and other transmission channels properly.
Huffman algorithm works on Binary trees.
55
HUFFMAN CODING SCHEME
59