0% found this document useful (0 votes)
8 views32 pages

Huffman Coding

The document discusses Huffman coding, a method for data compression that uses variable-length codes for characters based on their frequencies, resulting in reduced storage space and transmission time. It details the construction of Huffman trees, the algorithm's greedy approach, and the efficiency of dynamic Huffman coding for real-time data processing. The conclusion emphasizes the potential savings in data compression and the possibility of achieving better compression through alternative methods.

Uploaded by

kashafbutt72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views32 pages

Huffman Coding

The document discusses Huffman coding, a method for data compression that uses variable-length codes for characters based on their frequencies, resulting in reduced storage space and transmission time. It details the construction of Huffman trees, the algorithm's greedy approach, and the efficiency of dynamic Huffman coding for real-time data processing. The conclusion emphasizes the potential savings in data compression and the possibility of achieving better compression through alternative methods.

Uploaded by

kashafbutt72
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

Huffman Coding

Motivation

To compress or not to compress, that is the question!

 reducing the space


required to store files
on disk or tape

 reducing the time


to transmit large files.

Image Source : plus.maths.org/issue23/ features/data/data.jpg


Example:
• A file with 100K characters

Character a b c d e f

Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword

Space = (45*3 + 13*3 + 12*3 + 16*3 + 9*3 + 5*3) * 1000

= 300K bits
Can we do better ??
YES !!
• Use variable-length codes instead.
• Give frequent characters short codewords, and
infrequent characters long codewords.
a b c d e f

Frequency 45 13 12 16 9 5
(in 1000s)
Fixed-length 000 001 010 011 100 101
codeword
Variable-length 0 101 100 111 1101 1100
codeword

Space = (45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000

= 224K bits ( Savings = 25%)


PREFIX-FREE CODE :

• No codeword is also prefix of some other


codeword.

No Ambiguity !!

Variable-length 0 101 100 111 1101 1100


codeword
Representation:

The Huffman algorithm is represented as:


100
• binary tree 0 1

• each edge represents either a:45 0 55


1

• 0, "go to the left child" 25 30


0 1 1
0
• 1, "go to the right child" c:12 b:13 14 d:16
0 1
• each leaf corresponds a particular code.
f:5 e:9

• Cost of the tree


• B(T) = ∑f(c) dT(c) where c ε C
0ptimal Code

100 100
0 1
0 1

86 14 a:45 0 55
0 1
1 0

14 25 30
58 28 0 1 1
1 0 1 0 1 0
0
c:12 b:13 14 d:16
a:45 b:13 c:12 d:16 e:9 f:5 0 1

f:5 e:9

• Always a full binary tree


• One leaf for each letter of the alphabet
Constructing a Huffman code
• Build the tree T in a bottom-up manner.
• Begins with a set of |C| leaves
• Upward |C| - 1 "merging" operations

Greedy Choice?
• The two smallest nodes are chosen at each
step.
The steps of Huffman's algorithm

f:5 e:9 c:12 b:13 d:16 a:45 c:12 b:13 14 d:16 a:45
0 1

f:5 e:9

14 d:16 25 a:45 25 30 a:45


0 1 0 1 0 1 0 1

f:5 e:9 c:12 b:13 c:12 b:13 14 d:16


0 1

f:5 e:9

a:45 55
0 1 0
100 1
a:45 55
25 30
0 1
0 1 0 1

d:16 25 30
c:12 b:13 14
0 1 0 1
0 1
c:12 b:13 14 d:16
f:5 e:9
0 1

f:5 e:9
Running Time Analysis

Q is implemented as a binary min-heap.


The merge operation is executed exactly |n| - 1 times. Each
heap operation requires time O(log n).
= O(nlog n)
Huffman code

E = 01
I = 00
Input C = 10
◦ ACE A = 111
Output H = 110
◦ (111)(10)(01) = 1111001

Huffman Coding Example


 Decoding
1. Read compressed file & binary tree
2. Use binary tree to decode file
Follow path from root to leaf

Huffman Code Algorithm Overview


A H
1111001
3 2

1 0 C E I
5 5 8 7

1 0 1 0
10 15

1 0
25
Huffman Decoding 1
A H
1111001
3 2

1 0 C E I
5 5 8 7

1 0 1 0
10 15

1 0
25
Huffman Decoding 2
A H
1111001
3 2

1 0 C E I
5 8 7
5 A
1 0 1 0
10 15

1 0
25
Huffman Decoding 3
A H
1111001
3 2

1 0 C E I
5 8 7
5 A
1 0 1 0
10 15

1 0
25
Huffman Decoding 4
A H
1111001
3 2

1 0 C E I
5 8 7
5 AC
1 0 1 0
10 15

1 0
25
Huffman Decoding 5
A H
1111001
3 2

1 0 C E I
5 8 7
5 AC
1 0 1 0
10 15

1 0
25
Huffman Decoding 6
A H
1111001
3 2

1 0 C E I
5 8 7
5 ACE
1 0 1 0
10 15

1 0
25
Huffman Decoding 7
Induction on the number of code words

The Huffman algorithm finds an optimal


code for n = 1

Suppose that the Huffman algorithm finds


an optimal code for codes size n, now
consider a code of size n + 1 . . .

Correctness proof
Greedy Choice Proof
T T’ T’’

x a a

y y b

a b x b x y

Assume that f[a] < f[b] and f[x] < f[y]

Since f[x] and f[y] are the two lowest frequencies,

f[x] < f[a] and f[y] < f[b].


T – Tree constructed by Huffman
 X – Any code tree

 Show C(T) <= C(X)

 T’ and X’ – Trees from the greedy choice

 C(T’) = C(T)
 C(X’) <= C(X)

 T’’and X’’ – Trees with minimum cost leaves x and y


removed

Finish the induction proof


C(X’’) = C(X’) – x – y
C(T’’) = C(T’) – x – y
C(T’’) <= C(X’’)

C(T) = C(T’)
= C(T’’) + x +y
<= C(X’’) + x+y
= C(X’)
<= C(X)

X : Any tree, X’: – modified,


X’’ : Two smallest leaves removed
What
is
our
next
Challenges and how to tackle them?

Two passes over the data:


• One pass to collect frequency counts of the letters
• A second pass to encode and transmit the letters, based on
the static tree structure.

Problems:
Delay (network communication, file compression applications)
Extra disk accesses slow down the algorithm.

We need one-pass methods, in which letters are encoded “on the fly”.
Dynamic Huffman codes

Algorithm FGK
• The next letter of the message is encoded on the
basis of a Huffman tree for the previous letters.
• Encoding length = (2S + t), where S is the encoding
length by a static Huffman code, and t is the number
of letters in the original message.

Sender and receiver


• start with the same initial tree
• use the same algorithm to modify the tree after
each letter is processed and thus always have
equivalent copies of it.
Sibling Property:

A binary tree with p leaves of nonnegative weight is a Huffman tree iff


• the p leaves have nonnegative weights w1, . . . , wp, and the weight
of each internal node is the sum of the weights of its children; and
• the nodes can be numbered in non-decreasing order by weight, so
that nodes (2j - 1)and 2j are siblings, for 1 ≤ j ≤ p - 1, and their
common parent node is higher in the numbering.
11
32
21 10
11 9
f
10 7 11 8

5 5 5 6
3 5 6
4 d e
c
2 1
3 2
a b
Difficulty

Suppose that MT = ai1 , ai2, . . . , ait, has already been processed.

ai(t+1) is encoded and decoded using Huffman tree for MT.

How to modify this tree quickly in order to get a Huffman tree for MT+1?

Eg. Assume t = 32,


11 11
32 33

X
21 10 22 10
11 9 11 9
f f
10 7 11 8 ai(t+1) = “b” 11 7 11 8

5 5 5 6 5 6 5 6
3 5 6 3 5 6
4 d e 4 d e
c c
2 1
3 2
2 1
4 2
a b a b
First phase

• Begin with the leaf of ai(t+1), as the current node.


• Repeatedly interchange the contents of the current node, including the subtree
rooted there, with that of the highest numbered node of the same weight
• Make the parent of the latter node the new current node.
• Halt when the root is reached.

Eg. Assume t = 32, ai(t+1) = “b”


11
11 32
32
21 10
11 9 21 10 11 9

f f
11 8 10 7 11 8
10 7

5 5 5 6 5 5 5 6 6
3 5 6 3 5
4 d e 4
c c d e

2 3 2 1 3 2
1 2
a b a b
11
32 32 11
21 10
11 9
21 10
f 11 9

10 7 11 8
7 8
5 5 6 10 11
6
5 5 5 5 6 6 e f
3
4 e
c d 5 5
2 1 3 2 3 4
2 1 3 2 a b c d
a b
Second phase
• We turn this tree into the desired Huffman tree for MT+1 by incrementing the
weights of ai(t+1)’s leaf and its ancestors by 1

32 11 33 11

21 10
10 12 9
11 9 21

10 7 11 8 6 6 10 7 11 8
5 5 6 5 6
e
6
f e f

5 2 4 5 5
2 1 3 2 5 3 1 2 3 4
4 a
a c d b c d
b
• Huffman savings are between 20% - 90%
• Dynamic Huffman Coding optimal and efficient
• Optimal data compression achievable by a
character code can always be achieved with a
prefix code.
• Better compression possible (depends on data)
• Using other approaches (e.g., pattern
dictionary)

Conclusions
Thank you for your
attention!

You might also like