0% found this document useful (0 votes)
59 views

Data Compression Lecture01

The document discusses data compression concepts including lossless and lossy compression. Lossless compression aims to compress data without any loss, while lossy compression allows for some loss of data in exchange for higher compression ratios. Information theory provides an understanding of the limits of data compression and how to compress data efficiently based on the probabilities of symbols.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Data Compression Lecture01

The document discusses data compression concepts including lossless and lossy compression. Lossless compression aims to compress data without any loss, while lossy compression allows for some loss of data in exchange for higher compression ratios. Information theory provides an understanding of the limits of data compression and how to compress data efficiently based on the probabilities of symbols.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Data Compression

Data Compression

Introduction to Data Compression


Entropy
Variable Length Codes

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
Data Compression

Basic Data Compression Concepts


original compressed decompressed

x y x̂
Encoder Decoder

• Lossless compression x x̂
– Also called entropy coding, reversible coding.
• Lossy compression x x̂
– Also called irreversible coding.
• Compression ratio = x y
– x is number of bits in x.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
Data Compression

Why Compress
• Conserve storage space
• Reduce time for transmission
– Faster to encode, send, then decode than to send
the original
• Progressive transmission
– Some compression techniques allow us to send
the most important bits first so we can get a low
resolution version of some data before getting the
high fidelity version
• Reduce computation
– Use less data to achieve an approximate answer

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
Data Compression

Braille
• System to read text by feeling raised dots on
paper (or on electronic displays). Invented in
1820s by Louis Braille, a French blind man.

a b c z

and the with mother

th ch gh

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
Data Compression

Braille Example
Clear text:
Call me Ishmael. Some years ago -- never mind how
long precisely -- having \\ little or no money in my purse,
and nothing particular to interest me on shore, \\ I thought
I would sail about a little and see the watery part of the
world. (238 characters)
Grade 2 Braille in ASCII.
,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g
precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+
``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail
ab a ll \& see ! wat]y ``p ( ! \_w4 (203 characters)

Compression ratio = 238/203 = 1.17

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
Data Compression

Lossless Compression
• Data is not lost - the original is really needed.
– text compression
– compression of computer binary files
• Compression ratio typically no better than 4:1 for
lossless compression on many kinds of files.
• Statistical Techniques
– Huffman coding
– Arithmetic coding
– Golomb coding
• Dictionary techniques
– LZW, LZ77
– Sequitur
– Burrows-Wheeler Method
• Standards - Morse code, Braille, Unix compress, gzip,
zip, bzip, GIF, JBIG, Lossless JPEG
12

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
Data Compression

Lossy Compression
• Data is lost, but not too much.
– audio
– video
– still images, medical images, photographs
• Compression ratios of 10:1 often yield quite
high fidelity results.
• Major techniques include
– Vector Quantization
– Wavelets
– Block transforms
– Standards - JPEG, JPEG2000, MPEG 2, H.264

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
Data Compression

Why is Data Compression Possible


• Most data from nature has redundancy
– There is more data than the actual information
contained in the data.
– Squeezing out the excess data amounts to
compression.
– However, unsqueezing is necessary to be able to
figure out what the data means.
• Information theory is needed to understand
the limits of compression and give clues on
how to compress well.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
Data Compression

What is Information
• Analog data
– Also called continuous data
– Represented by real numbers (or complex
numbers)
• Digital data
– Finite set of symbols {a1, a2, ... , am}
– All data represented as sequences (strings) in the
symbol set.
– Example: {a,b,c,d,r} abracadabra
– Digital data can be an approximation to analog
data

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 9
Data Compression

Symbols
• Roman alphabet plus punctuation
• ASCII - 256 symbols
• Binary - {0,1}
– 0 and 1 are called bits
– All digital information can be represented
efficiently in binary
– {a,b,c,d} fixed length representation
symbol a b c d
binary 00 01 10 11

– 2 bits per symbol

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
0
Data Compression

Exercise - How Many Bits Per


Symbol?
• Suppose we have n symbols. How many bits
(as a function of n ) are needed in to
represent a symbol in binary?
– First try n a power of 2.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
1
Data Compression

Discussion: Non-Powers of Two


• Can we do better than a fixed length
representation for non-powers of two?

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
2
Data Compression

Information Theory
• Developed by Shannon in the 1940’s and 50’s
• Attempts to explain the limits of communication
using probability theory.
• Example: Suppose English text is being sent
– It is much more likely to receive an “e” than a “z”.
– In some sense “z” has more information than “e”.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
3
Data Compression

First-order Information
• Suppose we are given symbols {a1, a2, ... , am}.
• P(ai) = probability of symbol ai occurring in the
absence of any other information.
P(a1) + P(a2) + ... + P(am) = 1
• inf(ai) = log2(1/P(ai)) bits is the information of ai
in bits. 7
6
5
-log(x)
4
y

3
2
1
0
0.15

0.85
0.5
0.01
0.08

0.22
0.29

0.57
0.64

0.71
0.78

0.92
0.99
0.36
0.43

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
4
Data Compression

Example
• {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8
– inf(a) = log2(8) = 3
– inf(b) = log2(4) = 2
– inf(c) = log2(8/5) = .678
• Receiving an “a” has more information than
receiving a “b” or “c”.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
5
Data Compression

First Order Entropy


• The first order entropy is defined for a probability
distribution over symbols {a1, a2, ... , am}.
m
1
H ∑
i 1
P(ai ) log 2 (
P(ai )
)
• H is the average number of bits required to code up a
symbol, given all we know is the probability distribution
of the symbols.
• H is the Shannon lower bound on the average number of
bits to code a symbol in this “source model”.
• Stronger models of entropy include context.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
6
Data Compression

Entropy Examples
• {a, b, c} with a 1/8, b 1/4, c 5/8.
– H = 1/8 *3 + 1/4 *2 + 5/8* .678 = 1.3 bits/symbol

• {a, b, c} with a 1/3, b 1/3, c 1/3. (worst case)


– H = 3* (1/3)*log2(3) = 1.6 bits/symbol

• Note that a standard code takes 2 bits per


symbol
symbol a b c
binary code 00 01 10

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
7
Data Compression

An Extreme Case
• {a, b, c} with a 1, b 0, c 0
– H=?

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
8
Data Compression

Entropy Curve
• Suppose we have two symbols with probabilities
x and 1-x, respectively.
maximum entropy at .5
1.2

1 -(x log x + (1-x)log(1-x))

0.8
entropy

0.6

0.4

0.2

0
0

1
0.1
0.2
0.3
0.4

0.5
0.6
0.7
0.8
0.9

probability of first symbol

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
9
Data Compression

A Simple Prefix Code


• {a, b, c} with a 1/8, b 1/4, c 5/8.
• A prefix code is defined by a binary tree
• Prefix code property
– no output is a prefix of another
input output
binary tree a 00
0 1
c b 01 code
0 1
a b c 1

ccabccbccc
1 1 00 01 1 1 01 1 1 1

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
0
Data Compression

Binary Tree Terminology


root

node

leaf

1. Each node, except the root, has a unique parent.


2. Each internal node has exactly two children.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
1
Data Compression

Decoding a Prefix Code


repeat
start at root of tree
0 1 repeat
c if read bit = 1 then go right
0 1 else go left
a b until node is a leaf
report leaf
until end of the code

11000111100

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
2
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
3
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
4
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
5
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

cc

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
6
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

cc

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
7
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

cc

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
8
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

cca

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
9
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

cca

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
0
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

cca

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
1
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

ccab

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
2
Data Compression

Decoding a Prefix Code

0 1
c
0 1
a b

11000111100

ccabccca

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
3
Data Compression

Exercise Encode/Decode

0 1
a 1
0
d
0 1

b c

• Player 1: Encode a symbol string


• Player 2: Decode the string
• Check for equality

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
4
Data Compression

How Good is the Code

0 1
c
0 1 5/8
a b
1/8 1/4

bit rate = (1/8)2 + (1/4)2 + (5/8)1 = 11/8 = 1.375 bps


Entropy = 1.3 bps
Standard code = 2 bps

(bps = bits per symbol)

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
5
Data Compression

Design a Prefix Code 1


• abracadabra
• Design a prefix code for the 5 symbols
{a,b,r,c,d} which compresses this string the
most.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
6
Data Compression

Design a Prefix Code 2


• Suppose we have n symbols each with
probability 1/n. Design a prefix code with
minimum average bit rate.
• Consider n = 2,3,4,5,6 first.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
7
Data Compression

Huffman Coding
• Huffman (1951)
• Uses frequencies of symbols in a string to build a
variable rate prefix code.
– Each symbol is mapped to a binary string.
– More frequent symbols have shorter codes.
– No code is a prefix of another.
• Example: 0 1
a 0
b 100 a
0 1
c 101
d
d 11 0 1

b c

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
8
Data Compression

Variable Rate Code Example


• Example: a 0, b 100, c 101, d 11
• Coding:
– aabddcaa = 16 bits
– 0 0 100 11 11 101 0 0= 14 bits
• Prefix code ensures unique decodability.
– 00100111110100

– aabddcaa

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
9
Data Compression

Cost of a Huffman Tree


• Let p1, p2, ... , pm be the probabilities for the
symbols a1, a2, ... ,am, respectively.
• Define the cost of the Huffman tree T to be
m
C(T) ∑p r i i
i 1
where ri is the length of the path from the root
to ai.
• C(T) is the expected length of the code of a
symbol coded by the tree T. C(T) is the bit
rate of the code.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
0
Data Compression

Example of Cost
• Example: a 1/2, b 1/8, c 1/8, d 1/4
T
0 1
a 1
0
d
0 1

b c

C(T) = 1 x 1/2 + 3 x 1/8 + 3 x 1/8 + 2 x 1/4 = 1.75


a b c d

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
1
Data Compression

Huffman Tree
• Input: Probabilities p1, p2, ... , pm for symbols
a1, a2, ... ,am, respectively.
• Output: A tree that minimizes the average
number of bits (bit rate) to code a symbol.
That is, minimizes
m
HC(T) ∑p r i i bit rate
i 1
where ri is the length of the path from the root
to ai. This is the Huffman tree or Huffman
code

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
2
Data Compression

Optimality Principle 1
• In a Huffman tree a lowest probability symbol
has maximum distance from the root.
– If not exchanging a lowest probability symbol with
one at maximum distance will lower the cost.

T p smallest T’
k p<q
k<h
h
p q

q p

C(T’) = C(T) + hp - hq + kq - kp = C(T) - (h-k)(q-p) < C(T)

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
3
Data Compression

Optimality Principle 2
• The second lowest probability is a sibling of
the smallest in some Huffman tree.
– If not, we can move it there not raising the cost.

T p smallest T’
k q 2nd smallest
q<r
h
q k<h r

r p q p

C(T’) = C(T) + hq - hr + kr - kq = C(T) - (h-k)(r-q) < C(T)

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
4
Data Compression

Optimality Principle 3
• Assuming we have a Huffman tree T whose two
lowest probability symbols are siblings at
maximum depth, they can be replaced by a new
symbol whose probability is the sum of their
probabilities.
– The resulting tree is optimal for the new symbol set.
T
T’
p smallest
q 2nd smallest
h

q+p
q p
C(T’) = C(T) + (h-1)(p+q) - hp -hq = C(T) - (p+q)

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
5
Data Compression

Optimality Principle 3 (cont’)


• If T’ were not optimal then we could find a
lower cost tree T’’. This will lead to a lower
cost tree T’’’ for the original alphabet.

T’ T’’ T’’’

q+p
q+p q p

C(T’’’) = C(T’’) + p + q < C(T’) + p + q = C(T) which is a contradiction

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
6
Data Compression

Recursive Huffman Tree Algorithm


1. If there is just one symbol, a tree with one
node is optimal. Otherwise
2. Find the two lowest probability symbols with
probabilities p and q respectively.
3. Replace these with a new symbol with
probability p + q.
4. Solve the problem recursively for new symbols.
5. Replace the leaf with the new symbol with an
internal node with two children with the old symbols.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
7
Data Compression

Iterative Huffman Tree Algorithm


form a node for each symbol ai with weight pi;
insert the nodes in a min priority queue ordered by probability;
while the priority queue has more than one element do
min1 := delete-min;
min2 := delete-min;
create a new node n;
n.weight := min1.weight + min2.weight;
n.left := min1;
n.right := min2;
insert(n)
return the last node in the priority queue.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
8
Data Compression

Example of Huffman Tree Algorithm (1)


• P(a) =.4, P(b)=.1, P(c)=.3, P(d)=.1, P(e)=.1

.4 .1 .3 .1 .1
a b c d e

.4 .2 .3 .1
a c d

b e

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
9
Data Compression

Example of Huffman Tree Algorithm (2)


.4 .2 .3 .1
a c d

b e

.4 .3 .3
a c

b e

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
0
Data Compression

Example of Huffman Tree Algorithm (3)

.4 .3 .3 .4 .6
a c a

d c

b e d

b e

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
1
Data Compression

Example of Huffman Tree Algorithm (4)

.4 .6
a a

c c

d d

b e b e

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
2
Data Compression

Huffman Code

0 1 average number of bits per symbol is


.4 x 1 + .1 x 4 + .3 x 2 + .1 x 3 + .1 x 4 = 2.1
a
0 1
c a 0
0 b 1110
1
c 10
d d 110
0 1 e 1111
b e

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
3
Data Compression

Optimal Huffman Code vs. Entropy


• P(a) =.4, P(b)=.1, P(c)=.3, P(d)=.1, P(e)=.1
Entropy

H = -(.4 x log2(.4) + .1 x log2(.1) + .3 x log2(.3)


+ .1 x log2(.1) + .1 x log2(.1))
= 2.05 bits per symbol

Huffman Code

HC = .4 x 1 + .1 x 4 + .3 x 2 + .1 x 3 + .1 x 4
= 2.1 bits per symbol
pretty good!

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
4
Data Compression

In Class Exercise
• P(a) = 1/2, P(b) = 1/4, P(c) = 1/8, P(d) = 1/16,
P(e) = 1/16
• Compute the Optimal Huffman tree and its
average bit rate.
• Compute the Entropy
• Compare
• Hint: For the tree change probabilities to be
integers: a:8, b:4, c:2, d:1, e:1. Normalize at
the end.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
5
Data Compression

Quality of the Huffman Code


• The Huffman code is within one bit of the entropy
lower bound.
H HC H 1
• Huffman code does not work well with a two symbol
alphabet.
– Example: P(0) = 1/100, P(1) = 99/100
– HC = 1 bits/symbol
0 1
1 0
– H = -((1/100)*log2(1/100) + (99/100)log2(99/100))
= .08 bits/symbol

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
6
Data Compression

Powers of Two
• If all the probabilities are powers of two then
HC H
• Proof by induction on the number of symbols.
Let p1 < p2 < ... < pn be the probabilities that add up
to 1
If n = 1 then HC = H (both are zero).
If n > 1 then p1 = p2 = 2-k for some k, otherwise the
sum cannot add up to 1.
Combine the first two symbols into a new symbol of
probability 2-k + 2-k = 2-k+1.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
7
Data Compression

Powers of Two (Cont.)


By the induction hypothesis
HC(p1 p2 ,p3 ,...,pn ) H(p1 p2 ,p3 ,...,pn )
n
- (p1 p 2 )log2 (p1 p2 ) ∑ p log i 2 (pi )
i 3
n
2 k 1
log2 (2 k 1
) ∑ p log i 2 (pi )
i 3
n
2 k 1
(log2 (2 ) 1)k
∑ p log i 2 (pi )
i 3
n
k k
2 log 2(2 ) 2 log 2 (2 ) k k
∑ p log i 2 (p i ) 2 k
2 k

i 3
n

∑ p log i 2 (pi ) (p1 p2 )


i 1

H(p1,p2 ,...,pn ) (p1 p2 )

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
8
Data Compression

Powers of Two (Cont.)


By the previous page,

HC(p1 p2 ,p3 ,..., pn ) H(p1,p2 ,..., pn ) (p1 p2 )


By the properties of Huffman trees (principle 3),

HC(p1,p2 ,..., pn ) HC(p1 p2 ,p3 ,..., pn ) (p1 p2 )


Hence,
HC(p1,p 2 ,...,pn ) H(p1,p2 ,...,pn )

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
9
Data Compression

Extending the Alphabet


• Assuming independence P(ab) = P(a)P(b), so
we can lump symbols together.
• Example: P(0) = 1/100, P(1) = 99/100
– P(00) = 1/10000, P(01) = P(10) = 99/10000,
P(11) = 9801/10000.

0 1 HC = 1.03 bits/symbol (2 bit symbol)


= .515 bits/bit
11
0 1 Still not that close to H = .08 bits/bit
10
0 1
01 00

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
0
Data Compression

Quality of Extended Alphabet


• Suppose we extend the alphabet to symbols
of length k then

H HC H 1/k
• Pros and Cons of Extending the alphabet
+ Better compression
- 2k symbols
- padding needed to make the length of the input
divisible by k

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
1
Data Compression

Huffman Codes with Context


• Suppose we add a one symbol context. That is in
compressing a string x1x2...xn we want to take into
account xk-1 when encoding xk.
– New model, so entropy based on just independent
probabilities of the symbols doesn’t hold. The new entropy
model (2nd order entropy) has for each symbol a probability
for each other symbol following it.
– Example: {a,b,c}
next
a b c
a .4 .2 .4
prev b .1 .9 0
c .1 .1 .8

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
2
Data Compression

Multiple Codes
next
Code for first symbol
a b c
a 00
a .4 .2 .4
prev b .1 .9 b 01
0
c 10
c .1 .1 .8

a b c

0 1 0 1 0 1

a b a c
0 .9 .1 .8 0 1
1 .4
b a b
c
abbacc .1
.2 .4 .1
00 00 0 1 01 0

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
3
Data Compression

Average Bit Rate for Code


• P(a) = .4 P(a) + .1 P(b) + .1 P(c)
P(b) = .2 P(a) + .9 P(b) + .1 P(c)
1 = P(a) + P(b) + P(c)
• 0 = -.6 P(a) + .1 P(b) + .1 P(c)
0 = .2 P(a) - .1 P(b) + .1 P(c)
1 = P(a) + P(b) + P(c)
• P(a) = 1/7, P(b) = 4/7, P(c) = 2/7

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
4
Data Compression

Average Bit Rate for Code


1/7 4/7 2/7

0 1 0 1 0 1

a b a c
0 .9 .1 .8 0 1
1 .4
b a b
c
.2 .1 .1
.4

ABR = 1/7 (.6 x 2 + .4) + 4/7 (1) + 2/7 ( .2 x 2 +.8)


= 8/7 = 1.14 bps

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
5
Data Compression

Complexity of Huffman Code Design


• Time to design Huffman Code is O(n log n)
where n is the number of symbols.
– Each step consists of a constant number of priority
queue operations (2 deletemin’s and 1 insert)

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
6
Data Compression

Approaches to Huffman Codes


1. Frequencies computed for each input
– Must transmit the Huffman code or
frequencies as well as the compressed input
– Requires two passes
2. Fixed Huffman tree designed from training data
– Do not have to transmit the Huffman tree
because it is known to the decoder.
– H.263 video coder
3. Adaptive Huffman code
– One pass
– Huffman tree changes as frequencies change

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
7
Data Compression

Run-Length Coding
• Lots of 0’s and not too many 1’s.
– Fax of letters
– Graphics
• Simple run-length code
– Input
00000010000000001000000000010001001.....
– Symbols
6 9 10 3 2 ...
– Code the bits as a sequence of integers
– Problem: How long should the integers be?

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
8
Data Compression

Golomb Code of Order m


Variable Length Code for Integers
• Let n = qm + r where 0 < r < m.
– Divide m into n to get the quotient q and
remainder r.
• Code for n has two parts:
1. q is coded in unary
2. r is coded as a fixed prefix code
Example: m = 5 0 1
code for r
0 1 0 1

0 0 1
1 2

3 4

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
9
Data Compression

Example
• n = qm + r is represented by:
678q
11L10r̂
– where rˆ is the fixed prefix code for r
• Example (m = 5):
2 6 9 10 27
010 1001 10111 11000 11111010

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
0
Data Compression

Alternative Explanation
Golomb Code of order 5
input output
00000 1
0 1
00001 0111
0 1
00000 0001 0110
0 1 0 1
0 1 001 010
1 01 001
01 001
0001 00001
1 000

Variable length to variable length code.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
1
Data Compression

Run Length Example: m = 5


0000
00000010000000001000000000010001001.....
1
00000010000000001000000000010001001.....
001
00000
00000010000000001000000000010001001.....
1
00000010000000001000000000010001001.....
0111

In this example we coded 17 bits in only 9 bits.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
2
Data Compression

Choosing m
• Suppose that 0 has the probability p and 1
has probability 1-p.
• The probability of 0n1 is pn(1-p). The Golomb
code of order
m -
1 log 2 p
is optimal.
• Example: p = 127/128.

m - 89
1 log2 (127/128)

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
3
Data Compression

Average Bit Rate for Golomb Code

Average output code length


Average Bit Rate
Average input code length

• m = 4 as an example. With p as the probability of 0.


p 4 3p 3 (1 p) 3p 2(1 p) 3p(1 p) 3(1 p)
ABR
4p 4 4p3 (1- p) 3p2 (1- p) 2p(1 p) (1 p)

ouput 1 011 010 001 000


input 0000 0001 001 01 1
weight p4 p3(1-p) p2(1-p) p(1-p) 1-p

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
4
Data Compression

Comparison of GC with Entropy

GC – entropy
order entropy

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
5
Data Compression

Notes on Golomb codes


• Useful for binary compression when one symbol is
much more likely than another.
– binary images
– fax documents
– bit planes for wavelet image compression
• Need a parameter (the order)
– training
– adaptively learn the right parameter
• Variable-to-variable length code
• Last symbol needs to be a 1
– coder always adds a 1
– decoder always removes a 1

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
6
Data Compression

Tunstall Codes
• Variable-to-fixed length code
• Example
input output
a 000
b 001 a b cca cb ccc ...
ca 010 000 001 110 011 110 ...
cb 011
cca 100
ccb 101
ccc 110

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
7
Data Compression

Tunstall code Properties


1. No input code is a prefix of another to
assure unique encodability.
2. Minimize the number of bits per symbol.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
8
Data Compression

Prefix Code Property

a 000 a c
b
b 001
000 001 a c
ca 010 b
cb 011 011 a c
010 b
cca 100
ccb 101 100 101 110
ccc 110
Unused output code is 111.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
9
Data Compression

Use for unused code


• Consider the string “cc”, if it occurs at the end
of the data. It does not have a code.
• Send the unused code and some fixed code
for the cc.
• Generally, if there are k internal nodes in the
prefix tree then there is a need for k-1 fixed
codes.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
0
Data Compression

Designing a Tunstall Code


• Suppose there are m initial symbols.
• Choose a target output length n where 2n > m.
1. Form a tree with a root and m children with
edges labeled with the symbols.
2. If the number of leaves is > 2n – m then halt.*
3. Find the leaf with highest probability and
expand it to have m children.** Go to 2.

* In the next step we will add m-1 more leaves.


** The probability is the product of the probabilities
of the symbols on the root to leaf path.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
1
Data Compression

Example
• P(a) = .7, P(b) = .2, P(c) = .1
• n=3
a c
b

.7 .2 .1

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
2
Data Compression

Example
• P(a) = .7, P(b) = .2, P(c) = .1
• n=3
a c
b
a c
b .2 .1

.49 .14 .07

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
3
Data Compression

Example
• P(a) = .7, P(b) = .2, P(c) = .1
• n=3 aaa 000
a c aab 001
b
aac 010
a c
b .2 .1 ab 011
a c ac 100
b .14 .07
b 101
.343 .098 .049 c 110

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
4
Data Compression

Bit Rate of Tunstall


• The length of the output code divided by the
average length of the input code.
• Let pi be the probability of, and ri the length of
input code i (1 < i < s) and let n be the length
of the output code.
n
Average bit rate s

∑p r
i 1
i i

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
5
Data Compression

Example
aaa .343 000
aab .098 001
a c
b aac .049 010

a c ab .14 011
b .2 .1
ac .07 100
a c b .2 101
b .14 .07
c .1 110

.343 .098 .049

ABR = 3/[3 (.343 + .098 + .049) + 2 (.14 + .07) + .2 + .1]


= 1.37 bits per symbol
Entropy = 1.16 bits per symbol

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
6
Data Compression

Notes on Tunstall Codes


• Variable-to-fixed length code
• Error resilient
– A flipped bit will introduce just one error in the
output
– Huffman is not error resilient. A single bit flip can
destroy the code.

Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
7

You might also like