Data Compression Lecture01
Data Compression Lecture01
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
Data Compression
x y x̂
Encoder Decoder
• Lossless compression x x̂
– Also called entropy coding, reversible coding.
• Lossy compression x x̂
– Also called irreversible coding.
• Compression ratio = x y
– x is number of bits in x.
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
Data Compression
Why Compress
• Conserve storage space
• Reduce time for transmission
– Faster to encode, send, then decode than to send
the original
• Progressive transmission
– Some compression techniques allow us to send
the most important bits first so we can get a low
resolution version of some data before getting the
high fidelity version
• Reduce computation
– Use less data to achieve an approximate answer
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
Data Compression
Braille
• System to read text by feeling raised dots on
paper (or on electronic displays). Invented in
1820s by Louis Braille, a French blind man.
a b c z
th ch gh
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
Data Compression
Braille Example
Clear text:
Call me Ishmael. Some years ago -- never mind how
long precisely -- having \\ little or no money in my purse,
and nothing particular to interest me on shore, \\ I thought
I would sail about a little and see the watery part of the
world. (238 characters)
Grade 2 Braille in ASCII.
,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g
precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+
``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail
ab a ll \& see ! wat]y ``p ( ! \_w4 (203 characters)
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
Data Compression
Lossless Compression
• Data is not lost - the original is really needed.
– text compression
– compression of computer binary files
• Compression ratio typically no better than 4:1 for
lossless compression on many kinds of files.
• Statistical Techniques
– Huffman coding
– Arithmetic coding
– Golomb coding
• Dictionary techniques
– LZW, LZ77
– Sequitur
– Burrows-Wheeler Method
• Standards - Morse code, Braille, Unix compress, gzip,
zip, bzip, GIF, JBIG, Lossless JPEG
12
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
Data Compression
Lossy Compression
• Data is lost, but not too much.
– audio
– video
– still images, medical images, photographs
• Compression ratios of 10:1 often yield quite
high fidelity results.
• Major techniques include
– Vector Quantization
– Wavelets
– Block transforms
– Standards - JPEG, JPEG2000, MPEG 2, H.264
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
Data Compression
What is Information
• Analog data
– Also called continuous data
– Represented by real numbers (or complex
numbers)
• Digital data
– Finite set of symbols {a1, a2, ... , am}
– All data represented as sequences (strings) in the
symbol set.
– Example: {a,b,c,d,r} abracadabra
– Digital data can be an approximation to analog
data
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 9
Data Compression
Symbols
• Roman alphabet plus punctuation
• ASCII - 256 symbols
• Binary - {0,1}
– 0 and 1 are called bits
– All digital information can be represented
efficiently in binary
– {a,b,c,d} fixed length representation
symbol a b c d
binary 00 01 10 11
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
0
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
1
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
2
Data Compression
Information Theory
• Developed by Shannon in the 1940’s and 50’s
• Attempts to explain the limits of communication
using probability theory.
• Example: Suppose English text is being sent
– It is much more likely to receive an “e” than a “z”.
– In some sense “z” has more information than “e”.
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
3
Data Compression
First-order Information
• Suppose we are given symbols {a1, a2, ... , am}.
• P(ai) = probability of symbol ai occurring in the
absence of any other information.
P(a1) + P(a2) + ... + P(am) = 1
• inf(ai) = log2(1/P(ai)) bits is the information of ai
in bits. 7
6
5
-log(x)
4
y
3
2
1
0
0.15
0.85
0.5
0.01
0.08
0.22
0.29
0.57
0.64
0.71
0.78
0.92
0.99
0.36
0.43
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
4
Data Compression
Example
• {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8
– inf(a) = log2(8) = 3
– inf(b) = log2(4) = 2
– inf(c) = log2(8/5) = .678
• Receiving an “a” has more information than
receiving a “b” or “c”.
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
5
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
6
Data Compression
Entropy Examples
• {a, b, c} with a 1/8, b 1/4, c 5/8.
– H = 1/8 *3 + 1/4 *2 + 5/8* .678 = 1.3 bits/symbol
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
7
Data Compression
An Extreme Case
• {a, b, c} with a 1, b 0, c 0
– H=?
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
8
Data Compression
Entropy Curve
• Suppose we have two symbols with probabilities
x and 1-x, respectively.
maximum entropy at .5
1.2
0.8
entropy
0.6
0.4
0.2
0
0
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 1
9
Data Compression
ccabccbccc
1 1 00 01 1 1 01 1 1 1
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
0
Data Compression
node
leaf
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
1
Data Compression
11000111100
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
2
Data Compression
0 1
c
0 1
a b
11000111100
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
3
Data Compression
0 1
c
0 1
a b
11000111100
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
4
Data Compression
0 1
c
0 1
a b
11000111100
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
5
Data Compression
0 1
c
0 1
a b
11000111100
cc
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
6
Data Compression
0 1
c
0 1
a b
11000111100
cc
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
7
Data Compression
0 1
c
0 1
a b
11000111100
cc
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
8
Data Compression
0 1
c
0 1
a b
11000111100
cca
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 2
9
Data Compression
0 1
c
0 1
a b
11000111100
cca
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
0
Data Compression
0 1
c
0 1
a b
11000111100
cca
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
1
Data Compression
0 1
c
0 1
a b
11000111100
ccab
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
2
Data Compression
0 1
c
0 1
a b
11000111100
ccabccca
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
3
Data Compression
Exercise Encode/Decode
0 1
a 1
0
d
0 1
b c
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
4
Data Compression
0 1
c
0 1 5/8
a b
1/8 1/4
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
5
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
6
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
7
Data Compression
Huffman Coding
• Huffman (1951)
• Uses frequencies of symbols in a string to build a
variable rate prefix code.
– Each symbol is mapped to a binary string.
– More frequent symbols have shorter codes.
– No code is a prefix of another.
• Example: 0 1
a 0
b 100 a
0 1
c 101
d
d 11 0 1
b c
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
8
Data Compression
– aabddcaa
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 3
9
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
0
Data Compression
Example of Cost
• Example: a 1/2, b 1/8, c 1/8, d 1/4
T
0 1
a 1
0
d
0 1
b c
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
1
Data Compression
Huffman Tree
• Input: Probabilities p1, p2, ... , pm for symbols
a1, a2, ... ,am, respectively.
• Output: A tree that minimizes the average
number of bits (bit rate) to code a symbol.
That is, minimizes
m
HC(T) ∑p r i i bit rate
i 1
where ri is the length of the path from the root
to ai. This is the Huffman tree or Huffman
code
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
2
Data Compression
Optimality Principle 1
• In a Huffman tree a lowest probability symbol
has maximum distance from the root.
– If not exchanging a lowest probability symbol with
one at maximum distance will lower the cost.
T p smallest T’
k p<q
k<h
h
p q
q p
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
3
Data Compression
Optimality Principle 2
• The second lowest probability is a sibling of
the smallest in some Huffman tree.
– If not, we can move it there not raising the cost.
T p smallest T’
k q 2nd smallest
q<r
h
q k<h r
r p q p
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
4
Data Compression
Optimality Principle 3
• Assuming we have a Huffman tree T whose two
lowest probability symbols are siblings at
maximum depth, they can be replaced by a new
symbol whose probability is the sum of their
probabilities.
– The resulting tree is optimal for the new symbol set.
T
T’
p smallest
q 2nd smallest
h
q+p
q p
C(T’) = C(T) + (h-1)(p+q) - hp -hq = C(T) - (p+q)
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
5
Data Compression
T’ T’’ T’’’
q+p
q+p q p
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
6
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
7
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
8
Data Compression
.4 .1 .3 .1 .1
a b c d e
.4 .2 .3 .1
a c d
b e
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 4
9
Data Compression
b e
.4 .3 .3
a c
b e
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
0
Data Compression
.4 .3 .3 .4 .6
a c a
d c
b e d
b e
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
1
Data Compression
.4 .6
a a
c c
d d
b e b e
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
2
Data Compression
Huffman Code
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
3
Data Compression
Huffman Code
HC = .4 x 1 + .1 x 4 + .3 x 2 + .1 x 3 + .1 x 4
= 2.1 bits per symbol
pretty good!
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
4
Data Compression
In Class Exercise
• P(a) = 1/2, P(b) = 1/4, P(c) = 1/8, P(d) = 1/16,
P(e) = 1/16
• Compute the Optimal Huffman tree and its
average bit rate.
• Compute the Entropy
• Compare
• Hint: For the tree change probabilities to be
integers: a:8, b:4, c:2, d:1, e:1. Normalize at
the end.
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
5
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
6
Data Compression
Powers of Two
• If all the probabilities are powers of two then
HC H
• Proof by induction on the number of symbols.
Let p1 < p2 < ... < pn be the probabilities that add up
to 1
If n = 1 then HC = H (both are zero).
If n > 1 then p1 = p2 = 2-k for some k, otherwise the
sum cannot add up to 1.
Combine the first two symbols into a new symbol of
probability 2-k + 2-k = 2-k+1.
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
7
Data Compression
i 3
n
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
8
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 5
9
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
0
Data Compression
H HC H 1/k
• Pros and Cons of Extending the alphabet
+ Better compression
- 2k symbols
- padding needed to make the length of the input
divisible by k
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
1
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
2
Data Compression
Multiple Codes
next
Code for first symbol
a b c
a 00
a .4 .2 .4
prev b .1 .9 b 01
0
c 10
c .1 .1 .8
a b c
0 1 0 1 0 1
a b a c
0 .9 .1 .8 0 1
1 .4
b a b
c
abbacc .1
.2 .4 .1
00 00 0 1 01 0
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
3
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
4
Data Compression
0 1 0 1 0 1
a b a c
0 .9 .1 .8 0 1
1 .4
b a b
c
.2 .1 .1
.4
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
5
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
6
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
7
Data Compression
Run-Length Coding
• Lots of 0’s and not too many 1’s.
– Fax of letters
– Graphics
• Simple run-length code
– Input
00000010000000001000000000010001001.....
– Symbols
6 9 10 3 2 ...
– Code the bits as a sequence of integers
– Problem: How long should the integers be?
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
8
Data Compression
0 0 1
1 2
3 4
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 6
9
Data Compression
Example
• n = qm + r is represented by:
678q
11L10r̂
– where rˆ is the fixed prefix code for r
• Example (m = 5):
2 6 9 10 27
010 1001 10111 11000 11111010
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
0
Data Compression
Alternative Explanation
Golomb Code of order 5
input output
00000 1
0 1
00001 0111
0 1
00000 0001 0110
0 1 0 1
0 1 001 010
1 01 001
01 001
0001 00001
1 000
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
1
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
2
Data Compression
Choosing m
• Suppose that 0 has the probability p and 1
has probability 1-p.
• The probability of 0n1 is pn(1-p). The Golomb
code of order
m -
1 log 2 p
is optimal.
• Example: p = 127/128.
m - 89
1 log2 (127/128)
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
3
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
4
Data Compression
GC – entropy
order entropy
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
5
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
6
Data Compression
Tunstall Codes
• Variable-to-fixed length code
• Example
input output
a 000
b 001 a b cca cb ccc ...
ca 010 000 001 110 011 110 ...
cb 011
cca 100
ccb 101
ccc 110
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
7
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
8
Data Compression
a 000 a c
b
b 001
000 001 a c
ca 010 b
cb 011 011 a c
010 b
cca 100
ccb 101 100 101 110
ccc 110
Unused output code is 111.
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 7
9
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
0
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
1
Data Compression
Example
• P(a) = .7, P(b) = .2, P(c) = .1
• n=3
a c
b
.7 .2 .1
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
2
Data Compression
Example
• P(a) = .7, P(b) = .2, P(c) = .1
• n=3
a c
b
a c
b .2 .1
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
3
Data Compression
Example
• P(a) = .7, P(b) = .2, P(c) = .1
• n=3 aaa 000
a c aab 001
b
aac 010
a c
b .2 .1 ab 011
a c ac 100
b .14 .07
b 101
.343 .098 .049 c 110
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
4
Data Compression
∑p r
i 1
i i
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
5
Data Compression
Example
aaa .343 000
aab .098 001
a c
b aac .049 010
a c ab .14 011
b .2 .1
ac .07 100
a c b .2 101
b .14 .07
c .1 110
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
6
Data Compression
Computer Vision & Biometrics Lab, Indian Institute of Information Technology, Allahabad 8
7