Lecture 1
Lecture 1
Lecture 1
Basic Data Compression
original
Concepts
compressed decompressed
x y
Encoder Decoder xˆ
• Lossless compression x
– Also called entropy coding, reversible coding.
• Lossy compression x xˆ
– Also called irreversible coding.
• Compression ratio = xxˆ y
– x is number of bits in x.
2
Why
• Conserve storageCompress
space
• Reduce time for transmission
– Faster to encode, send, then decode than to send the
original
• Progressive transmission
– Some compression techniques allow us to send the
most important bits first so we can get a low
resolution version of some data before getting the
high fidelity version
• Reduce computation
– Use less data to achieve an approximate answer
3
Measures of performance
• Compression measures
– Compression ratio = Bits in original image
Bits in compressed image
a b c z
7
Braille Example
Clear text:
Call me Ishmael. Some years ago -- never mind how long
precisely -- having \\ little or no money in my purse,
and nothing particular to interest me on shore, \\ I thought I
would sail about a little and see the watery part of the
world. (238 characters)
Grade 2 Braille in ASCII.
,call me ,i\%mael4 ,``s ye$>$s ago -- n``e m9d h[ l;g
precisely -- hav+ \\ ll or no m``oy 9 my purse1 \& no?+
``picul$>$ 6 9t]e/ me on \%ore1 \\ ,i $?$``$|$ ,i wd sail ab
a ll \& see ! wat]y ``p ( ! \_w4 (203 characters)
10
CSEP 590 - Lecture 1 - Autumn 2007
Why is Data Compression
Possible
• Most data from nature has redundancy
– There is more data than the actual information
contained in the data.
– Squeezing out the excess data amounts to
compression.
– However, unsqueezing is necessary to be able to
figure out what the data means.
• Information theory is needed to understand
the limits of compression and give clues on
how to compress well.
11
What is Information
• Analog data
– Also called continuous data
– Represented by real numbers (or complex
numbers)
• Digital data
– Finite set of symbols {a1, a2, ... , am}
– All data represented as sequences (strings) in the
symbol set.
– Example: {a,b,c,d,r} abracadabra
– Digital data can be an approximation to analog
data
12
Symbols
• Roman alphabet plus punctuation
• ASCII - 256 symbols
• Binary - {0,1}
– 0 and 1 are called bits
– All digital information can be represented
efficiently in binary
– {a,b,c,d} fixed length representation
symbol a b c d
binary 00 01 10 11
14
Discussion: Non-Powers of
Two
• Can we do better than a fixed length
representation for non-powers of two?
15
Information
Theory
• Developed by Shannon in the 1940’s and 50’s
• Attempts to explain the limits of communication
using probability theory.
• Example: Suppose English text is being sent
– It is much more likely to receive an “e” than a “z”.
– In some sense “z” has more information than “e”.
16
First-order
• Suppose we are Information
given symbols {a , a , ... , a }.
1 2 m
• P(ai) = probability of symbol ai occurring in the
absence of any other information.
P(a1) + P(a2) + ... + P(am) = 1
• inf(ai) = log2(1/P(ai)) bits is the information of ai
in bits. 7
6
5
-log(x)
4
y
3
2
1
0
0.5
0.01
0.08
0.15
0.22
0.29
0.36
0.43
0.57
0.64
0.71
0.78
0.85
0.92
0.99
x
17
Example
• {a, b, c} with P(a) = 1/8, P(b) = 1/4, P(c) = 5/8
– inf(a) = log2(8) = 3
– inf(b) = log2(4) = 2
– inf(c) = log2(8/5) = .678
• Receiving an “a” has more information than
receiving a “b” or “c”.
18
First Order
•
Entropy
The first order entropy is defined for a probability
distribution over symbols {a1, a2, ... , am}.
m
1
)
H ∑
i1 P(ai ) log2 ( P(ai )
• H is the average number of bits required to code up a
symbol, given all we know is the probability distribution of
the symbols.
• H is the Shannon lower bound on the average number of bits
to code a symbol in this “source model”.
• Stronger models of entropy include context.
19
Entropy Examples
• {a, b, c} with a 1/8, b 1/4, c 5/8.
– H = 1/8 *3 + 1/4 *2 + 5/8* .678 = 1.3 bits/symbol
20
Entropy
•
Curve
Suppose we have two symbols with probabilities x
and 1-x, respectively.
maximum entropy at .5
1.2
0.8
entropy
0.6
0.4
0.2
0
0
1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
p
r
o 22
b
a
b
A Simple Prefix Code
• {a, b, c} with a 1/8, b 1/4, c 5/8.
• A prefix code is defined by a binary tree
• Prefix code property
– no output is a prefix of another
binary tree
0 1 a 00
c code
0 1 b 01
a b
c 1
ccabccbccc
1 1 00 01 1 1 01 1 1 1
23
Binary Tree Terminology
root
node
leaf
24
Decoding a Prefix Code
repeat
start at root of tree
0 1
repeat
c
0 1 if read bit = 1 then go right else
go left
a b
until node is a leaf
report leaf
until end of the code
11000111100
25
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
26
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
27
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
28
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
cc
29
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
cc
30
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
cc
31
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
cca
32
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
cca
33
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
cca
34
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
ccab
35
Decoding a Prefix Code
0 1
c
0 1
a b
11000111100
ccabccca
36