Data Compression Basics: Discrete Source
Data Compression Basics: Discrete Source
Discrete source
Information=uncertainty
Quantification of uncertainty
Source entropy
Variable length codes
Motivation
Prefix condition
Huffman coding algorithm
1
Information
What do we mean by information?
“A numerical measure of the uncertainty of an
experimental outcome” – Webster Dictionary
How to quantitatively measure and represent
information?
Shannon proposes a probabilistic approach
Let us first look at how we assess the amount of
information in our daily lives using common
sense
2
Information = Uncertainty
Zero information
Pittsburgh Steelers won the Superbowl XL (past news, no
uncertainty)
Afridi plays for Pakistan (celebrity fact, no uncertainty)
Little information
It will be very cold in Lahore tomorrow (not much uncertainty
since this is winter time)
It is going to rain in Malaysia next week (not much uncertainty
since it rains nine months a year in South East Asia)
Large information
An earthquake is going to hit Indonesia in July 2006 (are you sure?
an unlikely event)
Someone has shown P=NP (Wow! Really? Who did it?)
3
Shannon’s Picture on Communication
(1948)
channel channel
source channel destination
encoder decoder
super-channel
source source
encoder decoder
Examples of source:
Human speeches, photos, text messages, computer programs …
Examples of channel:
storage media, telephone lines, wireless transmission …
4
Source-Channel Separation Principle
5
Discrete Source
A discrete source is characterized by a discrete
random variable X
Examples
Coin flipping: P(X=H)=P(X=T)=1/2
Dice tossing: P(X=k)=1/6, k=1-6
Playing-card drawing:
P(X=S)=P(X=H)=P(X=D)=P(X=C)=1/4
What is the redundancy with a discrete source?
6
Two Extreme Cases
tossing source source
channel
a fair coin encoder decoder
Self-information
1 0 must happen
(no uncertainty)
unlikely to happen
0
(infinite amount of uncertainty)
Intuitively, I(p) measures the amount of uncertainty with event x
8
Weighted Self-information
p I ( p) I w ( p) p I ( p)
0 0
1/2 1 1/2
1 0 0
9
Maximum of Weighted Self-information*
p=1/e
1
I w ( p)
e ln 2
10
Quantification of Uncertainty of a Discrete Source
N
H ( X ) pi log 2 pi (bits/sample)
i 1 or bps
Weighting
coefficients
12
Source Entropy Examples
13
Entropy of Binary Bernoulli Source
14
Source Entropy Examples
N
Example 2: (4-way random walk)
1 1 W E
prob ( x S ) , prob ( x N )
2 4
1
prob ( x E ) prob ( x W ) S
8
1 1 1 1 1 1 1 1
H ( X ) ( log 2 log 2 log 2 log 2 ) 1.75bps
2 2 4 4 8 8 8 8
15
Source Entropy Examples (Con’t)
1 1
p prob( x red ) ,1 p prob( x blue)
2 2
Prob(event)=Prob(blue in the first k-1 picks)Prob(red in the k-th pick )
=(1/2)k-1(1/2)=(1/2)k
16
Source Entropy Calculation
If we consider all possible events, the sum of their probabilities will be one.
k
1
Check: 1
k 1 2 k
1
Then we can define a discrete random variable X with P( x k )
2
Entropy:
k
1
H ( X ) pk log 2 pk k 2bps
k 1 k 1 2
17
Properties of Source Entropy
Nonnegative and concave
Achieves the maximum when the source
observes uniform distribution (i.e.,
P(x=k)=1/N, k=1-N)
Goes to zero (minimum) as the source becomes
more and more skewed (i.e., P(x=k)1, P(xk)
0)
18
What is the use of H(X)?
Notes:
1. Memoryless means that the events are independently
generated (e.g., the outcomes of flipping a coin N times
are independent events)
2. Source redundancy can be then understood as the
difference between raw data rate and source entropy
19
Code Redundancy*
Practical performance Theoretical bound
r l H(X ) 0
N li: the length of
Average code length: l pi li
i 1
codeword assigned
N
1 to the i-th symbol
H ( X ) pi log 2
i 1 pi
Note: if we represent each symbol by q bits (fixed length codes),
Then redundancy is simply q-H(X) bps
20
How to achieve source entropy?
P(X)
21
Data Compression Basics
Discrete source
Information=uncertainty
Quantification of uncertainty
Source entropy
Variable length codes
Motivation
Prefix condition
Huffman coding algorithm
22
Variable Length Codes (VLC)
Recall:
Self-information I ( p ) log 2 p
23
4-way Random Walk Example
fixed-length variable-length
symbol k pk codeword codeword
S 0.5 00 0
N 0.25 01 10
E 0.125 10 110
W 0.125 11 111
24
Toy Example (Con’t)
• source entropy: 4
H ( X ) pk log 2 pk
k 1
=0.5×1+0.25×2+0.125×3+0.125×3
=1.75 bits/symbol
• average code length:
Nb Total number of bits
l (bps)
Ns Total number of symbols
fixed-length variable-length
l 2bps H ( X ) l 1.75bps H ( X )
25
Problems with VLC
When codewords have fixed lengths, the
boundary of codewords is always identifiable.
For codewords with variable lengths, their
boundary could become ambiguous
symbol VLC SSNW SE …
e
S 0 0 0 1 11 0 10…
N 1 0 0 11 1 0 10… 0 0 1 11 0 1 0…
E 10 d d
W 11 SSWN SE … SSNW SE …
26
Uniquely Decodable Codes
To avoid the ambiguity in decoding, we need to
enforce certain conditions with VLC to make
them uniquely decodable
Since ambiguity arises when some codeword
becomes the prefix of the other, it is natural to
consider prefix condition
Example: p pr pre pref prefi prefix
27
Prefix condition
No codeword is allowed to
be the prefix of any other
codeword.
28
Binary Codeword Tree
root # of codewords
Level 1 1 0 2
Level 2 11 10 01 00 22
Level k …… 2k
29
Prefix Condition Examples
symbol x codeword 1 codeword 2
S 0 0
N 1 10
E 10 110
W 11 111
1 0 1 0
11 10 01 00 11 10
…… 111 110… …
codeword 1 codeword 2
30
How to satisfy prefix condition?
Basic rule: If a node is used as a codeword,
then all its descendants cannot be used as
codeword.
Example 1 0
11 10
111 110
31
Property of Prefix Codes
N
Kraft’s inequality 1
2 li
i 1
1
2
i 1
li
1
2
i 1
li
32
Two Goals of VLC design
• achieve optimal code length (i.e., minimal redundancy)
For an event x with probability of p(x), the optimal
code-length is –log2p(x) , where x denotes the
smallest integer larger than x (e.g., 3.4=4 )
code redundancy: r l H ( X ) 0
33
Golomb Codes for Geometric Distribution
Optimal VLC for geometric source: P(X=k)=(1/2)k, k=1,2,…
k codeword
1 0 1 0
2 10
3 110 1 0
4 1110
5 11110 1 0
6 111110
1 0
7 1111110
8 11111110
… …… …
34