The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)
The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)
The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)
Mário S. Alvim
([email protected])
Information Theory
DCC-UFMG
(2020/01)
We will also see how the measures above are connected with the compression
of sources of information.
Convention. From now on, log x stands for log2 x, unless otherwise stated.
0 · log2 1/0 = 0,
Example 1 (Continued)
H(X ) ≥ 0,
E [f (x)] ≥ f (E [x]),
Proof.
We start by proving the following auxiliary result for when
p(x, y ) = p(x)p(y ):
1
h(x, y ) = log (by def. of h(·))
p(x, y )
1
= log (p(x, y ) = p(x)p(y ))
p(x)p(y )
1 1
= log ·
p(x) p(y )
1 1
= log + log
p(x) p(y )
= h(x) + h(y ) (by def. of h(·))
Proof. (Continued)
Note that the first term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(x)h(x) p(y ) (moving out constants)
x∈AX y ∈AY x∈AX y ∈AY
X P
= p(x)h(x) · 1 ( y ∈AY p(y ) = 1)
x∈AX
and the second term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(y )h(y ) p(x) (moving out constants)
x∈AX y ∈AY y ∈AY x∈AX
X P
= p(y )h(y ) · 1 ( x∈AX p(x) = 1)
y ∈AY
Proof. (Continued)
Now we can substitute Equations (??) and (???) in Equation (?) to obtain
The proof of the converse, i.e., that if H(X , Y ) = H(X ) + H(Y ) then X and
Y are independent, is similar and is left as an exercise.
Our goal in this class is to convince you of the following three claims:
2. The entropy
X 1
H(X ) = p(x) log2
x∈AX
p(x)
a) The less probable an outcome is, the more informative is its happening.
Or, the more “surprising” an outcome is, the more informative it is.
The function h(x) captures this intuition, since it grows as the probability of x
diminishes:
b) If an outcome x happens with certainty (i.e., p(x) = 1), the occurrence of this
outcome conveys no information:
1 1
h(x) = log2 = log2 = 0.
p(x) 1
a) (The weighing problem.) You are given 12 balls, all equal in weight except
for one that is either heavier or lighter.
You are also given a two-pan balance to use.
In each use of the balance you may put any number of the 12 balls on the left
pan, and the same number on the right pan, and push a button to initiate the
weighing; there are three possible outcomes:
Your task is to design a strategy to determine which is the odd ball and
whether it is heavier or lighter than the others in as few uses of the balance as
possible.
(i) The world may be in many different states, and you are uncertain about which
is the real one.
(ii) You have measurements (questions) that you can make (ask) to probe in what
state the world is.
(iii) Each measure (question) produces an observation (answer) that allows you to
rule out some states of the world as not possible.
(iv) At each time a subset of possible states is ruled out, you gain some
information about the real state of the world.
The information you have increases because your uncertainty about the real
state of the world decreases.
(v) The most efficient way of finding the actual state is to have every
measurement (question) outcomes as close as possible to equally probable.
If your measurement (question) allows for n different outcomes (types of
answers), it is best to use them so to always split the set of still possible states
of the world into n sets of probability 1/n each.
(vi) The Shannon information content (in base 3) of the set of balls is
24
X 1 1
H(X ) = log3 = log3 24 = 2.89,
i=1
24 1/24
which is just about the minimal number of measurements (3) needed in a best
strategy.
b) (The guessing game.) Do you know the “20 questions” game? (If you don’t,
Google it, it’s a fun game to while away the time with friends during a long
trip.)
In a dumber version of the game, a friend thinks of a number between 0 and
15 and you have to guess which number was selected using yes/no questions.
What is the smallest number of questions needed to be guaranteed to identify
an integer between 0 and 15?
15
X 1 1
H(X ) = log2 = log2 16 = 4.
i=0
16 1/16
(i) If we get a hit y on the k th shot, the sequence of answers we get is a string
x = nk−1 y,
y,
ny,
nny,
. . .,
nnnn . . . nny = n62 y,
nnnn . . . nnny = n63 y.
Note that the use of symbols {y, n} or {0, 1} makes little difference: each
binary string uniquely identifies a result of the game.
Note also that we have binary strings of many different sizes.
(iii) Every outcome nk−1 y, no matter what value k assumes from 1 to 64, conveys
the same amount of information h(nk−1 y) = 6 bits.
6 bits is exactly the number of bits necessary to uniquely identify a square in a
set of 64!
(iv) The information contained in a binary string representing an object is not
necessarily the number of bits in the string: it is related to the object it
identifies within a set!
In our submarine example, all strings from y (which has size of 1 bit) to n63 y
(which has size of 64 bits) carry the same amount of bits of information: 6
bits each.
Our third claim is that N outcomes from a source X can be compressed into
roughly N · H(X ) bits.
In other words, our claim is that the number of bits necessary to compress a
source is linear with respect to the entropy of the source.
This claim implies an intimate connection between data compression and the
measure of information content of the source.
Before giving support for our third claim, let us understand better what we
mean by “data compression”.
X1 , X2 , X3 , . . .
H0 (X ) = log2 |AX |.
H0 is a measure of information of X :
it is a bound on the number of bits necessary to encode elements of X as
codewords.
Solution. No! Just use the pigeonhole principle to verify that. (This exercise
is part of your homework assignment for this lecture.)
1. A lossy compressor maps all files to shorter codewords, but that means that
sometimes two or more files will necessarily be mapped to the same codeword.
The decompressor will be, in this case, unsure of how to decompress
ambiguous codewords, leading to a failure.
Calling δ the probability that the source file is one of the confusable files, a
lossy compressor has probability δ of failure.
If δ is small, the compressor is acceptable, but with some loss of information
(i.e., not all codewords are guaranteed to be decompressed correctly).
In the remaining of this lecture we will cover a simple lossy compressor, and
in future lectures we will cover lossless compressors.
Example 3 Let
AX = {a, b, c, d, e f, g, h}
1 1 1 3 1 1 1 1
PX = , , , , , , ,
4 4 4 16 64 64 64 64
The raw bit content of this ensemble is
H0 = log2 |AX | = log2 8 = 3 bits,
so to represent any symbol of the ensemble we need, in principle, codewords
of 3 bits each.
But notice that a small set contains almost all the probability:
15
p(x ∈ {a, b, c, d}) = .
16
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 46 / 62
Lossy data compression
Example 3 (Continued)
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 47 / 62
Lossy data compression
The quantity
Hδ (X ) = log2 |Sδ |
AX = {a, b, c, d, e f, g, h} , and
1 1 1 3 1 1 1 1
PX = , , , , , , , ,
4 4 4 16 64 64 64 64
we have:
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 50 / 62
Data compression of groups of symbols
H(X N ) = N · H(X ).
Example 5 (Continued)
Example 5 (Continued)
Example 5 (Continued)
It seems that as N
grows, N1 Hδ (X N ) flat-
tens out (i.e., becomes
constant), independen-
tly of the value of the
error δ tolerated.
1 N
What is the fixed value that N Hδ (X ) tends to as N grows?
Shannon’s source coding theorem tells us what it is...
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 55 / 62
The Source Coding Theorem
3. the maximum achievable compression will use approximately H(X ) bits per
symbol if the number of N of symbols being encoded is large enough.
To see why Shannon’s coding theorem is true we will use the concept of a
typical string generated by the source.
P(x)typ ≈ 2−NH(X ) .
Using the observations above, we define the typical set to be that of typical
strings, up to a tolerance β ≥ 0:
1. The typical set TNβ contains almost all probability: p(x ∈ TNβ ) ≈ 1.
2. The typical set TNβ contains roughly 2NH(X ) elements: |TNβ | ≈ 2NH(X ) .
That is a consequence from the fact that the typical set has probability almost
1, and the probability of a typical element is 2−NH(X ) , so the set must have
about 1/2−NH(X ) = 2NH(X ) elements.
We know that
Hence can conclude that the two sets must have a great intersection:
At the beginning of this lecture we set the goal to convince you of three
claims:
1
1. The Shannon information content h(x = ai ) = log2 p(x=a i)
is a sensible
measure of the information content of the outcome x = ai .
1
P
2. The entropy H(X ) = x∈AX p(x) log2 p(x) is a sensible measure of the
expected information content of an ensemble X .
3. The Source Coding Theorem: N outcomes from a source X can be
compressed into roughly N · H(X ) bits.