0% found this document useful (0 votes)

92 views62 pages

The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)

The document introduces concepts related to information theory including Shannon entropy and the source coding theorem. It defines Shannon entropy as the expected information content or uncertainty of a random variable. Entropy is always non-negative and maximized by a uniform distribution. The source coding theorem states that the optimal number of bits needed to compress data is the entropy of the source. Independent random variables have additive entropy.

Uploaded by

rose marine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views62 pages

The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)

Uploaded by

rose marine

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

The Source Coding Theorem

Mário S. Alvim
([email protected])

Information Theory

DCC-UFMG
(2020/01)

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 1 / 62

The Source Coding Theorem - Introduction

In this lecture we will define sensible measures of

1. the information content of a random event, and

2. the expected information content of a random experiment.

We will also see how the measures above are connected with the compression
of sources of information.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 2 / 62

Definition of entropy and
related functions

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 3 / 62

Ensembles

Recall that an ensemble X is a triple (x, AX , PX ), where:

x is the outcome of a random variable,

AX = {a1 , a2 , . . . , ai , . . . , aI } is the set of possible values for the random
variable, and
PX = {p1 , p2 , . . . , pI } are the probabilities of each value, with pi standing for
p(x = ai ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 4 / 62

Introduction to Shannon entropy

The Shannon information content of an outcome x is defined to be

1
h(x) = log2 ,
p(x)

and it is measured in bits.

Convention. From now on, log x stands for log2 x, unless otherwise stated.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 5 / 62

Introduction to Shannon entropy

Example 1 Frequency of letters in “The Frequently Asked Questions

Manual for Linux”, and their entropy.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 6 / 62

Introduction to Shannon entropy

The entropy of an ensemble X is defined to be the average Shannon

information content of an outcome:
X X 1
H(X ) = p(x)h(x) = p(x) log2 ,
p(x)
x∈AX x∈AX

with the convention that for p(x) = 0 we have

0 · log2 1/0 = 0,

since limθ→0 θ log2 1/θ = 0.

When it is convenient we may write H(X ) as H(p), where p is the vector
(p1 , p2 , . . . , pI ).
Another name for the entropy of X is the uncertainty of X .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 7 / 62

Introduction to Shannon entropy

Example 1 (Continued)

Frequency of letters in “The Frequently Asked Questions Manual for Linux”.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 8 / 62

Introduction to Shannon entropy

Some properties of the entropy function:

1. Entropy is always non-negative:

H(X ) ≥ 0,

with equality iff pi = 1 for one i.

2. Entropy is maximized when p is uniform:

H(X ) ≤ log2 |AX |,

with equality iff pi = 1/|AX | for all i.

Verifying these properties is part of your homework assignment: for that

you’ll need Jensen’s inequality.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 9 / 62

Jensen’s inequality and convex functions

A function f (x) is convex ^ over an interval

(a, b) if every chord of the function lies above
the function.
That is, if for all 0 ≤ λ ≤ 1:

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ).

Jensen’s inequality: if f is a convex function and x is a random variable,

then

E [f (x)] ≥ f (E [x]),

where E [·] denotes expected value.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 10 / 62

Introduction to Shannon entropy

The joint entropy of X , Y is

X X 1
H(X , Y ) = p(x, y )h(x, y ) = p(x, y ) log2 .
p(x, y )
xy ∈AX AY xy ∈AX AY

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 11 / 62

Introduction to Shannon entropy
Theorem Entropy is additive for independent random variables:
H(X , Y ) = H(X ) + H(Y ),
iff p(x, y ) = p(x)p(y ).

Proof.
We start by proving the following auxiliary result for when
p(x, y ) = p(x)p(y ):
1
h(x, y ) = log (by def. of h(·))
p(x, y )
1
= log (p(x, y ) = p(x)p(y ))
p(x)p(y )

1 1
= log ·
p(x) p(y )
1 1
= log + log
p(x) p(y )
= h(x) + h(y ) (by def. of h(·))

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 12 / 62

Introduction to Shannon entropy

Proof. (Continued)

Then, we can show that:

X X
H(X , Y ) = p(x, y )h(x, y ) (by definition)
x∈AX y ∈AY
X X
= p(x)p(y ) [h(x) + h(y )] (x, y independent)
x∈AX y ∈AY
X X
= [p(x)p(y )h(x) + p(x)p(y )h(y )] (by distributivity)
x∈AX y ∈AY
X X X X
= p(x)p(y )h(x) + p(x)p(y )h(y ) (splitting the sums (?))
x∈AX y ∈AY x∈AX y ∈AY

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 13 / 62

Joint Entropy - Properties
Proof. (Continued)

Note that the first term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(x)h(x) p(y ) (moving out constants)
x∈AX y ∈AY x∈AX y ∈AY
X P
= p(x)h(x) · 1 ( y ∈AY p(y ) = 1)
x∈AX

= H(X ) (by definition (??)),

and the second term in the sum in Equation (?) can be written as
X X X X
p(x)p(y )h(x) = p(y )h(y ) p(x) (moving out constants)
x∈AX y ∈AY y ∈AY x∈AX
X P
= p(y )h(y ) · 1 ( x∈AX p(x) = 1)
y ∈AY

= H(Y ) (by definition (???)).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 14 / 62

Joint Entropy - Properties

Proof. (Continued)

Now we can substitute Equations (??) and (???) in Equation (?) to obtain

H(X , Y ) = H(X ) + H(Y ).

The proof of the converse, i.e., that if H(X , Y ) = H(X ) + H(Y ) then X and
Y are independent, is similar and is left as an exercise.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 15 / 62

The Source Coding Theorem

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 16 / 62

The Source Coding Theorem - Three claims

Our goal in this class is to convince you of the following three claims:

1. The Shannon information content (a.k.a. surprisal, or self-information)

1
h(x = ai ) = log2
p(x = ai )

is a sensible measure of the information content of the outcome x = ai .

2. The entropy
X 1
H(X ) = p(x) log2
x∈AX
p(x)

is a sensible measure of the expected information content of an ensemble X .

3. Source coding theorem: N outcomes from a source X can be compressed

into roughly N · H(X ) bits.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 17 / 62

The Shannon information content of an outcome

Our first claim is that the Shannon information content

1
h(x = ai ) = log2
p(x = ai )

is a sensible measure of the information content of the outcome x = ai .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 18 / 62

The Shannon information content of an outcome
Some intuitive support for our first claim:

a) The less probable an outcome is, the more informative is its happening.
Or, the more “surprising” an outcome is, the more informative it is.
The function h(x) captures this intuition, since it grows as the probability of x
diminishes:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 19 / 62

The Shannon information content of an outcome

Some intuitive support for our first claim:

b) If an outcome x happens with certainty (i.e., p(x) = 1), the occurrence of this
outcome conveys no information:
1 1
h(x) = log2 = log2 = 0.
p(x) 1

If an outcome x happens is impossible (i.e., p(x) = 0), the occurrence of this

outcome conveys an infinite amount of information:
1 1
h(x) = log2 = log2 = ∞.
p(x) 0

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 20 / 62

The Shannon information content of an outcome

Some intuitive support for our first claim:

c) Independent events add up their surprises.

If p(x, y ) = p(x)p(y ) then
1
h(x, y ) = log2
p(x, y )

1 1
= log2 ·
p(x) p(y )
1 1
= log2 + log2
p(x) p(y )
= h(x) + h(y ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 21 / 62

The entropy of an ensemble

Our second claim is that the entropy

X 1
H(X ) = p(x) log2
p(x)
x∈AX

is a sensible measure of the expected information content of an ensemble

X = (x, AX , PX ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 22 / 62

The entropy of an ensemble

Some intuitive support for our second claim:

a) (The weighing problem.) You are given 12 balls, all equal in weight except
for one that is either heavier or lighter.
You are also given a two-pan balance to use.
In each use of the balance you may put any number of the 12 balls on the left
pan, and the same number on the right pan, and push a button to initiate the
weighing; there are three possible outcomes:

1. either the weights are equal; or

2. the balls on the left are heavier; or
3. the balls on the left are lighter.

Your task is to design a strategy to determine which is the odd ball and
whether it is heavier or lighter than the others in as few uses of the balance as
possible.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 23 / 62

The entropy of an ensemble
A possible optimal solution for the weighing problem.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 24 / 62

The entropy of an ensemble

Insights on “information” gained from the weighing problem:

(i) The world may be in many different states, and you are uncertain about which
is the real one.
(ii) You have measurements (questions) that you can make (ask) to probe in what
state the world is.
(iii) Each measure (question) produces an observation (answer) that allows you to
rule out some states of the world as not possible.
(iv) At each time a subset of possible states is ruled out, you gain some
information about the real state of the world.
The information you have increases because your uncertainty about the real
state of the world decreases.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 25 / 62

The entropy of an ensemble

Insights on “information” gained from the weighing problem:

(v) The most efficient way of finding the actual state is to have every
measurement (question) outcomes as close as possible to equally probable.
If your measurement (question) allows for n different outcomes (types of
answers), it is best to use them so to always split the set of still possible states
of the world into n sets of probability 1/n each.
(vi) The Shannon information content (in base 3) of the set of balls is
24
X 1 1
H(X ) = log3 = log3 24 = 2.89,
i=1
24 1/24

which is just about the minimal number of measurements (3) needed in a best
strategy.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 26 / 62

The entropy of an ensemble

Some intuitive support for our second claim:

b) (The guessing game.) Do you know the “20 questions” game? (If you don’t,
Google it, it’s a fun game to while away the time with friends during a long
trip.)
In a dumber version of the game, a friend thinks of a number between 0 and
15 and you have to guess which number was selected using yes/no questions.
What is the smallest number of questions needed to be guaranteed to identify
an integer between 0 and 15?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 27 / 62

The entropy of an ensemble

An optimal strategy of yes/no questions to identify an integer in 0-15:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 28 / 62

The entropy of an ensemble

Insights on “information” gained from the guessing game:

(i) A series of answers to yes/no questions can be used to uniquely identify an

object from a set.
(That’s why the questions are useful in the first place!)
(ii) The optimal strategy to win the game corresponds to the shortest sequences
of yes/no questions needed to identify objects in the set.
(iii) If you map each yes/no answer to a 0/1 bit, you get a unique binary string
that identifies each object in the set.
(iv) Hence, the optimal strategy to win the game leads to the shortest binary
description of objects in the set.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 29 / 62

The entropy of an ensemble

Insights on “information” gained from the guessing game:

(v) Encoding information efficiently is related to asking the right questions.

(vi) We saw that the number of yes/no questions needed to identify an integer
between 0 and 15 is 4.
Let us calculate the Shannon information of the set of integers between 0 and
15, assuming your friend can pick any number in the set with equal probability:

15
X 1 1
H(X ) = log2 = log2 16 = 4.
i=0
16 1/16

Shannon entropy gave us the minimal number of questions necessary to win

the game. Is this a coincidence?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 30 / 62

The entropy of an ensemble

Some intuitive support for our second claim:

c) (The game of submarine.) In a boring version of the “game of battleships”

called “game of submarine”, each player hides just one submarine in one
square of an eight-by-eight grid.
At each round, the other player picks a square in the grid to shoot at.
There two possible outcomes of a shot are y, n, corresponding to a hit and a
miss, and their probabilities depend on the state of the board.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 31 / 62

The entropy of an ensemble
In the game of submarine, each shot made by a player defines an ensemble:
at the beginning:
1 63
p(y) = and p(n) = ;
64 64

at the second shot, if the first shot missed:

1 62
p(y | n) = and p(n | n) = ;
63 63

at the third shot, if the first two shots missed:

1 61
p(y | nn) = and p(n | nn) = ;
62 62
...
at the k th shot (with 1 ≤ k ≤ 64), if the first k − 1 shots missed:
1 (64 − k)
p(y | nk−1 ) = and p(n | nk−1 ) = .
(64 − k + 1) (64 − k + 1)

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 32 / 62

The entropy of an ensemble

A game of submarine in which the submarine is hit on the 49th attempt.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 33 / 62

The entropy of an ensemble

Insights on “information” gained from the game of submarine:

(i) If we get a hit y on the k th shot, the sequence of answers we get is a string

x = nk−1 y,

in which the first k − 1 symbols are n and the last one is y.

This string x is the outcome of the hit/miss experiment that uniquely
identifies the square where the submarine is.
For a fixed strategy, our game has 64 possible outcomes:

y,
ny,
nny,
. . .,
nnnn . . . nny = n62 y,
nnnn . . . nnny = n63 y.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 34 / 62

The entropy of an ensemble

Insights on “information” gained from the game of submarine:

In particular, we can call y a bit 1 and n a bit 0, and encode each of the
game’s 64 possible outcomes as a binary string:

y = 1 (which is a binary string of 1 bit),

ny = 01 (which is a binary string of 2 bits)
nny = 001 (which is a binary string of 3 bits),
. . .,
n62 y = 062 1 (which is a binary string of 63 bits),
n63 y = 063 1 (which is a binary string of 64 bits).

Note that the use of symbols {y, n} or {0, 1} makes little difference: each
binary string uniquely identifies a result of the game.
Note also that we have binary strings of many different sizes.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 35 / 62

The entropy of an ensemble

Insights on “information” gained from the game of submarine:

(ii) Let us calculate the Shannon information content of an arbitrary string

x = nk−1 y for some 1 ≤ k ≤ 64:
1
h(x = nk−1 y) = log2
p(x = nk−1 y)
1
= log2 63 62 61 64−k+2 64−k+1 1
64
· 63
· 62
· ... · 64−k+3
· 64−k+2
· 64−k+1
1
= log2
1/64
= log2 64
= 6 bits.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 36 / 62

The entropy of an ensemble

Insights on “information” gained from the game of submarine:

(iii) Every outcome nk−1 y, no matter what value k assumes from 1 to 64, conveys
the same amount of information h(nk−1 y) = 6 bits.
6 bits is exactly the number of bits necessary to uniquely identify a square in a
set of 64!
(iv) The information contained in a binary string representing an object is not
necessarily the number of bits in the string: it is related to the object it
identifies within a set!
In our submarine example, all strings from y (which has size of 1 bit) to n63 y
(which has size of 64 bits) carry the same amount of bits of information: 6
bits each.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 37 / 62

Entropy as the complexity of binary search

Complementing the intuitions we got from the previous examples, in a future

lecture we will be able to prove an operational interpretation of Shannon
entropy in terms of search trees.

The Shannon entropy H(X ) of a random variable distributed according to

probability distribution pX is the:

the expected number of comparisons needed

for an optimal binary-search algorithm
on a space of values X following distribution pX .

In this sense, a random variable with:

low entropy would be located “fast” using binary search, whereas
one with high entropy would need “more effort/time” to be located.

Note that there could be other ways of measuring the information of a

distribution: we’ll disuss them at the end of this course.
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 38 / 62
Data compression

Our third claim is that N outcomes from a source X can be compressed into
roughly N · H(X ) bits.
In other words, our claim is that the number of bits necessary to compress a
source is linear with respect to the entropy of the source.

This claim implies an intimate connection between data compression and the
measure of information content of the source.
Before giving support for our third claim, let us understand better what we
mean by “data compression”.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 39 / 62

Data compression

A source of information is a stochastic process

X1 , X2 , X3 , . . .

in which the outcome of each ensemble Xi is a symbol produced by the

source.

Examples of sources of information include:

1 the speech produced by a human (each word is a symbol),
2 the sequence of pixels in a black and white image (each symbol is either
“black” or “white”),
3 the sequence of states of the weather in a region in a sequence of days (each
symbol is “good”, “cloudy”, “rainy”).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 40 / 62

Data compression

Average information content per symbol of a source:

If we can show that we can compress data from a particular source into a file
of L bits per source symbol and recover the data reliably, then we will say that
the average information content of that source is at most L bits per symbol.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 41 / 62

Data compression

The raw bit content of an ensemble X is

H0 (X ) = log2 |AX |.

H0 represents the minimum number of bits necessary to give a unique

codeword (i.e., a “name”) to every element in the ensemble X .

H0 is a measure of information of X :
it is a bound on the number of bits necessary to encode elements of X as
codewords.

This measure of information only considers the encoding of ensemble X , but

it does not consider how the encoding for the ensemble can be compressed.
To do compression, we need to take into consideration the probability of each
outcome of the ensemble.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 42 / 62

Data compression

Example 2 (MacKay 4.5) Could there be a compressor that maps an

outcome x to a binary code c(x), and a decompressor that maps c back to x,
such that every possible outcome is compressed into a binary code of length
shorter than H0 (X ) bits?

Solution. No! Just use the pigeonhole principle to verify that. (This exercise
is part of your homework assignment for this lecture.)

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 43 / 62

Data compression

There are only two ways a compressor can compress files:

1. A lossy compressor maps all files to shorter codewords, but that means that
sometimes two or more files will necessarily be mapped to the same codeword.
The decompressor will be, in this case, unsure of how to decompress
ambiguous codewords, leading to a failure.
Calling δ the probability that the source file is one of the confusable files, a
lossy compressor has probability δ of failure.
If δ is small, the compressor is acceptable, but with some loss of information
(i.e., not all codewords are guaranteed to be decompressed correctly).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 44 / 62

Data compression

There are only two ways a compressor can compress files:

2. A lossless compressor maps most files to shorter codewords, but it will

necessarily map some files to longer codewords.
By picking wisely which files to map to shorter codewords (i.e., the most
probable files) and which files to map to longer codewords (i.e., the least
probable files), the compressor can usually achieve satisfactory compression
rates, and without any loss of information.

In the remaining of this lecture we will cover a simple lossy compressor, and
in future lectures we will cover lossless compressors.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 45 / 62

Lossy data compression
All compressors must take into consideration the probabilities of the different
outcomes a source may produce.

Example 3 Let

AX = {a, b, c, d, e f, g, h}

1 1 1 3 1 1 1 1
PX = , , , , , , ,
4 4 4 16 64 64 64 64
The raw bit content of this ensemble is
H0 = log2 |AX | = log2 8 = 3 bits,
so to represent any symbol of the ensemble we need, in principle, codewords
of 3 bits each.
But notice that a small set contains almost all the probability:
15
p(x ∈ {a, b, c, d}) = .
16
Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 46 / 62
Lossy data compression
Example 3 (Continued)

If we accept a risk δ = 1/16 of not having a codeword for a symbol x, we can

use an encoding using only 2 bits per symbol instead of 3:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 47 / 62
Lossy data compression

The above example can be generalized to principle below.

Principle of lossy data compression.

Let Sδ denote the smallest δ-sufficient set (that is a subset of AX ) satisfying

p(x 6∈ Sδ ) ≤ δ or, equivalentely, p(x ∈ Sδ ) ≥ 1 − δ.

The maximum compression tolerating a probability of error at most δ uses

codewords of size log2 |Sδ | bits.

The quantity

Hδ (X ) = log2 |Sδ |

is called the essential bit content of X .

If we are willing to accept a probability δ of error, we can compress the

source from H0 (X ) bits per symbol to Hδ (X ) bits per symbol.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 48 / 62

Lossy data compression

Example 4 For the lossy compressor where

AX = {a, b, c, d, e f, g, h} , and

1 1 1 3 1 1 1 1
PX = , , , , , , , ,
4 4 4 16 64 64 64 64

we have:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 49 / 62

Lossy data compression
Example 4 (Continued)

For this lossy compressor, we have:

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 50 / 62
Data compression of groups of symbols

We can ask ourselves if we can do better compression if, instead of encoding

each symbol of the source individually, we encode groups of symbols as
blocks.

Let’s start by reasoning about the entropy of a group of symbols as a block.

Let x = (x1 , x2 , . . . , xN ) be a string of N independent identically distributed

(i.i.d.) random variables from a single ensemble X .
Let X N denote the ensemble (X1 , X2 , . . . , XN ).
Because entropy is additive for independent variables, we have

H(X N ) = N · H(X ).

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 51 / 62

Data compression of groups of symbols

Example 5 Consider a string of N flips of a bent coins, x = (x1 , x2 , . . . , xN ),

where xn ∈ {0, 1}, with probabilities p0 = 0.9 and p1 = 0.1.
If r (x) is the number of 1s in x, then
N−r (x) r (x)
p(x) = p0 p1 .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 52 / 62

Data compression of groups of symbols

Example 5 (Continued)

If we want to encode blocks of size N, we can make a graph of how the

necessary number Hδ (X N ) of bits to encode the blocks varies as a function of
the error δ we are willing to tolerate.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 53 / 62

Data compression of groups of symbols

Example 5 (Continued)

To encode blocks of size N we need Hδ (X N ) bits per block of size N symbols.

That means that the number of bits per symbol is
1
Hδ (X N ).
N

What happens as N grows?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 54 / 62

Data compression of groups of symbols

Example 5 (Continued)

As N grows we have the following graph:

It seems that as N
grows, N1 Hδ (X N ) flat-
tens out (i.e., becomes
constant), independen-
tly of the value of the
error δ tolerated.

1 N
What is the fixed value that N Hδ (X ) tends to as N grows?
Shannon’s source coding theorem tells us what it is...

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 55 / 62
The Source Coding Theorem

Shannon’s source coding theorem. Let X be an ensemble. Given > 0,

1
lim Hδ (X N ) = H(X ).
N→∞ N

In English, Shannon’s source coding theorem states that if

1. you are encoding N symbols of a source X , and

2. you are willing to accept a probability δ of error in the decompression,
then

3. the maximum achievable compression will use approximately H(X ) bits per
symbol if the number of N of symbols being encoded is large enough.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 56 / 62

The Source Coding Theorem - Why it works

To see why Shannon’s coding theorem is true we will use the concept of a
typical string generated by the source.

A string of size N is typical if the frequency of each symbol in the string is

the same as the probability of the symbol being produced by the source:

P(x)typ ≈ p1p1 N p2p2 N p3p3 N . . . pIpI N .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 57 / 62

The Source Coding Theorem - Why it works
The information content of a typical string is
1
h(x)typ = log2
P(x)typ
1
≈ log2
p1p1 N p2p2 N p3p3 N . . . pIpI N
X 1
=N pi log2
pi
i
= NH(X ).

Note that we just showed that

1
h(x)typ = log2 ≈ NH(X ),
P(x)typ
which implies that any typical string has probability

P(x)typ ≈ 2−NH(X ) .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 58 / 62

The Source Coding Theorem - Why it works

Using the observations above, we define the typical set to be that of typical
strings, up to a tolerance β ≥ 0:

TNβ = {x | 2−NH(X )−β ≤ p(x) ≤ 2−NH(X )+β }.

The typical set satisfies two important properties:

1. The typical set TNβ contains almost all probability: p(x ∈ TNβ ) ≈ 1.

That is a direct consequence of the law of large numbers: as N grows, it

becomes less and less likely that the frequency of any given symbol in a string
will differ from its probability of being generated by the source.
Hence, if N grows, more and more strings will happen to be typical.

2. The typical set TNβ contains roughly 2NH(X ) elements: |TNβ | ≈ 2NH(X ) .

That is a consequence from the fact that the typical set has probability almost
1, and the probability of a typical element is 2−NH(X ) , so the set must have
about 1/2−NH(X ) = 2NH(X ) elements.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 59 / 62

The Source Coding Theorem - Why it works

The properties of the typical set lead us to Shannon’s coding theorem as

follows.

We know that

1. Sδ contains all probability of the sequences in X N , up to an error δ, and

2. the typical set TNβ contains almost all probability of the sequences in X N .

Hence can conclude that the two sets must have a great intersection:

|Sδ | ≈ |TNβ | ≈ 2NH(X ) .

That means that the essential information content of X N is

Hδ (X N ) = log2 |Sδ | ≈ log2 |TNβ | ≈ log2 2NH(X ) = NH(X ) bits,

which is exactly what Shannon’s coding theorem states.

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 60 / 62

The Source Coding Theorem - Asymptotic Equipartition
Property

As an addendum, if you like to be formal, our justification of Shannon’s

coding theorem can be formalized by the following principle, which is a direct
consequence of the law of large numbers.

Asymptotic Equipartition Principle (AEP). For an ensemble of N

independent identically distributed (i.i.d.) random variables
XN = (X1 , X2 , . . . , XN ), with N sufficiently large, the outcome
x = (x1 , x2 , . . . , xN ) is almost certain to belong to a subset of AN
X having only
2NH(X ) members, each having probability “close” to 2−NH(X ) .

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 61 / 62

Take-home message

At the beginning of this lecture we set the goal to convince you of three
claims:
1
1. The Shannon information content h(x = ai ) = log2 p(x=a i)
is a sensible
measure of the information content of the outcome x = ai .
1
P
2. The entropy H(X ) = x∈AX p(x) log2 p(x) is a sensible measure of the
expected information content of an ensemble X .
3. The Source Coding Theorem: N outcomes from a source X can be
compressed into roughly N · H(X ) bits.

Are you convinced?

After this lecture, can you provide good arguments in favor of them?

Mário S. Alvim ([email protected]) The Source Coding Theorem DCC-UFMG (2020/01) 62 / 62

(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (55)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
CCS0006L (Computer Programming 1) : Activity
No ratings yet
CCS0006L (Computer Programming 1) : Activity
26 pages
Drone Delivery With Object Detection
100% (1)
Drone Delivery With Object Detection
53 pages
w504 Sysmac Studio Operation Manual en
No ratings yet
w504 Sysmac Studio Operation Manual en
826 pages
Shannon's Theorems: Math and Science Summer Program 2020
No ratings yet
Shannon's Theorems: Math and Science Summer Program 2020
28 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
ICT - Module 1 Lecture 1
No ratings yet
ICT - Module 1 Lecture 1
34 pages
Lecture 7 Source Coding 2024
No ratings yet
Lecture 7 Source Coding 2024
28 pages
Probabilistic Methods in Information Theory
No ratings yet
Probabilistic Methods in Information Theory
48 pages
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
No ratings yet
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
156 pages
Chapter 16
No ratings yet
Chapter 16
71 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Chapshannon PDF
No ratings yet
Chapshannon PDF
8 pages
IICT Notes Unit-2
No ratings yet
IICT Notes Unit-2
17 pages
C&C Combined Module Notes
No ratings yet
C&C Combined Module Notes
206 pages
21ECE72 - Coding and Cryp Module 1
No ratings yet
21ECE72 - Coding and Cryp Module 1
34 pages
Information Theory
No ratings yet
Information Theory
114 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
2 Information Measurement and Entropy
No ratings yet
2 Information Measurement and Entropy
23 pages
Entropy: Low Entropy High Entropy
No ratings yet
Entropy: Low Entropy High Entropy
11 pages
Entropy: Scott Sheffield
No ratings yet
Entropy: Scott Sheffield
57 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
1.1 Shannon's Information Measures: Lecture 1 - January 26
No ratings yet
1.1 Shannon's Information Measures: Lecture 1 - January 26
5 pages
Information Theory
No ratings yet
Information Theory
38 pages
MIT18 440S14 Lecture34
No ratings yet
MIT18 440S14 Lecture34
19 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
Information Theory
No ratings yet
Information Theory
108 pages
1.1 Entropy and Relative Entropy
No ratings yet
1.1 Entropy and Relative Entropy
22 pages
2009 Lecture25
No ratings yet
2009 Lecture25
11 pages
Information Theory
No ratings yet
Information Theory
18 pages
Lossless Compression: Huffman Coding: Mikita Gandhi Assistant Professor Adit
No ratings yet
Lossless Compression: Huffman Coding: Mikita Gandhi Assistant Professor Adit
39 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
Chapter 2 - Edited
No ratings yet
Chapter 2 - Edited
45 pages
ch1 Examples
No ratings yet
ch1 Examples
15 pages
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
No ratings yet
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
7 pages
Entr 5
No ratings yet
Entr 5
2 pages
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
No ratings yet
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
6 pages
Notes It
No ratings yet
Notes It
46 pages
Intro Quant Info
No ratings yet
Intro Quant Info
128 pages
Shannon Entropy A Rigorous Notion at The Crossroads Between Probability Information Theory Dynamical Systems and Statistical Physics
No ratings yet
Shannon Entropy A Rigorous Notion at The Crossroads Between Probability Information Theory Dynamical Systems and Statistical Physics
63 pages
Dabel Info Theory
No ratings yet
Dabel Info Theory
25 pages
Quantum Information: Stephen M. Barnett
No ratings yet
Quantum Information: Stephen M. Barnett
60 pages
It Co 1 en
No ratings yet
It Co 1 en
26 pages
Information Theory
No ratings yet
Information Theory
26 pages
Quantum Channels: Peter Shor MIT Cambridge, MA
No ratings yet
Quantum Channels: Peter Shor MIT Cambridge, MA
41 pages
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
36 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Entropy Handbook Definitions, Theorems, M-Files
No ratings yet
Entropy Handbook Definitions, Theorems, M-Files
22 pages
Lecture Notes: July 2020 (Be Safe and Stay at Home)
No ratings yet
Lecture Notes: July 2020 (Be Safe and Stay at Home)
9 pages
David Ellerman - An Introduction To Logical Entropy and Its Relation To Shannon Entropy
No ratings yet
David Ellerman - An Introduction To Logical Entropy and Its Relation To Shannon Entropy
23 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
No ratings yet
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
76 pages
ITC 2020 21 Lecture 3
No ratings yet
ITC 2020 21 Lecture 3
25 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Attendance Management System
No ratings yet
Attendance Management System
19 pages
Arithmetic Logic Unit: Submitted by Priti Gangwar Roll No. - 2K19/EC/134 & Ritik Gupta Roll No. - 2K19/EC/147
No ratings yet
Arithmetic Logic Unit: Submitted by Priti Gangwar Roll No. - 2K19/EC/134 & Ritik Gupta Roll No. - 2K19/EC/147
24 pages
1 Recall Routing in Fixed IP Networks
100% (1)
1 Recall Routing in Fixed IP Networks
3 pages
En - Dive MEMS Pressure Sensors Wearable IoT Applications PDF
No ratings yet
En - Dive MEMS Pressure Sensors Wearable IoT Applications PDF
23 pages
Calculus Project Final
No ratings yet
Calculus Project Final
4 pages
CH14 PPT
No ratings yet
CH14 PPT
80 pages
UCSP Quiz About Enculturation
No ratings yet
UCSP Quiz About Enculturation
6 pages
Mic 4th Lecture
No ratings yet
Mic 4th Lecture
11 pages
WITec RamanTraining Guide
No ratings yet
WITec RamanTraining Guide
8 pages
Mahak Resume
No ratings yet
Mahak Resume
1 page
GNSS Receivers, Data Colletors and Radio
No ratings yet
GNSS Receivers, Data Colletors and Radio
10 pages
Transact SQL Parte 6 Consultas
No ratings yet
Transact SQL Parte 6 Consultas
237 pages
Acessibilidade Digital PDF
No ratings yet
Acessibilidade Digital PDF
104 pages
Imsc
No ratings yet
Imsc
2 pages
Experiment 6
No ratings yet
Experiment 6
10 pages
Week 8 Algorithm Analysis
No ratings yet
Week 8 Algorithm Analysis
39 pages
1020244-Event Handling VIMP NOTES V2V
No ratings yet
1020244-Event Handling VIMP NOTES V2V
12 pages
Unit 2
No ratings yet
Unit 2
81 pages
Random Number Generation
100% (1)
Random Number Generation
13 pages
IITD PM B10 Brochure - R2
No ratings yet
IITD PM B10 Brochure - R2
16 pages
FNT6150 - IC-Product-Roadmap-8680 - Milestone 2 (Module 5) Assignment Integrating Cryptocurrency-Qudus Afuwape
No ratings yet
FNT6150 - IC-Product-Roadmap-8680 - Milestone 2 (Module 5) Assignment Integrating Cryptocurrency-Qudus Afuwape
3 pages
Akshay Surse
No ratings yet
Akshay Surse
2 pages
C2 Design Scenario
No ratings yet
C2 Design Scenario
4 pages
1 HSKG Led Film Screen
No ratings yet
1 HSKG Led Film Screen
11 pages
Oracle Leasing and Finance Management
No ratings yet
Oracle Leasing and Finance Management
74 pages
Components of An XML Document
100% (6)
Components of An XML Document
21 pages
Keith Laureano
No ratings yet
Keith Laureano
3 pages