0% found this document useful (0 votes)
8 views55 pages

TEOI InformationOfDataSources

Course on Information of Data Sources

Uploaded by

Carlos Hurtado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views55 pages

TEOI InformationOfDataSources

Course on Information of Data Sources

Uploaded by

Carlos Hurtado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Information Theory
Degree in Data Science and Engineering
Lesson 3: Information of data sources

Jordi Quer, Josep Vidal

Mathematics Department, Signal Theory and Communications Department


{jordi.quer, josep.vidal}@upc.edu

2019/20 - Q1

1/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Sources of data and information

Most natural signals or symbols generated by a source convey information in a


very dilute form: large amount of data contain a small amount of information.
There are two main reasons for that:

Data that are close to each other tend to have similar values (e.g. pixels
in an image, pixels in consecutive images in a video sequence, temporal
samples in an audio signal), or are related to each other (e.g. letters in
English, video recordings of the same scene at closely spaced cameras,
samples of stereo audio recordings).
Not all values generated by our source of data are equally frequent. We
know already that those less frequent carry more information.

2/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Examples of redundant sources

Neighbour pixels of an image.

3/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Examples of redundant sources

Samples obtained from the digitalization of an audio signal.

Predictability of letters in English increases with the context:

”Oh my God, the vulcano is eru ”

4/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Examples of redundant sources

Frequency of letters in English is far from


being uniform in normal text (values have
been estimated from The Frequently Asked
Questions Manual for Linux).

5/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Purpose of the chapter

Can we find short and yet reversible descriptions of a sequence of random


observations x1 x2 . . . xn ?

How short can this description be? How much can we compress data?

This will be highly relevant for the purposes of storage and communication of
sequences of symbols.

6/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Alphabets

Values generated by a source of random data X belong to an alphabet (see


also alphabet) which is a finite set X = {a1 , a2 , . . . , aq } of |X | = q elements
ai , called letters or symbols. For example:

Many natural languages are written using variants of the ISO basic Latin
alphabet of 26 letters:
a, b, c, d, e, f, g, h, i, j, k, l, m,
n, o, p, q, r, s, t, u, v, w, x, y, z.
The Greek alphabet of 24 letters:

α, β, γ, δ, , ζ, η, θ, ι, κ, λ, µ,
ν, ξ, o, π, ρ, X , τ, υ, φ, χ, ψ, ω.

The DNA alphabet of 4 letters X = {A, C, T, G} representing the four


nucleotides adenine, cytosine, thymine, guanine, used in genetics to write
the genome.

7/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Alphabets

The Braille alphabet of 64 letters:

A digital image is written in the alphabet of pixels, whose letters are d-bit
numbers, with d the image color bit depth.
A digital sound is written in the alphabet of wave samples, whose letters
are d-bit numbers, with d the audio bit depth.
Of course, the most important to us is the binary alphabet X = {0, 1}
consisting of symbols 0 and 1.

8/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Blocks and strings

A sequence of letters of X is called word, block, string, chain, text, message,


etc. depending on the context.
The name word is mostly used for sequences of short fixed length, or for the
sequences belonging to a certain particular set (a code).
X n is the set of the q n words (or blocks) of n letters:

X n = {xn = a1 a2 . . . an : ai ∈ X }.

X ∗ is the infinite set of strings of arbitrary length:


[
X ∗ = {x∗ = a1 a2 . . . an : ai ∈ X , n ≥ 0} = X n.
n≥0

We denote `(xn ) = n the length of the string (number of letters). The empty
string  with `() = 0 is considered an element of X ∗ .

9/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Codes

A source code Cn is a mapping that re-labels an n-length sequence of symbols


belonging to an alphabet, into a codeword of symbols belonging possibly to
another alphabet. The value of n is chosen when designing the code.

Cn : X n → B∗
c = Cn (xn )

If the length of the codewords `(c) is the same for all xn


i , we have a
fixed-length code. Otherwise, the code is said a variable-length code.

10/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Codes

Some definitions for a code...

Extension code is the concatenation of codewords.


Codebook is the set of codewords corresponding to the set of
source-words,
Cn = {c : c = Cn (xn ), xn ∈ X n }
Non-singularity: no two different source words get mapped to the same
codeword,
c = Cn (xn n n
1 ) = Cn (x2 ) =⇒ x1 = x2
n

that is, Cn (xn


1 ) is an injective mapping. This ensures that the encoding is
reversible, and zero-error decoding is possible. All codes will be injective.
Rate of the code: the ratio of the encoded sequence length to the source
sequence length.

11/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Efficiency

Coding is more efficient if words of n > 1 symbols are encoded. Let us take a
fair 6-sides dice, and name X the random variable associated to the outcome
of a throw.
X = {1, 2, 3, 4, 5, 6} and H(X) = log 6 = 2.58 bits/outcome.
If we want to encode a sequence of outcomes using a binary alphabet
{0, 1}, C1 = {000, 001, 010, 011, 100, 101, 110, 111} are the 8 codewords
needed (we recognise that two of them will not be used). The rate is 3
binary digits/outcome.
H(X)
Efficiency of the code: η = 3
= 0.86 bits/binary digit
We can do better if we encode words of 4 outcomes in each codeword.
There are now |X 4 | = 64 = 1296 possible source words for which we need
|C4 | = 211 codewords of 11 binary digits. The rate is
11/4 = 2.75 binary digits/outcome.
H(X)
Efficiency of the code: η = 2.75
= 0.938 bits/binary digit

These are fixed-length codes.

12/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Efficiency
1
For this code, the number of binary digits per outcome n
dn log |X |e is
asymptotically decreasing with n down to log |X |:

In general, the efficiency of the code can also be defined for q-ary codewords:
Hq (X)
η=
q − ary symbols per outcome
Pq
where Hq (X) = − i=1 p(xi ) logq p(xi )
13/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Teaser...

A sequence of letters is the observation of a sequence of random variables (that


is, a stochastic process) governed by a probability distribution. The probability
distribution provides all the necessary information about the achievable bounds
for data compression.

In this lesson, we prove that it is possible to encode an n-length sequence of


random symbols with an efficiency that approaches 1 with high accuracy, when
n is large enough.

We will assume first that X1 X2 ...Xn are independent and identically


distributed (i.i.d.) random variables, and will drop the assumption at the end.

14/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Stochastic process
A stochastic process is an infinite sequence X = X1 X2 X3 . . . of random
variables, each taking values in the same set X . A finite set of n random
variables will be denoted by X n = X1 X2 . . . Xn .
Examples of sources generating stochastic processes:
a language model for English, Catalan, etc. with an alphabet
X = {a, b, . . . , z, -, !, ?, (, ), . . . , ;, :} that includes letters, space and
punctuation characters,
a sequence of n dice throws,
the n samples of a sampled recording of a particular phoneme pronounced
by many speakers,
such that a given length-n sequence x1 x2 . . . xn ∈ X n is associated to a
certain probability:

p(x1 , x2 , . . . , xn ) = Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn )

satisfying probability rules (joint, marginal, conditional). Of course, the random


variables Xi may not be mutually independent.

15/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Stationary stochastic process


A stochastic process is said to be stationary if the joint distribution is invariant
to shifts in the index

Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn )
= Pr(X1+l = x1 , X2+l = x2 , . . . , Xn+l = xn )

for every shift l and for all values xi ∈ X .


As a consequence, the statistical magnitudes that can be computed on each Xi
do not depend on the index:

E[g(X1 )] = E[g(X2 )] = · · · = E[g(Xn )]

where g is a function that applies on the value taken by a random variable Xi


and hence its output is itself a random variable.
For example, take g(x) = x2 . If the process is stationary, the second order
moment does not depend on the index and we can write

E[Xi2 ] = E[X 2 ] ∀i

16/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Stationary stochastic process

A process is called i.i.d. (independent identically distributed) if each random


variable is independent of the rest:
n
Y
Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = Pr(Xi = xi )
i=1

Or in a simpler notation:
n
Y
p(x1 , x2 , . . . , xn ) = p(xi ) with xi ∈ X
i=1

If random variables are associated to observations in time, then an i.i.d.


process has no memory of past or future.

17/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Ergodic processes
Imagine the following experiment: observe N times the output of a stochastic
process consisting of n random variables X1 X2 . . . Xn . Each observation of the
n values taken by these random variables is called a realization of the process.

Stochastic processes can be defined at will. For example, we could randomly


take N English books and observe the sequence of the first n letters. Or take
just one book, open at random N pages and observe the sequence of the first
n letters.
18/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Ergodic processes
Assume the process is stationary. We define the process as ergodic if statistical
magnitudes can be evaluated from temporal averages done on any single
realization:
n
1X
E[g(X)] = lim g(xk )
n→∞ n
k=1

Otherwise said, if the process is ergodic, we do not need all possible realizations
to infer statistical information of the process.
In general, ergodic processes have memory when the random variables
X1 , X2 , . . . , Xn are not independent. A memory process can be generated
from an i.i.d. process as

where an FSM generates outputs depending on the current and past values of
the input. Only ergodic processes will be considered in the sequel.

19/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

A finite state machine producing English text


Assume the values adopted by X are a set of English words. This FSM can
generate a number of distinct ergodic output sequences if the inputs are
random binary values that select outputs of the states:

Some possible output sequences:


- THE COMMUNIST PARTY INVESTIGATED THE CONGRESS.
- THE CONGRESS INVESTIGATED THE COMMUNIST PARTY AND FOUND
EVIDENDE OF THE CONGRESS DESTROYED THE COMMUNIST PARTY.
- THE CONGRESS PURGED THE CONGRESS.
20/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

A taxonomy of processes

21/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Markov process

A Markov process is an ergodic stochastic process in which past has no


influence on the future, given the present. In general, in an order k Markov
process, each variable depends only on the previous k:

Pr(Xn+k = xn+k |Xn+k−1 = xn+k−1 , . . . , X1 = x1 )


= Pr(Xn+k = xn+k |Xn+k−1 = xn+k−1 , . . . , Xn = xn ).

The simplest example of a discrete-time Markov process is an order 1 Markov


process:

Pr(Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 , . . . , X1 = x1 )


= Pr(Xn+1 = xn+1 |Xn = xn ).

If the transition probabilities do not depend on n, the Markov process is called


invariant or homogeneous:

Pr(Xn+1 = x|Xn = y) = Pr(Xm+1 = x|Xm = y) ∀n, m

22/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Markov process

For an order 1 Markov process, the joint pdf is given by

p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) . . . p(xn |xn−1 )

The probability distribution of states at time n + 1 is given by


X
p(xn+1 ) = p(xn )p(xn+1 |xn )
xn ∈X

which can be written in matrix form as

p(n + 1) = p(n)P, n ≥ 0

where
[P]i,j = Pr (Xn+1 = xj |Xn = xi ), are the elements of the transition
matrix P,
[p(n)]j = Pr(Xn = xj ) are the elements of the row vector p(n)
containing the probabilities of all |X | states at time n.

23/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Markov process

A Markov process is said invariant if matrix P does not depend on n.


A Markov process is stationary if p(n) does not depend on n, so p(0) must be
an eigenvector of P.

If all elements of P are positive, the Perron-Frobenius theorem applies to


conclude that:
1. The largest left-eigenvalue of P is simple and its value is 1.
2. The entries of the associated eigenvector are real and positive.
3. All other left-eigenvalues are smaller in modulus (may be complex).
In this case, it turns out that

lim p(n) = lim p(0)Pn = p


n→∞ n→∞

and hence p is an eigenvector of P associated to the unit left-eigenvalue.


For non-negative P other results apply (not considered here).

24/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Exercise
A two-states Markov process, where X = {State 1, State 2}, can be
represented graphically as

where [P]i,j = Pr (Xn+1 = State j|Xn = State i). The transition matrix is
given by  
1−α α
P=
β 1−β

What are the asymptotic stationary probabilities p as n → ∞, regardless


the value of p(0)?
What is p(0) for the Markov process to be stationary? That is, p(n)
does not depend on n.
What are the values of α and β for an i.i.d. Markov process?
25/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

We will exploit some properties of stochastic processes to achieve efficient


coding. A few well known theorems are needed.

Lemma (Markov’s inequality)


For any non-negative random variable X and any α > 0
E[X]
Pr (X ≥ α) ≤
α

1 x≥α
Proof. Define the indicator function as I(x, α) =
0 x<α

E[X]
Pr(X ≥ α) = E[I(x, α)] ≤ E[X/α] = 
α
This inequality can be used to bound the tails of a distribution.

26/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

27/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

Lemma (Chebyshev’s inequality)


σx2
Pr (|X − E[X]| ≥ β) ≤
β2

Proof. Consider
σ2
Pr (|X − E[X]| ≥ β) = Pr |X − E[X]|2 ≥ β 2 ≤ x2

β
where the Markov’s inequality has been applied in the last inequality. 

28/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

The average of n random variables can be made arbitrarily close to the mean
by increasing n.

Theorem (The weak law of large numbers)


Consider a sequence of i.i.d. random variables Xi , and Xˆn = 1
Pn
n i=1 Xi , then
  σ2
Pr Xˆn − E[Xˆn ] ≥  ≤ 2
n
where σ 2 is the variance of Xi .

Proof. Apply Chebyshev’s inequality . 

It is said that Xˆn converges in probability to E[Xˆn ] = E[Xi ]:

Xˆn → E[Xi ]

29/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The asymptotic equipartition property (AEP)


The key theorem is...

Theorem
If X1 , X2 , . . . , Xn are i.i.d. random variables of pdf p(X), then
1
− log p(X1 , X2 , . . . , Xn ) → E[− log p(X1 , X2 , . . . , Xn )] = H(X)
n
that is lim Pr − n1 log p(X1 , X2 , . . . , Xn ) − H(X) ≥  = 0

n→∞

Proof. Apply the weak law of large numbers to the random variable
− n1 log p(X1 , X2 , . . . , Xn ) and its mean, calculated using Xi are i.i.d. as
  " n
# n
1 1X 1X
E − log p(X1 , X2 , . . . , Xn ) = E − log p(Xi ) = − E [log p(Xi )]
n n i=1 n i=1

n q n
1 XX 1X
=− p(xi,j ) log p(xi,j ) = H(Xi ) = H(X) 
n i=1 j=1 n i=1

30/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Properties of typical sequences

Typical sequences are those whose log of the probability approaches nH(X)
within a small value .
(n)
The set of typical sequences will be called A and it is included in X n . It is
the set of observations of X n = X1 X2 ...Xn such that

2−n(H(X)+) ≤ p(x1 , x2 , ..., xn ) ≤ 2−n(H(X)−) .

Therefore, the probability of all typical sequences is nearly the same.

Two relevant properties of the set are proved next.

31/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Properties of typical sequences

What is the probability mass of typical sequences?

Theorem (3.1)
 
(n)
For a sufficiently large value of n, Pr A >1−

Proof. Apply the weak law of large numbers. 

That is, the set contains most of the probability. How many typical sequences
are there?
Theorem (3.2)
(n)
For any value of n, A ≤ 2n(H(X)+)
(n)
For a sufficiently large value of n, A ≥ (1 − )2n(H(X)−)

32/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Properties of typical sequences

Proof. For the upper bound,


X X X
1= p(xn ) ≥ p(xn ) ≥ 2−n(H(X)+)
xn ∈X n (n)
xn ∈A
(n)
xn ∈A

= 2−n(H(X)+) A(n)

For the lower bound, take theorem 3.1, so that


  X
1 −  < Pr A(n)
 ≤ 2−n(H(X)−) = 2−n(H(X)−) A(n)
 
(n)
xn ∈A

The set may be small, its size depends on the entropy of X.

33/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

34/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

Take a sequence of i.i.d. observations of an unfair coin. We’ll check the


(n)
properties of A . The sequence forms a stationary Bernoulli process, whose
density function is Pr(X = 1) = p, Pr(X = 0) = 1 − p.
The number of 1 in a specific sequence xn is denoted by k(xn ).
The probability of a specific sequence with k ones is:
Pr(k) = pk (1 − p)n−k
 
n
The number of sequences of length n with k ones is N (k) =
k
The probability
 ofgenerating a sequence of k ones is
n
Pr(n, k) = pk (1 − p)n−k
k
E[k] = pn
p p
std(k) = E [(k − E[k])2 ] = np(1 − p). The standard deviation is
small with respect to the mean as n increases!

35/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

Take p = 41 , n = 25

k k

Number of different sequences of length n Pr(n, k): probability of generating a


with k ones sequence of k ones

36/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

1
Take p = 4
for increasing n

k k

Pr(n, k): probability of generating a Pr(n, k): probability of generating a


sequence of k ones, n = 250 sequence of k ones, n = 2500

37/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

From the plots it seems clear that the number of ones in a typical sequence is
k ≈ pn. Let us check it by evaluating the AEP:
1 1  
log p(xn ) = log pk (1 − p)n−k
n n
1
= (k log p + (n − k) log(1 − p))
n
≈ p log p + (1 − p) log(1 − p) = −H(X)

Case 1. For n = 2500, if p = 41 , H(X) = 0.8113


The number of typical sequences is ≈ 2nH(X) = 22029
1
Case 2. For n = 2500, if p = 100 , H(X) = 0.08
The number of typical sequences is ≈ 2nH(X) = 2200

38/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

As a consequence of the AEP, it is possible to find short descriptions of any


realization xn = x1 x2 ...xn of the random process X n = X1 X2 ...Xn .

Theorem (Source coding theorem)


Let X n be a sequence of i.i.d. random variables, and let  > 0. There exist a
code that maps observed sequences xn of n symbols into binary strings of
length `(Cn (xn )) such that the mapping is one-to-one and the average length is
 
1
E `(Cn (X n )) ≤ H(X) + 
n
for n sufficiently large.

39/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

Proof.
(n)
1. Let us divide all possibles sequences X n into two sets: the typical set A
(n)
and its complement A .
(n)
2. We order the elements in A and represent each possible sequence by
giving an index to it. Since

A(n) ≤ 2n(H+)

we need no more than (n(H + ) + 1) bits.


3. Let us prefix all sequences by a 0 so as to distinguish the typical set from its
complement.
(n)
4. We order the elements in A and use an index of (n log |X | + 1) bits plus
a 1 for prefix.

We can now evaluate the average length of the coded message if n is large
enough:

40/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

Proof (cont.).
X
E [`(Cn (X n )] = p(xn )`(Cn (xn ))
xn ∈X n
X X
= p(xn )`(Cn (xn )) + p(xn )`(Cn (xn ))
(n) (n)
xn ∈A xn ∈A
X X
≤ p(xn )(n(H + ) + 2) + p(xn )(n log |X | + 2)
(n) (n)
xn ∈A xn ∈A
   (n) 
= Pr A(n)
 (n(H + ) + 2) + Pr A (n log |X | + 2)
≤ n(H + ) + 2 +  n log |X | + 2 = n(H + 0 )

where 0 =  +  log |X | + n2 (1 + ) can be made small by an appropriate choice


of  and n. 

41/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

In short, the theorem implies that


Typical sequences are a tiny proportion of all possible sequences (its
number depends on H(X));
Typical sequences occur with a collective probability of about one;
Each typical sequence occur with about the same probability.

42/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set

(n)
The set A contains most of the probability but, is it the smallest set?
 
(n) (n)
Let us call Bδ ⊂ X n be the smallest set with Pr Bδ ≥ 1 − δ.

Theorem (3.3)
1
Let X1 , X2 , ..., X
n be i.i.d. random variables with distribution p(X). For δ <
 2
0 (n)
and δ > 0, if Pr Bδ ≥ 1 − δ, then

1
log Bδ > H(X) − δ 0
(n)
n
for n large enough.

(n) 0
Thus Bδ > 2n(H(X)−δ ) , so the high probability set and the typical set are
about the same size, if δ = .

43/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set


(n) (n)
Proof. Start with a comparative analysis of A and Bδ
       
(n) (n) (n)
Pr A(n)
 ∩ Bδ = Pr A(n) + Pr Bδ − Pr A(n)
 ∪ Bδ
≥1−+1−δ−1=1−−δ

The probability of the intersection of the sets is very large.


  X
(n)
1 −  − δ ≤ Pr A(n)  ∩ Bδ = Pr(xn )
(n) (n)
xn ∈A ∩Bδ
X
2−n(H(X)−) ≤ A(n) ∩ Bδ 2−n(H(X)−)
(n)

(n) (n)
xn ∈A ∩Bδ

2−n(H(X)−)
(n)
≤ Bδ

1 1
log Bδ > log (1 −  − δ) + H(X) −  = H(X) − δ 0
(n)

n n

44/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set


(n) (n)
Although A and Bδ have nearly the same size, they are not the same. It
suffices to show that the most likely sequences (first elements of the
δ-sufficient set) are not contained in the -typical set.

45/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set

Consider a Bernoulli process with Pr(X = 1) > Pr(X = 0): the most likely
(n)
sequence is the one having all ’1’, but it is not present in A because the
(n)
sequences in A contain those whose number of ’1’ is close to np. Those are
(n)
also in Bδ since the intersection is large.

How to build the high probability set?

It is simple: start from the highest probability sequence(s) and progressively


add sequences of deacreasingly smaller probability. This set contains the
maximum concentration of probability mass.
(n)
Why do we study A instead of the high probability set?
(n)
For compression purposes B would be more suitable (the set has less
(n)
elements), but we do not know its cardinality. Additionally, with A we can
use the fact that all sequences have nearly the same probability (which is
conveniently used in the proof).

46/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Entropy rate of ergodic sources

The source coding theorem states that nH(X) bits suffice to describe n i.i.d.
random variables. What if variables X1 , X2 , ..., Xn , Xn+1 , ... have some
statistical dependence?
In this case the entropy rate of a stochastic process is defined as
1
H(X) = lim H(X1 , X2 , ..., Xn )
n→∞ n
the per symbol entropy of n random variables.
We can also define:

H 0 (X) = lim H(Xn |Xn−1 , Xn−2 ..., X1 )


n→∞

as the entropy of the last random variable given the past.


Both magnitudes are equivalent for stationary processes (proof can be found in
T. Cover et al, Elements of Information Theory, chapter 4)

47/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Entropy rate of a Markov chain

The entropy rate of a stationary Markov process can be written as

H(X) = lim H(Xn |Xn−1 , Xn−2 ..., X1 ) = lim H(Xn |Xn−1 )


n→∞ n→∞

X Nstates
X
= H(X2 |X1 ) = p(x1 )H(X2 |X1 = x1 ) = p(i)[P]i,j log [P]i,j
x1 ∈X i,j=1

Exercise. For a two states Markov chain, prove that


β α
H(X) = H(α) + H(β)
α+β α+β
1 1
where H(α) = α log α
+ (1 − α) log 1−α
.

Check that H(X) = H(X) for an i.i.d. Markov process.

48/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English


Let us study the depence between blocks of consecutive letters as we increase
the number m of letters in a block. As we evaluate the pdf of pairs of letters...

Probability distribution of the 27x27 bigrams in the English language document


The Frequently Asked Questions Manual for Linux (taken from D. MacKay,
Information Theory, Inference, and Learning Algorithms).
49/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English

From the previous figure, some pairs of letters are quite predictable, given the
first letter. Let us increases the block size...

It looks like that the ability to predict each letter from the previous increases,
consequently the entropy decreases with m and the prediction of each letter
depends less on the letters of other blocks → blocks seem to be increasingly
independent with m.

50/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English


For an increasing block size m, these are empirical values of the information
per letter computed on concatenated long texts (Bible, Shakespeare’s works,
Moby Dick, etc.) of altogether 7 × 107 characters:
27
X 1
H(X1 ) = p(xi ) log = 4.08 bits/letter
i=1
p(xi )

27
1 1 X 1
H(X1 , X2 ) = p(xi , xj ) log = 3.32 bits/letter
2 2 i,j=1 p(xi , xj )
27
1 1 X 1
H(X1 , X2 , X3 ) = p(xi , xj , xk ) log = 2.73 bits/letter
3 3 p(xi , xj , xk )
i,j,k=1

1
H(X) = lim H(X1 , X2 , ..., Xm ) = 1.19 bits/letter
m→∞ m

See over for a graphical display (empirical values taken from table I in
T. Schürmann, P. Grassberger, ”Entropy estimation of symbol sequences”,
Chaos: An Interdisciplinary Journal of Nonlinear Science, 6(3):414-427, H(X)
is extrapolated from the set of empirical entropies provided there). 51/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English


Let Btm = Xmt Xmt+1 . . . X(m+1)t−1 be the t-th block of m consecutive
English letters.

As m increases, there is less


uncertainty per letter in the
possible values of the block. Once
the block size has grown to m ≈ 12
(the correlation length), the
identity of every letter in the block
Btm depends only on the letters of
that block and weakly on those of
m m
blocks Bt−1 and Bt+1 . Let us
justify it.

Using the chain rule, the joint entropy per letter of blocks Btm and Bt+1
m
is
1 1 1
H(Btm , Bt+1
m
)= H(Btm ) + m
H(Bt+1 |Btm )
2m 2m 2m

52/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English

From the empirical observation in the plot above, if the block size is m ≥ 12
the entropy per letter does not change
1 1
H(Btm , Bt+1
m
) u H(Btm ) ∀m ≥ 12
2m m
Using both equations, this is achieved if
m
H(Bt+1 |Btm ) u H(Btm ) = H(Bt+1
m
)
m
where stationarity of English has been applied in the last equality. Hence Bt+1
m
and Bt are nearly independent.

Therefore, we can trivially apply the source coding theorem to compress the
source to a number of bits per symbol equal to the entropy rate in this way:
assign each Btm a value in {1, . . . , |X |m }, and encode blocks of n of those
values (that is t = 1, . . . , n), with n very large.

53/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The SMB Theorem

The SMB theorem formally extends the source coding theorem to ergodic
sources:

Theorem (Shannon-McMillan-Breiman Theorem)


For arbritrary  > 0 there exists an integer n0 such that for every n > n0
 
1
lim Pr − log p(x1 , x2 , ..., xn ) − H(X) ≥  = 0
n→∞ N

where H(X) is the entropy rate. This allows defining the minimum rate of a
code for a correlated source. A way to design the code is to resort to the
source coding theorem for iid sources applied to blocks of n words, each word
of size equal to the correlation length.

Proof. It goes beyond the scope of the course, and can be found in
T. Cover et al, Elements of Information Theory, chapter 16.

54/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Way through...

The applications of the AEP and the concept of typicality reach beyond
data compression and will be found later in the course.
We have developed a constructive proof of the source coding theorem,
but notice that it only applies to very large sequences.
Chapter 4 introduces practical codes of finite length that achieve an
average length equal to the entropy bound.

55/55

You might also like