0% found this document useful (0 votes)

8 views55 pages

TEOI InformationOfDataSources

Course on Information of Data Sources

Uploaded by

Carlos Hurtado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views55 pages

TEOI InformationOfDataSources

Course on Information of Data Sources

Uploaded by

Carlos Hurtado

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Information Theory
Degree in Data Science and Engineering
Lesson 3: Information of data sources

Jordi Quer, Josep Vidal

Mathematics Department, Signal Theory and Communications Department

{jordi.quer, josep.vidal}@upc.edu

2019/20 - Q1

1/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Sources of data and information

Most natural signals or symbols generated by a source convey information in a

very dilute form: large amount of data contain a small amount of information.
There are two main reasons for that:

Data that are close to each other tend to have similar values (e.g. pixels
in an image, pixels in consecutive images in a video sequence, temporal
samples in an audio signal), or are related to each other (e.g. letters in
English, video recordings of the same scene at closely spaced cameras,
samples of stereo audio recordings).
Not all values generated by our source of data are equally frequent. We
know already that those less frequent carry more information.

2/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Examples of redundant sources

Neighbour pixels of an image.

3/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Examples of redundant sources

Samples obtained from the digitalization of an audio signal.

Predictability of letters in English increases with the context:

”Oh my God, the vulcano is eru ”

4/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Examples of redundant sources

Frequency of letters in English is far from

being uniform in normal text (values have
been estimated from The Frequently Asked
Questions Manual for Linux).

5/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Purpose of the chapter

Can we find short and yet reversible descriptions of a sequence of random

observations x1 x2 . . . xn ?

How short can this description be? How much can we compress data?

This will be highly relevant for the purposes of storage and communication of
sequences of symbols.

6/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Alphabets

Values generated by a source of random data X belong to an alphabet (see

also alphabet) which is a finite set X = {a1 , a2 , . . . , aq } of |X | = q elements
ai , called letters or symbols. For example:

Many natural languages are written using variants of the ISO basic Latin
alphabet of 26 letters:
a, b, c, d, e, f, g, h, i, j, k, l, m,
n, o, p, q, r, s, t, u, v, w, x, y, z.
The Greek alphabet of 24 letters:

α, β, γ, δ, , ζ, η, θ, ι, κ, λ, µ,
ν, ξ, o, π, ρ, X , τ, υ, φ, χ, ψ, ω.

The DNA alphabet of 4 letters X = {A, C, T, G} representing the four

nucleotides adenine, cytosine, thymine, guanine, used in genetics to write
the genome.

7/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Alphabets

The Braille alphabet of 64 letters:

A digital image is written in the alphabet of pixels, whose letters are d-bit
numbers, with d the image color bit depth.
A digital sound is written in the alphabet of wave samples, whose letters
are d-bit numbers, with d the audio bit depth.
Of course, the most important to us is the binary alphabet X = {0, 1}
consisting of symbols 0 and 1.

8/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Blocks and strings

A sequence of letters of X is called word, block, string, chain, text, message,

etc. depending on the context.
The name word is mostly used for sequences of short fixed length, or for the
sequences belonging to a certain particular set (a code).
X n is the set of the q n words (or blocks) of n letters:

X n = {xn = a1 a2 . . . an : ai ∈ X }.

X ∗ is the infinite set of strings of arbitrary length:

[
X ∗ = {x∗ = a1 a2 . . . an : ai ∈ X , n ≥ 0} = X n.
n≥0

We denote `(xn ) = n the length of the string (number of letters). The empty
string with `() = 0 is considered an element of X ∗ .

9/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Codes

A source code Cn is a mapping that re-labels an n-length sequence of symbols

belonging to an alphabet, into a codeword of symbols belonging possibly to
another alphabet. The value of n is chosen when designing the code.

Cn : X n → B∗
c = Cn (xn )

If the length of the codewords `(c) is the same for all xn

i , we have a
fixed-length code. Otherwise, the code is said a variable-length code.

10/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Codes

Some definitions for a code...

Extension code is the concatenation of codewords.

Codebook is the set of codewords corresponding to the set of
source-words,
Cn = {c : c = Cn (xn ), xn ∈ X n }
Non-singularity: no two different source words get mapped to the same
codeword,
c = Cn (xn n n
1 ) = Cn (x2 ) =⇒ x1 = x2
n

that is, Cn (xn

1 ) is an injective mapping. This ensures that the encoding is
reversible, and zero-error decoding is possible. All codes will be injective.
Rate of the code: the ratio of the encoded sequence length to the source
sequence length.

11/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Efficiency

Coding is more efficient if words of n > 1 symbols are encoded. Let us take a
fair 6-sides dice, and name X the random variable associated to the outcome
of a throw.
X = {1, 2, 3, 4, 5, 6} and H(X) = log 6 = 2.58 bits/outcome.
If we want to encode a sequence of outcomes using a binary alphabet
{0, 1}, C1 = {000, 001, 010, 011, 100, 101, 110, 111} are the 8 codewords
needed (we recognise that two of them will not be used). The rate is 3
binary digits/outcome.
H(X)
Efficiency of the code: η = 3
= 0.86 bits/binary digit
We can do better if we encode words of 4 outcomes in each codeword.
There are now |X 4 | = 64 = 1296 possible source words for which we need
|C4 | = 211 codewords of 11 binary digits. The rate is
11/4 = 2.75 binary digits/outcome.
H(X)
Efficiency of the code: η = 2.75
= 0.938 bits/binary digit

These are fixed-length codes.

12/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Efficiency
1
For this code, the number of binary digits per outcome n
dn log |X |e is
asymptotically decreasing with n down to log |X |:

In general, the efficiency of the code can also be defined for q-ary codewords:
Hq (X)
η=
q − ary symbols per outcome
Pq
where Hq (X) = − i=1 p(xi ) logq p(xi )
13/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Teaser...

A sequence of letters is the observation of a sequence of random variables (that

is, a stochastic process) governed by a probability distribution. The probability
distribution provides all the necessary information about the achievable bounds
for data compression.

In this lesson, we prove that it is possible to encode an n-length sequence of

random symbols with an efficiency that approaches 1 with high accuracy, when
n is large enough.

We will assume first that X1 X2 ...Xn are independent and identically

distributed (i.i.d.) random variables, and will drop the assumption at the end.

14/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Stochastic process
A stochastic process is an infinite sequence X = X1 X2 X3 . . . of random
variables, each taking values in the same set X . A finite set of n random
variables will be denoted by X n = X1 X2 . . . Xn .
Examples of sources generating stochastic processes:
a language model for English, Catalan, etc. with an alphabet
X = {a, b, . . . , z, -, !, ?, (, ), . . . , ;, :} that includes letters, space and
punctuation characters,
a sequence of n dice throws,
the n samples of a sampled recording of a particular phoneme pronounced
by many speakers,
such that a given length-n sequence x1 x2 . . . xn ∈ X n is associated to a
certain probability:

p(x1 , x2 , . . . , xn ) = Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn )

satisfying probability rules (joint, marginal, conditional). Of course, the random

variables Xi may not be mutually independent.

15/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Stationary stochastic process

A stochastic process is said to be stationary if the joint distribution is invariant
to shifts in the index

Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn )
= Pr(X1+l = x1 , X2+l = x2 , . . . , Xn+l = xn )

for every shift l and for all values xi ∈ X .

As a consequence, the statistical magnitudes that can be computed on each Xi
do not depend on the index:

E[g(X1 )] = E[g(X2 )] = · · · = E[g(Xn )]

where g is a function that applies on the value taken by a random variable Xi

and hence its output is itself a random variable.
For example, take g(x) = x2 . If the process is stationary, the second order
moment does not depend on the index and we can write

E[Xi2 ] = E[X 2 ] ∀i

16/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Stationary stochastic process

A process is called i.i.d. (independent identically distributed) if each random

variable is independent of the rest:
n
Y
Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = Pr(Xi = xi )
i=1

Or in a simpler notation:
n
Y
p(x1 , x2 , . . . , xn ) = p(xi ) with xi ∈ X
i=1

If random variables are associated to observations in time, then an i.i.d.

process has no memory of past or future.

17/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Ergodic processes
Imagine the following experiment: observe N times the output of a stochastic
process consisting of n random variables X1 X2 . . . Xn . Each observation of the
n values taken by these random variables is called a realization of the process.

Stochastic processes can be defined at will. For example, we could randomly

take N English books and observe the sequence of the first n letters. Or take
just one book, open at random N pages and observe the sequence of the first
n letters.
18/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Ergodic processes
Assume the process is stationary. We define the process as ergodic if statistical
magnitudes can be evaluated from temporal averages done on any single
realization:
n
1X
E[g(X)] = lim g(xk )
n→∞ n
k=1

Otherwise said, if the process is ergodic, we do not need all possible realizations
to infer statistical information of the process.
In general, ergodic processes have memory when the random variables
X1 , X2 , . . . , Xn are not independent. A memory process can be generated
from an i.i.d. process as

where an FSM generates outputs depending on the current and past values of
the input. Only ergodic processes will be considered in the sequel.

19/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

A finite state machine producing English text

Assume the values adopted by X are a set of English words. This FSM can
generate a number of distinct ergodic output sequences if the inputs are
random binary values that select outputs of the states:

Some possible output sequences:

- THE COMMUNIST PARTY INVESTIGATED THE CONGRESS.
- THE CONGRESS INVESTIGATED THE COMMUNIST PARTY AND FOUND
EVIDENDE OF THE CONGRESS DESTROYED THE COMMUNIST PARTY.
- THE CONGRESS PURGED THE CONGRESS.
20/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

A taxonomy of processes

21/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Markov process

A Markov process is an ergodic stochastic process in which past has no

influence on the future, given the present. In general, in an order k Markov
process, each variable depends only on the previous k:

Pr(Xn+k = xn+k |Xn+k−1 = xn+k−1 , . . . , X1 = x1 )

= Pr(Xn+k = xn+k |Xn+k−1 = xn+k−1 , . . . , Xn = xn ).

The simplest example of a discrete-time Markov process is an order 1 Markov

process:

Pr(Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 , . . . , X1 = x1 )

= Pr(Xn+1 = xn+1 |Xn = xn ).

If the transition probabilities do not depend on n, the Markov process is called

invariant or homogeneous:

Pr(Xn+1 = x|Xn = y) = Pr(Xm+1 = x|Xm = y) ∀n, m

22/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Markov process

For an order 1 Markov process, the joint pdf is given by

p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) . . . p(xn |xn−1 )

The probability distribution of states at time n + 1 is given by

X
p(xn+1 ) = p(xn )p(xn+1 |xn )
xn ∈X

which can be written in matrix form as

p(n + 1) = p(n)P, n ≥ 0

where
[P]i,j = Pr (Xn+1 = xj |Xn = xi ), are the elements of the transition
matrix P,
[p(n)]j = Pr(Xn = xj ) are the elements of the row vector p(n)
containing the probabilities of all |X | states at time n.

23/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Markov process

A Markov process is said invariant if matrix P does not depend on n.

A Markov process is stationary if p(n) does not depend on n, so p(0) must be
an eigenvector of P.

If all elements of P are positive, the Perron-Frobenius theorem applies to

conclude that:
1. The largest left-eigenvalue of P is simple and its value is 1.
2. The entries of the associated eigenvector are real and positive.
3. All other left-eigenvalues are smaller in modulus (may be complex).
In this case, it turns out that

lim p(n) = lim p(0)Pn = p

n→∞ n→∞

and hence p is an eigenvector of P associated to the unit left-eigenvalue.

For non-negative P other results apply (not considered here).

24/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Exercise
A two-states Markov process, where X = {State 1, State 2}, can be
represented graphically as

where [P]i,j = Pr (Xn+1 = State j|Xn = State i). The transition matrix is
given by
1−α α
P=
β 1−β

What are the asymptotic stationary probabilities p as n → ∞, regardless

the value of p(0)?
What is p(0) for the Markov process to be stationary? That is, p(n)
does not depend on n.
What are the values of α and β for an i.i.d. Markov process?
25/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

We will exploit some properties of stochastic processes to achieve efficient

coding. A few well known theorems are needed.

Lemma (Markov’s inequality)

For any non-negative random variable X and any α > 0
E[X]
Pr (X ≥ α) ≤
α

1 x≥α
Proof. Define the indicator function as I(x, α) =
0 x<α

E[X]
Pr(X ≥ α) = E[I(x, α)] ≤ E[X/α] =
α
This inequality can be used to bound the tails of a distribution.

26/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

27/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

Lemma (Chebyshev’s inequality)

σx2
Pr (|X − E[X]| ≥ β) ≤
β2

Proof. Consider
σ2
Pr (|X − E[X]| ≥ β) = Pr |X − E[X]|2 ≥ β 2 ≤ x2

β
where the Markov’s inequality has been applied in the last inequality.

28/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Some preliminaries...

The average of n random variables can be made arbitrarily close to the mean
by increasing n.

Theorem (The weak law of large numbers)

Consider a sequence of i.i.d. random variables Xi , and Xˆn = 1
Pn
n i=1 Xi , then
σ2
Pr Xˆn − E[Xˆn ] ≥ ≤ 2
n
where σ 2 is the variance of Xi .

Proof. Apply Chebyshev’s inequality .

It is said that Xˆn converges in probability to E[Xˆn ] = E[Xi ]:

Xˆn → E[Xi ]

29/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The asymptotic equipartition property (AEP)

The key theorem is...

Theorem
If X1 , X2 , . . . , Xn are i.i.d. random variables of pdf p(X), then
1
− log p(X1 , X2 , . . . , Xn ) → E[− log p(X1 , X2 , . . . , Xn )] = H(X)
n
that is lim Pr − n1 log p(X1 , X2 , . . . , Xn ) − H(X) ≥ = 0

n→∞

Proof. Apply the weak law of large numbers to the random variable
− n1 log p(X1 , X2 , . . . , Xn ) and its mean, calculated using Xi are i.i.d. as
" n
# n
1 1X 1X
E − log p(X1 , X2 , . . . , Xn ) = E − log p(Xi ) = − E [log p(Xi )]
n n i=1 n i=1

n q n
1 XX 1X
=− p(xi,j ) log p(xi,j ) = H(Xi ) = H(X)
n i=1 j=1 n i=1

30/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Properties of typical sequences

Typical sequences are those whose log of the probability approaches nH(X)
within a small value .
(n)
The set of typical sequences will be called A and it is included in X n . It is
the set of observations of X n = X1 X2 ...Xn such that

2−n(H(X)+) ≤ p(x1 , x2 , ..., xn ) ≤ 2−n(H(X)−) .

Therefore, the probability of all typical sequences is nearly the same.

Two relevant properties of the set are proved next.

31/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Properties of typical sequences

What is the probability mass of typical sequences?

Theorem (3.1)

(n)
For a sufficiently large value of n, Pr A >1−

Proof. Apply the weak law of large numbers.

That is, the set contains most of the probability. How many typical sequences
are there?
Theorem (3.2)
(n)
For any value of n, A ≤ 2n(H(X)+)
(n)
For a sufficiently large value of n, A ≥ (1 − )2n(H(X)−)

32/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Properties of typical sequences

Proof. For the upper bound,

X X X
1= p(xn ) ≥ p(xn ) ≥ 2−n(H(X)+)
xn ∈X n (n)
xn ∈A
(n)
xn ∈A

= 2−n(H(X)+) A(n)

For the lower bound, take theorem 3.1, so that

X
1 − < Pr A(n)
≤ 2−n(H(X)−) = 2−n(H(X)−) A(n)

(n)
xn ∈A

The set may be small, its size depends on the entropy of X.

33/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

34/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

Take a sequence of i.i.d. observations of an unfair coin. We’ll check the

(n)
properties of A . The sequence forms a stationary Bernoulli process, whose
density function is Pr(X = 1) = p, Pr(X = 0) = 1 − p.
The number of 1 in a specific sequence xn is denoted by k(xn ).
The probability of a specific sequence with k ones is:
Pr(k) = pk (1 − p)n−k

n
The number of sequences of length n with k ones is N (k) =
k
The probability
ofgenerating a sequence of k ones is
n
Pr(n, k) = pk (1 − p)n−k
k
E[k] = pn
p p
std(k) = E [(k − E[k])2 ] = np(1 − p). The standard deviation is
small with respect to the mean as n increases!

35/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

Take p = 41 , n = 25

k k

Number of different sequences of length n Pr(n, k): probability of generating a

with k ones sequence of k ones

36/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

1
Take p = 4
for increasing n

k k

Pr(n, k): probability of generating a Pr(n, k): probability of generating a

sequence of k ones, n = 250 sequence of k ones, n = 2500

37/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the AEP in a Bernoulli process

From the plots it seems clear that the number of ones in a typical sequence is
k ≈ pn. Let us check it by evaluating the AEP:
1 1
log p(xn ) = log pk (1 − p)n−k
n n
1
= (k log p + (n − k) log(1 − p))
n
≈ p log p + (1 − p) log(1 − p) = −H(X)

Case 1. For n = 2500, if p = 41 , H(X) = 0.8113

The number of typical sequences is ≈ 2nH(X) = 22029
1
Case 2. For n = 2500, if p = 100 , H(X) = 0.08
The number of typical sequences is ≈ 2nH(X) = 2200

38/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

As a consequence of the AEP, it is possible to find short descriptions of any

realization xn = x1 x2 ...xn of the random process X n = X1 X2 ...Xn .

Theorem (Source coding theorem)

Let X n be a sequence of i.i.d. random variables, and let > 0. There exist a
code that maps observed sequences xn of n symbols into binary strings of
length `(Cn (xn )) such that the mapping is one-to-one and the average length is

1
E `(Cn (X n )) ≤ H(X) +
n
for n sufficiently large.

39/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

Proof.
(n)
1. Let us divide all possibles sequences X n into two sets: the typical set A
(n)
and its complement A .
(n)
2. We order the elements in A and represent each possible sequence by
giving an index to it. Since

A(n) ≤ 2n(H+)

we need no more than (n(H + ) + 1) bits.

3. Let us prefix all sequences by a 0 so as to distinguish the typical set from its
complement.
(n)
4. We order the elements in A and use an index of (n log |X | + 1) bits plus
a 1 for prefix.

We can now evaluate the average length of the coded message if n is large
enough:

40/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

Proof (cont.).
X
E [`(Cn (X n )] = p(xn )`(Cn (xn ))
xn ∈X n
X X
= p(xn )`(Cn (xn )) + p(xn )`(Cn (xn ))
(n) (n)
xn ∈A xn ∈A
X X
≤ p(xn )(n(H + ) + 2) + p(xn )(n log |X | + 2)
(n) (n)
xn ∈A xn ∈A
(n)
= Pr A(n)
(n(H + ) + 2) + Pr A (n log |X | + 2)
≤ n(H + ) + 2 + n log |X | + 2 = n(H + 0 )

where 0 = + log |X | + n2 (1 + ) can be made small by an appropriate choice

of and n.

41/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Data compression

In short, the theorem implies that

Typical sequences are a tiny proportion of all possible sequences (its
number depends on H(X));
Typical sequences occur with a collective probability of about one;
Each typical sequence occur with about the same probability.

42/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set

(n)
The set A contains most of the probability but, is it the smallest set?

(n) (n)
Let us call Bδ ⊂ X n be the smallest set with Pr Bδ ≥ 1 − δ.

Theorem (3.3)
1
Let X1 , X2 , ..., X
n be i.i.d. random variables with distribution p(X). For δ <
2
0 (n)
and δ > 0, if Pr Bδ ≥ 1 − δ, then

1
log Bδ > H(X) − δ 0
(n)
n
for n large enough.

(n) 0
Thus Bδ > 2n(H(X)−δ ) , so the high probability set and the typical set are
about the same size, if δ = .

43/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set

(n) (n)
Proof. Start with a comparative analysis of A and Bδ

(n) (n) (n)
Pr A(n)
∩ Bδ = Pr A(n) + Pr Bδ − Pr A(n)
∪ Bδ
≥1−+1−δ−1=1−−δ

The probability of the intersection of the sets is very large.

X
(n)
1 − − δ ≤ Pr A(n) ∩ Bδ = Pr(xn )
(n) (n)
xn ∈A ∩Bδ
X
2−n(H(X)−) ≤ A(n) ∩ Bδ 2−n(H(X)−)
(n)
≤
(n) (n)
xn ∈A ∩Bδ

2−n(H(X)−)
(n)
≤ Bδ

1 1
log Bδ > log (1 − − δ) + H(X) − = H(X) − δ 0
(n)

n n

44/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set

(n) (n)
Although A and Bδ have nearly the same size, they are not the same. It
suffices to show that the most likely sequences (first elements of the
δ-sufficient set) are not contained in the -typical set.

45/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The high probability set

Consider a Bernoulli process with Pr(X = 1) > Pr(X = 0): the most likely
(n)
sequence is the one having all ’1’, but it is not present in A because the
(n)
sequences in A contain those whose number of ’1’ is close to np. Those are
(n)
also in Bδ since the intersection is large.

How to build the high probability set?

It is simple: start from the highest probability sequence(s) and progressively

add sequences of deacreasingly smaller probability. This set contains the
maximum concentration of probability mass.
(n)
Why do we study A instead of the high probability set?
(n)
For compression purposes B would be more suitable (the set has less
(n)
elements), but we do not know its cardinality. Additionally, with A we can
use the fact that all sequences have nearly the same probability (which is
conveniently used in the proof).

46/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Entropy rate of ergodic sources

The source coding theorem states that nH(X) bits suffice to describe n i.i.d.
random variables. What if variables X1 , X2 , ..., Xn , Xn+1 , ... have some
statistical dependence?
In this case the entropy rate of a stochastic process is defined as
1
H(X) = lim H(X1 , X2 , ..., Xn )
n→∞ n
the per symbol entropy of n random variables.
We can also define:

H 0 (X) = lim H(Xn |Xn−1 , Xn−2 ..., X1 )

n→∞

as the entropy of the last random variable given the past.

Both magnitudes are equivalent for stationary processes (proof can be found in
T. Cover et al, Elements of Information Theory, chapter 4)

47/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Entropy rate of a Markov chain

The entropy rate of a stationary Markov process can be written as

H(X) = lim H(Xn |Xn−1 , Xn−2 ..., X1 ) = lim H(Xn |Xn−1 )

n→∞ n→∞

X Nstates
X
= H(X2 |X1 ) = p(x1 )H(X2 |X1 = x1 ) = p(i)[P]i,j log [P]i,j
x1 ∈X i,j=1

Exercise. For a two states Markov chain, prove that

β α
H(X) = H(α) + H(β)
α+β α+β
1 1
where H(α) = α log α
+ (1 − α) log 1−α
.

Check that H(X) = H(X) for an i.i.d. Markov process.

48/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English

Let us study the depence between blocks of consecutive letters as we increase
the number m of letters in a block. As we evaluate the pdf of pairs of letters...

Probability distribution of the 27x27 bigrams in the English language document

The Frequently Asked Questions Manual for Linux (taken from D. MacKay,
Information Theory, Inference, and Learning Algorithms).
49/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English

From the previous figure, some pairs of letters are quite predictable, given the
first letter. Let us increases the block size...

It looks like that the ability to predict each letter from the previous increases,
consequently the entropy decreases with m and the prediction of each letter
depends less on the letters of other blocks → blocks seem to be increasingly
independent with m.

50/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English

For an increasing block size m, these are empirical values of the information
per letter computed on concatenated long texts (Bible, Shakespeare’s works,
Moby Dick, etc.) of altogether 7 × 107 characters:
27
X 1
H(X1 ) = p(xi ) log = 4.08 bits/letter
i=1
p(xi )

27
1 1 X 1
H(X1 , X2 ) = p(xi , xj ) log = 3.32 bits/letter
2 2 i,j=1 p(xi , xj )
27
1 1 X 1
H(X1 , X2 , X3 ) = p(xi , xj , xk ) log = 2.73 bits/letter
3 3 p(xi , xj , xk )
i,j,k=1

1
H(X) = lim H(X1 , X2 , ..., Xm ) = 1.19 bits/letter
m→∞ m

See over for a graphical display (empirical values taken from table I in
T. Schürmann, P. Grassberger, ”Entropy estimation of symbol sequences”,
Chaos: An Interdisciplinary Journal of Nonlinear Science, 6(3):414-427, H(X)
is extrapolated from the set of empirical entropies provided there). 51/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English

Let Btm = Xmt Xmt+1 . . . X(m+1)t−1 be the t-th block of m consecutive
English letters.

As m increases, there is less

uncertainty per letter in the
possible values of the block. Once
the block size has grown to m ≈ 12
(the correlation length), the
identity of every letter in the block
Btm depends only on the letters of
that block and weakly on those of
m m
blocks Bt−1 and Bt+1 . Let us
justify it.

Using the chain rule, the joint entropy per letter of blocks Btm and Bt+1
m
is
1 1 1
H(Btm , Bt+1
m
)= H(Btm ) + m
H(Bt+1 |Btm )
2m 2m 2m

52/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Example: the correlation of English

From the empirical observation in the plot above, if the block size is m ≥ 12
the entropy per letter does not change
1 1
H(Btm , Bt+1
m
) u H(Btm ) ∀m ≥ 12
2m m
Using both equations, this is achieved if
m
H(Bt+1 |Btm ) u H(Btm ) = H(Bt+1
m
)
m
where stationarity of English has been applied in the last equality. Hence Bt+1
m
and Bt are nearly independent.

Therefore, we can trivially apply the source coding theorem to compress the
source to a number of bits per symbol equal to the entropy rate in this way:
assign each Btm a value in {1, . . . , |X |m }, and encode blocks of n of those
values (that is t = 1, . . . , n), with n very large.

53/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

The SMB Theorem

The SMB theorem formally extends the source coding theorem to ergodic
sources:

Theorem (Shannon-McMillan-Breiman Theorem)

For arbritrary > 0 there exists an integer n0 such that for every n > n0

1
lim Pr − log p(x1 , x2 , ..., xn ) − H(X) ≥ = 0
n→∞ N

where H(X) is the entropy rate. This allows defining the minimum rate of a
code for a correlated source. A way to design the code is to resort to the
source coding theorem for iid sources applied to blocks of n words, each word
of size equal to the correlation length.

Proof. It goes beyond the scope of the course, and can be found in
T. Cover et al, Elements of Information Theory, chapter 16.

54/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Way through...

The applications of the AEP and the concept of typicality reach beyond
data compression and will be found later in the course.
We have developed a constructive proof of the source coding theorem,
but notice that it only applies to very large sequences.
Chapter 4 introduces practical codes of finite length that achieve an
average length equal to the entropy bound.

55/55

Gitaarmap Deel B PDF
100% (1)
Gitaarmap Deel B PDF
150 pages
Chapter 5 - Information Theory
No ratings yet
Chapter 5 - Information Theory
22 pages
Data Compression: Chapter - 2 Mathematical Preliminaries For Lossless Compression
100% (2)
Data Compression: Chapter - 2 Mathematical Preliminaries For Lossless Compression
26 pages
Bus Uncle Chatbot - Creating A Successful Digital Business (A)
No ratings yet
Bus Uncle Chatbot - Creating A Successful Digital Business (A)
10 pages
ITC Unit 2
No ratings yet
ITC Unit 2
186 pages
Information Theory: Dr. Muhammad Imran Farid
No ratings yet
Information Theory: Dr. Muhammad Imran Farid
32 pages
L12, L13, L14, L15, L16 - Module 4 - Source Coding
No ratings yet
L12, L13, L14, L15, L16 - Module 4 - Source Coding
59 pages
Computerised Assessment of Handwriting
No ratings yet
Computerised Assessment of Handwriting
15 pages
DCT Based Coding
No ratings yet
DCT Based Coding
49 pages
Lecture 4
No ratings yet
Lecture 4
65 pages
L15 Compression
No ratings yet
L15 Compression
63 pages
Information Theory
No ratings yet
Information Theory
108 pages
Data Compression
No ratings yet
Data Compression
113 pages
DC-PPT 5
No ratings yet
DC-PPT 5
44 pages
Chap 2
No ratings yet
Chap 2
47 pages
Chapter 2 - Edited
No ratings yet
Chapter 2 - Edited
45 pages
3 Information Theory
No ratings yet
3 Information Theory
48 pages
Source 515 A
No ratings yet
Source 515 A
80 pages
Unit 5 - Part-Ii
No ratings yet
Unit 5 - Part-Ii
41 pages
All Coding
No ratings yet
All Coding
52 pages
Source Coding
No ratings yet
Source Coding
29 pages
Information Theory and Coding PDF
No ratings yet
Information Theory and Coding PDF
150 pages
CE Notes
No ratings yet
CE Notes
32 pages
3 Source Coding
No ratings yet
3 Source Coding
31 pages
Lossless Math
No ratings yet
Lossless Math
32 pages
Lossless Data Compression
No ratings yet
Lossless Data Compression
24 pages
ECEVSP L03 Compression2
No ratings yet
ECEVSP L03 Compression2
40 pages
Introduction To Information Theory and Coding
No ratings yet
Introduction To Information Theory and Coding
46 pages
7-Information Theory
No ratings yet
7-Information Theory
29 pages
Intro To ICT 11
No ratings yet
Intro To ICT 11
31 pages
Uniquely Decodable Codes
No ratings yet
Uniquely Decodable Codes
10 pages
Lecture 3-Huffman Coding
No ratings yet
Lecture 3-Huffman Coding
30 pages
Digital Communication Process Through Swayam
No ratings yet
Digital Communication Process Through Swayam
31 pages
Group Presentation Digital Communication Systems
No ratings yet
Group Presentation Digital Communication Systems
29 pages
Week 3
No ratings yet
Week 3
30 pages
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
No ratings yet
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
25 pages
Chapter 2
No ratings yet
Chapter 2
13 pages
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
No ratings yet
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
37 pages
Unit 2
No ratings yet
Unit 2
30 pages
Lecture9 Lossless Coding
No ratings yet
Lecture9 Lossless Coding
46 pages
Source Coding
No ratings yet
Source Coding
10 pages
Source Coding: Source Encoder Channel Encoder Digital Source Source Entropy Symbols Binary Sequence Modulator
No ratings yet
Source Coding: Source Encoder Channel Encoder Digital Source Source Entropy Symbols Binary Sequence Modulator
18 pages
Lossless Compression: Huffman Coding: Mikita Gandhi Assistant Professor Adit
No ratings yet
Lossless Compression: Huffman Coding: Mikita Gandhi Assistant Professor Adit
39 pages
Chapter Three Source Coding: 1-Sampling Theorem
No ratings yet
Chapter Three Source Coding: 1-Sampling Theorem
19 pages
Information Theory
No ratings yet
Information Theory
38 pages
Source Coding of Discrete Sources: 1-The Average Code Length L Must Be As Minimum As Possible. This Average Length Is
No ratings yet
Source Coding of Discrete Sources: 1-The Average Code Length L Must Be As Minimum As Possible. This Average Length Is
17 pages
Information Theory: Principles and Applications: Tiago T. V. Vinhoza
No ratings yet
Information Theory: Principles and Applications: Tiago T. V. Vinhoza
34 pages
Channel Coding Theorem
No ratings yet
Channel Coding Theorem
23 pages
Data Compression Basics: Discrete Source
No ratings yet
Data Compression Basics: Discrete Source
34 pages
Publication 3 26433 1410
No ratings yet
Publication 3 26433 1410
6 pages
TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
No ratings yet
TSBK08 Data Compression Exercises: Informationskodning, ISY, Link Opings Universitet, 2013
32 pages
Entropy 3
No ratings yet
Entropy 3
10 pages
Source Coding
No ratings yet
Source Coding
9 pages
01 EntropyLosslessCoding PDF
No ratings yet
01 EntropyLosslessCoding PDF
29 pages
Information Theory Notes
No ratings yet
Information Theory Notes
4 pages
Sayood DataCompression
No ratings yet
Sayood DataCompression
22 pages
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
No ratings yet
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
7 pages
English Manual v3 001
No ratings yet
English Manual v3 001
63 pages
Orientering
No ratings yet
Orientering
15 pages
Agenda For The Lecture: C Himanshu Tyagi. Feel Free To Use With Acknowledgement
No ratings yet
Agenda For The Lecture: C Himanshu Tyagi. Feel Free To Use With Acknowledgement
7 pages
Shannon Source Coding Theorem
No ratings yet
Shannon Source Coding Theorem
3 pages
Bone Forming Tumors
No ratings yet
Bone Forming Tumors
81 pages
How To Make A Good Presentation
No ratings yet
How To Make A Good Presentation
34 pages
SCBA Pre-Use Inspection
No ratings yet
SCBA Pre-Use Inspection
2 pages
WAH5 - Functional Language Worksheets
No ratings yet
WAH5 - Functional Language Worksheets
6 pages
Cambridge O Level: Environmental Management 5014/22
No ratings yet
Cambridge O Level: Environmental Management 5014/22
11 pages
Mercedes-Benz: Faculty of Political Science
No ratings yet
Mercedes-Benz: Faculty of Political Science
7 pages
Hanover Report 1978
100% (1)
Hanover Report 1978
10 pages
Master Thesis Vu Amsterdam
100% (2)
Master Thesis Vu Amsterdam
8 pages
What Is Athletic Sports and Management?
No ratings yet
What Is Athletic Sports and Management?
3 pages
Amanda Mcelvany Position Paper Final
No ratings yet
Amanda Mcelvany Position Paper Final
6 pages
Regent College London New
No ratings yet
Regent College London New
2 pages
Resumen Productos Datalogic SENSORES
No ratings yet
Resumen Productos Datalogic SENSORES
219 pages
Erlie M.
No ratings yet
Erlie M.
65 pages
Instant Download Understanding Race and Crime 1st Edition Colin Webster PDF All Chapter
100% (3)
Instant Download Understanding Race and Crime 1st Edition Colin Webster PDF All Chapter
84 pages
Amcas Coursework Video
100% (2)
Amcas Coursework Video
7 pages
Internship Jntuh 160425 With Schedule
No ratings yet
Internship Jntuh 160425 With Schedule
3 pages
Parkinson Disease & ALS Cheat Sheet
No ratings yet
Parkinson Disease & ALS Cheat Sheet
4 pages
Customer Inquiry Report-9
No ratings yet
Customer Inquiry Report-9
7 pages
Mining Terms 1
No ratings yet
Mining Terms 1
23 pages
Mpa 17 em PDF
No ratings yet
Mpa 17 em PDF
9 pages
Guidanc CTspection
No ratings yet
Guidanc CTspection
17 pages
Second Quarter Lesson Plan in English 7
No ratings yet
Second Quarter Lesson Plan in English 7
5 pages
AC6-How To Setup Client+AP Mode
No ratings yet
AC6-How To Setup Client+AP Mode
10 pages
My MVP in Volleyball: Individual Awards: Collegiate Awards
No ratings yet
My MVP in Volleyball: Individual Awards: Collegiate Awards
1 page
PLC Interview Questions
No ratings yet
PLC Interview Questions
3 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Information Theory: A Concise Introduction
From Everand
Information Theory: A Concise Introduction
Stefan Hollos
No ratings yet
Algorithmic Information Theory: Fundamentals and Applications
From Everand
Algorithmic Information Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet

TEOI InformationOfDataSources

Uploaded by

TEOI InformationOfDataSources

Uploaded by

Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources

Jordi Quer, Josep Vidal

Mathematics Department, Signal Theory and Communications Department

Sources of data and information

Most natural signals or symbols generated by a source convey information in a

Examples of redundant sources

Neighbour pixels of an image.

Examples of redundant sources

Samples obtained from the digitalization of an audio signal.

Predictability of letters in English increases with the context:

”Oh my God, the vulcano is eru ”

Examples of redundant sources

Frequency of letters in English is far from

Purpose of the chapter

Can we find short and yet reversible descriptions of a sequence of random

Values generated by a source of random data X belong to an alphabet (see

The DNA alphabet of 4 letters X = {A, C, T, G} representing the four

The Braille alphabet of 64 letters:

Blocks and strings

A sequence of letters of X is called word, block, string, chain, text, message,

X ∗ is the infinite set of strings of arbitrary length:

A source code Cn is a mapping that re-labels an n-length sequence of symbols

If the length of the codewords `(c) is the same for all xn

Some definitions for a code...

Extension code is the concatenation of codewords.

that is, Cn (xn

These are fixed-length codes.

A sequence of letters is the observation of a sequence of random variables (that

In this lesson, we prove that it is possible to encode an n-length sequence of

We will assume first that X1 X2 ...Xn are independent and identically

satisfying probability rules (joint, marginal, conditional). Of course, the random

Stationary stochastic process

for every shift l and for all values xi ∈ X .

E[g(X1 )] = E[g(X2 )] = · · · = E[g(Xn )]

where g is a function that applies on the value taken by a random variable Xi

Stationary stochastic process

A process is called i.i.d. (independent identically distributed) if each random

If random variables are associated to observations in time, then an i.i.d.

Stochastic processes can be defined at will. For example, we could randomly

A finite state machine producing English text

Some possible output sequences:

A Markov process is an ergodic stochastic process in which past has no

Pr(Xn+k = xn+k |Xn+k−1 = xn+k−1 , . . . , X1 = x1 )

The simplest example of a discrete-time Markov process is an order 1 Markov

Pr(Xn+1 = xn+1 |Xn = xn , Xn−1 = xn−1 , . . . , X1 = x1 )

If the transition probabilities do not depend on n, the Markov process is called

Pr(Xn+1 = x|Xn = y) = Pr(Xm+1 = x|Xm = y) ∀n, m

For an order 1 Markov process, the joint pdf is given by

p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) . . . p(xn |xn−1 )

The probability distribution of states at time n + 1 is given by

which can be written in matrix form as

A Markov process is said invariant if matrix P does not depend on n.

If all elements of P are positive, the Perron-Frobenius theorem applies to

lim p(n) = lim p(0)Pn = p

and hence p is an eigenvector of P associated to the unit left-eigenvalue.

What are the asymptotic stationary probabilities p as n → ∞, regardless

We will exploit some properties of stochastic processes to achieve efficient

Lemma (Markov’s inequality)

Lemma (Chebyshev’s inequality)

Theorem (The weak law of large numbers)

Proof. Apply Chebyshev’s inequality .

It is said that Xˆn converges in probability to E[Xˆn ] = E[Xi ]:

The asymptotic equipartition property (AEP)

Properties of typical sequences

2−n(H(X)+) ≤ p(x1 , x2 , ..., xn ) ≤ 2−n(H(X)−) .

Therefore, the probability of all typical sequences is nearly the same.

Two relevant properties of the set are proved next.

Properties of typical sequences

What is the probability mass of typical sequences?

Proof. Apply the weak law of large numbers.

Properties of typical sequences

Proof. For the upper bound,

For the lower bound, take theorem 3.1, so that

The set may be small, its size depends on the entropy of X.

Example: the AEP in a Bernoulli process

Take a sequence of i.i.d. observations of an unfair coin. We’ll check the

2−n(H(X)+) ≤ p(x1 , x2 , ..., xn ) ≤ 2−n(H(X)−) .

we need no more than (n(H + ) + 1) bits.

where 0 = + log |X | + n2 (1 + ) can be made small by an appropriate choice