TEOI InformationOfDataSources
TEOI InformationOfDataSources
Information Theory
Degree in Data Science and Engineering
Lesson 3: Information of data sources
2019/20 - Q1
1/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Data that are close to each other tend to have similar values (e.g. pixels
in an image, pixels in consecutive images in a video sequence, temporal
samples in an audio signal), or are related to each other (e.g. letters in
English, video recordings of the same scene at closely spaced cameras,
samples of stereo audio recordings).
Not all values generated by our source of data are equally frequent. We
know already that those less frequent carry more information.
2/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
3/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
4/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
5/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
How short can this description be? How much can we compress data?
This will be highly relevant for the purposes of storage and communication of
sequences of symbols.
6/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Alphabets
Many natural languages are written using variants of the ISO basic Latin
alphabet of 26 letters:
a, b, c, d, e, f, g, h, i, j, k, l, m,
n, o, p, q, r, s, t, u, v, w, x, y, z.
The Greek alphabet of 24 letters:
α, β, γ, δ, , ζ, η, θ, ι, κ, λ, µ,
ν, ξ, o, π, ρ, X , τ, υ, φ, χ, ψ, ω.
7/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Alphabets
A digital image is written in the alphabet of pixels, whose letters are d-bit
numbers, with d the image color bit depth.
A digital sound is written in the alphabet of wave samples, whose letters
are d-bit numbers, with d the audio bit depth.
Of course, the most important to us is the binary alphabet X = {0, 1}
consisting of symbols 0 and 1.
8/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
X n = {xn = a1 a2 . . . an : ai ∈ X }.
We denote `(xn ) = n the length of the string (number of letters). The empty
string with `() = 0 is considered an element of X ∗ .
9/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Codes
Cn : X n → B∗
c = Cn (xn )
10/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Codes
11/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Efficiency
Coding is more efficient if words of n > 1 symbols are encoded. Let us take a
fair 6-sides dice, and name X the random variable associated to the outcome
of a throw.
X = {1, 2, 3, 4, 5, 6} and H(X) = log 6 = 2.58 bits/outcome.
If we want to encode a sequence of outcomes using a binary alphabet
{0, 1}, C1 = {000, 001, 010, 011, 100, 101, 110, 111} are the 8 codewords
needed (we recognise that two of them will not be used). The rate is 3
binary digits/outcome.
H(X)
Efficiency of the code: η = 3
= 0.86 bits/binary digit
We can do better if we encode words of 4 outcomes in each codeword.
There are now |X 4 | = 64 = 1296 possible source words for which we need
|C4 | = 211 codewords of 11 binary digits. The rate is
11/4 = 2.75 binary digits/outcome.
H(X)
Efficiency of the code: η = 2.75
= 0.938 bits/binary digit
12/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Efficiency
1
For this code, the number of binary digits per outcome n
dn log |X |e is
asymptotically decreasing with n down to log |X |:
In general, the efficiency of the code can also be defined for q-ary codewords:
Hq (X)
η=
q − ary symbols per outcome
Pq
where Hq (X) = − i=1 p(xi ) logq p(xi )
13/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Teaser...
14/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Stochastic process
A stochastic process is an infinite sequence X = X1 X2 X3 . . . of random
variables, each taking values in the same set X . A finite set of n random
variables will be denoted by X n = X1 X2 . . . Xn .
Examples of sources generating stochastic processes:
a language model for English, Catalan, etc. with an alphabet
X = {a, b, . . . , z, -, !, ?, (, ), . . . , ;, :} that includes letters, space and
punctuation characters,
a sequence of n dice throws,
the n samples of a sampled recording of a particular phoneme pronounced
by many speakers,
such that a given length-n sequence x1 x2 . . . xn ∈ X n is associated to a
certain probability:
p(x1 , x2 , . . . , xn ) = Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn )
15/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Pr(X1 = x1 , X2 = x2 , . . . , Xn = xn )
= Pr(X1+l = x1 , X2+l = x2 , . . . , Xn+l = xn )
E[Xi2 ] = E[X 2 ] ∀i
16/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Or in a simpler notation:
n
Y
p(x1 , x2 , . . . , xn ) = p(xi ) with xi ∈ X
i=1
17/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Ergodic processes
Imagine the following experiment: observe N times the output of a stochastic
process consisting of n random variables X1 X2 . . . Xn . Each observation of the
n values taken by these random variables is called a realization of the process.
Ergodic processes
Assume the process is stationary. We define the process as ergodic if statistical
magnitudes can be evaluated from temporal averages done on any single
realization:
n
1X
E[g(X)] = lim g(xk )
n→∞ n
k=1
Otherwise said, if the process is ergodic, we do not need all possible realizations
to infer statistical information of the process.
In general, ergodic processes have memory when the random variables
X1 , X2 , . . . , Xn are not independent. A memory process can be generated
from an i.i.d. process as
where an FSM generates outputs depending on the current and past values of
the input. Only ergodic processes will be considered in the sequel.
19/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
A taxonomy of processes
21/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Markov process
22/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Markov process
p(n + 1) = p(n)P, n ≥ 0
where
[P]i,j = Pr (Xn+1 = xj |Xn = xi ), are the elements of the transition
matrix P,
[p(n)]j = Pr(Xn = xj ) are the elements of the row vector p(n)
containing the probabilities of all |X | states at time n.
23/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Markov process
24/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Exercise
A two-states Markov process, where X = {State 1, State 2}, can be
represented graphically as
where [P]i,j = Pr (Xn+1 = State j|Xn = State i). The transition matrix is
given by
1−α α
P=
β 1−β
Some preliminaries...
E[X]
Pr(X ≥ α) = E[I(x, α)] ≤ E[X/α] =
α
This inequality can be used to bound the tails of a distribution.
26/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Some preliminaries...
27/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Some preliminaries...
Proof. Consider
σ2
Pr (|X − E[X]| ≥ β) = Pr |X − E[X]|2 ≥ β 2 ≤ x2
β
where the Markov’s inequality has been applied in the last inequality.
28/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Some preliminaries...
The average of n random variables can be made arbitrarily close to the mean
by increasing n.
Xˆn → E[Xi ]
29/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Theorem
If X1 , X2 , . . . , Xn are i.i.d. random variables of pdf p(X), then
1
− log p(X1 , X2 , . . . , Xn ) → E[− log p(X1 , X2 , . . . , Xn )] = H(X)
n
that is lim Pr − n1 log p(X1 , X2 , . . . , Xn ) − H(X) ≥ = 0
n→∞
Proof. Apply the weak law of large numbers to the random variable
− n1 log p(X1 , X2 , . . . , Xn ) and its mean, calculated using Xi are i.i.d. as
" n
# n
1 1X 1X
E − log p(X1 , X2 , . . . , Xn ) = E − log p(Xi ) = − E [log p(Xi )]
n n i=1 n i=1
n q n
1 XX 1X
=− p(xi,j ) log p(xi,j ) = H(Xi ) = H(X)
n i=1 j=1 n i=1
30/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Typical sequences are those whose log of the probability approaches nH(X)
within a small value .
(n)
The set of typical sequences will be called A and it is included in X n . It is
the set of observations of X n = X1 X2 ...Xn such that
31/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Theorem (3.1)
(n)
For a sufficiently large value of n, Pr A >1−
That is, the set contains most of the probability. How many typical sequences
are there?
Theorem (3.2)
(n)
For any value of n, A ≤ 2n(H(X)+)
(n)
For a sufficiently large value of n, A ≥ (1 − )2n(H(X)−)
32/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
= 2−n(H(X)+) A(n)
33/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
34/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
35/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Take p = 41 , n = 25
k k
36/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
1
Take p = 4
for increasing n
k k
37/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
From the plots it seems clear that the number of ones in a typical sequence is
k ≈ pn. Let us check it by evaluating the AEP:
1 1
log p(xn ) = log pk (1 − p)n−k
n n
1
= (k log p + (n − k) log(1 − p))
n
≈ p log p + (1 − p) log(1 − p) = −H(X)
38/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Data compression
39/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Data compression
Proof.
(n)
1. Let us divide all possibles sequences X n into two sets: the typical set A
(n)
and its complement A .
(n)
2. We order the elements in A and represent each possible sequence by
giving an index to it. Since
A(n) ≤ 2n(H+)
We can now evaluate the average length of the coded message if n is large
enough:
40/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Data compression
Proof (cont.).
X
E [`(Cn (X n )] = p(xn )`(Cn (xn ))
xn ∈X n
X X
= p(xn )`(Cn (xn )) + p(xn )`(Cn (xn ))
(n) (n)
xn ∈A xn ∈A
X X
≤ p(xn )(n(H + ) + 2) + p(xn )(n log |X | + 2)
(n) (n)
xn ∈A xn ∈A
(n)
= Pr A(n)
(n(H + ) + 2) + Pr A (n log |X | + 2)
≤ n(H + ) + 2 + n log |X | + 2 = n(H + 0 )
41/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Data compression
42/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
(n)
The set A contains most of the probability but, is it the smallest set?
(n) (n)
Let us call Bδ ⊂ X n be the smallest set with Pr Bδ ≥ 1 − δ.
Theorem (3.3)
1
Let X1 , X2 , ..., X
n be i.i.d. random variables with distribution p(X). For δ <
2
0 (n)
and δ > 0, if Pr Bδ ≥ 1 − δ, then
1
log Bδ > H(X) − δ 0
(n)
n
for n large enough.
(n) 0
Thus Bδ > 2n(H(X)−δ ) , so the high probability set and the typical set are
about the same size, if δ = .
43/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
2−n(H(X)−)
(n)
≤ Bδ
1 1
log Bδ > log (1 − − δ) + H(X) − = H(X) − δ 0
(n)
n n
44/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
45/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Consider a Bernoulli process with Pr(X = 1) > Pr(X = 0): the most likely
(n)
sequence is the one having all ’1’, but it is not present in A because the
(n)
sequences in A contain those whose number of ’1’ is close to np. Those are
(n)
also in Bδ since the intersection is large.
46/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
The source coding theorem states that nH(X) bits suffice to describe n i.i.d.
random variables. What if variables X1 , X2 , ..., Xn , Xn+1 , ... have some
statistical dependence?
In this case the entropy rate of a stochastic process is defined as
1
H(X) = lim H(X1 , X2 , ..., Xn )
n→∞ n
the per symbol entropy of n random variables.
We can also define:
47/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
X Nstates
X
= H(X2 |X1 ) = p(x1 )H(X2 |X1 = x1 ) = p(i)[P]i,j log [P]i,j
x1 ∈X i,j=1
48/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
From the previous figure, some pairs of letters are quite predictable, given the
first letter. Let us increases the block size...
It looks like that the ability to predict each letter from the previous increases,
consequently the entropy decreases with m and the prediction of each letter
depends less on the letters of other blocks → blocks seem to be increasingly
independent with m.
50/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
27
1 1 X 1
H(X1 , X2 ) = p(xi , xj ) log = 3.32 bits/letter
2 2 i,j=1 p(xi , xj )
27
1 1 X 1
H(X1 , X2 , X3 ) = p(xi , xj , xk ) log = 2.73 bits/letter
3 3 p(xi , xj , xk )
i,j,k=1
1
H(X) = lim H(X1 , X2 , ..., Xm ) = 1.19 bits/letter
m→∞ m
See over for a graphical display (empirical values taken from table I in
T. Schürmann, P. Grassberger, ”Entropy estimation of symbol sequences”,
Chaos: An Interdisciplinary Journal of Nonlinear Science, 6(3):414-427, H(X)
is extrapolated from the set of empirical entropies provided there). 51/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Using the chain rule, the joint entropy per letter of blocks Btm and Bt+1
m
is
1 1 1
H(Btm , Bt+1
m
)= H(Btm ) + m
H(Bt+1 |Btm )
2m 2m 2m
52/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
From the empirical observation in the plot above, if the block size is m ≥ 12
the entropy per letter does not change
1 1
H(Btm , Bt+1
m
) u H(Btm ) ∀m ≥ 12
2m m
Using both equations, this is achieved if
m
H(Bt+1 |Btm ) u H(Btm ) = H(Bt+1
m
)
m
where stationarity of English has been applied in the last equality. Hence Bt+1
m
and Bt are nearly independent.
Therefore, we can trivially apply the source coding theorem to compress the
source to a number of bits per symbol equal to the entropy rate in this way:
assign each Btm a value in {1, . . . , |X |m }, and encode blocks of n of those
values (that is t = 1, . . . , n), with n very large.
53/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
The SMB theorem formally extends the source coding theorem to ergodic
sources:
where H(X) is the entropy rate. This allows defining the minimum rate of a
code for a correlated source. A way to design the code is to resort to the
source coding theorem for iid sources applied to blocks of n words, each word
of size equal to the correlation length.
Proof. It goes beyond the scope of the course, and can be found in
T. Cover et al, Elements of Information Theory, chapter 16.
54/55
Codes Stochastic processes Asymptotic equipartition property Data compression The high probability set Ergodic sources
Way through...
The applications of the AEP and the concept of typicality reach beyond
data compression and will be found later in the course.
We have developed a constructive proof of the source coding theorem,
but notice that it only applies to very large sequences.
Chapter 4 introduces practical codes of finite length that achieve an
average length equal to the entropy bound.
55/55