Math7224 Notes
Math7224 Notes
Theory Notes
Billy Leung
May 2024
Introduction
This course will delve into one out of many topics in Probability Theory. In
Semester 1 of the 2023-24 academic year, it was about Information Theory. In-
formation theory is the mathematical study of the quantification, storage, and
communication of information. A fundamental problem in this field is how to
communicate reliably over unreliable channels. This note will also discuss the
application of information theory in statistical inference.
Recommended reading: Elements of Information Theory by Cover & Thomas,
2nd Edition
Contents
1 Basic Concepts 1
2 Fano’s Inequality 5
3 Capacity of a Channel 9
1 Basic Concepts
The most important concept in information theory is entropy, which measures
the level of uncertainty involved in the value of a random variable or the outcome
of a random process.
Definition P
1.1. The entropy of a discrete random variable X is defined by
H(X) := − x∈X p(x) log p(x) where X is the sample space of X.
1
Remark 1.1. 1. We define 0log0 to be 0.
2. Clearly from the definition, H(X) ≥ 0.
(
1 with probability p
Example 1.1. If X =
0 with probability 1-p,
Then H(X) = −p log p − (1 − p) log(1 − p)
Definition 1.2. The joint
P entropy
P H(X, Y ) for random variables X, Y is de-
fined by H(X, Y ) = − x∈X y∈Y p(x, y) log p(x, y).
Definition 1.3. The condition entropy H(Y |X) is defined by
X X X
H(Y |X) = p(x)H(Y |X = x) = − p(x) p(y|x) log p(y|x)
x∈X x∈X y∈Y
X X
=− p(x, y) log p(y|x)
x∈X y∈Y
2
Intuitively, I(X; Y ) is an indicator of how much X and Y are correlated.
Proposition 1.2. (1) I(X; Y ) = H(X) − H(X|Y )
(2)I(X; Y ) = H(X) + H(Y ) − H(X, Y )
(3)I(X; Y ) = I(Y ; X)
(4)I(X; Y ) = H(X) if Y = X
(5) (Conditioning reduces entropy) I(X; Y ) ≥ 0, equality holds if X and Y are
independent
We would prove (1) only, the others are left as exercise:
X p(y|x)
I(X; Y ) = p(x, y) log
x,y
p(x)
X X
=− p(x, y) log p(x) + log p(y|x)
x,y x,y
X X
=− p(x, y) log p(x) − (− log p(y|x))
x,y x,y
= H(X) − H(X|Y )
3
Proof.
X q(x)
−D(p||q) = p(x) log
x
x(x)
X q(x)
≤ log p(x) (Jensen’s Inequality, note that logarithm is a concave function)
x
p(x)
X
= log q(x)
x
= log 1
=0
q(x)
Note that for the equality holds iff p(x) is a constant, which implies p(x) = q(x)
∀x ∈ X .
Corollary 1.3. H(X) ≤ log|X | with equality iff X has a uniform distribution
over X .
Proof: Consider D(p||u) where u(x) = 1
|X | for x ∈ X .
Theorem 1.2 (log-sum inequality).
Pn
For a1 , · · · , an , b1 , · · · , bn ≥ 0,
i=1 ai
Pn ai Pn
i=1 ai log bi ≥ ( i=1 ai ) log n
bi .
P
i=1
Proof. Assume aj , bj > 0, since f (x) = x log x is strictly convex,P we have (by
Jensen’s
P inequality) for t 1 + t2 + · · · + t n = 1, t i ≥ 0, t i ≥ 0, i ti f (xi ) ≥
f ( i t i xi )
Set xi = abii and ti = Pbibi , we get the desired result.
i
4
2 Fano’s Inequality
Consider the following problem: Suppose we want to estimate random variable
X ∼ p(x) but we only observe Y , which is related to X by p(y|x). How should
we estimate X using X̂ = g(Y ) to minimize the probability of error Pe = P (X̂ ̸=
X)?
Theorem 2.1 (Fano’s inequality). H(Pe ) + Pe log(|X | − 1) ≥ H(X|Y ).
Proof.
( Define an error random variable
1 X̂ ̸= X
E= Now,
0 otherwise
(n)
Definition 2.1. The typical set Aϵ with respect to p(x) is the set of se-
quences (x1 , x2 , · · · , xn ) ∈ X n such that 2−n(H(X)+ϵ) ) ≤ p(x1 , x2 , · · · , xn ) ≤
2−n(H(X)−ϵ) .
(n) (n)
Theorem 2.3. (1) P (Aϵ ) := P ((x1 , · · · , xn ) ∈ Aϵ ) > 1 − ϵ for n sufficiently
large.
(n)
(2) |Aϵ | ≤ 2n(H(X)+ϵ)
(n)
(3) |Aϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n sufficiently large.
5
(n)
Proof. For (1) Note that P (Aϵ ) = P (| − n1 log p(x1 , x2 , · · · , xn ) − H(X)| <
(n)
ϵ) → 1 as n → ∞ by the AEP Theorem. So, P (Aϵ ) > 1 − ϵ for sufficiently
large n.
For (2), 1 = x∈X p(x) ≥ x∈A(n) p(x) ≥ x∈A(n) 2−n(H(X)+ϵ) = 2−n(H(X)+ϵ) |Anϵ |,
P P P
ϵ ϵ
(n)
which implies |Aϵ | ≤ 2n(H(X)+ϵ)
For (3), we know for sufficiently large n,
1 − ϵ < P (A(n)
ϵ ) (by (1))
X
≤ 2−n(H(X)−ϵ)
(n)
x∈Aϵ
= 2−n(H(X)−ϵ) |A(n)
ϵ |
⇒ |A(n)
ϵ | ≥ (1 − ϵ)2
n(H(X)−ϵ)
Remark 2.1. Typical sequences are ”VIP” sequences: very few sequences are
typical, but typical sequences occupy almost the entire probability space.
Now, we move on to discuss a noiseless source coding problem.
Theorem 2.4 (Kraft’s inequality). For any instantaneous code C over an al-
phabet D with size D, the codeword length l1 , l2 , · · · , lm must satisfies the in-
equality:
6
Pm
D−li ≤ 1
i=1
Conversely, given a set of l1 , l2 , · · · , lm that satisfy this inequality, there exists
an instantaneous code with these codeword lengths.
Proof. Let lmax = max{l1 , · · · , lm }. Consider the D-ary tree of level lmax .
Any codeword in C can be represented by a solid path in the tree. If C is
instantaneous, no path is a subpath of another path. Suppose a path from level
i to level i + 1 has length 1 for all 0 ≤ i ≤ lmax − 1. P Since a solid subpath of
lmax −li m lmax −li
length li has D Pm descendants at level lmax . So i=1 D ≤ Dlmax ,
−li
which implies i=1 D ≤ 1.
0 1
0 1 0 1
X X 1
L − HD (X) = pi li − pi logD
pi
X X Dli
= pi logD pi − pi logD P
pj D−lj
X pi 1 D−li
= pi logD + logD P −lj where ri = P −lj
ri D D
≥ 0 (by information inequality and Kraft’s inequality)
7
Recall that the optimal solution is li∗ = logD p1i . Rounding up l∗ to li =
⌈logD p1i ⌉, which satisfies the Kraft’s inequality:
P −⌈logD p1 ⌉ − logD p1 P
D i ≤D i = pi = 1.
It is not hard to observe that logD p1i ≤ li < logD p1i + 1, which implies HD ,
HD (X) ≤ L ≤ HD (X) + 1.
We now apply Shannon-Fano coding to stationary processes:
∞
Let X = X−∞ = · · · X−1 X0 X1 X2 · · · , we have H(X1n ) ≤ E(l(xn1 )) <
n
H(X ) E(l(xn
1 )) H(X1n )+1
H(X1n ) + 1, so n 1 ≤ n < n .
−1
H(X n ) H(X )+H(X |X )+···+H(X |X n−1 )
1 2 1 n H(X0 )+H(X0 |X− 1)+···+H(X0 |X1 ,··· ,X−n+1 )
Note that n 1 = n
1
= n .
−1 −1
By the fact that conditioning reduces entropy, we have H(X0 |X−n+1 ) ≤ H(X0 |X−n+2 )≤
−1
· · · ≤ H(X0 |X−1 ) ≤ H(X0 ), and since H(X0 |X−n+1 ) ≥ 0 so we can con-
clude limn→∞ H(X0 |X−n+1 −1
) exists, which implies H(Xn
0)
exists. Moreover,
H(X0 ) −1
n = limn→∞ H(X0 |X−n+1 ). This limit is called the entropy rate of X,
denoted by H(X). When X is i.i.d., H(X) = H(X1 ). When X is a one-step
stationary Markov process, H(X) = H(X2 |X1 ).
Theorem 2.5. For a stationary process, Shannon-Fano coding arbitrarily ap-
proaches its entropy rate.
Theorem 2.6 P(McMillan). The codeword lengths of any decodable D-ary code
must satisfy i D−li ≤ 1.
Proof. For any fixed x,
X X X X
( D−l(x) )k = ··· D−l(x1 ) · · · D−l(xk )
x∈X x1 ∈X x2 ∈X xk ∈X
X
= D−l(x1 )−···−l(xk )
(x1 ,x2 ,··· ,xk )∈X k
X k
= D−l(x1 )
xk
1 ∈X
k
klX
max
Pklmax m −m
By unique decodability, a(m) ≤ Dm , so ( x∈X D−l(x) )k ≤ m=1
P
D D =
1 1
−l(x)
P
klmax , and have x∈X D ≤ (klmax ) , and since limk→∞ (klmax ) = 1, we
k k
are done.
The theorem says that although there are uniquely decodable codes that
are not instantaneous, the set of achievable codeword lengths is the same for
uniquely decodable codes and instantaneous codes.
8
Huffman Coding: yields an optimal instantaneous code for a given
distribution.
0.15 (001)
Example 2.3.
3 Capacity of a Channel
Communication system with Discrete Memoryless Channel
Definition 3.1. A discrete channel is a system consisting of input alphabet X
and output alphabet Y and a transitional probability matrix {p(y|x)}, where
p(y|x) expresses the probability of observing the output y given that input x is
sent. The channel is said to be memoryless if the probability of output depends
only on the input at the time and is conditionally independent of the previous
or future inputs and outputs.
Definition 3.2. The channel capacity of a discrete memoryless channel is de-
fined as C := maxp(x) I(X; Y )
Example 3.1. For a binary symmetric channel with X = Y = {0, 1}, p(0|0) =
9
p(1|1) = 1 − p, p(0|1) = p(1|0) = p, the mutual information
with equality holds iff p(0) = p(1) = 21 . In other words, C = 1, which is achieved
by p(0) = p(1) = 12 .
Definition 3.3. An (M,n)-code consists of the following:
1. An index set {1, 2, · · · , M }
2. An encoding function f : {1, 2, · · · , M } → X n yielding codewords xn (1), xn (2), · · · , xn (M )
3. A decoding function g : Y n → {1, 2, · · · , M }.
Definition 3.4. Define the error of probability λi = P (g(Y n ) ̸= i|xn =
xn (i)) for 1 ≤ i ≤ M . Next define the maximal error probability λ(n) :=
maxi∈{1,2,··· ,M } λi . Finally, the average error of probability of an (M,n)-code is
(n) 1
Pm
defined as Pe = M i=1 λi .
10
Proof. For (1) and (2), see Theorem 3.3.
For (3). we have
X
P ((X̃ n , Ỹ n ) ∈ A(n)
ϵ )= p(xn )p(y n )
(n)
(xn ,y n )∈Aϵ
= ϵ + (2 nR
− 1)2−n(I(X;Y )−3ϵ)
≤ ϵ + 2−n(I(X;Y )−3ϵ−R)
≤ 2ϵ if R < I(X; Y ) − 3ϵ and n large enough
11
Similarly, we can obtain λi ≤ 2ϵ for 2 ≤ i ≤ 2nR for n large enough and
(n) 1
R < I(X; Y ) − 3ϵ, and so Pe = 2nR (λ1 + λ2 + · · · + λ2nR ) < 2ϵ. Thus, there is
∗ (n) 1
some B of B such that Pe = 2nR (λ1 +λ2 +· · ·+λ2nR ) < 2ϵ. By renumbering, if
necessary, assume λ1 ≤ λ2 ≤ · · · ≤ λ2nR . Therefore, max{λ1 , λ2 , · · · , λ2nR−1 } =
1
λ2nR−1 < 4ϵ. Choosing xn (1), xn (2), · · · , xn (2nR−1 ), we obtain a (2n(R− n ) , n)
code with maximal error probability less than 4ϵ, if R < I(X; Y ). Choose p(x)
1
to be the capacity achieving distribution, then there exists (2n(R− n ) , n) code
with λ(n) → 0 as n → ∞ if R < C.
(n)
Conversely, for any sequence of (2nR , n) codes with λ(n) → 0, we have Pe →
0, so if W is uniformly distributed over {1, 2, · · · , 2nR },
nR
2
X
P (Ŵ ̸= W ) = P (Ŵ ̸= i|W = i)P (W = i)
i=1
1
= (λ1 + λ2 + · · · + λ2nR )
2nR
= Pe(n) → 0
Hence,
nR = H(W )
= H(W |Y n ) + I(W ; Y n )
≤ H(X|Y n ) + I(X n (W ); Y n ) (applying data processing inequality for W → X n (W ) → Y n )
≤ 1 + Pe(n) nR + I(X n (W ); Y n ) (Fano’s inequality)
Note that
12
Remark 3.1. CF B ≥ C.
Theorem 3.3. CF B = C for DMC with feedback.
Proof. To prove CF B ≤ C, for any achievable rate R and uniformly distributed
(n)
W over {1, 2 · · · , 2nR } with Pe → 0 as n → 0, we will show that R ≤ C. Note
that
nR = H(W )
= H(W |Y n ) + I(W ; Y n )
≤ 1 + Pe(n) nR + I(W ; Y n ) (Fano’s inequality)
13
Example 3.4. An irreducible finite-state stationary Markov chain is ergodic.
Example 3.5. A reducible finite-state stationary Markov chain is not ergodic.
Theorem 3.4 (Shannon-McMillan-Breiman Theorem). For a finite-state sta-
∞
tionary ergodic process X = X−∞ , n1 log p(X1 , X2 , · · · , Xn ) → H(X) with prob-
ability 1.
(n)
Remark 3.3. (1) Aϵ typical set is similarly defined for stationary ergodic
process.
(n)
(2) AEP and joint AEP hold for Aϵ .
Theorem 3.5 (Separation Theorem). There exists a source-channel code with
(n)
Pe → 0 as n → ∞ if H(V ) < C. Conversely, for any stationary process V ,
(n)
if H(V ) > C, the probability of error Pe is bounded away from 0 and hence
it is impossible to send the process over the channel with an arbitrarily low
probability of error.
(n)
Proof. By the AEP, there exists a typical set Aϵ of size not greater than
2n(H(V )+ϵ)) , which occupies almost the entire probability space. We will encode
(n)
only the sequences in the typical set Aϵ , and all other sequences will result
in an error, contributing only 0 to the probability of error for large n. By the
Channel Coding Theorem, as long as H(V )+ϵ < C, the receiver can reconstruct
(n) (n)
V n ∈ Aϵ with an arbitrarily low probability of error. So Pe = P (V̂ ̸= V ) ≤
(n) (n)
/ Aϵ ) + P (g(Y n ) ̸= V n |V n ∈ Aϵ ) ≤ ϵ + ϵ = 2ϵ for sufficiently large n.
P (V n ∈
In other words, we can reconstruct the source with an arbitrarily small error of
probability if H(V ) < C.
(n)
Conversely, suppose Pe → 0 as n → ∞, then
H(V1 , V2 , · · · , Vn )
H(V ) ≤
n
1 1
= H(V |V̂ n ) + I(V n ; V̂ n )
n
n n
1 1
≤ (1 + Pe log |V |n ) + I(V n |V̂ n )
(n)
(by Fano’s inequality)
n n
1 1
≤ (1 + Pe(n) log |V |n ) + I(X n ; Y n ) (apply data processing lemma to
n n
the Markov chain V n → X n → Y n → V̂ n )
1
≤ + Pe(n) log |V | + C
n
(n)
Letting n → ∞, we have Pe → 0, hence H(V ) ≤ C.
14
Definition 4.1. A distortion function (or measure) is a mapping d : X × Xˆ →
R+ .
A distortion measure is said to be bounded if dmax := maxx∈X ,x̂∈Xˆ d(x, x̂) <
∞.
PThe distortion between sequences xn and x̂n is defined by d(xn , x̂n ) =
1 n
n i=1 d(xi , x̂i ).
In the following, we will explore the trade-off between rate and distortion.
Definition 4.4. The rate distortion function, R(D), is the infimum of all rates
R such that (R, D) is achievable.
Theorem 4.1 (Rate Distortion Theorem). The rate distortion function for an
i.i.d. source X ∼ p(x) and bounded distortion function d(x, x̂) can be computed
as
15
L
tion . Now compute the case 0 ≤ D ≤ p. Note that
To prove for the last inequality, note that H()˙ is increasing on [0, 21 ] and hence on
L
[0,D], and P (X X̂ = 1) = p(x)p(x̂|x)d(x, x̂) ≤ D. Now choose X̂ as follows,
we achieve the lower bound H(p) − H(D). P
If D ≥ p, let X̂ = 0 with probability 1, we have x,x̂ p(x)p(x̂|x)d(x, x̂) =
p ≤ D and I(X; X̂) = H(X) − H(X|X̂) = 0.
Definition 4.5. Let X × X̂ ∼ p(x, x̂), and let d(x, x̂) be a distortion measure
(n)
on X ׈. Then the distortion typical set Ad,ϵ is defined as the set
{(xn , x̂n ) : | − n1 log p(xn ) − H(X)| < ϵ, | − n1 log p(x̂n ) − H(X̂)| < ϵ,
| − n1 log p(xn , x̂n ) − H(X, X̂)| < ϵ, |d(xn , x̂n ) − E[d(X, X̂)]| < ϵ}.
n
(n) ;X̂ n )+3ϵ)
Lemma 4.1. For all (xn , x̂n ) ∈ Ad,ϵ , p(x̂n ) ≥ p(x̂n |xn )2−n(I(X
Proof.
p(xn , x̂n )
p(x̂n |xn ) = p(x̂n )
p(xn )p(x̂n )
2−n(H(X,X̂)−ϵ)
≤ p(x̂n )
2−n(H(X)+ϵ) 2−n(H(X̂)+ϵ)
= p(x̂ )2n(I(X;X̂)+3ϵ)
n
16
Decoding: Decode by X̂ n (w). Under such scheme, the expected distortion
D̄ can be computed over the expected over the random choice of codebooks C
and as D̄ = EX n ,C d(X n , X̂. Next, we prove that D̄ ≤ D:
X X
D̄ = P (C) p(xn )d(xn , x̂n (w))
C xn
X X X
= P (C)( p(xn )d(xn , x̂n (w)) + p(xn )d(xn , x̂n (1))) (where
C xn ∈J(C) xn ∈J(C)
/
J(C) is the set of xn having at least one x̂n distortion joint typical)
X X
≤D+ϵ+ P (C) p(xn )d(xn , x̂n (1))
C xn ∈J(C)
/
p(xn )
P P
Noticing that d(·, ·) is bounded, we will show that Pe = C P (C) xn ∈J(C)
/
p(x )d(x , x̂n (1)).
n n
P P
is arbitrarily small as n → ∞ and so is Pe = C P (C) xn ∈J(C)
/
X X
Pe = p(xn ) P (C)
xn C:xn ∈J(C)
/
X nR (n)
= p(xn )Π2i=1 P ((xn , x̂n (i)) ∈
/ Ad,ϵ )
xn
X nR X
= p(xn )Π2i=1 p(x̂n (i))
xn x̂n (i):(xn ,x̂n (i))∈A
(n)
/ d,ϵ
X nR nR X
= p(xn )Π2i=1 Π2i=1 p(x̂n )
xn x̂n :(z n ,x̂n )∈A
(n)
/ d,ϵ
X X nR
= p(xn )(1 − p(xˆn )K(xn , x̂n ))2 (where
xn xn
(n)
K(xn , x̂n ) = 1 if (xn , x̂n ) ∈ Ad,ϵ and 0 otherwise)
X X nR
≤ p(xn )(1 − 2−n(I(X;X̂)+3ϵ) p(x̂n |xn )K(xn , x̂n ))2 (∗)
xn x̂n
−n(I(X;X̂)+3ϵ)
×2nR
X X
≤ p(xn )(1 − p(x̂n |xn )K(xn , x̂n ) + e−2 ) (∗∗)
xn xn
X −n(R−I(X;X̂)−3ϵ)
=1− p(xn , x̂n )K(xn , x̂n ) + e−2
xn ,x̂n
(n) −n(R−I(X;X̂)−3ϵ)
/ Ad,ϵ ) + e−2
= P ((X m , X̂ n ) ∈
−n(R−I(X;X̂)−3ϵ)
≤ ϵ + +e−2
n n
Note that (∗) follows from p(x̂n ) ≥ p(x̂n |xn )2−n(I(X ;X̂ )+3ϵ) and (∗∗) follows
from (1 − xy)n ≤ 1 − x + e−ny for 0 ≤ x, y < 1 and n ≥ 1.
Therefore, if R > I(X; X̂) + 3ϵ = R(D) + 3ϵ, then D̄ = D+something
arbitrarily small if we choose ϵ and n appropriately. So there is at least one code
17
C ∗ with rate R (which is > R(D)) and average distortion D̄ ≤ D + (a.s.n.). To
get rid of the a.s.n., we will show that R(D) is a continuous function of D and
that R > R(D) ⇒ R > R(D − ϵ), and we can replace D by D − ϵ in the above
argument.
Proposition 4.1. R(D) is a non-increasing convex function of D.
Proof. For any 0 < λ < 1, we need to prove R(Dλ ) = R(λD1 + (1 − λ)D2 ) ≤
λR(D1 ) + (1 − λ)R(D2 ). Consider Pλ = λP1 + (1 − λ)P2 . By the fact that
I(X; Y ) is a convex function for p(y|x) for given p(x), we have
18
We have
nR ≥ H(fn (X n ))
≥ H(fn (X n )) − H(fn (X n )|X n )
= I(X n ; fn (X n ))
≥ I(X n ; X̂ n )
= H(X n ) − H(X n |X̂ n ) (by data-processing inequality)
Xn
= −H(X n |X̂ n )
i=1
n
X n
X
= − H(Xi |X̂ n , Xi−1 , Xi−2 , · · · , X1 )
i=1 i=1
n
X n
X
≥ H(Xi ) − H(Xi |X̂i )
i=1 i=1
Xn Xn
≥ H(Xi ) − H(Xi |X̂i )
i=1 i=1
Xn
= I(X i ; X̂ i )
i=1
n
X
= R(E[d(X i , X̂ i )])
i=1
n
1X
= n( R(E[d(X i , X̂ i )]))
n i=1
n
1X
≥ nR( E[d(X i , X̂ i )]) (by convexity of R(D) and Jensen’s Inequality)
n i=1
= nR(E[d(X i , X̂ i )])
≥ nR(D)
(
0 if x = x̂
Remark 4.1. When D = 0 and d(x, x̂) = min I(X; X̂)=I(X; X̂)=H(X)
1 if x ̸= x̂
19
Example 5.2. Let X = {1, 2, 3} and xn =11321. Then Pxn (1) = 35 , Pxn (2) =
Pxn (3)
= 15 and T (Pxn ) = {11123, 11132, 11213, · · · , 32111}. And |T (Pxn )| =
5 5!
= 3!1!1! = 20.
3, 1, 1
Proof. Trivial.
Theorem 5.2. If X1 , X2 , · · · , Xn are drawn i.i.d. according to Q(x), then
Qn (xn ) = 2−n(H(Px )+D(Px ||Q))
Proof.
1 ≥ P n (T (P ))
X
= P n (xn )
xn ∈T (P )
X
= 2−nH(P )
xn ∈T (P )
= |T (P )|2−nH(P )
20
P n (T (P̂ )). Consider the ratio:
So we have
X
1= P n (T (Q))
Q∈Pn
X
≤ P n (T (P ))
Q∈Pn
= |Pn |P n (T (P ))
X
≤ (n + 1)|X | P n (xn )
xn ∈T (P )
X
= (n + 1)|X | 2−nH(P )
xn ∈T (P )
|X |
= (n + 1) |T (P )|2−nH(P )
1 nH(P )
This implies n+1 2 ≤ |T (P )|.
1 −nD(P ||Q)
Theorem 5.4. For any P ∈ Pn and any distribution Q, n+1 2 ≤
Qn (T (P )) ≤ 2−nD(P ||Q) .
Proof.
X
Qn (T (P )) = Qn (xn )
xn ∈T (P )
X
= 2−n(D(P ||Q)+H(P ))
xn ∈T (P )
= |T (P )|2−n(D(P ||Q)+H(P ))
1 nH(P )
By the previous theorem: n+1 2 ≤ |T (P )| ≤ 2nH(P ) , we get the desired
result.
21
Theorem 5.5. Let X1 , X2 , · · · , Xn be i.i.d.∼ p(x). Then P (D(Pxn ||P ) > ϵ) ≤
|X | log(n+1)
2−n(ϵ− n )
, moreover, D(Pxn ||P ) → 0 with probability 1.
Proof.
X
P (D(Pxn ||P ) > ϵ) = P (xn )
xn :D(Pxn ||P )>ϵ
X
= P n (T (P̂ ))
P̂ :D(P̂ ||P )>ϵ
X
≤ 2−nD(P̂ ||P )
P̂ :D(P̂ ||P )>ϵ
X
≤ 2−nϵ 1
P̂ :D(P̂ ||P )>ϵ
≤ 2−nϵ (n + 1)|X |
log(n+1)
= 2−n(ϵ−|X | n )
P∞
Summing up, it is not hard to see that for any ϵ > 0, n=1 P (D(Pxn ||P ) > ϵ) <
∞. Then by the Borel-Cantelli Lemma, P (lim supn→∞ D(Pxn ||P ) > ϵ) = 0. In
other words, D(Pxn ||P ) → 0 with probability 1.
≤ (n + 1)|X | 2nRn
= 2nR
22
We can ignore those sequences outside A (since they have close to 0 probability)
and only encode those in A, for which a source code with rate R is enough:
P (n) = 1 − Qn (A)
X
= Qn (T (P ))
P :H(P )>Rn
Since H(Q) < R and H(P ) > Rn , for large enough n, we have H(P ) > H(Q).
So we deduce that for large n, minP :H(P )>Rn D(P ||Q) > 0, which implies that
(n)
Pe → 0 as n → ∞.
Theorem 5.7 (Sanov’s Theorem). Let X1 , X2 , · · · , Xn be i.i.d. ∼ Q(x). Let
E be a set of probability distributions. Let Qn (E) := Qn (T (E
P ∩ P )), or equiva-
lently, Qn (E) = xn :Pxn ∈E∩P Qn , or equivalently Qn (E) = P ∈E∩Pn Qn (T (P )).
P
n
∗
Then Qn (E) ≤ (n + 1)|X | 2−nD(P ||Q) , where P ∗ = argminP ∈E D(P ||Q) is the
distribution in E that is the closest to Q in relative entropy. If, in addition, the
set E is the closure of its interior, then n1 log Qn (E) → −D(P ∗ ||Q).
Proof.
X
Qn (E) = Qn (T (P ))
P ∈E∩Pn
X
≤ 2−nD(P ||Q)
P ∈E∩Pn
X ∗
≤ 2−nD(P ||Q)
P ∈E∩Pn
∗
≤ (n + 1)|X | 2−nD(P ||Q)
≥ Qn (T (pn ))
1
≥ 2−nD(pn ||Q)
(n + 1)X
which implies that
1 |X | log n
lim inf log Qn (E) ≥ lim inf (− − D(pn ||Q))
n→∞ n n→∞ n
= lim (−D(pn ||Q))
n→∞
= −D(P ∗ ||Q)
23
∗
On the other hand, one can use Qn (E) ≤ (n + 1)|X | 2−nD(P ||Q) to prove
that lim supn→∞ n1 Qn (E) ≤ −D(P ∗ ||Q), which implies limn→∞ n1 log Qn (E) =
−D(P ∗ ||Q)
Pn
Example 5.3 (Large deviation). Find an upper bound on P ( n1 i=1 gj (xi ) ≥
αj , j = 1, 2, · · · , k)
Pfor an i.i.d. random sequence X1 , X2 , · · · , Xn ∼ Q(x).
Let E = {P : a P (a)gj (a) ≥ αj , j = 1, 2 · · · , k}. Then
n
1X X
P( gj (xi ) ≥ αj , j = 1, 2, · · · , k) = Qn (xn )
n i=1 1
Pn
xn : n i=1 gj (xi )≥αj ,j=1,2··· ,k
X
= Qn (xn )
1
P
xn : n a Pxn (a)gj (a)≥αj ,j=1,2··· ,k
X
= Qn (xn )
xn :Pxn ∈E∩Pn
n
= Q (E)
P
Q(x)e i λi gPi (x)
Using Lagrange multipliers, we can find p∗ = P
i λi gi
, where λi can
a∈X Q(a)e
be computed through solving a system of linear
Pn constraints. We then have an
exponentially decreasing upper bound: P ( n1 i gj (xi ) ≥ αi , j = 1, 2, · · · , k) ≤
∗
(n + 1)|X | 2−nD(P ||Q) .
Theorem 5.8. For a closed convex set E of distributions, and some distribution
Q ̸= E, let p∗ ∈ E be the distribution that achieves the minimum distance to
Q, i.e. D(P ∗ ||Q) = minP ∈E D(P ||Q). Then D(P ||Q) ≥ D(P ||P ∗ ) + D(P ∗ ||Q)
for all P ∈ E.
Proof. For any P ∈ E, define Pλ = λP + (1 − λ)P ∗ . Since E is convex, Pλ ∈ E
for 0 ≤ λ ≤ 1. Computing the derivative of D(Pλ ||Q) with respect to λ, we
have
dD(Pλ ||Q) X Pλ (x)
= ((P (x) − P ∗ (x)) log + (P (x) − P ∗ (x)))
dλ x
Q(x)
X Pλ (x)
= (P (x) − P ∗ (x)) log
x
Q(x)
24
dD(Pλ ||Q)
Since P ∗ is the closest to Q in relative entropy, dλ |λ=0 ≥ 0, therefore
X P (x) X P (x) X P ∗ (x)
D(P ||Q) − D(P ||P ∗ ) − D(P ∗ ||Q) = (P (x) log )− (P (x) log ∗ )− (P ∗ (x) log )
x
Q(x) x
P (x) x
Q(x)
X P (x) P ∗ (x) X P ∗ (x)
= (P (x) log )− (P ∗ (x) log )
x
Q(x) P (x) x
Q(x)
X P ∗ (x)
= (P (x) − P ∗ (x)) log
x
Q(x)
dD(Pλ )
= |λ=0
dλ
≥0
Definition 5.3. The L1 -distanceP between any two discrete distributions P1 and
P2 is defined as ||P1 − P2 ||1 = a∈X |P1 (a) − P2 (a)|.
Let A be the set on which P1 (x) > P2 (x), then
X
||P1 − P2 ||1 = |P1 (x) − P2 (x)|
x∈X
X X
= (P1 (x) − P2 (x)) + (P2 (x) − P1 (x))
x∈A x∈Ac
= P1 (A) − P2 (A) + (1 − P2 (A)) − (1 − P1 (A))
= 2(P1 (A) − P2 (A))
||P1 −P2 ||
Therefore, maxB⊂X (P1 (B) − P2 (B)) = P1 (A) − P2 (A) = 2 .
||P1 −P2 ||21
Lemma 5.1 (Pinsker’s inequality). D(P1 ||P2 ) ≥ 2 ln 2
Proof. One can show the binary case: p log 1−p 1−p 2
1−q + (1 − p) log 1−q ≥ ln 2 (p − q)
2
by showing that for each q ∈ (0, 1), the function gq : (0, 1) → R, gq (p) =
p log 1−p 1−p 2 2
1−q + (1 − p) log 1−q − ln 2 (p − q) is minimized when p = q. This can be
p(1−q)
done by computing gq′ (p) = ln q(1−p) − 4(p − q), yielding gq′ (p) = 0 at p = q and
′′ 1 1
gq (p) = p + 1−p − 4 which is ≥ 0 at p = q.
For the general case, for two distributions P1 and P2 , let A = {x ∈ X :
25
P1 (x) > P2 (x)}. Then,
X P1 (x)
D(P1 ||P2 ) = P1 (x) log
P2 (x)
x∈X
X P1 (x) X P1 (x)
= P1 (x) log + P1 (x) log
P2 (x) c
P2 (x)
x∈A x∈A
P1 (A) 1 − P1 (A)
= P1 (A) log + (1 − P1 (A)) log (log-sum inequality)
P2 (A) 1 − P2 (A)
2
≥ (P1 (A) − P2 (A))2 (consider the Pinsker’s inequality
ln 2
for the binary case)
1
= ||P1 − P2 ||21
2 ln 2
Theorem 5.9 (The Conditional Limit Theorem). Let E be a closed convex set
of distributions and let Q be a distribution not in E. Let X1 , X2 , · · · , Xn , · · ·
be i.i.d. ∼ Q(x) and let P ∗ achieve minP ∈E D(P ||Q). Then Pr(X1 = a|PX n ∈
E) → P ∗ (a) in probability as n → ∞.
Proof. Define St : {P : D(P ||Q) < t} and d∗ = D(P ∗ ||Q) = minP ∈E D(P ||Q).
Now, for any δ > 0, define A = Sd∗ +2δ ∩ E and B = E − A. Now,
X
Qn (B) = Qn (T (P ))
P ∈E∩Pn :D(P ||Q)>d∗ +2δ
X
≤ 2−nD(P ||Q)
P ∈E∩Pn :D(P ||Q)>d∗ +2δ
∗
≤ (n + 1)|X | 2−n(d +2δ)
and
Qn (A) ≥ Qn (Sd∗ +δ )
X
= Qn (T (P ))
P ∈E∩Pn :D(P ||Q)≤d∗ +δ
X 1
≥ 2−nD(P ||Q)
(n + 1)|X |
P ∈E∩Pn :D(P ||Q)≤d∗ +δ
1 ∗
≥ 2−n(d +δ) (for sufficiently large n,
(n + 1)|X |
such P exists)
26
So for n enough large,
Qn (B ∩ E)
P r(PX n ∈ B|PX n ∈ E) =
Qn (E)
n
Q (B)
≤ n
Q (A)
(n + 1)|X |
≤
(n + 1)−|X | 2−n(d∗ +δ)
= (n + 1)2|X | 2−nδ
→0 (#)
27
Theorem 5.10 (Neyman-Pearson Lemma). Let X1 , X2 , · · · , Xn be i.i.d. ∼
Q(x). Consider the decision problem corresponding to hypotheses:
H1 : Q = P1 vs H2 : Q = P2 .
For T ≥ 0, define the acceptance region for H1 : An (T ) = {xn : P 1 (x1 ,··· ,xn )
P2 (x1 ,··· ,xn ) >
T }. Let α∗ = P1n (Acn (T )), β ∗ = P2n (An (T )), and let Bn be any other acception
region with associated probabilities of error α and β. If α ≤ α∗ , then β ≥ β ∗ .
Proof. Let ΦAn and ΦBn be the indicator functions of the decision regions
An = An (T ), Bn , respectively. Then for any (x1 , · · · , xn ) ∈ X n , (ΦAn (xn ) −
ΦBn (xn ))(P1 (xn ) − T P2 (xn )) ≥ 0. It then follows that
X
0≤ (ΦAn (xn )P1 (xn ) − T ΦAn (xn )P2 (xn ) − ΦBn P1 (xn ) + T ΦBn (xn )P2 (xn ))
xn
X X
= (P1 − T P2 ) − (P1 − T P2 )
An Bn
= (1 − α∗ ) − T β ∗ − (1 − α) + T β
= T (β − β ∗ ) − (α∗ − α)
28
Theorem 5.11 (Chernoff-Stein Lemma). Let X1 , X2 , · · · , Xn be i.i.d. ∼ Q.
Consider the hypothesis test between two alternatives: H1 : Q = P1 vs H2 :
Q = P2 , where D(P1 ||P2 ) < ∞. Let An ⊂ X n be an acception region for H1 ,
and let the probabilities of error be αn = P1n (Acn ), βn = P2n (An ). For 0 < ϵ < 12 ,
define βnϵ = minAn ⊂X n ,αn <ϵ βn . Then limn→∞ log βnϵ = −D(P1 ||P2 ).
Rough Idea of the proof: First construct a sequence of acception region
An ⊂ xn such that αn < ϵ and βn grows at the same rate as 2−nD(P1 ||P2 ) . Then
show that no other sequence of tests has an asymptotically better exponent.
Proof. Define for some small δ
n
An = {xn ∈ X n : 2n(D(P1 ||P2 )−δ) ≤ P 1 (x )
P2 (xn ) ≤ 2
n(D(P1 ||P2 )+δ)
}
1
P n P1 (Xi ) P1 (X)
By law of large numbers, n i=1 log P2 (Xi ) → Ep1 [log P 2 (X)
] = D(P1 ||P2 )
n n
in probability (with respect to P1 ). Therefore, we get P1 (A ) → 1 as n → ∞.
So for large n, αn = P1n (AC n
n ) = 1 − P1 (An ) < ϵ.
We have
X
βn = P2 (xn )
xn ∈An
X
≤ P1 (xn )2−n(D(P1 ||P2 )−δn )
An
X
= 2−n(D(P1 ||P2 )−δn ) P1 (xn )
An
Hence we get n1 n
log(1−αn ) , which implies that
n log βn ≥ −D(P1 ||P2 ) − δn + n
limn→∞ n1 log βn = −D(P1 ||P2 ).
Lemma 5.2. The number of strings c(n) in the Lempel-Ziv parsing of a binary
sequence X1 , X2 , Xn satisfies c(n) ≤ (1−ϵnn) log n where ϵn → 0 as n → ∞.
29
Lemma 5.3. Let Z be a positive integer-valued valued random variable with
mean µ. Then the entropy of H(Z) is upper bounded by (µ + 1) log(µ + 1) −
µ log µ.
Proof. See Theorem 12.1.1 of the textbook, where we can show that H(Z) is
maximized when Z is a geometric random variable.
∞
Let X−∞ be a stationary ergodic process with p.m.f. p. For a fixed integer k,
define the k-th order Markov approximation to p as Qk (x−(k−1) , · · · , x0 , x1 , · · · , xn ) =
p(x0−(k−1) )Πnj=1 p(xj |xj−1 n−1
j−k ). Since {p(xn |xn−k )}n is an ergodic process, we have
n
1 0 1X j−1
− log Qk (X1 , X2 , · · · , Xn |X−(k−1) )=− log p(Xj |Xj−k )
n n j=1
n→∞ j−1
−−−−→ E[− log p(Xj |Xj−k )]
j−1
= H(Xj |Xj−k )
k→∞
−−−−→ H(X)
∞
where H(X) is the entropy rate of X−∞ .
Suppose that X−(k−1) = x−(k−1) , and suppose that xn1 is parsed into c(n) dis-
n n
i+1 q −1
tinct phrases y1 , y2 , · · · , yc(n) , where yi = xq(i) . For each i = 1, 2, · · · , c(n),
define si = xqqii −1
−k .
Note that s1 = x0−(k−1) .
In other words, si is the k-bits of x
preceding yi . LetcP
ls be the number ofP
phrases yi with length l and preceding
state si = s. Then l,s cls = c(n) and l,s lcls = n.
Lemma 5.4. For the Lempel-Ziv P parsing of the sequence x1 x2 · · · xn , we have
log Qk (x1 , x2 , · · · , xn |s1 ) ≤ − l,s cls log cls
Proof. Note that
Qk (x1 , x2 , · · · , xn |s1 ) = Qk (y1 , y2 , · · · , yc(n) |s1 )
c(n)
= Πi=1 p(yi |si )
It then follows
c(n)
X
log Qk (x1 , x2 , · · · , xn |s1 ) = log p(yi |si )
i=1
X X
= log p(yi |si )
l,s i:|yi |=l,si =s
X X 1
= cls log p(yi |si )
cls
l,s i:|yi |=l,si =s
X X 1
≤ cls log( p(yi |si ))
cls
l,s i:|yi |=l,si =s
X 1
= cls log
cls
l,s
30
Theorem 5.12. Let {xn } be a stationary ergodic process with entropy rate
H(X), and let c(n) be the number of phrases in the Lempel-Ziv parsing of a
sample of length n from this process. Then lim supn→∞ c(n) log
n
c(n)
≤ H(X).
Proof. By Lemma 3, we have
X
log Qk (x1 , x2 , · · · , xn |s1 ) ≤ − cls log cls
l,s
X cls c(n)
=− cls log
c(n)
l,s
X cls cls
= −c(n) log c(n) − c(n) log
c(n) c(n)
l,s
P cls P lcls n
Note that l,s c(n) = 1 and l,s c(n) = c(n)
cls
Define random variables U , V such that Pr(U = l, V = s) = c(n) . Then
n c(n) c(n)
E[U ] = c(n) and n log c(n) − n H(U, V ) ≤ − n1 log Qk (x1 , x2 , · · · , xn |s1 ).
Now,
Therefore,
c(n) log c(n) −1
lim sup ≤ lim H(X0 |X−k ) = H(X)
n→∞ n k→∞
31
∞
Theorem 5.13. Let X = X−∞ be a stationary ergodic process, and let l(x1 , · · · , xn ))
be the Lempel-Ziv codeword length associated with x1 , · · · , xn . Then
1
lim sup l(x1 , · · · , xn ) ≤ H(X)
n→∞ n
32