0% found this document useful (0 votes)
6 views

Math7224 Notes

The document outlines the course MATH7224, focusing on advanced topics in Probability Theory, particularly Information Theory. It covers fundamental concepts such as entropy, mutual information, and Fano's inequality, along with their applications in statistical inference. Recommended reading includes 'Elements of Information Theory' by Cover & Thomas.

Uploaded by

mubeeuwu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Math7224 Notes

The document outlines the course MATH7224, focusing on advanced topics in Probability Theory, particularly Information Theory. It covers fundamental concepts such as entropy, mutual information, and Fano's inequality, along with their applications in statistical inference. Recommended reading includes 'Elements of Information Theory' by Cover & Thomas.

Uploaded by

mubeeuwu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

MATH7224 Topics in Advanced Probability

Theory Notes
Billy Leung
May 2024

Introduction
This course will delve into one out of many topics in Probability Theory. In
Semester 1 of the 2023-24 academic year, it was about Information Theory. In-
formation theory is the mathematical study of the quantification, storage, and
communication of information. A fundamental problem in this field is how to
communicate reliably over unreliable channels. This note will also discuss the
application of information theory in statistical inference.
Recommended reading: Elements of Information Theory by Cover & Thomas,
2nd Edition

Contents
1 Basic Concepts 1

2 Fano’s Inequality 5

3 Capacity of a Channel 9

4 Rate Distortion Theory 14

5 Information Theory and Statistics 19

1 Basic Concepts
The most important concept in information theory is entropy, which measures
the level of uncertainty involved in the value of a random variable or the outcome
of a random process.
Definition P
1.1. The entropy of a discrete random variable X is defined by
H(X) := − x∈X p(x) log p(x) where X is the sample space of X.

1
Remark 1.1. 1. We define 0log0 to be 0.
2. Clearly from the definition, H(X) ≥ 0.
(
1 with probability p
Example 1.1. If X =
0 with probability 1-p,
Then H(X) = −p log p − (1 − p) log(1 − p)
Definition 1.2. The joint
P entropy
P H(X, Y ) for random variables X, Y is de-
fined by H(X, Y ) = − x∈X y∈Y p(x, y) log p(x, y).
Definition 1.3. The condition entropy H(Y |X) is defined by
X X X
H(Y |X) = p(x)H(Y |X = x) = − p(x) p(y|x) log p(y|x)
x∈X x∈X y∈Y
X X
=− p(x, y) log p(y|x)
x∈X y∈Y

Remark 1.2. (1) H(X|X) = 0.


(2) H(Y |X) = H(Y ) if X and Y are independent.
(3) H(Y |X) indicates how random Y is given X.
Proposition 1.1 (Chain rule). H(X, Y ) = H(X) + H(Y |X)
Proof.
X X
H(X, Y ) = − p(x, y) log p(x, y)
x∈X y∈Y
X X
=− p(x, y) log p(x)p(y|x)
x∈X y∈Y
X X X X
=− p(x, y) log p(x) − p(x, y) log p(y|x)
x∈X y∈Y x∈X y∈Y
X X X
=− p(x, y) log p(x) − p(x, y) log p(y|x)
x∈X x∈X y∈Y

= H(X) + H(Y |X)

Definition 1.4. The relative entropy or Kullback–Leibler divergence between


two p.m.f.’s (probability mass functions) p and q is defined by D(p||q) = x∈X p(x)log p(x)
P
q(x)

It is not hard to observe that D(p||q) ̸= D(q||p) in general.


Definition 1.5. The mutual information I(X; Y ) between two random vari-
ables X and Y is defined by
I(X; Y ) = D(p(x, y)||p(x)p(y))
X X p(x, y)
= p(x, y) log
p(x)p(y)
x∈X y∈Y

2
Intuitively, I(X; Y ) is an indicator of how much X and Y are correlated.
Proposition 1.2. (1) I(X; Y ) = H(X) − H(X|Y )
(2)I(X; Y ) = H(X) + H(Y ) − H(X, Y )
(3)I(X; Y ) = I(Y ; X)
(4)I(X; Y ) = H(X) if Y = X
(5) (Conditioning reduces entropy) I(X; Y ) ≥ 0, equality holds if X and Y are
independent
We would prove (1) only, the others are left as exercise:
X p(y|x)
I(X; Y ) = p(x, y) log
x,y
p(x)
X X
=− p(x, y) log p(x) + log p(y|x)
x,y x,y
X X
=− p(x, y) log p(x) − (− log p(y|x))
x,y x,y

= H(X) − H(X|Y )

Corollary 1.1 (Conditioning reduces entropy). H(X|Y ) ≤ H(X) with equality


iff X and Y are independent.
Definition 1.6. The conditional mutual information of X and Y given Z is
defined as

I(X, Y ; Z) = H(X|Z) − H(X|Y, Z)


X p(x, y|z)
= p(x, y, z) log
x,y,z
p(x|z)p(y|z)

Proposition 1.3 (General Chain Rule). (1) H(X1 , X2 , · · · , Xn ) = H(X1 ) +


H(X2 |X1 ) + H(X3 |X1 , X2 ) + · · · + H(Xn |X1 , X2 , X3 , · · · , Xn−1 )
(2) I(X1 , X2 , · · · , Xn |Y ) = I(X1 ; Y )+I(X2 ; Y |X1 )+· · ·+I(Xn ; Y |X1 , X2 , · · · , Xn−1 )
Pn
Corollary 1.2. H(X1 , X2 , · · · , Xn ) ≤ i=1 H(Xi ) with equality iff Xi ’s are
independent.
Proof of corollary: Use Corollary 1.1 and Proposition 1.3 (1).

Theorem 1.1 (Information Inequality). D(p||q) ≥ 0. Equality holds iff p(x) =


q(x) for all x ∈ X .

3
Proof.
X q(x)
−D(p||q) = p(x) log
x
x(x)
X q(x)
≤ log p(x) (Jensen’s Inequality, note that logarithm is a concave function)
x
p(x)
X
= log q(x)
x
= log 1
=0

q(x)
Note that for the equality holds iff p(x) is a constant, which implies p(x) = q(x)
∀x ∈ X .
Corollary 1.3. H(X) ≤ log|X | with equality iff X has a uniform distribution
over X .
Proof: Consider D(p||u) where u(x) = 1
|X | for x ∈ X .
Theorem 1.2 (log-sum inequality).
Pn
For a1 , · · · , an , b1 , · · · , bn ≥ 0,
i=1 ai
Pn ai Pn
i=1 ai log bi ≥ ( i=1 ai ) log n
bi .
P
i=1

Proof. Assume aj , bj > 0, since f (x) = x log x is strictly convex,P we have (by
Jensen’s
P inequality) for t 1 + t2 + · · · + t n = 1, t i ≥ 0, t i ≥ 0, i ti f (xi ) ≥
f ( i t i xi )
Set xi = abii and ti = Pbibi , we get the desired result.
i

Remark 1.3. log-sum inequality implies the information inequality.


In the following, we write X → Y → Z if X, Y, Z form a Markov chain, i.e.
p(z|y, x) = p(z|y) or equivalently p(x, z|y) = p(x|y)p(z|y).
Theorem 1.3 (Data-processing inequality). If X → Y → Z, then I(X; Y ) ≥
I(X; Z).
Proof. If X → Y → Z, then I(X; Z|Y ) = 0, so
I(X; Y ) = I(X; Y ) + I(X; Z|Y )
= I(X; Y, Z)
= I(X; Z) + I(X; Y |Z)
≥ I(X; Z)

Corollary 1.4. (1) I(X; Y ) ≥ I(X; g(Y )) where g is a deterministic function.


(2) If X → Y → Z, I(X; Y ) ≥ I(X; Y |Z).

4
2 Fano’s Inequality
Consider the following problem: Suppose we want to estimate random variable
X ∼ p(x) but we only observe Y , which is related to X by p(y|x). How should
we estimate X using X̂ = g(Y ) to minimize the probability of error Pe = P (X̂ ̸=
X)?
Theorem 2.1 (Fano’s inequality). H(Pe ) + Pe log(|X | − 1) ≥ H(X|Y ).
Proof.
( Define an error random variable
1 X̂ ̸= X
E= Now,
0 otherwise

H(X|Y ) = H(X|Y ) + H(E|X, Y )


= H(E, X|Y )
= H(E|Y ) + H(X|E, Y )
≤ H(E) + H(X|E, Y )
= H(Pe ) + H(X|E, Y )
= H(Pe ) + P (E = 0)H(X|Y, E = 0) + P (E = 1)H(X|Y, E = 1)
= H(Pe ) + (1 − Pe ) × 0 + Pe log(|X | − 1)
Done.
Observe that if Pe = 0 (or is small), H(X|Y ) = 0 (or is small also).
Theorem 2.2 (The Asymptotic Equipartition (AEP) Theorem). If X1 , X2 , · · ·
are i.i.d.∼ p(x), then − n1 log p(X1 , X2 , · · · , Xn ) → H(X) in probability, where
X ∼ p(x).
Proof.
1 1
− log p(X1 , X2 , · · · , Xn ) = − log p(X1 )p(X2 ) · · · p(Xn )
n n
1
= − (log p(X1 ) + log p(X2 ) + · · · + log p(Xn ))
n
→ −E[log p(X)] (by weak law of large number)
= H(X)

(n)
Definition 2.1. The typical set Aϵ with respect to p(x) is the set of se-
quences (x1 , x2 , · · · , xn ) ∈ X n such that 2−n(H(X)+ϵ) ) ≤ p(x1 , x2 , · · · , xn ) ≤
2−n(H(X)−ϵ) .
(n) (n)
Theorem 2.3. (1) P (Aϵ ) := P ((x1 , · · · , xn ) ∈ Aϵ ) > 1 − ϵ for n sufficiently
large.
(n)
(2) |Aϵ | ≤ 2n(H(X)+ϵ)
(n)
(3) |Aϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n sufficiently large.

5
(n)
Proof. For (1) Note that P (Aϵ ) = P (| − n1 log p(x1 , x2 , · · · , xn ) − H(X)| <
(n)
ϵ) → 1 as n → ∞ by the AEP Theorem. So, P (Aϵ ) > 1 − ϵ for sufficiently
large n.
For (2), 1 = x∈X p(x) ≥ x∈A(n) p(x) ≥ x∈A(n) 2−n(H(X)+ϵ) = 2−n(H(X)+ϵ) |Anϵ |,
P P P
ϵ ϵ
(n)
which implies |Aϵ | ≤ 2n(H(X)+ϵ)
For (3), we know for sufficiently large n,

1 − ϵ < P (A(n)
ϵ ) (by (1))
X
≤ 2−n(H(X)−ϵ)
(n)
x∈Aϵ

= 2−n(H(X)−ϵ) |A(n)
ϵ |

⇒ |A(n)
ϵ | ≥ (1 − ϵ)2
n(H(X)−ϵ)

Remark 2.1. Typical sequences are ”VIP” sequences: very few sequences are
typical, but typical sequences occupy almost the entire probability space.
Now, we move on to discuss a noiseless source coding problem.

Definition 2.2. A source code C for a random variable X is a mapping from


X to D∗ , the set of finitely long strings over alphabet D. Let C(x) denote the
codeword corresponding to x, and l(x) denotes
P the length of C(x), and L(C)
denote the expected length, i.e. L(C) = x∈X p(x)l(x).

Example 2.1. Let X be a random variable with X = {1, 2, 3, 4} and P (X =


1) = 21 , P (X = 2) = 41 , P (X = 3) = 18 , P (X = 4) = 18 and we encode X by
C(1) = 00, C(2) = 01, C(3) = 10, C(4) = 11. the expected length of code
L(C) is 2. But if we take C(1) = 0, C(2) = 10, C(3) = 110, C(4) = 111, then
L(C) = 21 × 1 + 14 × 2 + 18 × 3 + 18 × 3 = 1.75, which is shorter.

Definition 2.3. A code C is said to be non-singular if the mapping C is


injective. Moreover, a code C is said to be uniquely decodable if there is a one-to-
one correspondence between the output encoded texts and the input sequences
of strings. Finally a code C is said to be instantaneous if no codeword is a prefix
of any other codeword.
X Singular Non-singular Uniquely decodable instantaneous
1 0 0 10 0
Example 2.2. 2 0 010 00 10
3 0 01 11 110
4 0 10 110 111

Theorem 2.4 (Kraft’s inequality). For any instantaneous code C over an al-
phabet D with size D, the codeword length l1 , l2 , · · · , lm must satisfies the in-
equality:

6
Pm
D−li ≤ 1
i=1
Conversely, given a set of l1 , l2 , · · · , lm that satisfy this inequality, there exists
an instantaneous code with these codeword lengths.
Proof. Let lmax = max{l1 , · · · , lm }. Consider the D-ary tree of level lmax .
Any codeword in C can be represented by a solid path in the tree. If C is
instantaneous, no path is a subpath of another path. Suppose a path from level
i to level i + 1 has length 1 for all 0 ≤ i ≤ lmax − 1. P Since a solid subpath of
lmax −li m lmax −li
length li has D Pm descendants at level lmax . So i=1 D ≤ Dlmax ,
−li
which implies i=1 D ≤ 1.

0 1

0 1 0 1

Conversely, given lengths l1 , · · · , lm , we can pack paths into the tree by


”going up whenever possible”.
By Kraft’s inequality, the problem of finding an optimal instantaneous code
is equivalent to solving the following optimization problem:
m
X
minimize L = pi li
i=1
X
subject to D−li ≤ 1, li ’s are integers

Ignoring the integer constraints,


P ∗ we can P solve the optimization problem and
get li∗ = −logD pi and L∗ = pi li = − pi logD pi = HD (X). The proof is as
follows:

X X 1
L − HD (X) = pi li − pi logD
pi
X X Dli
= pi logD pi − pi logD P
pj D−lj
X pi 1 D−li
= pi logD + logD P −lj where ri = P −lj
ri D D
≥ 0 (by information inequality and Kraft’s inequality)

Now, we move on to talk about Shannon-Fano coding. It’s nothing special


but a code that achieves an expected length L within one bit of the lower bound,
that is H(X) ≤ L ≤ H(X) + 1.

7
Recall that the optimal solution is li∗ = logD p1i . Rounding up l∗ to li =
⌈logD p1i ⌉, which satisfies the Kraft’s inequality:
P −⌈logD p1 ⌉ − logD p1 P
D i ≤D i = pi = 1.
It is not hard to observe that logD p1i ≤ li < logD p1i + 1, which implies HD ,
HD (X) ≤ L ≤ HD (X) + 1.
We now apply Shannon-Fano coding to stationary processes:

Let X = X−∞ = · · · X−1 X0 X1 X2 · · · , we have H(X1n ) ≤ E(l(xn1 )) <
n
H(X ) E(l(xn
1 )) H(X1n )+1
H(X1n ) + 1, so n 1 ≤ n < n .
−1
H(X n ) H(X )+H(X |X )+···+H(X |X n−1 )
1 2 1 n H(X0 )+H(X0 |X− 1)+···+H(X0 |X1 ,··· ,X−n+1 )
Note that n 1 = n
1
= n .
−1 −1
By the fact that conditioning reduces entropy, we have H(X0 |X−n+1 ) ≤ H(X0 |X−n+2 )≤
−1
· · · ≤ H(X0 |X−1 ) ≤ H(X0 ), and since H(X0 |X−n+1 ) ≥ 0 so we can con-
clude limn→∞ H(X0 |X−n+1 −1
) exists, which implies H(Xn
0)
exists. Moreover,
H(X0 ) −1
n = limn→∞ H(X0 |X−n+1 ). This limit is called the entropy rate of X,
denoted by H(X). When X is i.i.d., H(X) = H(X1 ). When X is a one-step
stationary Markov process, H(X) = H(X2 |X1 ).
Theorem 2.5. For a stationary process, Shannon-Fano coding arbitrarily ap-
proaches its entropy rate.

Theorem 2.6 P(McMillan). The codeword lengths of any decodable D-ary code
must satisfy i D−li ≤ 1.
Proof. For any fixed x,
X X X X
( D−l(x) )k = ··· D−l(x1 ) · · · D−l(xk )
x∈X x1 ∈X x2 ∈X xk ∈X
X
= D−l(x1 )−···−l(xk )
(x1 ,x2 ,··· ,xk )∈X k
X k
= D−l(x1 )
xk
1 ∈X
k

klX
max

= a(m)D−m , where lmax = max{l1 , l2 , · · · , lk } and


m=1
a(m) is the number of source sequence xk1 mapping into codewords of length m.

Pklmax m −m
By unique decodability, a(m) ≤ Dm , so ( x∈X D−l(x) )k ≤ m=1
P
D D =
1 1
−l(x)
P
klmax , and have x∈X D ≤ (klmax ) , and since limk→∞ (klmax ) = 1, we
k k

are done.
The theorem says that although there are uniquely decodable codes that
are not instantaneous, the set of achievable codeword lengths is the same for
uniquely decodable codes and instantaneous codes.

8
Huffman Coding: yields an optimal instantaneous code for a given
distribution.

0.25 (01) 0.3 (00) 0.45 (1) 0.55 (0) 1

0.25 (10) 0.25 (01) 0.3 (00) 0.45 (1)

0.2 (11) 0.25 (10) 0.25 (01)

0.15 (000) 0.2 (11)

0.15 (001)
Example 2.3.

Assume X has probability masses p1 ≥ p2 ≥ · · · ≥ pm . We have the following


proposition

Proposition 2.1. For any distribution, there exists an optimal instantaneous


code, called canonical code, such that
(1) The codeword lengths are ordered inversely with the probabilities, i.e. if
pj < pk , lj ≥ lk .
(2) The two longest codewords have the same length.
Two of the longest codewords differ only in the last bit and correspond to the
two least likely signals.

3 Capacity of a Channel
Communication system with Discrete Memoryless Channel
Definition 3.1. A discrete channel is a system consisting of input alphabet X
and output alphabet Y and a transitional probability matrix {p(y|x)}, where
p(y|x) expresses the probability of observing the output y given that input x is
sent. The channel is said to be memoryless if the probability of output depends
only on the input at the time and is conditionally independent of the previous
or future inputs and outputs.
Definition 3.2. The channel capacity of a discrete memoryless channel is de-
fined as C := maxp(x) I(X; Y )
Example 3.1. For a binary symmetric channel with X = Y = {0, 1}, p(0|0) =

9
p(1|1) = 1 − p, p(0|1) = p(1|0) = p, the mutual information

I(X; Y ) = H(Y ) − H(Y |X)


X
= H(Y ) − p(x)H(Y |X = x)
x
X
= H(Y ) − p(x)H(p)
x
= H(Y ) − H(p)
≤ 1 − H(p)

with equality holds iff p(0) = p(1) = 21 . In other words, C = 1, which is achieved
by p(0) = p(1) = 12 .
Definition 3.3. An (M,n)-code consists of the following:
1. An index set {1, 2, · · · , M }
2. An encoding function f : {1, 2, · · · , M } → X n yielding codewords xn (1), xn (2), · · · , xn (M )
3. A decoding function g : Y n → {1, 2, · · · , M }.
Definition 3.4. Define the error of probability λi = P (g(Y n ) ̸= i|xn =
xn (i)) for 1 ≤ i ≤ M . Next define the maximal error probability λ(n) :=
maxi∈{1,2,··· ,M } λi . Finally, the average error of probability of an (M,n)-code is
(n) 1
Pm
defined as Pe = M i=1 λi .

Definition 3.5. The rate R of an (M,n)-code is R = lognM (bits per transmis-


sion). A rate R is said to be achievable if there exists a sequence of (2nR , n)
codes such that λ(n) tends to 0 as n → ∞.
Theorem 3.1 (Channel Coding Theorem). All rates below the capacity C are
achievable. Conversely, any sequence of (2nR , n))-codes with λ(n) → 0 must
have rate R ≤ C.
(n)
Definition 3.6. The set Aϵ of jointly typical sequences {(xn , yn )} with repeat
to the distribution p(xn , y n ) is defined as
(n)
Aϵ = {(xn , y n ) ∈ X × Y :
| − n1 log p(xn ) − H(X)| < ϵ,
| − n1 log p(xn ) − H(X)| < ϵ,
| − n1 log p(xn , y n ) − H(X, Y )| < ϵ}
Theorem 3.2 (Joint-AEP Theorem). Let (X n , Y n ) ∼ p(xn , y n ). Then
(n)
(1) P ((X n , Y n ) ∈ Aϵ ) → 1 as n → ∞.
(n) (n)
(2) |Aϵ | ≤ 2n(H(X,Y )+ϵ) , and for n large enough, |Aϵ | ≥ (1 − ϵ)2n(H(X,Y )−ϵ) .
(3) If (X̃ n , Ỹ n ) ∼ p(xn )p(y n ), i.e. X̃ n and Ỹ n are independently generated
according to p(xn ) and p(y n ) respectively. Then
(n)
P ((X̃ n , Ỹ n ) ∈ Aϵ ) ≤ 2−n(I(X;Y )−3ϵ) , and for n enough large,
(n)
P ((X̃ n , Ỹ n ) ∈ Aϵ ) ≥ (1 − ϵ)2−n(I(X;Y )+3ϵ) .

10
Proof. For (1) and (2), see Theorem 3.3.
For (3). we have
X
P ((X̃ n , Ỹ n ) ∈ A(n)
ϵ )= p(xn )p(y n )
(n)
(xn ,y n )∈Aϵ

≤ 2n(H(X,Y )+ϵ) × 2−n(H(X)−ϵ) × 2−n(H(Y )−ϵ)


= 2−n(I(X;Y )−3ϵ)
For sufficiently large n,
X
P ((X̃ n , Ỹ n ) ∈ A(n)
ϵ )= p(xn )p(y n )
(n)
(xn ,y n )∈Aϵ

≥ (1 − ϵ)2n(H(X,Y )−ϵ) × 2−n(H(X)+ϵ) × 2−n(H(Y )+ϵ)


= (1 − ϵ)2−n(I(X;Y )+3ϵ)

We now prove the Channel Coding Theorem. To prove the achievability, we


use the probabilistic method: we will show that randomly generated (2nR , n)
codes are ”good” (λ(n) → 0) on average, thus ”good” code exists.
Step 1: Fix p(x). Randomly generate a (2nR , n) code B according to p(x).
Step 2: Reveal the codebook to both the sender and the receiver. Both of them
know p(x) and p(y|x).
Step 3: For any message w ∈ {1, 2, · · · , 2nR }, send the w-th codeword X n (w)
over the channel.
Step 4: After the receiver receives the message Y n , the receiver declares that the
index ŵ was sent if (X n (ŵ), Y n ) is jointly typical and there is no other message
(n)
index w′ with (X n (w′ ), Y n ) ∈ Aϵ . If no such ŵ exists, the receiver set ŵ = 0.
Step 5: There is a decoding error if ŵ ̸= w. Let E be the event of having a
(n)
decoding error and Ei be the event (X n (i), Y n ) ∈ Aϵ . Then
λ1 = P (E|W = 1)
= P (E1C ∪ E2 ∪ · · · ∪ E2nR |W = 1)
≤ P (E1C |W = 1) + P (E2 |W = 1) + · · · + P (E2nR |W = 1)
By Joint AEP Theorem, P (E1C ) ≤ ϵ for n large enough. Also, given W = 1 and
(n)
i ≥ 2, X n (i) and Y n are independent, so P (Ei ) = P ((X n (i), Y n ) ∈ Aϵ |W =
−n(I(X;Y )−3ϵ)
1) ≤ 2 . So,
nR
2
X
λ1 ≤ ϵ + 2−n(I(X;Y )−3ϵ)
i=2

= ϵ + (2 nR
− 1)2−n(I(X;Y )−3ϵ)
≤ ϵ + 2−n(I(X;Y )−3ϵ−R)
≤ 2ϵ if R < I(X; Y ) − 3ϵ and n large enough

11
Similarly, we can obtain λi ≤ 2ϵ for 2 ≤ i ≤ 2nR for n large enough and
(n) 1
R < I(X; Y ) − 3ϵ, and so Pe = 2nR (λ1 + λ2 + · · · + λ2nR ) < 2ϵ. Thus, there is
∗ (n) 1
some B of B such that Pe = 2nR (λ1 +λ2 +· · ·+λ2nR ) < 2ϵ. By renumbering, if
necessary, assume λ1 ≤ λ2 ≤ · · · ≤ λ2nR . Therefore, max{λ1 , λ2 , · · · , λ2nR−1 } =
1
λ2nR−1 < 4ϵ. Choosing xn (1), xn (2), · · · , xn (2nR−1 ), we obtain a (2n(R− n ) , n)
code with maximal error probability less than 4ϵ, if R < I(X; Y ). Choose p(x)
1
to be the capacity achieving distribution, then there exists (2n(R− n ) , n) code
with λ(n) → 0 as n → ∞ if R < C.
(n)
Conversely, for any sequence of (2nR , n) codes with λ(n) → 0, we have Pe →
0, so if W is uniformly distributed over {1, 2, · · · , 2nR },
nR
2
X
P (Ŵ ̸= W ) = P (Ŵ ̸= i|W = i)P (W = i)
i=1
1
= (λ1 + λ2 + · · · + λ2nR )
2nR
= Pe(n) → 0

Hence,

nR = H(W )
= H(W |Y n ) + I(W ; Y n )
≤ H(X|Y n ) + I(X n (W ); Y n ) (applying data processing inequality for W → X n (W ) → Y n )
≤ 1 + Pe(n) nR + I(X n (W ); Y n ) (Fano’s inequality)

Note that

I(X n (W ); Y n ) = H(Y n ) − H(Y n |X n (W ))


Xn
= H(Y n ) − H(Yi |Yi−1 , Yi−2 , · · · , Y1 , X n (W ))
i=1
n
X
= H(Y n ) − H(Yi |Xi (W )) (for DMC)
i=1
n
X n
X
≤ H(Yi ) − H(Yi |Xi (W ))
i=1 i=1
Xn
= I(Xi (W ); Y )
i=1
≤ nC
(n) (n)
So, we have nR ≤ 1 + Pe nR + nC and thus R ≤ (1 − Pe )−1 ( n1 + C), letting
n → ∞, we have R ≤ C.
Definition 3.7. The feedback capacity, CF B , is the supremum of all rates
achieved by feedback codes.

12
Remark 3.1. CF B ≥ C.
Theorem 3.3. CF B = C for DMC with feedback.
Proof. To prove CF B ≤ C, for any achievable rate R and uniformly distributed
(n)
W over {1, 2 · · · , 2nR } with Pe → 0 as n → 0, we will show that R ≤ C. Note
that

nR = H(W )
= H(W |Y n ) + I(W ; Y n )
≤ 1 + Pe(n) nR + I(W ; Y n ) (Fano’s inequality)

and note that

I(W ; Y n ) = H(Y n ) − H(Y n |W )


Xn
= H(Y n ) − H(Yi |Yi−1 , Yi−2 , · · · , Y1 , W )
i=1
n
X
= H(Y n ) − H(Yi |Yi−1 , Yi−2 , · · · , Y1 , W, Xi )
i=1
Xn
= H(Y n ) − H(Yi |Xi ) (for DMC)
i=1
n
X n
X
≤ H(Yi ) − H(Yi |Xi )
i=1 i=1
Xn
= I(Xi ; Yi )
i=1
≤ nC
(n)
So, we have nR ≤ 1 + Pe nR + nC and let n → ∞ we have R ≤ C.
Remark 3.2. Feedback does not help for DMC, but it helps for channels with
memory.

Definition 3.8. A stationary process X−∞ is said to be ergodic if ”time-
average” = ”space average”, i.e. for any function f , f (X1 )+f (X2n)+···+f (Xn ) →
E[f (X1 )] with probability 1.

Example 3.2. An i.i.d. process is ergodic. Assume X−∞ is i.i.d., then by
1
weak law of large number n (f (X1 ) + f (X2 ) + · · · + f (Xn )) → E[f (X1 )] with
probability 1.
(
0 with probability 0.3
Example 3.3. Let X1 = and for all i ≥ 2, Xi = X1 ,
1 with probability 0.7

we have n1 (X1 + X2 + · · · + Xn ) = X1 ̸= 0.3 = E[X1 ]. So X−∞ is not ergodic.

13
Example 3.4. An irreducible finite-state stationary Markov chain is ergodic.
Example 3.5. A reducible finite-state stationary Markov chain is not ergodic.
Theorem 3.4 (Shannon-McMillan-Breiman Theorem). For a finite-state sta-

tionary ergodic process X = X−∞ , n1 log p(X1 , X2 , · · · , Xn ) → H(X) with prob-
ability 1.
(n)
Remark 3.3. (1) Aϵ typical set is similarly defined for stationary ergodic
process.
(n)
(2) AEP and joint AEP hold for Aϵ .
Theorem 3.5 (Separation Theorem). There exists a source-channel code with
(n)
Pe → 0 as n → ∞ if H(V ) < C. Conversely, for any stationary process V ,
(n)
if H(V ) > C, the probability of error Pe is bounded away from 0 and hence
it is impossible to send the process over the channel with an arbitrarily low
probability of error.
(n)
Proof. By the AEP, there exists a typical set Aϵ of size not greater than
2n(H(V )+ϵ)) , which occupies almost the entire probability space. We will encode
(n)
only the sequences in the typical set Aϵ , and all other sequences will result
in an error, contributing only 0 to the probability of error for large n. By the
Channel Coding Theorem, as long as H(V )+ϵ < C, the receiver can reconstruct
(n) (n)
V n ∈ Aϵ with an arbitrarily low probability of error. So Pe = P (V̂ ̸= V ) ≤
(n) (n)
/ Aϵ ) + P (g(Y n ) ̸= V n |V n ∈ Aϵ ) ≤ ϵ + ϵ = 2ϵ for sufficiently large n.
P (V n ∈
In other words, we can reconstruct the source with an arbitrarily small error of
probability if H(V ) < C.
(n)
Conversely, suppose Pe → 0 as n → ∞, then
H(V1 , V2 , · · · , Vn )
H(V ) ≤
n
1 1
= H(V |V̂ n ) + I(V n ; V̂ n )
n
n n
1 1
≤ (1 + Pe log |V |n ) + I(V n |V̂ n )
(n)
(by Fano’s inequality)
n n
1 1
≤ (1 + Pe(n) log |V |n ) + I(X n ; Y n ) (apply data processing lemma to
n n
the Markov chain V n → X n → Y n → V̂ n )
1
≤ + Pe(n) log |V | + C
n
(n)
Letting n → ∞, we have Pe → 0, hence H(V ) ≤ C.

4 Rate Distortion Theory


Source X of size 2n is to be represented by X̂ of size 2m , m ≤ n, necessarily
there is a trade-off between m
n and the distortion between X and X̂.

14
Definition 4.1. A distortion function (or measure) is a mapping d : X × Xˆ →
R+ .
A distortion measure is said to be bounded if dmax := maxx∈X ,x̂∈Xˆ d(x, x̂) <
∞.
PThe distortion between sequences xn and x̂n is defined by d(xn , x̂n ) =
1 n
n i=1 d(xi , x̂i ).

Definition 4.2. A (2nR , n)-rate distortion code consists of an encoding function


fn : X n → {1, 2, · · · , 2nR } and a decoding function gn : {1, 2, · · · , 2nR } → ˆn .
The rate of an (M, n)-rate distortion code is defined as R = M n . The distortion
associated with the (2nR , n) code is defined as

D = E[d(X n , gn (fn (X n ))]


X
= p(xn )d(xn , gn (fn (xn ))
xn

In the following, we will explore the trade-off between rate and distortion.

Definition 4.3. A rate distortion pair (R, D) is said to be achievable if there


exists a sequence of (2nR , n) rate distortion code (fn , gn ) with
X
lim p(xn )d(xn , gn (fn (xn ))) ≤ D
n→∞
xn

Definition 4.4. The rate distortion function, R(D), is the infimum of all rates
R such that (R, D) is achievable.
Theorem 4.1 (Rate Distortion Theorem). The rate distortion function for an
i.i.d. source X ∼ p(x) and bounded distortion function d(x, x̂) can be computed
as

R(D) = P min I(X; X̂)


p(x̂|x): (x,x̂) p(x)p(x̂|x)d(x,x̂)≤D

Do you recall something similar? Yes, the definition of capacity: C =


maxp(x) I(X; Y ).
(
0 if x = x̂
Example 4.1 (Hamming distortion). d(x, x̂) =
1 if x ̸= x̂

Example 4.2. The rate distortion function


( for a Bernoulli(p) source with Ham-
H(p) − H(D) 0≤ D ≤ min{p, 1 − p}
ming distortion is given by R(D) =
0 D>min{p,1-p}

Proof. W.L.O.G., assume p < 21 . We need to compute


R(D) = minp(x,x̂):Px,x̂ p(x)p(x̂|x)d(x,x̂)≤D I(X; X̂). Consider the modulo 2 addi-

15
L
tion . Now compute the case 0 ≤ D ≤ p. Note that

I(X; X̂) = H(X) − H(X|X̂)


M
= H(p) − H(X X̂|X̂)
M
≥ H(p) − H(X X̂)
M
= H(p) − H(P (X X̂ = 1))
≥ H(p) − H(D)

To prove for the last inequality, note that H()˙ is increasing on [0, 21 ] and hence on
L
[0,D], and P (X X̂ = 1) = p(x)p(x̂|x)d(x, x̂) ≤ D. Now choose X̂ as follows,
we achieve the lower bound H(p) − H(D). P
If D ≥ p, let X̂ = 0 with probability 1, we have x,x̂ p(x)p(x̂|x)d(x, x̂) =
p ≤ D and I(X; X̂) = H(X) − H(X|X̂) = 0.

Definition 4.5. Let X × X̂ ∼ p(x, x̂), and let d(x, x̂) be a distortion measure
(n)
on X ׈. Then the distortion typical set Ad,ϵ is defined as the set
{(xn , x̂n ) : | − n1 log p(xn ) − H(X)| < ϵ, | − n1 log p(x̂n ) − H(X̂)| < ϵ,
| − n1 log p(xn , x̂n ) − H(X, X̂)| < ϵ, |d(xn , x̂n ) − E[d(X, X̂)]| < ϵ}.
n
(n) ;X̂ n )+3ϵ)
Lemma 4.1. For all (xn , x̂n ) ∈ Ad,ϵ , p(x̂n ) ≥ p(x̂n |xn )2−n(I(X

Proof.

p(xn , x̂n )
p(x̂n |xn ) = p(x̂n )
p(xn )p(x̂n )
2−n(H(X,X̂)−ϵ)
≤ p(x̂n )
2−n(H(X)+ϵ) 2−n(H(X̂)+ϵ)
= p(x̂ )2n(I(X;X̂)+3ϵ)
n

We now prove the Rate Distortion Theorem.


Achievability: For any D and R > R(D), we will show that (R, D) is achiev-
able by proving the existence of a sequence of (2nR , n) codes with asymptotic
distortion ≤ D. P
P constraint x,x̂ p(x)p(x̂|x)d(x, x̂) ≤ D
Fix p(x̂|x), where p(x̂|x) satisfies the
and achieves R(D). Calculate p(x̂) = x p(x)p(x̂|x).
Code generation: Generate a (2nR , n) code C according to p(x̂).
(n)
Encoding: Encode X n by w if there exists w such that (X n , X̂ n (w)) ∈ Ad,ϵ .
If there is more than one such w, choose the least. If there is no such w, set
w = 1.

16
Decoding: Decode by X̂ n (w). Under such scheme, the expected distortion
D̄ can be computed over the expected over the random choice of codebooks C
and as D̄ = EX n ,C d(X n , X̂. Next, we prove that D̄ ≤ D:
X X
D̄ = P (C) p(xn )d(xn , x̂n (w))
C xn
X X X
= P (C)( p(xn )d(xn , x̂n (w)) + p(xn )d(xn , x̂n (1))) (where
C xn ∈J(C) xn ∈J(C)
/

J(C) is the set of xn having at least one x̂n distortion joint typical)
X X
≤D+ϵ+ P (C) p(xn )d(xn , x̂n (1))
C xn ∈J(C)
/

p(xn )
P P
Noticing that d(·, ·) is bounded, we will show that Pe = C P (C) xn ∈J(C)
/
p(x )d(x , x̂n (1)).
n n
P P
is arbitrarily small as n → ∞ and so is Pe = C P (C) xn ∈J(C)
/

X X
Pe = p(xn ) P (C)
xn C:xn ∈J(C)
/
X nR (n)
= p(xn )Π2i=1 P ((xn , x̂n (i)) ∈
/ Ad,ϵ )
xn
X nR X
= p(xn )Π2i=1 p(x̂n (i))
xn x̂n (i):(xn ,x̂n (i))∈A
(n)
/ d,ϵ
X nR nR X
= p(xn )Π2i=1 Π2i=1 p(x̂n )
xn x̂n :(z n ,x̂n )∈A
(n)
/ d,ϵ
X X nR
= p(xn )(1 − p(xˆn )K(xn , x̂n ))2 (where
xn xn
(n)
K(xn , x̂n ) = 1 if (xn , x̂n ) ∈ Ad,ϵ and 0 otherwise)
X X nR
≤ p(xn )(1 − 2−n(I(X;X̂)+3ϵ) p(x̂n |xn )K(xn , x̂n ))2 (∗)
xn x̂n
−n(I(X;X̂)+3ϵ)
×2nR
X X
≤ p(xn )(1 − p(x̂n |xn )K(xn , x̂n ) + e−2 ) (∗∗)
xn xn
X −n(R−I(X;X̂)−3ϵ)
=1− p(xn , x̂n )K(xn , x̂n ) + e−2
xn ,x̂n
(n) −n(R−I(X;X̂)−3ϵ)
/ Ad,ϵ ) + e−2
= P ((X m , X̂ n ) ∈
−n(R−I(X;X̂)−3ϵ)
≤ ϵ + +e−2
n n
Note that (∗) follows from p(x̂n ) ≥ p(x̂n |xn )2−n(I(X ;X̂ )+3ϵ) and (∗∗) follows
from (1 − xy)n ≤ 1 − x + e−ny for 0 ≤ x, y < 1 and n ≥ 1.
Therefore, if R > I(X; X̂) + 3ϵ = R(D) + 3ϵ, then D̄ = D+something
arbitrarily small if we choose ϵ and n appropriately. So there is at least one code

17
C ∗ with rate R (which is > R(D)) and average distortion D̄ ≤ D + (a.s.n.). To
get rid of the a.s.n., we will show that R(D) is a continuous function of D and
that R > R(D) ⇒ R > R(D − ϵ), and we can replace D by D − ϵ in the above
argument.
Proposition 4.1. R(D) is a non-increasing convex function of D.

Proof. For any 0 < λ < 1, we need to prove R(Dλ ) = R(λD1 + (1 − λ)D2 ) ≤
λR(D1 ) + (1 − λ)R(D2 ). Consider Pλ = λP1 + (1 − λ)P2 . By the fact that
I(X; Y ) is a convex function for p(y|x) for given p(x), we have

Ipλ (X; X̂) ≤ λIp1 (X; X̂) + (1 − λ)Ip2 (X; X̂)


= λR(D1 ) + (1 − λ)R(D2 )
P
Note that x,x̂ p(x)pλ (x̂|x)d(x, x̂) ≤ Dλ , we have R(Dλ ) ≤ Ipλ (X; X̂) ≤
λR(D1 ) + (1 − λ)R(D2 ).

Corollary 4.1. R(D) is continuous in D.


We end this section by proving that for any source X ∼ p(x) with d(x, x̂),
any rate distortion code with distortion D must have rate R ≥ R(D).
fn gn
X n −→ {1, 2, · · · , 2nR } −→ X̂ n

18
We have
nR ≥ H(fn (X n ))
≥ H(fn (X n )) − H(fn (X n )|X n )
= I(X n ; fn (X n ))
≥ I(X n ; X̂ n )
= H(X n ) − H(X n |X̂ n ) (by data-processing inequality)
Xn
= −H(X n |X̂ n )
i=1
n
X n
X
= − H(Xi |X̂ n , Xi−1 , Xi−2 , · · · , X1 )
i=1 i=1
n
X n
X
≥ H(Xi ) − H(Xi |X̂i )
i=1 i=1
Xn Xn
≥ H(Xi ) − H(Xi |X̂i )
i=1 i=1
Xn
= I(X i ; X̂ i )
i=1
n
X
= R(E[d(X i , X̂ i )])
i=1
n
1X
= n( R(E[d(X i , X̂ i )]))
n i=1
n
1X
≥ nR( E[d(X i , X̂ i )]) (by convexity of R(D) and Jensen’s Inequality)
n i=1
= nR(E[d(X i , X̂ i )])
≥ nR(D)
(
0 if x = x̂
Remark 4.1. When D = 0 and d(x, x̂) = min I(X; X̂)=I(X; X̂)=H(X)
1 if x ̸= x̂

5 Information Theory and Statistics


Definition 5.1. The type Pxn of a sequence x1 , · · · , xn over alphabet X is
n
defined as Pxn (a) = N (a|x
n
)
where N (a|xn ) is the number of times a occurs in
n
x . Pn will denote the set of all types with denominator n.
Example 5.1. Let X = {0, 1}. Then Pn = {(0, 1), ( n1 , n−1
n ), · · · , (1, 0)}.
Definition 5.2. For any P ∈ Pn , the type class of P , denoted by T (P ) is
defined as T (P ) = {xn ∈ X n : Pxn = P }.

19
Example 5.2. Let X = {1, 2, 3} and xn =11321. Then Pxn (1) = 35 , Pxn (2) =
Pxn (3) 
 = 15 and T (Pxn ) = {11123, 11132, 11213, · · · , 32111}. And |T (Pxn )| =
5 5!
= 3!1!1! = 20.
3, 1, 1

Theorem 5.1. |Pn | ≤ (n + 1)|X | .

Proof. Trivial.
Theorem 5.2. If X1 , X2 , · · · , Xn are drawn i.i.d. according to Q(x), then
Qn (xn ) = 2−n(H(Px )+D(Px ||Q))
Proof.

Qn (xn ) = Πni=1 Q(xi )


n
= Πa∈X Q(a)N (a|x )

= Πa∈X Q(a)nPxn (a)


= Πa∈X 2nPxn (a) log Q(a)
= Πa∈X 2nPxn (a) log Q(a)−Pxn (a) log Pxn (a)+Pxn (a) log Pxn (a)
P Pxn (a)
= 2n a∈X (−Pxn (a) log Q(a)
+Pxn (a) log Pxn (a))

= 2−n(H(Pxn )+D(Pxn ||Q))

Corollary 5.1. If xn is in the type class of Q, then Qn (xn ) = 2−nH(Q) .


1 nH(P )
Theorem 5.3. For any P ∈ Pn , n+1 2 ≤ |T (P )| ≤ 2nH(P ) .

Proof. Firstly, we have

1 ≥ P n (T (P ))
X
= P n (xn )
xn ∈T (P )
X
= 2−nH(P )
xn ∈T (P )

= |T (P )|2−nH(P )

So we get |T (P )| ≤ 2nH(P ) . For the other part, we prove that P n (T (P )) ≥

20
P n (T (P̂ )). Consider the ratio:

P n (T (P )) |T (P )|Πa∈X P (a)nP (a)


=
P n (T (P̂ )) |T (P̂ )|Πa∈X P (a)nP̂ (a)
(nP̂ (a))!
= Πa∈X P (a)n(P (a)−P̂ (a))
(nP (a))!
≥ Πa∈X (nP (a))n(P̂ (a)−P (a)) P (a)n(P (a)−P̂ (a))
= Πa∈X nn(P̂ (a)−P (a))
=1

So we have
X
1= P n (T (Q))
Q∈Pn
X
≤ P n (T (P ))
Q∈Pn

= |Pn |P n (T (P ))
X
≤ (n + 1)|X | P n (xn )
xn ∈T (P )
X
= (n + 1)|X | 2−nH(P )
xn ∈T (P )
|X |
= (n + 1) |T (P )|2−nH(P )

1 nH(P )
This implies n+1 2 ≤ |T (P )|.
1 −nD(P ||Q)
Theorem 5.4. For any P ∈ Pn and any distribution Q, n+1 2 ≤
Qn (T (P )) ≤ 2−nD(P ||Q) .
Proof.
X
Qn (T (P )) = Qn (xn )
xn ∈T (P )
X
= 2−n(D(P ||Q)+H(P ))
xn ∈T (P )

= |T (P )|2−n(D(P ||Q)+H(P ))

1 nH(P )
By the previous theorem: n+1 2 ≤ |T (P )| ≤ 2nH(P ) , we get the desired
result.

21
Theorem 5.5. Let X1 , X2 , · · · , Xn be i.i.d.∼ p(x). Then P (D(Pxn ||P ) > ϵ) ≤
|X | log(n+1)
2−n(ϵ− n )
, moreover, D(Pxn ||P ) → 0 with probability 1.
Proof.
X
P (D(Pxn ||P ) > ϵ) = P (xn )
xn :D(Pxn ||P )>ϵ
X
= P n (T (P̂ ))
P̂ :D(P̂ ||P )>ϵ
X
≤ 2−nD(P̂ ||P )
P̂ :D(P̂ ||P )>ϵ
X
≤ 2−nϵ 1
P̂ :D(P̂ ||P )>ϵ

≤ 2−nϵ (n + 1)|X |
log(n+1)
= 2−n(ϵ−|X | n )

P∞
Summing up, it is not hard to see that for any ϵ > 0, n=1 P (D(Pxn ||P ) > ϵ) <
∞. Then by the Borel-Cantelli Lemma, P (lim supn→∞ D(Pxn ||P ) > ϵ) = 0. In
other words, D(Pxn ||P ) → 0 with probability 1.

Uniform Source Coding: For a source X with unknown distribution p(x)


and entropy H(X), there exists a source code with rate R > H(X), and the
source code can be chosen independent of the source X.

Notation: For a source X1 , X2 , · · · , Xn with an unknown distribution Q, a


universal source code with rate R consists of the encoder fn : X n → {1, 2, · · · , 2nR },
(n)
the decoder ϕn : {1, 2, · · · , 2nR } → X n . The probability of error is Pe =
n n n
Q (ϕn (fn (x ) ̸= x ).
Theorem 5.6. There exists a sequence of (2nR , n) universal source codes such
(n)
that Pe → for every source Q with H(Q) < R.

Proof. Let Rn = R − |X | log(n+1)


n . Consider A := {xn ∈ X n : H(Pxn ) ≤ Rn }.
Then
X
|A| = |T (P )|
P ∈Pn :H(P )≤Rn
X
≤ 2nH(P )
P ∈Pn :H(P )≤Rn
X
≤ 2nRn
P ∈Pn :H(P )≤Rn

≤ (n + 1)|X | 2nRn
= 2nR

22
We can ignore those sequences outside A (since they have close to 0 probability)
and only encode those in A, for which a source code with rate R is enough:

P (n) = 1 − Qn (A)
X
= Qn (T (P ))
P :H(P )>Rn

≤ (n + 1)|X | 2−n minP :H(P )>Rn D(P ||Q)

Since H(Q) < R and H(P ) > Rn , for large enough n, we have H(P ) > H(Q).
So we deduce that for large n, minP :H(P )>Rn D(P ||Q) > 0, which implies that
(n)
Pe → 0 as n → ∞.
Theorem 5.7 (Sanov’s Theorem). Let X1 , X2 , · · · , Xn be i.i.d. ∼ Q(x). Let
E be a set of probability distributions. Let Qn (E) := Qn (T (E
P ∩ P )), or equiva-
lently, Qn (E) = xn :Pxn ∈E∩P Qn , or equivalently Qn (E) = P ∈E∩Pn Qn (T (P )).
P
n

Then Qn (E) ≤ (n + 1)|X | 2−nD(P ||Q) , where P ∗ = argminP ∈E D(P ||Q) is the
distribution in E that is the closest to Q in relative entropy. If, in addition, the
set E is the closure of its interior, then n1 log Qn (E) → −D(P ∗ ||Q).
Proof.
X
Qn (E) = Qn (T (P ))
P ∈E∩Pn
X
≤ 2−nD(P ||Q)
P ∈E∩Pn
X ∗
≤ 2−nD(P ||Q)

P ∈E∩Pn

≤ (n + 1)|X | 2−nD(P ||Q)

If E is closed, we can find a sequence of distributions pn ∈ E∩pn s.t. D(Pn ||Q) →


D(P ∗ ||Q).
Now,
X
Qn (E) = Qn (T (P ))
P ∈E∩Pn

≥ Qn (T (pn ))
1
≥ 2−nD(pn ||Q)
(n + 1)X
which implies that
1 |X | log n
lim inf log Qn (E) ≥ lim inf (− − D(pn ||Q))
n→∞ n n→∞ n
= lim (−D(pn ||Q))
n→∞
= −D(P ∗ ||Q)

23

On the other hand, one can use Qn (E) ≤ (n + 1)|X | 2−nD(P ||Q) to prove
that lim supn→∞ n1 Qn (E) ≤ −D(P ∗ ||Q), which implies limn→∞ n1 log Qn (E) =
−D(P ∗ ||Q)
Pn
Example 5.3 (Large deviation). Find an upper bound on P ( n1 i=1 gj (xi ) ≥
αj , j = 1, 2, · · · , k)
Pfor an i.i.d. random sequence X1 , X2 , · · · , Xn ∼ Q(x).
Let E = {P : a P (a)gj (a) ≥ αj , j = 1, 2 · · · , k}. Then
n
1X X
P( gj (xi ) ≥ αj , j = 1, 2, · · · , k) = Qn (xn )
n i=1 1
Pn
xn : n i=1 gj (xi )≥αj ,j=1,2··· ,k
X
= Qn (xn )
1
P
xn : n a Pxn (a)gj (a)≥αj ,j=1,2··· ,k
X
= Qn (xn )
xn :Pxn ∈E∩Pn
n
= Q (E)
P
Q(x)e i λi gPi (x)
Using Lagrange multipliers, we can find p∗ = P
i λi gi
, where λi can
a∈X Q(a)e
be computed through solving a system of linear
Pn constraints. We then have an
exponentially decreasing upper bound: P ( n1 i gj (xi ) ≥ αi , j = 1, 2, · · · , k) ≤

(n + 1)|X | 2−nD(P ||Q) .
Theorem 5.8. For a closed convex set E of distributions, and some distribution
Q ̸= E, let p∗ ∈ E be the distribution that achieves the minimum distance to
Q, i.e. D(P ∗ ||Q) = minP ∈E D(P ||Q). Then D(P ||Q) ≥ D(P ||P ∗ ) + D(P ∗ ||Q)
for all P ∈ E.
Proof. For any P ∈ E, define Pλ = λP + (1 − λ)P ∗ . Since E is convex, Pλ ∈ E
for 0 ≤ λ ≤ 1. Computing the derivative of D(Pλ ||Q) with respect to λ, we
have
dD(Pλ ||Q) X Pλ (x)
= ((P (x) − P ∗ (x)) log + (P (x) − P ∗ (x)))
dλ x
Q(x)
X Pλ (x)
= (P (x) − P ∗ (x)) log
x
Q(x)

24
dD(Pλ ||Q)
Since P ∗ is the closest to Q in relative entropy, dλ |λ=0 ≥ 0, therefore
X P (x) X P (x) X P ∗ (x)
D(P ||Q) − D(P ||P ∗ ) − D(P ∗ ||Q) = (P (x) log )− (P (x) log ∗ )− (P ∗ (x) log )
x
Q(x) x
P (x) x
Q(x)
X P (x) P ∗ (x) X P ∗ (x)
= (P (x) log )− (P ∗ (x) log )
x
Q(x) P (x) x
Q(x)
X P ∗ (x)
= (P (x) − P ∗ (x)) log
x
Q(x)
dD(Pλ )
= |λ=0

≥0

Definition 5.3. The L1 -distanceP between any two discrete distributions P1 and
P2 is defined as ||P1 − P2 ||1 = a∈X |P1 (a) − P2 (a)|.
Let A be the set on which P1 (x) > P2 (x), then
X
||P1 − P2 ||1 = |P1 (x) − P2 (x)|
x∈X
X X
= (P1 (x) − P2 (x)) + (P2 (x) − P1 (x))
x∈A x∈Ac
= P1 (A) − P2 (A) + (1 − P2 (A)) − (1 − P1 (A))
= 2(P1 (A) − P2 (A))
||P1 −P2 ||
Therefore, maxB⊂X (P1 (B) − P2 (B)) = P1 (A) − P2 (A) = 2 .
||P1 −P2 ||21
Lemma 5.1 (Pinsker’s inequality). D(P1 ||P2 ) ≥ 2 ln 2

Proof. One can show the binary case: p log 1−p 1−p 2
1−q + (1 − p) log 1−q ≥ ln 2 (p − q)
2

by showing that for each q ∈ (0, 1), the function gq : (0, 1) → R, gq (p) =
p log 1−p 1−p 2 2
1−q + (1 − p) log 1−q − ln 2 (p − q) is minimized when p = q. This can be
p(1−q)
done by computing gq′ (p) = ln q(1−p) − 4(p − q), yielding gq′ (p) = 0 at p = q and
′′ 1 1
gq (p) = p + 1−p − 4 which is ≥ 0 at p = q.
For the general case, for two distributions P1 and P2 , let A = {x ∈ X :

25
P1 (x) > P2 (x)}. Then,
X P1 (x)
D(P1 ||P2 ) = P1 (x) log
P2 (x)
x∈X
X P1 (x) X P1 (x)
= P1 (x) log + P1 (x) log
P2 (x) c
P2 (x)
x∈A x∈A
P1 (A) 1 − P1 (A)
= P1 (A) log + (1 − P1 (A)) log (log-sum inequality)
P2 (A) 1 − P2 (A)
2
≥ (P1 (A) − P2 (A))2 (consider the Pinsker’s inequality
ln 2
for the binary case)
1
= ||P1 − P2 ||21
2 ln 2

Theorem 5.9 (The Conditional Limit Theorem). Let E be a closed convex set
of distributions and let Q be a distribution not in E. Let X1 , X2 , · · · , Xn , · · ·
be i.i.d. ∼ Q(x) and let P ∗ achieve minP ∈E D(P ||Q). Then Pr(X1 = a|PX n ∈
E) → P ∗ (a) in probability as n → ∞.
Proof. Define St : {P : D(P ||Q) < t} and d∗ = D(P ∗ ||Q) = minP ∈E D(P ||Q).
Now, for any δ > 0, define A = Sd∗ +2δ ∩ E and B = E − A. Now,
X
Qn (B) = Qn (T (P ))
P ∈E∩Pn :D(P ||Q)>d∗ +2δ
X
≤ 2−nD(P ||Q)
P ∈E∩Pn :D(P ||Q)>d∗ +2δ

≤ (n + 1)|X | 2−n(d +2δ)

and

Qn (A) ≥ Qn (Sd∗ +δ )
X
= Qn (T (P ))
P ∈E∩Pn :D(P ||Q)≤d∗ +δ
X 1
≥ 2−nD(P ||Q)
(n + 1)|X |
P ∈E∩Pn :D(P ||Q)≤d∗ +δ
1 ∗
≥ 2−n(d +δ) (for sufficiently large n,
(n + 1)|X |
such P exists)

26
So for n enough large,
Qn (B ∩ E)
P r(PX n ∈ B|PX n ∈ E) =
Qn (E)
n
Q (B)
≤ n
Q (A)
(n + 1)|X |

(n + 1)−|X | 2−n(d∗ +δ)
= (n + 1)2|X | 2−nδ
→0 (#)

So P r(PX n ∈ A|PX n ∈ E) → 1 as n → ∞. Note that PX n ∈A implies that


D(PX n ||Q) ≤ D∗ + 2δ.
By Pythagorean Theorem, D(PX n ||P ∗ )+D(P ∗ ||Q) ≤ D(PX n ||Q), which im-
plies that D(PX n ||P ∗ ) ≤ 2δ. This, together with (#), implies Pr(D(PX n ||P ∗ ) ≤
2δ|PX n ∈ E) → 1 as n → ∞, which by Pinsker’s inequality, implies that for
any ϵ > 0, any a ∈ X , P r(||PX n (a) − P ∗ (a)||1 > ϵ|PX n ∈ E) → 1 as n → ∞.
It then follows that E[PX n (a)|PX n ∈ E] → P ∗ (a) as n → ∞. Then, ( we have
1 if X = a
E[ I(X1 )+···+I(X
n
n)
|PX n ∈ E] → P ∗ (a) as n → ∞ where I(X) =
0 if X ̸= a
Noting that
I(X1 ) + · · · + I(Xn )
E[ |PX n ∈ E] = E[I(X1 )|PX n ∈ E]
n
= Pr(X1 = a|PX n ∈ E)

And we are done.


Example 5.4. If Xi ∼ Q, then
n
1X 2 X
Pr(X1 = a| Xi ≥ α) = Pr(X1 = a| PX (a)a2 ≥ α) (1)
n i=1 a
→ P ∗ (a) (2)

where P ∗ minimizes D(P ||Q) over P satisfying p(a)a2 ≥ α.


P
a

Hypothesis testing: Let X1 , X2 , · · · , Xn be i.i.d. ∼ Q(x), and we observe


x1 , x2 , · · · , xn . Consider two hypotheses:
(1) H1 : Q = P1
(2) H2 : Q = P2
Find a decision function g(x1 , · · · , xn ) s.t.
H1 is accepted if g(x1 , · · · , xn ) = 1 (this is equivalent to xn ∈ A);
H2 is accepted if g(x1 , · · · , xn ) = 2 (this is equivalent to xn ∈ Ac )
Define two probabilities of error:
α = Pr(xn ∈ Ac |H1 true) = P1n (Ac )
β = Pr(xn ∈ A|H2 true) = P2n (A)

27
Theorem 5.10 (Neyman-Pearson Lemma). Let X1 , X2 , · · · , Xn be i.i.d. ∼
Q(x). Consider the decision problem corresponding to hypotheses:
H1 : Q = P1 vs H2 : Q = P2 .
For T ≥ 0, define the acceptance region for H1 : An (T ) = {xn : P 1 (x1 ,··· ,xn )
P2 (x1 ,··· ,xn ) >
T }. Let α∗ = P1n (Acn (T )), β ∗ = P2n (An (T )), and let Bn be any other acception
region with associated probabilities of error α and β. If α ≤ α∗ , then β ≥ β ∗ .
Proof. Let ΦAn and ΦBn be the indicator functions of the decision regions
An = An (T ), Bn , respectively. Then for any (x1 , · · · , xn ) ∈ X n , (ΦAn (xn ) −
ΦBn (xn ))(P1 (xn ) − T P2 (xn )) ≥ 0. It then follows that
X
0≤ (ΦAn (xn )P1 (xn ) − T ΦAn (xn )P2 (xn ) − ΦBn P1 (xn ) + T ΦBn (xn )P2 (xn ))
xn
X X
= (P1 − T P2 ) − (P1 − T P2 )
An Bn
= (1 − α∗ ) − T β ∗ − (1 − α) + T β
= T (β − β ∗ ) − (α∗ − α)

This implies the lemma.


Remark 5.1. Neyman-Pearson Lemma implies that the likelihood test is an
optimum best.
Connection with Information Theory: We rewrite the log-likelihood ratio as
P1 (x1 , · · · , xn )
L(x1 , · · · , xn ) = log
P2 (x1 , · · · , xn )
n
X P1 (xi )
= log
i=1
P2 (xi )
X P1 (a)
= nPX n (a) log
P2 (a)
a∈X
X PX n (a) X PX n (a)
= nPX n (a) log − nPX n (a) log
P2 (a) P1 (a)
a∈X a∈X
= nD(PX n ||P2 ) − nD(PX n ||P1 )
P1 (x1 ,··· ,xn )
Hence, P 2 (x1 ,··· ,xn )
> T ⇐⇒ D(PX n ||P2 ) − D(PX n ||P1 ) > n1 log T .
Let An denotes the acception region H1 , then
An = {xn : D(PX n ||P2 )−D(PX n ||P1 ) > n1 log T }. Then αn = P1 (Acn ) and βn =

P2 (An ). Now by Sanov’s Theorem, αn grows as the same rate as 2−nD(P1 ||P1 )) ,

βn grows as the same rate as 2−nD(P2 ||P2 )) , where P1∗ is the closest element in
Acn to distribution P1 while P2∗ is the closest element in An to distribution P2 .
Using Lagrange multipliers, we can solve
P λ (x)P21−λ (x)
P1∗ (x) = P2∗ (x) = P 1 P λ (a)P 1−λ
(a)
:= Pλ (x), where λ is chosen such that
a∈X 1 2
log T
D(Pλ ||P1 ) − D(Pλ ||P2 ) = n .

28
Theorem 5.11 (Chernoff-Stein Lemma). Let X1 , X2 , · · · , Xn be i.i.d. ∼ Q.
Consider the hypothesis test between two alternatives: H1 : Q = P1 vs H2 :
Q = P2 , where D(P1 ||P2 ) < ∞. Let An ⊂ X n be an acception region for H1 ,
and let the probabilities of error be αn = P1n (Acn ), βn = P2n (An ). For 0 < ϵ < 12 ,
define βnϵ = minAn ⊂X n ,αn <ϵ βn . Then limn→∞ log βnϵ = −D(P1 ||P2 ).
Rough Idea of the proof: First construct a sequence of acception region
An ⊂ xn such that αn < ϵ and βn grows at the same rate as 2−nD(P1 ||P2 ) . Then
show that no other sequence of tests has an asymptotically better exponent.
Proof. Define for some small δ
n
An = {xn ∈ X n : 2n(D(P1 ||P2 )−δ) ≤ P 1 (x )
P2 (xn ) ≤ 2
n(D(P1 ||P2 )+δ)
}
1
P n P1 (Xi ) P1 (X)
By law of large numbers, n i=1 log P2 (Xi ) → Ep1 [log P 2 (X)
] = D(P1 ||P2 )
n n
in probability (with respect to P1 ). Therefore, we get P1 (A ) → 1 as n → ∞.
So for large n, αn = P1n (AC n
n ) = 1 − P1 (An ) < ϵ.
We have
X
βn = P2 (xn )
xn ∈An
X
≤ P1 (xn )2−n(D(P1 ||P2 )−δn )
An
X
= 2−n(D(P1 ||P2 )−δn ) P1 (xn )
An

= 2−n(D(P1 ||P2 )−δn ) P1n (A1 )


= 2−n(D(P1 ||P2 )−δn ) (1 − αn )

Similarly, we can(show that βn ≥ P2n (An ) ≥ 2−n(D(P1 ||P2 )+δn ) (1 − αn ).


1
log βn ≤ −D(P1 ||P2 ) + δn + log(1−α n)

Hence we get n1 n
log(1−αn ) , which implies that
n log βn ≥ −D(P1 ||P2 ) − δn + n
limn→∞ n1 log βn = −D(P1 ||P2 ).

Limpel-Ziv coding (LZ78): a practical universal data compression scheme.


The algorithm of the compression scheme is as follows: First, the source
sequence is sequentially parsed into the shortest string that has not appeared
so far. Second, let c(n) be the number of strings after parsing the input
n-sequence. We need log c(n) bits to describe the location of the prefix of the
strings and 1 bit to describe the last bit. For instance, 1011010100010 is
parsed into 1,0,11,01,010,00,10 and is described by (000,1), (000,0), (001,1),
(010,1), (100,0), (010,0), (001,0).
The decoding is obvious.

Lemma 5.2. The number of strings c(n) in the Lempel-Ziv parsing of a binary
sequence X1 , X2 , Xn satisfies c(n) ≤ (1−ϵnn) log n where ϵn → 0 as n → ∞.

The proof of this lemma is some just bounding analyses.

29
Lemma 5.3. Let Z be a positive integer-valued valued random variable with
mean µ. Then the entropy of H(Z) is upper bounded by (µ + 1) log(µ + 1) −
µ log µ.
Proof. See Theorem 12.1.1 of the textbook, where we can show that H(Z) is
maximized when Z is a geometric random variable.

Let X−∞ be a stationary ergodic process with p.m.f. p. For a fixed integer k,
define the k-th order Markov approximation to p as Qk (x−(k−1) , · · · , x0 , x1 , · · · , xn ) =
p(x0−(k−1) )Πnj=1 p(xj |xj−1 n−1
j−k ). Since {p(xn |xn−k )}n is an ergodic process, we have
n
1 0 1X j−1
− log Qk (X1 , X2 , · · · , Xn |X−(k−1) )=− log p(Xj |Xj−k )
n n j=1
n→∞ j−1
−−−−→ E[− log p(Xj |Xj−k )]
j−1
= H(Xj |Xj−k )
k→∞
−−−−→ H(X)

where H(X) is the entropy rate of X−∞ .
Suppose that X−(k−1) = x−(k−1) , and suppose that xn1 is parsed into c(n) dis-
n n

i+1 q −1
tinct phrases y1 , y2 , · · · , yc(n) , where yi = xq(i) . For each i = 1, 2, · · · , c(n),
define si = xqqii −1
−k .
Note that s1 = x0−(k−1) .
In other words, si is the k-bits of x
preceding yi . LetcP
ls be the number ofP
phrases yi with length l and preceding
state si = s. Then l,s cls = c(n) and l,s lcls = n.
Lemma 5.4. For the Lempel-Ziv P parsing of the sequence x1 x2 · · · xn , we have
log Qk (x1 , x2 , · · · , xn |s1 ) ≤ − l,s cls log cls
Proof. Note that
Qk (x1 , x2 , · · · , xn |s1 ) = Qk (y1 , y2 , · · · , yc(n) |s1 )
c(n)
= Πi=1 p(yi |si )
It then follows
c(n)
X
log Qk (x1 , x2 , · · · , xn |s1 ) = log p(yi |si )
i=1
X X
= log p(yi |si )
l,s i:|yi |=l,si =s
X X 1
= cls log p(yi |si )
cls
l,s i:|yi |=l,si =s
X X 1
≤ cls log( p(yi |si ))
cls
l,s i:|yi |=l,si =s
X 1
= cls log
cls
l,s

30
Theorem 5.12. Let {xn } be a stationary ergodic process with entropy rate
H(X), and let c(n) be the number of phrases in the Lempel-Ziv parsing of a
sample of length n from this process. Then lim supn→∞ c(n) log
n
c(n)
≤ H(X).
Proof. By Lemma 3, we have
X
log Qk (x1 , x2 , · · · , xn |s1 ) ≤ − cls log cls
l,s
X cls c(n)
=− cls log
c(n)
l,s
X cls cls
= −c(n) log c(n) − c(n) log
c(n) c(n)
l,s
P cls P lcls n
Note that l,s c(n) = 1 and l,s c(n) = c(n)
cls
Define random variables U , V such that Pr(U = l, V = s) = c(n) . Then
n c(n) c(n)
E[U ] = c(n) and n log c(n) − n H(U, V ) ≤ − n1 log Qk (x1 , x2 , · · · , xn |s1 ).
Now,

H(U, V ) ≤ H(U ) + H(V )


≤ H(U ) + log |X |k
≤ H(U ) + k
≤ (E[U ] + 1) log(E[U ] + 1) − E[U ] log(E[U ]) + k
n n n n
=( + 1) log( + 1) − log +k
c(n) c(n) c(n) c(n)

So c(n) c(n) n c(n) c(n)


n H(U, V ) = n log c(n) + (1 + n ) log(1 + n ) +
c(n)
n k → 0 as n → ∞
where we have used c(n) ≤ logn n (1 + O(1)). So we have

c(n) log c(n) 1


≤ − log Qk (x1 , x2 , · · · , xn |s1 ) + ϵk (n)
n n
where ϵk (n) → 0 as n → ∞. So,
c(n) log c(n) −1
lim sup ≤ lim log Qk (x1 , · · · , xn |s1 )
n→∞ n n→∞ n
Furthermore
c(n) log c(n) −1 0 −1
lim sup ≤ lim log Qk (X1 , · · · , Xn |X−(k−1) ) = H(X0 |X−k )
n→∞ n n→∞ n

Therefore,
c(n) log c(n) −1
lim sup ≤ lim H(X0 |X−k ) = H(X)
n→∞ n k→∞

31

Theorem 5.13. Let X = X−∞ be a stationary ergodic process, and let l(x1 , · · · , xn ))
be the Lempel-Ziv codeword length associated with x1 , · · · , xn . Then
1
lim sup l(x1 , · · · , xn ) ≤ H(X)
n→∞ n

where H(X) is the entropy rate of X.


Proof. Since l(x1 , · · · , xn ) ≤ c(n)(log c(n) + 1), we have

1 c(n) log c(n) c(n)


lim sup l(x1 , · · · , xn ) ≤ lim sup( + ) ≤ H(X)
n→∞ n n→∞ n n

32

You might also like