Shannon's Theorems: Math and Science Summer Program 2020
Shannon's Theorems: Math and Science Summer Program 2020
Shannon’s theorems
Contents
Notation 3
3 Communication channel 13
3.1 Conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Information channel and product form . . . . . . . . . . . . . . . . . . . . . . 16
2
Notation
Notation
From now on, by a set we mean a discrete set (i.e. a set that is either finite or countably infinite),
unless anything else is mentioned. The collection 2 X of all subsets of a given (discrete) set X
forms a σ -algebra.
p(x) = 1. If A is
Í
A probability measure on X is simply a function p : X → [0, 1] satisfying
Í x ∈X
any subset of X, we write p(A) to indicate the sum p(x) (and hence a non-negative function
x ∈A
p on X is a probability measure iff p(X) = 1).
Whenever p is a probability measure on X and X is a random variable whose value is in X, we
write X ∼ p if the distribution PX of X coincides with p, i.e. if
∀x ∈ X, P(X = x) = p(x).
3
1 Shannon’s entropy and source coding
Remark. 1. The base of logarithm D is usually omitted when understood. One often takes
D = 2 to yield the binary entropy H 2 .
2. We implicitly used the convention 0 · log 0 := 0 in the Definition 1.1 by continuously
extending the function x 7→ x log x at x = 0.
Example 1.2. Let X = {0, 1}, p ∈ [0, 1] and X ∼ Ber(p), that is,
The entropy of X ,
H (X ) = −p log p − (1 − p) log(1 − p),
is denoted h(p).
Proposition 1.3. Let X be a random variable with value in X. Its entropy H (X ) enjoys the
following properties.
1. H (X ) > 0 and the equality holds iff X is (almost surely) deterministic.
2. Let f : X → Y be a deterministic function. One has H (X ) > H (f (X )) and the equality
holds iff f is injective.
Proof. Exercise.
Theorem 1.4 (Gibbs’ inequality). If p and q are two probability measures on X, there holds
Õ
H (p) 6 − p(x) log q(x). (1.1)
x ∈X
4
1 Shannon’s entropy and source coding
The right hand side of (1.1) is called the cross entropy between p and q, denoted H (p; q).
Proof. Exercise.
Corollary 1.5. When X = {1, . . . , n}, the uniform distribution U(n) on X maximizes the
entropy.
Corollary 1.6. Let X = N∗ and X a random variable taking value in N∗ . Let µ < 1 be given. If
EX = µ, the entropy of X is maximized when X has the geometric distribution Geom(1/µ).
The inequality (1.1) also implies the following
Proposition 1.7. Let X = (X 1 , . . . , X n ) be a random vector with value in X = X1 × · · · × Xn .
One has the inequality
H (X ) 6 H (X 1 ) + · · · + H (X n ). (1.2)
Moreover the equality holds iff X 1 , . . . , X n are independent.
Proof. Exercise.
Exercises
Exercise 1.1 (Entropy of a homogeneous Markov chain). Let (X n )n ∈N∗ be random variables
with value in X such that the following conditions hold.
1. For each n ∈ N∗ , X n+1 and (X 1 , . . . , X n−1 ) are independent given X n .
2. There exists a right stochastic matrix P = [px,y ]x,y ∈X such that
In this section, let |X| = D ∈ N∗ \ {1} and let p be a probability measure on X. Let X =
(X 1 , . . . , X n ) be i.i.d. random variables with value in X n .
Definition 1.8. Let ε > 0. A realization x = (x 1 , . . . , x n ) ∈ X n of X is called ε-typical if
n
1Õ
− log p(x ) − H (p) 6 ε.
D i D
n i=1
5
1 Shannon’s entropy and source coding
We denote by A(n) n
ε the subset of ε-typical vectors in X . Asymptotically, typical vectors concen-
trate probability, that is
lim P X ∈ A(n) ε =1 (1.3)
n→∞
Hence Õ
(n) (n) −n(H D (p)+ε )
1 > P X ∈ Aε = P(X = x) > Aε · D
(n)
x ∈Aε
The next proposition shows that there are no ’smaller’ sets that (asymptotically) concentrate
probability.
Proposition 1.10. Let R > 0 such that for each n ∈ N∗ , there exists a subset B (n) ⊆ X n of
cardinality at most D nR . If
lim sup P X ∈ B (n) = 1,
n→∞
Proof. Let ε > 0. By similar arguments to those in the proof of Proposition 1.9, if x = (x 1 , . . . , x n )
is ε-typical, we would have P(X = x) 6 D −n(H D (p)−ε ) . Consequently,
Õ
P X ∈ A(n)
ε ∩B
(n)
= P(X = x) 6 B (n) · D −n(H D (p)−ε ) 6 D −n(H D (p)−R−ε ) .
(n)
x ∈Aε ∩B (n)
Let n 1 < n 2 < · · · be a strictly increasing sequence of positive integers such that lim P X ∈ B (nk ) =
k→∞
1. Then P X ∈ A(n ε
k)
∩ B (n k ) → 1 as k → ∞ and it follows that the term D −n k (H D (p)−R−ε ) re-
mains lower bounded, which implies H D (p) − R − ε 6 0. Letting ε ↓ 0 yields the result.
Our goal is to encode the source message X (consisting of n i.i.d. symbols) using a (hopefully,
smaller) number of symbols than n. The question is, at which compression rate can one correctly
decode the encoded message (at least with a high probability)?
6
1 Shannon’s entropy and source coding
Definition 1.11. Let Y be a finite set, which we will call an alphabet. By an encoding/decoding
scheme, we mean the following data for each n ∈ N∗
1. a positive integer n 0 and a encoding function c n : X n → Y n ,
0
2. a decoding function ĉ n :→ Y n → X n .
0
n0
The compression rate of the given encoding/decoding scheme is Rn := . Let Y := c n (X ) be the
n
encoded message and X̂ := ĉ n (Y ) be the decoded message. We will now concern ourselves to
the error probability of the scheme
Pe(n) := P X̂ , X ,
given the compression rate. For simplicity, let us take Y = X. The general case is left as an
exercise.
Theorem 1.12 (Shannon’s first theorem / Noiseless coding theorem). 1. For any R > H D (p),
n
there exist encoding functions c n : X → X dnR e and decoding functions ĉ n : X dnR e → X n
(n)
such that lim Pe = 0.
n→∞
2. For any R < H D (p) and any encoding/decoding scheme c n : X n → X bnR c , ĉ n : X bnR c →
X n , we have lim inf Pe(n) > 0.
n→∞
Proof. 1. Let ε ∈ (0, R − H D (p)). By Proposition 1.9, for each n ∈ N∗ , we can find an injection
fn : A(n)
ε →X
dnR e
and an element x ∗ ∈ X dnR e \ fn A(n)
ε . Consider the following encoding/decoding scheme
(
n fn (x) if x ∈ A(n)
ε
∀x ∈ X , c n (x) :=
x∗ otherwise,
(
fn−1 (y) if y ∈ fn A(n)
ε
∀y ∈ X dnR e , ĉ n (y) :=
arbitrary otherwise.
It is clear that ĉ n (c n (x)) = x for all x ∈ A(n)
ε . Consequently, P (n)
e 6 P X < A(n)
ε , which
converges to 0 as n tends to ∞ by (1.3).
2. We define for each n ∈ N∗ ,
B (n) := {x ∈ X n | ĉ n (c n (x)) = x } .
7
1 Shannon’s entropy and source coding
Exercises
Exercise 1.2. State and prove Theorem 1.12 in the general case (where Y is not necessarily
equal to X).
8
2 Universal source coding
This chapter deals with an extension of Theorem 1.12 when the distribution of the source
alphabet is unknown. In other words, we will show that for any coding rate R > 0, there is
an asymptotically error-free encoding/decoding scheme that is universal for all probability
distribution of i.i.d. source symbols whose entropy is smaller than R.
Let X be a set of cardinality D ∈ N∗ \ {1}.
The base of logarithm in Definition 2.2 can be any real number greater than 1. The convention
used here are 0 · log(0/y) = 0 for y > 0 and x · log(x/0) = +∞ for x > 0.
By inequality (1.1), D(p; q) > 0, and equality holds iff p = q (be warned, the Kullback-Leibler
divergence is not symmetric!).
Lemma 2.3. Let p be a probability measure on X. We have
1. for any vector x ∈ X n , p ⊗n (x) = 2−n(H2 (px )+D 2 (px ;p)) ,
2. for any empirical distribution q ∈ Pn , p ⊗n (Tn (q)) 6 2−nD 2 (q;p) .
Proof. Exercise. For the inequality, one can make use of (2.1).
9
2 Universal source coding
Exercises
H (X ) + H (Y ) − H (X , Y ) = D(P(X,Y ) ; PX ⊗ PY ).
We are now ready to state and prove an extension of Theorem 1.12. For simplicity, let us take
the binary encoding alphabet {0, 1}.
Theorem 2.4 (Universal source coding). Let R > 0. There exist functions c n : X n → {0, 1} dnR e ,
ĉ n : {0, 1} dnR e → X n such that for any probability measure p on X satisfying H 2 (p) < R and
any sequence X 1 , X 2 , . . . of i.i.d. random variables with value in X and distribution p,
lim Pe(n) = 0
n→∞
Proof. The proof of Theorem 1.12 can be mimicked, provided that we find subsets A(n) ⊆ X dnR e
(depending only on R) of cardinality strictly less than 2nR such that for any probability measure
p on X with entropy less than R, we have
lim p ⊗n A(n) = 1. (2.2)
n→∞
To do this, take
A(n) := {a ∈ X n | H 2 (pa ) 6 Rn }
where
D log2 (n + 2)
Rn := R − .
n
10
2 Universal source coding
2nH2 (q)
Õ
6 (by (2.1))
q ∈ Pn
H 2 (q)6R n
6 |Pn | · 2nRn
2nR
6 (n + 1)D ·
(n + 2)D
< 2nR .
It suffices to show that (2.2) holds for any probability measure p on X such that H 2 (p) < R.
Take ε ∈ (0, R − H 2 (p)). Let P denote the set of all probability measures on X and
(Exercise: argue that q∗ exists and that D(q∗ , p) > 0). For large n, one has Rn > H 2 (p) + ε, thus
D(q; p) > D(q∗ ; p) for all q ∈ P with H 2 (q) > Rn . Observe that
p ⊗n X n \ A(n) =
Õ
p ⊗n (Tn (q))
q ∈ Pn
H 2 (q)>R n
Õ
6 2−nD 2 (q;p) (by Lemma 2.3)
q ∈ Pn
H 2 (q)>R n
The last term converge to 0 as n → ∞ since D(q∗ , p) > 0. This finishes the proof.
Exercises
Exercise 2.5 (Bernoulli source-symbols). For a vector x ∈ {0, 1}n , let K(x) denote its number
of 1’s. Let B(x) be the lexicographical rank of x among all vectors of {0, 1}n with exactly K(x)
1’s. When then have a bijection
n
c n0 : {0, 1}n → (k, b) | k ∈ {0, . . . , n}, b ∈ 1, . . . , , c n0 (x) := (K(x), B(x)).
k
Consider the encoding function c n on {0, 1}n defined as follows: c n (x) is the concatenation of
the binary representation of K(x) and B(x). We shall work on the asymptotic length of c n (x).
11
2 Universal source coding
(see the definition of h 2 (p) in Example 1.2). Moreover, show that if 12np(1 − p) > 9,
n −nh2 (p) 1
2 > p .
np 8np(1 − p)
1
2. Let x 1 , x 2 , . . . ∈ {0, 1} such that K(x 1 , . . . , x n ) → p ∈ (0, 1) as n → ∞. Let | · | denotes
n
the length of a binary sequence. Show that
1
lim |c n (x 1 , . . . , x n )| = h 2 (p).
n→∞ n
See Section 14.1.1 (A First Example) in Pierre Brémaud, Discrete Probability Models and Methods.
Springer, 2017 for universal source coding for binary sequences.
12
3 Communication channel
3 Communication channel
In this chapter, X, Y and Z are always finite sets. The base of logarithm is always 2 unless
specified. To simplify notations, we use the convention P(A|B) = 0 when P(B) = 0, where A
and B are two events.
2.
H (X |Y = y) > 0 (3.2)
for all y ∈ Y. Moreover, H (X |Y ) > 0 and equality holds iff Y = f (X ) where f : X → Y
is a deterministic function.
3. (Chain rule)
H (X , Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ). (3.3)
4. If H (Y ) < ∞, then
H (X |Y ) 6 H (X ) (3.4)
and equality holds iff X ⊥ Y .
13
3 Communication channel
H (Y | f (X )) > H (Y |X ) (3.5)
and
∀i ∈ {1, . . . , n}, H (X i |Y ) 6 H (X 1 , . . . , X n |Y ). (3.9)
where
Õ
H (X |Y , Z = z) := − P(X = x, Y = y|Z = z) log P(X = x |Y = y, Z = z).
x ∈X,y ∈Y
2. (Chain rule)
H (X , Y |Z ) = H (X |Z ) + H (Y |X , Z ) = H (Y |Z ) + H (X |Y , Z ). (3.11)
3. If H (Y |Z ) < ∞, then
H (X |Y , Z ) 6 H (X |Y ) (3.12)
14
3 Communication channel
Exercises
Exercise 3.1. Show the sequential conditional chain rules: If (X 1 , . . . , X n , Y ) is a random vector
with value in X1 × · · · × Xn × Y , we have
n
Õ
H (X 1 , . . . , X n ) = H (X i |X 1 , . . . , X i−1 ) (3.13)
i=1
and
n
Õ
H (X 1 , . . . , X n |Y ) = H (X i |X 1 , . . . , X i−1 , Y ). (3.14)
i=1
Definition 3.4. For a random vector (X , Y ) with value in X × Y, we define the mutual infor-
mation between X and Y
I (X ; Y ) := H (X ) + H (Y ) − H (X , Y ).
Exercise 2.4 says that I (X ; Y ) = D(P(X,Y ) ; PX ⊗ PY ). The chain rule 3.3 says that
I (X ; Y ) = H (X ) − H (X |Y ) = H (Y ) − H (Y |X ) = I (Y ; X ).
We state without proof some basic properties of mutual information.
Proposition 3.5. Let (X , Y ) be a random vector with value in X × Y. Suppose that H (X ) and
H (Y ) are finite. Then we have the following results.
1.
I (X ; X ) = H (X ) (3.15)
2.
I (X ; Y ) > 0 (3.16)
with equality iff X ⊥ Y .
3.
I (X ; Y ) 6 I (Y ; Y ) = H (Y ) (3.17)
with equality iff Y is a deterministic function of X .
4. For any function f : X → Y,
I (f (X ); Y ) 6 I (X ; Y ) (3.18)
with equality iff Y ⊥ X given f (X ).
5. (Data processing inequality) If Z is a random variable with value in Z and finite entropy,
then
I (X , Y ; Z ) 6 I (Y ; Z ) (3.19)
with equality iff Z ⊥ X given Y .
15
3 Communication channel
Exercises
I (X 1 ; X n ) 6 I (X k ; X ` ). (3.20)
Exercise 3.3. If X 1 , X 2 , Y1 , Y2 are random variables with finite entropy such that (X 1 , X 2 ) and
(Y1 , Y2 ) are independent, then
I (X 1 , Y1 ; X 2 , Y2 ) = I (X 1 ; X 2 ) + I (Y1 ; Y2 ). (3.21)
Exercise 3.4. Define the conditional version of mutual information, state and prove its basic
properties.
Exercise 3.5. Show the Kolmogorov’s formula: If X , Y and Z have finite entropy, then
I (X ; Y , Z ) = I (X ; Z ) + I (X ; Y |Z ). (3.22)
Exercise 3.6 (yet another Chain rule). 1. Let X , Y , Z and U be random variables with range
respectively X, Y, Z and U with finite entropy. We have
I (X ; Y , Z |U ) = I (X ; Z |U ) + I (X ; Y |Z , U ). (3.23)
Having studied the concepts of mutual information, we are now ready to define an information
channel and its capacity.
Definition 3.6. By an information channel with input alphabet X and output alphabet Y, we
mean a family of non-negative functions
where n ∈ N∗ (called probability kernels of the channel, meaning for each x ∈ X n , the function
κn (·|x) is a probability measure on Y n ).
16
3 Communication channel
For n ∈ N∗ , the channel takes a random vector X = (X 1 , . . . , X n ) with value in X n as input (the
emitted word) and produces a random vector Y = (Y1 , . . . , Yn ) with value in Y n (the received
word) in such a way that
Proof. Exercise.
Proposition 3.9. If a channel is memoryless and without feedback, then for any n > 2, any
distrbution of X , any j ∈ {2, . . . , n} and any (x 1 , . . . , x j , y1 , . . . , y j ) ∈ X j × Y j ,
j
Ö
P(Y1 = y1 , . . . , Yj = y j |X 1 = x 1 , . . . , X j = x j ) = P(Yk = yk |X k = x k ).
k=1
Proof. Exercise.
Proposition 3.10. If for any n > 2, any distrbution of X and any (x 1 , . . . , x n , y1 , . . . , yn ) ∈
X n × Y n , we have
n
Ö
P(Y1 = y1 , . . . , Yn = yn |X 1 = x 1 , . . . , X n = x n ) = P(Yk = yk |X k = x k ) (3.26)
k =1
Proof. Exercise.
Proposition 3.11. For a given channel, the followings are equivalent.
1. The channel is memoryless and without feedback.
2. (3.26) holds for any n > 2, any distrbution of X and any (x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n ×Y n .
3. The channel is memoryless and without anticipation.
Proof. Exercise.
17
3 Communication channel
The matrix [κ(y|x)]x ∈X,y ∈Y is called the transition matrix of the channel.
Proposition 3.13. A channel is memoryless, without feedback and time-invariant with transi-
tion matrix [κ(y|x)]x ∈X,y ∈Y iff for any (x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n × Y n ,
n
Ö
κn (y1 , . . . , yn |x 1 , . . . , x n ) = κ(yi |x i ).
i=1
Proof. Exercise.
For this reason, a memoryless, without feedback and time-invariant channel is said to have
(homogeneous) product form.
Let X be a generic input symbol and Y the output. Then the distribution of (X , Y ) is a function
of PX and the probability kernel κ.
Definition 3.14. The capacity of a channel from X to Y in product form with transition matrix
κ is the following supremum, taken over all probability measure on X
Cκ := sup I (X ; Y ).
PX
Example 3.15 (Binary symmetric channel). Let X = Y = {0, 1} and consider the channel in
product form from X to Y such that any bit is transmitted incorrectly with probability p ∈ [0, 1].
In other words, the transition matrix is
1−p p
κ= .
p 1−p
Exercises
Exercise 3.8. Show that for a channel in product form, I (X ; Y ) is a continuous and concave
function of PX . Moreover, concavity is strict iff the map PX 7→ PY is injective. Deduce that the
supremum I (X ; Y ) is achieved (i.e. it is a maximum). When is the maximizer unique?
Exercise 3.9. Show that for fixed PX , I (X ; Y ) is a convex function of the probability kernel.
18
3 Communication channel
Exercise 3.10 (Binary erasure channel). We take the input alphabet X = {0, 1} and the output
alphabet Y = {0, 1, e}, where e stands for an error. Consider the channel with transition matrix
1−p 0 p
κ= .
0 1−p p
Let (X 1 , X 2 ) be a random input of the channel and (Y1 , Y2 ) the corresponding output.
1. Show that for (x 1 , x 2 , y1 , y2 ) ∈ X1 × X2 × Y1 × Y2 ,
P(Y1 = y1 |X 1 = x 1 , X 2 = x 2 , Y2 = y2 ) = κ 1 (y1 |x 1 ).
Y1 := X 1 ⊕ Z 1 , Y2 := X 2 .
Calculate the capacity of this channel. Show that the capacity-achieving distribution of output
is unique, but that of input is not.
Exercise 3.14 (Symmetric channel). A channel from X to Y is said to be symmetric if the rows
of its transition matrix are permutation of each other, as well as the columns.
1. Let q be the probability that the first row of the transition matrix define. Show that the
capacity of the channel is
C = log |Y| − H (q),
which can be achieved by uniform distribution.
2. Show that the above result hold for weakly symmetric channels (i.e. channels where the
rows of its transition matrix are permutation of each other, and the columns have equal
sums).
3. Let L > 2 be an integer and X = Y = {0, 1, . . . , L − 1}. Consider the channel where the
output Y is related to the input X by Y := Z ⊕ X (mod L) where Z is a random variable
with value in {0, 1, . . . , L − 1}, independent of X . Find the capacity of this channel.
19
3 Communication channel
Exercise 3.15 (Asymmetric erasure channel). Find the capacity of the channel with transition
matrix
2/3 − α α 1/3
.
α 2/3 − α 1/3
20
4 Shannon’s second theorem
In this chapter, we use the same notations as those in the previous one. All information channels
have product form. The base of logarithms is always 2.
An additional notation are finite sets Mn = {1, . . . , Mn } (n ∈ N∗ ), whose elements we call
messages.
An encoding/decoding scheme using a channel with kernel κ consists of
1. encoding functions c n : Mn → X n for n ∈ N∗ (i.e. the message set depends on n),
2. decoding functions ĉ n : Y n → Mn for n ∈ N∗ .
We use ω to denote a generic message in Mn . n sent symbols are those of the vector x := c n (ω).
n received symbols (which are random by the noisiness of the channel) are those of Y (the
distribution of Y is κ ⊗n (·|x)). The message estimate is Ŵ := ĉ n (Y ).
The error probability on a message ω ∈ Mn is
Õ
Pe(n)
|ω
:= P(Ŵ , ω |X = c n (ω)) = κ ⊗n (y|x).
y ∈Y n
ĉ n (y),ω
It is clear that λn > Pe(n) . We shall proceed to study the correlation between error probability
and the transmission rate
log Mn
Rn := .
n
21
4 Shannon’s second theorem
Proof. Take a distribution PX on X such that I (X ; Y ) = Cκ (see Exercise 3.8 for the existence for
Cκ − R
and Mn0 := 2nR+1 for n ∈ N∗ .
PX ). Let ε :=
5
Step 1. Control the average error.
The idea is to use random codes. Generate a random matrix Cn = [X i (ω)]ω ∈Mn0 ,16i 6n with i.i.d.
entries, each having distribution PX (the ω in the bracket is merely an index). We call this matrix
a (random) codebook. For a realization of Cn , we define encoding functions c n and decoding
functions ĉ n as follows.
∀ω ∈ Mn0 , c n (ω) := (X 1 (ω), . . . , X n (ω)).
ω̂ ∈ Mn0 such that (X 1 (ω̂) , . . . , X n (ω̂) , y1 , . . . , yn ) ∈ A(n)
ε
∀(y1 , . . . , yn ) ∈ Y n , ĉ n (y1 , . . . , yn ) :=
if ω exists and is unique
1
otherwise.
For ω ∈ Mn0 , let (Y1 (ω), . . . , Yn (ω)) be the (random) channel output related (deterministic) input
(X 1 (ω), . . . , X n (ω)). Its distribution is given by
n
n
Ö
∀(y1 , . . . , yn ) ∈ Y , P((Y1 (ω), . . . , Yn (ω)) = (y1 , . . . , yn )|(X 1 (ω), . . . , X n (ω)) = κ(yi |X i (ω)).
i=1
22
4 Shannon’s second theorem
It is clear that
{ĉ n (Y1 (1), . . . , Yn (1)) , 1} ⊆ E 1c ∪
Ø
E ω(n) .
ω ∈Mn0 \{1}
Hence
h i Õ
E Pe(n) = P (ĉ n (Y1 (1), . . . , Yn (1)) , 1) 6 1 − P E 1(n) + P E ω(n) ) . (4.1)
ω ∈Mn0 \{1}
The first term on the right hand side of (4.1) tends to 0 by (1.3). Moreover, since the rows of Cn
are indepedent, by Proposition 4.2
Õ
P E ω(n) ) 6 Mn0 · 2−n(Cκ −3ε )
ω ∈Mn0 \{1}
By Fatou’s lemma, there is a realization of Cn with such that the associated encoding/decoding
scheme has average error probability Pe(n) tends to 0.
Step 2. From average to maximal error.
This part and the part concerning the calculation of transmission rate are left as an exercise.
(Hint: Observe that if the average of n numbers is δ , then at least bn/2c of them are at most 2δ ).
Remark. The proof of Theorem 4.3 is probabilistic. Noone has ever constructed the promised
code.
In this section, we show that transmission rates that are larger than the channel capacity can
never be achieved. We start with the following
inequality.
Theorem 4.4 (Fano’s inequality). Let X , Y , X̂ be a random vector with value in X × Y × X
such that X̂ ⊥ X given Y . Let Pe := P X̂ , X , then
23
4 Shannon’s second theorem
By (3.10), we have
H X |E, X̂ = H X |X̂ , E = 1 Pe + H X |X̂ , E = 0 (1 − Pe )
6 Pe H (X ) + (1 − Pe )H X̂ |X̂ , E (by (3.4))
6 Pe log |X| + (1 − Pe )H X̂ |X̂ (by Corollary 1.5 and (3.4))
= Pe log |X| (by (3.7)).
24
4 Shannon’s second theorem
We have
I (W ; Y1 , . . . , Yn ) 6 I (X 1 , . . . , X n ; Y1 , . . . , Yn ) (by (3.20))
= H (X 1 , . . . , X n ) − H (X 1 , . . . , X n |Y1 , . . . , Yn )
Õn Õ n
6 H (X i ) − H (Yi |X 1 , . . . , X n , Y1 , . . . , Yi−1 ) (by (1.2) and (3.14))
i=1 i=1
Õn Õn
= H (X i ) − H (Yi |X i ) (since the channel is in product form)
i=1 i=1
Õn
= I (X i ; Yi )
i=1
6 nCκ .
log M
Combine this with (4.6) (note that Rn = n n ), we get the following inequality, from which the
result follows easily.
1
1 − Pe(n) Rn 6 Cκ + .
n
Theorem 4.5 claims that if the transmission rate is larger than the channel capacity, then the error
probability is bounded away from 0. In this section, we show a stronger converse to Theorem
4.3: this error probability converge to 1 (i.e. the communication is completely unreliable).
Consider a channel with probability kernel κ from X to Y. For an input distribution PX and
(x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n × Y n , define
n
κ ⊗n (y1 , . . . , yn |x 1 , . . . , x n ) Õ
I (x 1 , . . . , x n ; y1 , . . . , yn ) := log = I (x i ; yi ).
PY⊗n (y1 , . . . , yn ) i=1
I (x 1 , . . . , x n ; Y1 , . . . , Yn )
25
4 Shannon’s second theorem
Lemma 4.6. Let PX be an input distribution such that I (X ; Y ) = Cκ . Then for all n ∈ N∗ and
(x 1 , . . . , x n ) ∈ X n , we have E[I (x 1 , . . . , x n ; Y1 , . . . , Yn )] 6 nCκ .
Proof. It suffices to show the lemma for n = 1. Indeed, since the channel is in product form,
n
Õ
I (x 1 , . . . , x n ; Y1 , . . . , Yn ) = I (x i ; Yi ).
i=1
For x ∈ X, we have
Õ κ(y|x)
E[I (x; Y )] = κ(y|x) log .
P(Y = y)
y ∈Y
When X̃ = x is deterministic (i.e. when the distribution of X̃ is degenerate at x), one can check
that the mutual information I (X̃ ; Y ) is precisely E[I (x; Y )]. It folows from the definition of Cκ
that E[I (x; Y )] 6 Cκ .
Theorem 4.7 (Wolfowitz). Let M = {1, . . . , M } be a finite set of messages. Fix n ∈ N∗ .
Let c : M → X n be an encoding function and ĉ : Y n → M be a decoding function. If
log M
R := > Cκ , then
n
4A n(R−Cκ )
Pe > 1 − 2
− 2− 2
n(R − Cκ )
for some positive constant A, depending only on κ, neither on n nor on M.
R − Cκ
Let ε := and define, for ω ∈ M,
2
Bω := {y ∈ Y n | I (c(ω); y) > n(Cκ + ε)}.
Then for y ∈ Bωc , one has κ ⊗n (y|c(ω)) 6 PY⊗n (y) · 2n(Cκ +ε ) . It follows that
6 2n(Cκ +ε )
Õ Õ
PY⊗n (y)
ω ∈M y ∈Yω
= 2n(Cκ +ε )
n(R+Cκ )
=2 2 . (4.8)
26
4 Shannon’s second theorem
Thus
Õ Õ
κ ⊗n (y|c(ω)) 6 κ ⊗n (y|c(ω))
y ∈Yω ∩Bω y ∈Bω
1 Õ Õ ⊗n
Pe = 1 − κ (y|c(ω))
M
ω ∈M y ∈Yω
1 Õ Õ ⊗n
> 1− κ (y|c(ω))
M
ω ∈M y ∈Yω
1 Õ Õ 1 Õ Õ
=1− κ ⊗n (y|c(ω)) − κ ⊗n (y|c(ω))
M c M
ω ∈M y ∈Yω ∩Bω ω ∈M y ∈Yω ∩Bω
n(R+Cκ )
2 2 4A
> 1− −
M n(R − Cκ )2
4A − n(R−C κ)
> 1− − 2 2 .
n(R − Cκ )2
log M
The last inequality comes from the fact that = R > Cκ , i.e. M > 2nCκ .
n
Corollary 4.8. For a channel with probability kernel κ from X to Y and an encoding/decoding
scheme c n : Mn → X n , ĉ n : Y n → Mn , if lim inf Rn > Cκ , then lim Pe(n) = 1.
n→∞ n→∞
Exercises
Exercise 4.1 (Noisy typewriter). Consider a channel with input and output alphabet X = Y =
{a, b, . . . , z}. For a given input letter, the output may be equal to the input letter or to the next
one, both with probability 1/2. What is the information capacity of this channel?
27
4 Shannon’s second theorem
∀n ∈ N∗ , Yn := X n ⊕ Z n
where Z n ∼ Ber(q) and (Z n )n ∈N∗ is independent of (X n )n ∈N∗ (by the Z n ’s are not necessarily
independent).
1. Show that H (X 1 , . . . , X n |Y1 , . . . , Yn ) = H (Z 1 , . . . , Z n ).
1
2. The capacity of this channel is C := lim inf sup I (X 1 , . . . , X n ; Y1 , . . . , Yn ). Compare
n P(X 1, . . ., Xn )n→∞
the information capacities of the BSC channel with and without memory.
Exercise 4.4. Consider a time-varying (non-homogeneous) product-form channel with proba-
bility kernels κn (n ∈ N∗ ) from X to Y. Show that
n
Õ
sup I (X 1 , . . . , X n ; Y1 , . . . , Yn ) = sup I (X i ; Yi ).
P(X 1, . . ., X n ) i=1 PX i
Exercise 4.5 (Fano’s inequality is sharp). Let p ∈ (0, 1). By considering a random variable X
p
with range X = {1, . . . , m} such that P(X = 1) = 1 − p and P(X = k) = for 2 6 k 6 m
m−1
and taking Y to be a singleton, show that Fano’s inequality (4.2) is sharp.
Exercise 4.6 (Feedback does not increase capacity). Show that Theorem 4.5 remains true if we
consider encoding functions c n of the form
where
c n,i : Mn × Y i−1 7→ X, i = 1, . . . , n.
28