0% found this document useful (0 votes)
58 views28 pages

Shannon's Theorems: Math and Science Summer Program 2020

The document discusses Shannon's theorems on information theory. It begins by introducing Shannon's entropy as a measure of uncertainty in a random variable. It then defines typical sequences, which concentrate most of the probability, and states Shannon's first theorem, which establishes that the number of typical sequences is approximately exponential in the entropy. The document goes on to discuss universal source coding, channels of communication, conditional entropy, and mutual information before concluding with Shannon's second theorem about the achievable rates for reliable communication over noisy channels.

Uploaded by

Anh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views28 pages

Shannon's Theorems: Math and Science Summer Program 2020

The document discusses Shannon's theorems on information theory. It begins by introducing Shannon's entropy as a measure of uncertainty in a random variable. It then defines typical sequences, which concentrate most of the probability, and states Shannon's first theorem, which establishes that the number of typical sequences is approximately exponential in the entropy. The document goes on to discuss universal source coding, channels of communication, conditional entropy, and mutual information before concluding with Shannon's second theorem about the achievable rates for reliable communication over noisy channels.

Uploaded by

Anh Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Math and Science Summer Program 2020

Shannon’s theorems

Lecture on Applied Mathematics

Nguyen Manh Linh


École Normale Supérieure, France
Contents

Contents

Notation 3

1 Shannon’s entropy and source coding 4


1.1 Shannon’s entropy and Gibbs’ inequality . . . . . . . . . . . . . . . . . . . . . 4
1.2 Typical sequences and Shannon’s first theorem . . . . . . . . . . . . . . . . . . 5

2 Universal source coding 9


2.1 Empirical distributions and Kullback-Leibler divergence . . . . . . . . . . . . . 9
2.2 Universal source coding with errors . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Communication channel 13
3.1 Conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Information channel and product form . . . . . . . . . . . . . . . . . . . . . . 16

4 Shannon’s second theorem 21


4.1 Jointly typical sequences and Shannon’s second theorem . . . . . . . . . . . . 21
4.2 Weak converse to Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Strong converse to Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2
Notation

Notation

From now on, by a set we mean a discrete set (i.e. a set that is either finite or countably infinite),
unless anything else is mentioned. The collection 2 X of all subsets of a given (discrete) set X
forms a σ -algebra.
p(x) = 1. If A is
Í
A probability measure on X is simply a function p : X → [0, 1] satisfying
Í x ∈X
any subset of X, we write p(A) to indicate the sum p(x) (and hence a non-negative function
x ∈A
p on X is a probability measure iff p(X) = 1).
Whenever p is a probability measure on X and X is a random variable whose value is in X, we
write X ∼ p if the distribution PX of X coincides with p, i.e. if

∀x ∈ X, P(X = x) = p(x).

When X = X1 × · · · × Xn and pi is a probability measure on Xi (i = 1, . . . , n), the tensor product


of p1 , . . . , pn is the probability measure on X defined by

∀(x 1 , . . . , x n ) ∈ X, (p1 ⊗ · · · ⊗ pn )(x 1 , . . . , x n ) := p1 (x 1 ) · · · pn (x n ).

Random variables X 1 , . . . , X n with values in respectively X1 , . . . , Xn satisfying X i ∼ pi (i =


1, . . . , n) are called independent if the random vector X = (X 1 , . . . , X n ) has joint distribution
p1 ⊗ · · · ⊗ pn . We use the notation X ⊥ Y to indicate that two random variables X and Y are
independent.

3
1 Shannon’s entropy and source coding

1 Shannon’s entropy and source coding

1.1 Shannon’s entropy and Gibbs’ inequality

Let X be a set and p a probability measure on X. Let D > 1 be a real number.


Definition 1.1. 1. The entropy of p is defined to be
Õ
H D (p) := − p(x) logD p(x).
x ∈X

2. Let X be a random variable with value in X whose distribution is p. The entropy of X is

H D (X ) := H D (p) = −E[logD p(X )].

Remark. 1. The base of logarithm D is usually omitted when understood. One often takes
D = 2 to yield the binary entropy H 2 .
2. We implicitly used the convention 0 · log 0 := 0 in the Definition 1.1 by continuously
extending the function x 7→ x log x at x = 0.
Example 1.2. Let X = {0, 1}, p ∈ [0, 1] and X ∼ Ber(p), that is,

P(X = 0) = p = 1 − P(X = 1).

The entropy of X ,
H (X ) = −p log p − (1 − p) log(1 − p),
is denoted h(p).
Proposition 1.3. Let X be a random variable with value in X. Its entropy H (X ) enjoys the
following properties.
1. H (X ) > 0 and the equality holds iff X is (almost surely) deterministic.
2. Let f : X → Y be a deterministic function. One has H (X ) > H (f (X )) and the equality
holds iff f is injective.

Proof. Exercise. 
Theorem 1.4 (Gibbs’ inequality). If p and q are two probability measures on X, there holds
Õ
H (p) 6 − p(x) log q(x). (1.1)
x ∈X

The equality in (1.1) holds iff p = q.

4
1 Shannon’s entropy and source coding

The right hand side of (1.1) is called the cross entropy between p and q, denoted H (p; q).

Proof. Exercise. 
Corollary 1.5. When X = {1, . . . , n}, the uniform distribution U(n) on X maximizes the
entropy.
Corollary 1.6. Let X = N∗ and X a random variable taking value in N∗ . Let µ < 1 be given. If
EX = µ, the entropy of X is maximized when X has the geometric distribution Geom(1/µ).
The inequality (1.1) also implies the following
Proposition 1.7. Let X = (X 1 , . . . , X n ) be a random vector with value in X = X1 × · · · × Xn .
One has the inequality
H (X ) 6 H (X 1 ) + · · · + H (X n ). (1.2)
Moreover the equality holds iff X 1 , . . . , X n are independent.

Proof. Exercise. 

Exercises

Exercise 1.1 (Entropy of a homogeneous Markov chain). Let (X n )n ∈N∗ be random variables
with value in X such that the following conditions hold.
1. For each n ∈ N∗ , X n+1 and (X 1 , . . . , X n−1 ) are independent given X n .
2. There exists a right stochastic matrix P = [px,y ]x,y ∈X such that

∀n ∈ N∗ , ∀x, y ∈ X, P(X n+1 = y | X n = x) = px,y .

3. The distribution (row) vector π := [P(X 1 = x)]x ∈X of X 1 is stationary, that is, πP = π .


Show that for each n ∈ N∗ ,

H (X 1 , . . . , X n ) = H (X 1 ) − (n − 1)E[log(pX 1,X 2 )].

1.2 Typical sequences and Shannon’s first theorem

In this section, let |X| = D ∈ N∗ \ {1} and let p be a probability measure on X. Let X =
(X 1 , . . . , X n ) be i.i.d. random variables with value in X n .
Definition 1.8. Let ε > 0. A realization x = (x 1 , . . . , x n ) ∈ X n of X is called ε-typical if
n


− log p(x ) − H (p) 6 ε.

D i D
n i=1

5
1 Shannon’s entropy and source coding

We denote by A(n) n
ε the subset of ε-typical vectors in X . Asymptotically, typical vectors concen-
trate probability, that is  
lim P X ∈ A(n) ε =1 (1.3)
n→∞

(which follows from the weak Law of Large Numbers).


The good news is, the size of A(n)
ε can be
(n) n
controlled. The following result assures that Aε is relatively small, compared to X (one may
wish to recall that Corollary
1.5 implies H D (p) 6 H D (U(D)) = 1).
(n) n(H
Proposition 1.9. Aε 6 D D (p)+ε ) .

Proof. If x = (x 1 , . . . , x n ) is ε-typical, by definition, one would have


n
Õ
logD p(x i ) > −n(H D (p) + ε).
i=1

Therefore, the fact that x ∈ A(n)


ε implies P(X = x) = p (x) > D
⊗n −n(H D (p)+ε ) (by independence).

Hence   Õ
(n) (n) −n(H D (p)+ε )
1 > P X ∈ Aε = P(X = x) > Aε · D
(n)
x ∈Aε

and the conclusion follows. 

The next proposition shows that there are no ’smaller’ sets that (asymptotically) concentrate
probability.
Proposition 1.10. Let R > 0 such that for each n ∈ N∗ , there exists a subset B (n) ⊆ X n of
cardinality at most D nR . If  
lim sup P X ∈ B (n) = 1,
n→∞

then we have necessarily R > H D (p).

Proof. Let ε > 0. By similar arguments to those in the proof of Proposition 1.9, if x = (x 1 , . . . , x n )
is ε-typical, we would have P(X = x) 6 D −n(H D (p)−ε ) . Consequently,
  Õ
P X ∈ A(n)
ε ∩B
(n)
= P(X = x) 6 B (n) · D −n(H D (p)−ε ) 6 D −n(H D (p)−R−ε ) .

(n)
x ∈Aε ∩B (n)
 
Let n 1 < n 2 < · · · be a strictly increasing sequence of positive integers such that lim P X ∈ B (nk ) =
  k→∞
1. Then P X ∈ A(n ε
k)
∩ B (n k ) → 1 as k → ∞ and it follows that the term D −n k (H D (p)−R−ε ) re-

mains lower bounded, which implies H D (p) − R − ε 6 0. Letting ε ↓ 0 yields the result. 

Our goal is to encode the source message X (consisting of n i.i.d. symbols) using a (hopefully,
smaller) number of symbols than n. The question is, at which compression rate can one correctly
decode the encoded message (at least with a high probability)?

6
1 Shannon’s entropy and source coding

Definition 1.11. Let Y be a finite set, which we will call an alphabet. By an encoding/decoding
scheme, we mean the following data for each n ∈ N∗
1. a positive integer n 0 and a encoding function c n : X n → Y n ,
0

2. a decoding function ĉ n :→ Y n → X n .
0

n0
The compression rate of the given encoding/decoding scheme is Rn := . Let Y := c n (X ) be the
n
encoded message and X̂ := ĉ n (Y ) be the decoded message. We will now concern ourselves to
the error probability of the scheme
 
Pe(n) := P X̂ , X ,

given the compression rate. For simplicity, let us take Y = X. The general case is left as an
exercise.
Theorem 1.12 (Shannon’s first theorem / Noiseless coding theorem). 1. For any R > H D (p),
n
there exist encoding functions c n : X → X dnR e and decoding functions ĉ n : X dnR e → X n
(n)
such that lim Pe = 0.
n→∞

2. For any R < H D (p) and any encoding/decoding scheme c n : X n → X bnR c , ĉ n : X bnR c →
X n , we have lim inf Pe(n) > 0.
n→∞

Proof. 1. Let ε ∈ (0, R − H D (p)). By Proposition 1.9, for each n ∈ N∗ , we can find an injection

fn : A(n)
ε →X
dnR e

 
and an element x ∗ ∈ X dnR e \ fn A(n)
ε . Consider the following encoding/decoding scheme
(
n fn (x) if x ∈ A(n)
ε
∀x ∈ X , c n (x) :=
x∗ otherwise,
(  
fn−1 (y) if y ∈ fn A(n)
ε
∀y ∈ X dnR e , ĉ n (y) :=
arbitrary otherwise.
 
It is clear that ĉ n (c n (x)) = x for all x ∈ A(n)
ε . Consequently, P (n)
e 6 P X < A(n)
ε , which
converges to 0 as n tends to ∞ by (1.3).
2. We define for each n ∈ N∗ ,

B (n) := {x ∈ X n | ĉ n (c n (x)) = x } .

In particular, the restriction c n : B (n) → X bnR c is injective, which implies B (n) 6 D nR .


The result follows from Proposition 1.10.




7
1 Shannon’s entropy and source coding

Exercises

Exercise 1.2. State and prove Theorem 1.12 in the general case (where Y is not necessarily
equal to X).

8
2 Universal source coding

2 Universal source coding

This chapter deals with an extension of Theorem 1.12 when the distribution of the source
alphabet is unknown. In other words, we will show that for any coding rate R > 0, there is
an asymptotically error-free encoding/decoding scheme that is universal for all probability
distribution of i.i.d. source symbols whose entropy is smaller than R.
Let X be a set of cardinality D ∈ N∗ \ {1}.

2.1 Empirical distributions and Kullback-Leibler divergence

Definition 2.1. Let a = (a 1 , . . . , an ) ∈ X n . The associated empirical distribution pa is the


probability measure on X is defined by
n

∀x ∈ X, pa (x) := 1a =x .
n i=1 i
We denote by Pn the set of empirical distributions on X associated to all vectors of length n.
Let p ∈ Pn . We define
Tn (p) := {a ∈ X n | pa = p} .
The following inequality holds.
(n + 1)−D · 2nH2 (p) 6 |Tn (p)| 6 2nH2 (p) . (2.1)
Definition 2.2. The Kullback-Leibler divergence between two probability measures p and q is
Õ p(x)
D(p; q) := p(x) log .
x ∈X
q(x)

The base of logarithm in Definition 2.2 can be any real number greater than 1. The convention
used here are 0 · log(0/y) = 0 for y > 0 and x · log(x/0) = +∞ for x > 0.
By inequality (1.1), D(p; q) > 0, and equality holds iff p = q (be warned, the Kullback-Leibler
divergence is not symmetric!).
Lemma 2.3. Let p be a probability measure on X. We have
1. for any vector x ∈ X n , p ⊗n (x) = 2−n(H2 (px )+D 2 (px ;p)) ,
2. for any empirical distribution q ∈ Pn , p ⊗n (Tn (q)) 6 2−nD 2 (q;p) .

Proof. Exercise. For the inequality, one can make use of (2.1). 

9
2 Universal source coding

Exercises

Exercise 2.1. Let p be a probability measure on X. Let X 1 , X 2 , . . . be an i.i.d. sequence of


random variables with value in X such that X 1 ∼ p. Show that, almost surely, p(X 1, ...,X n ) → p
as n → ∞.
Exercise 2.2. Prove inequality (2.1) (Hint: For k 1 , . . . , k D > 0 such that k 1 + · · · +k D = n, expand
nn = (k 1 + · · · + k D )n and majorize every summand by the largest one).
Exercise 2.3. Let p and q be two probability measures on X. Show that D(p; q) < +∞ iff p  q
(meaning q(x) = 0 implies p(x) = 0).
Exercise 2.4. Let (X , Y ) be a random vector with value in X 2 . Show that

H (X ) + H (Y ) − H (X , Y ) = D(P(X,Y ) ; PX ⊗ PY ).

2.2 Universal source coding with errors

We are now ready to state and prove an extension of Theorem 1.12. For simplicity, let us take
the binary encoding alphabet {0, 1}.
Theorem 2.4 (Universal source coding). Let R > 0. There exist functions c n : X n → {0, 1} dnR e ,
ĉ n : {0, 1} dnR e → X n such that for any probability measure p on X satisfying H 2 (p) < R and
any sequence X 1 , X 2 , . . . of i.i.d. random variables with value in X and distribution p,

lim Pe(n) = 0
n→∞

where Pe(n) = P (ĉ n (c n (X 1 , . . . , X n ) , (X 1 , . . . , X n )).

Proof. The proof of Theorem 1.12 can be mimicked, provided that we find subsets A(n) ⊆ X dnR e
(depending only on R) of cardinality strictly less than 2nR such that for any probability measure
p on X with entropy less than R, we have
 
lim p ⊗n A(n) = 1. (2.2)
n→∞

To do this, take
A(n) := {a ∈ X n | H 2 (pa ) 6 Rn }
where
D log2 (n + 2)
Rn := R − .
n

10
2 Universal source coding

We study the size of A(n) . One has


Õ
A =
(n)
|Tn (q)|
q ∈ Pn
H 2 (q)6R n

2nH2 (q)
Õ
6 (by (2.1))
q ∈ Pn
H 2 (q)6R n

6 |Pn | · 2nRn
2nR
6 (n + 1)D ·
(n + 2)D
< 2nR .

It suffices to show that (2.2) holds for any probability measure p on X such that H 2 (p) < R.
Take ε ∈ (0, R − H 2 (p)). Let P denote the set of all probability measures on X and

q∗ := arg min D(q, p)


q∈P
H 2 (q)>H 2 (p)+ε

(Exercise: argue that q∗ exists and that D(q∗ , p) > 0). For large n, one has Rn > H 2 (p) + ε, thus
D(q; p) > D(q∗ ; p) for all q ∈ P with H 2 (q) > Rn . Observe that
 
p ⊗n X n \ A(n) =
Õ
p ⊗n (Tn (q))
q ∈ Pn
H 2 (q)>R n
Õ
6 2−nD 2 (q;p) (by Lemma 2.3)
q ∈ Pn
H 2 (q)>R n

6 (n + 1)D · 2−nD 2 (q∗,p) (for large n).

The last term converge to 0 as n → ∞ since D(q∗ , p) > 0. This finishes the proof. 

Exercises

Exercise 2.5 (Bernoulli source-symbols). For a vector x ∈ {0, 1}n , let K(x) denote its number
of 1’s. Let B(x) be the lexicographical rank of x among all vectors of {0, 1}n with exactly K(x)
1’s. When then have a bijection
    
n
c n0 : {0, 1}n → (k, b) | k ∈ {0, . . . , n}, b ∈ 1, . . . , , c n0 (x) := (K(x), B(x)).
k

Consider the encoding function c n on {0, 1}n defined as follows: c n (x) is the concatenation of
the binary representation of K(x) and B(x). We shall work on the asymptotic length of c n (x).

11
2 Universal source coding

1. Let n ∈ N∗ and p ∈ (0, 1). Show that


 
n −nh2 (p) 1
2 6 p
np πnp(1 − p)

(see the definition of h 2 (p) in Example 1.2). Moreover, show that if 12np(1 − p) > 9,
 
n −nh2 (p) 1
2 > p .
np 8np(1 − p)

(Hint: One can use Stirling’s approximation


√ n n 1 √ n n 1
2πn e 12n+1 6 n! 6 2πn e 12n ).
e e

1
2. Let x 1 , x 2 , . . . ∈ {0, 1} such that K(x 1 , . . . , x n ) → p ∈ (0, 1) as n → ∞. Let | · | denotes
n
the length of a binary sequence. Show that
1
lim |c n (x 1 , . . . , x n )| = h 2 (p).
n→∞ n

See Section 14.1.1 (A First Example) in Pierre Brémaud, Discrete Probability Models and Methods.
Springer, 2017 for universal source coding for binary sequences.

12
3 Communication channel

3 Communication channel

In this chapter, X, Y and Z are always finite sets. The base of logarithm is always 2 unless
specified. To simplify notations, we use the convention P(A|B) = 0 when P(B) = 0, where A
and B are two events.

3.1 Conditional entropy

Definition 3.1. 1. Let (X , Y ) be a X × Y-valued random vector. For y ∈ Y, the conditional


entropy of X given Y = y is defined to be
Õ
H (X |Y = y) := − P(X = x |Y = y) log P(X = x |Y = y).
x ∈X

2. The conditional entropy of X given Y is


Õ
H (X |Y ) := − P(X = x, Y = y) log P(X = x |Y = y).
x ∈X,y ∈Y

The following properties are straightforward.


Proposition 3.2. Let (X , Y ) be a random vector with value in X × Y. We have the following
results.
1. (Bayes’ rule) Õ
H (X |Y ) = P(Y = y)H (X |Y = y) (3.1)
y ∈Y

2.
H (X |Y = y) > 0 (3.2)
for all y ∈ Y. Moreover, H (X |Y ) > 0 and equality holds iff Y = f (X ) where f : X → Y
is a deterministic function.
3. (Chain rule)
H (X , Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ). (3.3)

4. If H (Y ) < ∞, then
H (X |Y ) 6 H (X ) (3.4)
and equality holds iff X ⊥ Y .

13
3 Communication channel

5. For any function f : X → Z,

H (Y | f (X )) > H (Y |X ) (3.5)

with equality iff Y ⊥ X given f (X ).


6. Let (X 1 , . . . , X n , Y ) be a random vector with value in X1 × · · · Xn × Y. We have
n
Õ
H (X 1 , . . . , X n |Y ) 6 H (X i |Y ) (3.6)
i=1

with equality when X 1 , . . . , X n are independence given Y . Inversely, if H (X 1 , . . . , X n |Y ) 6


Ín
i=1 H (X i |Y ) < +∞, then X 1 , . . . , X n are independence given Y .

7. For any function f : X → Y,


H (f (X )|X ) = 0 (3.7)

8. Let (X 1 , . . . , X n , Y ) be a random vector with value in X1 × · · · Xn × Y. Then

∀i ∈ {1, . . . , n}, H (X i ) 6 H (X 1 , . . . , X n ) (3.8)

and
∀i ∈ {1, . . . , n}, H (X i |Y ) 6 H (X 1 , . . . , X n |Y ). (3.9)

9. H (X , Y ) is finite iff H (X ) and H (Y ) are. Similarly, H (X , Y |Z ) is finite iff H (X |Z ) and


H (Y |Z ) are (where Z is any Z-valued random variable).
The properties (3.1), (3.3) and (3.4) can also be improved.
Proposition 3.3. Let (X , Y , Z ) be a random vector with value in X × Y × Z. We have the
following results.
1. (Bayes’ rule) Õ
H (X |Y , Z ) = P(Z = z)H (X |Y , Z = z) (3.10)
z ∈Z

where
Õ
H (X |Y , Z = z) := − P(X = x, Y = y|Z = z) log P(X = x |Y = y, Z = z).
x ∈X,y ∈Y

2. (Chain rule)

H (X , Y |Z ) = H (X |Z ) + H (Y |X , Z ) = H (Y |Z ) + H (X |Y , Z ). (3.11)

3. If H (Y |Z ) < ∞, then
H (X |Y , Z ) 6 H (X |Y ) (3.12)

14
3 Communication channel

Exercises

Exercise 3.1. Show the sequential conditional chain rules: If (X 1 , . . . , X n , Y ) is a random vector
with value in X1 × · · · × Xn × Y , we have
n
Õ
H (X 1 , . . . , X n ) = H (X i |X 1 , . . . , X i−1 ) (3.13)
i=1

and
n
Õ
H (X 1 , . . . , X n |Y ) = H (X i |X 1 , . . . , X i−1 , Y ). (3.14)
i=1

3.2 Mutual information

Definition 3.4. For a random vector (X , Y ) with value in X × Y, we define the mutual infor-
mation between X and Y
I (X ; Y ) := H (X ) + H (Y ) − H (X , Y ).

Exercise 2.4 says that I (X ; Y ) = D(P(X,Y ) ; PX ⊗ PY ). The chain rule 3.3 says that
I (X ; Y ) = H (X ) − H (X |Y ) = H (Y ) − H (Y |X ) = I (Y ; X ).
We state without proof some basic properties of mutual information.
Proposition 3.5. Let (X , Y ) be a random vector with value in X × Y. Suppose that H (X ) and
H (Y ) are finite. Then we have the following results.
1.
I (X ; X ) = H (X ) (3.15)

2.
I (X ; Y ) > 0 (3.16)
with equality iff X ⊥ Y .
3.
I (X ; Y ) 6 I (Y ; Y ) = H (Y ) (3.17)
with equality iff Y is a deterministic function of X .
4. For any function f : X → Y,
I (f (X ); Y ) 6 I (X ; Y ) (3.18)
with equality iff Y ⊥ X given f (X ).
5. (Data processing inequality) If Z is a random variable with value in Z and finite entropy,
then
I (X , Y ; Z ) 6 I (Y ; Z ) (3.19)
with equality iff Z ⊥ X given Y .

15
3 Communication channel

Exercises

Exercise 3.2. If X 1 , . . . , X n is a Markov chain, show that for 1 6 k 6 ` 6 n,

I (X 1 ; X n ) 6 I (X k ; X ` ). (3.20)

Exercise 3.3. If X 1 , X 2 , Y1 , Y2 are random variables with finite entropy such that (X 1 , X 2 ) and
(Y1 , Y2 ) are independent, then

I (X 1 , Y1 ; X 2 , Y2 ) = I (X 1 ; X 2 ) + I (Y1 ; Y2 ). (3.21)

Exercise 3.4. Define the conditional version of mutual information, state and prove its basic
properties.
Exercise 3.5. Show the Kolmogorov’s formula: If X , Y and Z have finite entropy, then

I (X ; Y , Z ) = I (X ; Z ) + I (X ; Y |Z ). (3.22)

Exercise 3.6 (yet another Chain rule). 1. Let X , Y , Z and U be random variables with range
respectively X, Y, Z and U with finite entropy. We have

I (X ; Y , Z |U ) = I (X ; Z |U ) + I (X ; Y |Z , U ). (3.23)

2. Let X , (Y1 , . . . , Yn ) and U random variables with range respectively X, Y n and U. If


H (X , Y |U ) < +∞, show that
n
Õ n
Õ
I (X ; Y1 , . . . , Yn |U ) = I (X ; Yi |Y1 , . . . , Yi−1 , U ) = I (X ; Yj |Yj+1 , . . . , Yn , U ). (3.24)
i=1 j=1

Exercise 3.7 (Csiszár sum identity). Let X = (X 1 , . . . , X n ), Y = (Y1 , . . . , Yn ) and U be random


variables with range respectively X n , Y n and U. If H (X , Y |U ) < +∞, show that
n
Õ n
Õ
I (Yi ; X i+1 , . . . , X n |Y1 , . . . , Yi−1 , U ) = I (X i ; Y1 , . . . , Yi−1 |X i+1 , . . . , X n , U ). (3.25)
i=1 i=1

Having studied the concepts of mutual information, we are now ready to define an information
channel and its capacity.

3.3 Information channel and product form

Definition 3.6. By an information channel with input alphabet X and output alphabet Y, we
mean a family of non-negative functions

κn (·|·) : X n × Y n → [0, +∞), (x, y) 7→ κn (y|x)

where n ∈ N∗ (called probability kernels of the channel, meaning for each x ∈ X n , the function
κn (·|x) is a probability measure on Y n ).

16
3 Communication channel

For n ∈ N∗ , the channel takes a random vector X = (X 1 , . . . , X n ) with value in X n as input (the
emitted word) and produces a random vector Y = (Y1 , . . . , Yn ) with value in Y n (the received
word) in such a way that

∀(x, y) ∈ X n × Y n , P(Y = y|X = x) = κn (y|x).

Definition 3.7. Same notations, a channel is said to be


1. memoryless if for any n > 2, any distribution of the emitted word X and any j ∈ {2, . . . , n},
Yj ⊥ (X 1 , . . . , X j−1 , Y1 , . . . , Yj−1 ) given X j ,
2. without feedback if for any n > 2, any distribution of the emitted word X and any
j ∈ {2, . . . , n}, X j ⊥ (Y1 , . . . , Yj−1 ) given (X 1 , . . . , X j−1 ),
3. without anticipation if for any n > 2, any distribution of the emitted word X and any
j ∈ {1, . . . , n − 1}, Yj ⊥ (X j+1 , . . . , X n ) given (X 1 , . . . , X j ).
Proposition 3.8. A channel without anticipation is without feedback.

Proof. Exercise. 
Proposition 3.9. If a channel is memoryless and without feedback, then for any n > 2, any
distrbution of X , any j ∈ {2, . . . , n} and any (x 1 , . . . , x j , y1 , . . . , y j ) ∈ X j × Y j ,
j
Ö
P(Y1 = y1 , . . . , Yj = y j |X 1 = x 1 , . . . , X j = x j ) = P(Yk = yk |X k = x k ).
k=1

Proof. Exercise. 
Proposition 3.10. If for any n > 2, any distrbution of X and any (x 1 , . . . , x n , y1 , . . . , yn ) ∈
X n × Y n , we have
n
Ö
P(Y1 = y1 , . . . , Yn = yn |X 1 = x 1 , . . . , X n = x n ) = P(Yk = yk |X k = x k ) (3.26)
k =1

then the channel is memoryless, without feedback and without anticipation.

Proof. Exercise. 
Proposition 3.11. For a given channel, the followings are equivalent.
1. The channel is memoryless and without feedback.
2. (3.26) holds for any n > 2, any distrbution of X and any (x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n ×Y n .
3. The channel is memoryless and without anticipation.

Proof. Exercise. 

17
3 Communication channel

Definition 3.12. A memoryless and without feedback channel is said to be time-invariant if


there is a probability kernel κ(·|·) from X to Y such that for any i ∈ N∗ and any (x, y) ∈ X × Y,

P(Yi = y|X i = x) = κ(y|x).

The matrix [κ(y|x)]x ∈X,y ∈Y is called the transition matrix of the channel.
Proposition 3.13. A channel is memoryless, without feedback and time-invariant with transi-
tion matrix [κ(y|x)]x ∈X,y ∈Y iff for any (x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n × Y n ,
n
Ö
κn (y1 , . . . , yn |x 1 , . . . , x n ) = κ(yi |x i ).
i=1

Proof. Exercise. 

For this reason, a memoryless, without feedback and time-invariant channel is said to have
(homogeneous) product form.
Let X be a generic input symbol and Y the output. Then the distribution of (X , Y ) is a function
of PX and the probability kernel κ.
Definition 3.14. The capacity of a channel from X to Y in product form with transition matrix
κ is the following supremum, taken over all probability measure on X

Cκ := sup I (X ; Y ).
PX

Example 3.15 (Binary symmetric channel). Let X = Y = {0, 1} and consider the channel in
product form from X to Y such that any bit is transmitted incorrectly with probability p ∈ [0, 1].
In other words, the transition matrix is
 
1−p p
κ= .
p 1−p

The algebraic representation of this channel is Y = X ⊕ Z where Z is independent of (X , Y ) and


Z ∼ Ber(p) (the notation ⊕ means addition modulo 2).
Computation shows that the capacity of this channel (denoted BSC(p)) is 1 − h(p) (Exercise).
One can take values p = 0, 1/2, 1 and interpret the corresponding results.

Exercises

Exercise 3.8. Show that for a channel in product form, I (X ; Y ) is a continuous and concave
function of PX . Moreover, concavity is strict iff the map PX 7→ PY is injective. Deduce that the
supremum I (X ; Y ) is achieved (i.e. it is a maximum). When is the maximizer unique?
Exercise 3.9. Show that for fixed PX , I (X ; Y ) is a convex function of the probability kernel.

18
3 Communication channel

Exercise 3.10 (Binary erasure channel). We take the input alphabet X = {0, 1} and the output
alphabet Y = {0, 1, e}, where e stands for an error. Consider the channel with transition matrix
 
1−p 0 p
κ= .
0 1−p p

Show that the capacity of this channel (denoted BEC(p)) is 1 − p.


Exercise 3.11. What is the capacity of the channel where we place n binary symmetric channels
BSC(p) in series?
Exercise 3.12 (Parallel independent channels). For i = 1, 2, let κi be a probability kernel of a
channel from Xi to Yi . We place them in parallel by considering the compound channel from
X1 × X2 to Y1 × Y2 and the probability kernel

∀(x 1 , x 2 , y1 , y2 ) ∈ X1 × X2 × Y1 × Y2 , κ(y1 , y2 |x 1 , x 2 ) := κ 1 (y1 |x 1 )κ 2 (y2 |x 2 ).

Let (X 1 , X 2 ) be a random input of the channel and (Y1 , Y2 ) the corresponding output.
1. Show that for (x 1 , x 2 , y1 , y2 ) ∈ X1 × X2 × Y1 × Y2 ,

P(Y1 = y1 |X 1 = x 1 , X 2 = x 2 , Y2 = y2 ) = κ 1 (y1 |x 1 ).

2. Prove that H (Y1 , Y2 |X 1 , X 2 ) = H (Y1 |X 1 ) + H (Y2 |X 2 ).


3. Show that if X 1 and X 2 are independent, so are Y1 and Y2 .
4. Deduce that Cκ = Cκ1 + Cκ2 .
Exercise 3.13. Let X = Y = {0, 1}2 . Let Z 1 ∼ Ber(1/2). Consider a channel from XtoY with
input (X 1 , X 2 ) and the output (Y1 , Y2 ) is given by

Y1 := X 1 ⊕ Z 1 , Y2 := X 2 .

Calculate the capacity of this channel. Show that the capacity-achieving distribution of output
is unique, but that of input is not.
Exercise 3.14 (Symmetric channel). A channel from X to Y is said to be symmetric if the rows
of its transition matrix are permutation of each other, as well as the columns.
1. Let q be the probability that the first row of the transition matrix define. Show that the
capacity of the channel is
C = log |Y| − H (q),
which can be achieved by uniform distribution.
2. Show that the above result hold for weakly symmetric channels (i.e. channels where the
rows of its transition matrix are permutation of each other, and the columns have equal
sums).
3. Let L > 2 be an integer and X = Y = {0, 1, . . . , L − 1}. Consider the channel where the
output Y is related to the input X by Y := Z ⊕ X (mod L) where Z is a random variable
with value in {0, 1, . . . , L − 1}, independent of X . Find the capacity of this channel.

19
3 Communication channel

Exercise 3.15 (Asymmetric erasure channel). Find the capacity of the channel with transition
matrix  
2/3 − α α 1/3
.
α 2/3 − α 1/3

20
4 Shannon’s second theorem

4 Shannon’s second theorem

In this chapter, we use the same notations as those in the previous one. All information channels
have product form. The base of logarithms is always 2.
An additional notation are finite sets Mn = {1, . . . , Mn } (n ∈ N∗ ), whose elements we call
messages.
An encoding/decoding scheme using a channel with kernel κ consists of
1. encoding functions c n : Mn → X n for n ∈ N∗ (i.e. the message set depends on n),
2. decoding functions ĉ n : Y n → Mn for n ∈ N∗ .
We use ω to denote a generic message in Mn . n sent symbols are those of the vector x := c n (ω).
n received symbols (which are random by the noisiness of the channel) are those of Y (the
distribution of Y is κ ⊗n (·|x)). The message estimate is Ŵ := ĉ n (Y ).
The error probability on a message ω ∈ Mn is
Õ
Pe(n)

:= P(Ŵ , ω |X = c n (ω)) = κ ⊗n (y|x).
y ∈Y n
ĉ n (y),ω

The average error probability is


1 Õ (n)
Pe(n) := Pe |ω .
Mn
ω ∈Mn

The maximal error probability is


λn := max Pe(n)

.
ω ∈Mn

It is clear that λn > Pe(n) . We shall proceed to study the correlation between error probability
and the transmission rate
log Mn
Rn := .
n

4.1 Jointly typical sequences and Shannon’s second theorem

Let (X 1 , Y1 ), . . . , (X n , Yn ) be a random samples of (X , Y ), an X × Y-valued random vector. One


may wish to recall the Definition 1.8 of typical vectors. For ε > 0, let A(n) (n)
ε,X (resp. Aε,X and
A(n)
ε,(X,Y )
) denote the subset of X n (resp. Y n and X n × Y n ).

21
4 Shannon’s second theorem

Definition 4.1. Elements of


 
A(n)
ε := A(n)
ε,X × A(n) (n)
ε,Y ∩ Aε,(X,Y )

are called ε-jointly typical vectors in X n × Y n .


 
Equation (1.3) implies that P (X 1 , . . . , X n , Y1 , . . . , Yn ) ∈ A(n) ε → 1 when n → ∞.
Proposition 4.2. If X i is indepedent of Yi for i = 1, . . . , n, we have
 
P (X 1 , . . . , X n , Y1 , . . . , Yn ) ∈ A(n)
ε,(X,Y )
6 2−n(I (X ;Y )−3ε ) .

Proof. Exercise. One can follow the proof of Proposition 1.9. 


Theorem 4.3. [Shannon’s second theorem / Noisy-channel coding theorem] For a channel
with probability kernel κ from X to Y, any transmission rate R < Cκ is achievable, i.e., there
exists a channel encoding/decoding scheme c n : X n → Mn , ĉ n : Y n → Mn (for some sets Mn )
with transmission rate Rn > R such that the maximal error probability λn tends to 0.

Proof. Take a distribution PX on X such that I (X ; Y ) = Cκ (see Exercise 3.8 for the existence for
Cκ − R
and Mn0 := 2nR+1 for n ∈ N∗ .
 
PX ). Let ε :=
5
Step 1. Control the average error.
The idea is to use random codes. Generate a random matrix Cn = [X i (ω)]ω ∈Mn0 ,16i 6n with i.i.d.
entries, each having distribution PX (the ω in the bracket is merely an index). We call this matrix
a (random) codebook. For a realization of Cn , we define encoding functions c n and decoding
functions ĉ n as follows.
∀ω ∈ Mn0 , c n (ω) := (X 1 (ω), . . . , X n (ω)).


 ω̂ ∈ Mn0 such that (X 1 (ω̂) , . . . , X n (ω̂) , y1 , . . . , yn ) ∈ A(n)
ε
∀(y1 , . . . , yn ) ∈ Y n , ĉ n (y1 , . . . , yn ) :=


if ω exists and is unique

1

otherwise.

For ω ∈ Mn0 , let (Y1 (ω), . . . , Yn (ω)) be the (random) channel output related (deterministic) input
(X 1 (ω), . . . , X n (ω)). Its distribution is given by
n
n
Ö
∀(y1 , . . . , yn ) ∈ Y , P((Y1 (ω), . . . , Yn (ω)) = (y1 , . . . , yn )|(X 1 (ω), . . . , X n (ω)) = κ(yi |X i (ω)).
i=1

The expected average error probability (with respect to Cn ) is


 
h i  1 Õ
Pe(n) =E 0 P (ĉ n (Y1 (ω), . . . , Yn (ω)) , ω |Cn )

E 
 Mn ω ∈Mn0 
 
1 Õ
= 0 P (ĉ n (Y1 (ω), . . . , Yn (ω)) , ω)
Mn
ω ∈Mn
0

= P (ĉ n (Y1 (1), . . . , Yn (1)) , 1) (by symmetry of rows of Cn ).

22
4 Shannon’s second theorem

For ω ∈ Mn0 , define the event


n o
E ω(n) := (X 1 (ω), . . . , X n (ω), Y1 (1), . . . , Yn (1)) ∈ A(n)
ε .

It is clear that
{ĉ n (Y1 (1), . . . , Yn (1)) , 1} ⊆ E 1c ∪
Ø
E ω(n) .
ω ∈Mn0 \{1}

Hence
h i    Õ  
E Pe(n) = P (ĉ n (Y1 (1), . . . , Yn (1)) , 1) 6 1 − P E 1(n) + P E ω(n) ) . (4.1)
ω ∈Mn0 \{1}

The first term on the right hand side of (4.1) tends to 0 by (1.3). Moreover, since the rows of Cn
are indepedent, by Proposition 4.2
Õ  
P E ω(n) ) 6 Mn0 · 2−n(Cκ −3ε )
ω ∈Mn0 \{1}

6 2nR+2 · 2−n(Cκ −3ε )


6 2−n(Cκ −R−4ε ) (for large n)
=2 −nε
.

The last term converges to 0 when n → ∞. Thus


h i
lim E Pe(n) = 0.
n→∞

By Fatou’s lemma, there is a realization of Cn with such that the associated encoding/decoding
scheme has average error probability Pe(n) tends to 0.
Step 2. From average to maximal error.
This part and the part concerning the calculation of transmission rate are left as an exercise.
(Hint: Observe that if the average of n numbers is δ , then at least bn/2c of them are at most 2δ ). 
Remark. The proof of Theorem 4.3 is probabilistic. Noone has ever constructed the promised
code.

4.2 Weak converse to Theorem 4.3

In this section, we show that transmission rates that are larger than the channel capacity can
never be achieved. We start with the following
 inequality.
Theorem 4.4 (Fano’s inequality). Let X , Y , X̂ be a random vector with value in X × Y × X
 
such that X̂ ⊥ X given Y . Let Pe := P X̂ , X , then

h(Pe ) + Pe log |X| > H (X |Y ). (4.2)

23
4 Shannon’s second theorem

Proof. Let E := 1X̂ ,X . By (3.11), we know that


       
H X |X̂ + H E|X , X̂ = H E|X̂ + H X |E, X̂ . (4.3)

If X and X̂ are given, E is deterministic, meaning


 
H E|X , X̂ = 0. (4.4)

Since E ∼ Ber(Pe ), it follows from (3.4) that


 
H E|X̂ 6 H (E) = h(Pe ). (4.5)

By (3.10), we have
     
H X |E, X̂ = H X |X̂ , E = 1 Pe + H X |X̂ , E = 0 (1 − Pe )
 
6 Pe H (X ) + (1 − Pe )H X̂ |X̂ , E (by (3.4))
 
6 Pe log |X| + (1 − Pe )H X̂ |X̂ (by Corollary 1.5 and (3.4))
= Pe log |X| (by (3.7)).

Plugging this, together with (4.4), (4.5) into (4.3) yields


 
H X |X̂ 6 h(Pe ) + Pe log |X|.
 
Finally inequality (3.20) shows that I X ; X̂ 6 I (X ; Y ), or equivalently
 
H X |X̂ > H (X |Y ).

The result follows. 


Theorem 4.5. For a channel with probability kernel κ from X to Y and an encoding/decoding
scheme c n : Mn → X n , ĉ n : Y n → Mn , we have the following results.
1. If lim Pe(n) = 0, then lim sup Rn 6 Cκ .
n→∞ n→∞

2. If lim inf Rn > Cκ , then lim inf Rn > 0.


n→∞ n→∞

Proof. Let W be a uniform random variable on Mn . Let (X 1 , . . . , X n ) := c n (W ) be the input and


Ŵ := ĉ n (Y1 , . . . , Yn ) be the estimator on the channel output. ThenW , (X 1 , . . . , X n ), (Y1 , . . . , Yn ), Ŵ
form a Markov chain. The error probability is
1 Õ  
Pe(n) := P (ĉ n (Y1 , . . . , Yn ) , ω |(X 1 , . . . , X n ) = c n (ω)) = P Ŵ , W .
Mn
ω ∈Mn

24
4 Shannon’s second theorem

It follows from Theorem 4.4 that


 
h Pe(n) + Pe(n) log Mn > H (W |Y1 , . . . , Yn ).
 
Recall that H (W ) = H (U(Mn )) = log Mn and that h Pe(n) 6 H (U(2)) = 1 (Corollary 1.5, hence
   
1 − Pe(n) log Mn 6 H (W ) − H (W |Y1 , . . . , Yn ) + h Pe(n) 6 I (W ; Y1 , . . . , Yn ) + 1. (4.6)

We have

I (W ; Y1 , . . . , Yn ) 6 I (X 1 , . . . , X n ; Y1 , . . . , Yn ) (by (3.20))
= H (X 1 , . . . , X n ) − H (X 1 , . . . , X n |Y1 , . . . , Yn )
Õn Õ n
6 H (X i ) − H (Yi |X 1 , . . . , X n , Y1 , . . . , Yi−1 ) (by (1.2) and (3.14))
i=1 i=1
Õn Õn
= H (X i ) − H (Yi |X i ) (since the channel is in product form)
i=1 i=1
Õn
= I (X i ; Yi )
i=1
6 nCκ .
log M
Combine this with (4.6) (note that Rn = n n ), we get the following inequality, from which the
result follows easily.
  1
1 − Pe(n) Rn 6 Cκ + .
n


4.3 Strong converse to Theorem 4.3

Theorem 4.5 claims that if the transmission rate is larger than the channel capacity, then the error
probability is bounded away from 0. In this section, we show a stronger converse to Theorem
4.3: this error probability converge to 1 (i.e. the communication is completely unreliable).
Consider a channel with probability kernel κ from X to Y. For an input distribution PX and
(x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n × Y n , define
n
κ ⊗n (y1 , . . . , yn |x 1 , . . . , x n ) Õ
I (x 1 , . . . , x n ; y1 , . . . , yn ) := log = I (x i ; yi ).
PY⊗n (y1 , . . . , yn ) i=1

One also defines the random variable

I (x 1 , . . . , x n ; Y1 , . . . , Yn )

where the vector (Y1 , . . . , Yn ) has distribution κ ⊗n (·|x 1 , . . . , x n ).

25
4 Shannon’s second theorem

Lemma 4.6. Let PX be an input distribution such that I (X ; Y ) = Cκ . Then for all n ∈ N∗ and
(x 1 , . . . , x n ) ∈ X n , we have E[I (x 1 , . . . , x n ; Y1 , . . . , Yn )] 6 nCκ .

Proof. It suffices to show the lemma for n = 1. Indeed, since the channel is in product form,
n
Õ
I (x 1 , . . . , x n ; Y1 , . . . , Yn ) = I (x i ; Yi ).
i=1

For x ∈ X, we have
Õ κ(y|x)
E[I (x; Y )] = κ(y|x) log .
P(Y = y)
y ∈Y

When X̃ = x is deterministic (i.e. when the distribution of X̃ is degenerate at x), one can check
that the mutual information I (X̃ ; Y ) is precisely E[I (x; Y )]. It folows from the definition of Cκ
that E[I (x; Y )] 6 Cκ . 
Theorem 4.7 (Wolfowitz). Let M = {1, . . . , M } be a finite set of messages. Fix n ∈ N∗ .
Let c : M → X n be an encoding function and ĉ : Y n → M be a decoding function. If
log M
R := > Cκ , then
n
4A n(R−Cκ )
Pe > 1 − 2
− 2− 2
n(R − Cκ )
for some positive constant A, depending only on κ, neither on n nor on M.

Proof. Let PX be an input distribution such that I (X ; Y ) = Cκ . For ω ∈ M, let

Yω := ĉ −1 (ω) = {y ∈ Y n | ĉ(y) = ω}.

We seek to majorize the probability of correctly decoding a message,


1 Õ Õ ⊗n
1 − Pe = κ (y|c(ω)). (4.7)
M
ω ∈M y ∈Yω

R − Cκ
Let ε := and define, for ω ∈ M,
2
Bω := {y ∈ Y n | I (c(ω); y) > n(Cκ + ε)}.

Then for y ∈ Bωc , one has κ ⊗n (y|c(ω)) 6 PY⊗n (y) · 2n(Cκ +ε ) . It follows that

PY⊗n (y) · 2n(Cκ +ε )


Õ Õ Õ Õ
κ ⊗n (y|c(ω)) 6
c
ω ∈M y ∈Yω ∩Bω c
ω ∈M y ∈Yω ∩Bω

6 2n(Cκ +ε )
Õ Õ
PY⊗n (y)
ω ∈M y ∈Yω

= 2n(Cκ +ε )
n(R+Cκ )
=2 2 . (4.8)

26
4 Shannon’s second theorem

On the other hand, for ω ∈ M, let c(ω) = (x 1 , . . . , x n ). Then, by Lemma 4.6,

E[I (c(ω); Y1 , . . . , Yn )] 6 nCκ .

Thus
Õ Õ
κ ⊗n (y|c(ω)) 6 κ ⊗n (y|c(ω))
y ∈Yω ∩Bω y ∈Bω

= P(I (c(ω); Y1 , . . . , Yn ) > n(Cκ + ε))


Var[I (c(ω); Y1 , . . . , Yn )]
6 (Chebyshev’s inequality)
n 2ε 2
n
1 Õ
= 2 2 Var[I (x i ; Yi )] (channel is in product form)
n ε i=1
A
6
nε 2
4A
= , (4.9)
n(R − Cκ )2
where A := max Var[I (x; Y )]. Plugging (4.8) and (4.9) into (4.7) yields
x ∈X

1 Õ Õ ⊗n
Pe = 1 − κ (y|c(ω))
M
ω ∈M y ∈Yω
1 Õ Õ ⊗n
> 1− κ (y|c(ω))
M
ω ∈M y ∈Yω
1 Õ Õ 1 Õ Õ
=1− κ ⊗n (y|c(ω)) − κ ⊗n (y|c(ω))
M c M
ω ∈M y ∈Yω ∩Bω ω ∈M y ∈Yω ∩Bω
n(R+Cκ )
2 2 4A
> 1− −
M n(R − Cκ )2
4A − n(R−C κ)
> 1− − 2 2 .
n(R − Cκ )2
log M
The last inequality comes from the fact that = R > Cκ , i.e. M > 2nCκ . 
n
Corollary 4.8. For a channel with probability kernel κ from X to Y and an encoding/decoding
scheme c n : Mn → X n , ĉ n : Y n → Mn , if lim inf Rn > Cκ , then lim Pe(n) = 1.
n→∞ n→∞

Exercises

Exercise 4.1 (Noisy typewriter). Consider a channel with input and output alphabet X = Y =
{a, b, . . . , z}. For a given input letter, the output may be equal to the input letter or to the next
one, both with probability 1/2. What is the information capacity of this channel?

27
4 Shannon’s second theorem

Exercise 4.2 (Cryptosystem). A cryptosystem consists of


1. finite sets X, Y and K,
2. an encoding function e : X × K → Y and a decoding function d : Y × K → X,
3. a random vector (X , Y , K) with values in (X, Y, K) (X is called the plaintext, Y is called
the ciphertext and K is called the key) such that Y = e(X , K) and X = d(Y , K).
Show and interpret the following results.
1. H (X |Y ) 6 H (K |Y ).
2. The cryptosystem is said to have perfect secrecy if I (X ; Y ) = 0. Show that perfect secrecy
implies H (X ) 6 H (K).
3. The secure information of the cryptosystem is I (X ; Y |K). It is full if I (X ; Y |K) = H (X ).
Show that full secure information requires X to be indepedent of K.
Exercise 4.3 (Binary symmetric channel with memory). Let q ∈ [0, 1]. Consider a channel (not
necessarily in product form) with X = Y = {0, 1} where the output is relative to the input by

∀n ∈ N∗ , Yn := X n ⊕ Z n

where Z n ∼ Ber(q) and (Z n )n ∈N∗ is independent of (X n )n ∈N∗ (by the Z n ’s are not necessarily
independent).
1. Show that H (X 1 , . . . , X n |Y1 , . . . , Yn ) = H (Z 1 , . . . , Z n ).
1
2. The capacity of this channel is C := lim inf sup I (X 1 , . . . , X n ; Y1 , . . . , Yn ). Compare
n P(X 1, . . ., Xn )n→∞
the information capacities of the BSC channel with and without memory.
Exercise 4.4. Consider a time-varying (non-homogeneous) product-form channel with proba-
bility kernels κn (n ∈ N∗ ) from X to Y. Show that
n
Õ
sup I (X 1 , . . . , X n ; Y1 , . . . , Yn ) = sup I (X i ; Yi ).
P(X 1, . . ., X n ) i=1 PX i

Exercise 4.5 (Fano’s inequality is sharp). Let p ∈ (0, 1). By considering a random variable X
p
with range X = {1, . . . , m} such that P(X = 1) = 1 − p and P(X = k) = for 2 6 k 6 m
m−1
and taking Y to be a singleton, show that Fano’s inequality (4.2) is sharp.
Exercise 4.6 (Feedback does not increase capacity). Show that Theorem 4.5 remains true if we
consider encoding functions c n of the form

c n (ω) = (c n,1 (ω), c n,2 (ω), . . . , c n,n (ω))

where
c n,i : Mn × Y i−1 7→ X, i = 1, . . . , n.

28

You might also like