0% found this document useful (0 votes)

58 views28 pages

Shannon's Theorems: Math and Science Summer Program 2020

The document discusses Shannon's theorems on information theory. It begins by introducing Shannon's entropy as a measure of uncertainty in a random variable. It then defines typical sequences, which concentrate most of the probability, and states Shannon's first theorem, which establishes that the number of typical sequences is approximately exponential in the entropy. The document goes on to discuss universal source coding, channels of communication, conditional entropy, and mutual information before concluding with Shannon's second theorem about the achievable rates for reliable communication over noisy channels.

Uploaded by

Anh Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views28 pages

Shannon's Theorems: Math and Science Summer Program 2020

Uploaded by

Anh Nguyen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Math and Science Summer Program 2020

Shannon’s theorems

Lecture on Applied Mathematics

Nguyen Manh Linh

École Normale Supérieure, France
Contents

Contents

Notation 3

1 Shannon’s entropy and source coding 4

1.1 Shannon’s entropy and Gibbs’ inequality . . . . . . . . . . . . . . . . . . . . . 4
1.2 Typical sequences and Shannon’s first theorem . . . . . . . . . . . . . . . . . . 5

2 Universal source coding 9

2.1 Empirical distributions and Kullback-Leibler divergence . . . . . . . . . . . . . 9
2.2 Universal source coding with errors . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Communication channel 13
3.1 Conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Information channel and product form . . . . . . . . . . . . . . . . . . . . . . 16

4 Shannon’s second theorem 21

4.1 Jointly typical sequences and Shannon’s second theorem . . . . . . . . . . . . 21
4.2 Weak converse to Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Strong converse to Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2
Notation

Notation

From now on, by a set we mean a discrete set (i.e. a set that is either finite or countably infinite),
unless anything else is mentioned. The collection 2 X of all subsets of a given (discrete) set X
forms a σ -algebra.
p(x) = 1. If A is
Í
A probability measure on X is simply a function p : X → [0, 1] satisfying
Í x ∈X
any subset of X, we write p(A) to indicate the sum p(x) (and hence a non-negative function
x ∈A
p on X is a probability measure iff p(X) = 1).
Whenever p is a probability measure on X and X is a random variable whose value is in X, we
write X ∼ p if the distribution PX of X coincides with p, i.e. if

∀x ∈ X, P(X = x) = p(x).

When X = X1 × · · · × Xn and pi is a probability measure on Xi (i = 1, . . . , n), the tensor product

of p1 , . . . , pn is the probability measure on X defined by

∀(x 1 , . . . , x n ) ∈ X, (p1 ⊗ · · · ⊗ pn )(x 1 , . . . , x n ) := p1 (x 1 ) · · · pn (x n ).

Random variables X 1 , . . . , X n with values in respectively X1 , . . . , Xn satisfying X i ∼ pi (i =

1, . . . , n) are called independent if the random vector X = (X 1 , . . . , X n ) has joint distribution
p1 ⊗ · · · ⊗ pn . We use the notation X ⊥ Y to indicate that two random variables X and Y are
independent.

3
1 Shannon’s entropy and source coding

1 Shannon’s entropy and source coding

1.1 Shannon’s entropy and Gibbs’ inequality

Let X be a set and p a probability measure on X. Let D > 1 be a real number.

Definition 1.1. 1. The entropy of p is defined to be
Õ
H D (p) := − p(x) logD p(x).
x ∈X

2. Let X be a random variable with value in X whose distribution is p. The entropy of X is

H D (X ) := H D (p) = −E[logD p(X )].

Remark. 1. The base of logarithm D is usually omitted when understood. One often takes
D = 2 to yield the binary entropy H 2 .
2. We implicitly used the convention 0 · log 0 := 0 in the Definition 1.1 by continuously
extending the function x 7→ x log x at x = 0.
Example 1.2. Let X = {0, 1}, p ∈ [0, 1] and X ∼ Ber(p), that is,

P(X = 0) = p = 1 − P(X = 1).

The entropy of X ,
H (X ) = −p log p − (1 − p) log(1 − p),
is denoted h(p).
Proposition 1.3. Let X be a random variable with value in X. Its entropy H (X ) enjoys the
following properties.
1. H (X ) > 0 and the equality holds iff X is (almost surely) deterministic.
2. Let f : X → Y be a deterministic function. One has H (X ) > H (f (X )) and the equality
holds iff f is injective.

Proof. Exercise.
Theorem 1.4 (Gibbs’ inequality). If p and q are two probability measures on X, there holds
Õ
H (p) 6 − p(x) log q(x). (1.1)
x ∈X

The equality in (1.1) holds iff p = q.

4
1 Shannon’s entropy and source coding

The right hand side of (1.1) is called the cross entropy between p and q, denoted H (p; q).

Proof. Exercise.
Corollary 1.5. When X = {1, . . . , n}, the uniform distribution U(n) on X maximizes the
entropy.
Corollary 1.6. Let X = N∗ and X a random variable taking value in N∗ . Let µ < 1 be given. If
EX = µ, the entropy of X is maximized when X has the geometric distribution Geom(1/µ).
The inequality (1.1) also implies the following
Proposition 1.7. Let X = (X 1 , . . . , X n ) be a random vector with value in X = X1 × · · · × Xn .
One has the inequality
H (X ) 6 H (X 1 ) + · · · + H (X n ). (1.2)
Moreover the equality holds iff X 1 , . . . , X n are independent.

Proof. Exercise.

Exercises

Exercise 1.1 (Entropy of a homogeneous Markov chain). Let (X n )n ∈N∗ be random variables
with value in X such that the following conditions hold.
1. For each n ∈ N∗ , X n+1 and (X 1 , . . . , X n−1 ) are independent given X n .
2. There exists a right stochastic matrix P = [px,y ]x,y ∈X such that

∀n ∈ N∗ , ∀x, y ∈ X, P(X n+1 = y | X n = x) = px,y .

3. The distribution (row) vector π := [P(X 1 = x)]x ∈X of X 1 is stationary, that is, πP = π .

Show that for each n ∈ N∗ ,

H (X 1 , . . . , X n ) = H (X 1 ) − (n − 1)E[log(pX 1,X 2 )].

1.2 Typical sequences and Shannon’s first theorem

In this section, let |X| = D ∈ N∗ \ {1} and let p be a probability measure on X. Let X =
(X 1 , . . . , X n ) be i.i.d. random variables with value in X n .
Definition 1.8. Let ε > 0. A realization x = (x 1 , . . . , x n ) ∈ X n of X is called ε-typical if
n

1Õ
− log p(x ) − H (p) 6 ε.

D i D
n i=1

5
1 Shannon’s entropy and source coding

We denote by A(n) n
ε the subset of ε-typical vectors in X . Asymptotically, typical vectors concen-
trate probability, that is
lim P X ∈ A(n) ε =1 (1.3)
n→∞

(which follows from the weak Law of Large Numbers).

The good news is, the size of A(n)
ε can be
(n) n
controlled. The following result assures that Aε is relatively small, compared to X (one may
wish to recall that Corollary
1.5 implies H D (p) 6 H D (U(D)) = 1).
(n) n(H
Proposition 1.9. Aε 6 D D (p)+ε ) .

Proof. If x = (x 1 , . . . , x n ) is ε-typical, by definition, one would have

n
Õ
logD p(x i ) > −n(H D (p) + ε).
i=1

Therefore, the fact that x ∈ A(n)

ε implies P(X = x) = p (x) > D
⊗n −n(H D (p)+ε ) (by independence).

Hence Õ
(n) (n) −n(H D (p)+ε )
1 > P X ∈ Aε = P(X = x) > Aε · D
(n)
x ∈Aε

and the conclusion follows.

The next proposition shows that there are no ’smaller’ sets that (asymptotically) concentrate
probability.
Proposition 1.10. Let R > 0 such that for each n ∈ N∗ , there exists a subset B (n) ⊆ X n of
cardinality at most D nR . If
lim sup P X ∈ B (n) = 1,
n→∞

then we have necessarily R > H D (p).

Proof. Let ε > 0. By similar arguments to those in the proof of Proposition 1.9, if x = (x 1 , . . . , x n )
is ε-typical, we would have P(X = x) 6 D −n(H D (p)−ε ) . Consequently,
Õ
P X ∈ A(n)
ε ∩B
(n)
= P(X = x) 6 B (n) · D −n(H D (p)−ε ) 6 D −n(H D (p)−R−ε ) .

(n)
x ∈Aε ∩B (n)

Let n 1 < n 2 < · · · be a strictly increasing sequence of positive integers such that lim P X ∈ B (nk ) =
k→∞
1. Then P X ∈ A(n ε
k)
∩ B (n k ) → 1 as k → ∞ and it follows that the term D −n k (H D (p)−R−ε ) re-

mains lower bounded, which implies H D (p) − R − ε 6 0. Letting ε ↓ 0 yields the result.

Our goal is to encode the source message X (consisting of n i.i.d. symbols) using a (hopefully,
smaller) number of symbols than n. The question is, at which compression rate can one correctly
decode the encoded message (at least with a high probability)?

6
1 Shannon’s entropy and source coding

Definition 1.11. Let Y be a finite set, which we will call an alphabet. By an encoding/decoding
scheme, we mean the following data for each n ∈ N∗
1. a positive integer n 0 and a encoding function c n : X n → Y n ,
0

2. a decoding function ĉ n :→ Y n → X n .
0

n0
The compression rate of the given encoding/decoding scheme is Rn := . Let Y := c n (X ) be the
n
encoded message and X̂ := ĉ n (Y ) be the decoded message. We will now concern ourselves to
the error probability of the scheme

Pe(n) := P X̂ , X ,

given the compression rate. For simplicity, let us take Y = X. The general case is left as an
exercise.
Theorem 1.12 (Shannon’s first theorem / Noiseless coding theorem). 1. For any R > H D (p),
n
there exist encoding functions c n : X → X dnR e and decoding functions ĉ n : X dnR e → X n
(n)
such that lim Pe = 0.
n→∞

2. For any R < H D (p) and any encoding/decoding scheme c n : X n → X bnR c , ĉ n : X bnR c →
X n , we have lim inf Pe(n) > 0.
n→∞

Proof. 1. Let ε ∈ (0, R − H D (p)). By Proposition 1.9, for each n ∈ N∗ , we can find an injection

fn : A(n)
ε →X
dnR e

and an element x ∗ ∈ X dnR e \ fn A(n)
ε . Consider the following encoding/decoding scheme
(
n fn (x) if x ∈ A(n)
ε
∀x ∈ X , c n (x) :=
x∗ otherwise,
(
fn−1 (y) if y ∈ fn A(n)
ε
∀y ∈ X dnR e , ĉ n (y) :=
arbitrary otherwise.

It is clear that ĉ n (c n (x)) = x for all x ∈ A(n)
ε . Consequently, P (n)
e 6 P X < A(n)
ε , which
converges to 0 as n tends to ∞ by (1.3).
2. We define for each n ∈ N∗ ,

B (n) := {x ∈ X n | ĉ n (c n (x)) = x } .

In particular, the restriction c n : B (n) → X bnR c is injective, which implies B (n) 6 D nR .

The result follows from Proposition 1.10.

7
1 Shannon’s entropy and source coding

Exercises

Exercise 1.2. State and prove Theorem 1.12 in the general case (where Y is not necessarily
equal to X).

8
2 Universal source coding

2 Universal source coding

This chapter deals with an extension of Theorem 1.12 when the distribution of the source
alphabet is unknown. In other words, we will show that for any coding rate R > 0, there is
an asymptotically error-free encoding/decoding scheme that is universal for all probability
distribution of i.i.d. source symbols whose entropy is smaller than R.
Let X be a set of cardinality D ∈ N∗ \ {1}.

2.1 Empirical distributions and Kullback-Leibler divergence

Definition 2.1. Let a = (a 1 , . . . , an ) ∈ X n . The associated empirical distribution pa is the

probability measure on X is defined by
n
1Õ
∀x ∈ X, pa (x) := 1a =x .
n i=1 i
We denote by Pn the set of empirical distributions on X associated to all vectors of length n.
Let p ∈ Pn . We define
Tn (p) := {a ∈ X n | pa = p} .
The following inequality holds.
(n + 1)−D · 2nH2 (p) 6 |Tn (p)| 6 2nH2 (p) . (2.1)
Definition 2.2. The Kullback-Leibler divergence between two probability measures p and q is
Õ p(x)
D(p; q) := p(x) log .
x ∈X
q(x)

The base of logarithm in Definition 2.2 can be any real number greater than 1. The convention
used here are 0 · log(0/y) = 0 for y > 0 and x · log(x/0) = +∞ for x > 0.
By inequality (1.1), D(p; q) > 0, and equality holds iff p = q (be warned, the Kullback-Leibler
divergence is not symmetric!).
Lemma 2.3. Let p be a probability measure on X. We have
1. for any vector x ∈ X n , p ⊗n (x) = 2−n(H2 (px )+D 2 (px ;p)) ,
2. for any empirical distribution q ∈ Pn , p ⊗n (Tn (q)) 6 2−nD 2 (q;p) .

Proof. Exercise. For the inequality, one can make use of (2.1).

9
2 Universal source coding

Exercises

Exercise 2.1. Let p be a probability measure on X. Let X 1 , X 2 , . . . be an i.i.d. sequence of

random variables with value in X such that X 1 ∼ p. Show that, almost surely, p(X 1, ...,X n ) → p
as n → ∞.
Exercise 2.2. Prove inequality (2.1) (Hint: For k 1 , . . . , k D > 0 such that k 1 + · · · +k D = n, expand
nn = (k 1 + · · · + k D )n and majorize every summand by the largest one).
Exercise 2.3. Let p and q be two probability measures on X. Show that D(p; q) < +∞ iff p q
(meaning q(x) = 0 implies p(x) = 0).
Exercise 2.4. Let (X , Y ) be a random vector with value in X 2 . Show that

H (X ) + H (Y ) − H (X , Y ) = D(P(X,Y ) ; PX ⊗ PY ).

2.2 Universal source coding with errors

We are now ready to state and prove an extension of Theorem 1.12. For simplicity, let us take
the binary encoding alphabet {0, 1}.
Theorem 2.4 (Universal source coding). Let R > 0. There exist functions c n : X n → {0, 1} dnR e ,
ĉ n : {0, 1} dnR e → X n such that for any probability measure p on X satisfying H 2 (p) < R and
any sequence X 1 , X 2 , . . . of i.i.d. random variables with value in X and distribution p,

lim Pe(n) = 0
n→∞

where Pe(n) = P (ĉ n (c n (X 1 , . . . , X n ) , (X 1 , . . . , X n )).

Proof. The proof of Theorem 1.12 can be mimicked, provided that we find subsets A(n) ⊆ X dnR e
(depending only on R) of cardinality strictly less than 2nR such that for any probability measure
p on X with entropy less than R, we have

lim p ⊗n A(n) = 1. (2.2)
n→∞

To do this, take
A(n) := {a ∈ X n | H 2 (pa ) 6 Rn }
where
D log2 (n + 2)
Rn := R − .
n

10
2 Universal source coding

We study the size of A(n) . One has

Õ
A =
(n)
|Tn (q)|
q ∈ Pn
H 2 (q)6R n

2nH2 (q)
Õ
6 (by (2.1))
q ∈ Pn
H 2 (q)6R n

6 |Pn | · 2nRn
2nR
6 (n + 1)D ·
(n + 2)D
< 2nR .

It suffices to show that (2.2) holds for any probability measure p on X such that H 2 (p) < R.
Take ε ∈ (0, R − H 2 (p)). Let P denote the set of all probability measures on X and

q∗ := arg min D(q, p)

q∈P
H 2 (q)>H 2 (p)+ε

(Exercise: argue that q∗ exists and that D(q∗ , p) > 0). For large n, one has Rn > H 2 (p) + ε, thus
D(q; p) > D(q∗ ; p) for all q ∈ P with H 2 (q) > Rn . Observe that

p ⊗n X n \ A(n) =
Õ
p ⊗n (Tn (q))
q ∈ Pn
H 2 (q)>R n
Õ
6 2−nD 2 (q;p) (by Lemma 2.3)
q ∈ Pn
H 2 (q)>R n

6 (n + 1)D · 2−nD 2 (q∗,p) (for large n).

The last term converge to 0 as n → ∞ since D(q∗ , p) > 0. This finishes the proof.

Exercises

Exercise 2.5 (Bernoulli source-symbols). For a vector x ∈ {0, 1}n , let K(x) denote its number
of 1’s. Let B(x) be the lexicographical rank of x among all vectors of {0, 1}n with exactly K(x)
1’s. When then have a bijection

n
c n0 : {0, 1}n → (k, b) | k ∈ {0, . . . , n}, b ∈ 1, . . . , , c n0 (x) := (K(x), B(x)).
k

Consider the encoding function c n on {0, 1}n defined as follows: c n (x) is the concatenation of
the binary representation of K(x) and B(x). We shall work on the asymptotic length of c n (x).

11
2 Universal source coding

1. Let n ∈ N∗ and p ∈ (0, 1). Show that

n −nh2 (p) 1
2 6 p
np πnp(1 − p)

(see the definition of h 2 (p) in Example 1.2). Moreover, show that if 12np(1 − p) > 9,

n −nh2 (p) 1
2 > p .
np 8np(1 − p)

(Hint: One can use Stirling’s approximation

√ n n 1 √ n n 1
2πn e 12n+1 6 n! 6 2πn e 12n ).
e e

1
2. Let x 1 , x 2 , . . . ∈ {0, 1} such that K(x 1 , . . . , x n ) → p ∈ (0, 1) as n → ∞. Let | · | denotes
n
the length of a binary sequence. Show that
1
lim |c n (x 1 , . . . , x n )| = h 2 (p).
n→∞ n

See Section 14.1.1 (A First Example) in Pierre Brémaud, Discrete Probability Models and Methods.
Springer, 2017 for universal source coding for binary sequences.

12
3 Communication channel

3 Communication channel

In this chapter, X, Y and Z are always finite sets. The base of logarithm is always 2 unless
specified. To simplify notations, we use the convention P(A|B) = 0 when P(B) = 0, where A
and B are two events.

3.1 Conditional entropy

Definition 3.1. 1. Let (X , Y ) be a X × Y-valued random vector. For y ∈ Y, the conditional

entropy of X given Y = y is defined to be
Õ
H (X |Y = y) := − P(X = x |Y = y) log P(X = x |Y = y).
x ∈X

2. The conditional entropy of X given Y is

Õ
H (X |Y ) := − P(X = x, Y = y) log P(X = x |Y = y).
x ∈X,y ∈Y

The following properties are straightforward.

Proposition 3.2. Let (X , Y ) be a random vector with value in X × Y. We have the following
results.
1. (Bayes’ rule) Õ
H (X |Y ) = P(Y = y)H (X |Y = y) (3.1)
y ∈Y

2.
H (X |Y = y) > 0 (3.2)
for all y ∈ Y. Moreover, H (X |Y ) > 0 and equality holds iff Y = f (X ) where f : X → Y
is a deterministic function.
3. (Chain rule)
H (X , Y ) = H (X ) + H (Y |X ) = H (Y ) + H (X |Y ). (3.3)

4. If H (Y ) < ∞, then
H (X |Y ) 6 H (X ) (3.4)
and equality holds iff X ⊥ Y .

13
3 Communication channel

5. For any function f : X → Z,

H (Y | f (X )) > H (Y |X ) (3.5)

with equality iff Y ⊥ X given f (X ).

6. Let (X 1 , . . . , X n , Y ) be a random vector with value in X1 × · · · Xn × Y. We have
n
Õ
H (X 1 , . . . , X n |Y ) 6 H (X i |Y ) (3.6)
i=1

with equality when X 1 , . . . , X n are independence given Y . Inversely, if H (X 1 , . . . , X n |Y ) 6

Ín
i=1 H (X i |Y ) < +∞, then X 1 , . . . , X n are independence given Y .

7. For any function f : X → Y,

H (f (X )|X ) = 0 (3.7)

8. Let (X 1 , . . . , X n , Y ) be a random vector with value in X1 × · · · Xn × Y. Then

∀i ∈ {1, . . . , n}, H (X i ) 6 H (X 1 , . . . , X n ) (3.8)

and
∀i ∈ {1, . . . , n}, H (X i |Y ) 6 H (X 1 , . . . , X n |Y ). (3.9)

9. H (X , Y ) is finite iff H (X ) and H (Y ) are. Similarly, H (X , Y |Z ) is finite iff H (X |Z ) and

H (Y |Z ) are (where Z is any Z-valued random variable).
The properties (3.1), (3.3) and (3.4) can also be improved.
Proposition 3.3. Let (X , Y , Z ) be a random vector with value in X × Y × Z. We have the
following results.
1. (Bayes’ rule) Õ
H (X |Y , Z ) = P(Z = z)H (X |Y , Z = z) (3.10)
z ∈Z

where
Õ
H (X |Y , Z = z) := − P(X = x, Y = y|Z = z) log P(X = x |Y = y, Z = z).
x ∈X,y ∈Y

2. (Chain rule)

H (X , Y |Z ) = H (X |Z ) + H (Y |X , Z ) = H (Y |Z ) + H (X |Y , Z ). (3.11)

3. If H (Y |Z ) < ∞, then
H (X |Y , Z ) 6 H (X |Y ) (3.12)

14
3 Communication channel

Exercises

Exercise 3.1. Show the sequential conditional chain rules: If (X 1 , . . . , X n , Y ) is a random vector
with value in X1 × · · · × Xn × Y , we have
n
Õ
H (X 1 , . . . , X n ) = H (X i |X 1 , . . . , X i−1 ) (3.13)
i=1

and
n
Õ
H (X 1 , . . . , X n |Y ) = H (X i |X 1 , . . . , X i−1 , Y ). (3.14)
i=1

3.2 Mutual information

Definition 3.4. For a random vector (X , Y ) with value in X × Y, we define the mutual infor-
mation between X and Y
I (X ; Y ) := H (X ) + H (Y ) − H (X , Y ).

Exercise 2.4 says that I (X ; Y ) = D(P(X,Y ) ; PX ⊗ PY ). The chain rule 3.3 says that
I (X ; Y ) = H (X ) − H (X |Y ) = H (Y ) − H (Y |X ) = I (Y ; X ).
We state without proof some basic properties of mutual information.
Proposition 3.5. Let (X , Y ) be a random vector with value in X × Y. Suppose that H (X ) and
H (Y ) are finite. Then we have the following results.
1.
I (X ; X ) = H (X ) (3.15)

2.
I (X ; Y ) > 0 (3.16)
with equality iff X ⊥ Y .
3.
I (X ; Y ) 6 I (Y ; Y ) = H (Y ) (3.17)
with equality iff Y is a deterministic function of X .
4. For any function f : X → Y,
I (f (X ); Y ) 6 I (X ; Y ) (3.18)
with equality iff Y ⊥ X given f (X ).
5. (Data processing inequality) If Z is a random variable with value in Z and finite entropy,
then
I (X , Y ; Z ) 6 I (Y ; Z ) (3.19)
with equality iff Z ⊥ X given Y .

15
3 Communication channel

Exercises

Exercise 3.2. If X 1 , . . . , X n is a Markov chain, show that for 1 6 k 6 ` 6 n,

I (X 1 ; X n ) 6 I (X k ; X ` ). (3.20)

Exercise 3.3. If X 1 , X 2 , Y1 , Y2 are random variables with finite entropy such that (X 1 , X 2 ) and
(Y1 , Y2 ) are independent, then

I (X 1 , Y1 ; X 2 , Y2 ) = I (X 1 ; X 2 ) + I (Y1 ; Y2 ). (3.21)

Exercise 3.4. Define the conditional version of mutual information, state and prove its basic
properties.
Exercise 3.5. Show the Kolmogorov’s formula: If X , Y and Z have finite entropy, then

I (X ; Y , Z ) = I (X ; Z ) + I (X ; Y |Z ). (3.22)

Exercise 3.6 (yet another Chain rule). 1. Let X , Y , Z and U be random variables with range
respectively X, Y, Z and U with finite entropy. We have

I (X ; Y , Z |U ) = I (X ; Z |U ) + I (X ; Y |Z , U ). (3.23)

2. Let X , (Y1 , . . . , Yn ) and U random variables with range respectively X, Y n and U. If

H (X , Y |U ) < +∞, show that
n
Õ n
Õ
I (X ; Y1 , . . . , Yn |U ) = I (X ; Yi |Y1 , . . . , Yi−1 , U ) = I (X ; Yj |Yj+1 , . . . , Yn , U ). (3.24)
i=1 j=1

Exercise 3.7 (Csiszár sum identity). Let X = (X 1 , . . . , X n ), Y = (Y1 , . . . , Yn ) and U be random

variables with range respectively X n , Y n and U. If H (X , Y |U ) < +∞, show that
n
Õ n
Õ
I (Yi ; X i+1 , . . . , X n |Y1 , . . . , Yi−1 , U ) = I (X i ; Y1 , . . . , Yi−1 |X i+1 , . . . , X n , U ). (3.25)
i=1 i=1

Having studied the concepts of mutual information, we are now ready to define an information
channel and its capacity.

3.3 Information channel and product form

Definition 3.6. By an information channel with input alphabet X and output alphabet Y, we
mean a family of non-negative functions

κn (·|·) : X n × Y n → [0, +∞), (x, y) 7→ κn (y|x)

where n ∈ N∗ (called probability kernels of the channel, meaning for each x ∈ X n , the function
κn (·|x) is a probability measure on Y n ).

16
3 Communication channel

For n ∈ N∗ , the channel takes a random vector X = (X 1 , . . . , X n ) with value in X n as input (the
emitted word) and produces a random vector Y = (Y1 , . . . , Yn ) with value in Y n (the received
word) in such a way that

∀(x, y) ∈ X n × Y n , P(Y = y|X = x) = κn (y|x).

Definition 3.7. Same notations, a channel is said to be

1. memoryless if for any n > 2, any distribution of the emitted word X and any j ∈ {2, . . . , n},
Yj ⊥ (X 1 , . . . , X j−1 , Y1 , . . . , Yj−1 ) given X j ,
2. without feedback if for any n > 2, any distribution of the emitted word X and any
j ∈ {2, . . . , n}, X j ⊥ (Y1 , . . . , Yj−1 ) given (X 1 , . . . , X j−1 ),
3. without anticipation if for any n > 2, any distribution of the emitted word X and any
j ∈ {1, . . . , n − 1}, Yj ⊥ (X j+1 , . . . , X n ) given (X 1 , . . . , X j ).
Proposition 3.8. A channel without anticipation is without feedback.

Proof. Exercise.
Proposition 3.9. If a channel is memoryless and without feedback, then for any n > 2, any
distrbution of X , any j ∈ {2, . . . , n} and any (x 1 , . . . , x j , y1 , . . . , y j ) ∈ X j × Y j ,
j
Ö
P(Y1 = y1 , . . . , Yj = y j |X 1 = x 1 , . . . , X j = x j ) = P(Yk = yk |X k = x k ).
k=1

Proof. Exercise.
Proposition 3.10. If for any n > 2, any distrbution of X and any (x 1 , . . . , x n , y1 , . . . , yn ) ∈
X n × Y n , we have
n
Ö
P(Y1 = y1 , . . . , Yn = yn |X 1 = x 1 , . . . , X n = x n ) = P(Yk = yk |X k = x k ) (3.26)
k =1

then the channel is memoryless, without feedback and without anticipation.

Proof. Exercise.
Proposition 3.11. For a given channel, the followings are equivalent.
1. The channel is memoryless and without feedback.
2. (3.26) holds for any n > 2, any distrbution of X and any (x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n ×Y n .
3. The channel is memoryless and without anticipation.

Proof. Exercise.

17
3 Communication channel

Definition 3.12. A memoryless and without feedback channel is said to be time-invariant if

there is a probability kernel κ(·|·) from X to Y such that for any i ∈ N∗ and any (x, y) ∈ X × Y,

P(Yi = y|X i = x) = κ(y|x).

The matrix [κ(y|x)]x ∈X,y ∈Y is called the transition matrix of the channel.
Proposition 3.13. A channel is memoryless, without feedback and time-invariant with transi-
tion matrix [κ(y|x)]x ∈X,y ∈Y iff for any (x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n × Y n ,
n
Ö
κn (y1 , . . . , yn |x 1 , . . . , x n ) = κ(yi |x i ).
i=1

Proof. Exercise.

For this reason, a memoryless, without feedback and time-invariant channel is said to have
(homogeneous) product form.
Let X be a generic input symbol and Y the output. Then the distribution of (X , Y ) is a function
of PX and the probability kernel κ.
Definition 3.14. The capacity of a channel from X to Y in product form with transition matrix
κ is the following supremum, taken over all probability measure on X

Cκ := sup I (X ; Y ).
PX

Example 3.15 (Binary symmetric channel). Let X = Y = {0, 1} and consider the channel in
product form from X to Y such that any bit is transmitted incorrectly with probability p ∈ [0, 1].
In other words, the transition matrix is

1−p p
κ= .
p 1−p

The algebraic representation of this channel is Y = X ⊕ Z where Z is independent of (X , Y ) and

Z ∼ Ber(p) (the notation ⊕ means addition modulo 2).
Computation shows that the capacity of this channel (denoted BSC(p)) is 1 − h(p) (Exercise).
One can take values p = 0, 1/2, 1 and interpret the corresponding results.

Exercises

Exercise 3.8. Show that for a channel in product form, I (X ; Y ) is a continuous and concave
function of PX . Moreover, concavity is strict iff the map PX 7→ PY is injective. Deduce that the
supremum I (X ; Y ) is achieved (i.e. it is a maximum). When is the maximizer unique?
Exercise 3.9. Show that for fixed PX , I (X ; Y ) is a convex function of the probability kernel.

18
3 Communication channel

Exercise 3.10 (Binary erasure channel). We take the input alphabet X = {0, 1} and the output
alphabet Y = {0, 1, e}, where e stands for an error. Consider the channel with transition matrix

1−p 0 p
κ= .
0 1−p p

Show that the capacity of this channel (denoted BEC(p)) is 1 − p.

Exercise 3.11. What is the capacity of the channel where we place n binary symmetric channels
BSC(p) in series?
Exercise 3.12 (Parallel independent channels). For i = 1, 2, let κi be a probability kernel of a
channel from Xi to Yi . We place them in parallel by considering the compound channel from
X1 × X2 to Y1 × Y2 and the probability kernel

∀(x 1 , x 2 , y1 , y2 ) ∈ X1 × X2 × Y1 × Y2 , κ(y1 , y2 |x 1 , x 2 ) := κ 1 (y1 |x 1 )κ 2 (y2 |x 2 ).

Let (X 1 , X 2 ) be a random input of the channel and (Y1 , Y2 ) the corresponding output.
1. Show that for (x 1 , x 2 , y1 , y2 ) ∈ X1 × X2 × Y1 × Y2 ,

P(Y1 = y1 |X 1 = x 1 , X 2 = x 2 , Y2 = y2 ) = κ 1 (y1 |x 1 ).

2. Prove that H (Y1 , Y2 |X 1 , X 2 ) = H (Y1 |X 1 ) + H (Y2 |X 2 ).

3. Show that if X 1 and X 2 are independent, so are Y1 and Y2 .
4. Deduce that Cκ = Cκ1 + Cκ2 .
Exercise 3.13. Let X = Y = {0, 1}2 . Let Z 1 ∼ Ber(1/2). Consider a channel from XtoY with
input (X 1 , X 2 ) and the output (Y1 , Y2 ) is given by

Y1 := X 1 ⊕ Z 1 , Y2 := X 2 .

Calculate the capacity of this channel. Show that the capacity-achieving distribution of output
is unique, but that of input is not.
Exercise 3.14 (Symmetric channel). A channel from X to Y is said to be symmetric if the rows
of its transition matrix are permutation of each other, as well as the columns.
1. Let q be the probability that the first row of the transition matrix define. Show that the
capacity of the channel is
C = log |Y| − H (q),
which can be achieved by uniform distribution.
2. Show that the above result hold for weakly symmetric channels (i.e. channels where the
rows of its transition matrix are permutation of each other, and the columns have equal
sums).
3. Let L > 2 be an integer and X = Y = {0, 1, . . . , L − 1}. Consider the channel where the
output Y is related to the input X by Y := Z ⊕ X (mod L) where Z is a random variable
with value in {0, 1, . . . , L − 1}, independent of X . Find the capacity of this channel.

19
3 Communication channel

Exercise 3.15 (Asymmetric erasure channel). Find the capacity of the channel with transition
matrix
2/3 − α α 1/3
.
α 2/3 − α 1/3

20
4 Shannon’s second theorem

4 Shannon’s second theorem

In this chapter, we use the same notations as those in the previous one. All information channels
have product form. The base of logarithms is always 2.
An additional notation are finite sets Mn = {1, . . . , Mn } (n ∈ N∗ ), whose elements we call
messages.
An encoding/decoding scheme using a channel with kernel κ consists of
1. encoding functions c n : Mn → X n for n ∈ N∗ (i.e. the message set depends on n),
2. decoding functions ĉ n : Y n → Mn for n ∈ N∗ .
We use ω to denote a generic message in Mn . n sent symbols are those of the vector x := c n (ω).
n received symbols (which are random by the noisiness of the channel) are those of Y (the
distribution of Y is κ ⊗n (·|x)). The message estimate is Ŵ := ĉ n (Y ).
The error probability on a message ω ∈ Mn is
Õ
Pe(n)
|ω
:= P(Ŵ , ω |X = c n (ω)) = κ ⊗n (y|x).
y ∈Y n
ĉ n (y),ω

The average error probability is

1 Õ (n)
Pe(n) := Pe |ω .
Mn
ω ∈Mn

The maximal error probability is

λn := max Pe(n)
|ω
.
ω ∈Mn

It is clear that λn > Pe(n) . We shall proceed to study the correlation between error probability
and the transmission rate
log Mn
Rn := .
n

4.1 Jointly typical sequences and Shannon’s second theorem

Let (X 1 , Y1 ), . . . , (X n , Yn ) be a random samples of (X , Y ), an X × Y-valued random vector. One

may wish to recall the Definition 1.8 of typical vectors. For ε > 0, let A(n) (n)
ε,X (resp. Aε,X and
A(n)
ε,(X,Y )
) denote the subset of X n (resp. Y n and X n × Y n ).

21
4 Shannon’s second theorem

Definition 4.1. Elements of

A(n)
ε := A(n)
ε,X × A(n) (n)
ε,Y ∩ Aε,(X,Y )

are called ε-jointly typical vectors in X n × Y n .

Equation (1.3) implies that P (X 1 , . . . , X n , Y1 , . . . , Yn ) ∈ A(n) ε → 1 when n → ∞.
Proposition 4.2. If X i is indepedent of Yi for i = 1, . . . , n, we have

P (X 1 , . . . , X n , Y1 , . . . , Yn ) ∈ A(n)
ε,(X,Y )
6 2−n(I (X ;Y )−3ε ) .

Proof. Exercise. One can follow the proof of Proposition 1.9.

Theorem 4.3. [Shannon’s second theorem / Noisy-channel coding theorem] For a channel
with probability kernel κ from X to Y, any transmission rate R < Cκ is achievable, i.e., there
exists a channel encoding/decoding scheme c n : X n → Mn , ĉ n : Y n → Mn (for some sets Mn )
with transmission rate Rn > R such that the maximal error probability λn tends to 0.

Proof. Take a distribution PX on X such that I (X ; Y ) = Cκ (see Exercise 3.8 for the existence for
Cκ − R
and Mn0 := 2nR+1 for n ∈ N∗ .

PX ). Let ε :=
5
Step 1. Control the average error.
The idea is to use random codes. Generate a random matrix Cn = [X i (ω)]ω ∈Mn0 ,16i 6n with i.i.d.
entries, each having distribution PX (the ω in the bracket is merely an index). We call this matrix
a (random) codebook. For a realization of Cn , we define encoding functions c n and decoding
functions ĉ n as follows.
∀ω ∈ Mn0 , c n (ω) := (X 1 (ω), . . . , X n (ω)).


 ω̂ ∈ Mn0 such that (X 1 (ω̂) , . . . , X n (ω̂) , y1 , . . . , yn ) ∈ A(n)
ε
∀(y1 , . . . , yn ) ∈ Y n , ĉ n (y1 , . . . , yn ) :=


if ω exists and is unique

1

otherwise.

For ω ∈ Mn0 , let (Y1 (ω), . . . , Yn (ω)) be the (random) channel output related (deterministic) input
(X 1 (ω), . . . , X n (ω)). Its distribution is given by
n
n
Ö
∀(y1 , . . . , yn ) ∈ Y , P((Y1 (ω), . . . , Yn (ω)) = (y1 , . . . , yn )|(X 1 (ω), . . . , X n (ω)) = κ(yi |X i (ω)).
i=1

The expected average error probability (with respect to Cn ) is

 
h i  1 Õ
Pe(n) =E 0 P (ĉ n (Y1 (ω), . . . , Yn (ω)) , ω |Cn )

E 
 Mn ω ∈Mn0 
 
1 Õ
= 0 P (ĉ n (Y1 (ω), . . . , Yn (ω)) , ω)
Mn
ω ∈Mn
0

= P (ĉ n (Y1 (1), . . . , Yn (1)) , 1) (by symmetry of rows of Cn ).

22
4 Shannon’s second theorem

For ω ∈ Mn0 , define the event

n o
E ω(n) := (X 1 (ω), . . . , X n (ω), Y1 (1), . . . , Yn (1)) ∈ A(n)
ε .

It is clear that
{ĉ n (Y1 (1), . . . , Yn (1)) , 1} ⊆ E 1c ∪
Ø
E ω(n) .
ω ∈Mn0 \{1}

Hence
h i Õ
E Pe(n) = P (ĉ n (Y1 (1), . . . , Yn (1)) , 1) 6 1 − P E 1(n) + P E ω(n) ) . (4.1)
ω ∈Mn0 \{1}

The first term on the right hand side of (4.1) tends to 0 by (1.3). Moreover, since the rows of Cn
are indepedent, by Proposition 4.2
Õ
P E ω(n) ) 6 Mn0 · 2−n(Cκ −3ε )
ω ∈Mn0 \{1}

6 2nR+2 · 2−n(Cκ −3ε )

6 2−n(Cκ −R−4ε ) (for large n)
=2 −nε
.

The last term converges to 0 when n → ∞. Thus

h i
lim E Pe(n) = 0.
n→∞

By Fatou’s lemma, there is a realization of Cn with such that the associated encoding/decoding
scheme has average error probability Pe(n) tends to 0.
Step 2. From average to maximal error.
This part and the part concerning the calculation of transmission rate are left as an exercise.
(Hint: Observe that if the average of n numbers is δ , then at least bn/2c of them are at most 2δ ).
Remark. The proof of Theorem 4.3 is probabilistic. Noone has ever constructed the promised
code.

4.2 Weak converse to Theorem 4.3

In this section, we show that transmission rates that are larger than the channel capacity can
never be achieved. We start with the following
inequality.
Theorem 4.4 (Fano’s inequality). Let X , Y , X̂ be a random vector with value in X × Y × X

such that X̂ ⊥ X given Y . Let Pe := P X̂ , X , then

h(Pe ) + Pe log |X| > H (X |Y ). (4.2)

23
4 Shannon’s second theorem

Proof. Let E := 1X̂ ,X . By (3.11), we know that

H X |X̂ + H E|X , X̂ = H E|X̂ + H X |E, X̂ . (4.3)

If X and X̂ are given, E is deterministic, meaning

H E|X , X̂ = 0. (4.4)

Since E ∼ Ber(Pe ), it follows from (3.4) that

H E|X̂ 6 H (E) = h(Pe ). (4.5)

Plugging this, together with (4.4), (4.5) into (4.3) yields

H X |X̂ 6 h(Pe ) + Pe log |X|.

Finally inequality (3.20) shows that I X ; X̂ 6 I (X ; Y ), or equivalently

H X |X̂ > H (X |Y ).

The result follows.

Theorem 4.5. For a channel with probability kernel κ from X to Y and an encoding/decoding
scheme c n : Mn → X n , ĉ n : Y n → Mn , we have the following results.
1. If lim Pe(n) = 0, then lim sup Rn 6 Cκ .
n→∞ n→∞

2. If lim inf Rn > Cκ , then lim inf Rn > 0.

n→∞ n→∞

Proof. Let W be a uniform random variable on Mn . Let (X 1 , . . . , X n ) := c n (W ) be the input and

Ŵ := ĉ n (Y1 , . . . , Yn ) be the estimator on the channel output. ThenW , (X 1 , . . . , X n ), (Y1 , . . . , Yn ), Ŵ
form a Markov chain. The error probability is
1 Õ
Pe(n) := P (ĉ n (Y1 , . . . , Yn ) , ω |(X 1 , . . . , X n ) = c n (ω)) = P Ŵ , W .
Mn
ω ∈Mn

24
4 Shannon’s second theorem

It follows from Theorem 4.4 that

h Pe(n) + Pe(n) log Mn > H (W |Y1 , . . . , Yn ).

Recall that H (W ) = H (U(Mn )) = log Mn and that h Pe(n) 6 H (U(2)) = 1 (Corollary 1.5, hence

1 − Pe(n) log Mn 6 H (W ) − H (W |Y1 , . . . , Yn ) + h Pe(n) 6 I (W ; Y1 , . . . , Yn ) + 1. (4.6)

We have

I (W ; Y1 , . . . , Yn ) 6 I (X 1 , . . . , X n ; Y1 , . . . , Yn ) (by (3.20))
= H (X 1 , . . . , X n ) − H (X 1 , . . . , X n |Y1 , . . . , Yn )
Õn Õ n
6 H (X i ) − H (Yi |X 1 , . . . , X n , Y1 , . . . , Yi−1 ) (by (1.2) and (3.14))
i=1 i=1
Õn Õn
= H (X i ) − H (Yi |X i ) (since the channel is in product form)
i=1 i=1
Õn
= I (X i ; Yi )
i=1
6 nCκ .
log M
Combine this with (4.6) (note that Rn = n n ), we get the following inequality, from which the
result follows easily.
1
1 − Pe(n) Rn 6 Cκ + .
n

4.3 Strong converse to Theorem 4.3

Theorem 4.5 claims that if the transmission rate is larger than the channel capacity, then the error
probability is bounded away from 0. In this section, we show a stronger converse to Theorem
4.3: this error probability converge to 1 (i.e. the communication is completely unreliable).
Consider a channel with probability kernel κ from X to Y. For an input distribution PX and
(x 1 , . . . , x n , y1 , . . . , yn ) ∈ X n × Y n , define
n
κ ⊗n (y1 , . . . , yn |x 1 , . . . , x n ) Õ
I (x 1 , . . . , x n ; y1 , . . . , yn ) := log = I (x i ; yi ).
PY⊗n (y1 , . . . , yn ) i=1

One also defines the random variable

I (x 1 , . . . , x n ; Y1 , . . . , Yn )

where the vector (Y1 , . . . , Yn ) has distribution κ ⊗n (·|x 1 , . . . , x n ).

25
4 Shannon’s second theorem

Lemma 4.6. Let PX be an input distribution such that I (X ; Y ) = Cκ . Then for all n ∈ N∗ and
(x 1 , . . . , x n ) ∈ X n , we have E[I (x 1 , . . . , x n ; Y1 , . . . , Yn )] 6 nCκ .

Proof. It suffices to show the lemma for n = 1. Indeed, since the channel is in product form,
n
Õ
I (x 1 , . . . , x n ; Y1 , . . . , Yn ) = I (x i ; Yi ).
i=1

For x ∈ X, we have
Õ κ(y|x)
E[I (x; Y )] = κ(y|x) log .
P(Y = y)
y ∈Y

When X̃ = x is deterministic (i.e. when the distribution of X̃ is degenerate at x), one can check
that the mutual information I (X̃ ; Y ) is precisely E[I (x; Y )]. It folows from the definition of Cκ
that E[I (x; Y )] 6 Cκ .
Theorem 4.7 (Wolfowitz). Let M = {1, . . . , M } be a finite set of messages. Fix n ∈ N∗ .
Let c : M → X n be an encoding function and ĉ : Y n → M be a decoding function. If
log M
R := > Cκ , then
n
4A n(R−Cκ )
Pe > 1 − 2
− 2− 2
n(R − Cκ )
for some positive constant A, depending only on κ, neither on n nor on M.

Proof. Let PX be an input distribution such that I (X ; Y ) = Cκ . For ω ∈ M, let

Yω := ĉ −1 (ω) = {y ∈ Y n | ĉ(y) = ω}.

We seek to majorize the probability of correctly decoding a message,

1 Õ Õ ⊗n
1 − Pe = κ (y|c(ω)). (4.7)
M
ω ∈M y ∈Yω

R − Cκ
Let ε := and define, for ω ∈ M,
2
Bω := {y ∈ Y n | I (c(ω); y) > n(Cκ + ε)}.

Then for y ∈ Bωc , one has κ ⊗n (y|c(ω)) 6 PY⊗n (y) · 2n(Cκ +ε ) . It follows that

PY⊗n (y) · 2n(Cκ +ε )

Õ Õ Õ Õ
κ ⊗n (y|c(ω)) 6
c
ω ∈M y ∈Yω ∩Bω c
ω ∈M y ∈Yω ∩Bω

6 2n(Cκ +ε )
Õ Õ
PY⊗n (y)
ω ∈M y ∈Yω

= 2n(Cκ +ε )
n(R+Cκ )
=2 2 . (4.8)

26
4 Shannon’s second theorem

On the other hand, for ω ∈ M, let c(ω) = (x 1 , . . . , x n ). Then, by Lemma 4.6,

E[I (c(ω); Y1 , . . . , Yn )] 6 nCκ .

Thus
Õ Õ
κ ⊗n (y|c(ω)) 6 κ ⊗n (y|c(ω))
y ∈Yω ∩Bω y ∈Bω

= P(I (c(ω); Y1 , . . . , Yn ) > n(Cκ + ε))

Var[I (c(ω); Y1 , . . . , Yn )]
6 (Chebyshev’s inequality)
n 2ε 2
n
1 Õ
= 2 2 Var[I (x i ; Yi )] (channel is in product form)
n ε i=1
A
6
nε 2
4A
= , (4.9)
n(R − Cκ )2
where A := max Var[I (x; Y )]. Plugging (4.8) and (4.9) into (4.7) yields
x ∈X

1 Õ Õ ⊗n
Pe = 1 − κ (y|c(ω))
M
ω ∈M y ∈Yω
1 Õ Õ ⊗n
> 1− κ (y|c(ω))
M
ω ∈M y ∈Yω
1 Õ Õ 1 Õ Õ
=1− κ ⊗n (y|c(ω)) − κ ⊗n (y|c(ω))
M c M
ω ∈M y ∈Yω ∩Bω ω ∈M y ∈Yω ∩Bω
n(R+Cκ )
2 2 4A
> 1− −
M n(R − Cκ )2
4A − n(R−C κ)
> 1− − 2 2 .
n(R − Cκ )2
log M
The last inequality comes from the fact that = R > Cκ , i.e. M > 2nCκ .
n
Corollary 4.8. For a channel with probability kernel κ from X to Y and an encoding/decoding
scheme c n : Mn → X n , ĉ n : Y n → Mn , if lim inf Rn > Cκ , then lim Pe(n) = 1.
n→∞ n→∞

Exercises

Exercise 4.1 (Noisy typewriter). Consider a channel with input and output alphabet X = Y =
{a, b, . . . , z}. For a given input letter, the output may be equal to the input letter or to the next
one, both with probability 1/2. What is the information capacity of this channel?

27
4 Shannon’s second theorem

Exercise 4.2 (Cryptosystem). A cryptosystem consists of

1. finite sets X, Y and K,
2. an encoding function e : X × K → Y and a decoding function d : Y × K → X,
3. a random vector (X , Y , K) with values in (X, Y, K) (X is called the plaintext, Y is called
the ciphertext and K is called the key) such that Y = e(X , K) and X = d(Y , K).
Show and interpret the following results.
1. H (X |Y ) 6 H (K |Y ).
2. The cryptosystem is said to have perfect secrecy if I (X ; Y ) = 0. Show that perfect secrecy
implies H (X ) 6 H (K).
3. The secure information of the cryptosystem is I (X ; Y |K). It is full if I (X ; Y |K) = H (X ).
Show that full secure information requires X to be indepedent of K.
Exercise 4.3 (Binary symmetric channel with memory). Let q ∈ [0, 1]. Consider a channel (not
necessarily in product form) with X = Y = {0, 1} where the output is relative to the input by

∀n ∈ N∗ , Yn := X n ⊕ Z n

where Z n ∼ Ber(q) and (Z n )n ∈N∗ is independent of (X n )n ∈N∗ (by the Z n ’s are not necessarily
independent).
1. Show that H (X 1 , . . . , X n |Y1 , . . . , Yn ) = H (Z 1 , . . . , Z n ).
1
2. The capacity of this channel is C := lim inf sup I (X 1 , . . . , X n ; Y1 , . . . , Yn ). Compare
n P(X 1, . . ., Xn )n→∞
the information capacities of the BSC channel with and without memory.
Exercise 4.4. Consider a time-varying (non-homogeneous) product-form channel with proba-
bility kernels κn (n ∈ N∗ ) from X to Y. Show that
n
Õ
sup I (X 1 , . . . , X n ; Y1 , . . . , Yn ) = sup I (X i ; Yi ).
P(X 1, . . ., X n ) i=1 PX i

Exercise 4.5 (Fano’s inequality is sharp). Let p ∈ (0, 1). By considering a random variable X
p
with range X = {1, . . . , m} such that P(X = 1) = 1 − p and P(X = k) = for 2 6 k 6 m
m−1
and taking Y to be a singleton, show that Fano’s inequality (4.2) is sharp.
Exercise 4.6 (Feedback does not increase capacity). Show that Theorem 4.5 remains true if we
consider encoding functions c n of the form

c n (ω) = (c n,1 (ω), c n,2 (ω), . . . , c n,n (ω))

where
c n,i : Mn × Y i−1 7→ X, i = 1, . . . , n.

Elements of Information Theory: Thomas M - Cover
No ratings yet
Elements of Information Theory: Thomas M - Cover
8 pages
Lect 6 Quantinfo 1112
No ratings yet
Lect 6 Quantinfo 1112
13 pages
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
No ratings yet
A Joint Representation of Rényi's and Tsalli's Entropy With Application in Coding Theory - 2017 - International Journal of Mathematics A
6 pages
Notes It
No ratings yet
Notes It
46 pages
Entropy 1
No ratings yet
Entropy 1
7 pages
The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)
No ratings yet
The Source Coding Theorem: M Ario S. Alvim (Msalvim@dcc - Ufmg.br)
62 pages
Electrical Engineering 229A Lecture Notes Information Theory and Coding
No ratings yet
Electrical Engineering 229A Lecture Notes Information Theory and Coding
117 pages
Shannon Source Coding Theorem
No ratings yet
Shannon Source Coding Theorem
3 pages
Probabilistic Methods in Information Theory
No ratings yet
Probabilistic Methods in Information Theory
48 pages
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
No ratings yet
Information Theory and Coding: Universit' A Degli Studi Di Siena Facolt'a Di Ingegneria
156 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Entropy Handbook Definitions, Theorems, M-Files
No ratings yet
Entropy Handbook Definitions, Theorems, M-Files
22 pages
Chapshannon PDF
No ratings yet
Chapshannon PDF
8 pages
Dabel Info Theory
No ratings yet
Dabel Info Theory
25 pages
Stat520 Ch.5
No ratings yet
Stat520 Ch.5
5 pages
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
No ratings yet
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
7 pages
Lect 02
No ratings yet
Lect 02
20 pages
Lecture 15
No ratings yet
Lecture 15
7 pages
Materi Source Coding
No ratings yet
Materi Source Coding
39 pages
Kjom351 09 PDF
No ratings yet
Kjom351 09 PDF
7 pages
Information Theory Differential Entropy
No ratings yet
Information Theory Differential Entropy
29 pages
Entropy
No ratings yet
Entropy
9 pages
Three Tutorial Lectures
No ratings yet
Three Tutorial Lectures
36 pages
Lecture 7 Source Coding 2024
No ratings yet
Lecture 7 Source Coding 2024
28 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
Information Theory
No ratings yet
Information Theory
114 pages
Information Theory/ Data Compression Ma 4211: J Urgen Bierbrauer February 28, 2007
No ratings yet
Information Theory/ Data Compression Ma 4211: J Urgen Bierbrauer February 28, 2007
78 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Information Theory For Single-User Systems With Arbitrary Statistical Memory
No ratings yet
Information Theory For Single-User Systems With Arbitrary Statistical Memory
111 pages
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
Lecture 5 - AEP: Nguyễn Phương Thái
No ratings yet
Lecture 5 - AEP: Nguyễn Phương Thái
20 pages
Information Theory
No ratings yet
Information Theory
122 pages
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
No ratings yet
Shanon Encoding and Fano Encoding, Theorem, Problems On Entropy
25 pages
1805.11965 Edward Witten
No ratings yet
1805.11965 Edward Witten
40 pages
Chapter 16
No ratings yet
Chapter 16
71 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
A Mini-Introduction To Information Theory: Edward Witten
No ratings yet
A Mini-Introduction To Information Theory: Edward Witten
39 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
Lecture 9
No ratings yet
Lecture 9
11 pages
Week 4 - Channel Capacity (Chapter 7) and Differential Entropy (Chapter 8)
No ratings yet
Week 4 - Channel Capacity (Chapter 7) and Differential Entropy (Chapter 8)
16 pages
EE 376A: Information Theory: Lecture Notes
No ratings yet
EE 376A: Information Theory: Lecture Notes
75 pages
Entropy: Low Entropy High Entropy
No ratings yet
Entropy: Low Entropy High Entropy
11 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
A Proofless Introduction To Information Theory - Math Programming
No ratings yet
A Proofless Introduction To Information Theory - Math Programming
4 pages
MIT18 600F19 Lec33
No ratings yet
MIT18 600F19 Lec33
58 pages
A Mini-Introduction To Information Theor PDF
No ratings yet
A Mini-Introduction To Information Theor PDF
40 pages
A Mini-Introduction To Information Theor PDF
No ratings yet
A Mini-Introduction To Information Theor PDF
40 pages
Maximum-Entropy Probability Distributions Principles, Formalism and Techniques
No ratings yet
Maximum-Entropy Probability Distributions Principles, Formalism and Techniques
30 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Entropy 3
No ratings yet
Entropy 3
10 pages
MIT Lec03
No ratings yet
MIT Lec03
17 pages
Elements of Information Theory-Chapter1-2
No ratings yet
Elements of Information Theory-Chapter1-2
63 pages
1
No ratings yet
1
86 pages
Information Theory
No ratings yet
Information Theory
18 pages
Noiseless Coding
No ratings yet
Noiseless Coding
5 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
4.5/5 (2)
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
CTFT DTFT DFT
No ratings yet
CTFT DTFT DFT
6 pages
Multi, Square & Percentage
No ratings yet
Multi, Square & Percentage
6 pages
Q2 Week 3 Relation and Function
No ratings yet
Q2 Week 3 Relation and Function
42 pages
1 s2.0 S1877050922015058 Main
No ratings yet
1 s2.0 S1877050922015058 Main
11 pages
Module 1: Complex Numbers
No ratings yet
Module 1: Complex Numbers
8 pages
3.OO Testing
No ratings yet
3.OO Testing
9 pages
Year 8 Spring Core MS 2020
100% (1)
Year 8 Spring Core MS 2020
3 pages
Setting Out Notes
No ratings yet
Setting Out Notes
3 pages
Application - Data Interpretation
No ratings yet
Application - Data Interpretation
10 pages
6746fe71a3a5a Crack Xat 2025 in 40 Days
No ratings yet
6746fe71a3a5a Crack Xat 2025 in 40 Days
4 pages
Quantitative Aptitude Shortcuts & Tricks
No ratings yet
Quantitative Aptitude Shortcuts & Tricks
8 pages
Game Theory Lecture Notes - Levent Kockesen
No ratings yet
Game Theory Lecture Notes - Levent Kockesen
120 pages
Determinents
No ratings yet
Determinents
2 pages
Proshake Tutorial
No ratings yet
Proshake Tutorial
10 pages
Function Varargout
No ratings yet
Function Varargout
7 pages
Python 123455
No ratings yet
Python 123455
11 pages
Intermediate AMC Questions
0% (1)
Intermediate AMC Questions
6 pages
Marketing Measurement and Forecasting
86% (14)
Marketing Measurement and Forecasting
16 pages
Module 3 - Theory of Production, Cost and Revenue
100% (1)
Module 3 - Theory of Production, Cost and Revenue
11 pages
Reinventing Discovery
No ratings yet
Reinventing Discovery
4 pages
MATH 6 PPT Q3 - Formulas in Solving For The Areas of Plane Figures
No ratings yet
MATH 6 PPT Q3 - Formulas in Solving For The Areas of Plane Figures
22 pages
4 Lab Manual 18CSL76
No ratings yet
4 Lab Manual 18CSL76
29 pages
S 2 BCTQ HK HMQB 9 Y2 Uew H4
No ratings yet
S 2 BCTQ HK HMQB 9 Y2 Uew H4
18 pages
Wca Regulations and Guidelines
No ratings yet
Wca Regulations and Guidelines
25 pages
Kottarathil J. Graph Theory and Decomposition 2024
No ratings yet
Kottarathil J. Graph Theory and Decomposition 2024
201 pages
Chapter 11 - Similarity
100% (1)
Chapter 11 - Similarity
37 pages
The Sublime Girls Academy of Science Rajanpur: Long Questions
No ratings yet
The Sublime Girls Academy of Science Rajanpur: Long Questions
2 pages
Lecture 13: Natural Frequency and Bode Plot: Lecturer: Dr. Vinita Vasudevan Scribe: Shashank Shekhar
No ratings yet
Lecture 13: Natural Frequency and Bode Plot: Lecturer: Dr. Vinita Vasudevan Scribe: Shashank Shekhar
4 pages
Recent Advances in Mathematics For Engineering (Mathematical Engineering, Manufacturing, and Management Sciences) 1st Edition Mangey Ram (Editor)
100% (3)
Recent Advances in Mathematics For Engineering (Mathematical Engineering, Manufacturing, and Management Sciences) 1st Edition Mangey Ram (Editor)
54 pages
Carrom
No ratings yet
Carrom
3 pages