0% found this document useful (0 votes)

8 views211 pages

Lecture 1

The document provides an introduction to Information Theory, focusing on concepts such as entropy, relative entropy, and mutual information, which were established by Claude Shannon in 1948. It discusses the limits of data compression and transmission, emphasizing the importance of statistical models in determining these limits. Additionally, it outlines key definitions and properties of entropy, including its calculation and significance in measuring uncertainty.

Uploaded by

zhangxbkimmich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views211 pages

Lecture 1

Uploaded by

zhangxbkimmich

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 211

Information Theory

Lecture 1: Entropy, Relative Entropy, and Mutual

Information

Basak Guler

1 / 136
Background
• Information theory studies the fundamental limits of how to represent,
compress, and transfer information.

• The field was founded by Claude Shannon in 1948 through his

landmark paper:

A Mathematical Theory of Communication

(Available online)
Suggested reading: pages 1 and 2 of the paper.

• Who is Claude Shannon?

• A recent article:
https://www.quantamagazine.org/how-claude-shannons-information-
theory-invented-the-future-20201222/
• A recent movie commissioned by the IEEE Information Theory Society:
The Bit Player (2018)
https://www.imdb.com/title/tt5015534/

2 / 136
Introduction
Information theory answers two fundamental questions:

1. What is the ultimate limit of data compression?

• If a source generates information in digital form, how many bits do we
need to compress it?
• Information theory states that the amount of information a source
contains can be measured by the extent to which it can be compressed.
• If a source contains more information, it requires more bits to compress.
• This depends on the statistical model of the source.

2. What is the ultimate rate of data transmission?

• Communication channels are noisy, which corrupts the transmitted
signals.
• Despite the random noise introduced by the channel, it is possible to
reliably convey what is transmitted to the receiver by introducing
redundancy in our transmitted data.
• The maximum rate at which reliable communication is possible is called
the channel capacity, which is dependent upon the statistical model of
the channel.
3 / 136
Remark
• Information theory identifies the “limits” (what can and cannot be
achieved theoretically) and “suggests” how to achieve them. These
theoretical schemes, however, may not be practical. The existence of
these limits inspire engineers to build practical algorithms to try to
approach/achieve these limits.

4 / 136
Remark
• Although Information Theory originated from “dealing” with
communications, its principles and impact goes well beyond the field
of communications.

• Over the years the “theory of information” and information-theoretic

concepts (such as entropy, relative entropy/KL-divergence, mutual
information) have been instrumental in many fields, including
computer science (machine learning, security & privacy), statistics,
economics, etc.

5 / 136
Notation
• We will assume that a discrete random variable (r.v.) X has an
alphabet X , meaning that X takes values from the set X , with a
probability mass function (PMF):

PX (x) = P[X = x] (1)

for x ∈ X .
• With some abuse of notation, we will use p(x) to denote the PMF of
X , instead of PX (x). That is, we will define p(x) , PX (x).

6 / 136
Remark
• Information theory relies on a set of mathematical tools.

• In particular, there are a few key definitions that facilitate the main
results.

• First, we will need to learn those.

• The most important notions are Entropy (H) and Mutual Information (I).

• Let’s start!

7 / 136
Entropy
• Definition 1. Entropy of a discrete random variable X is:

X 1
H(X ) = p(x) log (2)
p(x)
x∈X
X
=− p(x) log p(x) (3)
x∈X

• Entropy measures the uncertainty about the r.v.

• Entropy is a function of the PMF of the r.v.
• The log is base 2, and the resulting unit is called a BIT - this is the
basic unit of information in communications and computing
Less often we use base e, then the information unit is called “nats”.

8 / 136
Entropy
• Recall that for a function g(X ) of X ,
X
E[g(X )] = p(x)g(x) (4)
x∈X

If we define the function g(X ) such that

1
g(x) = log ∀x ∈ X , (5)
p(x)

then the entropy is simply:

1
H(X ) = E[g(X )] = E[log ]. (6)
p(X )

9 / 136
Example 1
• Consider a r.v. X that takes values from X = {1, 2, . . . , 8} with equal
probability.

Calculate the entropy of X .

• Solution. The entropy of X is:

8
X 1 1
H(X ) = − log (7)
8 8
x=1
1
=8× × log 8 (8)
8
= 3 bits (9)

10 / 136
Example 2
• Consider a r.v. X that takes values from X = {1, 2, . . . , 8}. Suppose
the PMF for X is ( 12 , 14 , 81 , 16
1 1
, 64 1
, 64 1
, 64 1
, 64 ).

Calculate the entropy of X .

• Solution. The entropy of X is:

8
X
H(X ) = − p(x) log p(x) (10)
x=1
8
X 1
= p(x) log (11)
p(x)
x=1
1 1 1 1 1
= log 2 + log 4 + log 8 + log 16 + 4 log 64 (12)
2 4 8 16 64
1 1 3 2 3
= + + + + (13)
2 2 8 8 8
= 2 bits (14)

11 / 136
Observations
• Entropy is non-negative.

• Entropy of the uniformly-distributed (which we will call equiprobable)

r.v. is higher.

1 1 1 1 1 1 1 1 1
H( , . . . , ) > H( , , , , , , ) (15)
8 8 2 4 16 64 64 64 64

Thought exercise: Why?

We will prove these observations next.

12 / 136
Properties of Entropy
• Lemma 1. Entropy is always non-negative, i.e., H(X ) ≥ 0.

• Proof. The result follow follows from the axioms of probability, in

particular, from the fact that 0 ≤ p(x) ≤ 1 for all x ∈ X .
• for p(x) ∈ (0, 1), we have log p(x) < 0.

Then, −p(x) log p(x) > 0.

• for p(x) = 1, we have log p(x) = 0.

Then, −p(x) log p(x) = 0.

• for p(x) = 0, log 0 is undefined.

For such cases, we use the convention

0 log 0 = 0 (16)

which follows by taking the limit limp(x)→0 p(x) log p(x) = 0.

• Therefore, H(X ) = x∈X p(x) log p(x) ≥ 0 for all X .

13 / 136
Properties of Entropy
• Lemma 2. Let Ha (X ) denote the entropy of X with the logarithm taken
with respect to base a, i.e.,
X 1
Ha (X ) = p(x) loga (17)
p(x)
x∈X

Similarly, let Hb (X ) denote the entropy with respect to base b. Then,

Hb (X ) = (logb a)Ha (X ). (18)

14 / 136
Properties of Entropy
• Proof. Note that,
logb p(x)
loga p(x) = (19)
logb a
or equally, logb p(x) = (logb a)loga p(x). Then, Finally,
X
Hb (X ) = − p(x) logb p(x) (20)
x∈X
X
=− p(x)(logb a)loga p(x) (21)
x∈X
!
X
= (logb a) − p(x)loga p(x) (22)
x∈X
= (logb a)Ha (X ) (23)

• Therefore, entropy base can be changed from one to another with a

constant multiplier.

15 / 136
Example 3- Binary Entropy Function
• Consider a binary r.v. X :

1 with probability p
X = (24)
0 with probability 1−p

Find the entropy of X .

• Solution. The entropy of X :

H(X ) = −(p log p + (1 − p) log(1 − p)) , H(p) (25)

is called the binary entropy function.

16 / 136
Example 3- Binary Entropy Function
• Properties of the binary entropy function:

H(p) = −(p log p + (1 − p) log(1 − p)) (26)

• H(p) = H(1 − p) by definition, so it is symmetric around 12 .

• H(0) = H(1) = 0 (this means there is no uncertainty in either case)
To calculate H(0) and H(1), we use limx→0 x log x = 0.
• H(p) is maximized when p = 21 resulting in 1 bit of entropy.
• H(p) is a concave function of p (More on this later).
17 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Let X be a random variable with the following PMF:


1 with probability 1/2
2 with probability 1/4

X =
3
 with probability 1/8
4 with probability 1/8


• E.g., X is the outcome of a biased 4-sided die roll.

• Suppose that we do not know the “true value (realization)” of X .
• We can make guesses in the form of ”Is X = 2?”. After each question,
we get an answer yes/no. If we are wrong, we make another guess.
• What is the best strategy for guessing the outcome?

18 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Answer. Intuitively, it is better to start by guessing the most likely
outcome. Then, we are more likely to be correct.
This strategy would look like:

Question 1 Is X = 1? (half the time we will be right!)

YES NO

Question 2 Done (in 1 question) Is X = 2? (next “most probable”)

(X = 1)

YES NO

Question 3 Done (in 2 questions) Is X = 3?

(X = 2)
YES NO

Done (in 3 questions) Done (in 3 questions)

(X = 3) (X = 4)

19 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Answer. Intuitively, it is better to start by guessing the most likely
outcome. Then, we are more likely to be correct.
• Then the expected number of questions is:

1 × P[X = 1] + 2 × P[X = 2] + 3 × P[X = 3 or 4]

1 1 1 1
=1× +2× +3× + (27)
2 4 8 8
7
= (28)
4
• If we calculate the entropy of X , we will also find that H(X ) = 74 !
• This is not a coincidence!
• It turns out that there is a strong relation between the best strategy to
guess the outcome of a random variable and its entropy.
• In future lectures, we will see why this is the case.

20 / 136
Recap
• Entropy is a measure of uncertainty, randomness, amount of
self-information.
• Less entropy means
• less randomness, less self-information
• more compression, less average number of bits needed to represent the
outcomes
• In the future chapters, we will study these concepts in detail.
• So far we have covered Sections 1 and 2.1 from the book Elements of
Information Theory, Cover-Thomas.
• Next, we will cover Sections 2.2. and 2.3.

21 / 136
Joint Entropy
• We can extend the notion of entropy to a pair of random variables.

• Definition 2. The joint entropy H(X , Y ) of a pair of discrete random

variables (X , Y ) with a joint distribution p(x, y) is:
X
H(X , Y ) = − p(x, y ) log p(x, y) (29)
x∈X ,y ∈Y
X 1
= p(x, y ) log (30)
p(x, y )
x∈X ,y ∈Y
1
= E[log ] (31)
p(X , Y )

22 / 136
Joint Entropy
• We can extend the notion of entropy to a pair of random variables.

• Definition 2. The joint entropy H(X , Y ) of a pair of discrete random

variables (X , Y ) with a joint distribution p(x, y) is:
X
H(X , Y ) = − p(x, y ) log p(x, y) (29)
x∈X ,y ∈Y
X 1
= p(x, y ) log (30)
p(x, y )
x∈X ,y ∈Y
1
= E[log ] (31)
p(X , Y )
• same as H(X ) except X is now a random vector with two elements
• extends to n > 2 dimensional random vectors (X1 , . . . , Xn ):
X
H(X1 , . . . , Xn ) = − p(x1 , . . . , xn ) log p(x1 , . . . , xn ) (32)
x1 ∈X1 ,...,xn ∈Xn
1
= E[log ] (33)
p(X1 , . . . , Xn )
22 / 136
Conditional Entropy
• Conditional entropy H(Y |X ) quantifies the amount of uncertainty
remaining in Y when we know X .

23 / 136
Conditional Entropy
• Conditional entropy H(Y |X ) quantifies the amount of uncertainty
remaining in Y when we know X .
• Definition 3. The conditional entropy H(Y |X ) is defined as:
X
H(Y |X ) = p(x)H(Y |X = x) (34)
x∈X
X X 1
= p(x) p(y |x) log (35)
p(y |x)
x∈X y∈Y
XX 1
= p(y |x)p(x) log (36)
| {z } p(y|x)
x∈X y ∈Y
p(x,y)
XX 1
= p(x, y ) log (37)
p(y|x)
x∈X y ∈Y
1
= E[log ] (38)
p(Y |X )

23 / 136
Chain Rule of Entropy
• Theorem 1. The chain rule of entropy:

H(X , Y ) = H(X ) + H(Y |X ) (39)

24 / 136
Chain Rule of Entropy - Proof
XX
H(X , Y ) = − p(x, y ) log p(x, y) (40)
x∈X y∈Y
XX
=− p(x, y ) log(p(y |x)p(x)) (41)
x∈X y∈Y
XX XX
=− p(x, y) log p(x) − p(x, y) log p(y|x) (42)
x∈X y∈Y x∈X y∈Y
 
X X XX
=−  p(x, y ) log p(x) − p(x, y ) log p(y|x) (43)
x∈X y ∈Y x∈X y∈Y
| {z }
=p(x)
X XX
=− p(x) log p(x) − p(x, y) log p(y|x) (44)
x∈X x∈X y∈Y

= H(X ) + H(Y |X ) (45)

P
where (43) is from definition of a marginal PMF, p(x) = y∈Y p(x, y).

25 / 136
Chain Rule of Entropy - Alternative Proof
• Note: The proof can also be carried out by noting

1 1 1
log = log + log (46)
p(x, y ) p(x) p(y |x)

and taking the expectation of both sides:

1 1 1
EX ,Y [log ] = EX ,Y [log + log ] (47)
p(X , Y ) p(X ) p(Y |X )
1 1
= EX ,Y [log ] + EX ,Y [log ] (48)
p(X ) p(Y |X )
1 1
= EX [log ] + EX ,Y [log ] (49)
p(X ) p(Y |X )
= H(X ) + H(Y |X ) (50)

26 / 136
Chain Rule of Entropy
• Also, we have (by symmetry):

H(X , Y ) = H(Y ) + H(X |Y ) (51)

• Avg. uncertainty about (X , Y )

= Avg. uncertainty about X + Avg. uncertainty about Y given X
= Avg. uncertainty about Y + Avg. uncertainty about X given Y

H(X , Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y ) = H(Y , X ) (52)

27 / 136
Chain Rule for Many Random Variables
• Theorem 2. The chain rule for n random variables:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 ) (53)
i=1

28 / 136
Chain Rule for Many Random Variables
• Proof. From the chain rule for conditional probabilities:

p(x1 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 , x1 ) . . . p(xn |xn−1 , . . . , x1 ) (54)

then

H(X1 , . . . , Xn )
X
=− p(x1 , . . . , xn ) log p(x1 , . . . , xn ) (55)
x1 ∈X1 ,...,xn ∈Xn
X
=− p(x1 , . . . , xn ) log(p(x1 ) . . . p(xn |xn−1 , . . . , x1 )) (56)
x1 ∈X1 ,...,xn ∈Xn
X
=− p(x1 , . . . , xn )(log p(x1 ) + . . . + log p(xn |xn−1 , . . . , x1 ))
x1 ∈X1 ,...,xn ∈Xn
X X
=− p(x1 , . . . , xn ) log p(x1 ) −. . .− p(x1 , . . . , xn ) log p(xn |xn−1 , . . . , x1 )
x1 ,...,xn x1 ,...,xn

= H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (57)

29 / 136
Chain Rule for Many Random Variables
• Simpler proof. From the chain rule for conditional probabilities:

p(x1 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 , x1 ) . . . p(xn |xn−1 , . . . , x1 ) (58)

so
1 1 1 1
log = log +log +. . .+log
p(x1 , . . . , xn ) p(x1 ) p(x2 |x1 ) p(xn |xn−1 , . . . , x1 )
(59)
Take the expectation of both sides with respect to (X1 , . . . , Xn ):

1
EX1 ,...,Xn [log ]
p(X1 , . . . , Xn )
1 1
= EX1 ,...,Xn [log + . . . + log ] (60)
p(X1 ) p(Xn |Xn−1 , . . . , X1 )
1 1
= EX1 ,...,Xn [log ] + . . . + EX1 ,...,Xn [log ] (61)
p(X1 ) p(Xn |Xn−1 , . . . , X1 )
= H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (62)

30 / 136
Corollary
• Corollary 1. For three random variables X , Y , Z :

H(X , Y |Z ) = H(X |Z ) + H(Y |X , Z ) (63)

31 / 136
Corollary
• Proof. From the chain rule, p(x, y|z) = p(x|z)p(y |x, z). Then,

H(X , Y |Z )
X
= p(z)H(X , Y |Z = z)
z∈Z
X XX
=− p(z) p(x, y|z) log p(x, y |z)
| {z }
z∈Z x∈X y∈Y
p(x|z)p(y|x,z)
 
X XX XX
=− p(z) p(x, y |z) log p(x|z)+ p(x, y|z) log p(y|x, z)
z∈Z x∈X y ∈Y x∈X y∈Y
 
X X XX
=− p(z)  p(x|z) log p(x|z) + p(x, y|z) log p(y |x, z)
z∈Z x∈X x∈X y∈Y
XX XXX
=− p(x, z) log p(x|z) − p(x, y , z) log p(y|x, z)
z∈Z x∈X x∈X y ∈Y z∈Z

= H(X , Y |Z ) = H(X |Z ) + H(Y |X , Z )

32 / 136
Example 5
• Consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 2 4
(64)
1
1 0 4

• Find H(X ),H(Y ), H(X |Y ), H(Y |X ), H(X , Y ).

• Is H(X |Y ) = H(Y |X )?

33 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y

34 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y

• Then, the marginal PMF of X , p(x) can be found as:

3 1 3 1
P[X = 0] = , P[X = 1] = ⇒ p(x) = ,
4 4 4 4

34 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y

• Then, the marginal PMF of X , p(x) can be found as:

3 1 3 1
P[X = 0] = , P[X = 1] = ⇒ p(x) = ,
4 4 4 4

• Then, the entropy of X is:

3 3 1 1
H(X ) = − log − log = 0.8113
4 4 4 4

34 / 136
Example 5 - Solution
2) Find H(Y ).
• The marginal PMF of Y , p(y) is:

1 1 1 1
P[Y = 0] = , P[Y = 1] = ⇒ p(y ) = ,
2 2 2 2

35 / 136
Example 5 - Solution
2) Find H(Y ).
• The marginal PMF of Y , p(y) is:

1 1 1 1
P[Y = 0] = , P[Y = 1] = ⇒ p(y ) = ,
2 2 2 2

• Then, the entropy of Y is:

1 1 1 1
H(Y ) = − log − log = 1
2 2 2 2

35 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).

36 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
• From the definition of conditional probabilities:
P[X = x, Y = y] p(x, y)
p(x|y ) = P[X = x|Y = y ] = =
P[Y = y ] p(y )

P[X = 1|Y = 0] = 1 − P[X = 0|Y = 0] = 0

• Similarly,
P[X = 0, Y = 1]
P[X = 0|Y = 1] = = 1/2
P[Y = 1]

P[X = 1|Y = 1] = 1 − 1/2 = 1/2

36 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

• Note that:

H(X |Y = 0) = −1 log 1 − 0 log 0 = 0

37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

• Note that:

H(X |Y = 0) = −1 log 1 − 0 log 0 = 0

• whereas
1 1 1 1
H(X |Y = 1) = − log − log = 1
2 2 2 2

37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

• Note that:

H(X |Y = 0) = −1 log 1 − 0 log 0 = 0

• whereas
1 1 1 1
H(X |Y = 1) = − log − log = 1
2 2 2 2
• Then,
X 1 1
H(X |Y ) = p(y )H(X |Y = y) = × 0 + × 1 = 1/2
2 2
y∈{0,1}

37 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).

38 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
• From the definition of conditional probabilities:
P[Y = y , X = x] p(y , x)
p(y |x) = P[Y = y |X = x] = =
P[X = x] p(x)

P[Y = 1|X = 0] = 1 − P[Y = 0|X = 0] = 1/3

• Similarly,

P[Y = 0, X = 1] 0
P[Y = 0|X = 1] = = =0
P[X = 1] 1/4

P[Y = 1|X = 1] = 1 − 0 = 1

38 / 136
Example 5 - Solution
• Then,
X
H(Y |X ) = p(x)H(Y |X = x)
x∈{0,1}

where
2 2 1 1
H(Y |X = 0) = − log − log = 0.9183
3 3 3 3
and

H(Y |X = 1) = −0 log 0 − 1 log 1 = 0

Then,
X 3 1
H(Y |X ) = p(x)H(Y |X = x) = × 0.9183 + × 0 = 0.6887
4 4
x∈{0,1}

39 / 136
Example 5 - Solution
• 5) Find the joint entropy H(X , Y ).

1 1 1 1 1 1
H(X , Y ) = − log − 0 log 0 − log − log = 1.5
2 2 4 4 4 4

40 / 136
Example 5 - Solution
• 5) Find the joint entropy H(X , Y ).

1 1 1 1 1 1
H(X , Y ) = − log − 0 log 0 − log − log = 1.5
2 2 4 4 4 4

• Remark. Note that H(X |Y ) 6= H(Y |X )

• But

H(X , Y ) = H(X , Y )
H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
H(X ) − H(X |Y ) = H(Y ) − H(Y |X )
0.8113 − 1/2 = 1 − 0.6887
0.3113 = 0.3113

• This will be important soon.

40 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )

41 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )
• The relative entropy is a measure of distance between two
distributions (although it is actually not a true distance measure
because it is not symmetric and does not satisfy the triangle
inequality!). However, we will see that it is always ≥ 0 and = 0 if and
only if p = q.

41 / 136
Mutual Information
• Definition 5. Consider two random variables X and Y with a joint
PMF p(x, y) and marginal PMFs p(x) and p(y ).

The mutual information I(X ; Y ) is the relative entropy between the

joint distribution and the product distribution p(x)p(y ):

I(X ; Y ) = D(p(x, y )||p(x)p(y))

X p(x, y)
= p(x, y ) log
p(x)p(y )
x∈X y ∈Y

p(X ,Y )
• Also note that I(X ; Y ) = E[log p(X )p(Y ) ]

• Later we will be able to generalize this definition to continuous or

mixed random variables.

42 / 136
Mutual Information

• Theorem 3. Relationship between entropy and mutual information:

I(X ; Y ) = H(X ) − H(X |Y )

Also, from symmetry of p(x, y) = p(x)p(y |x) = p(y)p(x|y ),

I(X ; Y ) = H(Y ) − H(Y |X )

• Mutual information measures how much information one random

variable carries about another.

• Equally, mutual information measures the amount of uncertainty

reduced in one random variable by knowing another random variable.

43 / 136
Relationship Between Entropy and Mutual Information
• Proof.
XX p(x, y)
I(X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y
XX p(x|y)
= p(x, y ) log
p(x)
x∈X y ∈Y
XX 1 XX 1
= p(x, y ) log − p(x, y ) log
p(x) p(x|y)
x∈X y ∈Y x∈X y∈Y
| {z } | {z }
p(x) H(X |Y )
| {z }
H(X )

44 / 136
Relationship Between Entropy and Mutual Information
• Alternative Proof (by using expectations).

45 / 136
How to Interpret Mutual Information?
• We have seen that,

H(X ) − H(X |Y ) = I(X ; Y )

| {z } | {z } | {z }
I II III

• I: Average uncertainty about X (before observing Y ).

• II: Average uncertainty about X AFTER observing Y .

• I: Average reduction in uncertainty of X after observing Y (average

information about X that is supplied by Y ).

I(X ; Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X )

• That is, X tells as much information about Y , as Y does about X .

I(X ; Y ) = I(Y ; X )

46 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )

47 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )

48 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )

• How about I(X ; X )?

48 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )

• How about I(X ; X )?

I(X ; X ) = H(X ) − H(X |X ) = H(X )

| {z }
0

49 / 136
Observations
• The following diagram shows these relationships:

H(X) H(Y )

H(X|Y ) I(X; Y ) H(Y |X)

H(X, Y )

50 / 136
Conditional Mutual Information
• Definition 6. Conditional mutual information:

X p(x, y|z)
I(X ; Y |Z ) = p(x, y, z) log
p(x|z)p(y |z)
x∈X ,y∈Y,z∈Z

p(X , Y |Z )
= EX ,Y ,Z [log ]
p(X |Z )p(Y |Z )

= H(X |Z ) − H(X |Y , Z )

51 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!

52 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!
• Theorem 4. Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1
• Proof.
I(X1 , . . . , Xn ; Y ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y )
n
X n
X
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y )
i=1 i=1
Xn
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y )
i=1
n
X
= I(Xi ; Y |Xi−1 , . . . , X1 )
i=1 52 / 136
Conditional Relative Entropy
• Definition 7. For two joint PMFs p(x, y ) and q(x, y), the conditional
relative entropy is defined as:
X X p(y |x)
D(p(y |x)||q(y|x)) = p(x) p(y |x) log
q(y |x)
x∈X y∈Y
XX p(y|x)
= p(x, y) log
q(y|x)
x∈X y∈Y
p(Y |X )
= EX ,Y [log ]
q(Y |X )

53 / 136
Chain Rule of Relative Entropy
• Relative entropy also has a chain rule:

D(p(x, y )||q(x, y )) = D(p(x)||q(x)) + D(p(y|x)||q(y |x))

54 / 136
Chain Rule of Relative Entropy
• Relative entropy also has a chain rule:

D(p(x, y )||q(x, y )) = D(p(x)||q(x)) + D(p(y|x)||q(y |x))

• Proof.
XX p(x, y )
D(p(x, y)||q(x, y)) = p(x, y) log
q(x, y )
x∈X y∈Y
XX p(y|x)p(x)
= p(x, y) log
q(y|x)q(x)
x∈X y∈Y
XX p(y|x) X X p(x)
= p(x, y) log + p(x, y ) log
q(y|x) q(x)
x∈X y∈Y x∈X y∈Y
| {z } | {z }
D(p(y |x)||q(y |x)) p(x)
| {z }
D(p(x)||q(x))

= D(p(y|x)||q(y |x)) + D(p(x)||q(x))

54 / 136
Convex Functions
• We will now briefly review the basic definitions of convexity and
present one of the most widely used inequalities in information theory.

55 / 136
Convex Functions
• Definition 8 (Convex function). A function f (x) is convex over an
interval (a, b) if for every x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1:

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 )

• The function is strictly convex if for the above, the equality holds only
when λ = 0 or λ = 1.

f (x)
f (x1 ) + (1 )f (x2 )
f (x2 )

f (x1 )

f ( x1 + (1 )x2 )

x1 x2
56 / 136
Example 6
• Example 6. f (x) = x 2 where x ∈ R

57 / 136
Example 6
• Example 6. f (x) = x 2 where x ∈ R

100

0
-10 -8 -6 -4 -2 0 2 4 6 8 10

• is a convex function

58 / 136
Example 7
• Example 7. f (x) = − log x where x > 0

59 / 136
Example 7
• Example 7. f (x) = − log x where x > 0

-2

-4
0 1 2 3 4 5 6 7 8 9 10

• is a convex function

60 / 136
Example 8
• Example 8. f (x) = ex where x ∈ R

61 / 136
Example 8
• Example 8. f (x) = ex where x ∈ R

150

100

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

• is a convex function

62 / 136
Example 9
• Example 9. f (x) = ax + b where x ∈ R

63 / 136
Example 9
• Example 9. f (x) = ax + b where x ∈ R

-1

-2

-3
-2 -1 0 1 2 3 4 5

• is a convex function (plot is drawn for a = 1, b = −1)

64 / 136
Example 10
• Example 10. f (x) = x log x where x ≥ 0

-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

• is a convex function

65 / 136
Concave Functions
• Definition 8 (Concave function). A function f (·) is concave over
(a, b) if −f (x) is convex, i.e., for every x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1:

f (λx1 + (1 − λ)x2 ) ≥ λf (x1 ) + (1 − λ)f (x2 )

• The function is strictly concave if for the above, the equality holds only
when λ = 0 or λ = 1.

f (x) f ( x1 + (1 )x2 )

f (x2 )

f (x1 )
f (x1 ) + (1 )f (x2 )

x1 x2
66 / 136
Example 11
√
• Example 11. f (x) = x where x ≥ 0

67 / 136
Example 11
√
• Example 11. f (x) = x where x ≥ 0
2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

• is a concave function

68 / 136
Example 12
• Example 12. f (x) = log x where x > 0

69 / 136
Example 12
• Example 12. f (x) = log x where x > 0

-1

-2

-3

-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

• is a concave function

70 / 136
How do we know if a function is convex (or concave)?
• If f (x) is twice differentiable,

f 00 (x) ≥ 0 → convex
00
f (x) ≤ 0 → concave

71 / 136
How do we know if a function is convex (or concave)?
• If f (x) is twice differentiable,

f 00 (x) ≥ 0 → convex
00
f (x) ≤ 0 → concave

• If you would like to learn more on convex functions, read Chapter 3 of:

Convex Optimization, Boyd-Vandenberghe (available online)

https:
//web.stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf

71 / 136
Example 13
• Example 13. Is f (x) = x 2 for x ∈ R convex?

f 0 (x) = 2x (65)
00
f (x) = 2 > 0 → convex (66)

72 / 136
Example 14
• Example 14. Is f (x) = log x for x > 0 convex?

1
f 0 (x) = (67)
x ln(2)
1
f 00 (x) = − 2 <0 → concave (68)
x ln(2)

73 / 136
Recap
• So far we have covered sections 2.3, 2.4, 2.5 from the textbook. Next,
we will cover 2.6, 2.7, 2.8.

74 / 136
Important Properties of Convex Functions
Pn
• Theorem 5. Let p1 , . . . , pn ≥ 0 such that i=1 pi = 1. If f (x) is
convex, then for any x1 , . . . , xn ,

Xn n
X
f( pi xi ) ≤ pi f (xi )
i=1 i=1

75 / 136
Important Properties of Convex Functions
• Proof. Can be proved by induction.

Step 1. For n = 1, this is true by the definition of convexity.

76 / 136
Important Properties of Convex Functions
• Step 2. Assume that the claim holds for n − 1. Then, for n:
n n−1
X X pi
pi f (xi ) = pn f (xn ) + (1 − pn ) f (xi )
1 − pn
i=1 i=1
pi
Now, set qi = 1−p n
for i = 1, . . . , n − 1. Note that qi ≥ 0 and
Pn−1
i=1 qi = 1. Since we assumed that the hypothesis is true for n − 1,
Xn n−1
X
pi f (xi ) = pn f (xn ) + (1 − pn ) qi f (xi )
i=1 i=1
Xn−1
≥ pn f (xn ) + (1 − pn )f ( qi xi )
i=1
| {z }
x̄
≥ f (pn xn + (1 − pn )x̄) (hypothesis true for n = 2)
n−1
!
X pi
≥ f pn xn + (1 − pn ) xi (Substitute back x̄)
1 − pn
i=1
n
X
= f( pi xi )
i=1
77 / 136
Jensen’s Inequality
• We will now state an important inequality.

78 / 136
Jensen’s Inequality
• We will now state an important inequality.

• Theorem 5 (Jensen’s inequality). If f is convex and X is a random

variable, we have:

E[f (X )] ≥ f (E[X ])

• Moreover, if f is strictly convex, equality implies that X = E[X ] with

probability 1, i.e., X is constant.

78 / 136
Jensen’s Inequality
• We will now state an important inequality.

• Theorem 5 (Jensen’s inequality). If f is convex and X is a random

variable, we have:

E[f (X )] ≥ f (E[X ])

• Moreover, if f is strictly convex, equality implies that X = E[X ] with

probability 1, i.e., X is constant.

• Proof. If X is a discrete random variable, the proof is the same as the

proof of Theorem 5, by letting the pi , i = 1 . . . , n denote the PMF of X .

• This proof can be extended to continuous random variables also.

78 / 136
Jensen’s Inequality
• Corollary 2. If f is concave,

E[f (X )] ≤ f (E[X ])

79 / 136
Jensen’s Inequality
• Corollary 2. If f is concave,

E[f (X )] ≤ f (E[X ])
• Next, we will use Jensen’s inequality to prove some important
properties of the measures we have defined so far.

79 / 136
KL-distance is Non-negative
• Theorem 7 (Information Inequality). For two probability mass
functions (PMFs) p(x) and q(x) over an alphabet x ∈ X , we have:
D(p||q) ≥ 0
where equality holds if and only if p(x) = q(x) for all x.

80 / 136
KL-distance is Non-negative
• Theorem 7 (Information Inequality). For two probability mass
functions (PMFs) p(x) and q(x) over an alphabet x ∈ X , we have:
D(p||q) ≥ 0
where equality holds if and only if p(x) = q(x) for all x.
• Proof. Define a set A = {x : p(x) > 0}. Then,
X p(x)
−D(p||q) = − p(x) log
q(x)
x∈A

X p(x)
= p(x) − log
q(x)
x∈A
X q(x)
= p(x) log
p(x)
x∈A

q(X )
= E log (expectation is taken over p(x) > 0)
p(X )

80 / 136
KL-distance is Non-negative
• Proof continued. Recall that the function f (y ) = log y is concave.
Therefore, log q(x) q(x)
p(x) is concave in p(x) . Then,

q(X ) q(X )
E[log ] ≤ log E[ ] (Jensen’s inequality - Corollary 2) (69)
p(X ) p(X )
!
X q(x)
= log p(x) (70)
p(x)
x∈A
!
X
= log q(x) (71)
x∈A
!
X
≤ log q(x) (log y is strictly increasing in y ) (72)
x∈X
| {z }
=1
= log 1 (probability of the entire sample space is 1)
=0

• Therefore, D(p||q) ≥ 0
81 / 136
When is D(p||q) = 0?
• Note that f (y ) = log y is a strictly concave function of y. Then, from
Jensen’s Inequality, equality occurs, i.e.,

E[f (Y )] = f (E[Y ])

if and only if Y is a constant.

• In other words, (69) becomes an equality if and only if q(x)
p(x) = c for
some constant c for all x ∈ A. Then, (70) can be written as,
X q(x) X
log p(x) = log( p(x)c)
p(x)
x∈A x∈A
X
= log(c p(x))
x∈A
X
= log(c p(x)) (since p(x) = 0 ∀x ∈
/ A)
x∈X
= log c

82 / 136
When is D(p||q) = 0?
• Finally, (72) becomes an equality if and only if
X
q(x) = c = 1
x∈A

• Also from this result, we find that q(x) = 0 for all x ∈

/ A (second axiom
of probability, i.e., probability of the whole sample space is 1).

• Therefore, D(p||q) = 0 if and only if p(x) = q(x) for all x ∈ X .

83 / 136
Mutual Information is Non-negative
• Corollary 3. Mutual information is non-negative:

I(X ; Y ) ≥ 0

84 / 136
Mutual Information is Non-negative
• Corollary 3. Mutual information is non-negative:

I(X ; Y ) ≥ 0

• Proof Follows from:

I(X ; Y ) = D(p(x, y)||p(x)p(y)) (by definition)

≥0 (KL-distance is non-negative, i.e., Theorem 7) (73)

and (73) becomes an equality if and only if p(x, y ) = p(x)p(y ), i.e.,

when X and Y are independent.

84 / 136
Corollaries
• Corollary 4. Conditional KL-distance is non-negative

D(p(y|x)||q(y |x)) ≥ 0

85 / 136
Corollaries
• Corollary 4. Conditional KL-distance is non-negative

D(p(y|x)||q(y |x)) ≥ 0

• Corollary 5. Conditional mutual information is non-negative

I(X ; Y |Z ) ≥ 0

85 / 136
Upper Bound on Entropy
• Theorem 8. For any random variable X defined over an alphabet X ,

H(X ) ≤ log |X |

where |X | represents the number of elements in the range of X and is

called the cardinality of X .
• This means that, for any random variable X , its entropy is no greater
than that of a uniform random variable defined over the same set of
elements X .
• Using this result, we can bound the entropy of any random variable as:

0 ≤ H(X ) ≤ log |X | (74)

• Let’s now prove the theorem.

86 / 136
Upper Bound on Entropy
• Proof. We will use Theorem 7. Specifically, let p(x) denote the PMF
of the random variable X and let u(x) = |X1 | be the PMF of a uniform
random variable.

87 / 136
Upper Bound on Entropy
• Proof. We will use Theorem 7. Specifically, let p(x) denote the PMF
of the random variable X and let u(x) = |X1 | be the PMF of a uniform
random variable.
• Then,
X p(x)
D(p||u) = p(x) log (75)
u(x)
x∈X
X 1 X 1
= p(x) log − p(x) log (76)
u(x) p(x)
x∈X x∈X
X X 1
= p(x) log |X | − p(x) log (77)
p(x)
x∈X x∈X
| {z }
by definition of H(X )

= log |X | − H(X ) (78)

≥0 (from Theorem 7) (79)

• Therefore,
H(X ) ≤ log |X | (80)
87 / 136
Uniform Distribution Maximizes Entropy
• Corollary 8. The uniform random variable has the largest entropy.

88 / 136
Uniform Distribution Maximizes Entropy
• Corollary 8. The uniform random variable has the largest entropy.
• Proof. Let X be a uniform random variable over a set of elements X .
Denote the PMF of X by u(x) = |X1 | for all x ∈ X . Then, the entropy of
X is given by:
X
H(X ) = − u(x) log u(x) (81)
x∈X
X 1 1
=− log (82)
|X | |X |
x∈X
X 1
= log |X | (83)
|X |
x∈X
!
X 1
= log |X | (84)
|X |
x∈X
= log |X | (85)
(86)

88 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:

H(X |Y ) ≤ H(X )

89 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:

H(X |Y ) ≤ H(X )

• Proof. Follows from the non-negativity of mutual information:

I(X ; Y ) = H(X ) − H(X |Y ) ≥ 0

89 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:

H(X |Y ) ≤ H(X )

• Proof. Follows from the non-negativity of mutual information:

I(X ; Y ) = H(X ) − H(X |Y ) ≥ 0

• Very important: This theorem implies that the conditional entropy

H(X |Y ) is less than or equal to the entropy H(X ). It does not say that
H(X |Y = y ) for any specific y is necessarily smaller than H(X )! Note
that H(X |Y ) is defined as the average of H(X |Y = y ) over all
realizations Y = y :
X
H(X |Y ) = p(y ) H(X |Y = y )
| {z } | {z }
y ∈Y
This can never be larger than H(X ) Some of these terms can be larger than H(X )

89 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3

90 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3

• PMF of X :

2 1
P[X = 0] = 2/3, P[X = 1] = 1/3, p(x) = ,
3 3

90 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3

• PMF of X :

2 1
P[X = 0] = 2/3, P[X = 1] = 1/3, p(x) = ,
3 3
• Then, the entropy of X is:

2 2 1 1
H(X ) = − log − log = 0.918bits
3 3 3 3

90 / 136
Example 15
• PMF of Y :
1 2 1 2
P[Y = 0] = , P[Y = 1] = , p(y) = ( , )
3 3 3 3
• Conditional PMF of p(x|y ):

P[X = 0, Y = 0]
P[X = 0|Y = 0] = =1
P[Y = 0]
P[X = 1|Y = 0] = 0
1/3 1
P[X = 0|Y = 1] = =
2/3 2
1
P[X = 1|Y = 1] =
2
• Then,
p(x|y ) 0 1
1
0 1 2
(88)
1
1 0 2

91 / 136
Example 15
• Then,
X
H(X |Y ) = p(y )H(X |Y = y )
y∈Y

= P[Y = 0]H(X |Y = 0) + P[Y = 1]H(X |Y = 1)

• Note that:

H(X |Y = 0) = 0 bits < H(X )

• On the other hand,

H(X |Y = 1) = 1 bit > H(X )!

• But, the average:

1 2 2
H(X |Y ) = × 0 + × 1 = = 0.667 bits < H(X )
3 3 3
as expected.
92 / 136
Independence Bound on Entropy
• Theorem 10 For any set of n random variables X1 , . . . , Xn , their joint
entropy can be upper bounded by the sum of the individual entropies.
n
X
H(X1 , X2 , . . . , Xn ) ≤ H(Xi )
i=1

93 / 136
Independence Bound on Entropy
• Theorem 10 For any set of n random variables X1 , . . . , Xn , their joint
entropy can be upper bounded by the sum of the individual entropies.
n
X
H(X1 , X2 , . . . , Xn ) ≤ H(Xi )
i=1

• Proof. We will use the chain rule of entropy and Theorem 9.

Specifically, recall from the chain rule of entropy that:
n
X
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

from Theorem 9, for each of these terms we have

H(Xi |Xi−1 , . . . , X1 ) ≤ H(Xi ) since conditioning cannot increase
entropy. Therefore,
n
X n
X
H(Xi |Xi−1 , . . . , X1 ) ≤ H(Xi )
i=1 i=1

with equality if and only if Xi are all independent from each other.
93 / 136
Recap
• So far, we have seen Jensen’s inequality and used it to prove some
important results and observations.

• Next, we will see another useful inequality, called the Log-sum

inequality and use it to establish more results that are central to
information theory.

• We will be covering chapters 2.7, 2.8, and 2.10.

94 / 136
LOG-SUM inequality
• Theorem 11 (LOG-SUM Inequality). For non-negative numbers
a1 , . . . , an and b1 , . . . , bn , the following holds:

n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1

95 / 136
LOG-SUM inequality
• Proof. We will use the convention 0 log 0 = 0, a log a0 = ∞( for a > 0),
and 0 log 00 = 0. Then, without loss of generality, we can assume
ai , bi > 0 for all i.
• Define,
ai
p(xi ) = Pn
j=1 aj

and
bi
q(xi ) = Pn
j=1 bj
• Since pi , qi ≥ 0, and
n
X n
X
p(xi ) = q(xi ) = 1
i=1 i=1

p and q are valid PMFs.

96 / 136
LOG-SUM inequality
• Next, consider the KL-distance between p and q, and recall that
KL-distance is non-negative D(p||q) ≥ 0. In other words,
n
X p(xi )
D(p||q) = p(xi ) log ≥0
q(xi )
i=1

By substituting p(xi ) and q(xi ),

n Pnai
X ai j=1 aj
⇒ Pn log ≥0
Pnbi
i=1 j=1 aj j=1 bj

n Pn !
ai ai j=1 aj
X
⇒ Pn log − log Pn ≥0
j=1 aj
bi j=1 bj
i=1
n n
! Pn
ai j=1 aj
X X
⇒ ai log ≥ ai log Pn
bi j=1 bj
i=1 i=1

97 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

• Proof. We first note that the definition of convexity from Definition 8

extends to vector variables. That is, let f (z) be a scalar function of a
group of variables represented by a vector z.

98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

• Proof. We first note that the definition of convexity from Definition 8

extends to vector variables. That is, let f (z) be a scalar function of a
group of variables represented by a vector z.
• Then, the function f (z) is said to be convex in z if:

f (λz1 + (1 − λ)z2 ) ≤ λf (z1 ) + (1 − λ)f (z2 )

98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

• Proof. We first note that the definition of convexity from Definition 8

extends to vector variables. That is, let f (z) be a scalar function of a
group of variables represented by a vector z.
• Then, the function f (z) is said to be convex in z if:

f (λz1 + (1 − λ)z2 ) ≤ λf (z1 ) + (1 − λ)f (z2 )

• In particular, if z = [x y]T , this means that f is convex in (x, y) if

f (λx1 + (1 − λ)x2 , λy1 + (1 − λ)y2 ) ≤ λf (x1 , y1 ) + (1 − λ)f (x2 , y2 )

98 / 136
Relative entropy is convex
• Now, the inequality from Theorem 12 says that:
X λp1 (x) + (1 − λ)p2 (x)
(λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
x∈X
X p1 (x) X p2 (x)
≤λ p1 (x) log + (1 − λ) p2 (x) log (89)
q1 (x) q2 (x)
x∈X x∈X

• Let’s have a look at this for one term with a specific x on both sides.
Can we prove the following inequality?

λp1 (x) + (1 − λ)p2 (x)

(λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
p1 (x) p2 (x)
≤ λp1 (x) log + (1 − λ)p2 (x) log (90)
q1 (x) q2 (x)

99 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1

100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
• Let
a1 = λp1 (x) b1 = λq1 (x)
a2 = (1 − λ)p2 (x) b2 = (1 − λ)q2 (x)
• Then, (91) becomes,
λp1 (x) (1 − λ)p2 (x)
λp1 (x) log + (1 − λ)p2 (x) log
λq1 (x) (1 − λ)q2 (x)
λp1 (x) + (1 − λ)p2 (x)
≥ (λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
which means we proved (90).

101 / 136
Relative entropy is convex
• Corollary 7. Relative entropy is convex in p for any fixed q.
• Proof. Choose q1 = q2 = q in Theorem 12:

D(λp1 + (1 − λ)p2 ||q) ≤ λD(p1 ||q) + (1 − λ)D(p2 ||q)

101 / 136
Entropy is concave
• Theorem 13. (Concavity of entropy) Let H(p) denote the entropy of
X , with p representing the PMF of X (that is, if X = {x1 , . . . , xn },
p = (p(x1 ), p(x2 ), . . . , p(xn ))). Then, H(p) is a concave function of p.

102 / 136
Entropy is concave
• Theorem 13. (Concavity of entropy) Let H(p) denote the entropy of
X , with p representing the PMF of X (that is, if X = {x1 , . . . , xn },
p = (p(x1 ), p(x2 ), . . . , p(xn ))). Then, H(p) is a concave function of p.
• Proof. We will use Corollary 7.
• Let q(x) = |X1 | for all x ∈ X , i.e., q(x) = u(x) (recall that u(x) is the
discrete uniform random variable). Then,
X p(x)
D(p||u) = p(x) log 1
x∈X |X |

= log |X | − H(X )

• Therefore,

H(X ) = H(p) = log |X | − D(p||u)

• By Corollary, 7 D(p||u) is convex in p.

• By Definition 8 (if f (x) is convex in x, then −f (x) is concave in x) we
have H(p) is concave in p.
102 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).

103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).
• Proof. Recall that:

p(x, y) = p(y |x)p(x) = p(x|y )p(y )

X
I(X ; Y ) = H(Y ) − H(Y |X ) = H(Y ) − p(x)H(Y |X = x)
| {z } | {z }
x∈X
(A) (B)

103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).
• Proof. Recall that:

p(x, y) = p(y |x)p(x) = p(x|y )p(y )

X
I(X ; Y ) = H(Y ) − H(Y |X ) = H(Y ) − p(x)H(Y |X = x)
| {z } | {z }
x∈X
(A) (B)

• (A):
• Note that for fixed p(y |x), p(y) is a linear function of p(x) (because
P
p(y) = x p(y |x)p(x)).
• We known by Theorem 11, H(Y ) is concave in p(y).
• Fact 1. Let y be a linear function of x. Then, a function f is concave in
x if and only if f is concave in y.
• Thus, H(Y ) for fixed p(y|x) is concave in p(x).

103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• (B):
X
−H(Y |X ) = − H(Y |X = x) p(x)
x
| {z }
fixed since p(y|x) is fixed

is a linear function of p(x). Recall that a linear function is both

concave and convex.

104 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• (B):
X
−H(Y |X ) = − H(Y |X = x) p(x)
x
| {z }
fixed since p(y|x) is fixed

is a linear function of p(x). Recall that a linear function is both

concave and convex.
• Fact 2. Function f is concave if:

f (x) = g1 (x) + g2 (x)

and g1 and g2 are both concave.
• ⇒ I(X ; Y ) is concave in p(x) for fixed p(y |x).
• Remark: This result will be very important for channel capacity.
• Thought exercise: How would you prove Fact 1 and Fact 2?

104 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• Theorem 14’. Mutual Information I(X ; Y ) is a convex function of
p(y|x) for fixed p(x).

105 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• Theorem 14’. Mutual Information I(X ; Y ) is a convex function of
p(y|x) for fixed p(x).
• Proof. For a fixed p(x), let’s define a function f (·) of p(y|x)

f (p(y |x)) = I(X ; Y ) (92)

X p(x, y)
= p(x, y) log (93)
x,y
p(x)p(y )
= D(p(x)p(y|x)||p(x)p(y )) (94)
P
where p(y ) , x p(y|x)p(x).

105 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• In order to show that f (p(y |x)) = I(X ; Y ) is convex in p(y |x), we need
to show that (by definition of convexity):

f (λp1 (y|x) + (1 − λ)p2 (y |x))

≤ λf (p1 (y |x)) + (1 − λ)f (p2 (y |x)) (95)

which, by using (94), can be re-written as:

D(p(x)pλ (y |x)||p(x)pλ (y ))
≤ λD(p(x)p1 (y |x)||p(x)p1 (y ))
+ (1 − λ)D(p(x)p2 (y|x)||p(x)p2 (y )) (96)
P P
where p1 (y) , x p1 (y |x)p(x) and p2 (y ) , x p2 (y |x)p(x), and

pλ (y |x) , λp1 (y|x) + (1 − λ)p2 (y |x)

and X
p(x)pλ (y ) = p(x) p(x 0 )(λp1 (y|x 0 ) + (1 − λ)p2 (y |x 0 ))
x0

= p(x) λp1 (y ) + (1 − λ)p2 (y)
= λq1 (x, y ) + (1 − λ)q2 (x, y ) (98)

By substituting (97) and (98) in (96), we find that

D(λp1 (x, y ) + (1 − λ)p2 (x, y )||λq1 (x, y ) + (1 − λ)q2 (x, y ))
≤ λD(p1 (x, y )||q1 (x, y)) + (1 − λ)D(p2 (x, y)||q2 (x, y )) (99)

which holds since we know that D(p||q) is convex in (p, q), therefore
(95) holds and the proof is completed.
107 / 136
Entropy of a function of a random variable
• Let g(X ) be some known function of X . Then,

H(g(X )) ≤ H(X )

which equality if and only if the function is 1-to-1.

• In other words, entropy of a function of X is less than the entropy of X .

108 / 136
Entropy of a function of a random variable
• Proof. We know that,

I(X ; Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X ) (100)

Let Y = g(X ), then (100) becomes:

H(X ) − H(X |g(X )) = H(g(X )) − H(g(X )|X ) (101)

which means

H(g(X )) = H(X ) − H(X |g(X )) + H(g(X )|X ) (102)

109 / 136
Entropy of a function of a random variable
• Proof. We know that,

I(X ; Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X ) (100)

Let Y = g(X ), then (100) becomes:

H(X ) − H(X |g(X )) = H(g(X )) − H(g(X )|X ) (101)

which means

H(g(X )) = H(X ) − H(X |g(X )) + H(g(X )|X ) (102)

• Now, note that H(g(X )|X ) = 0, because by knowing X we can

immediately find g(X ). In other words, there’s no uncertainty in g(X )
after knowing X ! Then, (102) becomes,

H(g(X )) = H(X ) − H(X |g(X )) (103)

• Finally, note that H(X |g(X )) ≥ 0 since entropy cannot be negative, so

H(g(X )) ≤ H(X ) (104)

109 / 136
Example 16
• Example 16. (Entropy of a sum) Let X and Y be two random variables
taking values in x1 , . . . , xn and y1 , . . . , yn , respectively. Let Z = X + Y .
(a) Show that H(Z |X ) = H(Y |X ) and if X and Y are independent,
then H(Z ) ≥ H(X ) and H(Z ) ≥ H(Y ).

110 / 136
Example 16
• Example 16. (Entropy of a sum) Let X and Y be two random variables
taking values in x1 , . . . , xn and y1 , . . . , yn , respectively. Let Z = X + Y .
(a) Show that H(Z |X ) = H(Y |X ) and if X and Y are independent,
then H(Z ) ≥ H(X ) and H(Z ) ≥ H(Y ).
• Solution:
X
H(Z |X ) = p(x)H(Z |X = x)
x
X X 1
= p(x) P[Z = z|X = x] log
x z
P[Z = z|X = x]
X X 1
= p(x) P[Y = z − X |X = x] log
x z
P[Y = z − X |X = x]
X X 1
= p(x) P[Y = z − x|X = x] log
x z
P[Y = z − x|X = x]
X X 1
= p(x) P[Y = y |X = x] log
x y
P[Y = y |X = x]
| {z }
H(Y |X =x)

= H(Y |X ) (105)
110 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}

= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

111 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}

= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

• (A): By Theorem 9 (Conditioning reduces entropy)

(B): By equation (105)
(C): Since X and Y are independent

111 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}

= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

• (A): By Theorem 9 (Conditioning reduces entropy)

(B): By equation (105)
(C): Since X and Y are independent
• We can follow the same steps (for X and Y independent) to show
H(Z ) ≥ H(Z |Y ) = H(X |Y ) = H(X )

111 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}

= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

• (A): By Theorem 9 (Conditioning reduces entropy)

(B): By equation (105)
(C): Since X and Y are independent
• We can follow the same steps (for X and Y independent) to show
H(Z ) ≥ H(Z |Y ) = H(X |Y ) = H(X )
• Since H(Z ) ≥ H(X ) and H(Z ) ≥ H(Y ) for independent X and Y and
Z = X + Y , we conclude that adding two independent random
variables increases entropy.

111 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .

112 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .
• Solution.
Let
1 w.p. (with probability) 1/2
X =
0 w.p. 1/2
and

Y = −X
Z =X +Y

• Then,

H(X ) = H(Y ) = 1 bit

H(Z ) = 0 bits (Why?)

112 / 136
Example 16
• (c) Under what conditions does

H(Z ) = H(X ) + H(Y )

hold?

113 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.

114 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
• Let Z be a function of (X , Y ). Then,

H(Z ) ≤ H(X , Y ) = H(X ) + H(Y |X ) ≤ H(X ) + H(Y ) (106)

where the equality holds if and only if (iff) Z is a 1-to-1 function of

(X , Y ) and X and Y are independent.

H(Z ) ≤ H(X , Y ) = H(X ) + H(Y |X ) ≤ H(X ) + H(Y ) (106)

where the equality holds if and only if (iff) Z is a 1-to-1 function of

H(Z ) ≤ H(X , Y ) = H(X ) + H(Y |X ) ≤ H(X ) + H(Y ) (106)

where the equality holds if and only if (iff) Z is a 1-to-1 function of

(X , Y ) and X and Y are independent.
• Therefore, H(X + Y ) = H(X ) + H(Y ) iff X and Y are independent and
Z is a 1-to-1 function of (X , Y ). That is, the value of Z should uniquely
determine the value of (X , Y ).
• For example, suppose X takes values in {0, 1}, and Y takes values in
{−5, 4}. Then,
• if Z = −5 we know for sure that X = 0, Y = −5 (no other assignment
satisfies Z = −5!),
• if Z = 4 we know that X = 0, Y = 4,
• if Z = −4 we know that X = 1, Y = −5,
• if Z = 5 we know that X = 1, Y = 4.
114 / 136
Recap
• So far, we have seen some important properties of entropy, relative
entropy, and mutual information.

• Next, we will learn about one of the most important concepts in

information theory, called the data processing inequality.

• We will first define/review “Markov Chains”. To read more on Markov

chains, see the book of Papoulis (Probability, Random variables, and
Stochastic Processes), Chapter 15.

115 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff

P[Xn = xn |Xn−1 = xn−1 , . . . , X1 = x1 ] = P[Xn = xn |Xn−1 = xn−1 ]

• This is called the Markov property: The most recent past

summarizes all history

116 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff

P[Xn = xn |Xn−1 = xn−1 , . . . , X1 = x1 ] = P[Xn = xn |Xn−1 = xn−1 ]

• This is called the Markov property: The most recent past

summarizes all history
• The future of the process, given the present, is independent of its past

116 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff

P[Xn = xn |Xn−1 = xn−1 , . . . , X1 = x1 ] = P[Xn = xn |Xn−1 = xn−1 ]

• This is called the Markov property: The most recent past

summarizes all history
• The future of the process, given the present, is independent of its past
• Typically, we represent the states and the transition probabilities as:
P [Xn = l|Xn 1 = j]

j
i
l
k

116 / 136
Markov Chains
• Definition 10. Three random variables X , Y , Z form a Markov chain
in that order, shown as X → Y → Z if the following equivalent
conditions are satisfied:
1. p(z|y, x) = p(z|y )
2. p(x, y , z) = p(z|y)p(y |x)p(x)

p(x, y, z) = p(z|y , x) p(y |x)p(x) = p(z|y )p(y|x)p(x)

| {z }
=p(z|y) if Markov

3. p(x, z|y ) = p(z|y )p(x|y)

p(x, z|y ) = p(z|y , x) p(x|y ) = p(z|y )p(x|y)

| {z }
=p(z|y ) if Markov

117 / 136
Markov Chains
• Definition 10. Three random variables X , Y , Z form a Markov chain
in that order, shown as X → Y → Z if the following equivalent
conditions are satisfied:
1. p(z|y, x) = p(z|y )
2. p(x, y , z) = p(z|y)p(y |x)p(x)

p(x, y, z) = p(z|y , x) p(y |x)p(x) = p(z|y )p(y|x)p(x)

| {z }
=p(z|y) if Markov

3. p(x, z|y ) = p(z|y )p(x|y)

p(x, z|y ) = p(z|y , x) p(x|y ) = p(z|y )p(x|y)

| {z }
=p(z|y ) if Markov

• This means that X and Z are independent when conditioned on Y

(recall the definition of conditional independence).
• Remark: If X → Y → Z then Z → Y → X (Prove this as exercise)
• Remark: If Z = f (Y ), then, we always have X → Y → Z .
117 / 136
Data Processing Inequality (DPI)
• Theorem 15. If X , Y , Z form a Markov chain X → Y → Z , then,

I(X ; Y ) ≥ I(X ; Z )

• This means that, no kind of processing can ever increase mutual

information.

118 / 136
Data Processing Inequality (DPI)
• Proof. Recall the Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1

• So, we have,
I(X ; Y , Z ) = I(X ; Y |Z ) + I(X ; Z )
and similarly,
I(X ; Y , Z ) = I(X ; Z |Y ) + I(X ; Y )
so that:
I(X ; Y |Z ) + I(X ; Z ) = I(X ; Z |Y ) + I(X ; Y ) (107)
• If X → Y → Z , then, conditioned on Y , random variables X and Z are
independent (i.e., p(x, z|y ) = p(x|y )p(z|y)). Therefore the mutual
information between X and Z conditioned on Y is 0, i.e.,

I(X ; Z |Y ) = 0

119 / 136
Data Processing Inequality (DPI)
• Then, (107) becomes,

I(X ; Y |Z ) + I(X ; Z ) = I(X ; Y ) (108)

• Finally, note that I(X ; Y |Z ) ≥ 0 because mutual information cannot be

negative. Then, we have from (108) that,

I(X ; Z ) ≤ I(X ; Y ) (109)

which completes the proof.

120 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,

I(X ; Y |Z ) ≤ I(X ; Y ) (110)

121 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,

I(X ; Y |Z ) ≤ I(X ; Y ) (110)

• Proof follows from the same argument above.

121 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,

I(X ; Y |Z ) ≤ I(X ; Y ) (110)

• Proof follows from the same argument above.

• The “dependence” between X and Y decreases by observing a
“downstream” random variable Z .
• This does not necessarily hold if X , Y , Z don’t form a Markov chain.

121 / 136
Data Processing Inequality (DPI)
• Corollary 9. If Z = g(Y )

I(X ; Y ) ≥ I(X ; g(Y )) (111)

122 / 136
Data Processing Inequality (DPI)
• Corollary 9. If Z = g(Y )

I(X ; Y ) ≥ I(X ; g(Y )) (111)

• Proof. Recall that X → Y → g(Y )

• D.P.I says that we can never get more information (about a random
variable) by further processing (that random variable)!

122 / 136
Example 17
• Problem. Show that if H(X |Y ) = 0, then there exists a function g(Y )
such that X = g(Y ). In other words, X is a function of Y .

123 / 136
Example 17
• Solution. Let’s start by the definition of H(X |Y ):
X
H(X |Y ) = p(y)H(X |Y = y )
y ∈Y
X X 1
= p(y) P[X = x|Y = y] log
P[X = x|Y = y ]
y ∈Y x∈X
| {z }
≥0

124 / 136
Example 17
• Solution. Let’s start by the definition of H(X |Y ):
X
H(X |Y ) = p(y)H(X |Y = y )
y ∈Y
X X 1
= p(y) P[X = x|Y = y] log
P[X = x|Y = y ]
y ∈Y x∈X
| {z }
≥0

• To have this equal to zero, whenever p(y) > 0, we should have

H(X |Y = y ) = 0. To have H(X |Y = y ) = 0, we must have:

1 if x = x0
P[X = x|Y = y] = (112)
0 otherwise

for some x0 ∈ X .

124 / 136
Example 17
• We can re-write (112) as:

P[X = x|Y = y ] = δ(x − x0 )

for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).

125 / 136
Example 17
• We can re-write (112) as:

P[X = x|Y = y ] = δ(x − x0 )

for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).
• For p(y ) = 0, we can assign x = g(y) to an arbitrary value in Y.

125 / 136
Example 17
• We can re-write (112) as:

P[X = x|Y = y ] = δ(x − x0 )

for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).
• For p(y ) = 0, we can assign x = g(y) to an arbitrary value in Y.

• Then, for each Y , X takes only one value with probability 1, i.e.,
X = g(Y ).

125 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .

126 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
• In other words, we can estimate the value of X from the observations
Y with zero error probability if and only if H(X |Y ) = 0.
• We will now see an important inequality, called Fano’s inequality,
which extends this argument to arbitrary X and Y .

126 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).

127 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).
• We observe Y and want to guess the value of X .

• Let the “guess” be X̂ = g(Y ) and define the probability of making a

wrong guess:

Pe = P(X̂ 6= X )

127 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),

H(Pe ) + Pe log |X | ≥ H(X |Y ) (113)

which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),

1 + Pe log |X | ≥ H(X |Y ) (114)

H(X |Y ) − 1
Pe ≥ (115)
log |X |

128 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),

H(Pe ) + Pe log |X | ≥ H(X |Y ) (113)

which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),

1 + Pe log |X | ≥ H(X |Y ) (114)

H(X |Y ) − 1
Pe ≥ (115)
log |X |
• If Pe = 0, then Fano’s inequality says that H(X |Y ) = 0, which is in line
with our intuition.

128 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),

H(Pe ) + Pe log |X | ≥ H(X |Y ) (113)

which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),

1 + Pe log |X | ≥ H(X |Y ) (114)

H(X |Y ) − 1
Pe ≥ (115)
log |X |
• If Pe = 0, then Fano’s inequality says that H(X |Y ) = 0, which is in line
with our intuition.
• The main use of Fano’s inequality will be in the converse of the
channel capacity theorem.

128 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision

129 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision

• Use chain rule of entropy in two ways:

H(E, X |Y ) = H(X |Y ) + H(E|X , Y ) = H(E|Y ) + H(X |E, Y )

| {z }
0

therefore,

H(X |Y ) = H(E|Y ) +H(X |E, Y )

| {z }
≤H[E]

129 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision

• Use chain rule of entropy in two ways:

H(E, X |Y ) = H(X |Y ) + H(E|X , Y ) = H(E|Y ) + H(X |E, Y )

| {z }
0

therefore,

H(X |Y ) = H(E|Y ) +H(X |E, Y )

| {z }
≤H[E]

• By defining H(Pe ) , H(E), we have

H(X |Y ) ≤ H(Pe ) + H(X |E, Y )

129 / 136
Fano’s Inequality
• But
1
X
H(X |E, Y ) = P[E = e]H(X |Y , E = e)
e=0
P[E = 0] H(X |Y , E = 0) + P[E = 1] H(X |Y , E = 1)
| {z } | {z } | {z } | {z }
1−Pe 0 Pe (A)

• (A) We have Y and X̂ = g(Y ) 6= X (given that E = 1) therefore X

takes one of the remaining values (that isn’t what we predicted), i.e.,
one of X − 1 values. We know then that this entropy is upper bounded
by log(|X | − 1), so that:

H(Pe ) + Pe log(|X | − 1) ≥ H(X |Y )

We can relax this by noting that H(Pe ) ≤ 1 and log(|X | − 1) ≤ log |X |,

H(X |Y ) ≤ 1 + Pe log |X |

which leads to the lower bound on Pe .

130 / 136
Another Useful Inequality
• Here is another inequality that relates probability of error and entropy.
• Theorem 18. Let X and X 0 be two independent identically distributed
(i.i.d.) random variables with entropy H(X ). Then,

P(X = X 0 ) ≥ 2−H(X ) (117)

131 / 136
Another Useful Inequality
• Here is another inequality that relates probability of error and entropy.
• Theorem 18. Let X and X 0 be two independent identically distributed
(i.i.d.) random variables with entropy H(X ). Then,

P(X = X 0 ) ≥ 2−H(X ) (117)

• Proof. First, note that:

X
P(X = X̂ ) = p(x)P[X 0 = x|X = x]
x
X
= p(x)P[X 0 = x] (X and X 0 are independent)
x
X
= p(x)P[X = x] (X and X 0 are identically distributed)
x
X
= p(x)2
x

131 / 136
Another Useful Inequality
• Then
P
2−H(X ) = 2 x p(x) log p(x)
X
≤ p(x)2log p(x) (from Jensen’s inequality) (118)
x
X
= p2 (x)
x
= P(X = X 0 )

where in (118) we used the convexity of the function f (Y ) = 2Y . Then,

by letting Y = log p(X ), we have

f (E[Y ]) ≤ E[f (Y )] ⇒ 2E[Y ] ≤ E[2Y ] ⇒ 2E[log p(X )] ≤ E[2log p(X ) ]

132 / 136
Summary
• We have now finished the “toolbox” lectures, i.e., Chapter 2 from the
book.
• Next, we will start Chapter 3, the “Asymptotic Equipartition Property”.

133 / 136

BEC503-DC-M3-Information Theory
No ratings yet
BEC503-DC-M3-Information Theory
100 pages
Entropy (Information Theory)
No ratings yet
Entropy (Information Theory)
17 pages
CMZ700 Yokogawa Gyro
90% (10)
CMZ700 Yokogawa Gyro
84 pages
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (55)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
High School Calculus
100% (1)
High School Calculus
471 pages
Sinusoidal Steady State-Mcqs: 1. The Value of Current Through The 1 Farad Capacitor of Figure Is
No ratings yet
Sinusoidal Steady State-Mcqs: 1. The Value of Current Through The 1 Farad Capacitor of Figure Is
13 pages
DM4 V39a Manual
No ratings yet
DM4 V39a Manual
147 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
Entropy in Mathemtics
No ratings yet
Entropy in Mathemtics
2 pages
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
Elements of Information Theory-Chapter1-2
No ratings yet
Elements of Information Theory-Chapter1-2
63 pages
Information Theory: Info Rmatio N Types
No ratings yet
Information Theory: Info Rmatio N Types
52 pages
Comp101 Lect02
No ratings yet
Comp101 Lect02
44 pages
Probability & Information: Prof. J Bapat
No ratings yet
Probability & Information: Prof. J Bapat
20 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Entropy
No ratings yet
Entropy
21 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
Information Theory/ Data Compression Ma 4211: J Urgen Bierbrauer February 28, 2007
No ratings yet
Information Theory/ Data Compression Ma 4211: J Urgen Bierbrauer February 28, 2007
78 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Session 2
No ratings yet
Session 2
60 pages
Paper Theory On Information Theory
No ratings yet
Paper Theory On Information Theory
15 pages
1.1 Shannon's Information Measures: Lecture 1 - January 26
No ratings yet
1.1 Shannon's Information Measures: Lecture 1 - January 26
5 pages
Advance Digital Communication
No ratings yet
Advance Digital Communication
66 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
MIT18 600F19 Lec33
No ratings yet
MIT18 600F19 Lec33
58 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
It Co 1 en
No ratings yet
It Co 1 en
26 pages
Oxford 1
100% (1)
Oxford 1
2 pages
2009 Lecture25
No ratings yet
2009 Lecture25
11 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
Quantum Information: Stephen M. Barnett
No ratings yet
Quantum Information: Stephen M. Barnett
60 pages
MIT16 36s09 Lec03
No ratings yet
MIT16 36s09 Lec03
10 pages
2 Information Measurement and Entropy
No ratings yet
2 Information Measurement and Entropy
23 pages
Lecture 1: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 1: Entropy and Mutual Information: 2.1 Example
8 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Probabilistic Methods in Information Theory
No ratings yet
Probabilistic Methods in Information Theory
48 pages
Entr 5
No ratings yet
Entr 5
2 pages
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
No ratings yet
PMIT-6214: Information Coding: Instructor: M. Shamim Kaiser Email: Text Phone: 01511000555
76 pages
Principles of Information Theory
No ratings yet
Principles of Information Theory
5 pages
Information Theory
No ratings yet
Information Theory
114 pages
Slide 04
No ratings yet
Slide 04
16 pages
GSM-To-UMTS Training Series 01 - Principles of The WCDMA System - V1.0
No ratings yet
GSM-To-UMTS Training Series 01 - Principles of The WCDMA System - V1.0
87 pages
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
No ratings yet
Information Theory: Mike Brookes E4.40, ISE4.51, SO20
114 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Prediction Paper - Non Calculator Paper 1
No ratings yet
Prediction Paper - Non Calculator Paper 1
16 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Fundamentals of Photogrammetry
No ratings yet
Fundamentals of Photogrammetry
25 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Ai Fundamentals Midterm Quizzes Source
No ratings yet
Ai Fundamentals Midterm Quizzes Source
26 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Notes It
No ratings yet
Notes It
46 pages
Information Theory Entropy Relative Entropy
No ratings yet
Information Theory Entropy Relative Entropy
60 pages
Quantum Series G.6 Operating Instructions - Issue 2
No ratings yet
Quantum Series G.6 Operating Instructions - Issue 2
32 pages
A Literature Survey On Various Approaches On Content Based Image Search
No ratings yet
A Literature Survey On Various Approaches On Content Based Image Search
6 pages
Tech Note Format
No ratings yet
Tech Note Format
2 pages
User Manual For Amazfit Band 5
No ratings yet
User Manual For Amazfit Band 5
25 pages
Factors For Success in Customer Relationship Management (CRM) Systems
No ratings yet
Factors For Success in Customer Relationship Management (CRM) Systems
27 pages
Infrastructure Components
No ratings yet
Infrastructure Components
26 pages
Final Intern Apoorva
No ratings yet
Final Intern Apoorva
38 pages
Build and Accesing MySQL Database Using XAMPP
No ratings yet
Build and Accesing MySQL Database Using XAMPP
26 pages
Non-Orthogonal Multiple Access For Cooperative Communications: Challenges, Opportunities, and Trends
No ratings yet
Non-Orthogonal Multiple Access For Cooperative Communications: Challenges, Opportunities, and Trends
20 pages
The Green Data Center Chapter 1
No ratings yet
The Green Data Center Chapter 1
19 pages
The Government Contracts Reference Book
0% (1)
The Government Contracts Reference Book
3 pages
Assignment On Bisection Method GIVEN ON 06/10/2020: Program
No ratings yet
Assignment On Bisection Method GIVEN ON 06/10/2020: Program
13 pages
MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog
No ratings yet
MapReduce Patterns, Algorithms, and Use Cases - Highly Scalable Blog
7 pages
DC25 S500 Range en 2013 10
No ratings yet
DC25 S500 Range en 2013 10
6 pages
ATS2805A
No ratings yet
ATS2805A
21 pages
Implementation of An E-Commerce System For The Automation and Improvement of Commercial Management at A Business Level
No ratings yet
Implementation of An E-Commerce System For The Automation and Improvement of Commercial Management at A Business Level
7 pages
Jurnal Internasional
No ratings yet
Jurnal Internasional
6 pages
10alytics Data Analyst Track Welcome Kit Cohort 15-1
No ratings yet
10alytics Data Analyst Track Welcome Kit Cohort 15-1
9 pages
Application of Artificial Intelligence in Reducing Risks Caused by The Human Factor
No ratings yet
Application of Artificial Intelligence in Reducing Risks Caused by The Human Factor
10 pages
Choppa Sravani: Professional Objective
No ratings yet
Choppa Sravani: Professional Objective
2 pages
Ford Company Document
No ratings yet
Ford Company Document
1 page
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Recursive Analysis
From Everand
Recursive Analysis
R. L. Goodstein
No ratings yet
Basic Methods of Linear Functional Analysis
From Everand
Basic Methods of Linear Functional Analysis
John D. Pryce
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet