0% found this document useful (0 votes)
8 views211 pages

Lecture 1

The document provides an introduction to Information Theory, focusing on concepts such as entropy, relative entropy, and mutual information, which were established by Claude Shannon in 1948. It discusses the limits of data compression and transmission, emphasizing the importance of statistical models in determining these limits. Additionally, it outlines key definitions and properties of entropy, including its calculation and significance in measuring uncertainty.

Uploaded by

zhangxbkimmich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views211 pages

Lecture 1

The document provides an introduction to Information Theory, focusing on concepts such as entropy, relative entropy, and mutual information, which were established by Claude Shannon in 1948. It discusses the limits of data compression and transmission, emphasizing the importance of statistical models in determining these limits. Additionally, it outlines key definitions and properties of entropy, including its calculation and significance in measuring uncertainty.

Uploaded by

zhangxbkimmich
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 211

Information Theory

Lecture 1: Entropy, Relative Entropy, and Mutual


Information

Basak Guler

1 / 136
Background
• Information theory studies the fundamental limits of how to represent,
compress, and transfer information.

• The field was founded by Claude Shannon in 1948 through his


landmark paper:

A Mathematical Theory of Communication


(Available online)
Suggested reading: pages 1 and 2 of the paper.

• Who is Claude Shannon?


• A recent article:
https://www.quantamagazine.org/how-claude-shannons-information-
theory-invented-the-future-20201222/
• A recent movie commissioned by the IEEE Information Theory Society:
The Bit Player (2018)
https://www.imdb.com/title/tt5015534/

2 / 136
Introduction
Information theory answers two fundamental questions:

1. What is the ultimate limit of data compression?


• If a source generates information in digital form, how many bits do we
need to compress it?
• Information theory states that the amount of information a source
contains can be measured by the extent to which it can be compressed.
• If a source contains more information, it requires more bits to compress.
• This depends on the statistical model of the source.

2. What is the ultimate rate of data transmission?


• Communication channels are noisy, which corrupts the transmitted
signals.
• Despite the random noise introduced by the channel, it is possible to
reliably convey what is transmitted to the receiver by introducing
redundancy in our transmitted data.
• The maximum rate at which reliable communication is possible is called
the channel capacity, which is dependent upon the statistical model of
the channel.
3 / 136
Remark
• Information theory identifies the “limits” (what can and cannot be
achieved theoretically) and “suggests” how to achieve them. These
theoretical schemes, however, may not be practical. The existence of
these limits inspire engineers to build practical algorithms to try to
approach/achieve these limits.

4 / 136
Remark
• Although Information Theory originated from “dealing” with
communications, its principles and impact goes well beyond the field
of communications.

• Over the years the “theory of information” and information-theoretic


concepts (such as entropy, relative entropy/KL-divergence, mutual
information) have been instrumental in many fields, including
computer science (machine learning, security & privacy), statistics,
economics, etc.

5 / 136
Notation
• We will assume that a discrete random variable (r.v.) X has an
alphabet X , meaning that X takes values from the set X , with a
probability mass function (PMF):

PX (x) = P[X = x] (1)

for x ∈ X .
• With some abuse of notation, we will use p(x) to denote the PMF of
X , instead of PX (x). That is, we will define p(x) , PX (x).

6 / 136
Remark
• Information theory relies on a set of mathematical tools.

• In particular, there are a few key definitions that facilitate the main
results.

• First, we will need to learn those.

• The most important notions are Entropy (H) and Mutual Information (I).

• Let’s start!

7 / 136
Entropy
• Definition 1. Entropy of a discrete random variable X is:

X 1
H(X ) = p(x) log (2)
p(x)
x∈X
X
=− p(x) log p(x) (3)
x∈X

• Entropy measures the uncertainty about the r.v.


• Entropy is a function of the PMF of the r.v.
• The log is base 2, and the resulting unit is called a BIT - this is the
basic unit of information in communications and computing
Less often we use base e, then the information unit is called “nats”.

8 / 136
Entropy
• Recall that for a function g(X ) of X ,
X
E[g(X )] = p(x)g(x) (4)
x∈X

If we define the function g(X ) such that

1
g(x) = log ∀x ∈ X , (5)
p(x)

then the entropy is simply:

1
H(X ) = E[g(X )] = E[log ]. (6)
p(X )

9 / 136
Example 1
• Consider a r.v. X that takes values from X = {1, 2, . . . , 8} with equal
probability.

Calculate the entropy of X .

• Solution. The entropy of X is:

8
X 1 1
H(X ) = − log (7)
8 8
x=1
1
=8× × log 8 (8)
8
= 3 bits (9)

10 / 136
Example 2
• Consider a r.v. X that takes values from X = {1, 2, . . . , 8}. Suppose
the PMF for X is ( 12 , 14 , 81 , 16
1 1
, 64 1
, 64 1
, 64 1
, 64 ).

Calculate the entropy of X .

• Solution. The entropy of X is:

8
X
H(X ) = − p(x) log p(x) (10)
x=1
8
X 1
= p(x) log (11)
p(x)
x=1
1 1 1 1 1
= log 2 + log 4 + log 8 + log 16 + 4 log 64 (12)
2 4 8 16 64
1 1 3 2 3
= + + + + (13)
2 2 8 8 8
= 2 bits (14)

11 / 136
Observations
• Entropy is non-negative.

• Entropy of the uniformly-distributed (which we will call equiprobable)


r.v. is higher.

1 1 1 1 1 1 1 1 1
H( , . . . , ) > H( , , , , , , ) (15)
8 8 2 4 16 64 64 64 64

Thought exercise: Why?

We will prove these observations next.

12 / 136
Properties of Entropy
• Lemma 1. Entropy is always non-negative, i.e., H(X ) ≥ 0.

• Proof. The result follow follows from the axioms of probability, in


particular, from the fact that 0 ≤ p(x) ≤ 1 for all x ∈ X .
• for p(x) ∈ (0, 1), we have log p(x) < 0.

Then, −p(x) log p(x) > 0.

• for p(x) = 1, we have log p(x) = 0.

Then, −p(x) log p(x) = 0.

• for p(x) = 0, log 0 is undefined.

For such cases, we use the convention

0 log 0 = 0 (16)

which follows by taking the limit limp(x)→0 p(x) log p(x) = 0.

• Therefore, H(X ) = x∈X p(x) log p(x) ≥ 0 for all X .


P

13 / 136
Properties of Entropy
• Lemma 2. Let Ha (X ) denote the entropy of X with the logarithm taken
with respect to base a, i.e.,
X 1
Ha (X ) = p(x) loga (17)
p(x)
x∈X

Similarly, let Hb (X ) denote the entropy with respect to base b. Then,

Hb (X ) = (logb a)Ha (X ). (18)

14 / 136
Properties of Entropy
• Proof. Note that,
logb p(x)
loga p(x) = (19)
logb a
or equally, logb p(x) = (logb a)loga p(x). Then, Finally,
X
Hb (X ) = − p(x) logb p(x) (20)
x∈X
X
=− p(x)(logb a)loga p(x) (21)
x∈X
!
X
= (logb a) − p(x)loga p(x) (22)
x∈X
= (logb a)Ha (X ) (23)

• Therefore, entropy base can be changed from one to another with a


constant multiplier.

15 / 136
Example 3- Binary Entropy Function
• Consider a binary r.v. X :

1 with probability p
X = (24)
0 with probability 1−p

Find the entropy of X .

• Solution. The entropy of X :

H(X ) = −(p log p + (1 − p) log(1 − p)) , H(p) (25)

is called the binary entropy function.

16 / 136
Example 3- Binary Entropy Function
• Properties of the binary entropy function:

H(p) = −(p log p + (1 − p) log(1 − p)) (26)

• H(p) = H(1 − p) by definition, so it is symmetric around 12 .


• H(0) = H(1) = 0 (this means there is no uncertainty in either case)
To calculate H(0) and H(1), we use limx→0 x log x = 0.
• H(p) is maximized when p = 21 resulting in 1 bit of entropy.
• H(p) is a concave function of p (More on this later).
17 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Let X be a random variable with the following PMF:


1 with probability 1/2
2 with probability 1/4

X =
3
 with probability 1/8
4 with probability 1/8

• E.g., X is the outcome of a biased 4-sided die roll.


• Suppose that we do not know the “true value (realization)” of X .
• We can make guesses in the form of ”Is X = 2?”. After each question,
we get an answer yes/no. If we are wrong, we make another guess.
• What is the best strategy for guessing the outcome?

18 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Answer. Intuitively, it is better to start by guessing the most likely
outcome. Then, we are more likely to be correct.
This strategy would look like:

Question 1 Is X = 1? (half the time we will be right!)

YES NO

Question 2 Done (in 1 question) Is X = 2? (next “most probable”)


(X = 1)

YES NO

Question 3 Done (in 2 questions) Is X = 3?


(X = 2)
YES NO

Done (in 3 questions) Done (in 3 questions)


(X = 3) (X = 4)

19 / 136
Example 4-Best Strategy to Guess the Value of a R.V.
• Answer. Intuitively, it is better to start by guessing the most likely
outcome. Then, we are more likely to be correct.
• Then the expected number of questions is:

1 × P[X = 1] + 2 × P[X = 2] + 3 × P[X = 3 or 4]


 
1 1 1 1
=1× +2× +3× + (27)
2 4 8 8
7
= (28)
4
• If we calculate the entropy of X , we will also find that H(X ) = 74 !
• This is not a coincidence!
• It turns out that there is a strong relation between the best strategy to
guess the outcome of a random variable and its entropy.
• In future lectures, we will see why this is the case.

20 / 136
Recap
• Entropy is a measure of uncertainty, randomness, amount of
self-information.
• Less entropy means
• less randomness, less self-information
• more compression, less average number of bits needed to represent the
outcomes
• In the future chapters, we will study these concepts in detail.
• So far we have covered Sections 1 and 2.1 from the book Elements of
Information Theory, Cover-Thomas.
• Next, we will cover Sections 2.2. and 2.3.

21 / 136
Joint Entropy
• We can extend the notion of entropy to a pair of random variables.

• Definition 2. The joint entropy H(X , Y ) of a pair of discrete random


variables (X , Y ) with a joint distribution p(x, y) is:
X
H(X , Y ) = − p(x, y ) log p(x, y) (29)
x∈X ,y ∈Y
X 1
= p(x, y ) log (30)
p(x, y )
x∈X ,y ∈Y
1
= E[log ] (31)
p(X , Y )

22 / 136
Joint Entropy
• We can extend the notion of entropy to a pair of random variables.

• Definition 2. The joint entropy H(X , Y ) of a pair of discrete random


variables (X , Y ) with a joint distribution p(x, y) is:
X
H(X , Y ) = − p(x, y ) log p(x, y) (29)
x∈X ,y ∈Y
X 1
= p(x, y ) log (30)
p(x, y )
x∈X ,y ∈Y
1
= E[log ] (31)
p(X , Y )
• same as H(X ) except X is now a random vector with two elements
• extends to n > 2 dimensional random vectors (X1 , . . . , Xn ):
X
H(X1 , . . . , Xn ) = − p(x1 , . . . , xn ) log p(x1 , . . . , xn ) (32)
x1 ∈X1 ,...,xn ∈Xn
1
= E[log ] (33)
p(X1 , . . . , Xn )
22 / 136
Conditional Entropy
• Conditional entropy H(Y |X ) quantifies the amount of uncertainty
remaining in Y when we know X .

23 / 136
Conditional Entropy
• Conditional entropy H(Y |X ) quantifies the amount of uncertainty
remaining in Y when we know X .
• Definition 3. The conditional entropy H(Y |X ) is defined as:
X
H(Y |X ) = p(x)H(Y |X = x) (34)
x∈X
X X 1
= p(x) p(y |x) log (35)
p(y |x)
x∈X y∈Y
XX 1
= p(y |x)p(x) log (36)
| {z } p(y|x)
x∈X y ∈Y
p(x,y)
XX 1
= p(x, y ) log (37)
p(y|x)
x∈X y ∈Y
1
= E[log ] (38)
p(Y |X )

23 / 136
Conditional Entropy
• Conditional entropy H(Y |X ) quantifies the amount of uncertainty
remaining in Y when we know X .
• Definition 3. The conditional entropy H(Y |X ) is defined as:
X
H(Y |X ) = p(x)H(Y |X = x) (34)
x∈X
X X 1
= p(x) p(y |x) log (35)
p(y |x)
x∈X y∈Y
XX 1
= p(y |x)p(x) log (36)
| {z } p(y|x)
x∈X y ∈Y
p(x,y)
XX 1
= p(x, y ) log (37)
p(y|x)
x∈X y ∈Y
1
= E[log ] (38)
p(Y |X )
• The expectation is over X , Y , i.e., E[log 1
p(Y |X ) ] = EX ,Y [log 1
p(Y |X ) ]

23 / 136
Chain Rule of Entropy
• Theorem 1. The chain rule of entropy:

H(X , Y ) = H(X ) + H(Y |X ) (39)

24 / 136
Chain Rule of Entropy - Proof
XX
H(X , Y ) = − p(x, y ) log p(x, y) (40)
x∈X y∈Y
XX
=− p(x, y ) log(p(y |x)p(x)) (41)
x∈X y∈Y
XX XX
=− p(x, y) log p(x) − p(x, y) log p(y|x) (42)
x∈X y∈Y x∈X y∈Y
 
X X XX
=−  p(x, y ) log p(x) − p(x, y ) log p(y|x) (43)
x∈X y ∈Y x∈X y∈Y
| {z }
=p(x)
X XX
=− p(x) log p(x) − p(x, y) log p(y|x) (44)
x∈X x∈X y∈Y

= H(X ) + H(Y |X ) (45)


P
where (43) is from definition of a marginal PMF, p(x) = y∈Y p(x, y).

25 / 136
Chain Rule of Entropy - Alternative Proof
• Note: The proof can also be carried out by noting

1 1 1
log = log + log (46)
p(x, y ) p(x) p(y |x)

and taking the expectation of both sides:

1 1 1
EX ,Y [log ] = EX ,Y [log + log ] (47)
p(X , Y ) p(X ) p(Y |X )
1 1
= EX ,Y [log ] + EX ,Y [log ] (48)
p(X ) p(Y |X )
1 1
= EX [log ] + EX ,Y [log ] (49)
p(X ) p(Y |X )
= H(X ) + H(Y |X ) (50)

26 / 136
Chain Rule of Entropy
• Also, we have (by symmetry):

H(X , Y ) = H(Y ) + H(X |Y ) (51)

• Avg. uncertainty about (X , Y )


= Avg. uncertainty about X + Avg. uncertainty about Y given X
= Avg. uncertainty about Y + Avg. uncertainty about X given Y

H(X , Y ) = H(X ) + H(Y |X ) = H(Y ) + H(X |Y ) = H(Y , X ) (52)

27 / 136
Chain Rule for Many Random Variables
• Theorem 2. The chain rule for n random variables:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 ) (53)
i=1

28 / 136
Chain Rule for Many Random Variables
• Proof. From the chain rule for conditional probabilities:

p(x1 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 , x1 ) . . . p(xn |xn−1 , . . . , x1 ) (54)

then

H(X1 , . . . , Xn )
X
=− p(x1 , . . . , xn ) log p(x1 , . . . , xn ) (55)
x1 ∈X1 ,...,xn ∈Xn
X
=− p(x1 , . . . , xn ) log(p(x1 ) . . . p(xn |xn−1 , . . . , x1 )) (56)
x1 ∈X1 ,...,xn ∈Xn
X
=− p(x1 , . . . , xn )(log p(x1 ) + . . . + log p(xn |xn−1 , . . . , x1 ))
x1 ∈X1 ,...,xn ∈Xn
X X
=− p(x1 , . . . , xn ) log p(x1 ) −. . .− p(x1 , . . . , xn ) log p(xn |xn−1 , . . . , x1 )
x1 ,...,xn x1 ,...,xn

= H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (57)

29 / 136
Chain Rule for Many Random Variables
• Simpler proof. From the chain rule for conditional probabilities:

p(x1 , . . . , xn ) = p(x1 )p(x2 |x1 )p(x3 |x2 , x1 ) . . . p(xn |xn−1 , . . . , x1 ) (58)

so
1 1 1 1
log = log +log +. . .+log
p(x1 , . . . , xn ) p(x1 ) p(x2 |x1 ) p(xn |xn−1 , . . . , x1 )
(59)
Take the expectation of both sides with respect to (X1 , . . . , Xn ):

1
EX1 ,...,Xn [log ]
p(X1 , . . . , Xn )
1 1
= EX1 ,...,Xn [log + . . . + log ] (60)
p(X1 ) p(Xn |Xn−1 , . . . , X1 )
1 1
= EX1 ,...,Xn [log ] + . . . + EX1 ,...,Xn [log ] (61)
p(X1 ) p(Xn |Xn−1 , . . . , X1 )
= H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |Xn−1 , . . . , X1 ) (62)

30 / 136
Corollary
• Corollary 1. For three random variables X , Y , Z :

H(X , Y |Z ) = H(X |Z ) + H(Y |X , Z ) (63)

31 / 136
Corollary
• Proof. From the chain rule, p(x, y|z) = p(x|z)p(y |x, z). Then,

H(X , Y |Z )
X
= p(z)H(X , Y |Z = z)
z∈Z
X XX
=− p(z) p(x, y|z) log p(x, y |z)
| {z }
z∈Z x∈X y∈Y
p(x|z)p(y|x,z)
 
X XX XX
=− p(z) p(x, y |z) log p(x|z)+ p(x, y|z) log p(y|x, z)
z∈Z x∈X y ∈Y x∈X y∈Y
 
X X XX
=− p(z)  p(x|z) log p(x|z) + p(x, y|z) log p(y |x, z)
z∈Z x∈X x∈X y∈Y
XX XXX
=− p(x, z) log p(x|z) − p(x, y , z) log p(y|x, z)
z∈Z x∈X x∈X y ∈Y z∈Z

= H(X , Y |Z ) = H(X |Z ) + H(Y |X , Z )

32 / 136
Example 5
• Consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 2 4
(64)
1
1 0 4

• Find H(X ),H(Y ), H(X |Y ), H(Y |X ), H(X , Y ).


• Is H(X |Y ) = H(Y |X )?

33 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y

34 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y

• Then, the marginal PMF of X , p(x) can be found as:

 
3 1 3 1
P[X = 0] = , P[X = 1] = ⇒ p(x) = ,
4 4 4 4

34 / 136
Example 5 - Solution
1) Find H(X ).
• From the definition of marginal probabilities:
X X
p(x) = P[X = x] = P[X = x, Y = y] = p(x, y)
y∈Y y∈Y

• Then, the marginal PMF of X , p(x) can be found as:

 
3 1 3 1
P[X = 0] = , P[X = 1] = ⇒ p(x) = ,
4 4 4 4

• Then, the entropy of X is:

3 3 1 1
H(X ) = − log − log = 0.8113
4 4 4 4

34 / 136
Example 5 - Solution
2) Find H(Y ).
• The marginal PMF of Y , p(y) is:

 
1 1 1 1
P[Y = 0] = , P[Y = 1] = ⇒ p(y ) = ,
2 2 2 2

35 / 136
Example 5 - Solution
2) Find H(Y ).
• The marginal PMF of Y , p(y) is:

 
1 1 1 1
P[Y = 0] = , P[Y = 1] = ⇒ p(y ) = ,
2 2 2 2

• Then, the entropy of Y is:

1 1 1 1
H(Y ) = − log − log = 1
2 2 2 2

35 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).

36 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
• From the definition of conditional probabilities:
P[X = x, Y = y] p(x, y)
p(x|y ) = P[X = x|Y = y ] = =
P[Y = y ] p(y )

36 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
• From the definition of conditional probabilities:
P[X = x, Y = y] p(x, y)
p(x|y ) = P[X = x|Y = y ] = =
P[Y = y ] p(y )
• Then,
P[X = 0, Y = 0] 1/2
P[X = 0|Y = 0] = = =1
P[Y = 0] 1/2

P[X = 1|Y = 0] = 1 − P[X = 0|Y = 0] = 0

36 / 136
Example 5 - Solution
3) Find the conditional entropy H(X |Y ).
• For this, we first need to find conditional PMF p(x|y ).
• From the definition of conditional probabilities:
P[X = x, Y = y] p(x, y)
p(x|y ) = P[X = x|Y = y ] = =
P[Y = y ] p(y )
• Then,
P[X = 0, Y = 0] 1/2
P[X = 0|Y = 0] = = =1
P[Y = 0] 1/2

P[X = 1|Y = 0] = 1 − P[X = 0|Y = 0] = 0


• Similarly,
P[X = 0, Y = 1]
P[X = 0|Y = 1] = = 1/2
P[Y = 1]

P[X = 1|Y = 1] = 1 − 1/2 = 1/2

36 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

• Note that:

H(X |Y = 0) = −1 log 1 − 0 log 0 = 0

37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

• Note that:

H(X |Y = 0) = −1 log 1 − 0 log 0 = 0

• whereas
1 1 1 1
H(X |Y = 1) = − log − log = 1
2 2 2 2

37 / 136
Example 5 - Solution
• Then,
X
H(X |Y ) = p(y)H(X |Y = y)
y∈{0,1}

• Note that:

H(X |Y = 0) = −1 log 1 − 0 log 0 = 0

• whereas
1 1 1 1
H(X |Y = 1) = − log − log = 1
2 2 2 2
• Then,
X 1 1
H(X |Y ) = p(y )H(X |Y = y) = × 0 + × 1 = 1/2
2 2
y∈{0,1}

37 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).

38 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
• From the definition of conditional probabilities:
P[Y = y , X = x] p(y , x)
p(y |x) = P[Y = y |X = x] = =
P[X = x] p(x)

38 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
• From the definition of conditional probabilities:
P[Y = y , X = x] p(y , x)
p(y |x) = P[Y = y |X = x] = =
P[X = x] p(x)
• Then,
P[Y = 0, X = 0] 1/2
P[Y = 0|X = 0] = = = 2/3
P[X = 0] 3/4

P[Y = 1|X = 0] = 1 − P[Y = 0|X = 0] = 1/3

38 / 136
Example 5 - Solution
4) Find the conditional entropy H(Y |X ).
• For this, we first need to find conditional PMF p(y |x).
• From the definition of conditional probabilities:
P[Y = y , X = x] p(y , x)
p(y |x) = P[Y = y |X = x] = =
P[X = x] p(x)
• Then,
P[Y = 0, X = 0] 1/2
P[Y = 0|X = 0] = = = 2/3
P[X = 0] 3/4

P[Y = 1|X = 0] = 1 − P[Y = 0|X = 0] = 1/3


• Similarly,

P[Y = 0, X = 1] 0
P[Y = 0|X = 1] = = =0
P[X = 1] 1/4

P[Y = 1|X = 1] = 1 − 0 = 1

38 / 136
Example 5 - Solution
• Then,
X
H(Y |X ) = p(x)H(Y |X = x)
x∈{0,1}

where
2 2 1 1
H(Y |X = 0) = − log − log = 0.9183
3 3 3 3
and

H(Y |X = 1) = −0 log 0 − 1 log 1 = 0

Then,
X 3 1
H(Y |X ) = p(x)H(Y |X = x) = × 0.9183 + × 0 = 0.6887
4 4
x∈{0,1}

39 / 136
Example 5 - Solution
• 5) Find the joint entropy H(X , Y ).

1 1 1 1 1 1
H(X , Y ) = − log − 0 log 0 − log − log = 1.5
2 2 4 4 4 4

40 / 136
Example 5 - Solution
• 5) Find the joint entropy H(X , Y ).

1 1 1 1 1 1
H(X , Y ) = − log − 0 log 0 − log − log = 1.5
2 2 4 4 4 4

• Remark. Note that H(X |Y ) 6= H(Y |X )

• But

H(X , Y ) = H(X , Y )
H(X ) + H(Y |X ) = H(Y ) + H(X |Y )
H(X ) − H(X |Y ) = H(Y ) − H(Y |X )
0.8113 − 1/2 = 1 − 0.6887
0.3113 = 0.3113

• This will be important soon.

40 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )

41 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )
• The relative entropy is a measure of distance between two
distributions (although it is actually not a true distance measure
because it is not symmetric and does not satisfy the triangle
inequality!). However, we will see that it is always ≥ 0 and = 0 if and
only if p = q.

41 / 136
Relative Entropy - KL-distance
• Definition 4. The relative entropy or Kullback-Leibler (KL) distance
between two PMFs p(x) and q(x) (that are defined on the same
alphabet) is:
X p(x)
D(p||q) = p(x) log
q(x)
x∈X
p(X )
= E[log ]
q(X )
• The relative entropy is a measure of distance between two
distributions (although it is actually not a true distance measure
because it is not symmetric and does not satisfy the triangle
inequality!). However, we will see that it is always ≥ 0 and = 0 if and
only if p = q.
• If there is any symbol x ∈ X for which p(x) > 0 and q(x) = 0, then
D(p||q) = ∞.
• This notion is also called KL divergence or information divergence.

41 / 136
Mutual Information
• Definition 5. Consider two random variables X and Y with a joint
PMF p(x, y) and marginal PMFs p(x) and p(y ).

The mutual information I(X ; Y ) is the relative entropy between the


joint distribution and the product distribution p(x)p(y ):

I(X ; Y ) = D(p(x, y )||p(x)p(y))


X p(x, y)
= p(x, y ) log
p(x)p(y )
x∈X y ∈Y

p(X ,Y )
• Also note that I(X ; Y ) = E[log p(X )p(Y ) ]

• Later we will be able to generalize this definition to continuous or


mixed random variables.

42 / 136
Mutual Information

• Theorem 3. Relationship between entropy and mutual information:

I(X ; Y ) = H(X ) − H(X |Y )

Also, from symmetry of p(x, y) = p(x)p(y |x) = p(y)p(x|y ),

I(X ; Y ) = H(Y ) − H(Y |X )

• Mutual information measures how much information one random


variable carries about another.

• Equally, mutual information measures the amount of uncertainty


reduced in one random variable by knowing another random variable.

43 / 136
Relationship Between Entropy and Mutual Information
• Proof.
XX p(x, y)
I(X ; Y ) = p(x, y ) log
p(x)p(y )
x∈X y ∈Y
XX p(x|y)
= p(x, y ) log
p(x)
x∈X y ∈Y
XX 1 XX 1
= p(x, y ) log − p(x, y ) log
p(x) p(x|y)
x∈X y ∈Y x∈X y∈Y
| {z } | {z }
p(x) H(X |Y )
| {z }
H(X )

44 / 136
Relationship Between Entropy and Mutual Information
• Alternative Proof (by using expectations).

p(X , Y )
I(X ; Y ) = EX ,Y [log ]
p(X )p(Y )
p(X |Y )
= EX ,Y [log ]
p(X )
p(X |Y )
= EX ,Y [log ]
p(X )
1 1
= EX ,Y [log − log ]
p(X ) p(X |Y )
1 1
= EX [log ] − EX ,Y [log ]
p(X ) p(X |Y )
= H(X ) − H(X |Y )

45 / 136
How to Interpret Mutual Information?
• We have seen that,

H(X ) − H(X |Y ) = I(X ; Y )


| {z } | {z } | {z }
I II III

• I: Average uncertainty about X (before observing Y ).

• II: Average uncertainty about X AFTER observing Y .

• I: Average reduction in uncertainty of X after observing Y (average


information about X that is supplied by Y ).

I(X ; Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X )

• That is, X tells as much information about Y , as Y does about X .

I(X ; Y ) = I(Y ; X )

46 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )

47 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )

48 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )

• How about I(X ; X )?

48 / 136
Observations
• We can also write:
p(X , Y )
I(X ; Y ) = E[log ]
p(X )p(Y )
1 1 1
= E[log ] + E[log ] − E[log ]
p(X ) p(Y ) p(X , Y )
= H(X ) + H(Y ) − H(X , Y )

• How about I(X ; X )?

I(X ; X ) = H(X ) − H(X |X ) = H(X )


| {z }
0

49 / 136
Observations
• The following diagram shows these relationships:

H(X) H(Y )

H(X|Y ) I(X; Y ) H(Y |X)

H(X, Y )

50 / 136
Conditional Mutual Information
• Definition 6. Conditional mutual information:

X p(x, y|z)
I(X ; Y |Z ) = p(x, y, z) log
p(x|z)p(y |z)
x∈X ,y∈Y,z∈Z

p(X , Y |Z )
= EX ,Y ,Z [log ]
p(X |Z )p(Y |Z )

= H(X |Z ) − H(X |Y , Z )

51 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!

52 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!
• Theorem 4. Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1

52 / 136
Chain Rule of Mutual Information
• Recall the chain rule of entropy:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
• Mutual information also has a chain rule!
• Theorem 4. Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1
• Proof.
I(X1 , . . . , Xn ; Y ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y )
n
X n
X
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y )
i=1 i=1
Xn  
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y )
i=1
n
X
= I(Xi ; Y |Xi−1 , . . . , X1 )
i=1 52 / 136
Conditional Relative Entropy
• Definition 7. For two joint PMFs p(x, y ) and q(x, y), the conditional
relative entropy is defined as:
X X p(y |x)
D(p(y |x)||q(y|x)) = p(x) p(y |x) log
q(y |x)
x∈X y∈Y
XX p(y|x)
= p(x, y) log
q(y|x)
x∈X y∈Y
p(Y |X )
= EX ,Y [log ]
q(Y |X )

53 / 136
Chain Rule of Relative Entropy
• Relative entropy also has a chain rule:

D(p(x, y )||q(x, y )) = D(p(x)||q(x)) + D(p(y|x)||q(y |x))

54 / 136
Chain Rule of Relative Entropy
• Relative entropy also has a chain rule:

D(p(x, y )||q(x, y )) = D(p(x)||q(x)) + D(p(y|x)||q(y |x))

• Proof.
XX p(x, y )
D(p(x, y)||q(x, y)) = p(x, y) log
q(x, y )
x∈X y∈Y
XX p(y|x)p(x)
= p(x, y) log
q(y|x)q(x)
x∈X y∈Y
XX p(y|x) X X p(x)
= p(x, y) log + p(x, y ) log
q(y|x) q(x)
x∈X y∈Y x∈X y∈Y
| {z } | {z }
D(p(y |x)||q(y |x)) p(x)
| {z }
D(p(x)||q(x))

= D(p(y|x)||q(y |x)) + D(p(x)||q(x))

54 / 136
Convex Functions
• We will now briefly review the basic definitions of convexity and
present one of the most widely used inequalities in information theory.

55 / 136
Convex Functions
• Definition 8 (Convex function). A function f (x) is convex over an
interval (a, b) if for every x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1:

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 )

• The function is strictly convex if for the above, the equality holds only
when λ = 0 or λ = 1.

f (x)
f (x1 ) + (1 )f (x2 )
f (x2 )

f (x1 )

f ( x1 + (1 )x2 )

x1 x2
56 / 136
Example 6
• Example 6. f (x) = x 2 where x ∈ R

57 / 136
Example 6
• Example 6. f (x) = x 2 where x ∈ R

100

90

80

70

60

50

40

30

20

10

0
-10 -8 -6 -4 -2 0 2 4 6 8 10

• is a convex function

58 / 136
Example 7
• Example 7. f (x) = − log x where x > 0

59 / 136
Example 7
• Example 7. f (x) = − log x where x > 0

10

-2

-4
0 1 2 3 4 5 6 7 8 9 10

• is a convex function

60 / 136
Example 8
• Example 8. f (x) = ex where x ∈ R

61 / 136
Example 8
• Example 8. f (x) = ex where x ∈ R

150

100

50

0
-5 -4 -3 -2 -1 0 1 2 3 4 5

• is a convex function

62 / 136
Example 9
• Example 9. f (x) = ax + b where x ∈ R

63 / 136
Example 9
• Example 9. f (x) = ax + b where x ∈ R

-1

-2

-3
-2 -1 0 1 2 3 4 5

• is a convex function (plot is drawn for a = 1, b = −1)

64 / 136
Example 10
• Example 10. f (x) = x log x where x ≥ 0

12

10

-2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

• is a convex function

65 / 136
Concave Functions
• Definition 8 (Concave function). A function f (·) is concave over
(a, b) if −f (x) is convex, i.e., for every x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1:

f (λx1 + (1 − λ)x2 ) ≥ λf (x1 ) + (1 − λ)f (x2 )

• The function is strictly concave if for the above, the equality holds only
when λ = 0 or λ = 1.

f (x) f ( x1 + (1 )x2 )

f (x2 )

f (x1 )
f (x1 ) + (1 )f (x2 )

x1 x2
66 / 136
Example 11

• Example 11. f (x) = x where x ≥ 0

67 / 136
Example 11

• Example 11. f (x) = x where x ≥ 0
2.5

1.5

0.5

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

• is a concave function

68 / 136
Example 12
• Example 12. f (x) = log x where x > 0

69 / 136
Example 12
• Example 12. f (x) = log x where x > 0

-1

-2

-3

-4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

• is a concave function

70 / 136
How do we know if a function is convex (or concave)?
• If f (x) is twice differentiable,

f 00 (x) ≥ 0 → convex
00
f (x) ≤ 0 → concave

71 / 136
How do we know if a function is convex (or concave)?
• If f (x) is twice differentiable,

f 00 (x) ≥ 0 → convex
00
f (x) ≤ 0 → concave

• If you would like to learn more on convex functions, read Chapter 3 of:

Convex Optimization, Boyd-Vandenberghe (available online)

https:
//web.stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf

71 / 136
Example 13
• Example 13. Is f (x) = x 2 for x ∈ R convex?

f 0 (x) = 2x (65)
00
f (x) = 2 > 0 → convex (66)

72 / 136
Example 14
• Example 14. Is f (x) = log x for x > 0 convex?

1
f 0 (x) = (67)
x ln(2)
1
f 00 (x) = − 2 <0 → concave (68)
x ln(2)

73 / 136
Recap
• So far we have covered sections 2.3, 2.4, 2.5 from the textbook. Next,
we will cover 2.6, 2.7, 2.8.

74 / 136
Important Properties of Convex Functions
Pn
• Theorem 5. Let p1 , . . . , pn ≥ 0 such that i=1 pi = 1. If f (x) is
convex, then for any x1 , . . . , xn ,

Xn n
X
f( pi xi ) ≤ pi f (xi )
i=1 i=1

75 / 136
Important Properties of Convex Functions
• Proof. Can be proved by induction.

Step 1. For n = 1, this is true by the definition of convexity.

76 / 136
Important Properties of Convex Functions
• Step 2. Assume that the claim holds for n − 1. Then, for n:
n n−1
X X pi
pi f (xi ) = pn f (xn ) + (1 − pn ) f (xi )
1 − pn
i=1 i=1
pi
Now, set qi = 1−p n
for i = 1, . . . , n − 1. Note that qi ≥ 0 and
Pn−1
i=1 qi = 1. Since we assumed that the hypothesis is true for n − 1,
Xn n−1
X
pi f (xi ) = pn f (xn ) + (1 − pn ) qi f (xi )
i=1 i=1
Xn−1
≥ pn f (xn ) + (1 − pn )f ( qi xi )
i=1
| {z }

≥ f (pn xn + (1 − pn )x̄) (hypothesis true for n = 2)
n−1
!
X pi
≥ f pn xn + (1 − pn ) xi (Substitute back x̄)
1 − pn
i=1
n
X
= f( pi xi )
i=1
77 / 136
Jensen’s Inequality
• We will now state an important inequality.

78 / 136
Jensen’s Inequality
• We will now state an important inequality.

• Theorem 5 (Jensen’s inequality). If f is convex and X is a random


variable, we have:

E[f (X )] ≥ f (E[X ])

• Moreover, if f is strictly convex, equality implies that X = E[X ] with


probability 1, i.e., X is constant.

78 / 136
Jensen’s Inequality
• We will now state an important inequality.

• Theorem 5 (Jensen’s inequality). If f is convex and X is a random


variable, we have:

E[f (X )] ≥ f (E[X ])

• Moreover, if f is strictly convex, equality implies that X = E[X ] with


probability 1, i.e., X is constant.

• Proof. If X is a discrete random variable, the proof is the same as the


proof of Theorem 5, by letting the pi , i = 1 . . . , n denote the PMF of X .

• This proof can be extended to continuous random variables also.

78 / 136
Jensen’s Inequality
• Corollary 2. If f is concave,

E[f (X )] ≤ f (E[X ])

79 / 136
Jensen’s Inequality
• Corollary 2. If f is concave,

E[f (X )] ≤ f (E[X ])
• Next, we will use Jensen’s inequality to prove some important
properties of the measures we have defined so far.

79 / 136
KL-distance is Non-negative
• Theorem 7 (Information Inequality). For two probability mass
functions (PMFs) p(x) and q(x) over an alphabet x ∈ X , we have:
D(p||q) ≥ 0
where equality holds if and only if p(x) = q(x) for all x.

80 / 136
KL-distance is Non-negative
• Theorem 7 (Information Inequality). For two probability mass
functions (PMFs) p(x) and q(x) over an alphabet x ∈ X , we have:
D(p||q) ≥ 0
where equality holds if and only if p(x) = q(x) for all x.
• Proof. Define a set A = {x : p(x) > 0}. Then,
X p(x)
−D(p||q) = − p(x) log
q(x)
x∈A
 
X p(x)
= p(x) − log
q(x)
x∈A
X q(x)
= p(x) log
p(x)
x∈A
 
q(X )
= E log (expectation is taken over p(x) > 0)
p(X )

80 / 136
KL-distance is Non-negative
• Proof continued. Recall that the function f (y ) = log y is concave.
Therefore, log q(x) q(x)
p(x) is concave in p(x) . Then,

q(X ) q(X )
E[log ] ≤ log E[ ] (Jensen’s inequality - Corollary 2) (69)
p(X ) p(X )
!
X q(x)
= log p(x) (70)
p(x)
x∈A
!
X
= log q(x) (71)
x∈A
!
X
≤ log q(x) (log y is strictly increasing in y ) (72)
x∈X
| {z }
=1
= log 1 (probability of the entire sample space is 1)
=0

• Therefore, D(p||q) ≥ 0
81 / 136
When is D(p||q) = 0?
• Note that f (y ) = log y is a strictly concave function of y. Then, from
Jensen’s Inequality, equality occurs, i.e.,

E[f (Y )] = f (E[Y ])

if and only if Y is a constant.


• In other words, (69) becomes an equality if and only if q(x)
p(x) = c for
some constant c for all x ∈ A. Then, (70) can be written as,
X q(x)  X
log p(x) = log( p(x)c)
p(x)
x∈A x∈A
X
= log(c p(x))
x∈A
X
= log(c p(x)) (since p(x) = 0 ∀x ∈
/ A)
x∈X
= log c

82 / 136
When is D(p||q) = 0?
• Finally, (72) becomes an equality if and only if
X
q(x) = c = 1
x∈A

• Also from this result, we find that q(x) = 0 for all x ∈


/ A (second axiom
of probability, i.e., probability of the whole sample space is 1).

• Therefore, D(p||q) = 0 if and only if p(x) = q(x) for all x ∈ X .

83 / 136
Mutual Information is Non-negative
• Corollary 3. Mutual information is non-negative:

I(X ; Y ) ≥ 0

84 / 136
Mutual Information is Non-negative
• Corollary 3. Mutual information is non-negative:

I(X ; Y ) ≥ 0

• Proof Follows from:

I(X ; Y ) = D(p(x, y)||p(x)p(y)) (by definition)

≥0 (KL-distance is non-negative, i.e., Theorem 7) (73)

and (73) becomes an equality if and only if p(x, y ) = p(x)p(y ), i.e.,


when X and Y are independent.

84 / 136
Corollaries
• Corollary 4. Conditional KL-distance is non-negative

D(p(y|x)||q(y |x)) ≥ 0

85 / 136
Corollaries
• Corollary 4. Conditional KL-distance is non-negative

D(p(y|x)||q(y |x)) ≥ 0

• Corollary 5. Conditional mutual information is non-negative

I(X ; Y |Z ) ≥ 0

85 / 136
Upper Bound on Entropy
• Theorem 8. For any random variable X defined over an alphabet X ,

H(X ) ≤ log |X |

where |X | represents the number of elements in the range of X and is


called the cardinality of X .
• This means that, for any random variable X , its entropy is no greater
than that of a uniform random variable defined over the same set of
elements X .
• Using this result, we can bound the entropy of any random variable as:

0 ≤ H(X ) ≤ log |X | (74)

• Let’s now prove the theorem.

86 / 136
Upper Bound on Entropy
• Proof. We will use Theorem 7. Specifically, let p(x) denote the PMF
of the random variable X and let u(x) = |X1 | be the PMF of a uniform
random variable.

87 / 136
Upper Bound on Entropy
• Proof. We will use Theorem 7. Specifically, let p(x) denote the PMF
of the random variable X and let u(x) = |X1 | be the PMF of a uniform
random variable.
• Then,
X p(x)
D(p||u) = p(x) log (75)
u(x)
x∈X
X 1 X 1
= p(x) log − p(x) log (76)
u(x) p(x)
x∈X x∈X
X X 1
= p(x) log |X | − p(x) log (77)
p(x)
x∈X x∈X
| {z }
by definition of H(X )

= log |X | − H(X ) (78)


≥0 (from Theorem 7) (79)

• Therefore,
H(X ) ≤ log |X | (80)
87 / 136
Uniform Distribution Maximizes Entropy
• Corollary 8. The uniform random variable has the largest entropy.

88 / 136
Uniform Distribution Maximizes Entropy
• Corollary 8. The uniform random variable has the largest entropy.
• Proof. Let X be a uniform random variable over a set of elements X .
Denote the PMF of X by u(x) = |X1 | for all x ∈ X . Then, the entropy of
X is given by:
X
H(X ) = − u(x) log u(x) (81)
x∈X
X 1 1
=− log (82)
|X | |X |
x∈X
X 1
= log |X | (83)
|X |
x∈X
!
X 1
= log |X | (84)
|X |
x∈X
= log |X | (85)
(86)

88 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:

H(X |Y ) ≤ H(X )

89 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:

H(X |Y ) ≤ H(X )

• Proof. Follows from the non-negativity of mutual information:

I(X ; Y ) = H(X ) − H(X |Y ) ≥ 0

89 / 136
Conditioning Reduces Entropy
• Theorem 9. Conditioning can not increase entropy:

H(X |Y ) ≤ H(X )

• Proof. Follows from the non-negativity of mutual information:

I(X ; Y ) = H(X ) − H(X |Y ) ≥ 0

• Very important: This theorem implies that the conditional entropy


H(X |Y ) is less than or equal to the entropy H(X ). It does not say that
H(X |Y = y ) for any specific y is necessarily smaller than H(X )! Note
that H(X |Y ) is defined as the average of H(X |Y = y ) over all
realizations Y = y :
X
H(X |Y ) = p(y ) H(X |Y = y )
| {z } | {z }
y ∈Y
This can never be larger than H(X ) Some of these terms can be larger than H(X )

89 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3

90 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3

• PMF of X :
 
2 1
P[X = 0] = 2/3, P[X = 1] = 1/3, p(x) = ,
3 3

90 / 136
Example 15
• Lets consider a pair of r.v.s X , Y with a joint PMF p(x, y ):

X \Y 0 1
1 1
0 3 3
(87)
1
1 0 3

• PMF of X :
 
2 1
P[X = 0] = 2/3, P[X = 1] = 1/3, p(x) = ,
3 3
• Then, the entropy of X is:

2 2 1 1
H(X ) = − log − log = 0.918bits
3 3 3 3

90 / 136
Example 15
• PMF of Y :
1 2 1 2
P[Y = 0] = , P[Y = 1] = , p(y) = ( , )
3 3 3 3
• Conditional PMF of p(x|y ):

P[X = 0, Y = 0]
P[X = 0|Y = 0] = =1
P[Y = 0]
P[X = 1|Y = 0] = 0
1/3 1
P[X = 0|Y = 1] = =
2/3 2
1
P[X = 1|Y = 1] =
2
• Then,
p(x|y ) 0 1
1
0 1 2
(88)
1
1 0 2

91 / 136
Example 15
• Then,
X
H(X |Y ) = p(y )H(X |Y = y )
y∈Y

= P[Y = 0]H(X |Y = 0) + P[Y = 1]H(X |Y = 1)

• Note that:

H(X |Y = 0) = 0 bits < H(X )

• On the other hand,

H(X |Y = 1) = 1 bit > H(X )!

• But, the average:

1 2 2
H(X |Y ) = × 0 + × 1 = = 0.667 bits < H(X )
3 3 3
as expected.
92 / 136
Independence Bound on Entropy
• Theorem 10 For any set of n random variables X1 , . . . , Xn , their joint
entropy can be upper bounded by the sum of the individual entropies.
n
X
H(X1 , X2 , . . . , Xn ) ≤ H(Xi )
i=1

93 / 136
Independence Bound on Entropy
• Theorem 10 For any set of n random variables X1 , . . . , Xn , their joint
entropy can be upper bounded by the sum of the individual entropies.
n
X
H(X1 , X2 , . . . , Xn ) ≤ H(Xi )
i=1

• Proof. We will use the chain rule of entropy and Theorem 9.


Specifically, recall from the chain rule of entropy that:
n
X
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1

from Theorem 9, for each of these terms we have


H(Xi |Xi−1 , . . . , X1 ) ≤ H(Xi ) since conditioning cannot increase
entropy. Therefore,
n
X n
X
H(Xi |Xi−1 , . . . , X1 ) ≤ H(Xi )
i=1 i=1

with equality if and only if Xi are all independent from each other.
93 / 136
Recap
• So far, we have seen Jensen’s inequality and used it to prove some
important results and observations.

• Next, we will see another useful inequality, called the Log-sum


inequality and use it to establish more results that are central to
information theory.

• We will be covering chapters 2.7, 2.8, and 2.10.

94 / 136
LOG-SUM inequality
• Theorem 11 (LOG-SUM Inequality). For non-negative numbers
a1 , . . . , an and b1 , . . . , bn , the following holds:

n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1

95 / 136
LOG-SUM inequality
• Proof. We will use the convention 0 log 0 = 0, a log a0 = ∞( for a > 0),
and 0 log 00 = 0. Then, without loss of generality, we can assume
ai , bi > 0 for all i.
• Define,
ai
p(xi ) = Pn
j=1 aj

and
bi
q(xi ) = Pn
j=1 bj
• Since pi , qi ≥ 0, and
n
X n
X
p(xi ) = q(xi ) = 1
i=1 i=1

p and q are valid PMFs.

96 / 136
LOG-SUM inequality
• Next, consider the KL-distance between p and q, and recall that
KL-distance is non-negative D(p||q) ≥ 0. In other words,
n
X p(xi )
D(p||q) = p(xi ) log ≥0
q(xi )
i=1

By substituting p(xi ) and q(xi ),


n Pnai
X ai j=1 aj
⇒ Pn log ≥0
Pnbi
i=1 j=1 aj j=1 bj

n Pn !
ai ai j=1 aj
X
⇒ Pn log − log Pn ≥0
j=1 aj
bi j=1 bj
i=1
n n
! Pn
ai j=1 aj
X X
⇒ ai log ≥ ai log Pn
bi j=1 bj
i=1 i=1

97 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

• Proof. We first note that the definition of convexity from Definition 8


extends to vector variables. That is, let f (z) be a scalar function of a
group of variables represented by a vector z.

98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

• Proof. We first note that the definition of convexity from Definition 8


extends to vector variables. That is, let f (z) be a scalar function of a
group of variables represented by a vector z.
• Then, the function f (z) is said to be convex in z if:

f (λz1 + (1 − λ)z2 ) ≤ λf (z1 ) + (1 − λ)f (z2 )

98 / 136
Relative entropy is convex
• Theorem 12. The relative entropy (KL-distance) is convex in the pair
of distributions (p, q). That is,

D(λp1 + (1 − λ)p2 ||λq1 + (1 − λ)q2 ) ≤ λD(p1 ||q1 ) + (1 − λ)D(p2 ||q2 )

• Proof. We first note that the definition of convexity from Definition 8


extends to vector variables. That is, let f (z) be a scalar function of a
group of variables represented by a vector z.
• Then, the function f (z) is said to be convex in z if:

f (λz1 + (1 − λ)z2 ) ≤ λf (z1 ) + (1 − λ)f (z2 )

• In particular, if z = [x y]T , this means that f is convex in (x, y) if

f (λx1 + (1 − λ)x2 , λy1 + (1 − λ)y2 ) ≤ λf (x1 , y1 ) + (1 − λ)f (x2 , y2 )

98 / 136
Relative entropy is convex
• Now, the inequality from Theorem 12 says that:
X λp1 (x) + (1 − λ)p2 (x)
(λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
x∈X
X p1 (x) X p2 (x)
≤λ p1 (x) log + (1 − λ) p2 (x) log (89)
q1 (x) q2 (x)
x∈X x∈X

• Let’s have a look at this for one term with a specific x on both sides.
Can we prove the following inequality?

λp1 (x) + (1 − λ)p2 (x)


(λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
p1 (x) p2 (x)
≤ λp1 (x) log + (1 − λ)p2 (x) log (90)
q1 (x) q2 (x)

99 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1

100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2

100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
• Let
a1 = λp1 (x) b1 = λq1 (x)
a2 = (1 − λ)p2 (x) b2 = (1 − λ)q2 (x)

100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
• Let
a1 = λp1 (x) b1 = λq1 (x)
a2 = (1 − λ)p2 (x) b2 = (1 − λ)q2 (x)
• Then, (91) becomes,
λp1 (x) (1 − λ)p2 (x)
λp1 (x) log + (1 − λ)p2 (x) log
λq1 (x) (1 − λ)q2 (x)
λp1 (x) + (1 − λ)p2 (x)
≥ (λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
which means we proved (90).

100 / 136
Relative entropy is convex
• Log-sum inequality says that:
n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n
bi i=1 bi
i=1 i=1
• By setting n = 2, we have:
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (91)
b1 b2 b1 + b2
• Let
a1 = λp1 (x) b1 = λq1 (x)
a2 = (1 − λ)p2 (x) b2 = (1 − λ)q2 (x)
• Then, (91) becomes,
λp1 (x) (1 − λ)p2 (x)
λp1 (x) log + (1 − λ)p2 (x) log
λq1 (x) (1 − λ)q2 (x)
λp1 (x) + (1 − λ)p2 (x)
≥ (λp1 (x) + (1 − λ)p2 (x)) log
λq1 (x) + (1 − λ)q2 (x)
which means we proved (90).
• Now, if we take the sum over all x ∈ X of both sides, we have (89),
which proves Theorem 12.
100 / 136
Relative entropy is convex
• Corollary 7. Relative entropy is convex in p for any fixed q.

101 / 136
Relative entropy is convex
• Corollary 7. Relative entropy is convex in p for any fixed q.
• Proof. Choose q1 = q2 = q in Theorem 12:

D(λp1 + (1 − λ)p2 ||q) ≤ λD(p1 ||q) + (1 − λ)D(p2 ||q)

101 / 136
Entropy is concave
• Theorem 13. (Concavity of entropy) Let H(p) denote the entropy of
X , with p representing the PMF of X (that is, if X = {x1 , . . . , xn },
p = (p(x1 ), p(x2 ), . . . , p(xn ))). Then, H(p) is a concave function of p.

102 / 136
Entropy is concave
• Theorem 13. (Concavity of entropy) Let H(p) denote the entropy of
X , with p representing the PMF of X (that is, if X = {x1 , . . . , xn },
p = (p(x1 ), p(x2 ), . . . , p(xn ))). Then, H(p) is a concave function of p.
• Proof. We will use Corollary 7.
• Let q(x) = |X1 | for all x ∈ X , i.e., q(x) = u(x) (recall that u(x) is the
discrete uniform random variable). Then,
X p(x)
D(p||u) = p(x) log 1
x∈X |X |

= log |X | − H(X )

• Therefore,

H(X ) = H(p) = log |X | − D(p||u)

• By Corollary, 7 D(p||u) is convex in p.


• By Definition 8 (if f (x) is convex in x, then −f (x) is concave in x) we
have H(p) is concave in p.
102 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).

103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).
• Proof. Recall that:

p(x, y) = p(y |x)p(x) = p(x|y )p(y )


X
I(X ; Y ) = H(Y ) − H(Y |X ) = H(Y ) − p(x)H(Y |X = x)
| {z } | {z }
x∈X
(A) (B)

103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• Theorem 14. Mutual Information I(X ; Y ) is a concave function of p(x)
for fixed p(y |x).
• Proof. Recall that:

p(x, y) = p(y |x)p(x) = p(x|y )p(y )


X
I(X ; Y ) = H(Y ) − H(Y |X ) = H(Y ) − p(x)H(Y |X = x)
| {z } | {z }
x∈X
(A) (B)

• (A):
• Note that for fixed p(y |x), p(y) is a linear function of p(x) (because
P
p(y) = x p(y |x)p(x)).
• We known by Theorem 11, H(Y ) is concave in p(y).
• Fact 1. Let y be a linear function of x. Then, a function f is concave in
x if and only if f is concave in y.
• Thus, H(Y ) for fixed p(y|x) is concave in p(x).

103 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• (B):
X
−H(Y |X ) = − H(Y |X = x) p(x)
x
| {z }
fixed since p(y|x) is fixed

is a linear function of p(x). Recall that a linear function is both


concave and convex.

104 / 136
Mutual Information is concave in p(x) for fixed p(y |x)
• (B):
X
−H(Y |X ) = − H(Y |X = x) p(x)
x
| {z }
fixed since p(y|x) is fixed

is a linear function of p(x). Recall that a linear function is both


concave and convex.
• Fact 2. Function f is concave if:

f (x) = g1 (x) + g2 (x)


and g1 and g2 are both concave.
• ⇒ I(X ; Y ) is concave in p(x) for fixed p(y |x).
• Remark: This result will be very important for channel capacity.
• Thought exercise: How would you prove Fact 1 and Fact 2?

104 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• Theorem 14’. Mutual Information I(X ; Y ) is a convex function of
p(y|x) for fixed p(x).

105 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• Theorem 14’. Mutual Information I(X ; Y ) is a convex function of
p(y|x) for fixed p(x).
• Proof. For a fixed p(x), let’s define a function f (·) of p(y|x)

f (p(y |x)) = I(X ; Y ) (92)


X p(x, y)
= p(x, y) log (93)
x,y
p(x)p(y )
= D(p(x)p(y|x)||p(x)p(y )) (94)
P
where p(y ) , x p(y|x)p(x).

105 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• In order to show that f (p(y |x)) = I(X ; Y ) is convex in p(y |x), we need
to show that (by definition of convexity):

f (λp1 (y|x) + (1 − λ)p2 (y |x))


≤ λf (p1 (y |x)) + (1 − λ)f (p2 (y |x)) (95)

which, by using (94), can be re-written as:

D(p(x)pλ (y |x)||p(x)pλ (y ))
≤ λD(p(x)p1 (y |x)||p(x)p1 (y ))
+ (1 − λ)D(p(x)p2 (y|x)||p(x)p2 (y )) (96)
P P
where p1 (y) , x p1 (y |x)p(x) and p2 (y ) , x p2 (y |x)p(x), and

pλ (y |x) , λp1 (y|x) + (1 − λ)p2 (y |x)


X X
pλ (y) , p(x)pλ (y |x) = p(x)(λp1 (y |x) + (1 − λ)p2 (y |x))
x x
= λp1 (y) + (1 − λ)p2 (y )
106 / 136
Mutual Information is convex in p(y |x) for fixed p(x)
• Now, let p1 (x, y) , p(x)p1 (y|x) and p2 (x, y) , p(x)p2 (y|x).
• Then, let q1 (x, y ) , p(x)p1 (y ) and q2 (x, y ) , p(x)p2 (y )
and note that:
p(x)pλ (y|x) , λp(x)p1 (y|x) + (1 − λ)p(x)p2 (y|x)
= λp1 (x, y) + (1 − λ)p2 (x, y) (97)

and X 
p(x)pλ (y ) = p(x) p(x 0 )(λp1 (y|x 0 ) + (1 − λ)p2 (y |x 0 ))
x0
 
= p(x) λp1 (y ) + (1 − λ)p2 (y)
= λq1 (x, y ) + (1 − λ)q2 (x, y ) (98)

By substituting (97) and (98) in (96), we find that


D(λp1 (x, y ) + (1 − λ)p2 (x, y )||λq1 (x, y ) + (1 − λ)q2 (x, y ))
≤ λD(p1 (x, y )||q1 (x, y)) + (1 − λ)D(p2 (x, y)||q2 (x, y )) (99)

which holds since we know that D(p||q) is convex in (p, q), therefore
(95) holds and the proof is completed.
107 / 136
Entropy of a function of a random variable
• Let g(X ) be some known function of X . Then,

H(g(X )) ≤ H(X )

which equality if and only if the function is 1-to-1.


• In other words, entropy of a function of X is less than the entropy of X .

108 / 136
Entropy of a function of a random variable
• Proof. We know that,

I(X ; Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X ) (100)

Let Y = g(X ), then (100) becomes:

H(X ) − H(X |g(X )) = H(g(X )) − H(g(X )|X ) (101)

which means

H(g(X )) = H(X ) − H(X |g(X )) + H(g(X )|X ) (102)

109 / 136
Entropy of a function of a random variable
• Proof. We know that,

I(X ; Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X ) (100)

Let Y = g(X ), then (100) becomes:

H(X ) − H(X |g(X )) = H(g(X )) − H(g(X )|X ) (101)

which means

H(g(X )) = H(X ) − H(X |g(X )) + H(g(X )|X ) (102)

• Now, note that H(g(X )|X ) = 0, because by knowing X we can


immediately find g(X ). In other words, there’s no uncertainty in g(X )
after knowing X ! Then, (102) becomes,

H(g(X )) = H(X ) − H(X |g(X )) (103)

• Finally, note that H(X |g(X )) ≥ 0 since entropy cannot be negative, so

H(g(X )) ≤ H(X ) (104)


109 / 136
Example 16
• Example 16. (Entropy of a sum) Let X and Y be two random variables
taking values in x1 , . . . , xn and y1 , . . . , yn , respectively. Let Z = X + Y .
(a) Show that H(Z |X ) = H(Y |X ) and if X and Y are independent,
then H(Z ) ≥ H(X ) and H(Z ) ≥ H(Y ).

110 / 136
Example 16
• Example 16. (Entropy of a sum) Let X and Y be two random variables
taking values in x1 , . . . , xn and y1 , . . . , yn , respectively. Let Z = X + Y .
(a) Show that H(Z |X ) = H(Y |X ) and if X and Y are independent,
then H(Z ) ≥ H(X ) and H(Z ) ≥ H(Y ).
• Solution:
X
H(Z |X ) = p(x)H(Z |X = x)
x
X X 1
= p(x) P[Z = z|X = x] log
x z
P[Z = z|X = x]
X X 1
= p(x) P[Y = z − X |X = x] log
x z
P[Y = z − X |X = x]
X X 1
= p(x) P[Y = z − x|X = x] log
x z
P[Y = z − x|X = x]
X X 1
= p(x) P[Y = y |X = x] log
x y
P[Y = y |X = x]
| {z }
H(Y |X =x)

= H(Y |X ) (105)
110 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}


= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

111 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}


= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

• (A): By Theorem 9 (Conditioning reduces entropy)


(B): By equation (105)
(C): Since X and Y are independent

111 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}


= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

• (A): By Theorem 9 (Conditioning reduces entropy)


(B): By equation (105)
(C): Since X and Y are independent
• We can follow the same steps (for X and Y independent) to show
H(Z ) ≥ H(Z |Y ) = H(X |Y ) = H(X )

111 / 136
Example 16
• If X and Y are independent,

H(Z ) ≥ H(Z |X ) |{z}


= H(Y |X ) |{z}
= H(Y )
|{z}
(A) (B) (C)

• (A): By Theorem 9 (Conditioning reduces entropy)


(B): By equation (105)
(C): Since X and Y are independent
• We can follow the same steps (for X and Y independent) to show
H(Z ) ≥ H(Z |Y ) = H(X |Y ) = H(X )
• Since H(Z ) ≥ H(X ) and H(Z ) ≥ H(Y ) for independent X and Y and
Z = X + Y , we conclude that adding two independent random
variables increases entropy.

111 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .

112 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .
• Solution.
Let 
1 w.p. (with probability) 1/2
X =
0 w.p. 1/2
and

Y = −X
Z =X +Y

112 / 136
Example 16
• (b) Give an example of (necessarily dependent) random variables in
which H(X ) > H(Z ) and H(Y ) > H(Z ) where Z = X + Y .
• Solution.
Let 
1 w.p. (with probability) 1/2
X =
0 w.p. 1/2
and

Y = −X
Z =X +Y

• Then,

H(X ) = H(Y ) = 1 bit


H(Z ) = 0 bits (Why?)

112 / 136
Example 16
• (c) Under what conditions does

H(Z ) = H(X ) + H(Y )

hold?

113 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.

114 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
• Let Z be a function of (X , Y ). Then,

H(Z ) ≤ H(X , Y ) = H(X ) + H(Y |X ) ≤ H(X ) + H(Y ) (106)

where the equality holds if and only if (iff) Z is a 1-to-1 function of


(X , Y ) and X and Y are independent.

114 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
• Let Z be a function of (X , Y ). Then,

H(Z ) ≤ H(X , Y ) = H(X ) + H(Y |X ) ≤ H(X ) + H(Y ) (106)

where the equality holds if and only if (iff) Z is a 1-to-1 function of


(X , Y ) and X and Y are independent.
• Therefore, H(X + Y ) = H(X ) + H(Y ) iff X and Y are independent and
Z is a 1-to-1 function of (X , Y ). That is, the value of Z should uniquely
determine the value of (X , Y ).

114 / 136
Example 16
• Solution. Recall that for any function g(X ) of X , we showed that
H(g(X )) ≤ H(X ), with equality if and only if the function is 1-to-1.
• Let Z be a function of (X , Y ). Then,

H(Z ) ≤ H(X , Y ) = H(X ) + H(Y |X ) ≤ H(X ) + H(Y ) (106)

where the equality holds if and only if (iff) Z is a 1-to-1 function of


(X , Y ) and X and Y are independent.
• Therefore, H(X + Y ) = H(X ) + H(Y ) iff X and Y are independent and
Z is a 1-to-1 function of (X , Y ). That is, the value of Z should uniquely
determine the value of (X , Y ).
• For example, suppose X takes values in {0, 1}, and Y takes values in
{−5, 4}. Then,
• if Z = −5 we know for sure that X = 0, Y = −5 (no other assignment
satisfies Z = −5!),
• if Z = 4 we know that X = 0, Y = 4,
• if Z = −4 we know that X = 1, Y = −5,
• if Z = 5 we know that X = 1, Y = 4.
114 / 136
Recap
• So far, we have seen some important properties of entropy, relative
entropy, and mutual information.

• Next, we will learn about one of the most important concepts in


information theory, called the data processing inequality.

• We will first define/review “Markov Chains”. To read more on Markov


chains, see the book of Papoulis (Probability, Random variables, and
Stochastic Processes), Chapter 15.

115 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff

P[Xn = xn |Xn−1 = xn−1 , . . . , X1 = x1 ] = P[Xn = xn |Xn−1 = xn−1 ]

• This is called the Markov property: The most recent past


summarizes all history

116 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff

P[Xn = xn |Xn−1 = xn−1 , . . . , X1 = x1 ] = P[Xn = xn |Xn−1 = xn−1 ]

• This is called the Markov property: The most recent past


summarizes all history
• The future of the process, given the present, is independent of its past

116 / 136
Markov Chains
• Let Xn be a discrete-time, discrete state random process.
• Xn is called a Markov chain iff

P[Xn = xn |Xn−1 = xn−1 , . . . , X1 = x1 ] = P[Xn = xn |Xn−1 = xn−1 ]

• This is called the Markov property: The most recent past


summarizes all history
• The future of the process, given the present, is independent of its past
• Typically, we represent the states and the transition probabilities as:
P [Xn = l|Xn 1 = j]

j
i
l
k

116 / 136
Markov Chains
• Definition 10. Three random variables X , Y , Z form a Markov chain
in that order, shown as X → Y → Z if the following equivalent
conditions are satisfied:
1. p(z|y, x) = p(z|y )
2. p(x, y , z) = p(z|y)p(y |x)p(x)

p(x, y, z) = p(z|y , x) p(y |x)p(x) = p(z|y )p(y|x)p(x)


| {z }
=p(z|y) if Markov

3. p(x, z|y ) = p(z|y )p(x|y)

p(x, z|y ) = p(z|y , x) p(x|y ) = p(z|y )p(x|y)


| {z }
=p(z|y ) if Markov

117 / 136
Markov Chains
• Definition 10. Three random variables X , Y , Z form a Markov chain
in that order, shown as X → Y → Z if the following equivalent
conditions are satisfied:
1. p(z|y, x) = p(z|y )
2. p(x, y , z) = p(z|y)p(y |x)p(x)

p(x, y, z) = p(z|y , x) p(y |x)p(x) = p(z|y )p(y|x)p(x)


| {z }
=p(z|y) if Markov

3. p(x, z|y ) = p(z|y )p(x|y)

p(x, z|y ) = p(z|y , x) p(x|y ) = p(z|y )p(x|y)


| {z }
=p(z|y ) if Markov

• This means that X and Z are independent when conditioned on Y


(recall the definition of conditional independence).
• Remark: If X → Y → Z then Z → Y → X (Prove this as exercise)
• Remark: If Z = f (Y ), then, we always have X → Y → Z .
117 / 136
Data Processing Inequality (DPI)
• Theorem 15. If X , Y , Z form a Markov chain X → Y → Z , then,

I(X ; Y ) ≥ I(X ; Z )

• This means that, no kind of processing can ever increase mutual


information.

118 / 136
Data Processing Inequality (DPI)
• Proof. Recall the Chain rule of mutual information:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1

• So, we have,
I(X ; Y , Z ) = I(X ; Y |Z ) + I(X ; Z )
and similarly,
I(X ; Y , Z ) = I(X ; Z |Y ) + I(X ; Y )
so that:
I(X ; Y |Z ) + I(X ; Z ) = I(X ; Z |Y ) + I(X ; Y ) (107)
• If X → Y → Z , then, conditioned on Y , random variables X and Z are
independent (i.e., p(x, z|y ) = p(x|y )p(z|y)). Therefore the mutual
information between X and Z conditioned on Y is 0, i.e.,

I(X ; Z |Y ) = 0

119 / 136
Data Processing Inequality (DPI)
• Then, (107) becomes,

I(X ; Y |Z ) + I(X ; Z ) = I(X ; Y ) (108)

• Finally, note that I(X ; Y |Z ) ≥ 0 because mutual information cannot be


negative. Then, we have from (108) that,

I(X ; Z ) ≤ I(X ; Y ) (109)

which completes the proof.

120 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,

I(X ; Y |Z ) ≤ I(X ; Y ) (110)

121 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,

I(X ; Y |Z ) ≤ I(X ; Y ) (110)

• Proof follows from the same argument above.

121 / 136
Data Processing Inequality (DPI)
• Corollary 8. If X → Y → Z forms a Markov chain,

I(X ; Y |Z ) ≤ I(X ; Y ) (110)

• Proof follows from the same argument above.


• The “dependence” between X and Y decreases by observing a
“downstream” random variable Z .
• This does not necessarily hold if X , Y , Z don’t form a Markov chain.

121 / 136
Data Processing Inequality (DPI)
• Corollary 9. If Z = g(Y )

I(X ; Y ) ≥ I(X ; g(Y )) (111)

122 / 136
Data Processing Inequality (DPI)
• Corollary 9. If Z = g(Y )

I(X ; Y ) ≥ I(X ; g(Y )) (111)

• Proof. Recall that X → Y → g(Y )

• D.P.I says that we can never get more information (about a random
variable) by further processing (that random variable)!

122 / 136
Example 17
• Problem. Show that if H(X |Y ) = 0, then there exists a function g(Y )
such that X = g(Y ). In other words, X is a function of Y .

123 / 136
Example 17
• Solution. Let’s start by the definition of H(X |Y ):
X
H(X |Y ) = p(y)H(X |Y = y )
y ∈Y
X X 1
= p(y) P[X = x|Y = y] log
P[X = x|Y = y ]
y ∈Y x∈X
| {z }
≥0

124 / 136
Example 17
• Solution. Let’s start by the definition of H(X |Y ):
X
H(X |Y ) = p(y)H(X |Y = y )
y ∈Y
X X 1
= p(y) P[X = x|Y = y] log
P[X = x|Y = y ]
y ∈Y x∈X
| {z }
≥0

• To have this equal to zero, whenever p(y) > 0, we should have


H(X |Y = y ) = 0. To have H(X |Y = y ) = 0, we must have:

1 if x = x0
P[X = x|Y = y] = (112)
0 otherwise

for some x0 ∈ X .

124 / 136
Example 17
• We can re-write (112) as:

P[X = x|Y = y ] = δ(x − x0 )

for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).

125 / 136
Example 17
• We can re-write (112) as:

P[X = x|Y = y ] = δ(x − x0 )

for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).
• For p(y ) = 0, we can assign x = g(y) to an arbitrary value in Y.

125 / 136
Example 17
• We can re-write (112) as:

P[X = x|Y = y ] = δ(x − x0 )

for some x0 , which means for each y with p(y ) > 0, there is only one
possible value of x, hence x = g(y ).
• For p(y ) = 0, we can assign x = g(y) to an arbitrary value in Y.

• Then, for each Y , X takes only one value with probability 1, i.e.,
X = g(Y ).

125 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .

126 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
• In other words, we can estimate the value of X from the observations
Y with zero error probability if and only if H(X |Y ) = 0.

126 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
• In other words, we can estimate the value of X from the observations
Y with zero error probability if and only if H(X |Y ) = 0.
• We will now see an important inequality, called Fano’s inequality,
which extends this argument to arbitrary X and Y .

126 / 136
Fano’s Inequality
• From Example 17, we know that if H(X |Y ) = 0, then X = g(Y ) for
some function g(·).
• We also know that if X = g(Y ) for some g(·), then H(X |Y ) = 0.
• Therefore, if H(X |Y ) = 0 if and only if X is a function of Y .
• In other words, we can estimate the value of X from the observations
Y with zero error probability if and only if H(X |Y ) = 0.
• We will now see an important inequality, called Fano’s inequality,
which extends this argument to arbitrary X and Y .
• It says that in order for us to be able to estimate a random variable X
from the observations of Y with small probability of error, then
H(X |Y ) should be small.

126 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).

127 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).
• We observe Y and want to guess the value of X .

127 / 136
Fano’s Inequality
• Suppose there are two random variables X and Y with joint PMF
p(x, y ).
• We observe Y and want to guess the value of X .

• Let the “guess” be X̂ = g(Y ) and define the probability of making a


wrong guess:

Pe = P(X̂ 6= X )

127 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),

H(Pe ) + Pe log |X | ≥ H(X |Y ) (113)

which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),

1 + Pe log |X | ≥ H(X |Y ) (114)


H(X |Y ) − 1
Pe ≥ (115)
log |X |

128 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),

H(Pe ) + Pe log |X | ≥ H(X |Y ) (113)

which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),

1 + Pe log |X | ≥ H(X |Y ) (114)


H(X |Y ) − 1
Pe ≥ (115)
log |X |
• If Pe = 0, then Fano’s inequality says that H(X |Y ) = 0, which is in line
with our intuition.

128 / 136
Fano’s Inequality
• Theorem 13. (Fano’s Inequality) For any estimator X̂ = g(Y ) (this
implies X → Y → X̂ ),

H(Pe ) + Pe log |X | ≥ H(X |Y ) (113)

which can be weakened to (by using the fact that for the binary
entropy function H(Pe ) ≤ 1),

1 + Pe log |X | ≥ H(X |Y ) (114)


H(X |Y ) − 1
Pe ≥ (115)
log |X |
• If Pe = 0, then Fano’s inequality says that H(X |Y ) = 0, which is in line
with our intuition.
• The main use of Fano’s inequality will be in the converse of the
channel capacity theorem.

128 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision

129 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision

• Use chain rule of entropy in two ways:

H(E, X |Y ) = H(X |Y ) + H(E|X , Y ) = H(E|Y ) + H(X |E, Y )


| {z }
0

therefore,

H(X |Y ) = H(E|Y ) +H(X |E, Y )


| {z }
≤H[E]

129 / 136
Fano’s Inequality
• Proof. Define a random variable:
(
1 if X̂ 6= X Wrong decision
E= (116)
0 if X̂ = X Correct decision

• Use chain rule of entropy in two ways:

H(E, X |Y ) = H(X |Y ) + H(E|X , Y ) = H(E|Y ) + H(X |E, Y )


| {z }
0

therefore,

H(X |Y ) = H(E|Y ) +H(X |E, Y )


| {z }
≤H[E]

• By defining H(Pe ) , H(E), we have

H(X |Y ) ≤ H(Pe ) + H(X |E, Y )

129 / 136
Fano’s Inequality
• But
1
X
H(X |E, Y ) = P[E = e]H(X |Y , E = e)
e=0
P[E = 0] H(X |Y , E = 0) + P[E = 1] H(X |Y , E = 1)
| {z } | {z } | {z } | {z }
1−Pe 0 Pe (A)

• (A) We have Y and X̂ = g(Y ) 6= X (given that E = 1) therefore X


takes one of the remaining values (that isn’t what we predicted), i.e.,
one of X − 1 values. We know then that this entropy is upper bounded
by log(|X | − 1), so that:

H(Pe ) + Pe log(|X | − 1) ≥ H(X |Y )

We can relax this by noting that H(Pe ) ≤ 1 and log(|X | − 1) ≤ log |X |,

H(X |Y ) ≤ 1 + Pe log |X |

which leads to the lower bound on Pe .


130 / 136
Another Useful Inequality
• Here is another inequality that relates probability of error and entropy.
• Theorem 18. Let X and X 0 be two independent identically distributed
(i.i.d.) random variables with entropy H(X ). Then,

P(X = X 0 ) ≥ 2−H(X ) (117)

131 / 136
Another Useful Inequality
• Here is another inequality that relates probability of error and entropy.
• Theorem 18. Let X and X 0 be two independent identically distributed
(i.i.d.) random variables with entropy H(X ). Then,

P(X = X 0 ) ≥ 2−H(X ) (117)

• Proof. First, note that:


X
P(X = X̂ ) = p(x)P[X 0 = x|X = x]
x
X
= p(x)P[X 0 = x] (X and X 0 are independent)
x
X
= p(x)P[X = x] (X and X 0 are identically distributed)
x
X
= p(x)2
x

131 / 136
Another Useful Inequality
• Then
P
2−H(X ) = 2 x p(x) log p(x)
X
≤ p(x)2log p(x) (from Jensen’s inequality) (118)
x
X
= p2 (x)
x
= P(X = X 0 )

where in (118) we used the convexity of the function f (Y ) = 2Y . Then,


by letting Y = log p(X ), we have

f (E[Y ]) ≤ E[f (Y )] ⇒ 2E[Y ] ≤ E[2Y ] ⇒ 2E[log p(X )] ≤ E[2log p(X ) ]

132 / 136
Summary
• We have now finished the “toolbox” lectures, i.e., Chapter 2 from the
book.
• Next, we will start Chapter 3, the “Asymptotic Equipartition Property”.

133 / 136

You might also like