0% found this document useful (0 votes)
14 views41 pages

Ict Solution

Chapter 2 discusses concepts of entropy, relative entropy, and mutual information through various problems and solutions. It covers topics such as the entropy of coin flips, the relationship between the entropy of a random variable and its functions, minimum entropy, and conditional mutual information. Additionally, it presents a strategy for identifying a counterfeit coin among a group using a balance, illustrating the application of information theory in practical scenarios.

Uploaded by

Google Shorts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views41 pages

Ict Solution

Chapter 2 discusses concepts of entropy, relative entropy, and mutual information through various problems and solutions. It covers topics such as the entropy of coin flips, the relationship between the entropy of a random variable and its functions, minimum entropy, and conditional mutual information. Additionally, it presents a strategy for identifying a counterfeit coin among a group using a balance, illustrating the application of information theory in practical scenarios.

Uploaded by

Google Shorts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Chapter 2

Entropy, Relative Entropy and


Mutual Information

1. Coin flips. A fair coin is flipped until the first head occurs. Let X denote the number
of flips required.

(a) Find the entropy H(X) in bits. The following expressions may be useful:
∞ ∞
! 1 ! r
r =
n
, nr n = .
n=0
1−r n=0
(1 − r)2

(b) A random variable X is drawn according to this distribution. Find an “efficient”


sequence of yes-no questions of the form, “Is X contained in the set S ?” Compare
H(X) to the expected number of questions required to determine X .

Solution:
(a) The number X of tosses till the first head appears has the geometric distribution
with parameter p = 1/2 , where P (X = n) = pq n−1 , n ∈ {1, 2, . . .} . Hence the
entropy of X is

!
H(X) = − pq n−1 log(pq n−1 )
n=1
"∞ ∞
#
! !
= − pq log p +
n
npq log q
n

n=0 n=0
−p log p pq log q
= −
1−q p2
−p log p − q log q
=
p
= H(p)/p bits.
If p = 1/2 , then H(X) = 2 bits.
9
10 Entropy, Relative Entropy and Mutual Information

(b) Intuitively, it seems clear that the best questions are those that have equally likely
chances of receiving a yes or a no answer. Consequently, one possible guess is
that the most “efficient” series of questions is: Is X = 1 ? If not, is X = 2 ?
If not, is X = 3 ? . . . with a resulting expected number of questions equal to
$∞
n=1 n(1/2 ) = 2. This should reinforce the intuition that H(X) is a mea-
n

sure of the uncertainty of X . Indeed in this case, the entropy is exactly the
same as the average number of questions needed to define X , and in general
E(# of questions) ≥ H(X) . This problem has an interpretation as a source cod-
ing problem. Let 0 = no, 1 = yes, X = Source, and Y = Encoded Source. Then
the set of questions in the above procedure can be written as a collection of (X, Y )
pairs: (1, 1) , (2, 01) , (3, 001) , etc. . In fact, this intuitively derived code is the
optimal (Huffman) code minimizing the expected number of questions.

2. Entropy of functions. Let X be a random variable taking on a finite number of


values. What is the (general) inequality relationship of H(X) and H(Y ) if

(a) Y = 2X ?
(b) Y = cos X ?

Solution: Let y = g(x) . Then


!
p(y) = p(x).
x: y=g(x)

Consider any set of x ’s that map onto a single y . For this set
! !
p(x) log p(x) ≤ p(x) log p(y) = p(y) log p(y),
x: y=g(x) x: y=g(x)
$
since log is a monotone increasing function and p(x) ≤ x: y=g(x) p(x) = p(y) . Ex-
tending this argument to the entire range of X (and Y ), we obtain
!
H(X) = − p(x) log p(x)
x
! !
= − p(x) log p(x)
y x: y=g(x)
!
≥ − p(y) log p(y)
y
= H(Y ),

with equality iff g is one-to-one with probability one.

(a) Y = 2X is one-to-one and hence the entropy, which is just a function of the
probabilities (and not the values of a random variable) does not change, i.e.,
H(X) = H(Y ) .
(b) Y = cos(X) is not necessarily one-to-one. Hence all that we can say is that
H(X) ≥ H(Y ) , with equality if cosine is one-to-one on the range of X .
Entropy, Relative Entropy and Mutual Information 11

3. Minimum entropy. What is the minimum value of H(p 1 , ..., pn ) = H(p) as p ranges
over the set of n -dimensional probability vectors? Find all p ’s which achieve this
minimum.
Solution: We wish to find all probability vectors p = (p 1 , p2 , . . . , pn ) which minimize
!
H(p) = − pi log pi .
i

Now −pi log pi ≥ 0 , with equality iff pi = 0 or 1 . Hence the only possible probability
vectors which minimize H(p) are those with p i = 1 for some i and pj = 0, j %= i .
There are n such vectors, i.e., (1, 0, . . . , 0) , (0, 1, 0, . . . , 0) , . . . , (0, . . . , 0, 1) , and the
minimum value of H(p) is 0.
4. Entropy of functions of a random variable. Let X be a discrete random variable.
Show that the entropy of a function of X is less than or equal to the entropy of X by
justifying the following steps:
(a)
H(X, g(X)) = H(X) + H(g(X) | X) (2.1)
(b)
= H(X); (2.2)
(c)
H(X, g(X)) = H(g(X)) + H(X | g(X)) (2.3)
(d)
≥ H(g(X)). (2.4)
Thus H(g(X)) ≤ H(X).
Solution: Entropy of functions of a random variable.
(a) H(X, g(X)) = H(X) + H(g(X)|X) by the chain rule for entropies.
(b) H(g(X)|X) = 0 since for any particular value of X, g(X) is fixed, and hence
$ $
H(g(X)|X) = x p(x)H(g(X)|X = x) = x 0 = 0 .
(c) H(X, g(X)) = H(g(X)) + H(X|g(X)) again by the chain rule.
(d) H(X|g(X)) ≥ 0 , with equality iff X is a function of g(X) , i.e., g(.) is one-to-one.
Hence H(X, g(X)) ≥ H(g(X)) .
Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)) .
5. Zero conditional entropy. Show that if H(Y |X) = 0 , then Y is a function of X ,
i.e., for all x with p(x) > 0 , there is only one possible value of y with p(x, y) > 0 .
Solution: Zero Conditional Entropy. Assume that there exists an x , say x 0 and two
different values of y , say y1 and y2 such that p(x0 , y1 ) > 0 and p(x0 , y2 ) > 0 . Then
p(x0 ) ≥ p(x0 , y1 ) + p(x0 , y2 ) > 0 , and p(y1 |x0 ) and p(y2 |x0 ) are not equal to 0 or 1.
Thus
! !
H(Y |X) = − p(x) p(y|x) log p(y|x) (2.5)
x y
≥ p(x0 )(−p(y1 |x0 ) log p(y1 |x0 ) − p(y2 |x0 ) log p(y2 |x0 )) (2.6)
> > 0, (2.7)
12 Entropy, Relative Entropy and Mutual Information

since −t log t ≥ 0 for 0 ≤ t ≤ 1 , and is strictly positive for t not equal to 0 or 1.


Therefore the conditional entropy H(Y |X) is 0 if and only if Y is a function of X .

6. Conditional mutual information vs. unconditional mutual information. Give


examples of joint random variables X , Y and Z such that

(a) I(X; Y | Z) < I(X; Y ) ,


(b) I(X; Y | Z) > I(X; Y ) .

Solution: Conditional mutual information vs. unconditional mutual information.

(a) The last corollary to Theorem 2.8.1 in the text states that if X → Y → Z that
is, if p(x, y | z) = p(x | z)p(y | z) then, I(X; Y ) ≥ I(X; Y | Z) . Equality holds if
and only if I(X; Z) = 0 or X and Z are independent.
A simple example of random variables satisfying the inequality conditions above
is, X is a fair binary random variable and Y = X and Z = Y . In this case,

I(X; Y ) = H(X) − H(X | Y ) = H(X) = 1

and,
I(X; Y | Z) = H(X | Z) − H(X | Y, Z) = 0.
So that I(X; Y ) > I(X; Y | Z) .
(b) This example is also given in the text. Let X, Y be independent fair binary
random variables and let Z = X + Y . In this case we have that,

I(X; Y ) = 0

and,
I(X; Y | Z) = H(X | Z) = 1/2.
So I(X; Y ) < I(X; Y | Z) . Note that in this case X, Y, Z are not markov.

7. Coin weighing. Suppose one has n coins, among which there may or may not be one
counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than
the other coins. The coins are to be weighed by a balance.

(a) Find an upper bound on the number of coins n so that k weighings will find the
counterfeit coin (if any) and correctly declare it to be heavier or lighter.
(b) (Difficult) What is the coin weighing strategy for k = 3 weighings and 12 coins?

Solution: Coin weighing.

(a) For n coins, there are 2n + 1 possible situations or “states”.


• One of the n coins is heavier.
• One of the n coins is lighter.
• They are all of equal weight.
Entropy, Relative Entropy and Mutual Information 13

Each weighing has three possible outcomes - equal, left pan heavier or right pan
heavier. Hence with k weighings, there are 3 k possible outcomes and hence we
can distinguish between at most 3k different “states”. Hence 2n + 1 ≤ 3k or
n ≤ (3k − 1)/2 .
Looking at it from an information theoretic viewpoint, each weighing gives at most
log2 3 bits of information. There are 2n + 1 possible “states”, with a maximum
entropy of log2 (2n + 1) bits. Hence in this situation, one would require at least
log2 (2n + 1)/ log 2 3 weighings to extract enough information for determination of
the odd coin, which gives the same result as above.
(b) There are many solutions to this problem. We will give one which is based on the
ternary number system.
We may express the numbers {−12, −11, . . . , −1, 0, 1, . . . , 12} in a ternary number
system with alphabet {−1, 0, 1} . For example, the number 8 is (-1,0,1) where
−1 × 30 + 0 × 31 + 1 × 32 = 8 . We form the matrix with the representation of the
positive numbers as its columns.
1 2 3 4 5 6 7 8 9 10 11 12
3 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 0
31 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 2
3 2 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8
Note that the row sums are not all zero. We can negate some columns to make
the row sums zero. For example, negating columns 7,9,11 and 12, we obtain
1 2 3 4 5 6 7 8 9 10 11 12
3 0 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 0
31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 0
32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0
Now place the coins on the balance according to the following rule: For weighing
#i , place coin n
• On left pan, if ni = −1 .
• Aside, if ni = 0 .
• On right pan, if ni = 1 .
The outcome of the three weighings will find the odd coin if any and tell whether
it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if
the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings
give the ternary expansion of the index of the odd coin. If the expansion is the
same as the expansion in the matrix, it indicates that the coin is heavier. If
the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1)
indicates (0)30 +(−1)3+(−1)32 = −12 , hence coin #12 is heavy, (1,0,-1) indicates
#8 is light, (0,0,0) indicates no odd coin.
Why does this scheme work? It is a single error correcting Hamming code for the
ternary alphabet (discussed in Section 8.11 in the book). Here are some details.
First note a few properties of the matrix above that was used for the scheme.
All the columns are distinct and no two columns add to (0,0,0). Also if any coin
14 Entropy, Relative Entropy and Mutual Information

is heavier, it will produce the sequence of weighings that matches its column in
the matrix. If it is lighter, it produces the negative of its column as a sequence
of weighings. Combining all these facts, we can see that any single odd coin will
produce a unique sequence of weighings, and that the coin can be determined from
the sequence.
One of the questions that many of you had whether the bound derived in part (a)
was actually achievable. For example, can one distinguish 13 coins in 3 weighings?
No, not with a scheme like the one above. Yes, under the assumptions under
which the bound was derived. The bound did not prohibit the division of coins
into halves, neither did it disallow the existence of another coin known to be
normal. Under both these conditions, it is possible to find the odd coin of 13 coins
in 3 weighings. You could try modifying the above scheme to these cases.

8. Drawing with and without replacement. An urn contains r red, w white, and
b black balls. Which has higher entropy, drawing k ≥ 2 balls from the urn with
replacement or without replacement? Set it up and show why. (There is both a hard
way and a relatively simple way to do this.)
Solution: Drawing with and without replacement. Intuitively, it is clear that if the
balls are drawn with replacement, the number of possible choices for the i -th ball is
larger, and therefore the conditional entropy is larger. But computing the conditional
distributions is slightly involved. It is easier to compute the unconditional entropy.

• With replacement. In this case the conditional distribution of each draw is the
same for every draw. Thus

 red with prob. r+w+b
r

Xi = white with prob. r+w+b
w
(2.8)

 black with prob. r+w+b
b

and therefore

H(Xi |Xi−1 , . . . , X1 ) = H(Xi ) (2.9)


r w b
= log(r + w + b) − log r − log w − log(2.10)
b.
r+w+b r+w+b r+w+b

• Without replacement. The unconditional probability of the i -th ball being red is
still r/(r + w + b) , etc. Thus the unconditional entropy H(X i ) is still the same as
with replacement. The conditional entropy H(X i |Xi−1 , . . . , X1 ) is less than the
unconditional entropy, and therefore the entropy of drawing without replacement
is lower.

9. A metric. A function ρ(x, y) is a metric if for all x, y ,

• ρ(x, y) ≥ 0
• ρ(x, y) = ρ(y, x)
Entropy, Relative Entropy and Mutual Information 15

• ρ(x, y) = 0 if and only if x = y


• ρ(x, y) + ρ(y, z) ≥ ρ(x, z) .

(a) Show that ρ(X, Y ) = H(X|Y ) + H(Y |X) satisfies the first, second and fourth
properties above. If we say that X = Y if there is a one-to-one function mapping
from X to Y , then the third property is also satisfied, and ρ(X, Y ) is a metric.
(b) Verify that ρ(X, Y ) can also be expressed as
ρ(X, Y ) = H(X) + H(Y ) − 2I(X; Y ) (2.11)
= H(X, Y ) − I(X; Y ) (2.12)
= 2H(X, Y ) − H(X) − H(Y ). (2.13)
Solution: A metric
(a) Let
ρ(X, Y ) = H(X|Y ) + H(Y |X). (2.14)
Then
• Since conditional entropy is always ≥ 0 , ρ(X, Y ) ≥ 0 .
• The symmetry of the definition implies that ρ(X, Y ) = ρ(Y, X) .
• By problem 2.6, it follows that H(Y |X) is 0 iff Y is a function of X and
H(X|Y ) is 0 iff X is a function of Y . Thus ρ(X, Y ) is 0 iff X and Y
are functions of each other - and therefore are equivalent up to a reversible
transformation.
• Consider three random variables X , Y and Z . Then
H(X|Y ) + H(Y |Z) ≥ H(X|Y, Z) + H(Y |Z) (2.15)
= H(X, Y |Z) (2.16)
= H(X|Z) + H(Y |X, Z) (2.17)
≥ H(X|Z), (2.18)
from which it follows that
ρ(X, Y ) + ρ(Y, Z) ≥ ρ(X, Z). (2.19)
Note that the inequality is strict unless X → Y → Z forms a Markov Chain
and Y is a function of X and Z .
(b) Since H(X|Y ) = H(X) − I(X; Y ) , the first equation follows. The second relation
follows from the first equation and the fact that H(X, Y ) = H(X) + H(Y ) −
I(X; Y ) . The third follows on substituting I(X; Y ) = H(X) + H(Y ) − H(X, Y ) .
10. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn
according to probability mass functions p 1 (·) and p2 (·) over the respective alphabets
X1 = {1, 2, . . . , m} and X2 = {m + 1, . . . , n}. Let
)
X1 , with probability α,
X=
X2 , with probability 1 − α.
16 Entropy, Relative Entropy and Mutual Information

(a) Find H(X) in terms of H(X1 ) and H(X2 ) and α.


(b) Maximize over α to show that 2H(X) ≤ 2H(X1 ) + 2H(X2 ) and interpret using the
notion that 2H(X) is the effective alphabet size.
Solution: Entropy. We can do this problem by writing down the definition of entropy
and expanding the various terms. Instead, we will use the algebra of entropies for a
simpler proof.
Since X1 and X2 have disjoint support sets, we can write
)
X1 with probability α
X=
X2 with probability 1 − α
Define a function of X ,
)
1 when X = X1
θ = f (X) =
2 when X = X2
Then as in problem 1, we have
H(X) = H(X, f (X)) = H(θ) + H(X|θ)
= H(θ) + p(θ = 1)H(X|θ = 1) + p(θ = 2)H(X|θ = 2)
= H(α) + αH(X1 ) + (1 − α)H(X2 )
where H(α) = −α log α − (1 − α) log(1 − α) .
11. A measure of correlation. Let X1 and X2 be identically distributed, but not
necessarily independent. Let
H(X2 | X1 )
ρ=1− .
H(X1 )
I(X1 ;X2 )
(a) Show ρ = H(X1 ) .
(b) Show 0 ≤ ρ ≤ 1.
(c) When is ρ = 0 ?
(d) When is ρ = 1 ?
Solution: A measure of correlation. X1 and X2 are identically distributed and
H(X2 |X1 )
ρ=1−
H(X1 )
(a)
H(X1 ) − H(X2 |X1 )
ρ =
H(X1 )
H(X2 ) − H(X2 |X1 )
= (since H(X1 ) = H(X2 ))
H(X1 )
I(X1 ; X2 )
= .
H(X1 )
Entropy, Relative Entropy and Mutual Information 17

(b) Since 0 ≤ H(X2 |X1 ) ≤ H(X2 ) = H(X1 ) , we have


H(X2 |X1 )
0≤ ≤1
H(X1 )
0 ≤ ρ ≤ 1.
(c) ρ = 0 iff I(X1 ; X2 ) = 0 iff X1 and X2 are independent.
(d) ρ = 1 iff H(X2 |X1 ) = 0 iff X2 is a function of X1 . By symmetry, X1 is a
function of X2 , i.e., X1 and X2 have a one-to-one relationship.

12. Example of joint entropy. Let p(x, y) be given by

! Y
! 0 1
X !

0 1
3
1
3

1 0 1
3

Find

(a) H(X), H(Y ).


(b) H(X | Y ), H(Y | X).
(c) H(X, Y ).
(d) H(Y ) − H(Y | X).
(e) I(X; Y ) .
(f) Draw a Venn diagram for the quantities in (a) through (e).

Solution: Example of joint entropy

(a) H(X) = 2
3 log 3
2 + 1
3 log 3 = 0.918 bits = H(Y ) .
(b) H(X|Y ) = 3 H(X|Y = 0) + 3 H(X|Y
1 2
= 1) = 0.667 bits = H(Y |X) .
(c) H(X, Y ) = 3 × 13 log 3 = 1.585 bits.
(d) H(Y ) − H(Y |X) = 0.251 bits.
(e) I(X; Y ) = H(Y ) − H(Y |X) = 0.251 bits.
(f) See Figure 1.

13. Inequality. Show ln x ≥ 1 − 1


x for x > 0.
Solution: Inequality. Using the Remainder form of the Taylor expansion of ln(x)
about x = 1 , we have for some c between 1 and x
* + * +
1 −1 (x − 1)2
ln(x) = ln(1) + (x − 1) + ≤ x−1
t t=1 t2 t=c 2
18 Entropy, Relative Entropy and Mutual Information

Figure 2.1: Venn diagram to illustrate the relationships of entropy and relative entropy

H(Y)

H(X)
H(X|Y) I(X;Y)
H(Y|X)

since the second term is always negative. Hence letting y = 1/x , we obtain
1
− ln y ≤ −1
y
or
1
ln y ≥ 1 −
y
with equality iff y = 1 .

14. Entropy of a sum. Let X and Y be random variables that take on values x 1 , x2 , . . . , xr
and y1 , y2 , . . . , ys , respectively. Let Z = X + Y.

(a) Show that H(Z|X) = H(Y |X). Argue that if X, Y are independent, then H(Y ) ≤
H(Z) and H(X) ≤ H(Z). Thus the addition of independent random variables
adds uncertainty.
(b) Give an example of (necessarily dependent) random variables in which H(X) >
H(Z) and H(Y ) > H(Z).
(c) Under what conditions does H(Z) = H(X) + H(Y ) ?

Solution: Entropy of a sum.

(a) Z = X + Y . Hence p(Z = z|X = x) = p(Y = z − x|X = x) .


!
H(Z|X) = p(x)H(Z|X = x)
! !
= − p(x) p(Z = z|X = x) log p(Z = z|X = x)
x z
! !
= p(x) p(Y = z − x|X = x) log p(Y = z − x|X = x)
x y
!
= p(x)H(Y |X = x)
= H(Y |X).
Entropy, Relative Entropy and Mutual Information 19

If X and Y are independent, then H(Y |X) = H(Y ) . Since I(X; Z) ≥ 0 ,


we have H(Z) ≥ H(Z|X) = H(Y |X) = H(Y ) . Similarly we can show that
H(Z) ≥ H(X) .
(b) Consider the following joint distribution for X and Y Let
)
1 with probability 1/2
X = −Y =
0 with probability 1/2

Then H(X) = H(Y ) = 1 , but Z = 0 with prob. 1 and hence H(Z) = 0 .


(c) We have
H(Z) ≤ H(X, Y ) ≤ H(X) + H(Y )

because Z is a function of (X, Y ) and H(X, Y ) = H(X) + H(Y |X) ≤ H(X) +


H(Y ) . We have equality iff (X, Y ) is a function of Z and H(Y ) = H(Y |X) , i.e.,
X and Y are independent.

15. Data processing. Let X1 → X2 → X3 → · · · → Xn form a Markov chain in this


order; i.e., let
p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 ) · · · p(xn |xn−1 ).

Reduce I(X1 ; X2 , . . . , Xn ) to its simplest form.


Solution: Data Processing. By the chain rule for mutual information,

I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ) + I(X1 ; X3 |X2 ) + · · · + I(X1 ; Xn |X2 , . . . , Xn−2 ). (2.20)

By the Markov property, the past and the future are conditionally independent given
the present and hence all terms except the first are zero. Therefore

I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ). (2.21)

16. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necks
down to k < n states, and then fans back to m > k states. Thus X 1 → X2 → X3 ,
i.e., p(x1 , x2 , x3 ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) , for all x1 ∈ {1, 2, . . . , n} , x2 ∈ {1, 2, . . . , k} ,
x3 ∈ {1, 2, . . . , m} .

(a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving
that I(X1 ; X3 ) ≤ log k.
(b) Evaluate I(X1 ; X3 ) for k = 1 , and conclude that no dependence can survive such
a bottleneck.

Solution:
Bottleneck.
20 Entropy, Relative Entropy and Mutual Information

(a) From the data processing inequality, and the fact that entropy is maximum for a
uniform distribution, we get

I(X1 ; X3 ) ≤ I(X1 ; X2 )
= H(X2 ) − H(X2 | X1 )
≤ H(X2 )
≤ log k.

Thus, the dependence between X1 and X3 is limited by the size of the bottleneck.
That is I(X1 ; X3 ) ≤ log k .
(b) For k = 1 , I(X1 ; X3 ) ≤ log 1 = 0 and since I(X1 , X3 ) ≥ 0 , I(X1 , X3 ) = 0 .
Thus, for k = 1 , X1 and X3 are independent.

17. Pure randomness and bent coins. Let X 1 , X2 , . . . , Xn denote the outcomes of
independent flips of a bent coin. Thus Pr {X i = 1} = p, Pr {Xi = 0} = 1 − p ,
where p is unknown. We wish to obtain a sequence Z 1 , Z2 , . . . , ZK of fair coin flips
from X1 , X2 , . . . , Xn . Toward this end let f : X n → {0, 1}∗ , (where {0, 1}∗ =
{Λ, 0, 1, 00, 01, . . .} is the set of all finite length binary sequences), be a mapping
f (X1 , X2 , . . . , Xn ) = (Z1 , Z2 , . . . , ZK ) , where Zi ∼ Bernoulli ( 12 ) , and K may depend
on (X1 , . . . , Xn ) . In order that the sequence Z1 , Z2 , . . . appear to be fair coin flips, the
map f from bent coin flips to fair flips must have the property that all 2 k sequences
(Z1 , Z2 , . . . , Zk ) of a given length k have equal probability (possibly 0), for k = 1, 2, . . . .
For example, for n = 2 , the map f (01) = 0 , f (10) = 1 , f (00) = f (11) = Λ (the null
string), has the property that Pr{Z1 = 1|K = 1} = Pr{Z1 = 0|K = 1} = 12 .
Give reasons for the following inequalities:

(a)
nH(p) = H(X1 , . . . , Xn )
(b)
≥ H(Z1 , Z2 , . . . , ZK , K)
(c)
= H(K) + H(Z1 , . . . , ZK |K)
(d)
= H(K) + E(K)
(e)
≥ EK.

Thus no more than nH(p) fair coin tosses can be derived from (X 1 , . . . , Xn ) , on the
average. Exhibit a good map f on sequences of length 4.
Solution: Pure randomness and bent coins.

(a)
nH(p) = H(X1 , . . . , Xn )
(b)
≥ H(Z1 , Z2 , . . . , ZK )
Entropy, Relative Entropy and Mutual Information 21

(c)
= H(Z1 , Z2 , . . . , ZK , K)
(d)
= H(K) + H(Z1 , . . . , ZK |K)
(e)
= H(K) + E(K)
(f )
≥ EK .

(a) Since X1 , X2 , . . . , Xn are i.i.d. with probability of Xi = 1 being p , the entropy


H(X1 , X2 , . . . , Xn ) is nH(p) .
(b) Z1 , . . . , ZK is a function of X1 , X2 , . . . , Xn , and since the entropy of a function of a
random variable is less than the entropy of the random variable, H(Z 1 , . . . , ZK ) ≤
H(X1 , X2 , . . . , Xn ) .
(c) K is a function of Z1 , Z2 , . . . , ZK , so its conditional entropy given Z1 , Z2 , . . . , ZK
is 0. Hence H(Z1 , Z2 , . . . , ZK , K) = H(Z1 , . . . , ZK ) + H(K|Z1 , Z2 , . . . , ZK ) =
H(Z1 , Z2 , . . . , ZK ).
(d) Follows from the chain rule for entropy.
(e) By assumption, Z1 , Z2 , . . . , ZK are pure random bits (given K ), with entropy 1
bit per symbol. Hence
!
H(Z1 , Z2 , . . . , ZK |K) = p(K = k)H(Z1 , Z2 , . . . , Zk |K = k) (2.22)
k
!
= p(k)k (2.23)
k
= EK. (2.24)

(f) Follows from the non-negativity of discrete entropy.


(g) Since we do not know p , the only way to generate pure random bits is to use
the fact that all sequences with the same number of ones are equally likely. For
example, the sequences 0001,0010,0100 and 1000 are equally likely and can be used
to generate 2 pure random bits. An example of a mapping to generate random
bits is

0000 → Λ
0001 → 00 0010 → 01 0100 → 10 1000 → 11
0011 → 00 0110 → 01 1100 → 10 1001 → 11
(2.25)
1010 → 0 0101 → 1
1110 → 11 1101 → 10 1011 → 01 0111 → 00
1111 → Λ
The resulting expected number of bits is

EK = 4pq 3 × 2 + 4p2 q 2 × 2 + 2p2 q 2 × 1 + 4p3 q × 2 (2.26)


= 8pq + 10p q + 8p q.
3 2 2 3
(2.27)
22 Entropy, Relative Entropy and Mutual Information

For example, for p ≈ 12 , the expected number of pure random bits is close to 1.625.
This is substantially less then the 4 pure random bits that could be generated if
p were exactly 12 .
We will now analyze the efficiency of this scheme of generating random bits for long
sequences of bent coin flips. Let n be the number of bent coin flips. The algorithm
that we will use is the obvious extension of the above method of generating pure
bits using the fact that all sequences with the same number of ones are equally
likely.
, -
Consider all sequences ,n-
with k ones. There are nk such sequences, which ,n-
are
all equally likely. If k were a power of 2, then we could generate
,n-
log k pure
random bits from such a set. However, in the general
,n-
case, k is not a power of
2 and the best we can to is the divide the set of k elements into subset of sizes
which are powers of 2. The largest set would have a size 2 $log (k )% and could be
n

, -
used to generate *log nk + random bits. We could divide the remaining elements
into
,n-
the largest set which is a power of 2, etc. The worst case would occur when
k = 2l+1 − 1 , in which case the subsets would be of sizes 2 l , 2l−1 , 2l−2 , . . . , 1 .
Instead of analyzing the scheme exactly, we will
,n-
just find a lower bound on number
,n-
of random bits generated from a set of size k . Let l = *log k + . Then at least
half of the elements belong to a set of size 2 l and would generate l random bits,
at least 14 th belong to a set of size 2l−1 and generate l − 1 random bits, etc. On
the average, the number of bits generated is

1 1 1
E[K|k 1’s in sequence] ≥ l + (l − 1) + · · · + l 1 (2.28)
2 4* 2 +
1 1 2 3 l−1
= l− 1 + + + + · · · + l−2 (2.29)
4 2 4 8 2
≥ l − 1, (2.30)

since the infinite series sums to 1.


, -
Hence the fact that nk is not a power of 2 will cost at most 1 bit on the average
in the number of random bits that are produced.
Hence, the expected number of pure random bits produced by this algorithm is
n
. / . /
! n k n−k n
EK ≥ p q *log − 1+ (2.31)
k=0
k k
n
. / . . / /
! n k n−k n
≥ p q log −2 (2.32)
k=0
k k
n
. / . /
! n k n−k n
= p q log −2 (2.33)
k=0
k k
. / . /
! n k n−k n
≥ p q log − 2. (2.34)
n(p−!)≤k≤n(p+!)
k k
Entropy, Relative Entropy and Mutual Information 23

Now for sufficiently large n , the probability that the number of 1’s in the sequence
is close to np is near 1 (by the weak law of large numbers). For such sequences,
n is close to p and hence there exists a δ such that
k
. /
n k
≥ 2n(H( n )−δ) ≥ 2n(H(p)−2δ) (2.35)
k
using Stirling’s approximation for the binomial coefficients and the continuity of
the entropy function. If we assume that n is large enough so that the probability
that n(p − %) ≤ k ≤ n(p + %) is greater than 1 − % , then we see that EK ≥
(1 − %)n(H(p) − 2δ) − 2 , which is very good since nH(p) is an upper bound on the
number of pure random bits that can be produced from the bent coin sequence.

18. World Series. The World Series is a seven-game series that terminates as soon as
either team wins four games. Let X be the random variable that represents the outcome
of a World Series between teams A and B; possible values of X are AAAA, BABABAB,
and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7.
Assuming that A and B are equally matched and that the games are independent,
calculate H(X) , H(Y ) , H(Y |X) , and H(X|Y ) .
Solution:
World Series. Two teams play until one of them has won 4 games.
There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability
(1/2)4 .
,4-
There are 8 = 2 3 World Series with 5 games. Each happens with probability (1/2) 5 .
,5-
There are 20 = 2 3 World Series with 6 games. Each happens with probability (1/2) 6 .
,6-
There are 40 = 2 3 World Series with 7 games. Each happens with probability (1/2) 7 .

The probability of a 4 game series ( Y = 4 ) is 2(1/2) 4 = 1/8 .


The probability of a 5 game series ( Y = 5 ) is 8(1/2) 5 = 1/4 .
The probability of a 6 game series ( Y = 6 ) is 20(1/2) 6 = 5/16 .
The probability of a 7 game series ( Y = 7 ) is 40(1/2) 7 = 5/16 .

! 1
H(X) = p(x)log
p(x)
= 2(1/16) log 16 + 8(1/32) log 32 + 20(1/64) log 64 + 40(1/128) log 128
= 5.8125

! 1
H(Y ) = p(y)log
p(y)
= 1/8 log 8 + 1/4 log 4 + 5/16 log(16/5) + 5/16 log(16/5)
= 1.924
24 Entropy, Relative Entropy and Mutual Information

Y is a deterministic function of X, so if you know X there is no randomness in Y. Or,


H(Y |X) = 0 .
Since H(X) + H(Y |X) = H(X, Y ) = H(Y ) + H(X|Y ) , it is easy to determine
H(X|Y ) = H(X) + H(Y |X) − H(Y ) = 3.889

19. Infinite entropy. This problem shows that the entropy of a discrete random variable
$
can be infinite. Let A = ∞ n=2 (n log n)
2 −1 . (It is easy to show that A is finite by

bounding the infinite sum by the integral of (x log 2 x)−1 .) Show that the integer-
valued random variable X defined by Pr(X = n) = (An log 2 n)−1 for n = 2, 3, . . . ,
has H(X) = +∞ .
Solution: Infinite entropy. By definition, p n = Pr(X = n) = 1/An log 2 n for n ≥ 2 .
Therefore

!
H(X) = − p(n) log p(n)
n=2
!∞ 0 1 0 1
= − 1/An log 2 n log 1/An log 2 n
n=2

! log(An log 2 n)
=
n=2 An log2 n

! log A + log n + 2 log log n
=
n=2 An log2 n
∞ ∞
! 1 ! 2 log log n
= log A + + .
n=2
An log n n=2 An log2 n

The first term is finite. For base 2 logarithms, all the elements in the sum in the last
term are nonnegative. (For any other base, the terms of the last sum eventually all
become positive.) So all we have to do is bound the middle sum, which we do by
comparing with an integral.
∞ 2 3∞
! 1 ∞ 1 3
> dx = K ln ln x 3 = +∞ .
n=2
An log n 2 Ax log x 2

We conclude that H(X) = +∞ .

20. Run length coding. Let X1 , X2 , . . . , Xn be (possibly dependent) binary random


variables. Suppose one calculates the run lengths R = (R 1 , R2 , . . .) of this sequence
(in order as they occur). For example, the sequence X = 0001100100 yields run
lengths R = (3, 2, 2, 1, 2) . Compare H(X1 , X2 , . . . , Xn ) , H(R) and H(Xn , R) . Show
all equalities and inequalities, and bound all the differences.
Solution: Run length coding. Since the run lengths are a function of X 1 , X2 , . . . , Xn ,
H(R) ≤ H(X) . Any Xi together with the run lengths determine the entire sequence
Entropy, Relative Entropy and Mutual Information 25

X1 , X2 , . . . , Xn . Hence

H(X1 , X2 , . . . , Xn ) = H(Xi , R) (2.36)


= H(R) + H(Xi |R) (2.37)
≤ H(R) + H(Xi ) (2.38)
≤ H(R) + 1. (2.39)

21. Markov’s inequality for probabilities. Let p(x) be a probability mass function.
Prove, for all d ≥ 0 ,
* +
1
Pr {p(X) ≤ d} log ≤ H(X). (2.40)
d

Solution: Markov inequality applied to entropy.

1 ! 1
P (p(X) < d) log = p(x) log (2.41)
d x:p(x)<d
d
! 1
≤ p(x) log (2.42)
x:p(x)<d
p(x)
! 1
≤ p(x) log (2.43)
x p(x)
= H(X) (2.44)

22. Logical order of ideas. Ideas have been developed in order of need, and then gener-
alized if necessary. Reorder the following ideas, strongest first, implications following:

(a) Chain rule for I(X1 , . . . , Xn ; Y ) , chain rule for D(p(x1 , . . . , xn )||q(x1 , x2 , . . . , xn )) ,
and chain rule for H(X1 , X2 , . . . , Xn ) .
(b) D(f ||g) ≥ 0 , Jensen’s inequality, I(X; Y ) ≥ 0 .

Solution: Logical ordering of ideas.

(a) The following orderings are subjective. Since I(X; Y ) = D(p(x, y)||p(x)p(y)) is a
special case of relative entropy, it is possible to derive the chain rule for I from
the chain rule for D .
Since H(X) = I(X; X) , it is possible to derive the chain rule for H from the
chain rule for I .
It is also possible to derive the chain rule for I from the chain rule for H as was
done in the notes.
(b) In class, Jensen’s inequality was used to prove the non-negativity of D . The
inequality I(X; Y ) ≥ 0 followed as a special case of the non-negativity of D .
26 Entropy, Relative Entropy and Mutual Information

23. Conditional mutual information. Consider a sequence of n binary random vari-


ables X1 , X2 , . . . , Xn . Each sequence with an even number of 1’s has probability
2−(n−1) and each sequence with an odd number of 1’s has probability 0. Find the
mutual informations

I(X1 ; X2 ), I(X2 ; X3 |X1 ), . . . , I(Xn−1 ; Xn |X1 , . . . , Xn−2 ).

Solution: Conditional mutual information.


Consider a sequence of n binary random variables X 1 , X2 , . . . , Xn . Each sequence of
length n with an even number of 1’s is equally likely and has probability 2 −(n−1) .
Any n − 1 or fewer of these are independent. Thus, for k ≤ n − 1 ,

I(Xk−1 ; Xk |X1 , X2 , . . . , Xk−2 ) = 0.

However, given X1 , X2 , . . . , Xn−2 , we know that once we know either Xn−1 or Xn we


know the other.

I(Xn−1 ; Xn |X1 , X2 , . . . , Xn−2 ) = H(Xn |X1 , X2 , . . . , Xn−2 ) − H(Xn |X1 , X2 , . . . , Xn−1 )


= 1 − 0 = 1 bit.

24. Average entropy. Let H(p) = −p log 2 p − (1 − p) log2 (1 − p) be the binary entropy
function.

(a) Evaluate H(1/4) using the fact that log 2 3 ≈ 1.584 . Hint: You may wish to
consider an experiment with four equally likely outcomes, one of which is more
interesting than the others.
(b) Calculate the average entropy H(p) when the probability p is chosen uniformly
in the range 0 ≤ p ≤ 1 .
(c) (Optional) Calculate the average entropy H(p 1 , p2 , p3 ) where (p1 , p2 , p3 ) is a uni-
formly distributed probability vector. Generalize to dimension n .

Solution: Average Entropy.

(a) We can generate two bits of information by picking one of four equally likely
alternatives. This selection can be made in two steps. First we decide whether the
first outcome occurs. Since this has probability 1/4 , the information generated
is H(1/4) . If not the first outcome, then we select one of the three remaining
outcomes; with probability 3/4 , this produces log 2 3 bits of information. Thus

H(1/4) + (3/4) log 2 3 = 2

and so H(1/4) = 2 − (3/4) log 2 3 = 2 − (.75)(1.585) = 0.811 bits.


Entropy, Relative Entropy and Mutual Information 27

(b) If p is chosen uniformly in the range 0 ≤ p ≤ 1 , then the average entropy (in
nats) is
2 2 . /
1 1 x2 x2 31
3
− p ln p + (1 − p) ln(1 − p)dp = −2 x ln x dx = −2 ln x + 3 = 1
.
0 0 2 4 0 2

Therefore the average entropy is 12 log2 e = 1/(2 ln 2) = .721 bits.


(c) Choosing a uniformly distributed probability vector (p 1 , p2 , p3 ) is equivalent to
choosing a point (p1 , p2 ) uniformly from the triangle 0 ≤ p 1 ≤ 1 , p1 ≤ p2 ≤ 1 .
The probability density function has the constant value 2 because the area of the
triangle is 1/2. So the average entropy H(p 1 , p2 , p3 ) is
2 12 1
−2 p1 ln p1 + p2 ln p2 + (1 − p1 − p2 ) ln(1 − p1 − p2 )dp2 dp1 .
0 p1

After some enjoyable calculus, we obtain the final result 5/(6 ln 2) = 1.202 bits.
25. Venn diagrams. There isn’t realy a notion of mutual information common to three
random variables. Here is one attempt at a definition: Using Venn diagrams, we can
see that the mutual information common to three random variables X , Y and Z can
be defined by
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) .
This quantity is symmetric in X , Y and Z , despite the preceding asymmetric defi-
nition. Unfortunately, I(X; Y ; Z) is not necessarily nonnegative. Find X , Y and Z
such that I(X; Y ; Z) < 0 , and prove the following two identities:
(a) I(X; Y ; Z) = H(X, Y, Z) − H(X) − H(Y ) − H(Z) + I(X; Y ) + I(Y ; Z) + I(Z; X)
(b) I(X; Y ; Z) = H(X, Y, Z)− H(X, Y )− H(Y, Z)− H(Z, X)+ H(X)+ H(Y )+ H(Z)
The first identity can be understood using the Venn diagram analogy for entropy and
mutual information. The second identity follows easily from the first.
Solution: Venn Diagrams. To show the first identity,
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) by definition
= I(X; Y ) − (I(X; Y, Z) − I(X; Z)) by chain rule
= I(X; Y ) + I(X; Z) − I(X; Y, Z)
= I(X; Y ) + I(X; Z) − (H(X) + H(Y, Z) − H(X, Y, Z))
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − H(Y, Z)
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − (H(Y ) + H(Z) − I(Y ; Z))
= I(X; Y ) + I(X; Z) + I(Y ; Z) + H(X, Y, Z) − H(X) − H(Y ) − H(Z).
To show the second identity, simply substitute for I(X; Y ) , I(X; Z) , and I(Y ; Z)
using equations like
I(X; Y ) = H(X) + H(Y ) − H(X, Y ) .
These two identities show that I(X; Y ; Z) is a symmetric (but not necessarily nonneg-
ative) function of three random variables.
28 Entropy, Relative Entropy and Mutual Information

26. Another proof of non-negativity of relative entropy. In view of the fundamental


nature of the result D(p||q) ≥ 0 , we will give another proof.

(a) Show that ln x ≤ x − 1 for 0 < x < ∞ .


(b) Justify the following steps:

! q(x)
−D(p||q) = p(x) ln (2.45)
x p(x)
! * +
q(x)
≤ p(x) −1 (2.46)
x p(x)
≤ 0 (2.47)

(c) What are the conditions for equality?

Solution: Another proof of non-negativity of relative entropy. In view of the funda-


mental nature of the result D(p||q) ≥ 0 , we will give another proof.

(a) Show that ln x ≤ x − 1 for 0 < x < ∞ .


There are many ways to prove this. The easiest is using calculus. Let

f (x) = x − 1 − ln x (2.48)

for 0 < x < ∞ . Then f ' (x) = 1 − x1 and f '' (x) = x12 > 0 , and therefore f (x)
is strictly convex. Therefore a local minimum of the function is also a global
minimum. The function has a local minimum at the point where f ' (x) = 0 , i.e.,
when x = 1 . Therefore f (x) ≥ f (1) , i.e.,

x − 1 − ln x ≥ 1 − 1 − ln 1 = 0 (2.49)

which gives us the desired inequality. Equality occurs only if x = 1 .


(b) We let A be the set of x such that p(x) > 0 .

! q(x)
−De (p||q) = p(x)ln (2.50)
x∈A
p(x)
! * +
q(x)
≤ p(x) −1 (2.51)
x∈A
p(x)
! !
= q(x) − p(x) (2.52)
x∈A x∈A
≤ 0 (2.53)

The first step follows from the definition of D , the second step follows from the
inequality ln t ≤ t − 1 , the third step from expanding the sum, and the last step
from the fact that the q(A) ≤ 1 and p(A) = 1 .
Entropy, Relative Entropy and Mutual Information 29

(c) What are the conditions for equality?


We have equality in the inequality ln t ≤ t − 1 if and only if t = 1 . Therefore we
have equality in step 2 of the chain iff q(x)/p(x) = 1 for all x ∈ A . This implies
that p(x) = q(x) for all x , and we have equality in the last step as well. Thus
the condition for equality is that p(x) = q(x) for all x .

27. Grouping rule for entropy: Let p = (p 1 , p2 , . . . , pm ) be a probability distribution


$
on m elements, i.e, pi ≥ 0 , and m i=1 pi = 1 . Define a new distribution q on m − 1
elements as q1 = p1 , q2 = p2 ,. . . , qm−2 = pm−2 , and qm−1 = pm−1 + pm , i.e., the
distribution q is the same as p on {1, 2, . . . , m − 2} , and the probability of the last
element in q is the sum of the last two probabilities of p . Show that
* +
pm−1 pm
H(p) = H(q) + (pm−1 + pm )H , . (2.54)
pm−1 + pm pm−1 + pm

Solution:
m
!
H(p) = − pi log pi (2.55)
i=1
m−2
!
= − pi log pi − pm−1 log pm−1 − pm log pm (2.56)
i=1
m−2
! pm−1 pm
= − pi log pi − pm−1 log − pm log (2.57)
i=1
pm−1 + pm pm−1 + pm
−(pm−1 + pm ) log(pm−1 + pm ) (2.58)
pm−1 pm
= H(q) − pm−1 log − pm log (2.59)
pm−1 + pm pm−1 + pm
* +
pm−1 pm−1 pm pm
= H(q) − (pm−1 + pm ) log − log (2.60)
pm−1 + pm pm−1 + pm pm−1 + pm pm−1 + pm
* +
pm−1 pm
= H(q) + (pm−1 + pm )H2 , , (2.61)
pm−1 + pm pm−1 + pm
where H2 (a, b) = −a log a − b log b .

28. Mixing increases entropy. Show that the entropy of the probability distribution,
(p1 , . . . , pi , . . . , pj , . . . , pm ) , is less than the entropy of the distribution
p +p p +p
(p1 , . . . , i 2 j , . . . , i 2 j , . . . , pm ) . Show that in general any transfer of probability that
makes the distribution more uniform increases the entropy.
Solution:
Mixing increases entropy.
This problem depends on the convexity of the log function. Let

P1 = (p1 , . . . , pi , . . . , pj , . . . , pm )
pi + p j pj + p i
P2 = (p1 , . . . , ,..., , . . . , pm )
2 2
30 Entropy, Relative Entropy and Mutual Information

Then, by the log sum inequality,


pi + p j pi + p j
H(P2 ) − H(P1 ) = −2( ) log( ) + pi log pi + pj log pj
2 2
pi + p j
= −(pi + pj ) log( ) + pi log pi + pj log pj
2
≥ 0.

Thus,
H(P2 ) ≥ H(P1 ).

29. Inequalities. Let X , Y and Z be joint random variables. Prove the following
inequalities and find conditions for equality.

(a) H(X, Y |Z) ≥ H(X|Z) .


(b) I(X, Y ; Z) ≥ I(X; Z) .
(c) H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X) .
(d) I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z) .

Solution: Inequalities.

(a) Using the chain rule for conditional entropy,

H(X, Y |Z) = H(X|Z) + H(Y |X, Z) ≥ H(X|Z),

with equality iff H(Y |X, Z) = 0 , that is, when Y is a function of X and Z .
(b) Using the chain rule for mutual information,

I(X, Y ; Z) = I(X; Z) + I(Y ; Z|X) ≥ I(X; Z),

with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
pendent given X .
(c) Using first the chain rule for entropy and then the definition of conditional mutual
information,

H(X, Y, Z) − H(X, Y ) = H(Z|X, Y ) = H(Z|X) − I(Y ; Z|X)


≤ H(Z|X) = H(X, Z) − H(X) ,

with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
pendent given X .
(d) Using the chain rule for mutual information,

I(X; Z|Y ) + I(Z; Y ) = I(X, Y ; Z) = I(Z; Y |X) + I(X; Z) ,

and therefore
I(X; Z|Y ) = I(Z; Y |X) − I(Z; Y ) + I(X; Z) .
We see that this inequality is actually an equality in all cases.
Entropy, Relative Entropy and Mutual Information 31

30. Maximum entropy. Find the probability mass function p(x) that maximizes the
entropy H(X) of a non-negative integer-valued random variable X subject to the
constraint

!
EX = np(n) = A
n=0

for a fixed value A > 0 . Evaluate this maximum H(X) .


Solution: Maximum entropy
Recall that,

! ∞
!
− pi log pi ≤ − pi log qi .
i=0 i=0

Let qi = α(β)i . Then we have that,



! ∞
!
− pi log pi ≤ − pi log qi
i=0 i=0
. ∞ ∞
/
! !
= − log(α) pi + log(β) ipi
i=0 i=0
= − log α − A log β

Notice that the final right hand side expression is independent of {p i } , and that the
inequality,

!
− pi log pi ≤ − log α − A log β
i=0

holds for all α, β such that,



! 1
αβ i = 1 = α .
i=0
1−β

The constraint on the expected value also requires that,



! β
iαβ i = A = α .
i=0
(1 − β)2

Combining the two constraints we have,


* +* +
β α β
α =
(1 − β)2 1−β 1−β
β
=
1−β
= A,
32 Entropy, Relative Entropy and Mutual Information

which implies that,


A
β =
A+1
1
α = .
A+1
So the entropy maximizing distribution is,
* +i
1 A
pi = .
A+1 A+1
Plugging these values into the expression for the maximum entropy,
− log α − A log β = (A + 1) log(A + 1) − A log A.

The general form of the distribution,


pi = αβ i
can be obtained either by guessing or by Lagrange multipliers where,

! ∞
! ∞
!
F (pi , λ1 , λ2 ) = − pi log pi + λ1 ( pi − 1) + λ2 ( ipi − A)
i=0 i=0 i=0

is the function whose gradient we set to 0.


To complete the argument with Lagrange multipliers, it is necessary to show that the
local maximum is the global maximum. One possible argument is based on the fact
that −H(p) is convex, it has only one local minima, no local maxima and therefore
Lagrange multiplier actually gives the global maximum for H(p) .
31. Conditional entropy. Under what conditions does H(X | g(Y )) = H(X | Y ) ?
Solution: (Conditional Entropy). If H(X|g(Y )) = H(X|Y ) , then H(X)−H(X|g(Y )) =
H(X) − H(X|Y ) , i.e., I(X; g(Y )) = I(X; Y ) . This is the condition for equality in
the data processing inequality. From the derivation of the inequality, we have equal-
ity iff X → g(Y ) → Y forms a Markov chain. Hence H(X|g(Y )) = H(X|Y ) iff
X → g(Y ) → Y . This condition includes many special cases, such as g being one-
to-one, and X and Y being independent. However, these two special cases do not
exhaust all the possibilities.
32. Fano. We are given the following joint distribution on (X, Y )
Y
X a b c
1 1
6
1
12
1
12
2 1
12
1
6
1
12
3 1
12
1
12
1
6

Let X̂(Y ) be an estimator for X (based on Y) and let P e = Pr{X̂(Y ) %= X}.


Entropy, Relative Entropy and Mutual Information 33

(a) Find the minimum probability of error estimator X̂(Y ) and the associated Pe .
(b) Evaluate Fano’s inequality for this problem and compare.

Solution:

(a) From inspection we see that



 1 y=a

X̂(y) = 2 y=b

 3 y=c

Hence the associated Pe is the sum of P (1, b), P (1, c), P (2, a), P (2, c), P (3, a)
and P (3, b). Therefore, Pe = 1/2.
(b) From Fano’s inequality we know

H(X|Y ) − 1
Pe ≥ .
log |X |
Here,

H(X|Y ) = H(X|Y = a) Pr{y = a} + H(X|Y = b) Pr{y = b} + H(X|Y = c) Pr{y = c}


* + * + * +
1 1 1 1 1 1 1 1 1
= H , , Pr{y = a} + H , , Pr{y = b} + H , , Pr{y = c}
2 4 4 2 4 4 2 4 4
* +
1 1 1
= H , , (Pr{y = a} + Pr{y = b} + Pr{y = c})
2 4 4
* +
1 1 1
= H , ,
2 4 4
= 1.5 bits.

Hence
1.5 − 1
Pe ≥ = .316.
log 3
Hence our estimator X̂(Y ) is not very close to Fano’s bound in this form. If
X̂ ∈ X , as it does here, we can use the stronger form of Fano’s inequality to get

H(X|Y ) − 1
Pe ≥ .
log(|X |-1)

and
1.5 − 1 1
Pe ≥ = .
log 2 2
Therefore our estimator X̂(Y ) is actually quite good.

33. Fano’s inequality. Let Pr(X = i) = p i , i = 1, 2, . . . , m and let p1 ≥ p2 ≥ p3 ≥


· · · ≥ pm . The minimal probability of error predictor of X is X̂ = 1 , with resulting
probability of error Pe = 1 − p1 . Maximize H(p) subject to the constraint 1 − p 1 = Pe
34 Entropy, Relative Entropy and Mutual Information

to find a bound on Pe in terms of H . This is Fano’s inequality in the absence of


conditioning.
Solution: (Fano’s Inequality.) The minimal probability of error predictor when there
is no information is X̂ = 1 , the most probable value of X . The probability of error in
this case is Pe = 1 − p1 . Hence if we fix Pe , we fix p1 . We maximize the entropy of X
for a given Pe to obtain an upper bound on the entropy for a given P e . The entropy,
m
!
H(p) = −p1 log p1 − pi log pi (2.62)
i=2
!m
pi pi
= −p1 log p1 − Pe log − Pe log Pe (2.63)
i=2
Pe Pe
* +
p2 p3 pm
= H(Pe ) + Pe H , ,..., (2.64)
Pe Pe Pe
≤ H(Pe ) + Pe log(m − 1), (2.65)
0 1
since the maximum of H Pp2e , Pp3e , . . . , pPme is attained by an uniform distribution. Hence
any X that can be predicted with a probability of error P e must satisfy

H(X) ≤ H(Pe ) + Pe log(m − 1), (2.66)

which is the unconditional form of Fano’s inequality. We can weaken this inequality to
obtain an explicit lower bound for Pe ,
H(X) − 1
Pe ≥ . (2.67)
log(m − 1)

34. Entropy of initial conditions. Prove that H(X 0 |Xn ) is non-decreasing with n for
any Markov chain.
Solution: Entropy of initial conditions. For a Markov chain, by the data processing
theorem, we have
I(X0 ; Xn−1 ) ≥ I(X0 ; Xn ). (2.68)
Therefore
H(X0 ) − H(X0 |Xn−1 ) ≥ H(X0 ) − H(X0 |Xn ) (2.69)
or H(X0 |Xn ) increases with n .

35. Relative entropy is not symmetric: Let the random variable X have three possible
outcomes {a, b, c} . Consider two distributions on this random variable
Symbol p(x) q(x)
a 1/2 1/3
b 1/4 1/3
c 1/4 1/3
Calculate H(p) , H(q) , D(p||q) and D(q||p) . Verify that in this case D(p||q) %=
D(q||p) .
Entropy, Relative Entropy and Mutual Information 35

Solution:
1 1 1
H(p) =
log 2 + log 4 + log 4 = 1.5 bits. (2.70)
2 4 4
1 1 1
H(q) = log 3 + log 3 + log 3 = log 3 = 1.58496 bits. (2.71)
3 3 3
1 3 1 3 1 3
D(p||q) = log + log + log = log(3) − 1.5 = 1.58496 − 1.5 = 0.08496 (2.72)
2 2 4 4 4 4
1 2 1 4 1 4 5
D(q||p) = log + log + log = −log(3) = 1.66666−1.58496 = 0.08170 (2.73)
3 3 3 3 3 3 3
36. Symmetric relative entropy: Though, as the previous example shows, D(p||q) %=
D(q||p) in general, there could be distributions for which equality holds. Give an
example of two distributions p and q on a binary alphabet such that D(p||q) = D(q||p)
(other than the trivial case p = q ).
Solution:
A simple case for D((p, 1 − p)||(q, 1 − q)) = D((q, 1 − q)||(p, 1 − p)) , i.e., for
p 1−p q 1−q
p log + (1 − p) log = q log + (1 − q) log (2.74)
q 1−q p 1−p
is when q = 1 − p .

37. Relative entropy: Let X, Y, Z be three random variables with a joint probability
mass function p(x, y, z) . The relative entropy between the joint distribution and the
product of the marginals is
4 5
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.75)
p(x)p(y)p(z)
Expand this in terms of entropies. When is this quantity zero?
Solution:
4 5
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.76)
p(x)p(y)p(z)
= E[log p(x, y, z)] − E[log p(x)] − E[log p(y)] − E[log(2.77)
p(z)]
= −H(X, Y, Z) + H(X) + H(Y ) + H(Z) (2.78)

We have D(p(x, y, z)||p(x)p(y)p(z)) = 0 if and only p(x, y, z) = p(x)p(y)p(z) for all


(x, y, z) , i.e., if X and Y and Z are independent.

38. The value of a question Let X ∼ p(x) , x = 1, 2, . . . , m . We are given a set


S ⊆ {1, 2, . . . , m} . We ask whether X ∈ S and receive the answer
)
1, if X ∈ S
Y =
0, if X ∈
% S.

Suppose Pr{X ∈ S} = α . Find the decrease in uncertainty H(X) − H(X|Y ) .


36 Entropy, Relative Entropy and Mutual Information

Apparently any set S with a given α is as good as any other.


Solution: The value of a question.

H(X) − H(X|Y ) = I(X; Y )


= H(Y ) − H(Y |X)
= H(α) − H(Y |X)
= H(α)

since H(Y |X) = 0 .

39. Entropy and pairwise independence.


Let X, Y, Z be three binary Bernoulli ( 21 ) random variables that are pairwise indepen-
dent, that is, I(X; Y ) = I(X; Z) = I(Y ; Z) = 0 .

(a) Under this constraint, what is the minimum value for H(X, Y, Z) ?
(b) Give an example achieving this minimum.

Solution:

(a)

H(X, Y, Z) = H(X, Y ) + H(Z|X, Y ) (2.79)


≥ H(X, Y ) (2.80)
= 2. (2.81)

So the minimum value for H(X, Y, Z) is at least 2. To show that is is actually


equal to 2, we show in part (b) that this bound is attainable.
(b) Let X and Y be iid Bernoulli( 12 ) and let Z = X ⊕ Y , where ⊕ denotes addition
mod 2 (xor).

40. Discrete entropies


Let X and Y be two independent integer-valued random variables. Let X be uniformly
distributed over {1, 2, . . . , 8} , and let Pr{Y = k} = 2 −k , k = 1, 2, 3, . . .

(a) Find H(X)


(b) Find H(Y )
(c) Find H(X + Y, X − Y ) .

Solution:

(a) For a uniform distribution, H(X) = log m = log 8 = 3 .


$
(b) For a geometric distribution, H(Y ) = k k2−k = 2 . (See solution to problem 2.1
Entropy, Relative Entropy and Mutual Information 37

(c) Since (X, Y ) → (X +Y, X −Y ) is a one to one transformation, H(X +Y, X −Y ) =


H(X, Y ) = H(X) + H(Y ) = 3 + 2 = 5 .

41. Random questions


One wishes to identify a random object X ∼ p(x) . A question Q ∼ r(q) is asked
at random according to r(q) . This results in a deterministic answer A = A(x, q) ∈
{a1 , a2 , . . .} . Suppose X and Q are independent. Then I(X; Q, A) is the uncertainty
in X removed by the question-answer (Q, A) .

(a) Show I(X; Q, A) = H(A|Q) . Interpret.


(b) Now suppose that two i.i.d. questions Q 1 , Q2 , ∼ r(q) are asked, eliciting answers
A1 and A2 . Show that two questions are less valuable than twice a single question
in the sense that I(X; Q1 , A1 , Q2 , A2 ) ≤ 2I(X; Q1 , A1 ) .

Solution: Random questions.

(a)

I(X; Q, A) = H(Q, A) − H(Q, A, |X)


= H(Q) + H(A|Q) − H(Q|X) − H(A|Q, X)
= H(Q) + H(A|Q) − H(Q)
= H(A|Q)

The interpretation is as follows. The uncertainty removed in X by (Q, A) is the


same as the uncertainty in the answer given the question.
(b) Using the result from part a and the fact that questions are independent, we can
easily obtain the desired relationship.
(a)
I(X; Q1 , A1 , Q2 , A2 ) = I(X; Q1 ) + I(X; A1 |Q1 ) + I(X; Q2 |A1 , Q1 ) + I(X; A2 |A1 , Q1 , Q2 )
(b)
= I(X; A1 |Q1 ) + H(Q2 |A1 , Q1 ) − H(Q2 |X, A1 , Q1 ) + I(X; A2 |A1 , Q1 , Q2 )
(c)
= I(X; A1 |Q1 ) + I(X; A2 |A1 , Q1 , Q2 )
= I(X; A1 |Q1 ) + H(A2 |A1 , Q1 , Q2 ) − H(A2 |X, A1 , Q1 , Q2 )
(d)
= I(X; A1 |Q1 ) + H(A2 |A1 , Q1 , Q2 )
(e)
≤ I(X; A1 |Q1 ) + H(A2 |Q2 )
(f )
= 2I(X; A1 |Q1 )

(a) Chain rule.


(b) X and Q1 are independent.
38 Entropy, Relative Entropy and Mutual Information

(c) Q2 are independent of X , Q1 , and A1 .


(d) A2 is completely determined given Q2 and X .
(e) Conditioning decreases entropy.
(f) Result from part a.

42. Inequalities. Which of the following inequalities are generally ≥, =, ≤ ? Label each
with ≥, =, or ≤ .

(a) H(5X) vs. H(X)


(b) I(g(X); Y ) vs. I(X; Y )
(c) H(X0 |X−1 ) vs. H(X0 |X−1 , X1 )
(d) H(X1 , X2 , . . . , Xn ) vs. H(c(X1 , X2 , . . . , Xn )) , where c(x1 , x2 , . . . , xn ) is the Huff-
man codeword assigned to (x1 , x2 , . . . , xn ) .
(e) H(X, Y )/(H(X) + H(Y )) vs. 1

Solution:
(a) X → 5X is a one to one mapping, and hence H(X) = H(5X) .
(b) By data processing inequality, I(g(X); Y ) ≤ I(X; Y ) .
(c) Because conditioning reduces entropy, H(X 0 |X−1 ) ≥ H(X0 |X−1 , X1 ) .
(d) H(X, Y ) ≤ H(X) + H(Y ) , so H(X, Y )/(H(X) + H(Y )) ≤ 1 .
43. Mutual information of heads and tails.
(a) Consider a fair coin flip. What is the mutual information between the top side
and the bottom side of the coin?
(b) A 6-sided fair die is rolled. What is the mutual information between the top side
and the front face (the side most facing you)?
Solution:
Mutual information of heads and tails.
To prove (a) observe that
I(T ; B) = H(B) − H(B|T )
= log 2 = 1
since B ∼ Ber(1/2) , and B = f (T ) . Here B, T stand for Bottom and Top respectively.
To prove (b) note that having observed a side of the cube facing us F , there are four
possibilities for the top T , which are equally probable. Thus,
I(T ; F ) = H(T ) − H(T |F )
= log 6 − log 4
= log 3 − 1
since T has uniform distribution on {1, 2, . . . , 6} .
Entropy, Relative Entropy and Mutual Information 39

44. Pure randomness


We wish to use a 3-sided coin to generate a fair coin toss. Let the coin X have
probability mass function 

 A, pA
X= B, pB

 C, pC
where pA , pB , pC are unknown.

(a) How would you use 2 independent flips X 1 , X2 to generate (if possible) a Bernoulli( 12 )
random variable Z ?
(b) What is the resulting maximum expected number of fair bits generated?

Solution:

(a) The trick here is to notice that for any two letters Y and Z produced by two
independent tosses of our bent three-sided coin, Y Z has the same probability as
ZY . So we can produce B ∼ Bernoulli( 21 ) coin flips by letting B = 0 when we
get AB , BC or AC , and B = 1 when we get BA , CB or CA (if we get AA ,
BB or CC we don’t assign a value to B .)
(b) The expected number of bits generated by the above scheme is as follows. We get
one bit, except when the two flips of the 3-sided coin produce the same symbol.
So the expected number of fair bits generated is

0 ∗ [P (AA) + P (BB) + P (CC)] + 1 ∗ [1 − P (AA) − P (BB) − P (CC)], (2.82)

or, 1 − p2A − p2B − p2C . (2.83)

45. Finite entropy. Show that for a discrete random variable X ∈ {1, 2, . . .} , if E log X <
∞ , then H(X) < ∞ .
$
Solution: Let the distribution on the integers be p 1 , p2 , . . . . Then H(p) = − pi logpi
$
and E log X = pi logi = c < ∞ .
We will now find the maximum entropy distribution subject to the constraint on the
expected logarithm. Using Lagrange multipliers or the results of Chapter 12, we have
the following functional to optimize
! ! !
J(p) = − pi log pi − λ1 p i − λ2 pi log i (2.84)

Differentiating with respect to p i and setting to zero, we find that the p i that maximizes
$
the entropy set pi = aiλ , where a = 1/( iλ ) and λ chosed to meet the expected log
constraint, i.e. ! !
iλ log i = c iλ (2.85)
Using this value of pi , we can see that the entropy is finite.
40 Entropy, Relative Entropy and Mutual Information

46. Axiomatic definition of entropy. If we assume certain axioms for our measure of
information, then we will be forced to use a logarithmic measure like entropy. Shannon
used this to justify his initial definition of entropy. In this book, we will rely more on
the other properties of entropy rather than its axiomatic derivation to justify its use.
The following problem is considerably more difficult than the other problems in this
section.
If a sequence of symmetric functions H m (p1 , p2 , . . . , pm ) satisfies the following proper-
ties,
0 1
• Normalization: H2 1 1
2, 2 = 1,
• Continuity: H2 (p, 1 − p) is a continuous function of p ,
0 1
p1 p2
• Grouping: Hm (p1 , p2 , . . . , pm ) = Hm−1 (p1 +p2 , p3 , . . . , pm )+(p1 +p2 )H2 p1 +p2 , p1 +p2 ,

prove that Hm must be of the form


m
!
Hm (p1 , p2 , . . . , pm ) = − pi log pi , m = 2, 3, . . . . (2.86)
i=1

There are various other axiomatic formulations which also result in the same definition
of entropy. See, for example, the book by Csiszár and Körner[4].
Solution: Axiomatic definition of entropy. This is a long solution, so we will first
outline what we plan to do. First we will extend the grouping axiom by induction and
prove that

Hm (p1 , p2 , . . . , pm ) = Hm−k (p1 + p2 + · · · + pk , pk+1 , . . . , pm )


* +
p1 pk
+(p1 + p2 + · · · + pk )Hk ,..., (. 2.87)
p1 + p 2 + · · · + p k p1 + p 2 + · · · + p k
Let f (m) be the entropy of a uniform distribution on m symbols, i.e.,
* +
1 1 1
f (m) = Hm , ,..., . (2.88)
m m m
We will then show that for any two integers r and s , that f (rs) = f (r) + f (s) .
We use this to show that f (m) = log m . We then show for rational p = r/s , that
H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p) . By continuity, we will extend it to irrational
p and finally by induction and grouping, we will extend the result to H m for m ≥ 2 .
To begin, we extend the grouping axiom. For convenience in notation, we will let
k
!
Sk = pi (2.89)
i=1

and we will denote H2 (q, 1 − q) as h(q) . Then we can write the grouping axiom as
* +
p2
Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h . (2.90)
S2
Entropy, Relative Entropy and Mutual Information 41

Applying the grouping axiom again, we have


* +
p2
Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h (2.91)
S2
* + * +
p3 p2
= Hm−2 (S3 , p4 , . . . , pm ) + S3 h + S2 h (2.92)
S3 S2
..
. (2.93)
k
! * +
pi
= Hm−(k−1) (Sk , pk+1 , . . . , pm ) + Si h . (2.94)
i=2
Si

Now, we apply the same grouping axiom repeatedly to H k (p1 /Sk , . . . , pk /Sk ) , to obtain
* + * + k−1
! * +
p1 pk Sk−1 pk Si pi /Sk
Hk ,..., = H2 , + h (2.95)
Sk Sk Sk Sk i=2
Sk Si /Sk
k * +
1 ! pi
= Si h . (2.96)
Sk i=2 Si

From (2.94) and (2.96), it follows that


* +
p1 pk
Hm (p1 , . . . , pm ) = Hm−k (Sk , pk+1 , . . . , pm ) + Sk Hk ,..., , (2.97)
Sk Sk
which is the extended grouping axiom.
Now we need to use an axiom that is not explicitly stated in the text, namely that the
function Hm is symmetric with respect to its arguments. Using this, we can combine
any set of arguments of Hm using the extended grouping axiom.
Let f (m) denote Hm ( m
1 1 1
, m, . . . , m ).
Consider
1 1 1
f (mn) = Hmn (
, ,..., ). (2.98)
mn mn mn
By repeatedly applying the extended grouping axiom, we have
1 1 1
f (mn) = Hmn ( , ,..., ) (2.99)
mn mn mn
1 1 1 1 1 1
= Hmn−n ( , ,..., ) + Hn ( , . . . , ) (2.100)
m mn mn m n n
1 1 1 1 2 1 1
= Hmn−2n ( , , ,..., ) + Hn ( , . . . , ) (2.101)
m m mn mn m n n
..
. (2.102)
1 1 1 1
= Hm ( , . . . . ) + H( , . . . , ) (2.103)
m m n n
= f (m) + f (n). (2.104)
42 Entropy, Relative Entropy and Mutual Information

We can immediately use this to conclude that f (m k ) = kf (m) .


Now, we will argue that H2 (1, 0) = h(1) = 0 . We do this by expanding H 3 (p1 , p2 , 0)
( p1 + p2 = 1 ) in two different ways using the grouping axiom

H3 (p1 , p2 , 0) = H2 (p1 , p2 ) + p2 H2 (1, 0) (2.105)


= H2 (1, 0) + (p1 + p2 )H2 (p1 , p2 ) (2.106)

Thus p2 H2 (1, 0) = H2 (1, 0) for all p2 , and therefore H(1, 0) = 0 .


We will also need to show that f (m + 1) − f (m) → 0 as m → ∞ . To prove this, we
use the extended grouping axiom and write

1 1
f (m + 1) = Hm+1 ( ,..., ) (2.107)
m+1 m+1
1 m 1 1
= h( )+ Hm ( , . . . , ) (2.108)
m+1 m+1 m m
1 m
= h( )+ f (m) (2.109)
m+1 m+1

and therefore
m 1
f (m + 1) − f (m) = h( ). (2.110)
m+1 m+1
Thus lim f (m + 1) − m+1m
f (m) = lim h( m+1
1
). But by the continuity of H2 , it follows
that the limit on the right is h(0) = 0 . Thus lim h( m+1
1
) = 0.
Let us define
an+1 = f (n + 1) − f (n) (2.111)
and
1
bn = h( ). (2.112)
n
Then
1
an+1 = − f (n) + bn+1 (2.113)
n+1
n
1 !
= − ai + bn+1 (2.114)
n + 1 i=2

and therefore
n
!
(n + 1)bn+1 = (n + 1)an+1 + ai . (2.115)
i=2

Therefore summing over n , we have

N
! N
! N
!
nbn = (nan + an−1 + . . . + a2 ) = N ai . (2.116)
n=2 n=2 n=2
Entropy, Relative Entropy and Mutual Information 43
$N
Dividing both sides by n=1 n = N (N + 1)/2 , we obtain

N $
2 ! N
nbn
an = $n=2 (2.117)
N + 1 n=2 N
n=2 n

Now by continuity of H2 and the definition of bn , it follows that bn → 0 as n → ∞ .


Since the right hand side is essentially an average of the b n ’s, it also goes to 0 (This
can be proved more precisely using % ’s and δ ’s). Thus the left hand side goes to 0. We
can then see that
N
1 !
aN +1 = bN +1 − an (2.118)
N + 1 n=2
also goes to 0 as N → ∞ . Thus

f (n + 1) − f (n) → 0 asn → ∞. (2.119)

We will now prove the following lemma

Lemma 2.0.1 Let the function f (m) satisfy the following assumptions:

• f (mn) = f (m) + f (n) for all integers m , n .


• limn→∞ (f (n + 1) − f (n)) = 0
• f (2) = 1 ,

then the function f (m) = log 2 m .

Proof of the lemma: Let P be an arbitrary prime number and let

f (P ) log2 n
g(n) = f (n) − (2.120)
log2 P

Then g(n) satisfies the first assumption of the lemma. Also g(P ) = 0 .
Also if we let
f (P ) n
αn = g(n + 1) − g(n) = f (n + 1) − f (n) + log2 (2.121)
log2 P n+1

then the second assumption in the lemma implies that lim α n = 0 .


For an integer n , define 6 7
n
n (1)
= . (2.122)
P
Then it follows that n(1) < n/P , and

n = n(1) P + l (2.123)
44 Entropy, Relative Entropy and Mutual Information

where 0 ≤ l < P . From the fact that g(P ) = 0 , it follows that g(P n (1) ) = g(n(1) ) ,
and
n−1
!
g(n) = g(n(1) ) + g(n) − g(P n(1) ) = g(n(1) ) + αi (2.124)
i=P n(1)

Just as we have defined n(1) from n , we can define n(2) from n(1) . Continuing this
process, we can then write
 (i−1)

k
! n!
g(n) = g(n(k) ) +  αi  . (2.125)
j=1 i=P n(i)

Since n(k) ≤ n/P k , after


6 7
log n
k= +1 (2.126)
log P
terms, we have n(k) = 0 , and g(0) = 0 (this follows directly from the additive property
of g ). Thus we can write
tn
!
g(n) = αi (2.127)
i=1

the sum of bn terms, where * +


log n
bn ≤ P +1 . (2.128)
log P
g(n)
Since αn → 0 , it follows that log2 n → 0 , since g(n) has at most o(log 2 n) terms αi .
Thus it follows that
f (n) f (P )
lim = (2.129)
n→∞ log n log2 P
2

Since P was arbitrary, it follows that f (P )/ log 2 P = c for every prime number P .
Applying the third axiom in the lemma, it follows that the constant is 1, and f (P ) =
log2 P .
For composite numbers N = P1 P2 . . . Pl , we can apply the first property of f and the
prime number factorization of N to show that
! !
f (N ) = f (Pi ) = log2 Pi = log2 N. (2.130)

Thus the lemma is proved.


The lemma can be simplified considerably, if instead of the second assumption, we
replace it by the assumption that f (n) is monotone in n . We will now argue that the
only function f (m) such that f (mn) = f (m) + f (n) for all integers m, n is of the form
f (m) = log a m for some base a .
Let c = f (2) . Now f (4) = f (2 × 2) = f (2) + f (2) = 2c . Similarly, it is easy to see
that f (2k ) = kc = c log 2 2k . We will extend this to integers that are not powers of 2.
Entropy, Relative Entropy and Mutual Information 45

For any integer m , let r > 0 , be another integer and let 2 k ≤ mr < 2k+1 . Then by
the monotonicity assumption on f , we have

kc ≤ rf (m) < (k + 1)c (2.131)

or
k k+1
≤ f (m) < c
c (2.132)
r r
Now by the monotonicity of log , we have
k k+1
≤ log2 m < (2.133)
r r
Combining these two equations, we obtain
3 3
3f (m) − log 2 m 3 < 1
3 3
3 (2.134)
c 3 r
Since r was arbitrary, we must have
log2 m
f (m) = (2.135)
c
and we can identify c = 1 from the last assumption of the lemma.
Now we are almost done. We have shown that for any uniform distribution on m
outcomes, f (m) = Hm (1/m, . . . , 1/m) = log 2 m .
We will now show that

H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p). (2.136)

To begin, let p be a rational number, r/s , say. Consider the extended grouping axiom
for Hs
1 1 1 1 s−r s−r
f (s) = Hs ( , . . . , ) = H( , . . . , , )+ f (s − r) (2.137)
s s <s => s? s s
r
r s−r s s−r
= H2 ( , ) + f (s) + f (s − r) (2.138)
s s r s
Substituting f (s) = log 2 s , etc, we obtain
* + * +
r s−r r r s−r s−r
H2 ( , ) = − log2 − 1 − log2 1 − . (2.139)
s s s s s s
Thus (2.136) is true for rational p . By the continuity assumption, (2.136) is also true
at irrational p .
To complete the proof, we have to extend the definition from H 2 to Hm , i.e., we have
to show that !
Hm (p1 , . . . , pm ) = − pi log pi (2.140)
46 Entropy, Relative Entropy and Mutual Information

for all m . This is a straightforward induction. We have just shown that this is true for
m = 2 . Now assume that it is true for m = n − 1 . By the grouping axiom,

Hn (p1 , . . . , pn ) = Hn−1 (p1 + p2 , p3 , . . . , pn ) (2.141)


* +
p1 p2
+(p1 + p2 )H2 , (2.142)
p1 + p 2 p1 + p 2
n
!
= −(p1 + p2 ) log(p1 + p2 ) − pi log pi (2.143)
i=3
p1 p1 p2 p2
− log − log (2.144)
p1 + p 2 p1 + p 2 p1 + p 2 p1 + p 2
n
!
= − pi log pi . (2.145)
i=1

Thus the statement is true for m = n , and by induction, it is true for all m . Thus we
have finally proved that the only symmetric function that satisfies the axioms is
m
!
Hm (p1 , . . . , pm ) = − pi log pi . (2.146)
i=1

The proof above is due to Rényi[11]


47. The entropy of a missorted file.
A deck of n cards in order 1, 2, . . . , n is provided. One card is removed at random
then replaced at random. What is the entropy of the resulting deck?
Solution: The entropy of a missorted file.

The heart of this problem is simply carefully counting the possible outcome states.
There are n ways to choose which card gets mis-sorted, and, once the card is chosen,
there are again n ways to choose where the card is replaced in the deck. Each of these
shuffling actions has probability 1/n 2 . Unfortunately, not all of these n 2 actions results
in a unique mis-sorted file. So we need to carefully count the number of distinguishable
outcome states. The resulting deck can only take on one of the following three cases.

• The selected card is at its original location after a replacement.


• The selected card is at most one location away from its original location after a
replacement.
• The selected card is at least two locations away from its original location after a
replacement.

To compute the entropy of the resulting deck, we need to know the probability of each
case.
Case 1 (resulting deck is the same as the original): There are n ways to achieve this
outcome state, one for each of the n cards in the deck. Thus, the probability associated
with case 1 is n/n2 = 1/n .
Entropy, Relative Entropy and Mutual Information 47

Case 2 (adjacent pair swapping): There are n − 1 adjacent pairs, each of which will
have a probability of 2/n2 , since for each pair, there are two ways to achieve the swap,
either by selecting the left-hand card and moving it one to the right, or by selecting the
right-hand card and moving it one to the left.
Case 3 (typical situation): None of the remaining actions “collapses”. They all result
in unique outcome states, each with probability 1/n 2 . Of the n2 possible shuffling
actions, n2 − n − 2(n − 1) of them result in this third case (we’ve simply subtracted
the case 1 and case 2 situations above).
The entropy of the resulting deck can be computed as follows.

1 2 n2 1
H(X) = log(n) + (n − 1) 2 log( ) + (n2 − 3n + 2) 2 log(n2 )
n n 2 n
2n − 1 2(n − 1)
= log(n) −
n n2

48. Sequence length.


How much information does the length of a sequence give about the content of a se-
quence? Suppose we consider a Bernoulli (1/2) process {X i }.
Stop the process when the first 1 appears. Let N designate this stopping time.
Thus X N is an element of the set of all finite length binary sequences {0, 1} ∗ =
{0, 1, 00, 01, 10, 11, 000, . . .}.

(a) Find I(N ; X N ).

(b) Find H(X N |N ).

(c) Find H(X N ).

Let’s now consider a different stopping time. For this part, again assume X i ∼ Bernoulli (1/2)
but stop at time N = 6 , with probability 1/3 and stop at time N = 12 with probability
2/3. Let this stopping time be independent of the sequence X 1 X2 . . . X12 .

(d) Find I(N ; X N ).

(e) Find H(X N |N ).

(f) Find H(X N ).

Solution:

(a)

I(X N ; N ) = H(N ) − H(N |X N )


= H(N ) − 0
48 Entropy, Relative Entropy and Mutual Information
(a)
= E(N )
I(X ; N )
N
= 2

where (a) comes from the fact that the entropy of a geometric random variable is
just the mean.

(b) Since given N we know that Xi = 0 for all i < N and XN = 1,

H(X N |N ) = 0.

(c)

H(X N ) = I(X N ; N ) + H(X N |N )


= I(X N ; N ) + 0
H(X N ) = 2.

(d)

I(X N ; N ) = H(N ) − H(N |X N )


= H(N ) − 0
I(X ; N ) = HB (1/3)
N

(e)
1 2
H(X N |N ) = H(X 6 |N = 6) + H(X 12 |N = 12)
3 3
1 2
= H(X ) + H(X )
6 12
3 3
1 2
= 6 + 12
3 3
H(X N |N ) = 10.

(f)

H(X N ) = I(X N ; N ) + H(X N |N )


= I(X N ; N ) + 10
H(X N ) = H(1/3) + 10.
Chapter 3

The Asymptotic Equipartition


Property

1. Markov’s inequality and Chebyshev’s inequality.

(a) (Markov’s inequality.) For any non-negative random variable X and any t > 0 ,
show that
EX
Pr {X ≥ t} ≤ . (3.1)
t
Exhibit a random variable that achieves this inequality with equality.
(b) (Chebyshev’s inequality.) Let Y be a random variable with mean µ and variance
σ 2 . By letting X = (Y − µ)2 , show that for any % > 0 ,

σ2
Pr {|Y − µ| > %} ≤ . (3.2)
%2
(c) (The weak law of large numbers.) Let Z 1 , Z2 , . . . , Zn be a sequence of i.i.d. random
$
variables with mean µ and variance σ 2 . Let Z n = n1 ni=1 Zi be the sample mean.
Show that
@3 3 A σ2
3 3
Pr 3Z n − µ3 > % ≤ 2 . (3.3)
n%
@3 3 A
3 3
Thus Pr 3Z n − µ3 > % → 0 as n → ∞ . This is known as the weak law of large
numbers.

Solution: Markov’s inequality and Chebyshev’s inequality.

(a) If X has distribution F (x) ,


2 ∞
EX = xdF
0
2 δ 2 ∞
= xdF + xdF
0 δ
49

You might also like