0% found this document useful (0 votes)

14 views41 pages

Ict Solution

Chapter 2 discusses concepts of entropy, relative entropy, and mutual information through various problems and solutions. It covers topics such as the entropy of coin flips, the relationship between the entropy of a random variable and its functions, minimum entropy, and conditional mutual information. Additionally, it presents a strategy for identifying a counterfeit coin among a group using a balance, illustrating the application of information theory in practical scenarios.

Uploaded by

Google Shorts

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views41 pages

Ict Solution

Uploaded by

Google Shorts

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Chapter 2

Entropy, Relative Entropy and

Mutual Information

1. Coin flips. A fair coin is flipped until the first head occurs. Let X denote the number
of flips required.

(a) Find the entropy H(X) in bits. The following expressions may be useful:
∞ ∞
! 1 ! r
r =
n
, nr n = .
n=0
1−r n=0
(1 − r)2

(b) A random variable X is drawn according to this distribution. Find an “efficient”

sequence of yes-no questions of the form, “Is X contained in the set S ?” Compare
H(X) to the expected number of questions required to determine X .

Solution:
(a) The number X of tosses till the first head appears has the geometric distribution
with parameter p = 1/2 , where P (X = n) = pq n−1 , n ∈ {1, 2, . . .} . Hence the
entropy of X is
∞
!
H(X) = − pq n−1 log(pq n−1 )
n=1
"∞ ∞
#
! !
= − pq log p +
n
npq log q
n

n=0 n=0
−p log p pq log q
= −
1−q p2
−p log p − q log q
=
p
= H(p)/p bits.
If p = 1/2 , then H(X) = 2 bits.
9
10 Entropy, Relative Entropy and Mutual Information

(b) Intuitively, it seems clear that the best questions are those that have equally likely
chances of receiving a yes or a no answer. Consequently, one possible guess is
that the most “efficient” series of questions is: Is X = 1 ? If not, is X = 2 ?
If not, is X = 3 ? . . . with a resulting expected number of questions equal to
$∞
n=1 n(1/2 ) = 2. This should reinforce the intuition that H(X) is a mea-
n

sure of the uncertainty of X . Indeed in this case, the entropy is exactly the
same as the average number of questions needed to define X , and in general
E(# of questions) ≥ H(X) . This problem has an interpretation as a source cod-
ing problem. Let 0 = no, 1 = yes, X = Source, and Y = Encoded Source. Then
the set of questions in the above procedure can be written as a collection of (X, Y )
pairs: (1, 1) , (2, 01) , (3, 001) , etc. . In fact, this intuitively derived code is the
optimal (Huffman) code minimizing the expected number of questions.

2. Entropy of functions. Let X be a random variable taking on a finite number of

values. What is the (general) inequality relationship of H(X) and H(Y ) if

(a) Y = 2X ?
(b) Y = cos X ?

Solution: Let y = g(x) . Then

!
p(y) = p(x).
x: y=g(x)

Consider any set of x ’s that map onto a single y . For this set
! !
p(x) log p(x) ≤ p(x) log p(y) = p(y) log p(y),
x: y=g(x) x: y=g(x)
$
since log is a monotone increasing function and p(x) ≤ x: y=g(x) p(x) = p(y) . Ex-
tending this argument to the entire range of X (and Y ), we obtain
!
H(X) = − p(x) log p(x)
x
! !
= − p(x) log p(x)
y x: y=g(x)
!
≥ − p(y) log p(y)
y
= H(Y ),

with equality iff g is one-to-one with probability one.

(a) Y = 2X is one-to-one and hence the entropy, which is just a function of the
probabilities (and not the values of a random variable) does not change, i.e.,
H(X) = H(Y ) .
(b) Y = cos(X) is not necessarily one-to-one. Hence all that we can say is that
H(X) ≥ H(Y ) , with equality if cosine is one-to-one on the range of X .
Entropy, Relative Entropy and Mutual Information 11

3. Minimum entropy. What is the minimum value of H(p 1 , ..., pn ) = H(p) as p ranges
over the set of n -dimensional probability vectors? Find all p ’s which achieve this
minimum.
Solution: We wish to find all probability vectors p = (p 1 , p2 , . . . , pn ) which minimize
!
H(p) = − pi log pi .
i

Now −pi log pi ≥ 0 , with equality iff pi = 0 or 1 . Hence the only possible probability
vectors which minimize H(p) are those with p i = 1 for some i and pj = 0, j %= i .
There are n such vectors, i.e., (1, 0, . . . , 0) , (0, 1, 0, . . . , 0) , . . . , (0, . . . , 0, 1) , and the
minimum value of H(p) is 0.
4. Entropy of functions of a random variable. Let X be a discrete random variable.
Show that the entropy of a function of X is less than or equal to the entropy of X by
justifying the following steps:
(a)
H(X, g(X)) = H(X) + H(g(X) | X) (2.1)
(b)
= H(X); (2.2)
(c)
H(X, g(X)) = H(g(X)) + H(X | g(X)) (2.3)
(d)
≥ H(g(X)). (2.4)
Thus H(g(X)) ≤ H(X).
Solution: Entropy of functions of a random variable.
(a) H(X, g(X)) = H(X) + H(g(X)|X) by the chain rule for entropies.
(b) H(g(X)|X) = 0 since for any particular value of X, g(X) is fixed, and hence
$ $
H(g(X)|X) = x p(x)H(g(X)|X = x) = x 0 = 0 .
(c) H(X, g(X)) = H(g(X)) + H(X|g(X)) again by the chain rule.
(d) H(X|g(X)) ≥ 0 , with equality iff X is a function of g(X) , i.e., g(.) is one-to-one.
Hence H(X, g(X)) ≥ H(g(X)) .
Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)) .
5. Zero conditional entropy. Show that if H(Y |X) = 0 , then Y is a function of X ,
i.e., for all x with p(x) > 0 , there is only one possible value of y with p(x, y) > 0 .
Solution: Zero Conditional Entropy. Assume that there exists an x , say x 0 and two
different values of y , say y1 and y2 such that p(x0 , y1 ) > 0 and p(x0 , y2 ) > 0 . Then
p(x0 ) ≥ p(x0 , y1 ) + p(x0 , y2 ) > 0 , and p(y1 |x0 ) and p(y2 |x0 ) are not equal to 0 or 1.
Thus
! !
H(Y |X) = − p(x) p(y|x) log p(y|x) (2.5)
x y
≥ p(x0 )(−p(y1 |x0 ) log p(y1 |x0 ) − p(y2 |x0 ) log p(y2 |x0 )) (2.6)
> > 0, (2.7)
12 Entropy, Relative Entropy and Mutual Information

since −t log t ≥ 0 for 0 ≤ t ≤ 1 , and is strictly positive for t not equal to 0 or 1.

Therefore the conditional entropy H(Y |X) is 0 if and only if Y is a function of X .

6. Conditional mutual information vs. unconditional mutual information. Give

examples of joint random variables X , Y and Z such that

(a) I(X; Y | Z) < I(X; Y ) ,

(b) I(X; Y | Z) > I(X; Y ) .

Solution: Conditional mutual information vs. unconditional mutual information.

(a) The last corollary to Theorem 2.8.1 in the text states that if X → Y → Z that
is, if p(x, y | z) = p(x | z)p(y | z) then, I(X; Y ) ≥ I(X; Y | Z) . Equality holds if
and only if I(X; Z) = 0 or X and Z are independent.
A simple example of random variables satisfying the inequality conditions above
is, X is a fair binary random variable and Y = X and Z = Y . In this case,

I(X; Y ) = H(X) − H(X | Y ) = H(X) = 1

and,
I(X; Y | Z) = H(X | Z) − H(X | Y, Z) = 0.
So that I(X; Y ) > I(X; Y | Z) .
(b) This example is also given in the text. Let X, Y be independent fair binary
random variables and let Z = X + Y . In this case we have that,

I(X; Y ) = 0

and,
I(X; Y | Z) = H(X | Z) = 1/2.
So I(X; Y ) < I(X; Y | Z) . Note that in this case X, Y, Z are not markov.

7. Coin weighing. Suppose one has n coins, among which there may or may not be one
counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than
the other coins. The coins are to be weighed by a balance.

(a) Find an upper bound on the number of coins n so that k weighings will find the
counterfeit coin (if any) and correctly declare it to be heavier or lighter.
(b) (Difficult) What is the coin weighing strategy for k = 3 weighings and 12 coins?

Solution: Coin weighing.

(a) For n coins, there are 2n + 1 possible situations or “states”.

• One of the n coins is heavier.
• One of the n coins is lighter.
• They are all of equal weight.
Entropy, Relative Entropy and Mutual Information 13

Each weighing has three possible outcomes - equal, left pan heavier or right pan
heavier. Hence with k weighings, there are 3 k possible outcomes and hence we
can distinguish between at most 3k different “states”. Hence 2n + 1 ≤ 3k or
n ≤ (3k − 1)/2 .
Looking at it from an information theoretic viewpoint, each weighing gives at most
log2 3 bits of information. There are 2n + 1 possible “states”, with a maximum
entropy of log2 (2n + 1) bits. Hence in this situation, one would require at least
log2 (2n + 1)/ log 2 3 weighings to extract enough information for determination of
the odd coin, which gives the same result as above.
(b) There are many solutions to this problem. We will give one which is based on the
ternary number system.
We may express the numbers {−12, −11, . . . , −1, 0, 1, . . . , 12} in a ternary number
system with alphabet {−1, 0, 1} . For example, the number 8 is (-1,0,1) where
−1 × 30 + 0 × 31 + 1 × 32 = 8 . We form the matrix with the representation of the
positive numbers as its columns.
1 2 3 4 5 6 7 8 9 10 11 12
3 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 0
31 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 2
3 2 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8
Note that the row sums are not all zero. We can negate some columns to make
the row sums zero. For example, negating columns 7,9,11 and 12, we obtain
1 2 3 4 5 6 7 8 9 10 11 12
3 0 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 0
31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 0
32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0
Now place the coins on the balance according to the following rule: For weighing
#i , place coin n
• On left pan, if ni = −1 .
• Aside, if ni = 0 .
• On right pan, if ni = 1 .
The outcome of the three weighings will find the odd coin if any and tell whether
it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if
the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings
give the ternary expansion of the index of the odd coin. If the expansion is the
same as the expansion in the matrix, it indicates that the coin is heavier. If
the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1)
indicates (0)30 +(−1)3+(−1)32 = −12 , hence coin #12 is heavy, (1,0,-1) indicates
#8 is light, (0,0,0) indicates no odd coin.
Why does this scheme work? It is a single error correcting Hamming code for the
ternary alphabet (discussed in Section 8.11 in the book). Here are some details.
First note a few properties of the matrix above that was used for the scheme.
All the columns are distinct and no two columns add to (0,0,0). Also if any coin
14 Entropy, Relative Entropy and Mutual Information

is heavier, it will produce the sequence of weighings that matches its column in
the matrix. If it is lighter, it produces the negative of its column as a sequence
of weighings. Combining all these facts, we can see that any single odd coin will
produce a unique sequence of weighings, and that the coin can be determined from
the sequence.
One of the questions that many of you had whether the bound derived in part (a)
was actually achievable. For example, can one distinguish 13 coins in 3 weighings?
No, not with a scheme like the one above. Yes, under the assumptions under
which the bound was derived. The bound did not prohibit the division of coins
into halves, neither did it disallow the existence of another coin known to be
normal. Under both these conditions, it is possible to find the odd coin of 13 coins
in 3 weighings. You could try modifying the above scheme to these cases.

8. Drawing with and without replacement. An urn contains r red, w white, and
b black balls. Which has higher entropy, drawing k ≥ 2 balls from the urn with
replacement or without replacement? Set it up and show why. (There is both a hard
way and a relatively simple way to do this.)
Solution: Drawing with and without replacement. Intuitively, it is clear that if the
balls are drawn with replacement, the number of possible choices for the i -th ball is
larger, and therefore the conditional entropy is larger. But computing the conditional
distributions is slightly involved. It is easier to compute the unconditional entropy.

• With replacement. In this case the conditional distribution of each draw is the
same for every draw. Thus

 red with prob. r+w+b
r

Xi = white with prob. r+w+b
w
(2.8)

 black with prob. r+w+b
b

and therefore

H(Xi |Xi−1 , . . . , X1 ) = H(Xi ) (2.9)

r w b
= log(r + w + b) − log r − log w − log(2.10)
b.
r+w+b r+w+b r+w+b

• Without replacement. The unconditional probability of the i -th ball being red is
still r/(r + w + b) , etc. Thus the unconditional entropy H(X i ) is still the same as
with replacement. The conditional entropy H(X i |Xi−1 , . . . , X1 ) is less than the
unconditional entropy, and therefore the entropy of drawing without replacement
is lower.

9. A metric. A function ρ(x, y) is a metric if for all x, y ,

• ρ(x, y) ≥ 0
• ρ(x, y) = ρ(y, x)
Entropy, Relative Entropy and Mutual Information 15

• ρ(x, y) = 0 if and only if x = y

• ρ(x, y) + ρ(y, z) ≥ ρ(x, z) .

(a) Show that ρ(X, Y ) = H(X|Y ) + H(Y |X) satisfies the first, second and fourth
properties above. If we say that X = Y if there is a one-to-one function mapping
from X to Y , then the third property is also satisfied, and ρ(X, Y ) is a metric.
(b) Verify that ρ(X, Y ) can also be expressed as
ρ(X, Y ) = H(X) + H(Y ) − 2I(X; Y ) (2.11)
= H(X, Y ) − I(X; Y ) (2.12)
= 2H(X, Y ) − H(X) − H(Y ). (2.13)
Solution: A metric
(a) Let
ρ(X, Y ) = H(X|Y ) + H(Y |X). (2.14)
Then
• Since conditional entropy is always ≥ 0 , ρ(X, Y ) ≥ 0 .
• The symmetry of the definition implies that ρ(X, Y ) = ρ(Y, X) .
• By problem 2.6, it follows that H(Y |X) is 0 iff Y is a function of X and
H(X|Y ) is 0 iff X is a function of Y . Thus ρ(X, Y ) is 0 iff X and Y
are functions of each other - and therefore are equivalent up to a reversible
transformation.
• Consider three random variables X , Y and Z . Then
H(X|Y ) + H(Y |Z) ≥ H(X|Y, Z) + H(Y |Z) (2.15)
= H(X, Y |Z) (2.16)
= H(X|Z) + H(Y |X, Z) (2.17)
≥ H(X|Z), (2.18)
from which it follows that
ρ(X, Y ) + ρ(Y, Z) ≥ ρ(X, Z). (2.19)
Note that the inequality is strict unless X → Y → Z forms a Markov Chain
and Y is a function of X and Z .
(b) Since H(X|Y ) = H(X) − I(X; Y ) , the first equation follows. The second relation
follows from the first equation and the fact that H(X, Y ) = H(X) + H(Y ) −
I(X; Y ) . The third follows on substituting I(X; Y ) = H(X) + H(Y ) − H(X, Y ) .
10. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn
according to probability mass functions p 1 (·) and p2 (·) over the respective alphabets
X1 = {1, 2, . . . , m} and X2 = {m + 1, . . . , n}. Let
)
X1 , with probability α,
X=
X2 , with probability 1 − α.
16 Entropy, Relative Entropy and Mutual Information

(a) Find H(X) in terms of H(X1 ) and H(X2 ) and α.

(b) Maximize over α to show that 2H(X) ≤ 2H(X1 ) + 2H(X2 ) and interpret using the
notion that 2H(X) is the effective alphabet size.
Solution: Entropy. We can do this problem by writing down the definition of entropy
and expanding the various terms. Instead, we will use the algebra of entropies for a
simpler proof.
Since X1 and X2 have disjoint support sets, we can write
)
X1 with probability α
X=
X2 with probability 1 − α
Define a function of X ,
)
1 when X = X1
θ = f (X) =
2 when X = X2
Then as in problem 1, we have
H(X) = H(X, f (X)) = H(θ) + H(X|θ)
= H(θ) + p(θ = 1)H(X|θ = 1) + p(θ = 2)H(X|θ = 2)
= H(α) + αH(X1 ) + (1 − α)H(X2 )
where H(α) = −α log α − (1 − α) log(1 − α) .
11. A measure of correlation. Let X1 and X2 be identically distributed, but not
necessarily independent. Let
H(X2 | X1 )
ρ=1− .
H(X1 )
I(X1 ;X2 )
(a) Show ρ = H(X1 ) .
(b) Show 0 ≤ ρ ≤ 1.
(c) When is ρ = 0 ?
(d) When is ρ = 1 ?
Solution: A measure of correlation. X1 and X2 are identically distributed and
H(X2 |X1 )
ρ=1−
H(X1 )
(a)
H(X1 ) − H(X2 |X1 )
ρ =
H(X1 )
H(X2 ) − H(X2 |X1 )
= (since H(X1 ) = H(X2 ))
H(X1 )
I(X1 ; X2 )
= .
H(X1 )
Entropy, Relative Entropy and Mutual Information 17

(b) Since 0 ≤ H(X2 |X1 ) ≤ H(X2 ) = H(X1 ) , we have

H(X2 |X1 )
0≤ ≤1
H(X1 )
0 ≤ ρ ≤ 1.
(c) ρ = 0 iff I(X1 ; X2 ) = 0 iff X1 and X2 are independent.
(d) ρ = 1 iff H(X2 |X1 ) = 0 iff X2 is a function of X1 . By symmetry, X1 is a
function of X2 , i.e., X1 and X2 have a one-to-one relationship.

12. Example of joint entropy. Let p(x, y) be given by

! Y
! 0 1
X !

0 1
3
1
3

1 0 1
3

Find

(a) H(X), H(Y ).

(b) H(X | Y ), H(Y | X).
(c) H(X, Y ).
(d) H(Y ) − H(Y | X).
(e) I(X; Y ) .
(f) Draw a Venn diagram for the quantities in (a) through (e).

Solution: Example of joint entropy

13. Inequality. Show ln x ≥ 1 − 1

x for x > 0.
Solution: Inequality. Using the Remainder form of the Taylor expansion of ln(x)
about x = 1 , we have for some c between 1 and x
* + * +
1 −1 (x − 1)2
ln(x) = ln(1) + (x − 1) + ≤ x−1
t t=1 t2 t=c 2
18 Entropy, Relative Entropy and Mutual Information

Figure 2.1: Venn diagram to illustrate the relationships of entropy and relative entropy

H(Y)

H(X)
H(X|Y) I(X;Y)
H(Y|X)

since the second term is always negative. Hence letting y = 1/x , we obtain
1
− ln y ≤ −1
y
or
1
ln y ≥ 1 −
y
with equality iff y = 1 .

14. Entropy of a sum. Let X and Y be random variables that take on values x 1 , x2 , . . . , xr
and y1 , y2 , . . . , ys , respectively. Let Z = X + Y.

(a) Show that H(Z|X) = H(Y |X). Argue that if X, Y are independent, then H(Y ) ≤
H(Z) and H(X) ≤ H(Z). Thus the addition of independent random variables
adds uncertainty.
(b) Give an example of (necessarily dependent) random variables in which H(X) >
H(Z) and H(Y ) > H(Z).
(c) Under what conditions does H(Z) = H(X) + H(Y ) ?

Solution: Entropy of a sum.

(a) Z = X + Y . Hence p(Z = z|X = x) = p(Y = z − x|X = x) .

If X and Y are independent, then H(Y |X) = H(Y ) . Since I(X; Z) ≥ 0 ,

we have H(Z) ≥ H(Z|X) = H(Y |X) = H(Y ) . Similarly we can show that
H(Z) ≥ H(X) .
(b) Consider the following joint distribution for X and Y Let
)
1 with probability 1/2
X = −Y =
0 with probability 1/2

Then H(X) = H(Y ) = 1 , but Z = 0 with prob. 1 and hence H(Z) = 0 .

because Z is a function of (X, Y ) and H(X, Y ) = H(X) + H(Y |X) ≤ H(X) +

H(Y ) . We have equality iff (X, Y ) is a function of Z and H(Y ) = H(Y |X) , i.e.,
X and Y are independent.

15. Data processing. Let X1 → X2 → X3 → · · · → Xn form a Markov chain in this

order; i.e., let
p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 ) · · · p(xn |xn−1 ).

Reduce I(X1 ; X2 , . . . , Xn ) to its simplest form.

Solution: Data Processing. By the chain rule for mutual information,

I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ) + I(X1 ; X3 |X2 ) + · · · + I(X1 ; Xn |X2 , . . . , Xn−2 ). (2.20)

By the Markov property, the past and the future are conditionally independent given
the present and hence all terms except the first are zero. Therefore

I(X1 ; X2 , . . . , Xn ) = I(X1 ; X2 ). (2.21)

16. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necks
down to k < n states, and then fans back to m > k states. Thus X 1 → X2 → X3 ,
i.e., p(x1 , x2 , x3 ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) , for all x1 ∈ {1, 2, . . . , n} , x2 ∈ {1, 2, . . . , k} ,
x3 ∈ {1, 2, . . . , m} .

(a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving
that I(X1 ; X3 ) ≤ log k.
(b) Evaluate I(X1 ; X3 ) for k = 1 , and conclude that no dependence can survive such
a bottleneck.

Solution:
Bottleneck.
20 Entropy, Relative Entropy and Mutual Information

(a) From the data processing inequality, and the fact that entropy is maximum for a
uniform distribution, we get

I(X1 ; X3 ) ≤ I(X1 ; X2 )
= H(X2 ) − H(X2 | X1 )
≤ H(X2 )
≤ log k.

Thus, the dependence between X1 and X3 is limited by the size of the bottleneck.
That is I(X1 ; X3 ) ≤ log k .
(b) For k = 1 , I(X1 ; X3 ) ≤ log 1 = 0 and since I(X1 , X3 ) ≥ 0 , I(X1 , X3 ) = 0 .
Thus, for k = 1 , X1 and X3 are independent.

17. Pure randomness and bent coins. Let X 1 , X2 , . . . , Xn denote the outcomes of
independent flips of a bent coin. Thus Pr {X i = 1} = p, Pr {Xi = 0} = 1 − p ,
where p is unknown. We wish to obtain a sequence Z 1 , Z2 , . . . , ZK of fair coin flips
from X1 , X2 , . . . , Xn . Toward this end let f : X n → {0, 1}∗ , (where {0, 1}∗ =
{Λ, 0, 1, 00, 01, . . .} is the set of all finite length binary sequences), be a mapping
f (X1 , X2 , . . . , Xn ) = (Z1 , Z2 , . . . , ZK ) , where Zi ∼ Bernoulli ( 12 ) , and K may depend
on (X1 , . . . , Xn ) . In order that the sequence Z1 , Z2 , . . . appear to be fair coin flips, the
map f from bent coin flips to fair flips must have the property that all 2 k sequences
(Z1 , Z2 , . . . , Zk ) of a given length k have equal probability (possibly 0), for k = 1, 2, . . . .
For example, for n = 2 , the map f (01) = 0 , f (10) = 1 , f (00) = f (11) = Λ (the null
string), has the property that Pr{Z1 = 1|K = 1} = Pr{Z1 = 0|K = 1} = 12 .
Give reasons for the following inequalities:

(a)
nH(p) = H(X1 , . . . , Xn )
(b)
≥ H(Z1 , Z2 , . . . , ZK , K)
(c)
= H(K) + H(Z1 , . . . , ZK |K)
(d)
= H(K) + E(K)
(e)
≥ EK.

Thus no more than nH(p) fair coin tosses can be derived from (X 1 , . . . , Xn ) , on the
average. Exhibit a good map f on sequences of length 4.
Solution: Pure randomness and bent coins.

(a)
nH(p) = H(X1 , . . . , Xn )
(b)
≥ H(Z1 , Z2 , . . . , ZK )
Entropy, Relative Entropy and Mutual Information 21

(a) Since X1 , X2 , . . . , Xn are i.i.d. with probability of Xi = 1 being p , the entropy

H(X1 , X2 , . . . , Xn ) is nH(p) .
(b) Z1 , . . . , ZK is a function of X1 , X2 , . . . , Xn , and since the entropy of a function of a
random variable is less than the entropy of the random variable, H(Z 1 , . . . , ZK ) ≤
H(X1 , X2 , . . . , Xn ) .
(c) K is a function of Z1 , Z2 , . . . , ZK , so its conditional entropy given Z1 , Z2 , . . . , ZK
is 0. Hence H(Z1 , Z2 , . . . , ZK , K) = H(Z1 , . . . , ZK ) + H(K|Z1 , Z2 , . . . , ZK ) =
H(Z1 , Z2 , . . . , ZK ).
(d) Follows from the chain rule for entropy.
(e) By assumption, Z1 , Z2 , . . . , ZK are pure random bits (given K ), with entropy 1
bit per symbol. Hence
!
H(Z1 , Z2 , . . . , ZK |K) = p(K = k)H(Z1 , Z2 , . . . , Zk |K = k) (2.22)
k
!
= p(k)k (2.23)
k
= EK. (2.24)

(f) Follows from the non-negativity of discrete entropy.

(g) Since we do not know p , the only way to generate pure random bits is to use
the fact that all sequences with the same number of ones are equally likely. For
example, the sequences 0001,0010,0100 and 1000 are equally likely and can be used
to generate 2 pure random bits. An example of a mapping to generate random
bits is

0000 → Λ
0001 → 00 0010 → 01 0100 → 10 1000 → 11
0011 → 00 0110 → 01 1100 → 10 1001 → 11
(2.25)
1010 → 0 0101 → 1
1110 → 11 1101 → 10 1011 → 01 0111 → 00
1111 → Λ
The resulting expected number of bits is

EK = 4pq 3 × 2 + 4p2 q 2 × 2 + 2p2 q 2 × 1 + 4p3 q × 2 (2.26)

= 8pq + 10p q + 8p q.
3 2 2 3
(2.27)
22 Entropy, Relative Entropy and Mutual Information

For example, for p ≈ 12 , the expected number of pure random bits is close to 1.625.
This is substantially less then the 4 pure random bits that could be generated if
p were exactly 12 .
We will now analyze the efficiency of this scheme of generating random bits for long
sequences of bent coin flips. Let n be the number of bent coin flips. The algorithm
that we will use is the obvious extension of the above method of generating pure
bits using the fact that all sequences with the same number of ones are equally
likely.
, -
Consider all sequences ,n-
with k ones. There are nk such sequences, which ,n-
are
all equally likely. If k were a power of 2, then we could generate
,n-
log k pure
random bits from such a set. However, in the general
,n-
case, k is not a power of
2 and the best we can to is the divide the set of k elements into subset of sizes
which are powers of 2. The largest set would have a size 2 $log (k )% and could be
n

, -
used to generate *log nk + random bits. We could divide the remaining elements
into
,n-
the largest set which is a power of 2, etc. The worst case would occur when
k = 2l+1 − 1 , in which case the subsets would be of sizes 2 l , 2l−1 , 2l−2 , . . . , 1 .
Instead of analyzing the scheme exactly, we will
,n-
just find a lower bound on number
,n-
of random bits generated from a set of size k . Let l = *log k + . Then at least
half of the elements belong to a set of size 2 l and would generate l random bits,
at least 14 th belong to a set of size 2l−1 and generate l − 1 random bits, etc. On
the average, the number of bits generated is

1 1 1
E[K|k 1’s in sequence] ≥ l + (l − 1) + · · · + l 1 (2.28)
2 4* 2 +
1 1 2 3 l−1
= l− 1 + + + + · · · + l−2 (2.29)
4 2 4 8 2
≥ l − 1, (2.30)

since the infinite series sums to 1.

, -
Hence the fact that nk is not a power of 2 will cost at most 1 bit on the average
in the number of random bits that are produced.
Hence, the expected number of pure random bits produced by this algorithm is
n
. / . /
! n k n−k n
EK ≥ p q *log − 1+ (2.31)
k=0
k k
n
. / . . / /
! n k n−k n
≥ p q log −2 (2.32)
k=0
k k
n
. / . /
! n k n−k n
= p q log −2 (2.33)
k=0
k k
. / . /
! n k n−k n
≥ p q log − 2. (2.34)
n(p−!)≤k≤n(p+!)
k k
Entropy, Relative Entropy and Mutual Information 23

Now for sufficiently large n , the probability that the number of 1’s in the sequence
is close to np is near 1 (by the weak law of large numbers). For such sequences,
n is close to p and hence there exists a δ such that
k
. /
n k
≥ 2n(H( n )−δ) ≥ 2n(H(p)−2δ) (2.35)
k
using Stirling’s approximation for the binomial coefficients and the continuity of
the entropy function. If we assume that n is large enough so that the probability
that n(p − %) ≤ k ≤ n(p + %) is greater than 1 − % , then we see that EK ≥
(1 − %)n(H(p) − 2δ) − 2 , which is very good since nH(p) is an upper bound on the
number of pure random bits that can be produced from the bent coin sequence.

18. World Series. The World Series is a seven-game series that terminates as soon as
either team wins four games. Let X be the random variable that represents the outcome
of a World Series between teams A and B; possible values of X are AAAA, BABABAB,
and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7.
Assuming that A and B are equally matched and that the games are independent,
calculate H(X) , H(Y ) , H(Y |X) , and H(X|Y ) .
Solution:
World Series. Two teams play until one of them has won 4 games.
There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability
(1/2)4 .
,4-
There are 8 = 2 3 World Series with 5 games. Each happens with probability (1/2) 5 .
,5-
There are 20 = 2 3 World Series with 6 games. Each happens with probability (1/2) 6 .
,6-
There are 40 = 2 3 World Series with 7 games. Each happens with probability (1/2) 7 .

The probability of a 4 game series ( Y = 4 ) is 2(1/2) 4 = 1/8 .

The probability of a 5 game series ( Y = 5 ) is 8(1/2) 5 = 1/4 .
The probability of a 6 game series ( Y = 6 ) is 20(1/2) 6 = 5/16 .
The probability of a 7 game series ( Y = 7 ) is 40(1/2) 7 = 5/16 .

! 1
H(X) = p(x)log
p(x)
= 2(1/16) log 16 + 8(1/32) log 32 + 20(1/64) log 64 + 40(1/128) log 128
= 5.8125

! 1
H(Y ) = p(y)log
p(y)
= 1/8 log 8 + 1/4 log 4 + 5/16 log(16/5) + 5/16 log(16/5)
= 1.924
24 Entropy, Relative Entropy and Mutual Information

Y is a deterministic function of X, so if you know X there is no randomness in Y. Or,

19. Infinite entropy. This problem shows that the entropy of a discrete random variable
$
can be infinite. Let A = ∞ n=2 (n log n)
2 −1 . (It is easy to show that A is finite by

bounding the infinite sum by the integral of (x log 2 x)−1 .) Show that the integer-
valued random variable X defined by Pr(X = n) = (An log 2 n)−1 for n = 2, 3, . . . ,
has H(X) = +∞ .
Solution: Infinite entropy. By definition, p n = Pr(X = n) = 1/An log 2 n for n ≥ 2 .
Therefore
∞
!
H(X) = − p(n) log p(n)
n=2
!∞ 0 1 0 1
= − 1/An log 2 n log 1/An log 2 n
n=2
∞
! log(An log 2 n)
=
n=2 An log2 n
∞
! log A + log n + 2 log log n
=
n=2 An log2 n
∞ ∞
! 1 ! 2 log log n
= log A + + .
n=2
An log n n=2 An log2 n

The first term is finite. For base 2 logarithms, all the elements in the sum in the last
term are nonnegative. (For any other base, the terms of the last sum eventually all
become positive.) So all we have to do is bound the middle sum, which we do by
comparing with an integral.
∞ 2 3∞
! 1 ∞ 1 3
> dx = K ln ln x 3 = +∞ .
n=2
An log n 2 Ax log x 2

We conclude that H(X) = +∞ .

20. Run length coding. Let X1 , X2 , . . . , Xn be (possibly dependent) binary random

variables. Suppose one calculates the run lengths R = (R 1 , R2 , . . .) of this sequence
(in order as they occur). For example, the sequence X = 0001100100 yields run
lengths R = (3, 2, 2, 1, 2) . Compare H(X1 , X2 , . . . , Xn ) , H(R) and H(Xn , R) . Show
all equalities and inequalities, and bound all the differences.
Solution: Run length coding. Since the run lengths are a function of X 1 , X2 , . . . , Xn ,
H(R) ≤ H(X) . Any Xi together with the run lengths determine the entire sequence
Entropy, Relative Entropy and Mutual Information 25

X1 , X2 , . . . , Xn . Hence

H(X1 , X2 , . . . , Xn ) = H(Xi , R) (2.36)

= H(R) + H(Xi |R) (2.37)
≤ H(R) + H(Xi ) (2.38)
≤ H(R) + 1. (2.39)

21. Markov’s inequality for probabilities. Let p(x) be a probability mass function.
Prove, for all d ≥ 0 ,
* +
1
Pr {p(X) ≤ d} log ≤ H(X). (2.40)
d

Solution: Markov inequality applied to entropy.

1 ! 1
P (p(X) < d) log = p(x) log (2.41)
d x:p(x)<d
d
! 1
≤ p(x) log (2.42)
x:p(x)<d
p(x)
! 1
≤ p(x) log (2.43)
x p(x)
= H(X) (2.44)

22. Logical order of ideas. Ideas have been developed in order of need, and then gener-
alized if necessary. Reorder the following ideas, strongest first, implications following:

(a) Chain rule for I(X1 , . . . , Xn ; Y ) , chain rule for D(p(x1 , . . . , xn )||q(x1 , x2 , . . . , xn )) ,
and chain rule for H(X1 , X2 , . . . , Xn ) .
(b) D(f ||g) ≥ 0 , Jensen’s inequality, I(X; Y ) ≥ 0 .

Solution: Logical ordering of ideas.

(a) The following orderings are subjective. Since I(X; Y ) = D(p(x, y)||p(x)p(y)) is a
special case of relative entropy, it is possible to derive the chain rule for I from
the chain rule for D .
Since H(X) = I(X; X) , it is possible to derive the chain rule for H from the
chain rule for I .
It is also possible to derive the chain rule for I from the chain rule for H as was
done in the notes.
(b) In class, Jensen’s inequality was used to prove the non-negativity of D . The
inequality I(X; Y ) ≥ 0 followed as a special case of the non-negativity of D .
26 Entropy, Relative Entropy and Mutual Information

23. Conditional mutual information. Consider a sequence of n binary random vari-

ables X1 , X2 , . . . , Xn . Each sequence with an even number of 1’s has probability
2−(n−1) and each sequence with an odd number of 1’s has probability 0. Find the
mutual informations

I(X1 ; X2 ), I(X2 ; X3 |X1 ), . . . , I(Xn−1 ; Xn |X1 , . . . , Xn−2 ).

Solution: Conditional mutual information.

Consider a sequence of n binary random variables X 1 , X2 , . . . , Xn . Each sequence of
length n with an even number of 1’s is equally likely and has probability 2 −(n−1) .
Any n − 1 or fewer of these are independent. Thus, for k ≤ n − 1 ,

I(Xk−1 ; Xk |X1 , X2 , . . . , Xk−2 ) = 0.

However, given X1 , X2 , . . . , Xn−2 , we know that once we know either Xn−1 or Xn we

know the other.

I(Xn−1 ; Xn |X1 , X2 , . . . , Xn−2 ) = H(Xn |X1 , X2 , . . . , Xn−2 ) − H(Xn |X1 , X2 , . . . , Xn−1 )

= 1 − 0 = 1 bit.

24. Average entropy. Let H(p) = −p log 2 p − (1 − p) log2 (1 − p) be the binary entropy
function.

(a) Evaluate H(1/4) using the fact that log 2 3 ≈ 1.584 . Hint: You may wish to
consider an experiment with four equally likely outcomes, one of which is more
interesting than the others.
(b) Calculate the average entropy H(p) when the probability p is chosen uniformly
in the range 0 ≤ p ≤ 1 .
(c) (Optional) Calculate the average entropy H(p 1 , p2 , p3 ) where (p1 , p2 , p3 ) is a uni-
formly distributed probability vector. Generalize to dimension n .

Solution: Average Entropy.

(a) We can generate two bits of information by picking one of four equally likely
alternatives. This selection can be made in two steps. First we decide whether the
first outcome occurs. Since this has probability 1/4 , the information generated
is H(1/4) . If not the first outcome, then we select one of the three remaining
outcomes; with probability 3/4 , this produces log 2 3 bits of information. Thus

H(1/4) + (3/4) log 2 3 = 2

and so H(1/4) = 2 − (3/4) log 2 3 = 2 − (.75)(1.585) = 0.811 bits.

Entropy, Relative Entropy and Mutual Information 27

(b) If p is chosen uniformly in the range 0 ≤ p ≤ 1 , then the average entropy (in
nats) is
2 2 . /
1 1 x2 x2 31
3
− p ln p + (1 − p) ln(1 − p)dp = −2 x ln x dx = −2 ln x + 3 = 1
.
0 0 2 4 0 2

Therefore the average entropy is 12 log2 e = 1/(2 ln 2) = .721 bits.

(c) Choosing a uniformly distributed probability vector (p 1 , p2 , p3 ) is equivalent to
choosing a point (p1 , p2 ) uniformly from the triangle 0 ≤ p 1 ≤ 1 , p1 ≤ p2 ≤ 1 .
The probability density function has the constant value 2 because the area of the
triangle is 1/2. So the average entropy H(p 1 , p2 , p3 ) is
2 12 1
−2 p1 ln p1 + p2 ln p2 + (1 − p1 − p2 ) ln(1 − p1 − p2 )dp2 dp1 .
0 p1

After some enjoyable calculus, we obtain the final result 5/(6 ln 2) = 1.202 bits.
25. Venn diagrams. There isn’t realy a notion of mutual information common to three
random variables. Here is one attempt at a definition: Using Venn diagrams, we can
see that the mutual information common to three random variables X , Y and Z can
be defined by
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) .
This quantity is symmetric in X , Y and Z , despite the preceding asymmetric defi-
nition. Unfortunately, I(X; Y ; Z) is not necessarily nonnegative. Find X , Y and Z
such that I(X; Y ; Z) < 0 , and prove the following two identities:
(a) I(X; Y ; Z) = H(X, Y, Z) − H(X) − H(Y ) − H(Z) + I(X; Y ) + I(Y ; Z) + I(Z; X)
(b) I(X; Y ; Z) = H(X, Y, Z)− H(X, Y )− H(Y, Z)− H(Z, X)+ H(X)+ H(Y )+ H(Z)
The first identity can be understood using the Venn diagram analogy for entropy and
mutual information. The second identity follows easily from the first.
Solution: Venn Diagrams. To show the first identity,
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) by definition
= I(X; Y ) − (I(X; Y, Z) − I(X; Z)) by chain rule
= I(X; Y ) + I(X; Z) − I(X; Y, Z)
= I(X; Y ) + I(X; Z) − (H(X) + H(Y, Z) − H(X, Y, Z))
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − H(Y, Z)
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − (H(Y ) + H(Z) − I(Y ; Z))
= I(X; Y ) + I(X; Z) + I(Y ; Z) + H(X, Y, Z) − H(X) − H(Y ) − H(Z).
To show the second identity, simply substitute for I(X; Y ) , I(X; Z) , and I(Y ; Z)
using equations like
I(X; Y ) = H(X) + H(Y ) − H(X, Y ) .
These two identities show that I(X; Y ; Z) is a symmetric (but not necessarily nonneg-
ative) function of three random variables.
28 Entropy, Relative Entropy and Mutual Information

26. Another proof of non-negativity of relative entropy. In view of the fundamental

nature of the result D(p||q) ≥ 0 , we will give another proof.

(a) Show that ln x ≤ x − 1 for 0 < x < ∞ .

(b) Justify the following steps:

! q(x)
−D(p||q) = p(x) ln (2.45)
x p(x)
! * +
q(x)
≤ p(x) −1 (2.46)
x p(x)
≤ 0 (2.47)

(c) What are the conditions for equality?

Solution: Another proof of non-negativity of relative entropy. In view of the funda-

mental nature of the result D(p||q) ≥ 0 , we will give another proof.

(a) Show that ln x ≤ x − 1 for 0 < x < ∞ .

There are many ways to prove this. The easiest is using calculus. Let

f (x) = x − 1 − ln x (2.48)

for 0 < x < ∞ . Then f ' (x) = 1 − x1 and f '' (x) = x12 > 0 , and therefore f (x)
is strictly convex. Therefore a local minimum of the function is also a global
minimum. The function has a local minimum at the point where f ' (x) = 0 , i.e.,
when x = 1 . Therefore f (x) ≥ f (1) , i.e.,

x − 1 − ln x ≥ 1 − 1 − ln 1 = 0 (2.49)

which gives us the desired inequality. Equality occurs only if x = 1 .

(b) We let A be the set of x such that p(x) > 0 .

! q(x)
−De (p||q) = p(x)ln (2.50)
x∈A
p(x)
! * +
q(x)
≤ p(x) −1 (2.51)
x∈A
p(x)
! !
= q(x) − p(x) (2.52)
x∈A x∈A
≤ 0 (2.53)

The first step follows from the definition of D , the second step follows from the
inequality ln t ≤ t − 1 , the third step from expanding the sum, and the last step
from the fact that the q(A) ≤ 1 and p(A) = 1 .
Entropy, Relative Entropy and Mutual Information 29

(c) What are the conditions for equality?

We have equality in the inequality ln t ≤ t − 1 if and only if t = 1 . Therefore we
have equality in step 2 of the chain iff q(x)/p(x) = 1 for all x ∈ A . This implies
that p(x) = q(x) for all x , and we have equality in the last step as well. Thus
the condition for equality is that p(x) = q(x) for all x .

27. Grouping rule for entropy: Let p = (p 1 , p2 , . . . , pm ) be a probability distribution

$
on m elements, i.e, pi ≥ 0 , and m i=1 pi = 1 . Define a new distribution q on m − 1
elements as q1 = p1 , q2 = p2 ,. . . , qm−2 = pm−2 , and qm−1 = pm−1 + pm , i.e., the
distribution q is the same as p on {1, 2, . . . , m − 2} , and the probability of the last
element in q is the sum of the last two probabilities of p . Show that
* +
pm−1 pm
H(p) = H(q) + (pm−1 + pm )H , . (2.54)
pm−1 + pm pm−1 + pm

Solution:
m
!
H(p) = − pi log pi (2.55)
i=1
m−2
!
= − pi log pi − pm−1 log pm−1 − pm log pm (2.56)
i=1
m−2
! pm−1 pm
= − pi log pi − pm−1 log − pm log (2.57)
i=1
pm−1 + pm pm−1 + pm
−(pm−1 + pm ) log(pm−1 + pm ) (2.58)
pm−1 pm
= H(q) − pm−1 log − pm log (2.59)
pm−1 + pm pm−1 + pm
* +
pm−1 pm−1 pm pm
= H(q) − (pm−1 + pm ) log − log (2.60)
pm−1 + pm pm−1 + pm pm−1 + pm pm−1 + pm
* +
pm−1 pm
= H(q) + (pm−1 + pm )H2 , , (2.61)
pm−1 + pm pm−1 + pm
where H2 (a, b) = −a log a − b log b .

28. Mixing increases entropy. Show that the entropy of the probability distribution,
(p1 , . . . , pi , . . . , pj , . . . , pm ) , is less than the entropy of the distribution
p +p p +p
(p1 , . . . , i 2 j , . . . , i 2 j , . . . , pm ) . Show that in general any transfer of probability that
makes the distribution more uniform increases the entropy.
Solution:
Mixing increases entropy.
This problem depends on the convexity of the log function. Let

P1 = (p1 , . . . , pi , . . . , pj , . . . , pm )
pi + p j pj + p i
P2 = (p1 , . . . , ,..., , . . . , pm )
2 2
30 Entropy, Relative Entropy and Mutual Information

Then, by the log sum inequality,

pi + p j pi + p j
H(P2 ) − H(P1 ) = −2( ) log( ) + pi log pi + pj log pj
2 2
pi + p j
= −(pi + pj ) log( ) + pi log pi + pj log pj
2
≥ 0.

Thus,
H(P2 ) ≥ H(P1 ).

29. Inequalities. Let X , Y and Z be joint random variables. Prove the following
inequalities and find conditions for equality.

(a) H(X, Y |Z) ≥ H(X|Z) .

(b) I(X, Y ; Z) ≥ I(X; Z) .
(c) H(X, Y, Z) − H(X, Y ) ≤ H(X, Z) − H(X) .
(d) I(X; Z|Y ) ≥ I(Z; Y |X) − I(Z; Y ) + I(X; Z) .

Solution: Inequalities.

(a) Using the chain rule for conditional entropy,

H(X, Y |Z) = H(X|Z) + H(Y |X, Z) ≥ H(X|Z),

with equality iff H(Y |X, Z) = 0 , that is, when Y is a function of X and Z .
(b) Using the chain rule for mutual information,

I(X, Y ; Z) = I(X; Z) + I(Y ; Z|X) ≥ I(X; Z),

with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
pendent given X .
(c) Using first the chain rule for entropy and then the definition of conditional mutual
information,

H(X, Y, Z) − H(X, Y ) = H(Z|X, Y ) = H(Z|X) − I(Y ; Z|X)

≤ H(Z|X) = H(X, Z) − H(X) ,

with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
pendent given X .
(d) Using the chain rule for mutual information,

I(X; Z|Y ) + I(Z; Y ) = I(X, Y ; Z) = I(Z; Y |X) + I(X; Z) ,

and therefore
I(X; Z|Y ) = I(Z; Y |X) − I(Z; Y ) + I(X; Z) .
We see that this inequality is actually an equality in all cases.
Entropy, Relative Entropy and Mutual Information 31

30. Maximum entropy. Find the probability mass function p(x) that maximizes the
entropy H(X) of a non-negative integer-valued random variable X subject to the
constraint
∞
!
EX = np(n) = A
n=0

for a fixed value A > 0 . Evaluate this maximum H(X) .

Solution: Maximum entropy
Recall that,
∞
! ∞
!
− pi log pi ≤ − pi log qi .
i=0 i=0

Let qi = α(β)i . Then we have that,

∞
! ∞
!
− pi log pi ≤ − pi log qi
i=0 i=0
. ∞ ∞
/
! !
= − log(α) pi + log(β) ipi
i=0 i=0
= − log α − A log β

Notice that the final right hand side expression is independent of {p i } , and that the
inequality,
∞
!
− pi log pi ≤ − log α − A log β
i=0

holds for all α, β such that,

∞
! 1
αβ i = 1 = α .
i=0
1−β

The constraint on the expected value also requires that,

∞
! β
iαβ i = A = α .
i=0
(1 − β)2

Combining the two constraints we have,

* +* +
β α β
α =
(1 − β)2 1−β 1−β
β
=
1−β
= A,
32 Entropy, Relative Entropy and Mutual Information

which implies that,

A
β =
A+1
1
α = .
A+1
So the entropy maximizing distribution is,
* +i
1 A
pi = .
A+1 A+1
Plugging these values into the expression for the maximum entropy,
− log α − A log β = (A + 1) log(A + 1) − A log A.

The general form of the distribution,

pi = αβ i
can be obtained either by guessing or by Lagrange multipliers where,
∞
! ∞
! ∞
!
F (pi , λ1 , λ2 ) = − pi log pi + λ1 ( pi − 1) + λ2 ( ipi − A)
i=0 i=0 i=0

is the function whose gradient we set to 0.

To complete the argument with Lagrange multipliers, it is necessary to show that the
local maximum is the global maximum. One possible argument is based on the fact
that −H(p) is convex, it has only one local minima, no local maxima and therefore
Lagrange multiplier actually gives the global maximum for H(p) .
31. Conditional entropy. Under what conditions does H(X | g(Y )) = H(X | Y ) ?
Solution: (Conditional Entropy). If H(X|g(Y )) = H(X|Y ) , then H(X)−H(X|g(Y )) =
H(X) − H(X|Y ) , i.e., I(X; g(Y )) = I(X; Y ) . This is the condition for equality in
the data processing inequality. From the derivation of the inequality, we have equal-
ity iff X → g(Y ) → Y forms a Markov chain. Hence H(X|g(Y )) = H(X|Y ) iff
X → g(Y ) → Y . This condition includes many special cases, such as g being one-
to-one, and X and Y being independent. However, these two special cases do not
exhaust all the possibilities.
32. Fano. We are given the following joint distribution on (X, Y )
Y
X a b c
1 1
6
1
12
1
12
2 1
12
1
6
1
12
3 1
12
1
12
1
6

Let X̂(Y ) be an estimator for X (based on Y) and let P e = Pr{X̂(Y ) %= X}.

Entropy, Relative Entropy and Mutual Information 33

(a) Find the minimum probability of error estimator X̂(Y ) and the associated Pe .
(b) Evaluate Fano’s inequality for this problem and compare.

Solution:

(a) From inspection we see that


 1 y=a

X̂(y) = 2 y=b

 3 y=c

Hence the associated Pe is the sum of P (1, b), P (1, c), P (2, a), P (2, c), P (3, a)
and P (3, b). Therefore, Pe = 1/2.
(b) From Fano’s inequality we know

H(X|Y ) − 1
Pe ≥ .
log |X |
Here,

H(X|Y ) = H(X|Y = a) Pr{y = a} + H(X|Y = b) Pr{y = b} + H(X|Y = c) Pr{y = c}

* + * + * +
1 1 1 1 1 1 1 1 1
= H , , Pr{y = a} + H , , Pr{y = b} + H , , Pr{y = c}
2 4 4 2 4 4 2 4 4
* +
1 1 1
= H , , (Pr{y = a} + Pr{y = b} + Pr{y = c})
2 4 4
* +
1 1 1
= H , ,
2 4 4
= 1.5 bits.

Hence
1.5 − 1
Pe ≥ = .316.
log 3
Hence our estimator X̂(Y ) is not very close to Fano’s bound in this form. If
X̂ ∈ X , as it does here, we can use the stronger form of Fano’s inequality to get

H(X|Y ) − 1
Pe ≥ .
log(|X |-1)

and
1.5 − 1 1
Pe ≥ = .
log 2 2
Therefore our estimator X̂(Y ) is actually quite good.

33. Fano’s inequality. Let Pr(X = i) = p i , i = 1, 2, . . . , m and let p1 ≥ p2 ≥ p3 ≥

· · · ≥ pm . The minimal probability of error predictor of X is X̂ = 1 , with resulting
probability of error Pe = 1 − p1 . Maximize H(p) subject to the constraint 1 − p 1 = Pe
34 Entropy, Relative Entropy and Mutual Information

to find a bound on Pe in terms of H . This is Fano’s inequality in the absence of

conditioning.
Solution: (Fano’s Inequality.) The minimal probability of error predictor when there
is no information is X̂ = 1 , the most probable value of X . The probability of error in
this case is Pe = 1 − p1 . Hence if we fix Pe , we fix p1 . We maximize the entropy of X
for a given Pe to obtain an upper bound on the entropy for a given P e . The entropy,
m
!
H(p) = −p1 log p1 − pi log pi (2.62)
i=2
!m
pi pi
= −p1 log p1 − Pe log − Pe log Pe (2.63)
i=2
Pe Pe
* +
p2 p3 pm
= H(Pe ) + Pe H , ,..., (2.64)
Pe Pe Pe
≤ H(Pe ) + Pe log(m − 1), (2.65)
0 1
since the maximum of H Pp2e , Pp3e , . . . , pPme is attained by an uniform distribution. Hence
any X that can be predicted with a probability of error P e must satisfy

H(X) ≤ H(Pe ) + Pe log(m − 1), (2.66)

which is the unconditional form of Fano’s inequality. We can weaken this inequality to
obtain an explicit lower bound for Pe ,
H(X) − 1
Pe ≥ . (2.67)
log(m − 1)

34. Entropy of initial conditions. Prove that H(X 0 |Xn ) is non-decreasing with n for
any Markov chain.
Solution: Entropy of initial conditions. For a Markov chain, by the data processing
theorem, we have
I(X0 ; Xn−1 ) ≥ I(X0 ; Xn ). (2.68)
Therefore
H(X0 ) − H(X0 |Xn−1 ) ≥ H(X0 ) − H(X0 |Xn ) (2.69)
or H(X0 |Xn ) increases with n .

35. Relative entropy is not symmetric: Let the random variable X have three possible
outcomes {a, b, c} . Consider two distributions on this random variable
Symbol p(x) q(x)
a 1/2 1/3
b 1/4 1/3
c 1/4 1/3
Calculate H(p) , H(q) , D(p||q) and D(q||p) . Verify that in this case D(p||q) %=
D(q||p) .
Entropy, Relative Entropy and Mutual Information 35

Solution:
1 1 1
H(p) =
log 2 + log 4 + log 4 = 1.5 bits. (2.70)
2 4 4
1 1 1
H(q) = log 3 + log 3 + log 3 = log 3 = 1.58496 bits. (2.71)
3 3 3
1 3 1 3 1 3
D(p||q) = log + log + log = log(3) − 1.5 = 1.58496 − 1.5 = 0.08496 (2.72)
2 2 4 4 4 4
1 2 1 4 1 4 5
D(q||p) = log + log + log = −log(3) = 1.66666−1.58496 = 0.08170 (2.73)
3 3 3 3 3 3 3
36. Symmetric relative entropy: Though, as the previous example shows, D(p||q) %=
D(q||p) in general, there could be distributions for which equality holds. Give an
example of two distributions p and q on a binary alphabet such that D(p||q) = D(q||p)
(other than the trivial case p = q ).
Solution:
A simple case for D((p, 1 − p)||(q, 1 − q)) = D((q, 1 − q)||(p, 1 − p)) , i.e., for
p 1−p q 1−q
p log + (1 − p) log = q log + (1 − q) log (2.74)
q 1−q p 1−p
is when q = 1 − p .

37. Relative entropy: Let X, Y, Z be three random variables with a joint probability
mass function p(x, y, z) . The relative entropy between the joint distribution and the
product of the marginals is
4 5
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.75)
p(x)p(y)p(z)
Expand this in terms of entropies. When is this quantity zero?
Solution:
4 5
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.76)
p(x)p(y)p(z)
= E[log p(x, y, z)] − E[log p(x)] − E[log p(y)] − E[log(2.77)
p(z)]
= −H(X, Y, Z) + H(X) + H(Y ) + H(Z) (2.78)

We have D(p(x, y, z)||p(x)p(y)p(z)) = 0 if and only p(x, y, z) = p(x)p(y)p(z) for all

(x, y, z) , i.e., if X and Y and Z are independent.

38. The value of a question Let X ∼ p(x) , x = 1, 2, . . . , m . We are given a set

S ⊆ {1, 2, . . . , m} . We ask whether X ∈ S and receive the answer
)
1, if X ∈ S
Y =
0, if X ∈
% S.

Suppose Pr{X ∈ S} = α . Find the decrease in uncertainty H(X) − H(X|Y ) .

36 Entropy, Relative Entropy and Mutual Information

Apparently any set S with a given α is as good as any other.

Solution: The value of a question.

H(X) − H(X|Y ) = I(X; Y )

= H(Y ) − H(Y |X)
= H(α) − H(Y |X)
= H(α)

since H(Y |X) = 0 .

39. Entropy and pairwise independence.

Let X, Y, Z be three binary Bernoulli ( 21 ) random variables that are pairwise indepen-
dent, that is, I(X; Y ) = I(X; Z) = I(Y ; Z) = 0 .

(a) Under this constraint, what is the minimum value for H(X, Y, Z) ?
(b) Give an example achieving this minimum.

Solution:

(a)

H(X, Y, Z) = H(X, Y ) + H(Z|X, Y ) (2.79)

≥ H(X, Y ) (2.80)
= 2. (2.81)

So the minimum value for H(X, Y, Z) is at least 2. To show that is is actually

equal to 2, we show in part (b) that this bound is attainable.
(b) Let X and Y be iid Bernoulli( 12 ) and let Z = X ⊕ Y , where ⊕ denotes addition
mod 2 (xor).

40. Discrete entropies

Let X and Y be two independent integer-valued random variables. Let X be uniformly
distributed over {1, 2, . . . , 8} , and let Pr{Y = k} = 2 −k , k = 1, 2, 3, . . .

(a) Find H(X)

(b) Find H(Y )
(c) Find H(X + Y, X − Y ) .

Solution:

(a) For a uniform distribution, H(X) = log m = log 8 = 3 .

$
(b) For a geometric distribution, H(Y ) = k k2−k = 2 . (See solution to problem 2.1
Entropy, Relative Entropy and Mutual Information 37

(c) Since (X, Y ) → (X +Y, X −Y ) is a one to one transformation, H(X +Y, X −Y ) =

H(X, Y ) = H(X) + H(Y ) = 3 + 2 = 5 .

41. Random questions

One wishes to identify a random object X ∼ p(x) . A question Q ∼ r(q) is asked
at random according to r(q) . This results in a deterministic answer A = A(x, q) ∈
{a1 , a2 , . . .} . Suppose X and Q are independent. Then I(X; Q, A) is the uncertainty
in X removed by the question-answer (Q, A) .

(a) Show I(X; Q, A) = H(A|Q) . Interpret.

(b) Now suppose that two i.i.d. questions Q 1 , Q2 , ∼ r(q) are asked, eliciting answers
A1 and A2 . Show that two questions are less valuable than twice a single question
in the sense that I(X; Q1 , A1 , Q2 , A2 ) ≤ 2I(X; Q1 , A1 ) .

Solution: Random questions.

(a)

I(X; Q, A) = H(Q, A) − H(Q, A, |X)

= H(Q) + H(A|Q) − H(Q|X) − H(A|Q, X)
= H(Q) + H(A|Q) − H(Q)
= H(A|Q)

The interpretation is as follows. The uncertainty removed in X by (Q, A) is the

(a) Chain rule.

(b) X and Q1 are independent.
38 Entropy, Relative Entropy and Mutual Information

(c) Q2 are independent of X , Q1 , and A1 .

(d) A2 is completely determined given Q2 and X .
(e) Conditioning decreases entropy.
(f) Result from part a.

42. Inequalities. Which of the following inequalities are generally ≥, =, ≤ ? Label each
with ≥, =, or ≤ .

(a) H(5X) vs. H(X)

(b) I(g(X); Y ) vs. I(X; Y )
(c) H(X0 |X−1 ) vs. H(X0 |X−1 , X1 )
(d) H(X1 , X2 , . . . , Xn ) vs. H(c(X1 , X2 , . . . , Xn )) , where c(x1 , x2 , . . . , xn ) is the Huff-
man codeword assigned to (x1 , x2 , . . . , xn ) .
(e) H(X, Y )/(H(X) + H(Y )) vs. 1

Solution:
(a) X → 5X is a one to one mapping, and hence H(X) = H(5X) .
(b) By data processing inequality, I(g(X); Y ) ≤ I(X; Y ) .
(c) Because conditioning reduces entropy, H(X 0 |X−1 ) ≥ H(X0 |X−1 , X1 ) .
(d) H(X, Y ) ≤ H(X) + H(Y ) , so H(X, Y )/(H(X) + H(Y )) ≤ 1 .
43. Mutual information of heads and tails.
(a) Consider a fair coin flip. What is the mutual information between the top side
and the bottom side of the coin?
(b) A 6-sided fair die is rolled. What is the mutual information between the top side
and the front face (the side most facing you)?
Solution:
Mutual information of heads and tails.
To prove (a) observe that
I(T ; B) = H(B) − H(B|T )
= log 2 = 1
since B ∼ Ber(1/2) , and B = f (T ) . Here B, T stand for Bottom and Top respectively.
To prove (b) note that having observed a side of the cube facing us F , there are four
possibilities for the top T , which are equally probable. Thus,
I(T ; F ) = H(T ) − H(T |F )
= log 6 − log 4
= log 3 − 1
since T has uniform distribution on {1, 2, . . . , 6} .
Entropy, Relative Entropy and Mutual Information 39

44. Pure randomness

We wish to use a 3-sided coin to generate a fair coin toss. Let the coin X have
probability mass function 

 A, pA
X= B, pB

 C, pC
where pA , pB , pC are unknown.

(a) How would you use 2 independent flips X 1 , X2 to generate (if possible) a Bernoulli( 12 )
random variable Z ?
(b) What is the resulting maximum expected number of fair bits generated?

Solution:

(a) The trick here is to notice that for any two letters Y and Z produced by two
independent tosses of our bent three-sided coin, Y Z has the same probability as
ZY . So we can produce B ∼ Bernoulli( 21 ) coin flips by letting B = 0 when we
get AB , BC or AC , and B = 1 when we get BA , CB or CA (if we get AA ,
BB or CC we don’t assign a value to B .)
(b) The expected number of bits generated by the above scheme is as follows. We get
one bit, except when the two flips of the 3-sided coin produce the same symbol.
So the expected number of fair bits generated is

0 ∗ [P (AA) + P (BB) + P (CC)] + 1 ∗ [1 − P (AA) − P (BB) − P (CC)], (2.82)

or, 1 − p2A − p2B − p2C . (2.83)

45. Finite entropy. Show that for a discrete random variable X ∈ {1, 2, . . .} , if E log X <
∞ , then H(X) < ∞ .
$
Solution: Let the distribution on the integers be p 1 , p2 , . . . . Then H(p) = − pi logpi
$
and E log X = pi logi = c < ∞ .
We will now find the maximum entropy distribution subject to the constraint on the
expected logarithm. Using Lagrange multipliers or the results of Chapter 12, we have
the following functional to optimize
! ! !
J(p) = − pi log pi − λ1 p i − λ2 pi log i (2.84)

Differentiating with respect to p i and setting to zero, we find that the p i that maximizes
$
the entropy set pi = aiλ , where a = 1/( iλ ) and λ chosed to meet the expected log
constraint, i.e. ! !
iλ log i = c iλ (2.85)
Using this value of pi , we can see that the entropy is finite.
40 Entropy, Relative Entropy and Mutual Information

46. Axiomatic definition of entropy. If we assume certain axioms for our measure of
information, then we will be forced to use a logarithmic measure like entropy. Shannon
used this to justify his initial definition of entropy. In this book, we will rely more on
the other properties of entropy rather than its axiomatic derivation to justify its use.
The following problem is considerably more difficult than the other problems in this
section.
If a sequence of symmetric functions H m (p1 , p2 , . . . , pm ) satisfies the following proper-
ties,
0 1
• Normalization: H2 1 1
2, 2 = 1,
• Continuity: H2 (p, 1 − p) is a continuous function of p ,
0 1
p1 p2
• Grouping: Hm (p1 , p2 , . . . , pm ) = Hm−1 (p1 +p2 , p3 , . . . , pm )+(p1 +p2 )H2 p1 +p2 , p1 +p2 ,

prove that Hm must be of the form

m
!
Hm (p1 , p2 , . . . , pm ) = − pi log pi , m = 2, 3, . . . . (2.86)
i=1

There are various other axiomatic formulations which also result in the same definition
of entropy. See, for example, the book by Csiszár and Körner[4].
Solution: Axiomatic definition of entropy. This is a long solution, so we will first
outline what we plan to do. First we will extend the grouping axiom by induction and
prove that

Hm (p1 , p2 , . . . , pm ) = Hm−k (p1 + p2 + · · · + pk , pk+1 , . . . , pm )

* +
p1 pk
+(p1 + p2 + · · · + pk )Hk ,..., (. 2.87)
p1 + p 2 + · · · + p k p1 + p 2 + · · · + p k
Let f (m) be the entropy of a uniform distribution on m symbols, i.e.,
* +
1 1 1
f (m) = Hm , ,..., . (2.88)
m m m
We will then show that for any two integers r and s , that f (rs) = f (r) + f (s) .
We use this to show that f (m) = log m . We then show for rational p = r/s , that
H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p) . By continuity, we will extend it to irrational
p and finally by induction and grouping, we will extend the result to H m for m ≥ 2 .
To begin, we extend the grouping axiom. For convenience in notation, we will let
k
!
Sk = pi (2.89)
i=1

and we will denote H2 (q, 1 − q) as h(q) . Then we can write the grouping axiom as
* +
p2
Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h . (2.90)
S2
Entropy, Relative Entropy and Mutual Information 41

Applying the grouping axiom again, we have

* +
p2
Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h (2.91)
S2
* + * +
p3 p2
= Hm−2 (S3 , p4 , . . . , pm ) + S3 h + S2 h (2.92)
S3 S2
..
. (2.93)
k
! * +
pi
= Hm−(k−1) (Sk , pk+1 , . . . , pm ) + Si h . (2.94)
i=2
Si

Now, we apply the same grouping axiom repeatedly to H k (p1 /Sk , . . . , pk /Sk ) , to obtain
* + * + k−1
! * +
p1 pk Sk−1 pk Si pi /Sk
Hk ,..., = H2 , + h (2.95)
Sk Sk Sk Sk i=2
Sk Si /Sk
k * +
1 ! pi
= Si h . (2.96)
Sk i=2 Si

From (2.94) and (2.96), it follows that

* +
p1 pk
Hm (p1 , . . . , pm ) = Hm−k (Sk , pk+1 , . . . , pm ) + Sk Hk ,..., , (2.97)
Sk Sk
which is the extended grouping axiom.
Now we need to use an axiom that is not explicitly stated in the text, namely that the
function Hm is symmetric with respect to its arguments. Using this, we can combine
any set of arguments of Hm using the extended grouping axiom.
Let f (m) denote Hm ( m
1 1 1
, m, . . . , m ).
Consider
1 1 1
f (mn) = Hmn (
, ,..., ). (2.98)
mn mn mn
By repeatedly applying the extended grouping axiom, we have
1 1 1
f (mn) = Hmn ( , ,..., ) (2.99)
mn mn mn
1 1 1 1 1 1
= Hmn−n ( , ,..., ) + Hn ( , . . . , ) (2.100)
m mn mn m n n
1 1 1 1 2 1 1
= Hmn−2n ( , , ,..., ) + Hn ( , . . . , ) (2.101)
m m mn mn m n n
..
. (2.102)
1 1 1 1
= Hm ( , . . . . ) + H( , . . . , ) (2.103)
m m n n
= f (m) + f (n). (2.104)
42 Entropy, Relative Entropy and Mutual Information

We can immediately use this to conclude that f (m k ) = kf (m) .

Now, we will argue that H2 (1, 0) = h(1) = 0 . We do this by expanding H 3 (p1 , p2 , 0)
( p1 + p2 = 1 ) in two different ways using the grouping axiom

H3 (p1 , p2 , 0) = H2 (p1 , p2 ) + p2 H2 (1, 0) (2.105)

= H2 (1, 0) + (p1 + p2 )H2 (p1 , p2 ) (2.106)

Thus p2 H2 (1, 0) = H2 (1, 0) for all p2 , and therefore H(1, 0) = 0 .

We will also need to show that f (m + 1) − f (m) → 0 as m → ∞ . To prove this, we
use the extended grouping axiom and write

1 1
f (m + 1) = Hm+1 ( ,..., ) (2.107)
m+1 m+1
1 m 1 1
= h( )+ Hm ( , . . . , ) (2.108)
m+1 m+1 m m
1 m
= h( )+ f (m) (2.109)
m+1 m+1

and therefore
m 1
f (m + 1) − f (m) = h( ). (2.110)
m+1 m+1
Thus lim f (m + 1) − m+1m
f (m) = lim h( m+1
1
). But by the continuity of H2 , it follows
that the limit on the right is h(0) = 0 . Thus lim h( m+1
1
) = 0.
Let us define
an+1 = f (n + 1) − f (n) (2.111)
and
1
bn = h( ). (2.112)
n
Then
1
an+1 = − f (n) + bn+1 (2.113)
n+1
n
1 !
= − ai + bn+1 (2.114)
n + 1 i=2

and therefore
n
!
(n + 1)bn+1 = (n + 1)an+1 + ai . (2.115)
i=2

Therefore summing over n , we have

N
! N
! N
!
nbn = (nan + an−1 + . . . + a2 ) = N ai . (2.116)
n=2 n=2 n=2
Entropy, Relative Entropy and Mutual Information 43
$N
Dividing both sides by n=1 n = N (N + 1)/2 , we obtain

N $
2 ! N
nbn
an = $n=2 (2.117)
N + 1 n=2 N
n=2 n

Now by continuity of H2 and the definition of bn , it follows that bn → 0 as n → ∞ .

Since the right hand side is essentially an average of the b n ’s, it also goes to 0 (This
can be proved more precisely using % ’s and δ ’s). Thus the left hand side goes to 0. We
can then see that
N
1 !
aN +1 = bN +1 − an (2.118)
N + 1 n=2
also goes to 0 as N → ∞ . Thus

f (n + 1) − f (n) → 0 asn → ∞. (2.119)

We will now prove the following lemma

Lemma 2.0.1 Let the function f (m) satisfy the following assumptions:

• f (mn) = f (m) + f (n) for all integers m , n .

• limn→∞ (f (n + 1) − f (n)) = 0
• f (2) = 1 ,

then the function f (m) = log 2 m .

Proof of the lemma: Let P be an arbitrary prime number and let

f (P ) log2 n
g(n) = f (n) − (2.120)
log2 P

Then g(n) satisfies the first assumption of the lemma. Also g(P ) = 0 .
Also if we let
f (P ) n
αn = g(n + 1) − g(n) = f (n + 1) − f (n) + log2 (2.121)
log2 P n+1

then the second assumption in the lemma implies that lim α n = 0 .

For an integer n , define 6 7
n
n (1)
= . (2.122)
P
Then it follows that n(1) < n/P , and

n = n(1) P + l (2.123)
44 Entropy, Relative Entropy and Mutual Information

where 0 ≤ l < P . From the fact that g(P ) = 0 , it follows that g(P n (1) ) = g(n(1) ) ,
and
n−1
!
g(n) = g(n(1) ) + g(n) − g(P n(1) ) = g(n(1) ) + αi (2.124)
i=P n(1)

Just as we have defined n(1) from n , we can define n(2) from n(1) . Continuing this
process, we can then write
 (i−1)

k
! n!
g(n) = g(n(k) ) +  αi  . (2.125)
j=1 i=P n(i)

Since n(k) ≤ n/P k , after

6 7
log n
k= +1 (2.126)
log P
terms, we have n(k) = 0 , and g(0) = 0 (this follows directly from the additive property
of g ). Thus we can write
tn
!
g(n) = αi (2.127)
i=1

the sum of bn terms, where * +

log n
bn ≤ P +1 . (2.128)
log P
g(n)
Since αn → 0 , it follows that log2 n → 0 , since g(n) has at most o(log 2 n) terms αi .
Thus it follows that
f (n) f (P )
lim = (2.129)
n→∞ log n log2 P
2

Since P was arbitrary, it follows that f (P )/ log 2 P = c for every prime number P .
Applying the third axiom in the lemma, it follows that the constant is 1, and f (P ) =
log2 P .
For composite numbers N = P1 P2 . . . Pl , we can apply the first property of f and the
prime number factorization of N to show that
! !
f (N ) = f (Pi ) = log2 Pi = log2 N. (2.130)

Thus the lemma is proved.

The lemma can be simplified considerably, if instead of the second assumption, we
replace it by the assumption that f (n) is monotone in n . We will now argue that the
only function f (m) such that f (mn) = f (m) + f (n) for all integers m, n is of the form
f (m) = log a m for some base a .
Let c = f (2) . Now f (4) = f (2 × 2) = f (2) + f (2) = 2c . Similarly, it is easy to see
that f (2k ) = kc = c log 2 2k . We will extend this to integers that are not powers of 2.
Entropy, Relative Entropy and Mutual Information 45

For any integer m , let r > 0 , be another integer and let 2 k ≤ mr < 2k+1 . Then by
the monotonicity assumption on f , we have

kc ≤ rf (m) < (k + 1)c (2.131)

or
k k+1
≤ f (m) < c
c (2.132)
r r
Now by the monotonicity of log , we have
k k+1
≤ log2 m < (2.133)
r r
Combining these two equations, we obtain
3 3
3f (m) − log 2 m 3 < 1
3 3
3 (2.134)
c 3 r
Since r was arbitrary, we must have
log2 m
f (m) = (2.135)
c
and we can identify c = 1 from the last assumption of the lemma.
Now we are almost done. We have shown that for any uniform distribution on m
outcomes, f (m) = Hm (1/m, . . . , 1/m) = log 2 m .
We will now show that

H2 (p, 1 − p) = −p log p − (1 − p) log(1 − p). (2.136)

To begin, let p be a rational number, r/s , say. Consider the extended grouping axiom
for Hs
1 1 1 1 s−r s−r
f (s) = Hs ( , . . . , ) = H( , . . . , , )+ f (s − r) (2.137)
s s <s => s? s s
r
r s−r s s−r
= H2 ( , ) + f (s) + f (s − r) (2.138)
s s r s
Substituting f (s) = log 2 s , etc, we obtain
* + * +
r s−r r r s−r s−r
H2 ( , ) = − log2 − 1 − log2 1 − . (2.139)
s s s s s s
Thus (2.136) is true for rational p . By the continuity assumption, (2.136) is also true
at irrational p .
To complete the proof, we have to extend the definition from H 2 to Hm , i.e., we have
to show that !
Hm (p1 , . . . , pm ) = − pi log pi (2.140)
46 Entropy, Relative Entropy and Mutual Information

for all m . This is a straightforward induction. We have just shown that this is true for
m = 2 . Now assume that it is true for m = n − 1 . By the grouping axiom,

Hn (p1 , . . . , pn ) = Hn−1 (p1 + p2 , p3 , . . . , pn ) (2.141)

* +
p1 p2
+(p1 + p2 )H2 , (2.142)
p1 + p 2 p1 + p 2
n
!
= −(p1 + p2 ) log(p1 + p2 ) − pi log pi (2.143)
i=3
p1 p1 p2 p2
− log − log (2.144)
p1 + p 2 p1 + p 2 p1 + p 2 p1 + p 2
n
!
= − pi log pi . (2.145)
i=1

Thus the statement is true for m = n , and by induction, it is true for all m . Thus we
have finally proved that the only symmetric function that satisfies the axioms is
m
!
Hm (p1 , . . . , pm ) = − pi log pi . (2.146)
i=1

The proof above is due to Rényi[11]

47. The entropy of a missorted file.
A deck of n cards in order 1, 2, . . . , n is provided. One card is removed at random
then replaced at random. What is the entropy of the resulting deck?
Solution: The entropy of a missorted file.

The heart of this problem is simply carefully counting the possible outcome states.
There are n ways to choose which card gets mis-sorted, and, once the card is chosen,
there are again n ways to choose where the card is replaced in the deck. Each of these
shuffling actions has probability 1/n 2 . Unfortunately, not all of these n 2 actions results
in a unique mis-sorted file. So we need to carefully count the number of distinguishable
outcome states. The resulting deck can only take on one of the following three cases.

• The selected card is at its original location after a replacement.

• The selected card is at most one location away from its original location after a
replacement.
• The selected card is at least two locations away from its original location after a
replacement.

To compute the entropy of the resulting deck, we need to know the probability of each
case.
Case 1 (resulting deck is the same as the original): There are n ways to achieve this
outcome state, one for each of the n cards in the deck. Thus, the probability associated
with case 1 is n/n2 = 1/n .
Entropy, Relative Entropy and Mutual Information 47

Case 2 (adjacent pair swapping): There are n − 1 adjacent pairs, each of which will
have a probability of 2/n2 , since for each pair, there are two ways to achieve the swap,
either by selecting the left-hand card and moving it one to the right, or by selecting the
right-hand card and moving it one to the left.
Case 3 (typical situation): None of the remaining actions “collapses”. They all result
in unique outcome states, each with probability 1/n 2 . Of the n2 possible shuffling
actions, n2 − n − 2(n − 1) of them result in this third case (we’ve simply subtracted
the case 1 and case 2 situations above).
The entropy of the resulting deck can be computed as follows.

1 2 n2 1
H(X) = log(n) + (n − 1) 2 log( ) + (n2 − 3n + 2) 2 log(n2 )
n n 2 n
2n − 1 2(n − 1)
= log(n) −
n n2

48. Sequence length.

How much information does the length of a sequence give about the content of a se-
quence? Suppose we consider a Bernoulli (1/2) process {X i }.
Stop the process when the first 1 appears. Let N designate this stopping time.
Thus X N is an element of the set of all finite length binary sequences {0, 1} ∗ =
{0, 1, 00, 01, 10, 11, 000, . . .}.

(a) Find I(N ; X N ).

(b) Find H(X N |N ).

(c) Find H(X N ).

Let’s now consider a different stopping time. For this part, again assume X i ∼ Bernoulli (1/2)
but stop at time N = 6 , with probability 1/3 and stop at time N = 12 with probability
2/3. Let this stopping time be independent of the sequence X 1 X2 . . . X12 .

(d) Find I(N ; X N ).

(e) Find H(X N |N ).

(f) Find H(X N ).

Solution:

(a)

I(X N ; N ) = H(N ) − H(N |X N )

= H(N ) − 0
48 Entropy, Relative Entropy and Mutual Information
(a)
= E(N )
I(X ; N )
N
= 2

where (a) comes from the fact that the entropy of a geometric random variable is
just the mean.

(b) Since given N we know that Xi = 0 for all i < N and XN = 1,

H(X N |N ) = 0.

(c)

H(X N ) = I(X N ; N ) + H(X N |N )

= I(X N ; N ) + 0
H(X N ) = 2.

(d)

I(X N ; N ) = H(N ) − H(N |X N )

= H(N ) − 0
I(X ; N ) = HB (1/3)
N

(e)
1 2
H(X N |N ) = H(X 6 |N = 6) + H(X 12 |N = 12)
3 3
1 2
= H(X ) + H(X )
6 12
3 3
1 2
= 6 + 12
3 3
H(X N |N ) = 10.

(f)

H(X N ) = I(X N ; N ) + H(X N |N )

= I(X N ; N ) + 10
H(X N ) = H(1/3) + 10.
Chapter 3

The Asymptotic Equipartition

Property

1. Markov’s inequality and Chebyshev’s inequality.

(a) (Markov’s inequality.) For any non-negative random variable X and any t > 0 ,
show that
EX
Pr {X ≥ t} ≤ . (3.1)
t
Exhibit a random variable that achieves this inequality with equality.
(b) (Chebyshev’s inequality.) Let Y be a random variable with mean µ and variance
σ 2 . By letting X = (Y − µ)2 , show that for any % > 0 ,

σ2
Pr {|Y − µ| > %} ≤ . (3.2)
%2
(c) (The weak law of large numbers.) Let Z 1 , Z2 , . . . , Zn be a sequence of i.i.d. random
$
variables with mean µ and variance σ 2 . Let Z n = n1 ni=1 Zi be the sample mean.
Show that
@3 3 A σ2
3 3
Pr 3Z n − µ3 > % ≤ 2 . (3.3)
n%
@3 3 A
3 3
Thus Pr 3Z n − µ3 > % → 0 as n → ∞ . This is known as the weak law of large
numbers.

Solution: Markov’s inequality and Chebyshev’s inequality.

(a) If X has distribution F (x) ,

2 ∞
EX = xdF
0
2 δ 2 ∞
= xdF + xdF
0 δ
49

Betting On Yourself
100% (8)
Betting On Yourself
185 pages
The Power of Mathematical Visualization The Great Courses
100% (5)
The Power of Mathematical Visualization The Great Courses
282 pages
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (55)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
Soccer Miracle System
50% (16)
Soccer Miracle System
34 pages
lời giải
No ratings yet
lời giải
52 pages
Paglaum Village National High School: TH TH TH TH
50% (2)
Paglaum Village National High School: TH TH TH TH
3 pages
Quantum Information: Stephen M. Barnett
No ratings yet
Quantum Information: Stephen M. Barnett
60 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
1,221 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Chapter 2
No ratings yet
Chapter 2
14 pages
No of Flips For First Head
No ratings yet
No of Flips For First Head
8 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
Entropy 4
No ratings yet
Entropy 4
10 pages
Lec38 - 210108071 - AKSHAY KUMAR JHA
No ratings yet
Lec38 - 210108071 - AKSHAY KUMAR JHA
12 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Solution Manual: A. Viterbi K. Omura
No ratings yet
Solution Manual: A. Viterbi K. Omura
207 pages
Tennis Tutorial
100% (1)
Tennis Tutorial
31 pages
Probability & Information: Prof. J Bapat
No ratings yet
Probability & Information: Prof. J Bapat
20 pages
Mathematical Problems and Solutions On Information Theory
No ratings yet
Mathematical Problems and Solutions On Information Theory
28 pages
Practice 2
No ratings yet
Practice 2
7 pages
Problem Set 1
No ratings yet
Problem Set 1
3 pages
Solved Problems
No ratings yet
Solved Problems
7 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Midit 10
No ratings yet
Midit 10
5 pages
2009 Lecture25
No ratings yet
2009 Lecture25
11 pages
Homework 3 Solutions
No ratings yet
Homework 3 Solutions
9 pages
Ee5143 Pset1 PDF
No ratings yet
Ee5143 Pset1 PDF
4 pages
HW 1 Sol
No ratings yet
HW 1 Sol
5 pages
SS 19
No ratings yet
SS 19
22 pages
Information Theory and Computing Assignment No. 1: April 10, 2020
No ratings yet
Information Theory and Computing Assignment No. 1: April 10, 2020
12 pages
Homework 1: Due: 13:10, 09/26, 2024 (In Class)
No ratings yet
Homework 1: Due: 13:10, 09/26, 2024 (In Class)
2 pages
Info Theory Solutions
No ratings yet
Info Theory Solutions
270 pages
2 Information Measurement and Entropy
No ratings yet
2 Information Measurement and Entropy
23 pages
HW 1 Solw
No ratings yet
HW 1 Solw
17 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
Probc 1
No ratings yet
Probc 1
4 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Solution To Homework #1: (A) (B) (C) (D)
No ratings yet
Solution To Homework #1: (A) (B) (C) (D)
4 pages
Solution To Homework #1: (A) (B) (C) (D)
No ratings yet
Solution To Homework #1: (A) (B) (C) (D)
4 pages
Charlesworth D Decision Analysis For Managers A Guide 2ed 2017
100% (2)
Charlesworth D Decision Analysis For Managers A Guide 2ed 2017
159 pages
Information Theory and Computing Assignment No. 1: April 10, 2020
No ratings yet
Information Theory and Computing Assignment No. 1: April 10, 2020
12 pages
Info Theory Exercise Solutions
No ratings yet
Info Theory Exercise Solutions
16 pages
Slide 04
No ratings yet
Slide 04
16 pages
Ifo Theo&Coding-Compiled-PSS
No ratings yet
Ifo Theo&Coding-Compiled-PSS
156 pages
HW 1
No ratings yet
HW 1
4 pages
Notes It
No ratings yet
Notes It
46 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
TERM 1 Class 8th Maths Final (1) 1
No ratings yet
TERM 1 Class 8th Maths Final (1) 1
12 pages
Indian Institute of Technology Bombay
No ratings yet
Indian Institute of Technology Bombay
6 pages
HW 1
No ratings yet
HW 1
3 pages
It Co 1 en
No ratings yet
It Co 1 en
26 pages
Mutual Information
No ratings yet
Mutual Information
48 pages
Elements of Information Theory.2nd Ex 2.4
No ratings yet
Elements of Information Theory.2nd Ex 2.4
4 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
4 pages
Exercise Problems: Information Theory and Coding
No ratings yet
Exercise Problems: Information Theory and Coding
6 pages
8a87ctutorial Sheets Prob and Stats
No ratings yet
8a87ctutorial Sheets Prob and Stats
12 pages
Percentagessdafdsf
No ratings yet
Percentagessdafdsf
4 pages
Mutually Exclusive Events
No ratings yet
Mutually Exclusive Events
32 pages
Permutation 1
100% (1)
Permutation 1
29 pages
P&C and Probability Problems
No ratings yet
P&C and Probability Problems
12 pages
OpenStax Statistics CH03 LectureSlides HSTA203 3
No ratings yet
OpenStax Statistics CH03 LectureSlides HSTA203 3
63 pages
Number Talks
No ratings yet
Number Talks
6 pages
5 6066671456270418702
No ratings yet
5 6066671456270418702
6 pages
Tutorial 1
No ratings yet
Tutorial 1
1 page
Georgia Standards of Excellence Curriculum Map: Accelerated GSE 6/7A
No ratings yet
Georgia Standards of Excellence Curriculum Map: Accelerated GSE 6/7A
6 pages
Mathem 1 First Grading
No ratings yet
Mathem 1 First Grading
3 pages
4th Periodical
No ratings yet
4th Periodical
4 pages
Single Double Treble Accumulator Permed Bet Types Forecast Bets Tricast Bets
No ratings yet
Single Double Treble Accumulator Permed Bet Types Forecast Bets Tricast Bets
7 pages
Probabilitry - CPP
No ratings yet
Probabilitry - CPP
12 pages
DAF 1104 Q. Methods Teaching Notes
No ratings yet
DAF 1104 Q. Methods Teaching Notes
63 pages
15 Minute System 2009
100% (2)
15 Minute System 2009
16 pages
BUS 312 Midterm Exercises
No ratings yet
BUS 312 Midterm Exercises
14 pages
Q4 Math 8 Week 7
No ratings yet
Q4 Math 8 Week 7
5 pages
UTAMABET Situs Judi Bola Resmi & Agen Bola Online Terpercaya 2
No ratings yet
UTAMABET Situs Judi Bola Resmi & Agen Bola Online Terpercaya 2
1 page
1xbet Bookmaker. High Odds. 24-Hour Customer Serv
No ratings yet
1xbet Bookmaker. High Odds. 24-Hour Customer Serv
1 page
QUIZ 2 - SET 4 No Answer
No ratings yet
QUIZ 2 - SET 4 No Answer
1 page