Ict Solution
Ict Solution
1. Coin flips. A fair coin is flipped until the first head occurs. Let X denote the number
of flips required.
(a) Find the entropy H(X) in bits. The following expressions may be useful:
∞ ∞
! 1 ! r
r =
n
, nr n = .
n=0
1−r n=0
(1 − r)2
Solution:
(a) The number X of tosses till the first head appears has the geometric distribution
with parameter p = 1/2 , where P (X = n) = pq n−1 , n ∈ {1, 2, . . .} . Hence the
entropy of X is
∞
!
H(X) = − pq n−1 log(pq n−1 )
n=1
"∞ ∞
#
! !
= − pq log p +
n
npq log q
n
n=0 n=0
−p log p pq log q
= −
1−q p2
−p log p − q log q
=
p
= H(p)/p bits.
If p = 1/2 , then H(X) = 2 bits.
9
10 Entropy, Relative Entropy and Mutual Information
(b) Intuitively, it seems clear that the best questions are those that have equally likely
chances of receiving a yes or a no answer. Consequently, one possible guess is
that the most “efficient” series of questions is: Is X = 1 ? If not, is X = 2 ?
If not, is X = 3 ? . . . with a resulting expected number of questions equal to
$∞
n=1 n(1/2 ) = 2. This should reinforce the intuition that H(X) is a mea-
n
sure of the uncertainty of X . Indeed in this case, the entropy is exactly the
same as the average number of questions needed to define X , and in general
E(# of questions) ≥ H(X) . This problem has an interpretation as a source cod-
ing problem. Let 0 = no, 1 = yes, X = Source, and Y = Encoded Source. Then
the set of questions in the above procedure can be written as a collection of (X, Y )
pairs: (1, 1) , (2, 01) , (3, 001) , etc. . In fact, this intuitively derived code is the
optimal (Huffman) code minimizing the expected number of questions.
(a) Y = 2X ?
(b) Y = cos X ?
Consider any set of x ’s that map onto a single y . For this set
! !
p(x) log p(x) ≤ p(x) log p(y) = p(y) log p(y),
x: y=g(x) x: y=g(x)
$
since log is a monotone increasing function and p(x) ≤ x: y=g(x) p(x) = p(y) . Ex-
tending this argument to the entire range of X (and Y ), we obtain
!
H(X) = − p(x) log p(x)
x
! !
= − p(x) log p(x)
y x: y=g(x)
!
≥ − p(y) log p(y)
y
= H(Y ),
(a) Y = 2X is one-to-one and hence the entropy, which is just a function of the
probabilities (and not the values of a random variable) does not change, i.e.,
H(X) = H(Y ) .
(b) Y = cos(X) is not necessarily one-to-one. Hence all that we can say is that
H(X) ≥ H(Y ) , with equality if cosine is one-to-one on the range of X .
Entropy, Relative Entropy and Mutual Information 11
3. Minimum entropy. What is the minimum value of H(p 1 , ..., pn ) = H(p) as p ranges
over the set of n -dimensional probability vectors? Find all p ’s which achieve this
minimum.
Solution: We wish to find all probability vectors p = (p 1 , p2 , . . . , pn ) which minimize
!
H(p) = − pi log pi .
i
Now −pi log pi ≥ 0 , with equality iff pi = 0 or 1 . Hence the only possible probability
vectors which minimize H(p) are those with p i = 1 for some i and pj = 0, j %= i .
There are n such vectors, i.e., (1, 0, . . . , 0) , (0, 1, 0, . . . , 0) , . . . , (0, . . . , 0, 1) , and the
minimum value of H(p) is 0.
4. Entropy of functions of a random variable. Let X be a discrete random variable.
Show that the entropy of a function of X is less than or equal to the entropy of X by
justifying the following steps:
(a)
H(X, g(X)) = H(X) + H(g(X) | X) (2.1)
(b)
= H(X); (2.2)
(c)
H(X, g(X)) = H(g(X)) + H(X | g(X)) (2.3)
(d)
≥ H(g(X)). (2.4)
Thus H(g(X)) ≤ H(X).
Solution: Entropy of functions of a random variable.
(a) H(X, g(X)) = H(X) + H(g(X)|X) by the chain rule for entropies.
(b) H(g(X)|X) = 0 since for any particular value of X, g(X) is fixed, and hence
$ $
H(g(X)|X) = x p(x)H(g(X)|X = x) = x 0 = 0 .
(c) H(X, g(X)) = H(g(X)) + H(X|g(X)) again by the chain rule.
(d) H(X|g(X)) ≥ 0 , with equality iff X is a function of g(X) , i.e., g(.) is one-to-one.
Hence H(X, g(X)) ≥ H(g(X)) .
Combining parts (b) and (d), we obtain H(X) ≥ H(g(X)) .
5. Zero conditional entropy. Show that if H(Y |X) = 0 , then Y is a function of X ,
i.e., for all x with p(x) > 0 , there is only one possible value of y with p(x, y) > 0 .
Solution: Zero Conditional Entropy. Assume that there exists an x , say x 0 and two
different values of y , say y1 and y2 such that p(x0 , y1 ) > 0 and p(x0 , y2 ) > 0 . Then
p(x0 ) ≥ p(x0 , y1 ) + p(x0 , y2 ) > 0 , and p(y1 |x0 ) and p(y2 |x0 ) are not equal to 0 or 1.
Thus
! !
H(Y |X) = − p(x) p(y|x) log p(y|x) (2.5)
x y
≥ p(x0 )(−p(y1 |x0 ) log p(y1 |x0 ) − p(y2 |x0 ) log p(y2 |x0 )) (2.6)
> > 0, (2.7)
12 Entropy, Relative Entropy and Mutual Information
(a) The last corollary to Theorem 2.8.1 in the text states that if X → Y → Z that
is, if p(x, y | z) = p(x | z)p(y | z) then, I(X; Y ) ≥ I(X; Y | Z) . Equality holds if
and only if I(X; Z) = 0 or X and Z are independent.
A simple example of random variables satisfying the inequality conditions above
is, X is a fair binary random variable and Y = X and Z = Y . In this case,
and,
I(X; Y | Z) = H(X | Z) − H(X | Y, Z) = 0.
So that I(X; Y ) > I(X; Y | Z) .
(b) This example is also given in the text. Let X, Y be independent fair binary
random variables and let Z = X + Y . In this case we have that,
I(X; Y ) = 0
and,
I(X; Y | Z) = H(X | Z) = 1/2.
So I(X; Y ) < I(X; Y | Z) . Note that in this case X, Y, Z are not markov.
7. Coin weighing. Suppose one has n coins, among which there may or may not be one
counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than
the other coins. The coins are to be weighed by a balance.
(a) Find an upper bound on the number of coins n so that k weighings will find the
counterfeit coin (if any) and correctly declare it to be heavier or lighter.
(b) (Difficult) What is the coin weighing strategy for k = 3 weighings and 12 coins?
Each weighing has three possible outcomes - equal, left pan heavier or right pan
heavier. Hence with k weighings, there are 3 k possible outcomes and hence we
can distinguish between at most 3k different “states”. Hence 2n + 1 ≤ 3k or
n ≤ (3k − 1)/2 .
Looking at it from an information theoretic viewpoint, each weighing gives at most
log2 3 bits of information. There are 2n + 1 possible “states”, with a maximum
entropy of log2 (2n + 1) bits. Hence in this situation, one would require at least
log2 (2n + 1)/ log 2 3 weighings to extract enough information for determination of
the odd coin, which gives the same result as above.
(b) There are many solutions to this problem. We will give one which is based on the
ternary number system.
We may express the numbers {−12, −11, . . . , −1, 0, 1, . . . , 12} in a ternary number
system with alphabet {−1, 0, 1} . For example, the number 8 is (-1,0,1) where
−1 × 30 + 0 × 31 + 1 × 32 = 8 . We form the matrix with the representation of the
positive numbers as its columns.
1 2 3 4 5 6 7 8 9 10 11 12
3 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 Σ1 = 0
31 0 1 1 1 -1 -1 -1 0 0 0 1 1 Σ2 = 2
3 2 0 0 0 0 1 1 1 1 1 1 1 1 Σ3 = 8
Note that the row sums are not all zero. We can negate some columns to make
the row sums zero. For example, negating columns 7,9,11 and 12, we obtain
1 2 3 4 5 6 7 8 9 10 11 12
3 0 1 -1 0 1 -1 0 -1 -1 0 1 1 0 Σ1 = 0
31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 Σ2 = 0
32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 Σ3 = 0
Now place the coins on the balance according to the following rule: For weighing
#i , place coin n
• On left pan, if ni = −1 .
• Aside, if ni = 0 .
• On right pan, if ni = 1 .
The outcome of the three weighings will find the odd coin if any and tell whether
it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if
the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings
give the ternary expansion of the index of the odd coin. If the expansion is the
same as the expansion in the matrix, it indicates that the coin is heavier. If
the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1)
indicates (0)30 +(−1)3+(−1)32 = −12 , hence coin #12 is heavy, (1,0,-1) indicates
#8 is light, (0,0,0) indicates no odd coin.
Why does this scheme work? It is a single error correcting Hamming code for the
ternary alphabet (discussed in Section 8.11 in the book). Here are some details.
First note a few properties of the matrix above that was used for the scheme.
All the columns are distinct and no two columns add to (0,0,0). Also if any coin
14 Entropy, Relative Entropy and Mutual Information
is heavier, it will produce the sequence of weighings that matches its column in
the matrix. If it is lighter, it produces the negative of its column as a sequence
of weighings. Combining all these facts, we can see that any single odd coin will
produce a unique sequence of weighings, and that the coin can be determined from
the sequence.
One of the questions that many of you had whether the bound derived in part (a)
was actually achievable. For example, can one distinguish 13 coins in 3 weighings?
No, not with a scheme like the one above. Yes, under the assumptions under
which the bound was derived. The bound did not prohibit the division of coins
into halves, neither did it disallow the existence of another coin known to be
normal. Under both these conditions, it is possible to find the odd coin of 13 coins
in 3 weighings. You could try modifying the above scheme to these cases.
8. Drawing with and without replacement. An urn contains r red, w white, and
b black balls. Which has higher entropy, drawing k ≥ 2 balls from the urn with
replacement or without replacement? Set it up and show why. (There is both a hard
way and a relatively simple way to do this.)
Solution: Drawing with and without replacement. Intuitively, it is clear that if the
balls are drawn with replacement, the number of possible choices for the i -th ball is
larger, and therefore the conditional entropy is larger. But computing the conditional
distributions is slightly involved. It is easier to compute the unconditional entropy.
• With replacement. In this case the conditional distribution of each draw is the
same for every draw. Thus
red with prob. r+w+b
r
Xi = white with prob. r+w+b
w
(2.8)
black with prob. r+w+b
b
and therefore
• Without replacement. The unconditional probability of the i -th ball being red is
still r/(r + w + b) , etc. Thus the unconditional entropy H(X i ) is still the same as
with replacement. The conditional entropy H(X i |Xi−1 , . . . , X1 ) is less than the
unconditional entropy, and therefore the entropy of drawing without replacement
is lower.
• ρ(x, y) ≥ 0
• ρ(x, y) = ρ(y, x)
Entropy, Relative Entropy and Mutual Information 15
(a) Show that ρ(X, Y ) = H(X|Y ) + H(Y |X) satisfies the first, second and fourth
properties above. If we say that X = Y if there is a one-to-one function mapping
from X to Y , then the third property is also satisfied, and ρ(X, Y ) is a metric.
(b) Verify that ρ(X, Y ) can also be expressed as
ρ(X, Y ) = H(X) + H(Y ) − 2I(X; Y ) (2.11)
= H(X, Y ) − I(X; Y ) (2.12)
= 2H(X, Y ) − H(X) − H(Y ). (2.13)
Solution: A metric
(a) Let
ρ(X, Y ) = H(X|Y ) + H(Y |X). (2.14)
Then
• Since conditional entropy is always ≥ 0 , ρ(X, Y ) ≥ 0 .
• The symmetry of the definition implies that ρ(X, Y ) = ρ(Y, X) .
• By problem 2.6, it follows that H(Y |X) is 0 iff Y is a function of X and
H(X|Y ) is 0 iff X is a function of Y . Thus ρ(X, Y ) is 0 iff X and Y
are functions of each other - and therefore are equivalent up to a reversible
transformation.
• Consider three random variables X , Y and Z . Then
H(X|Y ) + H(Y |Z) ≥ H(X|Y, Z) + H(Y |Z) (2.15)
= H(X, Y |Z) (2.16)
= H(X|Z) + H(Y |X, Z) (2.17)
≥ H(X|Z), (2.18)
from which it follows that
ρ(X, Y ) + ρ(Y, Z) ≥ ρ(X, Z). (2.19)
Note that the inequality is strict unless X → Y → Z forms a Markov Chain
and Y is a function of X and Z .
(b) Since H(X|Y ) = H(X) − I(X; Y ) , the first equation follows. The second relation
follows from the first equation and the fact that H(X, Y ) = H(X) + H(Y ) −
I(X; Y ) . The third follows on substituting I(X; Y ) = H(X) + H(Y ) − H(X, Y ) .
10. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn
according to probability mass functions p 1 (·) and p2 (·) over the respective alphabets
X1 = {1, 2, . . . , m} and X2 = {m + 1, . . . , n}. Let
)
X1 , with probability α,
X=
X2 , with probability 1 − α.
16 Entropy, Relative Entropy and Mutual Information
! Y
! 0 1
X !
0 1
3
1
3
1 0 1
3
Find
(a) H(X) = 2
3 log 3
2 + 1
3 log 3 = 0.918 bits = H(Y ) .
(b) H(X|Y ) = 3 H(X|Y = 0) + 3 H(X|Y
1 2
= 1) = 0.667 bits = H(Y |X) .
(c) H(X, Y ) = 3 × 13 log 3 = 1.585 bits.
(d) H(Y ) − H(Y |X) = 0.251 bits.
(e) I(X; Y ) = H(Y ) − H(Y |X) = 0.251 bits.
(f) See Figure 1.
Figure 2.1: Venn diagram to illustrate the relationships of entropy and relative entropy
H(Y)
H(X)
H(X|Y) I(X;Y)
H(Y|X)
since the second term is always negative. Hence letting y = 1/x , we obtain
1
− ln y ≤ −1
y
or
1
ln y ≥ 1 −
y
with equality iff y = 1 .
14. Entropy of a sum. Let X and Y be random variables that take on values x 1 , x2 , . . . , xr
and y1 , y2 , . . . , ys , respectively. Let Z = X + Y.
(a) Show that H(Z|X) = H(Y |X). Argue that if X, Y are independent, then H(Y ) ≤
H(Z) and H(X) ≤ H(Z). Thus the addition of independent random variables
adds uncertainty.
(b) Give an example of (necessarily dependent) random variables in which H(X) >
H(Z) and H(Y ) > H(Z).
(c) Under what conditions does H(Z) = H(X) + H(Y ) ?
By the Markov property, the past and the future are conditionally independent given
the present and hence all terms except the first are zero. Therefore
16. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necks
down to k < n states, and then fans back to m > k states. Thus X 1 → X2 → X3 ,
i.e., p(x1 , x2 , x3 ) = p(x1 )p(x2 |x1 )p(x3 |x2 ) , for all x1 ∈ {1, 2, . . . , n} , x2 ∈ {1, 2, . . . , k} ,
x3 ∈ {1, 2, . . . , m} .
(a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving
that I(X1 ; X3 ) ≤ log k.
(b) Evaluate I(X1 ; X3 ) for k = 1 , and conclude that no dependence can survive such
a bottleneck.
Solution:
Bottleneck.
20 Entropy, Relative Entropy and Mutual Information
(a) From the data processing inequality, and the fact that entropy is maximum for a
uniform distribution, we get
I(X1 ; X3 ) ≤ I(X1 ; X2 )
= H(X2 ) − H(X2 | X1 )
≤ H(X2 )
≤ log k.
Thus, the dependence between X1 and X3 is limited by the size of the bottleneck.
That is I(X1 ; X3 ) ≤ log k .
(b) For k = 1 , I(X1 ; X3 ) ≤ log 1 = 0 and since I(X1 , X3 ) ≥ 0 , I(X1 , X3 ) = 0 .
Thus, for k = 1 , X1 and X3 are independent.
17. Pure randomness and bent coins. Let X 1 , X2 , . . . , Xn denote the outcomes of
independent flips of a bent coin. Thus Pr {X i = 1} = p, Pr {Xi = 0} = 1 − p ,
where p is unknown. We wish to obtain a sequence Z 1 , Z2 , . . . , ZK of fair coin flips
from X1 , X2 , . . . , Xn . Toward this end let f : X n → {0, 1}∗ , (where {0, 1}∗ =
{Λ, 0, 1, 00, 01, . . .} is the set of all finite length binary sequences), be a mapping
f (X1 , X2 , . . . , Xn ) = (Z1 , Z2 , . . . , ZK ) , where Zi ∼ Bernoulli ( 12 ) , and K may depend
on (X1 , . . . , Xn ) . In order that the sequence Z1 , Z2 , . . . appear to be fair coin flips, the
map f from bent coin flips to fair flips must have the property that all 2 k sequences
(Z1 , Z2 , . . . , Zk ) of a given length k have equal probability (possibly 0), for k = 1, 2, . . . .
For example, for n = 2 , the map f (01) = 0 , f (10) = 1 , f (00) = f (11) = Λ (the null
string), has the property that Pr{Z1 = 1|K = 1} = Pr{Z1 = 0|K = 1} = 12 .
Give reasons for the following inequalities:
(a)
nH(p) = H(X1 , . . . , Xn )
(b)
≥ H(Z1 , Z2 , . . . , ZK , K)
(c)
= H(K) + H(Z1 , . . . , ZK |K)
(d)
= H(K) + E(K)
(e)
≥ EK.
Thus no more than nH(p) fair coin tosses can be derived from (X 1 , . . . , Xn ) , on the
average. Exhibit a good map f on sequences of length 4.
Solution: Pure randomness and bent coins.
(a)
nH(p) = H(X1 , . . . , Xn )
(b)
≥ H(Z1 , Z2 , . . . , ZK )
Entropy, Relative Entropy and Mutual Information 21
(c)
= H(Z1 , Z2 , . . . , ZK , K)
(d)
= H(K) + H(Z1 , . . . , ZK |K)
(e)
= H(K) + E(K)
(f )
≥ EK .
0000 → Λ
0001 → 00 0010 → 01 0100 → 10 1000 → 11
0011 → 00 0110 → 01 1100 → 10 1001 → 11
(2.25)
1010 → 0 0101 → 1
1110 → 11 1101 → 10 1011 → 01 0111 → 00
1111 → Λ
The resulting expected number of bits is
For example, for p ≈ 12 , the expected number of pure random bits is close to 1.625.
This is substantially less then the 4 pure random bits that could be generated if
p were exactly 12 .
We will now analyze the efficiency of this scheme of generating random bits for long
sequences of bent coin flips. Let n be the number of bent coin flips. The algorithm
that we will use is the obvious extension of the above method of generating pure
bits using the fact that all sequences with the same number of ones are equally
likely.
, -
Consider all sequences ,n-
with k ones. There are nk such sequences, which ,n-
are
all equally likely. If k were a power of 2, then we could generate
,n-
log k pure
random bits from such a set. However, in the general
,n-
case, k is not a power of
2 and the best we can to is the divide the set of k elements into subset of sizes
which are powers of 2. The largest set would have a size 2 $log (k )% and could be
n
, -
used to generate *log nk + random bits. We could divide the remaining elements
into
,n-
the largest set which is a power of 2, etc. The worst case would occur when
k = 2l+1 − 1 , in which case the subsets would be of sizes 2 l , 2l−1 , 2l−2 , . . . , 1 .
Instead of analyzing the scheme exactly, we will
,n-
just find a lower bound on number
,n-
of random bits generated from a set of size k . Let l = *log k + . Then at least
half of the elements belong to a set of size 2 l and would generate l random bits,
at least 14 th belong to a set of size 2l−1 and generate l − 1 random bits, etc. On
the average, the number of bits generated is
1 1 1
E[K|k 1’s in sequence] ≥ l + (l − 1) + · · · + l 1 (2.28)
2 4* 2 +
1 1 2 3 l−1
= l− 1 + + + + · · · + l−2 (2.29)
4 2 4 8 2
≥ l − 1, (2.30)
Now for sufficiently large n , the probability that the number of 1’s in the sequence
is close to np is near 1 (by the weak law of large numbers). For such sequences,
n is close to p and hence there exists a δ such that
k
. /
n k
≥ 2n(H( n )−δ) ≥ 2n(H(p)−2δ) (2.35)
k
using Stirling’s approximation for the binomial coefficients and the continuity of
the entropy function. If we assume that n is large enough so that the probability
that n(p − %) ≤ k ≤ n(p + %) is greater than 1 − % , then we see that EK ≥
(1 − %)n(H(p) − 2δ) − 2 , which is very good since nH(p) is an upper bound on the
number of pure random bits that can be produced from the bent coin sequence.
18. World Series. The World Series is a seven-game series that terminates as soon as
either team wins four games. Let X be the random variable that represents the outcome
of a World Series between teams A and B; possible values of X are AAAA, BABABAB,
and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7.
Assuming that A and B are equally matched and that the games are independent,
calculate H(X) , H(Y ) , H(Y |X) , and H(X|Y ) .
Solution:
World Series. Two teams play until one of them has won 4 games.
There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability
(1/2)4 .
,4-
There are 8 = 2 3 World Series with 5 games. Each happens with probability (1/2) 5 .
,5-
There are 20 = 2 3 World Series with 6 games. Each happens with probability (1/2) 6 .
,6-
There are 40 = 2 3 World Series with 7 games. Each happens with probability (1/2) 7 .
! 1
H(X) = p(x)log
p(x)
= 2(1/16) log 16 + 8(1/32) log 32 + 20(1/64) log 64 + 40(1/128) log 128
= 5.8125
! 1
H(Y ) = p(y)log
p(y)
= 1/8 log 8 + 1/4 log 4 + 5/16 log(16/5) + 5/16 log(16/5)
= 1.924
24 Entropy, Relative Entropy and Mutual Information
19. Infinite entropy. This problem shows that the entropy of a discrete random variable
$
can be infinite. Let A = ∞ n=2 (n log n)
2 −1 . (It is easy to show that A is finite by
bounding the infinite sum by the integral of (x log 2 x)−1 .) Show that the integer-
valued random variable X defined by Pr(X = n) = (An log 2 n)−1 for n = 2, 3, . . . ,
has H(X) = +∞ .
Solution: Infinite entropy. By definition, p n = Pr(X = n) = 1/An log 2 n for n ≥ 2 .
Therefore
∞
!
H(X) = − p(n) log p(n)
n=2
!∞ 0 1 0 1
= − 1/An log 2 n log 1/An log 2 n
n=2
∞
! log(An log 2 n)
=
n=2 An log2 n
∞
! log A + log n + 2 log log n
=
n=2 An log2 n
∞ ∞
! 1 ! 2 log log n
= log A + + .
n=2
An log n n=2 An log2 n
The first term is finite. For base 2 logarithms, all the elements in the sum in the last
term are nonnegative. (For any other base, the terms of the last sum eventually all
become positive.) So all we have to do is bound the middle sum, which we do by
comparing with an integral.
∞ 2 3∞
! 1 ∞ 1 3
> dx = K ln ln x 3 = +∞ .
n=2
An log n 2 Ax log x 2
X1 , X2 , . . . , Xn . Hence
21. Markov’s inequality for probabilities. Let p(x) be a probability mass function.
Prove, for all d ≥ 0 ,
* +
1
Pr {p(X) ≤ d} log ≤ H(X). (2.40)
d
1 ! 1
P (p(X) < d) log = p(x) log (2.41)
d x:p(x)<d
d
! 1
≤ p(x) log (2.42)
x:p(x)<d
p(x)
! 1
≤ p(x) log (2.43)
x p(x)
= H(X) (2.44)
22. Logical order of ideas. Ideas have been developed in order of need, and then gener-
alized if necessary. Reorder the following ideas, strongest first, implications following:
(a) Chain rule for I(X1 , . . . , Xn ; Y ) , chain rule for D(p(x1 , . . . , xn )||q(x1 , x2 , . . . , xn )) ,
and chain rule for H(X1 , X2 , . . . , Xn ) .
(b) D(f ||g) ≥ 0 , Jensen’s inequality, I(X; Y ) ≥ 0 .
(a) The following orderings are subjective. Since I(X; Y ) = D(p(x, y)||p(x)p(y)) is a
special case of relative entropy, it is possible to derive the chain rule for I from
the chain rule for D .
Since H(X) = I(X; X) , it is possible to derive the chain rule for H from the
chain rule for I .
It is also possible to derive the chain rule for I from the chain rule for H as was
done in the notes.
(b) In class, Jensen’s inequality was used to prove the non-negativity of D . The
inequality I(X; Y ) ≥ 0 followed as a special case of the non-negativity of D .
26 Entropy, Relative Entropy and Mutual Information
24. Average entropy. Let H(p) = −p log 2 p − (1 − p) log2 (1 − p) be the binary entropy
function.
(a) Evaluate H(1/4) using the fact that log 2 3 ≈ 1.584 . Hint: You may wish to
consider an experiment with four equally likely outcomes, one of which is more
interesting than the others.
(b) Calculate the average entropy H(p) when the probability p is chosen uniformly
in the range 0 ≤ p ≤ 1 .
(c) (Optional) Calculate the average entropy H(p 1 , p2 , p3 ) where (p1 , p2 , p3 ) is a uni-
formly distributed probability vector. Generalize to dimension n .
(a) We can generate two bits of information by picking one of four equally likely
alternatives. This selection can be made in two steps. First we decide whether the
first outcome occurs. Since this has probability 1/4 , the information generated
is H(1/4) . If not the first outcome, then we select one of the three remaining
outcomes; with probability 3/4 , this produces log 2 3 bits of information. Thus
(b) If p is chosen uniformly in the range 0 ≤ p ≤ 1 , then the average entropy (in
nats) is
2 2 . /
1 1 x2 x2 31
3
− p ln p + (1 − p) ln(1 − p)dp = −2 x ln x dx = −2 ln x + 3 = 1
.
0 0 2 4 0 2
After some enjoyable calculus, we obtain the final result 5/(6 ln 2) = 1.202 bits.
25. Venn diagrams. There isn’t realy a notion of mutual information common to three
random variables. Here is one attempt at a definition: Using Venn diagrams, we can
see that the mutual information common to three random variables X , Y and Z can
be defined by
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) .
This quantity is symmetric in X , Y and Z , despite the preceding asymmetric defi-
nition. Unfortunately, I(X; Y ; Z) is not necessarily nonnegative. Find X , Y and Z
such that I(X; Y ; Z) < 0 , and prove the following two identities:
(a) I(X; Y ; Z) = H(X, Y, Z) − H(X) − H(Y ) − H(Z) + I(X; Y ) + I(Y ; Z) + I(Z; X)
(b) I(X; Y ; Z) = H(X, Y, Z)− H(X, Y )− H(Y, Z)− H(Z, X)+ H(X)+ H(Y )+ H(Z)
The first identity can be understood using the Venn diagram analogy for entropy and
mutual information. The second identity follows easily from the first.
Solution: Venn Diagrams. To show the first identity,
I(X; Y ; Z) = I(X; Y ) − I(X; Y |Z) by definition
= I(X; Y ) − (I(X; Y, Z) − I(X; Z)) by chain rule
= I(X; Y ) + I(X; Z) − I(X; Y, Z)
= I(X; Y ) + I(X; Z) − (H(X) + H(Y, Z) − H(X, Y, Z))
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − H(Y, Z)
= I(X; Y ) + I(X; Z) − H(X) + H(X, Y, Z) − (H(Y ) + H(Z) − I(Y ; Z))
= I(X; Y ) + I(X; Z) + I(Y ; Z) + H(X, Y, Z) − H(X) − H(Y ) − H(Z).
To show the second identity, simply substitute for I(X; Y ) , I(X; Z) , and I(Y ; Z)
using equations like
I(X; Y ) = H(X) + H(Y ) − H(X, Y ) .
These two identities show that I(X; Y ; Z) is a symmetric (but not necessarily nonneg-
ative) function of three random variables.
28 Entropy, Relative Entropy and Mutual Information
! q(x)
−D(p||q) = p(x) ln (2.45)
x p(x)
! * +
q(x)
≤ p(x) −1 (2.46)
x p(x)
≤ 0 (2.47)
f (x) = x − 1 − ln x (2.48)
for 0 < x < ∞ . Then f ' (x) = 1 − x1 and f '' (x) = x12 > 0 , and therefore f (x)
is strictly convex. Therefore a local minimum of the function is also a global
minimum. The function has a local minimum at the point where f ' (x) = 0 , i.e.,
when x = 1 . Therefore f (x) ≥ f (1) , i.e.,
x − 1 − ln x ≥ 1 − 1 − ln 1 = 0 (2.49)
! q(x)
−De (p||q) = p(x)ln (2.50)
x∈A
p(x)
! * +
q(x)
≤ p(x) −1 (2.51)
x∈A
p(x)
! !
= q(x) − p(x) (2.52)
x∈A x∈A
≤ 0 (2.53)
The first step follows from the definition of D , the second step follows from the
inequality ln t ≤ t − 1 , the third step from expanding the sum, and the last step
from the fact that the q(A) ≤ 1 and p(A) = 1 .
Entropy, Relative Entropy and Mutual Information 29
Solution:
m
!
H(p) = − pi log pi (2.55)
i=1
m−2
!
= − pi log pi − pm−1 log pm−1 − pm log pm (2.56)
i=1
m−2
! pm−1 pm
= − pi log pi − pm−1 log − pm log (2.57)
i=1
pm−1 + pm pm−1 + pm
−(pm−1 + pm ) log(pm−1 + pm ) (2.58)
pm−1 pm
= H(q) − pm−1 log − pm log (2.59)
pm−1 + pm pm−1 + pm
* +
pm−1 pm−1 pm pm
= H(q) − (pm−1 + pm ) log − log (2.60)
pm−1 + pm pm−1 + pm pm−1 + pm pm−1 + pm
* +
pm−1 pm
= H(q) + (pm−1 + pm )H2 , , (2.61)
pm−1 + pm pm−1 + pm
where H2 (a, b) = −a log a − b log b .
28. Mixing increases entropy. Show that the entropy of the probability distribution,
(p1 , . . . , pi , . . . , pj , . . . , pm ) , is less than the entropy of the distribution
p +p p +p
(p1 , . . . , i 2 j , . . . , i 2 j , . . . , pm ) . Show that in general any transfer of probability that
makes the distribution more uniform increases the entropy.
Solution:
Mixing increases entropy.
This problem depends on the convexity of the log function. Let
P1 = (p1 , . . . , pi , . . . , pj , . . . , pm )
pi + p j pj + p i
P2 = (p1 , . . . , ,..., , . . . , pm )
2 2
30 Entropy, Relative Entropy and Mutual Information
Thus,
H(P2 ) ≥ H(P1 ).
29. Inequalities. Let X , Y and Z be joint random variables. Prove the following
inequalities and find conditions for equality.
Solution: Inequalities.
with equality iff H(Y |X, Z) = 0 , that is, when Y is a function of X and Z .
(b) Using the chain rule for mutual information,
with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
pendent given X .
(c) Using first the chain rule for entropy and then the definition of conditional mutual
information,
with equality iff I(Y ; Z|X) = 0 , that is, when Y and Z are conditionally inde-
pendent given X .
(d) Using the chain rule for mutual information,
and therefore
I(X; Z|Y ) = I(Z; Y |X) − I(Z; Y ) + I(X; Z) .
We see that this inequality is actually an equality in all cases.
Entropy, Relative Entropy and Mutual Information 31
30. Maximum entropy. Find the probability mass function p(x) that maximizes the
entropy H(X) of a non-negative integer-valued random variable X subject to the
constraint
∞
!
EX = np(n) = A
n=0
Notice that the final right hand side expression is independent of {p i } , and that the
inequality,
∞
!
− pi log pi ≤ − log α − A log β
i=0
(a) Find the minimum probability of error estimator X̂(Y ) and the associated Pe .
(b) Evaluate Fano’s inequality for this problem and compare.
Solution:
Hence the associated Pe is the sum of P (1, b), P (1, c), P (2, a), P (2, c), P (3, a)
and P (3, b). Therefore, Pe = 1/2.
(b) From Fano’s inequality we know
H(X|Y ) − 1
Pe ≥ .
log |X |
Here,
Hence
1.5 − 1
Pe ≥ = .316.
log 3
Hence our estimator X̂(Y ) is not very close to Fano’s bound in this form. If
X̂ ∈ X , as it does here, we can use the stronger form of Fano’s inequality to get
H(X|Y ) − 1
Pe ≥ .
log(|X |-1)
and
1.5 − 1 1
Pe ≥ = .
log 2 2
Therefore our estimator X̂(Y ) is actually quite good.
which is the unconditional form of Fano’s inequality. We can weaken this inequality to
obtain an explicit lower bound for Pe ,
H(X) − 1
Pe ≥ . (2.67)
log(m − 1)
34. Entropy of initial conditions. Prove that H(X 0 |Xn ) is non-decreasing with n for
any Markov chain.
Solution: Entropy of initial conditions. For a Markov chain, by the data processing
theorem, we have
I(X0 ; Xn−1 ) ≥ I(X0 ; Xn ). (2.68)
Therefore
H(X0 ) − H(X0 |Xn−1 ) ≥ H(X0 ) − H(X0 |Xn ) (2.69)
or H(X0 |Xn ) increases with n .
35. Relative entropy is not symmetric: Let the random variable X have three possible
outcomes {a, b, c} . Consider two distributions on this random variable
Symbol p(x) q(x)
a 1/2 1/3
b 1/4 1/3
c 1/4 1/3
Calculate H(p) , H(q) , D(p||q) and D(q||p) . Verify that in this case D(p||q) %=
D(q||p) .
Entropy, Relative Entropy and Mutual Information 35
Solution:
1 1 1
H(p) =
log 2 + log 4 + log 4 = 1.5 bits. (2.70)
2 4 4
1 1 1
H(q) = log 3 + log 3 + log 3 = log 3 = 1.58496 bits. (2.71)
3 3 3
1 3 1 3 1 3
D(p||q) = log + log + log = log(3) − 1.5 = 1.58496 − 1.5 = 0.08496 (2.72)
2 2 4 4 4 4
1 2 1 4 1 4 5
D(q||p) = log + log + log = −log(3) = 1.66666−1.58496 = 0.08170 (2.73)
3 3 3 3 3 3 3
36. Symmetric relative entropy: Though, as the previous example shows, D(p||q) %=
D(q||p) in general, there could be distributions for which equality holds. Give an
example of two distributions p and q on a binary alphabet such that D(p||q) = D(q||p)
(other than the trivial case p = q ).
Solution:
A simple case for D((p, 1 − p)||(q, 1 − q)) = D((q, 1 − q)||(p, 1 − p)) , i.e., for
p 1−p q 1−q
p log + (1 − p) log = q log + (1 − q) log (2.74)
q 1−q p 1−p
is when q = 1 − p .
37. Relative entropy: Let X, Y, Z be three random variables with a joint probability
mass function p(x, y, z) . The relative entropy between the joint distribution and the
product of the marginals is
4 5
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.75)
p(x)p(y)p(z)
Expand this in terms of entropies. When is this quantity zero?
Solution:
4 5
p(x, y, z)
D(p(x, y, z)||p(x)p(y)p(z)) = E log (2.76)
p(x)p(y)p(z)
= E[log p(x, y, z)] − E[log p(x)] − E[log p(y)] − E[log(2.77)
p(z)]
= −H(X, Y, Z) + H(X) + H(Y ) + H(Z) (2.78)
(a) Under this constraint, what is the minimum value for H(X, Y, Z) ?
(b) Give an example achieving this minimum.
Solution:
(a)
Solution:
(a)
42. Inequalities. Which of the following inequalities are generally ≥, =, ≤ ? Label each
with ≥, =, or ≤ .
Solution:
(a) X → 5X is a one to one mapping, and hence H(X) = H(5X) .
(b) By data processing inequality, I(g(X); Y ) ≤ I(X; Y ) .
(c) Because conditioning reduces entropy, H(X 0 |X−1 ) ≥ H(X0 |X−1 , X1 ) .
(d) H(X, Y ) ≤ H(X) + H(Y ) , so H(X, Y )/(H(X) + H(Y )) ≤ 1 .
43. Mutual information of heads and tails.
(a) Consider a fair coin flip. What is the mutual information between the top side
and the bottom side of the coin?
(b) A 6-sided fair die is rolled. What is the mutual information between the top side
and the front face (the side most facing you)?
Solution:
Mutual information of heads and tails.
To prove (a) observe that
I(T ; B) = H(B) − H(B|T )
= log 2 = 1
since B ∼ Ber(1/2) , and B = f (T ) . Here B, T stand for Bottom and Top respectively.
To prove (b) note that having observed a side of the cube facing us F , there are four
possibilities for the top T , which are equally probable. Thus,
I(T ; F ) = H(T ) − H(T |F )
= log 6 − log 4
= log 3 − 1
since T has uniform distribution on {1, 2, . . . , 6} .
Entropy, Relative Entropy and Mutual Information 39
(a) How would you use 2 independent flips X 1 , X2 to generate (if possible) a Bernoulli( 12 )
random variable Z ?
(b) What is the resulting maximum expected number of fair bits generated?
Solution:
(a) The trick here is to notice that for any two letters Y and Z produced by two
independent tosses of our bent three-sided coin, Y Z has the same probability as
ZY . So we can produce B ∼ Bernoulli( 21 ) coin flips by letting B = 0 when we
get AB , BC or AC , and B = 1 when we get BA , CB or CA (if we get AA ,
BB or CC we don’t assign a value to B .)
(b) The expected number of bits generated by the above scheme is as follows. We get
one bit, except when the two flips of the 3-sided coin produce the same symbol.
So the expected number of fair bits generated is
45. Finite entropy. Show that for a discrete random variable X ∈ {1, 2, . . .} , if E log X <
∞ , then H(X) < ∞ .
$
Solution: Let the distribution on the integers be p 1 , p2 , . . . . Then H(p) = − pi logpi
$
and E log X = pi logi = c < ∞ .
We will now find the maximum entropy distribution subject to the constraint on the
expected logarithm. Using Lagrange multipliers or the results of Chapter 12, we have
the following functional to optimize
! ! !
J(p) = − pi log pi − λ1 p i − λ2 pi log i (2.84)
Differentiating with respect to p i and setting to zero, we find that the p i that maximizes
$
the entropy set pi = aiλ , where a = 1/( iλ ) and λ chosed to meet the expected log
constraint, i.e. ! !
iλ log i = c iλ (2.85)
Using this value of pi , we can see that the entropy is finite.
40 Entropy, Relative Entropy and Mutual Information
46. Axiomatic definition of entropy. If we assume certain axioms for our measure of
information, then we will be forced to use a logarithmic measure like entropy. Shannon
used this to justify his initial definition of entropy. In this book, we will rely more on
the other properties of entropy rather than its axiomatic derivation to justify its use.
The following problem is considerably more difficult than the other problems in this
section.
If a sequence of symmetric functions H m (p1 , p2 , . . . , pm ) satisfies the following proper-
ties,
0 1
• Normalization: H2 1 1
2, 2 = 1,
• Continuity: H2 (p, 1 − p) is a continuous function of p ,
0 1
p1 p2
• Grouping: Hm (p1 , p2 , . . . , pm ) = Hm−1 (p1 +p2 , p3 , . . . , pm )+(p1 +p2 )H2 p1 +p2 , p1 +p2 ,
There are various other axiomatic formulations which also result in the same definition
of entropy. See, for example, the book by Csiszár and Körner[4].
Solution: Axiomatic definition of entropy. This is a long solution, so we will first
outline what we plan to do. First we will extend the grouping axiom by induction and
prove that
and we will denote H2 (q, 1 − q) as h(q) . Then we can write the grouping axiom as
* +
p2
Hm (p1 , . . . , pm ) = Hm−1 (S2 , p3 , . . . , pm ) + S2 h . (2.90)
S2
Entropy, Relative Entropy and Mutual Information 41
Now, we apply the same grouping axiom repeatedly to H k (p1 /Sk , . . . , pk /Sk ) , to obtain
* + * + k−1
! * +
p1 pk Sk−1 pk Si pi /Sk
Hk ,..., = H2 , + h (2.95)
Sk Sk Sk Sk i=2
Sk Si /Sk
k * +
1 ! pi
= Si h . (2.96)
Sk i=2 Si
1 1
f (m + 1) = Hm+1 ( ,..., ) (2.107)
m+1 m+1
1 m 1 1
= h( )+ Hm ( , . . . , ) (2.108)
m+1 m+1 m m
1 m
= h( )+ f (m) (2.109)
m+1 m+1
and therefore
m 1
f (m + 1) − f (m) = h( ). (2.110)
m+1 m+1
Thus lim f (m + 1) − m+1m
f (m) = lim h( m+1
1
). But by the continuity of H2 , it follows
that the limit on the right is h(0) = 0 . Thus lim h( m+1
1
) = 0.
Let us define
an+1 = f (n + 1) − f (n) (2.111)
and
1
bn = h( ). (2.112)
n
Then
1
an+1 = − f (n) + bn+1 (2.113)
n+1
n
1 !
= − ai + bn+1 (2.114)
n + 1 i=2
and therefore
n
!
(n + 1)bn+1 = (n + 1)an+1 + ai . (2.115)
i=2
N
! N
! N
!
nbn = (nan + an−1 + . . . + a2 ) = N ai . (2.116)
n=2 n=2 n=2
Entropy, Relative Entropy and Mutual Information 43
$N
Dividing both sides by n=1 n = N (N + 1)/2 , we obtain
N $
2 ! N
nbn
an = $n=2 (2.117)
N + 1 n=2 N
n=2 n
Lemma 2.0.1 Let the function f (m) satisfy the following assumptions:
f (P ) log2 n
g(n) = f (n) − (2.120)
log2 P
Then g(n) satisfies the first assumption of the lemma. Also g(P ) = 0 .
Also if we let
f (P ) n
αn = g(n + 1) − g(n) = f (n + 1) − f (n) + log2 (2.121)
log2 P n+1
n = n(1) P + l (2.123)
44 Entropy, Relative Entropy and Mutual Information
where 0 ≤ l < P . From the fact that g(P ) = 0 , it follows that g(P n (1) ) = g(n(1) ) ,
and
n−1
!
g(n) = g(n(1) ) + g(n) − g(P n(1) ) = g(n(1) ) + αi (2.124)
i=P n(1)
Just as we have defined n(1) from n , we can define n(2) from n(1) . Continuing this
process, we can then write
(i−1)
k
! n!
g(n) = g(n(k) ) + αi . (2.125)
j=1 i=P n(i)
Since P was arbitrary, it follows that f (P )/ log 2 P = c for every prime number P .
Applying the third axiom in the lemma, it follows that the constant is 1, and f (P ) =
log2 P .
For composite numbers N = P1 P2 . . . Pl , we can apply the first property of f and the
prime number factorization of N to show that
! !
f (N ) = f (Pi ) = log2 Pi = log2 N. (2.130)
For any integer m , let r > 0 , be another integer and let 2 k ≤ mr < 2k+1 . Then by
the monotonicity assumption on f , we have
or
k k+1
≤ f (m) < c
c (2.132)
r r
Now by the monotonicity of log , we have
k k+1
≤ log2 m < (2.133)
r r
Combining these two equations, we obtain
3 3
3f (m) − log 2 m 3 < 1
3 3
3 (2.134)
c 3 r
Since r was arbitrary, we must have
log2 m
f (m) = (2.135)
c
and we can identify c = 1 from the last assumption of the lemma.
Now we are almost done. We have shown that for any uniform distribution on m
outcomes, f (m) = Hm (1/m, . . . , 1/m) = log 2 m .
We will now show that
To begin, let p be a rational number, r/s , say. Consider the extended grouping axiom
for Hs
1 1 1 1 s−r s−r
f (s) = Hs ( , . . . , ) = H( , . . . , , )+ f (s − r) (2.137)
s s <s => s? s s
r
r s−r s s−r
= H2 ( , ) + f (s) + f (s − r) (2.138)
s s r s
Substituting f (s) = log 2 s , etc, we obtain
* + * +
r s−r r r s−r s−r
H2 ( , ) = − log2 − 1 − log2 1 − . (2.139)
s s s s s s
Thus (2.136) is true for rational p . By the continuity assumption, (2.136) is also true
at irrational p .
To complete the proof, we have to extend the definition from H 2 to Hm , i.e., we have
to show that !
Hm (p1 , . . . , pm ) = − pi log pi (2.140)
46 Entropy, Relative Entropy and Mutual Information
for all m . This is a straightforward induction. We have just shown that this is true for
m = 2 . Now assume that it is true for m = n − 1 . By the grouping axiom,
Thus the statement is true for m = n , and by induction, it is true for all m . Thus we
have finally proved that the only symmetric function that satisfies the axioms is
m
!
Hm (p1 , . . . , pm ) = − pi log pi . (2.146)
i=1
The heart of this problem is simply carefully counting the possible outcome states.
There are n ways to choose which card gets mis-sorted, and, once the card is chosen,
there are again n ways to choose where the card is replaced in the deck. Each of these
shuffling actions has probability 1/n 2 . Unfortunately, not all of these n 2 actions results
in a unique mis-sorted file. So we need to carefully count the number of distinguishable
outcome states. The resulting deck can only take on one of the following three cases.
To compute the entropy of the resulting deck, we need to know the probability of each
case.
Case 1 (resulting deck is the same as the original): There are n ways to achieve this
outcome state, one for each of the n cards in the deck. Thus, the probability associated
with case 1 is n/n2 = 1/n .
Entropy, Relative Entropy and Mutual Information 47
Case 2 (adjacent pair swapping): There are n − 1 adjacent pairs, each of which will
have a probability of 2/n2 , since for each pair, there are two ways to achieve the swap,
either by selecting the left-hand card and moving it one to the right, or by selecting the
right-hand card and moving it one to the left.
Case 3 (typical situation): None of the remaining actions “collapses”. They all result
in unique outcome states, each with probability 1/n 2 . Of the n2 possible shuffling
actions, n2 − n − 2(n − 1) of them result in this third case (we’ve simply subtracted
the case 1 and case 2 situations above).
The entropy of the resulting deck can be computed as follows.
1 2 n2 1
H(X) = log(n) + (n − 1) 2 log( ) + (n2 − 3n + 2) 2 log(n2 )
n n 2 n
2n − 1 2(n − 1)
= log(n) −
n n2
Let’s now consider a different stopping time. For this part, again assume X i ∼ Bernoulli (1/2)
but stop at time N = 6 , with probability 1/3 and stop at time N = 12 with probability
2/3. Let this stopping time be independent of the sequence X 1 X2 . . . X12 .
Solution:
(a)
where (a) comes from the fact that the entropy of a geometric random variable is
just the mean.
H(X N |N ) = 0.
(c)
(d)
(e)
1 2
H(X N |N ) = H(X 6 |N = 6) + H(X 12 |N = 12)
3 3
1 2
= H(X ) + H(X )
6 12
3 3
1 2
= 6 + 12
3 3
H(X N |N ) = 10.
(f)
(a) (Markov’s inequality.) For any non-negative random variable X and any t > 0 ,
show that
EX
Pr {X ≥ t} ≤ . (3.1)
t
Exhibit a random variable that achieves this inequality with equality.
(b) (Chebyshev’s inequality.) Let Y be a random variable with mean µ and variance
σ 2 . By letting X = (Y − µ)2 , show that for any % > 0 ,
σ2
Pr {|Y − µ| > %} ≤ . (3.2)
%2
(c) (The weak law of large numbers.) Let Z 1 , Z2 , . . . , Zn be a sequence of i.i.d. random
$
variables with mean µ and variance σ 2 . Let Z n = n1 ni=1 Zi be the sample mean.
Show that
@3 3 A σ2
3 3
Pr 3Z n − µ3 > % ≤ 2 . (3.3)
n%
@3 3 A
3 3
Thus Pr 3Z n − µ3 > % → 0 as n → ∞ . This is known as the weak law of large
numbers.