0% found this document useful (0 votes)
50 views4 pages

Probc 1

This problem sheet covers information theory concepts including entropy, mutual information, and properties of entropy. It contains 13 problems analyzing these concepts through mathematical proofs and examples. The key topics covered are: 1) Entropy of a Bernoulli random variable and its properties 2) Relationship between entropy of original and transformed random variables 3) Conditions for mutual information and equality in information inequalities 4) Using chain rule and conditional entropy to relate entropy of joined and separate random variables
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views4 pages

Probc 1

This problem sheet covers information theory concepts including entropy, mutual information, and properties of entropy. It contains 13 problems analyzing these concepts through mathematical proofs and examples. The key topics covered are: 1) Entropy of a Bernoulli random variable and its properties 2) Relationship between entropy of original and transformed random variables 3) Conditions for mutual information and equality in information inequalities 4) Using chain rule and conditional entropy to relate entropy of joined and separate random variables
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Information Theory

Problem Sheet 1
(Most questions are from Cover & Thomas, the corresponding question numbers (as in 1st ed.) are given in brackets at the start of the question)

Notation: x, x, X are scalar, vector and matrix random variables respectively.


4. We write H ( p ) (with a scalar p ) to denote the entropy of the Bernoulli random
∞ ∞
variable with probability mass vector p = [1 − p p ] . Prove the following properties of
∑r ∑ nr
The following expressions may be useful: r r
n
= n
=
n =1
1− r n =1 (1 − r )2 this function:
(a) H ′( p ) = log(1 − p ) − log p
1. [2.1] A fair coin is flipped until the first head occurs. Let x denote the number of flips
− log e
required. (b) H ′′( p ) =
p (1 − p)
(a) Find the entropy H(x) in bits.
(c) H ( p ) ≥ 2 min( p,1 − p)
(b) A random variable x is drawn according to this distribution. Find an “efficient”
sequence of yes-no questions of the form “Is x contained in the set S?”. Compare
(d) H ( p ) ≥ 1 − 4( p − ½) 2
H(x) to the expected number of questions required to determine x.
(e) H ( p ) ≤ 1 − 2 log e( p − ½) 2
2. [~2.2] x is a random variable taking integer values. What can you say about the
relationship between H(x) and H(y) if
5. [2.5] Let x be a discrete random variable and g(x) a deterministic function of it. Show
(a) y = x2 that H ( g (x )) ≤ H (x ) by justifying the following steps:
(b) y = x3 (a) (b)
H (x , g (x )) = H (x ) + H ( g (x ) | x ) = H (x )
(c ) (d )
3. [2.3] If p is an n-dimensional probability vector, what is the maximum and the
H (x , g (x )) = H ( g (x )) + H (x | g (x )) ≥ H ( g (x ))
minimum value of H(p). Find all vectors p for which H(p) achieves its maximum or
minimum value.
6. [2.6] Show that if H(y | x) = 0, then y is a function of x, that is for all x with p(x)>0,
there is only one possible value of y with p(x,y) > 0.

Rev: 27-Mar-09 Information Theory: Problem Sheet 1 Page 1


12. [~2.22] If x→y→z form a markov chain, and for y, the alphabet size |Y| = k, show that
7. [~2.7] xi is a sequence of i.i.d. Bernoulli random variables with p(xi =1) = p where p is
I(x;z)≤ log k. What does this tell you if k = 1?
unknown. We want to find a function f that converts n samples of x into a smaller
number, K, of i.i.d. Bernoulli random variables, zi, with p(zi=1)=½. Thus z1:K=f(x1:n)
where K can depend on the values xi. 13. [2.29] Prove the following and find the conditions for equality:
(a) Show that the following mapping for n=4 satisfies the requirements and find the (a) H(x,y | z) ≥ H(x | z)
expected value of K, E (K).
(b) I(x,y;z) ≥ I(x;z)
0000,1111→ignore; 1010→0; 0101→1; 0001,0011,0111→00;
(c) H(x,y,z) – H(x,y) ≤ H(x,z) – H(x)
0010,0110,1110→01; 0100,1100,1101→10; 1000,1001,1011→11
(d) I(x;z | y) ≥ I(z;y | x) – I(z;y) + I(x;z)
(b) Justify the steps in the following bound on E (K)
(a) (b ) (c )
nH ( p) = H (x 1:n ) ≥ H (z 1:K , K ) = H ( K ) + H (z 1:K | K )
(d ) (e)
= H (K ) + E K ≥ E K

8. [2.10] Give examples of joint random variables x, y and z such that:


(a) I (x ; y | z ) < I (x ; y )
(b) I (x ; y | z ) > I (x ; y )

9. [2.12] We can define the “mutual information” between three variables as


I (x ; y ; z ) = I ( x ; y ) − I (x ; y | z )
(a) Prove that
I ( x ; y ; z ) = H ( x , y , z ) − H ( x , y ) − H (y , z ) − H ( z , x )
+ H ( x ) + H (y ) + H (z )
(b) Give an example where I(x;y;z) is negative. This lack of positivity means that it
does not have the intuitive properties of an "information" measure which is why I
put “mutual information” in quotes above.

10. [2.17] Show that loge(x) ≥ 1–x–1 for x>0.

11. [~2.16] x and y are correlated binary random variables with p(x=y=0)=0 and all other
joint probabilities equal to 1/3. Calculate H(x), H(y), H(x|y), H(y|x), H(x,y), I(x;y).

Rev: Mar-09 Information Theory: Problem Sheet 1 Page 2


Information Theory
Solution Sheet 1

For (d) we consider D ( p ) = H ( p ) − 1 + 4( p − ½) 2 . D′′( p ) = 0 is a quadratic in p and


1. (a) x = n means that Tail occurs for the first n – 1 flips, while the last flip is Head. has only two solutions p = ½ ± (2 − log e) / 8 = 0.5 ± 0.26 . Therefore D′( p ) increases
Thus, x has distribution P(x = n)= 2-(n-1) 2-1 =2-n. Thus from 0 at p = 0.5 to reach a maximum at p = 0.76 and decreases thereafter. This
implies that D′( p) = 0 has only one solution for p > ½ and therefore that D ( p ) has a
H ( x) = ∑ n =1 2 log 2 = ∑ n =1 2 n log 2
∞ −n n ∞ −n

single maximum. Since D (½) = D (1) = 0 we must have D ( p ) > 0 for ½ < p < 1 .
1/ 2
= ∑ n =1 n 2− n =

=2
(1 − 1/ 2) 2
5. (a) chain rule, (b) g(x)|x has only one possible value and hence zero entropy, (c) chain
(b) Ask if x = 1, 2, 3, … in turn, i.e., ask the following questions: rule, (d) entropy is positive. We have equality at (d) iff g(x) is a one-to-one function for
every x with p(x)>0.
Is x = 1?
If not, is x = 1?
If not, is x = 2? 6. H (y | x ) = ∑ p ( x ) H (y | x = x )
… x


∞ All terms are non-negative so the sum is zero only if all terms are zero. For any given
Expected number of questions is n =1
n 2− n = 2 . term this is true either if p(x)=0 or if H(y|x=x) is zero. The second case arises only if
H(y|x=x) has only one value, i.e. y is a function of x. The first case is why we needed
the qualification about p(x)>0 in answers 2 and 4 above.
2. H(x,y)=H(x)+H(y|x)=H(Y)+H(x|y), but H(y|x)=0 since Y is a function of x so H(y)=
H(x) – H(x|y) ≤H(x) with equality iff H(x|y)=0 which is true only if x is a function of y,
i.e. if y is a one-to-one function of x for every value of x with p(x)>0. Hence 7. (a) The probability of any given value of x1:4 depends on the number of 1’s and 0’s.
We create four subsets with equal probabilities to generate a pair of bits and two
(a) H(y)≤H(x) because, for example 1 = ( −1) 2 2
other subsets to generate one bit only. The expected number of bits generated is
(b) H(y)=H(x) E K = 8 p (1 − p )3 + 10 p 2 (1 − p ) 2 + 8 p 3 (1 − p )
(b) (a) i.i.d entropies add, (b) functions reduce entropy, (c) chain rule, (d) zi are i.i.d.
3. Maximum is log n iff all elements of p are equal. Minimum is 0 iff only one element of with entropy of 1 bit, (e) entropy is positive.
p is non-zero; there are n possible elements that this could be.

8. (a) This is true for any Markov chain x→y→z. One possibility is x=y=z all fair
4. (a) and (b) are straightforward calculus: easiest to convert logs to base e first. For the Bernoulli variables.
others, assume ½ < p < 1 for covenience (other half follows by symmetry). Since
H ′′( p ) < 0 , H ( p ) is concave and so lies above the straight line 2 − 2 p defined in (b) An example of this was given in lectures. A slightly different example is if x and
y are fair binary variables and z=xy. Knowing z, entangles x and y.
(c).
At p = ½ the bound in (e) has the same value and first two derivatives as H ( p ) . For
9. (a) I(x;y;z)={H(x)-H(x|y)}-{H(x|z)-H(x|y,z)}=H(x)-{H(x,y)-H(y)}-{H(x,z)-
½ < p < 1 its second derivative is greater than H ′′( p ) and so the bound follows.
H(z)}+{H(x,y,z)-H(y,z)}
(b) Use the example from 8(b) above.

Rev: Mar-099 Information Theory: Solution Sheet 1 Page 1


10. Define f(x)=ln(x)+x–1–1. This is continuous and differentiable in (0,∞). Differentiate
twice to show that the only extremum occurs at x=1 and that it is a minimum. Hence
f(x)≥f(1)=0.

11. H(x)=H(y)=0.918; H(x|y)=H(y|x)=0.667; H(x,y)=1.58; I(x;y)=0.252.

12. The data processing inequality says that I(x;z)≤I(x;y)=H(y)–H(y|x) ≤H(y) ≤ log k
where the last inequality is the uniform bound on entropy. If k=1 then log k = 0 and so
x and z must be independent.

13. (a) H(x,y|z)=H(x|z)+H(y|x,z)≥H(x|z) with equality if y is a function of x and z.


(b) I(x,y;z)=I(x;z)+I(y;z|x) ≥ I(x;z) with equality if y and z are conditionally
independent given x.
(c) H(x,y,z)–H(x,y)=H(z|x,y)=H(z|x)-I(y;z|x) ≤ H(z|x)=H(x,z)-H(x) with equality if y
and z are conditionally independent given x.
(d) I(x,y;z)=I(y;z)+I(x;z|y)= I(x;z)+I(y;z|x). Rearrange this to give the inequality
which is in fact always an equality (trick question).

Rev: Mar-09 Information Theory: Solution Sheet 1 Page 2

You might also like