0% found this document useful (0 votes)
79 views26 pages

Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2

This document summarizes key concepts from Lecture 2 of an information theory course, including: 1) Joint entropy measures the uncertainty of jointly distributed random variables and satisfies the chain rule. 2) Conditional entropy measures the remaining uncertainty of one random variable given another, and is always less than or equal to the unconditional entropy. 3) Mutual information can be defined in terms of conditional and joint entropies, and captures how much knowing one variable reduces uncertainty about another.

Uploaded by

Geremu Tilahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views26 pages

Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2

This document summarizes key concepts from Lecture 2 of an information theory course, including: 1) Joint entropy measures the uncertainty of jointly distributed random variables and satisfies the chain rule. 2) Conditional entropy measures the remaining uncertainty of one random variable given another, and is always less than or equal to the unconditional entropy. 3) Mutual information can be defined in terms of conditional and joint entropies, and captures how much knowing one variable reduces uncertainty about another.

Uploaded by

Geremu Tilahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Application of Information Theory, Lecture 2

Joint & Conditional Entropy, Mutual Information


Handout Mode

Iftach Haitner

Tel Aviv University.

Nov 4, 2014

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 1 / 26


Part I

Joint and Conditional Entropy

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 2 / 26


Joint entropy

▸ Recall that the entropy of rv X over X , is defined by

H(X ) = − ∑ PX (x) log PX (x)


x∈X
▸ Shorter notation: for X ∼ p, let H(X ) = − ∑x p(x) log p(x)
(where the summation is over the domain of X ).
▸ The joint entropy of (jointly distributed) rvs X and Y with (X , Y ) ∼ p, is

H(X , Y ) = − ∑ p(x, y ) log p(x, y )


x,y
This is simply the entropy of the rv Z = (X , Y ).
▸ Example:
1 1 1 1 1 1
Y
0 1 H(X , Y ) = − log 1 − log 1 − log 1
X
1 1
2 2
4 4
4 4
0 4 4
1 1 1 1
1 2
0 = ⋅1+ ⋅2=1
2 2 2

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 3 / 26


Joint entropy, cont.

▸ The joint entropy of (X1 , . . . , Xn ) ∼ p, is

H(X1 , . . . , Xn ) = − ∑ p(x1 , . . . , xn ) log p(x1 , . . . , xn )


x1 ,...,xn

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 4 / 26


Conditional entropy

▸ Let (X , Y ) ∼ p.
▸ For x ∈ Supp(X ), the random variable Y ∣X = x is well defined.
▸ The entropy of Y conditioned on X , is defined by
H(Y ∣X ) ∶= E H(Y ∣X = x) = E H(Y ∣X )
x←X X

▸ Measures the uncertainty in Y given X .


▸ Let pX & pY ∣X be the marginal & conational distributions induced by p.
H(Y ∣X ) = ∑ pX (x) ⋅ H(Y ∣X = x)
x∈X
= − ∑ pX (x) ∑ pY ∣X (y ∣x) log pY ∣X (y ∣x)
x∈X y ∈Y

= − ∑ p(x, y ) log pY ∣X (y ∣x)


x∈X ,y ∈Y

= − E log pY ∣X (Y ∣X )
(X ,Y )
= − E log Z
Z =pY ∣X (Y ∣X )

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 5 / 26


Conditional entropy, cont.

▸ Example Y
X 0 1
1 1
0 4 4
1
1 2
0

What is H(Y ∣X ) and H(X ∣Y )?

H(Y ∣X ) = E H(Y ∣X = x)
x←X
1 1
= H(Y ∣X = 0) + H(Y ∣X = 1)
2 2
1 1 1 1 1
= H( , ) + H(1, 0) = .
2 2 2 2 2

H(X ∣Y ) = E H(X ∣Y = y )
y ←Y
3 1
= H(X ∣Y = 0) + H(X ∣Y = 1)
4 4
3 1 2 1
= H( , ) + H(1, 0) = 0.6887≠H(Y ∣X ).
4 3 3 4
Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 6 / 26
Conditional entropy, cont..

H(X ∣Y , Z ) = E H(X ∣Y = y , Z = z)
(y ,z)←(Y ,Z )

= E E H(X ∣Y = y , Z = z)
y ←Y z←Z ∣Y =y

= E E H((X ∣Y = y )∣Z = z)
y ←Y z←Z ∣Y =y

= E H(Xy ∣Zy )
y ←Y

for (Xy , Zy ) = (X , Z )∣Y = y

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 7 / 26


Relating mutual entropy to conditional entropy

▸ What is the relation between H(X ), H(Y ), H(X , Y ) and H(Y ∣Y )?


▸ Intuitively, 0 ≤ H(Y ∣X ) ≤ H(Y )
Non-negativity is immediate. We prove upperbound later.
▸ H(Y ∣X ) = H(Y ) iff X and Y are independent.
▸ In our example, H(Y ) = H( 34 , 14 ) > 1
2
= H(Y ∣X )
▸ Note that H(Y ∣X = x) might be larger than H(Y ) for some x ∈ Supp(X ).
▸ Chain rule (proved next). H(X , Y ) = H(X ) + H(Y ∣X )
▸ Intuitively, uncertainty in (X , Y ) is the uncertainty in X plus the
uncertainty in Y given X .
▸ H(Y ∣X ) = H(X , Y ) − H(X ) is as an alternative definition for H(Y ∣X ).

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 8 / 26


Chain rule (for the entropy function)

Claim 1
For rvs X , Y , it holds that H(X , Y ) = H(X ) + H(Y ∣X ).

▸ Proof immediately follow by the grouping axiom:


Y
X Let qi = ∑nj=1 pi,j
P1,1 ... P1,n
⋮ ⋮ ⋮ H(P1,1 , . . . , Pn,n )
Pn,1 ... Pn,n Pi,1 Pi,n
= H(q1 , . . . , qn ) + ∑ qi H( ,..., )
qi qi
= H(X ) + H(Y ∣X ).
▸ Another proof. Let (X , Y ) ∼ p.
▸ p(x, y ) = pX (x) ⋅ pY ∣X (x∣y ).
Ô⇒ log p(x, y ) = log pX (x) + log pY ∣X (x∣y )
Ô⇒ E log p(X , Y ) = E log pX (X ) + E log pY ∣X (Y ∣X )
Ô⇒ H(X , Y ) = H(X ) + H(Y ∣X ).

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 9 / 26


H(Y ∣X ) ≤ H(Y )
Jensen inequality: for any concave function f , values t1 , . . . , tk and
λ1 , . . . , λk ∈ [0, 1] with ∑i λi = 1, it holds that ∑i λi f (ti ) ≤ f (∑i λi ti ).
Let (X , Y ) ∼ p.
H(Y ∣X ) = − ∑ p(x, y ) log pY ∣X (y ∣x)
x,y

pX (x)
= ∑ p(x, y ) log
x,y p(x, y )
p(x, y ) pX (x)
= ∑ pY (y ) ⋅ log
x,y pY (y ) p(x, y )
p(x, y ) pX (x)
= ∑ pY (y ) ∑ log
y x pY (y ) p(x, y )
p(x, y ) pX (x)
≤ ∑ pY (y ) log ∑
y x pY (y ) p(x, y )
1
= ∑ pY (y ) log = H(Y ).
y pY (y )

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 10 / 26


H(Y ∣X ) ≤ H(Y ) cont.

▸ Assume X and Y are independent (i.e., p(x, y ) = pX (x) ⋅ pY (y ) for any


x, y )
Ô⇒ pY ∣X = pY
Ô⇒ H(Y ∣X ) = H(Y )

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 11 / 26


Other inequalities

▸ H(X ), H(Y ) ≤ H(X , Y ) ≤ H(X ) + H(Y ).


Follows from H(X , Y ) = H(X ) + H(Y ∣X ).
▸ Left inequality since H(Y ∣X ) is non negative.
▸ Right inequality since H(Y ∣X ) ≤ H(Y ).
▸ H(X , Y ∣Z ) = H(X ∣Z ) + H(Y ∣X , Z ) (by chain rule)
▸ H(X ∣Y , Z ) ≤ H(X ∣Y )
Proof:
H(X ∣Y , Z ) = E H(X ∣ Y , Z )
Z ,Y
= E E H(X ∣ Y , Z )
Y Z ∣Y

= E E H((X ∣ Y ) ∣ Z )
Y Z ∣Y

≤ E E H(X ∣ Y )
Y Z ∣Y

= E H(X ∣ Y )
Y
= H(X ∣Y ).

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 12 / 26


Chain rule (for the entropy function), general case

Claim 2
For rvs X1 , . . . , Xk , it holds that
H(X1 , . . . , Xk ) = H(Xi ) + H(X2 ∣X1 ) + . . . + H(Xk ∣X1 , . . . , Xk −1 ).

Proof: ?
▸ Extremely useful property!
▸ Analogously to the two variables case, it also holds that:
▸ H(Xi ) ≤ H(X1 , . . . , Xk ) ≤ ∑i H(Xi )
▸ H(X1 , . . . , XK ∣Y ) ≤ ∑i H(Xi ∣Y )

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 13 / 26


Examples

▸ (from last class) Let X1 , . . . , Xn be Boolean iid with Xi ∼ ( 13 , 23 ).


Compute H(X1 , . . . , Xn )
▸ As above, but under the condition that ⊕i Xi = 0 ?
▸ Via chain rule?
▸ Via mapping?

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 14 / 26


Applications

▸ Let X1 , . . . , Xn be Boolean iids with Xi ∼ (p, 1 − p) and let X = X1 , . . . , Xn .


Let f be such that Pr [f (X ) = z] = Pr [f (X ) = z ′ ], for every k ∈ N and
z, z ′ ∈ {0, 1}k . Let K = ∣f (X )∣.
Prove that E K ≤ n ⋅ h(p).

n ⋅ h(p) = H(X1 , . . . , Xn )
≥ H(f (X ), K )
= H(K ) + H(f (X ) ∣ K )
= H(K ) + E K
≥ EK

▸ Interpretation
▸ Positive results

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 15 / 26


Applications cont.

▸ How many comparisons it takes to sort n elements?


Let A be a sorter for n elements algorithm making t comparisons.
What can we say about t?
▸ Let X be a uniform random permutation of [n] and let Y1 , . . . , Yt be the
answers A gets when sorting X .
▸ X is determined by Y1 , . . . , Yt .
Namely, X = f (Y1 , . . . , Yt ) for some function f .
▸ H(X ) = log n!

H(X ) = H(f (Y1 , . . . , Yn ))
≤ H(Y1 , . . . , Yn )
≤ ∑ H(Yi )
i
= t

Ô⇒ t ≥ log n! = Θ(n log n)

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 16 / 26


Concavity of entropy function
Let p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ) be two distributions, and for λ ∈ [0, 1]
consider the distribution τλ = λp + (1 − λ)q.
(i.e., τλ = (λp1 + (1 − λ)q1 , . . . , λpn + (1 − λ)qn ).

Claim 3
H(τλ ) ≥ λH(p) + (1 − λ)H(q)

Proof:
▸ Let Y over {0, 1} be 1 wp λ
▸ Let X be distributed according to p if Y = 0 and according to q otherwise.
▸ H(τλ ) = H(X ) ≥ H(X ∣ Y ) = λH(p) + (1 − λ)H(q)
We are now certain that we drew the graph of the (two-dimensional) entropy
function right...

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 17 / 26


Part II

Mutual Information

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 18 / 26


Mutual information
▸ I(X ; Y ) — the “information" that X gives on Y

I(X ; Y ) ∶= H(Y ) − H(Y ∣X )
= H(Y ) − (H(X , Y ) − H(X ))
= H(X ) + H(Y ) − H(X , y )
= I(Y ; X ).
▸ The mutual information that X gives about Y equals the mutual
information that Y gives about X .
▸ I(X ; X ) = H(X )
▸ I(X ; f (X )) = H(f (X )) (and smaller then H(X ) is f is non-injective)
▸ I(X ; Y , Z ) ≥ I(X ; Y ), I(X ; Z ) (since H(X ∣ Y , Z ) ≤ H(X ∣ Y ), H(X ∣ Z ))
▸ I(X ; Y ∣Z ) ∶= H(Y ∣Z ) − H(Y ∣X , Z )
▸ I(X ; Y ∣Z ) = I(Y ; X ∣Z ) (since I(X ′ ; Y ′ ) = I(Y ′ ; X ′ ))

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 19 / 26


Numerical example

▸ Example

Y
X 0 1
1 1
0 4 4
1
1 2
0

I(X ; Y ) = H(X ) − H(X ∣Y )


3 1
= 1 − ⋅ h( )
4 3
= I(Y ; X )
= H(Y ) − H(Y ∣X )
1 1 1
= h( ) − h( )
4 2 2

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 20 / 26


Chain rule for mutual information

Claim 4 (Chain rule for mutual information)


For rvs X1 , . . . , Xk , Y , it holds that
I(X1 , . . . , Xk ; Y ) = I(X ; Y ) + I(X2 ; Y ∣X1 ) + . . . + I(Xk ; Y ∣X1 , . . . , Xk −1 ).

Proof: ? HW

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 21 / 26


Examples

▸ Let X1 , . . . , Xn be iid with Xi ∼ (p, 1 − p), under the condition that ⊕i xi = 0.


Compute I(X1 , . . . , Xn−1 ; Xn ).
By chain rule

I(X1 , . . . , Xn−1 ; Xn )
= H(X1 ; Xn ) + H(X2 ; Xn ∣X1 ) + . . . + H(Xn−1 ; Xn ∣X1 , . . . , Xn−2 )
= 0 + 0 + . . . + 1 = 1.

▸ Let T and F be the top and front side, respectively, of a 6-sided fair dice.
Compute I(T ; F ).

I(T ; F ) = H(T ) − H(T ∣F )


= log 6 − log 4
= log 3 − 1.

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 22 / 26


Part III

Data processing

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 23 / 26


Data processing Inequality

Definition 5 (Markov Chain)


Rvs (X , Y , Z ) ∼ p form a Markov chain, denoted X → Y → Z , if
p(x, y , z) = pX (x) ⋅ pY ∣X (y ∣x) ⋅ pZ ∣Y (z∣y ), for all x, y , z.

Example: random walk on graph.


Claim 6
If X → Y → Z , then I(X ; Y ) ≥ I(X ; Z ).

▸ By Chain rule, I(X ; Y , Z ) = I(X ; Z ) + I(X ; Y ∣Z ) = I(X ; Y ) + I(X ; Z ∣Y ).


▸ I(X ; Z ∣Y ) = 0
▸ p
Z ∣Y =y = pZ ∣Y =y ,X =x for any x, y

I(X ; Z ∣Y ) = H(Z ∣Y ) − H(Z ∣Y , X )
= E H(pZ ∣Y =y ) − E H(pZ ∣Y =y ,X =x )
Y Y ,X
= E H(pZ ∣Y =y ) − E H(pZ ∣Y =y ) = 0.
Y Y

▸ Since I(X ; Y ∣Z ) ≥ 0, we conclude I(X ; Y ) ≥ I(X ; Z ).


Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 24 / 26
Fano’s Inequality

▸ How well can we guess X from Y ?


▸ Could with no error if H(X ∣Y ) = 0. What if H(X ∣Y ) is small?
Theorem 7 (Fano’s inequality)
For any rvs X and Y , and any (even random) g, it holds that

h(Pe ) + Pe log ∣X ∣ ≥ H(X ∣X̂ ) ≥ H(X ∣Y )

for X̂ = g(Y ) and Pe = Pr [X̂ ≠ X ].

▸ Note that Pe = 0 implies that H(X ∣Y ) = 0


▸ The inequality can be weekend to 1 + Pe log ∣X ∣ ≥ H(X ∣Y ),
H(X ∣Y )−1
▸ Alternatively, to Pe ≥ log∣X ∣
1
▸ Intuition for ∝ log∣X ∣

▸ We call X̂ an estimator for X (from Y ).

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 25 / 26


Proving Fano’s inequality
Let X and Y be rvs, let X̂ = g(Y ) and Pe = Pr [X̂ ≠ X ].
1, X̂ ≠ X
▸ Let E = {
0, X̂ = X .

H(E, X ∣X̂ ) = H(X ∣X̂ ) + H(E∣X , X̂ )


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
=0

= H(E∣X̂ ) + H(X ∣E, X̂ )


´¹¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
≤H(E)=h(Pe ) ≤Pe log∣X ∣(?)

▸ It follows that h(Pe ) + Pe log ∣X ∣ ≥ H(X ∣X̂ )


▸ Since X → Y → X̂ , it holds that I(X ; Y ) ≥ I(X ; X̂ )
Ô⇒ H(X ∣X̂ ) ≥ H(X ∣Y )

Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 26 / 26

You might also like