0% found this document useful (0 votes)
31 views10 pages

Entropy 4

The document discusses conditional entropy and mutual information. It defines conditional entropy and proves that it is always less than or equal to the entropy, with equality if and only if the random variables are independent. It also introduces the chain rule for entropy and proves several properties about conditional entropy and subadditivity under conditioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views10 pages

Entropy 4

The document discusses conditional entropy and mutual information. It defines conditional entropy and proves that it is always less than or equal to the entropy, with equality if and only if the random variables are independent. It also introduces the chain rule for entropy and proves several properties about conditional entropy and subadditivity under conditioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Entropy and Ergodic Theory

Lecture 4: Conditional entropy and mutual information

1 Conditional entropy
Let (Ω, F , P) be a probability space, let X be a RV taking values in some finite
set A. In this lecture we use the following notation:

• pX ∈ Prob(A) is the distribution of X:

pX (a) := P(X = a) for a ∈ A;

• for any other event U ∈ F with P(U ) > 0, we write pX|U ∈ Prob(A) for
the conditional distribution given U :

pX|U (a) := P(X = a | U );

• if Y is a RV taking values in a finite set B, then

pX,Y (a, b) := P(X = a, Y = b) (a ∈ A, b ∈ B)

is the joint distribution, and

pX|Y (a|b) := P(X = a | Y = b) (a ∈ A, b ∈ B)

is the conditional distribution (which is techincally defined only when pY (b) >
0, but this causes no problems in the sequel).

We may think of (X, Y ) as a RV taking values in A × B. When doing so, we write
H(X, Y ) rather than H (X, Y ) .
Our next result gives a way to express H(X, Y ) in terms of the marginal and
conditional distributions of X and Y .

1
Proposition 1.1 (The basic chain rule for Shannon entropy). For any two RVs X
and Y we have
XX
H(X, Y ) − H(X) = − pX,Y (a, b) log2 pY |X (b|a)
a∈A b∈B
X
= pX (a) · H(pY |{X=a} ). (1)
a∈A

Technically, on the last line of (1), the expression H(pY |{X=a} ) is defined only
if pX (a) > 0. If pX (a) = 0 then we simply interpret that whole term of the sum
as 0.
Proof. For each (a, b) ∈ A × B we can write

−pX,Y (a, b) log2 pX,Y (a, b) = −pX,Y (a, b) log2 [pX (a)pY |X (b|a)]

= −pX,Y (a, b) log2 pX (a) + log2 pY |X (b|a) .

Summing over (a, b), the left-hand side gives H(X, Y ). On the right-hand side,
the sum of the first term becomes
X XX 
− pX,Y (a, b) log2 pX (a) = − pX,Y (a, b) log2 pX (a)
a,b a b
X
=− pX (a) log2 pX (a) = H(X),
a

since pX is the marginal of pX,Y . Subtracting this from both sides, we are left with
the first line of (1). To obtain the second line of (1), we now re-write the first line
as
XX XX
− pX,Y (a, b) log2 pY |X (b|a) = − pX (a)pY |X (b|a) log2 pY |X (b|a)
a∈A b∈B a∈A b∈B
X h X i
= pX (a) − pY |X (b|a) log2 pY |X (b|a) .
a∈A b∈B

The result of this calculation plays an important role in information theory, so


it has its own name.

2
Definition 1.2. Given X and Y above, the conditional entropy of Y given X is
X
H(Y | X) := H(X, Y ) − H(X) = pX (a) · H(pY |{X=a} ).
a∈A

If U ∈ F has positive probability, then we may also write

H(Y | U ) := H(pY |U ).

Using this notation, we have write


X
H(Y | X) = pX (a) · H(Y | X = a).
a∈A

Many important consequences of Proposition 1.1 start by combining with the


fact that H is concave from Lecture 1.

Lemma 1.3. We always have

H(Y | X) ≤ H(Y )

with equality if and only if X and Y are independent.

Proof. According to the law of total probability, we may write pY as a convex


combination of the conditional distributions pY |{X=a} :
X
pY = pX (a) · pY |{X=a} .
a

Since the function H is strictly concave, it follows by Jensen’s inequality that


X
H(Y | X) = pX (a) · H(pY |{X=a} ) ≤ H(pX ) = H(X),
a

with equality if and only if

pY |{X=a} = pY whenever pX (a) > 0.

This last requirement is equivalent to the independence of X and Y .


Proposition 1.1 and Lemma 1.3 immediately give the following corollary.

3
Corollary 1.4 (Subadditivity). With X, Y as above we have

H(X, Y ) ≤ H(X) + H(Y ), (2)

with equality if and only if X and Y are independent.


Remark. The first part of Corollary 2 has a nice interpretation in terms of typical
sequences. Since pX and pY are the marginals of p = pX,Y , for any δ > 0 we have

Tn,δ (p) ⊆ Tn,δ (pX ) × Tn,δ (pY ) ⊆ An × B n .

By our results on counting approximately typical strings, this implies that

2H(p)n−∆(δ)n−o(n) ≤ 2(H(pX )n+∆(δ)n)+(H(pY )n+∆(δ)n)+o(n) .

Letting n −→ ∞ and then δ ↓ 0, this implies the desired subadditivity.


With a little more work (based on the law of large numbers) one can also show
equality when X and Y are independent by this ‘counting’ approach. But I do
not know a proof of the converse — that equality implies independence — which
does not rely on some analysis of the particular function H.
Since conditional entropy is just a weighted average of Shannon entropies of
conditional distributions, the results above easily extend to situations that involve
further conditioning. Consider now three discrete RVs X, Y and Z. The next re-
sult follows from Proposition 1.1: one simply applies that result to the conditional
distributions pX,Y |{Z=c} and then averages against pZ (c).
Proposition 1.5 (Basic chain rule under extra conditioning). We have

H(X, Y | Z) = H(X | Z) + H(Y | X, Z).

Corollary 1.6 (Full chain rule). For any discrete RVs X1 , . . . Xn and Y we have
n
X
H(X1 , X2 , . . . , Xn | Y ) = H(Xi | Xi−1 , . . . , X1 , Y ).
i=1

Proof. Induction on n using Proposition 1.5.


The next result is the extension of Lemma 1.3 to the setting with extra con-
ditioning; once again it follows by simply applying that lemma to the relevant
conditional distributions.

4
Lemma 1.7 (Monotonicity and subadditivity under conditioning). Any three dis-
crete RVs X, Y, Z satisfy
H(Y | X, Z) ≤ H(Y | Z) and H(X, Y | Z) ≤ H(X | Y ) + H(Y | Z)
(they are equivalent by Proposition 1.5). Equality holds in either case if and only
if X and Y are conditionally independent over Z.

2 Data-processing inequalities
Let X and Y be as above. We say that X determines Y according to P if there
is a map f : A −→ B such that P(Y = f (X)) = 1. We often drop the phrase
‘according to P’ if P is clear from the context. (This notion extends naturally to
non-discrete RVs, for which the targets A and B are arbitrary measurable spaces.
In that case one insists that f be measurable.)
Lemma 2.1. The following are equivalent:
(a) X determines Y ;
(b) H(Y | X) = 0;
(c) H(X, Y ) = H(X).
Proof. The key here is that X determines Y if and only if the conditional distri-
bution pY | {X=a} is a delta mass whenever pX (a) > 0. If this is so, then this delta
mass may be written as δf (a) for some function f : A −→ B, which then satisfies
X
P(Y = f (X)) = pX (a) · P(Y = f (a) | X = a) = 1.
a∈A

To prove that (a) and (b) are equivalent we combine this fact with the property of
entropy that
pY |{X=a} is a delta mass ⇐⇒ H(pY |{X=a} ) = 0.
Finally, (b) and (c) are equivalent simply by the definition of H(Y | X).
Corollary 2.2. Let X1 , X2 and Y be discrete RVs, and suppose that X1 determines
X2 . Then
H(X2 | Y ) ≤ H(X1 | Y ) (so, in particular, H(X2 ) ≤ H(X1 ))
and
H(Y | X2 ) ≥ H(Y | X1 ).

5
Proof. If X2 = f (X1 ) holds P-almost surely, then it also holds almost surely
according to P( · | Y = b) for pY -almost every b. So X1 determines X2 according
to P( · | Y = b) for pY -almost every b. Now apply part (c) of the previous lemma
for each such b to obtain

H(X1 | Y = b) = H(X1 , X2 | Y = b).

Averaging this against pY (b), it becomes

H(X1 | Y ) = H(X1 , X2 | Y ).

But by the conditional chain rule, this right-hand side equals

H(X2 | Y ) + H(X1 | X2 , Y ).

This proves the first inequality, because the second term here must be at least 0.
Similarly, if X2 = f (X1 ) holds P-almost surely, then

pX1 ,X2 (a1 , a2 ) > 0 if and only if pX1 (a1 ) > 0 and a2 = f (a1 ),

and in that case we have

pY |X1 ,X2 ( · | a1 , a2 ) = pY |X1 ( · | a1 ).

Therefore
H(Y | X1 ) = H(Y | X1 , X2 ),
which is less than or equal to H(Y | X2 ) by Lemma 1.7.

3 Mutual information
The conditional entropy H(Y | X) quantifies the additional entropy that Y brings
the pair (X, Y ) given that X is known. It is also useful to have a name for the gap
in the inequality of Lemma 1.3. The mutual information between X and Y is
the quantity
I(X ; Y ) := H(Y ) − H(Y | X).
By the basic chain rule, this is also equal to

H(Y ) + H(X) − H(X, Y ).

6
In particular, I(X ; Y ) is symmetric in X and Y .
Conditional entropy and mutual information fit together into the formula

H(Y ) = I(Y ; X) + H(Y | X).

This has a nice intuitive meaning: we have decomposed the total ‘uncertainty’ in
Y into an amount which is shared with X, given by I(Y ; X), and the amount
which is independent of X, given by H(Y | X). As usual, one must not take this
picture too seriously.
Using the definition of entropy and formula (1), mutual information can be
written in terms of distributions:
X pX,Y (a, b)
I(X ; Y ) = pX,Y (a, b) log2 . (3)
a,b
pX (a)pY (b)

As in the case of entropy, we may generalize by conditioning everything on a


third discrete RV Z: the resulting conditional mutual information is

I(X ; Y | Z) := H(Y | Z) − H(Y | X, Z)


= H(Y | Z) + H(X | Z) − H(X, Y | Z)
= I(Y ; X | Z).

As with conditional entropy, we may also describe conditional mutual information


by replacing pX,Y in (3) with the conditional distributions pX,Y |{Z=c} , and then
averaging against pZ (c).
Many of the results of the previous section are easily re-written in terms of I.
This can be helpful since I brings intuitions of its own to entropy calculations.

Proposition 3.1. Mutual information has the following properties:

1. (Bounds) Any discrete RVs X and Y satisfy 0 ≤ I(X ; Y ) ≤ min{H(X), H(Y )}


(from Lemma 1.3 and the symmetry of I);

2. (Chain rule) If X1 , . . . , Xn and Y are discrete RVs, then


n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y | Xi−1 , . . . , X1 )
i=1

(from re-writing Corollary 1.6 using the definition of I in terms of H);

7
3. (Independence) We have I(X ; Y ) = 0 if and only if X and Y are indepen-
dent (from Corollary 1.4 and the second formula for I above).

4. (Data processing) If X1 determines X2 , then

I(X1 ; Y ) ≥ I(X2 ; Y )

(from Corollary 2.2 and the definition of I).

4 Shannon’s list of properties for entropy


Shannon’s approach to his entropy was to simply write down the formula for
H(X), declare it as a quantity which measures the ‘uncertainty’ in X, and then
derive a few of its properties which match that intuition as justification. The prop-
erties in his list were a selection of those that we have proved so far. Up to slight
re-ordering and changes of notation, here is Shannon’s list:
1. If X = Y a.s. (even if their domains are formally different) then H(X) =
H(Y ).

2. If A is fixed, then H(X) is a continuous function of the distribution pX .

3. H(X) = 0 if and only if X is almost surely constant (that is, pX is a delta


mass): “the uncertainty is zero only if the outcome of X is certain”.

4. If |A| = n and X has the uniform distribution, then H(X) is an increas-


ing function of n: “if all outcomes are equally likely, then more possible
outcomes implies more uncertainty”.

5. If A is fixed, then H(X) is maximized uniquely when pX is the uniform


distribution on A: “on a given alphabet, the uniform distribution has the
greatest uncertainty”.

6. If X and Y are RVs taking values in A and B, then


X
H(X, Y ) = H(X) + pX (a) · H(pY |{X=a} ) = H(X) + H(Y | X) :
a∈A

“the uncertainty in the composite RV (X, Y ) may be obtained sequentially


as the uncertainty in X plus the expected uncertainty in Y conditioned on
the value taken by X”.

8
7. We have H(X, Y ) ≤ H(X) + H(Y ), with equality if and only if they are
independent: “if pX and pY are fixed, then the greatest uncertainty obtains
when X and Y are independent”.

Shannon recorded these properties as evidence that his entropy notion is intu-
itive and valuable, together with the following (see [Sha48, Theorem 2]).

Theorem 4.1. Any function of discrete RVs which satisfes properties 1–7 above is
a positive multiple of H.

Some authors use this as the main justification for introducing H. This is called
the ‘axiomatic approach’ to information theory.

5 Notes and remarks


Basic sources for this lecture: [CT06, Chapter 2], [Sha48], [Wel88, Chapter 1]
Further reading:

• Other selections of axioms than Shannon’s have also been proposed and
studied, and various relatives of Theorem 4.1 have been found. See, for
instance, [AD75]. Beware, however, that I do not know of many real con-
nections from this line of research back to other parts of mathematics.

References
[AD75] J. Aczél and Z. Daróczy. On measures of information and their char-
acterizations. Academic Press [Harcourt Brace Jovanovich, Publishers],
New York-London, 1975. Mathematics in Science and Engineering, Vol.
115.

[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory.


Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition,
2006.

[Sha48] C. E. Shannon. A mathematical theory of communication. Bell System


Tech. J., 27:379–423, 623–656, 1948.

[Wel88] Dominic Welsh. Codes and cryptography. Oxford Science Publications.


The Clarendon Press, Oxford University Press, New York, 1988.

9
T IM AUSTIN

Email: [email protected]
URL: math.ucla.edu/˜tim

10

You might also like