Entropy 4
Entropy 4
1 Conditional entropy
Let (Ω, F , P) be a probability space, let X be a RV taking values in some finite
set A. In this lecture we use the following notation:
• for any other event U ∈ F with P(U ) > 0, we write pX|U ∈ Prob(A) for
the conditional distribution given U :
is the conditional distribution (which is techincally defined only when pY (b) >
0, but this causes no problems in the sequel).
We may think of (X, Y ) as a RV taking values in A × B. When doing so, we write
H(X, Y ) rather than H (X, Y ) .
Our next result gives a way to express H(X, Y ) in terms of the marginal and
conditional distributions of X and Y .
1
Proposition 1.1 (The basic chain rule for Shannon entropy). For any two RVs X
and Y we have
XX
H(X, Y ) − H(X) = − pX,Y (a, b) log2 pY |X (b|a)
a∈A b∈B
X
= pX (a) · H(pY |{X=a} ). (1)
a∈A
Technically, on the last line of (1), the expression H(pY |{X=a} ) is defined only
if pX (a) > 0. If pX (a) = 0 then we simply interpret that whole term of the sum
as 0.
Proof. For each (a, b) ∈ A × B we can write
−pX,Y (a, b) log2 pX,Y (a, b) = −pX,Y (a, b) log2 [pX (a)pY |X (b|a)]
= −pX,Y (a, b) log2 pX (a) + log2 pY |X (b|a) .
Summing over (a, b), the left-hand side gives H(X, Y ). On the right-hand side,
the sum of the first term becomes
X XX
− pX,Y (a, b) log2 pX (a) = − pX,Y (a, b) log2 pX (a)
a,b a b
X
=− pX (a) log2 pX (a) = H(X),
a
since pX is the marginal of pX,Y . Subtracting this from both sides, we are left with
the first line of (1). To obtain the second line of (1), we now re-write the first line
as
XX XX
− pX,Y (a, b) log2 pY |X (b|a) = − pX (a)pY |X (b|a) log2 pY |X (b|a)
a∈A b∈B a∈A b∈B
X h X i
= pX (a) − pY |X (b|a) log2 pY |X (b|a) .
a∈A b∈B
2
Definition 1.2. Given X and Y above, the conditional entropy of Y given X is
X
H(Y | X) := H(X, Y ) − H(X) = pX (a) · H(pY |{X=a} ).
a∈A
H(Y | U ) := H(pY |U ).
H(Y | X) ≤ H(Y )
3
Corollary 1.4 (Subadditivity). With X, Y as above we have
Corollary 1.6 (Full chain rule). For any discrete RVs X1 , . . . Xn and Y we have
n
X
H(X1 , X2 , . . . , Xn | Y ) = H(Xi | Xi−1 , . . . , X1 , Y ).
i=1
4
Lemma 1.7 (Monotonicity and subadditivity under conditioning). Any three dis-
crete RVs X, Y, Z satisfy
H(Y | X, Z) ≤ H(Y | Z) and H(X, Y | Z) ≤ H(X | Y ) + H(Y | Z)
(they are equivalent by Proposition 1.5). Equality holds in either case if and only
if X and Y are conditionally independent over Z.
2 Data-processing inequalities
Let X and Y be as above. We say that X determines Y according to P if there
is a map f : A −→ B such that P(Y = f (X)) = 1. We often drop the phrase
‘according to P’ if P is clear from the context. (This notion extends naturally to
non-discrete RVs, for which the targets A and B are arbitrary measurable spaces.
In that case one insists that f be measurable.)
Lemma 2.1. The following are equivalent:
(a) X determines Y ;
(b) H(Y | X) = 0;
(c) H(X, Y ) = H(X).
Proof. The key here is that X determines Y if and only if the conditional distri-
bution pY | {X=a} is a delta mass whenever pX (a) > 0. If this is so, then this delta
mass may be written as δf (a) for some function f : A −→ B, which then satisfies
X
P(Y = f (X)) = pX (a) · P(Y = f (a) | X = a) = 1.
a∈A
To prove that (a) and (b) are equivalent we combine this fact with the property of
entropy that
pY |{X=a} is a delta mass ⇐⇒ H(pY |{X=a} ) = 0.
Finally, (b) and (c) are equivalent simply by the definition of H(Y | X).
Corollary 2.2. Let X1 , X2 and Y be discrete RVs, and suppose that X1 determines
X2 . Then
H(X2 | Y ) ≤ H(X1 | Y ) (so, in particular, H(X2 ) ≤ H(X1 ))
and
H(Y | X2 ) ≥ H(Y | X1 ).
5
Proof. If X2 = f (X1 ) holds P-almost surely, then it also holds almost surely
according to P( · | Y = b) for pY -almost every b. So X1 determines X2 according
to P( · | Y = b) for pY -almost every b. Now apply part (c) of the previous lemma
for each such b to obtain
H(X1 | Y ) = H(X1 , X2 | Y ).
H(X2 | Y ) + H(X1 | X2 , Y ).
This proves the first inequality, because the second term here must be at least 0.
Similarly, if X2 = f (X1 ) holds P-almost surely, then
pX1 ,X2 (a1 , a2 ) > 0 if and only if pX1 (a1 ) > 0 and a2 = f (a1 ),
Therefore
H(Y | X1 ) = H(Y | X1 , X2 ),
which is less than or equal to H(Y | X2 ) by Lemma 1.7.
3 Mutual information
The conditional entropy H(Y | X) quantifies the additional entropy that Y brings
the pair (X, Y ) given that X is known. It is also useful to have a name for the gap
in the inequality of Lemma 1.3. The mutual information between X and Y is
the quantity
I(X ; Y ) := H(Y ) − H(Y | X).
By the basic chain rule, this is also equal to
6
In particular, I(X ; Y ) is symmetric in X and Y .
Conditional entropy and mutual information fit together into the formula
This has a nice intuitive meaning: we have decomposed the total ‘uncertainty’ in
Y into an amount which is shared with X, given by I(Y ; X), and the amount
which is independent of X, given by H(Y | X). As usual, one must not take this
picture too seriously.
Using the definition of entropy and formula (1), mutual information can be
written in terms of distributions:
X pX,Y (a, b)
I(X ; Y ) = pX,Y (a, b) log2 . (3)
a,b
pX (a)pY (b)
7
3. (Independence) We have I(X ; Y ) = 0 if and only if X and Y are indepen-
dent (from Corollary 1.4 and the second formula for I above).
I(X1 ; Y ) ≥ I(X2 ; Y )
8
7. We have H(X, Y ) ≤ H(X) + H(Y ), with equality if and only if they are
independent: “if pX and pY are fixed, then the greatest uncertainty obtains
when X and Y are independent”.
Shannon recorded these properties as evidence that his entropy notion is intu-
itive and valuable, together with the following (see [Sha48, Theorem 2]).
Theorem 4.1. Any function of discrete RVs which satisfes properties 1–7 above is
a positive multiple of H.
Some authors use this as the main justification for introducing H. This is called
the ‘axiomatic approach’ to information theory.
• Other selections of axioms than Shannon’s have also been proposed and
studied, and various relatives of Theorem 4.1 have been found. See, for
instance, [AD75]. Beware, however, that I do not know of many real con-
nections from this line of research back to other parts of mathematics.
References
[AD75] J. Aczél and Z. Daróczy. On measures of information and their char-
acterizations. Academic Press [Harcourt Brace Jovanovich, Publishers],
New York-London, 1975. Mathematics in Science and Engineering, Vol.
115.
9
T IM AUSTIN
Email: [email protected]
URL: math.ucla.edu/˜tim
10