0% found this document useful (0 votes)

31 views10 pages

Entropy 4

The document discusses conditional entropy and mutual information. It defines conditional entropy and proves that it is always less than or equal to the entropy, with equality if and only if the random variables are independent. It also introduces the chain rule for entropy and proves several properties about conditional entropy and subadditivity under conditioning.

Uploaded by

concoursmaths2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views10 pages

Entropy 4

Uploaded by

concoursmaths2021

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Entropy and Ergodic Theory

Lecture 4: Conditional entropy and mutual information

1 Conditional entropy
Let (Ω, F , P) be a probability space, let X be a RV taking values in some finite
set A. In this lecture we use the following notation:

• pX ∈ Prob(A) is the distribution of X:

pX (a) := P(X = a) for a ∈ A;

• for any other event U ∈ F with P(U ) > 0, we write pX|U ∈ Prob(A) for
the conditional distribution given U :

pX|U (a) := P(X = a | U );

• if Y is a RV taking values in a finite set B, then

pX,Y (a, b) := P(X = a, Y = b) (a ∈ A, b ∈ B)

is the joint distribution, and

pX|Y (a|b) := P(X = a | Y = b) (a ∈ A, b ∈ B)

is the conditional distribution (which is techincally defined only when pY (b) >
0, but this causes no problems in the sequel).

We may think of (X, Y ) as a RV taking values in A × B. When doing so, we write
H(X, Y ) rather than H (X, Y ) .
Our next result gives a way to express H(X, Y ) in terms of the marginal and
conditional distributions of X and Y .

1
Proposition 1.1 (The basic chain rule for Shannon entropy). For any two RVs X
and Y we have
XX
H(X, Y ) − H(X) = − pX,Y (a, b) log2 pY |X (b|a)
a∈A b∈B
X
= pX (a) · H(pY |{X=a} ). (1)
a∈A

Technically, on the last line of (1), the expression H(pY |{X=a} ) is defined only
if pX (a) > 0. If pX (a) = 0 then we simply interpret that whole term of the sum
as 0.
Proof. For each (a, b) ∈ A × B we can write

−pX,Y (a, b) log2 pX,Y (a, b) = −pX,Y (a, b) log2 [pX (a)pY |X (b|a)]

= −pX,Y (a, b) log2 pX (a) + log2 pY |X (b|a) .

Summing over (a, b), the left-hand side gives H(X, Y ). On the right-hand side,
the sum of the first term becomes
X XX
− pX,Y (a, b) log2 pX (a) = − pX,Y (a, b) log2 pX (a)
a,b a b
X
=− pX (a) log2 pX (a) = H(X),
a

since pX is the marginal of pX,Y . Subtracting this from both sides, we are left with
the first line of (1). To obtain the second line of (1), we now re-write the first line
as
XX XX
− pX,Y (a, b) log2 pY |X (b|a) = − pX (a)pY |X (b|a) log2 pY |X (b|a)
a∈A b∈B a∈A b∈B
X h X i
= pX (a) − pY |X (b|a) log2 pY |X (b|a) .
a∈A b∈B

The result of this calculation plays an important role in information theory, so

it has its own name.

2
Definition 1.2. Given X and Y above, the conditional entropy of Y given X is
X
H(Y | X) := H(X, Y ) − H(X) = pX (a) · H(pY |{X=a} ).
a∈A

If U ∈ F has positive probability, then we may also write

H(Y | U ) := H(pY |U ).

Using this notation, we have write

X
H(Y | X) = pX (a) · H(Y | X = a).
a∈A

Many important consequences of Proposition 1.1 start by combining with the

fact that H is concave from Lecture 1.

Lemma 1.3. We always have

H(Y | X) ≤ H(Y )

with equality if and only if X and Y are independent.

Proof. According to the law of total probability, we may write pY as a convex

combination of the conditional distributions pY |{X=a} :
X
pY = pX (a) · pY |{X=a} .
a

Since the function H is strictly concave, it follows by Jensen’s inequality that

X
H(Y | X) = pX (a) · H(pY |{X=a} ) ≤ H(pX ) = H(X),
a

with equality if and only if

pY |{X=a} = pY whenever pX (a) > 0.

This last requirement is equivalent to the independence of X and Y .

Proposition 1.1 and Lemma 1.3 immediately give the following corollary.

3
Corollary 1.4 (Subadditivity). With X, Y as above we have

H(X, Y ) ≤ H(X) + H(Y ), (2)

with equality if and only if X and Y are independent.

Remark. The first part of Corollary 2 has a nice interpretation in terms of typical
sequences. Since pX and pY are the marginals of p = pX,Y , for any δ > 0 we have

Tn,δ (p) ⊆ Tn,δ (pX ) × Tn,δ (pY ) ⊆ An × B n .

By our results on counting approximately typical strings, this implies that

2H(p)n−∆(δ)n−o(n) ≤ 2(H(pX )n+∆(δ)n)+(H(pY )n+∆(δ)n)+o(n) .

Letting n −→ ∞ and then δ ↓ 0, this implies the desired subadditivity.

With a little more work (based on the law of large numbers) one can also show
equality when X and Y are independent by this ‘counting’ approach. But I do
not know a proof of the converse — that equality implies independence — which
does not rely on some analysis of the particular function H.
Since conditional entropy is just a weighted average of Shannon entropies of
conditional distributions, the results above easily extend to situations that involve
further conditioning. Consider now three discrete RVs X, Y and Z. The next re-
sult follows from Proposition 1.1: one simply applies that result to the conditional
distributions pX,Y |{Z=c} and then averages against pZ (c).
Proposition 1.5 (Basic chain rule under extra conditioning). We have

H(X, Y | Z) = H(X | Z) + H(Y | X, Z).

Corollary 1.6 (Full chain rule). For any discrete RVs X1 , . . . Xn and Y we have
n
X
H(X1 , X2 , . . . , Xn | Y ) = H(Xi | Xi−1 , . . . , X1 , Y ).
i=1

Proof. Induction on n using Proposition 1.5.

The next result is the extension of Lemma 1.3 to the setting with extra con-
ditioning; once again it follows by simply applying that lemma to the relevant
conditional distributions.

4
Lemma 1.7 (Monotonicity and subadditivity under conditioning). Any three dis-
crete RVs X, Y, Z satisfy
H(Y | X, Z) ≤ H(Y | Z) and H(X, Y | Z) ≤ H(X | Y ) + H(Y | Z)
(they are equivalent by Proposition 1.5). Equality holds in either case if and only
if X and Y are conditionally independent over Z.

2 Data-processing inequalities
Let X and Y be as above. We say that X determines Y according to P if there
is a map f : A −→ B such that P(Y = f (X)) = 1. We often drop the phrase
‘according to P’ if P is clear from the context. (This notion extends naturally to
non-discrete RVs, for which the targets A and B are arbitrary measurable spaces.
In that case one insists that f be measurable.)
Lemma 2.1. The following are equivalent:
(a) X determines Y ;
(b) H(Y | X) = 0;
(c) H(X, Y ) = H(X).
Proof. The key here is that X determines Y if and only if the conditional distri-
bution pY | {X=a} is a delta mass whenever pX (a) > 0. If this is so, then this delta
mass may be written as δf (a) for some function f : A −→ B, which then satisfies
X
P(Y = f (X)) = pX (a) · P(Y = f (a) | X = a) = 1.
a∈A

To prove that (a) and (b) are equivalent we combine this fact with the property of
entropy that
pY |{X=a} is a delta mass ⇐⇒ H(pY |{X=a} ) = 0.
Finally, (b) and (c) are equivalent simply by the definition of H(Y | X).
Corollary 2.2. Let X1 , X2 and Y be discrete RVs, and suppose that X1 determines
X2 . Then
H(X2 | Y ) ≤ H(X1 | Y ) (so, in particular, H(X2 ) ≤ H(X1 ))
and
H(Y | X2 ) ≥ H(Y | X1 ).

5
Proof. If X2 = f (X1 ) holds P-almost surely, then it also holds almost surely
according to P( · | Y = b) for pY -almost every b. So X1 determines X2 according
to P( · | Y = b) for pY -almost every b. Now apply part (c) of the previous lemma
for each such b to obtain

H(X1 | Y = b) = H(X1 , X2 | Y = b).

Averaging this against pY (b), it becomes

H(X1 | Y ) = H(X1 , X2 | Y ).

But by the conditional chain rule, this right-hand side equals

H(X2 | Y ) + H(X1 | X2 , Y ).

This proves the first inequality, because the second term here must be at least 0.
Similarly, if X2 = f (X1 ) holds P-almost surely, then

pX1 ,X2 (a1 , a2 ) > 0 if and only if pX1 (a1 ) > 0 and a2 = f (a1 ),

and in that case we have

pY |X1 ,X2 ( · | a1 , a2 ) = pY |X1 ( · | a1 ).

Therefore
H(Y | X1 ) = H(Y | X1 , X2 ),
which is less than or equal to H(Y | X2 ) by Lemma 1.7.

3 Mutual information
The conditional entropy H(Y | X) quantifies the additional entropy that Y brings
the pair (X, Y ) given that X is known. It is also useful to have a name for the gap
in the inequality of Lemma 1.3. The mutual information between X and Y is
the quantity
I(X ; Y ) := H(Y ) − H(Y | X).
By the basic chain rule, this is also equal to

H(Y ) + H(X) − H(X, Y ).

6
In particular, I(X ; Y ) is symmetric in X and Y .
Conditional entropy and mutual information fit together into the formula

H(Y ) = I(Y ; X) + H(Y | X).

This has a nice intuitive meaning: we have decomposed the total ‘uncertainty’ in
Y into an amount which is shared with X, given by I(Y ; X), and the amount
which is independent of X, given by H(Y | X). As usual, one must not take this
picture too seriously.
Using the definition of entropy and formula (1), mutual information can be
written in terms of distributions:
X pX,Y (a, b)
I(X ; Y ) = pX,Y (a, b) log2 . (3)
a,b
pX (a)pY (b)

As in the case of entropy, we may generalize by conditioning everything on a

third discrete RV Z: the resulting conditional mutual information is

I(X ; Y | Z) := H(Y | Z) − H(Y | X, Z)

= H(Y | Z) + H(X | Z) − H(X, Y | Z)
= I(Y ; X | Z).

As with conditional entropy, we may also describe conditional mutual information

by replacing pX,Y in (3) with the conditional distributions pX,Y |{Z=c} , and then
averaging against pZ (c).
Many of the results of the previous section are easily re-written in terms of I.
This can be helpful since I brings intuitions of its own to entropy calculations.

Proposition 3.1. Mutual information has the following properties:

1. (Bounds) Any discrete RVs X and Y satisfy 0 ≤ I(X ; Y ) ≤ min{H(X), H(Y )}

(from Lemma 1.3 and the symmetry of I);

2. (Chain rule) If X1 , . . . , Xn and Y are discrete RVs, then

n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y | Xi−1 , . . . , X1 )
i=1

(from re-writing Corollary 1.6 using the definition of I in terms of H);

7
3. (Independence) We have I(X ; Y ) = 0 if and only if X and Y are indepen-
dent (from Corollary 1.4 and the second formula for I above).

4. (Data processing) If X1 determines X2 , then

I(X1 ; Y ) ≥ I(X2 ; Y )

(from Corollary 2.2 and the definition of I).

4 Shannon’s list of properties for entropy

Shannon’s approach to his entropy was to simply write down the formula for
H(X), declare it as a quantity which measures the ‘uncertainty’ in X, and then
derive a few of its properties which match that intuition as justification. The prop-
erties in his list were a selection of those that we have proved so far. Up to slight
re-ordering and changes of notation, here is Shannon’s list:
1. If X = Y a.s. (even if their domains are formally different) then H(X) =
H(Y ).

2. If A is fixed, then H(X) is a continuous function of the distribution pX .

3. H(X) = 0 if and only if X is almost surely constant (that is, pX is a delta

mass): “the uncertainty is zero only if the outcome of X is certain”.

4. If |A| = n and X has the uniform distribution, then H(X) is an increas-

ing function of n: “if all outcomes are equally likely, then more possible
outcomes implies more uncertainty”.

5. If A is fixed, then H(X) is maximized uniquely when pX is the uniform

distribution on A: “on a given alphabet, the uniform distribution has the
greatest uncertainty”.

6. If X and Y are RVs taking values in A and B, then

X
H(X, Y ) = H(X) + pX (a) · H(pY |{X=a} ) = H(X) + H(Y | X) :
a∈A

“the uncertainty in the composite RV (X, Y ) may be obtained sequentially

as the uncertainty in X plus the expected uncertainty in Y conditioned on
the value taken by X”.

8
7. We have H(X, Y ) ≤ H(X) + H(Y ), with equality if and only if they are
independent: “if pX and pY are fixed, then the greatest uncertainty obtains
when X and Y are independent”.

Shannon recorded these properties as evidence that his entropy notion is intu-
itive and valuable, together with the following (see [Sha48, Theorem 2]).

Theorem 4.1. Any function of discrete RVs which satisfes properties 1–7 above is
a positive multiple of H.

Some authors use this as the main justification for introducing H. This is called
the ‘axiomatic approach’ to information theory.

5 Notes and remarks

Basic sources for this lecture: [CT06, Chapter 2], [Sha48], [Wel88, Chapter 1]
Further reading:

• Other selections of axioms than Shannon’s have also been proposed and
studied, and various relatives of Theorem 4.1 have been found. See, for
instance, [AD75]. Beware, however, that I do not know of many real con-
nections from this line of research back to other parts of mathematics.

References
[AD75] J. Aczél and Z. Daróczy. On measures of information and their char-
acterizations. Academic Press [Harcourt Brace Jovanovich, Publishers],
New York-London, 1975. Mathematics in Science and Engineering, Vol.
115.

[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory.

Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition,
2006.

[Sha48] C. E. Shannon. A mathematical theory of communication. Bell System

Tech. J., 27:379–423, 623–656, 1948.

[Wel88] Dominic Welsh. Codes and cryptography. Oxford Science Publications.

The Clarendon Press, Oxford University Press, New York, 1988.

9
T IM AUSTIN

Email: [email protected]
URL: math.ucla.edu/˜tim

Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
Probability and Stochastic Models
No ratings yet
Probability and Stochastic Models
78 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
79 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
2 Information Theory
No ratings yet
2 Information Theory
40 pages
Notes It
No ratings yet
Notes It
46 pages
lời giải
No ratings yet
lời giải
52 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
Entropy Methods in Combinatorics: Daniel Naylor
No ratings yet
Entropy Methods in Combinatorics: Daniel Naylor
50 pages
Ict Solution
No ratings yet
Ict Solution
41 pages
Jour 2
No ratings yet
Jour 2
37 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
36 pages
Mathematical Problems and Solutions On Information Theory
No ratings yet
Mathematical Problems and Solutions On Information Theory
28 pages
Three Tutorial Lectures
No ratings yet
Three Tutorial Lectures
36 pages
Learning Material - ITC
No ratings yet
Learning Material - ITC
96 pages
Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2
No ratings yet
Joint & Conditional Entropy, Mutual Information: Application of Information Theory, Lecture 2
26 pages
ITC Module2 1
No ratings yet
ITC Module2 1
34 pages
It Co 1 en
No ratings yet
It Co 1 en
26 pages
Lect2 PDF
No ratings yet
Lect2 PDF
25 pages
2 Entropy and Mutual Information: I (A) F (P (A) )
No ratings yet
2 Entropy and Mutual Information: I (A) F (P (A) )
27 pages
Instructor: DR - Saleem AL Ashhab Al Ba'At University Mathmatical Class Second Year Master Dgree
No ratings yet
Instructor: DR - Saleem AL Ashhab Al Ba'At University Mathmatical Class Second Year Master Dgree
13 pages
Slide 04
No ratings yet
Slide 04
16 pages
Conditional Probability and Expectation
No ratings yet
Conditional Probability and Expectation
19 pages
Entropy
No ratings yet
Entropy
21 pages
Math Supplement PDF
No ratings yet
Math Supplement PDF
17 pages
Lec38 - 210108071 - AKSHAY KUMAR JHA
No ratings yet
Lec38 - 210108071 - AKSHAY KUMAR JHA
12 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Uncertain We Are of The Outcome
No ratings yet
Uncertain We Are of The Outcome
14 pages
Understanding Basic Probability
No ratings yet
Understanding Basic Probability
7 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Entropy 5
No ratings yet
Entropy 5
9 pages
Stat520 Ch.5
No ratings yet
Stat520 Ch.5
5 pages
No of Flips For First Head
No ratings yet
No of Flips For First Head
8 pages
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
No ratings yet
Lecturer: Mark Braverman Scribe: Mark Braverman: COS597D: Information Theory in Computer Science
5 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lecture 1: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 1: Entropy and Mutual Information: 2.1 Example
8 pages
Solved Problems
No ratings yet
Solved Problems
7 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
1 Introduction To Information Theory
No ratings yet
1 Introduction To Information Theory
9 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
1 Convexity/Concavity of Mutual Information: Lecturer: Mark Braverman Scribe: Abhishek Bhowmick
No ratings yet
1 Convexity/Concavity of Mutual Information: Lecturer: Mark Braverman Scribe: Abhishek Bhowmick
4 pages
Relative Entropy
No ratings yet
Relative Entropy
6 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
1.1 Shannon's Information Measures: Lecture 1 - January 26
No ratings yet
1.1 Shannon's Information Measures: Lecture 1 - January 26
5 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
4 pages

Entropy 4

Uploaded by

Entropy 4

Uploaded by

Entropy and Ergodic Theory

Lecture 4: Conditional entropy and mutual information

• pX ∈ Prob(A) is the distribution of X:

pX (a) := P(X = a) for a ∈ A;

pX|U (a) := P(X = a | U );

• if Y is a RV taking values in a finite set B, then

pX,Y (a, b) := P(X = a, Y = b) (a ∈ A, b ∈ B)

is the joint distribution, and

pX|Y (a|b) := P(X = a | Y = b) (a ∈ A, b ∈ B)

The result of this calculation plays an important role in information theory, so

If U ∈ F has positive probability, then we may also write

Using this notation, we have write

Many important consequences of Proposition 1.1 start by combining with the

Lemma 1.3. We always have

with equality if and only if X and Y are independent.

Proof. According to the law of total probability, we may write pY as a convex

Since the function H is strictly concave, it follows by Jensen’s inequality that

with equality if and only if

pY |{X=a} = pY whenever pX (a) > 0.

This last requirement is equivalent to the independence of X and Y .

H(X, Y ) ≤ H(X) + H(Y ), (2)

with equality if and only if X and Y are independent.

Tn,δ (p) ⊆ Tn,δ (pX ) × Tn,δ (pY ) ⊆ An × B n .

By our results on counting approximately typical strings, this implies that

2H(p)n−∆(δ)n−o(n) ≤ 2(H(pX )n+∆(δ)n)+(H(pY )n+∆(δ)n)+o(n) .

Letting n −→ ∞ and then δ ↓ 0, this implies the desired subadditivity.

H(X, Y | Z) = H(X | Z) + H(Y | X, Z).

Proof. Induction on n using Proposition 1.5.

H(X1 | Y = b) = H(X1 , X2 | Y = b).

Averaging this against pY (b), it becomes

But by the conditional chain rule, this right-hand side equals

and in that case we have

pY |X1 ,X2 ( · | a1 , a2 ) = pY |X1 ( · | a1 ).

H(Y ) + H(X) − H(X, Y ).

H(Y ) = I(Y ; X) + H(Y | X).

As in the case of entropy, we may generalize by conditioning everything on a

I(X ; Y | Z) := H(Y | Z) − H(Y | X, Z)

As with conditional entropy, we may also describe conditional mutual information

Proposition 3.1. Mutual information has the following properties:

1. (Bounds) Any discrete RVs X and Y satisfy 0 ≤ I(X ; Y ) ≤ min{H(X), H(Y )}

2. (Chain rule) If X1 , . . . , Xn and Y are discrete RVs, then

(from re-writing Corollary 1.6 using the definition of I in terms of H);

4. (Data processing) If X1 determines X2 , then

(from Corollary 2.2 and the definition of I).

4 Shannon’s list of properties for entropy

2. If A is fixed, then H(X) is a continuous function of the distribution pX .

3. H(X) = 0 if and only if X is almost surely constant (that is, pX is a delta

4. If |A| = n and X has the uniform distribution, then H(X) is an increas-

5. If A is fixed, then H(X) is maximized uniquely when pX is the uniform

6. If X and Y are RVs taking values in A and B, then

“the uncertainty in the composite RV (X, Y ) may be obtained sequentially

5 Notes and remarks

[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory.

[Sha48] C. E. Shannon. A mathematical theory of communication. Bell System

[Wel88] Dominic Welsh. Codes and cryptography. Oxford Science Publications.

You might also like