0% found this document useful (0 votes)

13 views

Lecture 3: Entropy, Relative Entropy, and Mutual Information

The document discusses key information theory concepts including entropy, relative entropy, and mutual information. Entropy measures the amount of randomness in a random variable. It is greatest when all outcomes are equally likely and lowest when the variable is deterministic. Relative entropy quantifies the difference between two probability distributions and is always non-negative. Mutual information measures the shared information between two random variables.

Uploaded by

Assf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Lecture 3: Entropy, Relative Entropy, and Mutual Information

Uploaded by

Assf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

EE376A/STATS376A Information Theory Lecture 3 - 01/16/2018

Lecture 3: Entropy, Relative Entropy, and Mutual Information

Lecturer: Tsachy Weissman Scribe: Yicheng An, Melody Guan, Jacob Rebec, John Sholar

In this lecture, we will introduce certain key measures of information, that play crucial roles in theoretical
and operational characterizations throughout the course. These include the entropy, the mutual information,
and the relative entropy. We will also exhibit some key properties exhibited by these information measures.

1 Notation
A quick summary of the notation
1. Discrete Random Variable: U
2. Alphabet: U = {u1 , u2 , ..., uM } (An alphabet of size M)
3. Specific Value: u, u1 , etc.
For discrete random variables, we will write (interchangeably) P (U = u), PU (u) or most often just, p(u)
Similarly, for a pair of random variables X, Y we write P (X = x | Y = y), PX|Y (x | y) or p(x | y)

2 Entropy
Definition 1. “Surprise” Function:
1
S(u) , log (1)
p(u)
A lower probability of u translates to a greater “surprise” that it occurs.
Note here that we use log to mean log2 by default, rather than the natural log ln, as is typical in some other
contexts. This is true throughout these notes: log is assumed to be log2 unless otherwise indicated.
Definition 2. Entropy: Let U a discrete random variable taking values in alphabet U. The entropy of U
is given by:

1 X
H(U ) , E[S(U )] = E log = E − log (p(U )) = − p(u) log p(u) (2)
p(U ) u

Where U represents all u values possible to the variable.

The entropy is a property of the underlying distribution PU (u), u ∈ U that measures the amount of
randomness or surprise in the random variable.
Lemma 3. Jensen’s Inequality: Let Q denote a function on a random variable X. Jensen’s inequality
states that ∀Q that are convex:

E[Q(X)] ≥ Q(E[X]). (3)

Further, if Q is strictly convex, equality holds iff X is deterministic.

Conversely, if Q is a concave function, then

E[Q(X)] ≤ Q(E[X]). (4)

1
Proof:

If Q is a convex function then its graph {(X, Q(X)) : X ∈ R} can be seen as an upper bound on the set
of affine functions that lie below it. Written more concretely,

Q(X) = sup L(X)

L∈L

where
L = {L : L(u) = au + b ≤ Q(u) for all − ∞ < u < ∞}
Thus, by linearity:

E[Q(X)] = E[sup L(x)] (5)

L∈L
≥ sup E[L(X)] (6)
L∈L
= sup L(E[X]) (7)
L∈L
= Q(E[X]) (8)

(6) holds because of the monotonicity of E.

The same argument for concave functions can be done using the infimum instead of the supremum.

2.1 Properties of Entropy

Suppose U = {u1 , u2 , ..., uM }
1
1. H(U ) ≤ log M , with equality iff U is uniformly distributed i.e. p(u) = M ∀u
Proof:

1
H(U ) = E log (9)
p(U )

1
≤ log E (By Jensen’s inequality because log is concave) (10)
p(U )
X 1
= log p(u) · (11)
u
p(u)
= log M. (12)
1 1
Equality by Jensen’s inequality iff p(U ) is deterministic, iff p(u) = M ∀u ∈ U

2. H(U ) ≥ 0, with equality iff U is deterministic.

Proof:

1 1
H(U ) = E log ≥ 0 because log ≥0 (13)
p(U ) p(U )
1
The equality occurs iff log p(u) = 0 with probability 1 so U must be deterministic.

2
3. For a PMF q define X
1 1
Hq (U ) , E log = p(u) log . (14)
q(U ) q(u)
u∈U

Then:
H(U ) ≤ Hq (U ), (15)
with equality iff q = p.
Proof:

1 1
H(U ) − Hq (U ) = E log − E log (16)
p(u) q(u)

q(u)
H(U ) − Hq (U ) = E log (17)
p(u)

q(u)
≤ log E (18)
p(u)
X q(u)
= log p(u) (19)
p(u)
u∈U
X
= log q(u) (20)
u∈U
= log 1 (21)
=0 (22)

Thus,
H(U ) − Hq (U ) ≤ 0.
q(u)
Equality only holds when p(u) is deterministic, which occurs when q = p (distributions are identical).

Definition 4. Relative Entropy1 An measure of distance between probability distributions is relative

entropy:
X p(u) p(u)
D(p k q) , p(u) log = E log (23)
q(u) q(u)
u∈U

Note that by property 3, the relative entropy is always greater than or equal to 0, with equality iff
q = p. For now, relative entropy can be thought of as a measure of discrepancy between two probability
distributions. We will soon see that it is central to information theory.

4. If X1 , X2 , . . . , Xn are independent random variables, then

n
X
H(X1 , X2 , . . . , Xn ) = H(Xi ) (24)
i=1

Note: H(X1 , X2 , . . . , Xn ) is called the joint entropy of X1 , X2 , . . . , Xn .

1 Some students may be familiar with relative entropy as Kullback-Leibler (KL) divergence

3
Proof:

H(X1 , X2 , . . . , Xn ) = E [− log p(x1 , x2 , . . . , xn )] (25)

" n
#
Y
= E − log p(xi ) (26)
i=1
" n
#
X
=E − log p(xi ) (27)
i=1
n
X
= E [− log p(xi )] (28)
i=1
Xn
= H(Xi ). (29)
i=1

H(X | Y ) ≤ H(X) with equality iff X and Y are independent.

Proof:

1 1
H(X) − H(X | Y ) = E log − E log (34)
p(X) p(X|Y )

p(X | Y ) p(Y )
= E log (35)
p(X) p(Y )

p(X, Y )
= E log (36)
p(X)p(Y )
X P (x, y)
= P (x, y) log (37)
x,y
P (x)P (y)
= D(Px,y k Px × Py ) (38)
≥0 (39)

D(Px,y k Px ×Py ) ≥ 0 because relative entropy can never be negative. Equality holds iff Px,y ≡ Px ×Py ,
(X and Y are independent).

4
6. Chain Rule:
1
H(X, Y ) , E log (40)
P (X, Y )
1
= E log ] (41)
P (Y )P (X | Y )
= H(Y ) + H(X | Y ) (42)

We can take this one step further with (5):

H(X, Y ) = H(Y ) + H(X | Y ) ≤ H(X) + H(Y ), (43)

with equality holding iff X and Y are independent.

Definition 6. Mutual information between X and Y

We now define the mutual information between random variables X and Y distributed according to
the joint PMF P (x, y):

I(X; Y ) , D(Px,y k Px × Py ) (44)

= H(X) − H(X|Y ) (45)
= H(X) + H(Y ) − H(X, Y ) (46)

(May find any of these in the literature) The mutual information tells how helpful one variable is at
reducing uncertainty in the other.
Note: while relative entropy is not symmetric, mutual information is.

3 Exercises
1. “Data processing decreases entropy” (note that this statement only applies to deterministic functions)
Y = f (X) ⇒ H(Y ) ≤ H(X) with equality when f is one-to-one.
Note: Proof is part of homework 1.
2. “Data processing on side information increases entropy”
Y = f (X) ⇒ H(Z|X) ≤ H(Z|Y )
True more generally:
whenever Y − X − Z (Markov Relation), i.e., p(Z|X, Y ) = p(Z|X), then H(Z|X) ≤ H(Z|Y )
Note: Proof is part of homework 1.
3.
Definition 7. Conditional mutual information

I(X; Y |Z) , H(X|Z) − H(X|Y, Z) (47)

Show that: I(X; Y1 , Y2 ) = I(X; Y1 ) + I(X; Y2 |Y1 )

Proof:

I(X; Y1 , Y2 ) = H(X) − H(X|Y1 , Y2 ) (48)

= H(X) − H(X|Y1 , Y2 ) − H(X|Y1 ) + H(X|Y1 ) (49)
= [H(X) − H(X|Y1 )] + [H(X|Y1 ) − H(X|Y1 , Y2 )] (50)
= I(X; Y1 ) + I(X; Y2 |Y1 ) (51)

(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
85% (54)
(397 P. COMPLETE SOLUTIONS) Elements of Information Theory 2nd Edition - COMPLETE Solutions Manual (Chapters 1-17)
397 pages
Lect 6 Quantinfo 1112
No ratings yet
Lect 6 Quantinfo 1112
13 pages
2019 WMI Grade 4 Questions Part 2-Fï
100% (3)
2019 WMI Grade 4 Questions Part 2-Fï
4 pages
Lecture 3: Entropy, Relative Entropy, and Mutual Information
No ratings yet
Lecture 3: Entropy, Relative Entropy, and Mutual Information
5 pages
Lecture 3 - Entropy
No ratings yet
Lecture 3 - Entropy
35 pages
Tema 1 Awp
No ratings yet
Tema 1 Awp
32 pages
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
No ratings yet
The Binary Entropy Function: ECE 7680 Lecture 2 - Definitions and Basic Facts
8 pages
Lecture 3
No ratings yet
Lecture 3
31 pages
Relative Entropy
No ratings yet
Relative Entropy
6 pages
CoverThomas Ch2 PDF
No ratings yet
CoverThomas Ch2 PDF
38 pages
Entropy, Relative Entropy and Mutual Information
No ratings yet
Entropy, Relative Entropy and Mutual Information
38 pages
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
No ratings yet
Elements of Information Theory 2006 Thomas M. Cover and Joy A. Thomas
16 pages
IT-CO-1-EN
No ratings yet
IT-CO-1-EN
26 pages
session2
No ratings yet
session2
60 pages
Notes It
No ratings yet
Notes It
46 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
22 pages
Information Theory Textbook
No ratings yet
Information Theory Textbook
14 pages
Lecture 1: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 1: Entropy and Mutual Information: 2.1 Example
8 pages
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
No ratings yet
Information Theory and Coding (Lecture 2) : Dr. Farman Ullah
36 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
L01
No ratings yet
L01
5 pages
Lecture 1: Introduction, Entropy and ML Estimation
No ratings yet
Lecture 1: Introduction, Entropy and ML Estimation
5 pages
Entropie Eng PDF
No ratings yet
Entropie Eng PDF
6 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Entropy
No ratings yet
Entropy
21 pages
Lecture_15
No ratings yet
Lecture_15
7 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Information Theory Entropy Relative Entropy
No ratings yet
Information Theory Entropy Relative Entropy
60 pages
Mathematical Problems and Solutions On Information Theory
No ratings yet
Mathematical Problems and Solutions On Information Theory
28 pages
Slide 04
No ratings yet
Slide 04
16 pages
lời giải
No ratings yet
lời giải
52 pages
Probabilistic Methods in Information Theory
No ratings yet
Probabilistic Methods in Information Theory
48 pages
Entropy and Uncertainty
No ratings yet
Entropy and Uncertainty
15 pages
Lecture 2: Entropy and Mutual Information: 2.1 Example
No ratings yet
Lecture 2: Entropy and Mutual Information: 2.1 Example
8 pages
Entropy Post
No ratings yet
Entropy Post
27 pages
BEC503-DC-M3-Information Theory
No ratings yet
BEC503-DC-M3-Information Theory
100 pages
Entropy Handbook Definitions, Theorems, M-Files
No ratings yet
Entropy Handbook Definitions, Theorems, M-Files
22 pages
Advance Digital Communication
No ratings yet
Advance Digital Communication
66 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
Shannon's Theorems: Math and Science Summer Program 2020
No ratings yet
Shannon's Theorems: Math and Science Summer Program 2020
28 pages
Lecture 5
No ratings yet
Lecture 5
42 pages
Statistics Part3 2013
No ratings yet
Statistics Part3 2013
25 pages
Kullback-Leibler Divergence
No ratings yet
Kullback-Leibler Divergence
22 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
No ratings yet
Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples
6 pages
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
Elements of Information Theory-Chapter1-2
No ratings yet
Elements of Information Theory-Chapter1-2
63 pages
Information Theory Differential Entropy
No ratings yet
Information Theory Differential Entropy
29 pages
1.1 Shannon's Information Measures: Lecture 1 - January 26
No ratings yet
1.1 Shannon's Information Measures: Lecture 1 - January 26
5 pages
Information-Theoretic Identities
No ratings yet
Information-Theoretic Identities
29 pages
CS340 Machine Learning Information Theory
No ratings yet
CS340 Machine Learning Information Theory
22 pages
15359-2009-lecture25
No ratings yet
15359-2009-lecture25
11 pages
Lect2 PDF
No ratings yet
Lect2 PDF
25 pages
LECTURE 1: Introduction
No ratings yet
LECTURE 1: Introduction
16 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Lectures on Integral Equations
From Everand
Lectures on Integral Equations
Harold Widom
3.5/5 (1)
Harmonic Analysis and the Theory of Probability
From Everand
Harmonic Analysis and the Theory of Probability
Salomon Bochner
No ratings yet
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
(Production and Operations Management) Chapter 2 Forecasting Summary
No ratings yet
(Production and Operations Management) Chapter 2 Forecasting Summary
8 pages
CM2 Mock 6 Paper A
No ratings yet
CM2 Mock 6 Paper A
6 pages
Numerical Methods: Dr. Nasir M Mirza
No ratings yet
Numerical Methods: Dr. Nasir M Mirza
27 pages
Non Linear Regression
No ratings yet
Non Linear Regression
20 pages
Efficient K-Clique Counting On Large Graphs The Power of Color-Based Sampling Approaches
No ratings yet
Efficient K-Clique Counting On Large Graphs The Power of Color-Based Sampling Approaches
18 pages
Integrating Machine Learning With Human Knowledge
No ratings yet
Integrating Machine Learning With Human Knowledge
27 pages
Password and Authentication (PPT Final)
No ratings yet
Password and Authentication (PPT Final)
14 pages
Chapter 4
No ratings yet
Chapter 4
38 pages
Chapter 18 Answers
No ratings yet
Chapter 18 Answers
7 pages
103983432
No ratings yet
103983432
81 pages
Linear Programming: Basic Concepts Solution To Solved Problems
100% (1)
Linear Programming: Basic Concepts Solution To Solved Problems
4 pages
Data Structures and Algorithms (Python)
75% (4)
Data Structures and Algorithms (Python)
13 pages
5 MUST Watch Youtube Channels For ML
No ratings yet
5 MUST Watch Youtube Channels For ML
13 pages
Sciencedirect: © 2019, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
No ratings yet
Sciencedirect: © 2019, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
6 pages
Lift (Data Mining)
No ratings yet
Lift (Data Mining)
3 pages
Springer Series in Operations Research: Springer Science+Business Media, LLC
No ratings yet
Springer Series in Operations Research: Springer Science+Business Media, LLC
553 pages
Bda 1
No ratings yet
Bda 1
28 pages
An Operations Research Case Study: by Rohit Anand 10BME0153
No ratings yet
An Operations Research Case Study: by Rohit Anand 10BME0153
8 pages
Authorship Analysis by
No ratings yet
Authorship Analysis by
26 pages
Work and Strain Energy
No ratings yet
Work and Strain Energy
47 pages
implicit
No ratings yet
implicit
2 pages
Chapter 9 - Simple Regression Analysis - L1 - Jan 2024
No ratings yet
Chapter 9 - Simple Regression Analysis - L1 - Jan 2024
26 pages
Bloom Filter: Algorithm Description
No ratings yet
Bloom Filter: Algorithm Description
11 pages
Sheet 4
0% (1)
Sheet 4
6 pages
2 - Time Value of Money
No ratings yet
2 - Time Value of Money
68 pages
Effects of Rounding and Truncating Methods of Quantization Error and SQNR For Sine Signal
No ratings yet
Effects of Rounding and Truncating Methods of Quantization Error and SQNR For Sine Signal
6 pages
Applied and Computational Optimal Control A Control Parametrization Approach Kok Lay Teo Bin Li Changjun Yu Volker Rehbock - Instantly access the complete ebook with just one click
100% (1)
Applied and Computational Optimal Control A Control Parametrization Approach Kok Lay Teo Bin Li Changjun Yu Volker Rehbock - Instantly access the complete ebook with just one click
75 pages
Tute Signals
No ratings yet
Tute Signals
12 pages

Lecture 3: Entropy, Relative Entropy, and Mutual Information

Uploaded by

Lecture 3: Entropy, Relative Entropy, and Mutual Information

Uploaded by

EE376A/STATS376A Information Theory Lecture 3 - 01/16/2018

Lecture 3: Entropy, Relative Entropy, and Mutual Information

Where U represents all u values possible to the variable.

E[Q(X)] ≥ Q(E[X]). (3)

Further, if Q is strictly convex, equality holds iff X is deterministic.

E[Q(X)] ≤ Q(E[X]). (4)

Q(X) = sup L(X)

E[Q(X)] = E[sup L(x)] (5)

(6) holds because of the monotonicity of E.

2.1 Properties of Entropy

2. H(U ) ≥ 0, with equality iff U is deterministic.

Definition 4. Relative Entropy1 An measure of distance between probability distributions is relative

4. If X1 , X2 , . . . , Xn are independent random variables, then

Note: H(X1 , X2 , . . . , Xn ) is called the joint entropy of X1 , X2 , . . . , Xn .

H(X1 , X2 , . . . , Xn ) = E [− log p(x1 , x2 , . . . , xn )] (25)

H(X | Y ) ≤ H(X) with equality iff X and Y are independent.

We can take this one step further with (5):

H(X, Y ) = H(Y ) + H(X | Y ) ≤ H(X) + H(Y ), (43)

with equality holding iff X and Y are independent.

I(X; Y ) , D(Px,y k Px × Py ) (44)

I(X; Y |Z) , H(X|Z) − H(X|Y, Z) (47)

Show that: I(X; Y1 , Y2 ) = I(X; Y1 ) + I(X; Y2 |Y1 )

I(X; Y1 , Y2 ) = H(X) − H(X|Y1 , Y2 ) (48)

You might also like