Lecture 3: Entropy, Relative Entropy, and Mutual Information
Lecture 3: Entropy, Relative Entropy, and Mutual Information
In this lecture, we will introduce certain key measures of information, that play crucial roles in theoretical
and operational characterizations throughout the course. These include the entropy, the mutual information,
and the relative entropy. We will also exhibit some key properties exhibited by these information measures.
1 Notation
A quick summary of the notation
1. Discrete Random Variable: U
2. Alphabet: U = {u1 , u2 , ..., uM } (An alphabet of size M)
3. Specific Value: u, u1 , etc.
For discrete random variables, we will write (interchangeably) P (U = u), PU (u) or most often just, p(u)
Similarly, for a pair of random variables X, Y we write P (X = x | Y = y), PX|Y (x | y) or p(x | y)
2 Entropy
Definition 1. “Surprise” Function:
1
S(u) , log (1)
p(u)
A lower probability of u translates to a greater “surprise” that it occurs.
Note here that we use log to mean log2 by default, rather than the natural log ln, as is typical in some other
contexts. This is true throughout these notes: log is assumed to be log2 unless otherwise indicated.
Definition 2. Entropy: Let U a discrete random variable taking values in alphabet U. The entropy of U
is given by:
1 X
H(U ) , E[S(U )] = E log = E − log (p(U )) = − p(u) log p(u) (2)
p(U ) u
1
Proof:
If Q is a convex function then its graph {(X, Q(X)) : X ∈ R} can be seen as an upper bound on the set
of affine functions that lie below it. Written more concretely,
where
L = {L : L(u) = au + b ≤ Q(u) for all − ∞ < u < ∞}
Thus, by linearity:
1 1
H(U ) = E log ≥ 0 because log ≥0 (13)
p(U ) p(U )
1
The equality occurs iff log p(u) = 0 with probability 1 so U must be deterministic.
2
3. For a PMF q define X
1 1
Hq (U ) , E log = p(u) log . (14)
q(U ) q(u)
u∈U
Then:
H(U ) ≤ Hq (U ), (15)
with equality iff q = p.
Proof:
1 1
H(U ) − Hq (U ) = E log − E log (16)
p(u) q(u)
q(u)
H(U ) − Hq (U ) = E log (17)
p(u)
q(u)
≤ log E (18)
p(u)
X q(u)
= log p(u) (19)
p(u)
u∈U
X
= log q(u) (20)
u∈U
= log 1 (21)
=0 (22)
Thus,
H(U ) − Hq (U ) ≤ 0.
q(u)
Equality only holds when p(u) is deterministic, which occurs when q = p (distributions are identical).
Note that by property 3, the relative entropy is always greater than or equal to 0, with equality iff
q = p. For now, relative entropy can be thought of as a measure of discrepancy between two probability
distributions. We will soon see that it is central to information theory.
3
Proof:
5.
Definition 5. Conditional Entropy of X given Y
1
H(X | Y ) , E log (30)
p(X | Y )
X 1
= p(x, y) log (31)
x,y
p(x | y)
" #
X X 1
= p(y) p(x | y) log (32)
y x
p(x | y)
X
= p(y)H(X | Y = y). (33)
y
D(Px,y k Px ×Py ) ≥ 0 because relative entropy can never be negative. Equality holds iff Px,y ≡ Px ×Py ,
(X and Y are independent).
4
6. Chain Rule:
1
H(X, Y ) , E log (40)
P (X, Y )
1
= E log ] (41)
P (Y )P (X | Y )
= H(Y ) + H(X | Y ) (42)
We now define the mutual information between random variables X and Y distributed according to
the joint PMF P (x, y):
(May find any of these in the literature) The mutual information tells how helpful one variable is at
reducing uncertainty in the other.
Note: while relative entropy is not symmetric, mutual information is.
3 Exercises
1. “Data processing decreases entropy” (note that this statement only applies to deterministic functions)
Y = f (X) ⇒ H(Y ) ≤ H(X) with equality when f is one-to-one.
Note: Proof is part of homework 1.
2. “Data processing on side information increases entropy”
Y = f (X) ⇒ H(Z|X) ≤ H(Z|Y )
True more generally:
whenever Y − X − Z (Markov Relation), i.e., p(Z|X, Y ) = p(Z|X), then H(Z|X) ≤ H(Z|Y )
Note: Proof is part of homework 1.
3.
Definition 7. Conditional mutual information