0% found this document useful (0 votes)
77 views18 pages

Divergence, Entropy, Information: Phil Chodrow

1. The document introduces information theory as a mathematical theory of learning with connections to fields like artificial intelligence, physics, and biology. It quantifies complexity and predictability in systems. 2. Rather than starting with entropy, the document chooses to begin with the Kullback-Leibler divergence as the foundational concept. This is because the divergence is well-defined for both discrete and continuous random variables, unlike entropy which can be negative or dependent on variable description for continuous cases. 3. The document then motivates the Kullback-Leibler divergence by discussing how it governs how "surprised" one should be when measuring a system thought to have distribution q but which actually has distribution

Uploaded by

Hala Lala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views18 pages

Divergence, Entropy, Information: Phil Chodrow

1. The document introduces information theory as a mathematical theory of learning with connections to fields like artificial intelligence, physics, and biology. It quantifies complexity and predictability in systems. 2. Rather than starting with entropy, the document chooses to begin with the Kullback-Leibler divergence as the foundational concept. This is because the divergence is well-defined for both discrete and continuous random variables, unlike entropy which can be negative or dependent on variable description for continuous cases. 3. The document then motivates the Kullback-Leibler divergence by discussing how it governs how "surprised" one should be when measuring a system thought to have distribution q but which actually has distribution

Uploaded by

Hala Lala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Divergence, Entropy, Information

An Opinionated Introduction to Information Theory

Phil Chodrow 1
1
MIT Operations Research Center
[email protected]
https://philchodrow.github.io/
arXiv:1708.07459v1 [cs.IT] 24 Aug 2017

August 25, 2017

Information theory is a mathematical theory of learning with deep con-


nections with topics as diverse as artificial intelligence, statistical physics,
and biological evolution. Many primers on the topic paint a broad picture
with relatively little mathematical sophistication, while many others develop
specific application areas in detail. In contrast, these informal notes aim to
outline some elements of the information-theoretic “way of thinking,” by cut-
ting a rapid and interesting path through some of the theory’s foundational
concepts and theorems. They are aimed at practicing systems scientists who
are interested in exploring potential connections between information theory
and their own fields. The main mathematical prerequisite for the notes is
comfort with elementary probability, including sample spaces, conditioning,
and expectations.
We take the Kullback-Leibler divergence as our foundational concept, and
then proceed to develop the entropy and mutual information. We discuss
some of the main foundational results, including the Chernoff bounds as a
characterization of the divergence; Gibbs’ Theorem; and the Data Processing
Inequality. A recurring theme is that the definitions of information theory
support natural theorems that sound “obvious” when translated into English.
More pithily, “information theory makes common sense precise.” Since the
focus of the notes is not primarily on technical details, proofs are provided
only where the relevant techniques are illustrative of broader themes. Oth-
erwise, proofs and intriguing tangents are referenced in liberally-sprinkled
footnotes. The notes close with a highly nonexhaustive list of references to
resources and other perspectives on the field.

1
Contents
1 Why Information Theory? 3

2 Why Not Start with Entropy? 3

3 Introducing the Divergence 4

4 Entropy 8

5 Conditional Entropy 10

6 Information Three Ways 12

7 Why Information Shrinks 14

8 Some Further Reading 16

2
1 Why Information Theory?
Briefly, information theory is a mathematical theory of learning with rich
connections to physics, statistics, and biology. Information-theoretic meth-
ods quantify complexity and predictability in systems, and make precise
how observing one feature of a system assists in predicting other features.
Information-theoretic thinking helps to structure algorithms; describe natu-
ral processes; and draw surprising connections between seemingly disparate
fields.
Formally, information theory is a subfield of probability, the mathemat-
ical study of uncertainty and randomness. What is distinctive about in-
formation theory is its emphasis on properties of probability distributions
that are independent of how those distributions are represented. Because
of this representation-independence, information-theoretic quantities often
have claim to be the most “fundamental” properties of a system or problem,
governing its complexity and learnability.
In its original formulation Shannon (1948), information theory is a theory
of communication: specifically, the transmission of a signal of some given
complexity over an unreliable channel, such as a telephone line corrupted by
a certain amount of white noise. In these notes, we will emphasize a slightly
different role for information theory, as a theory of learning more generally.
This emphasis is consistent with the original formulation, since the com-
munication problem may be viewed as the problem of the message receiver
learning the intent of the sender based on the potentially corrupted trans-
mission. However, the emphasis on learning allows us to more easily glimpse
some of the rich connections of information theory to other disciplines. Spe-
cial consideration in these notes will be given to statistical motivations for
information-theoretic concepts. Theoretical statistics is the design of mathe-
matical methods for learning from data; information-theoretic considerations
determine when such learning is possible and to what degree. We will close
with a connection to physics; some connections to biology are cited in the
references.

2 Why Not Start with Entropy?


Entropy is easily the information-theoretic concept with the widest popular
currency, and many expositions of that theory take entropy as their starting
point. We, however, will choose a different point of departure for these notes,
and derive entropy along the way. Our point of choice is the Kullback-Leibler
(KL) divergence between two distributions, also called in some contexts the
relative entropy, relative information, or free energy.1 Why start with the

1
For the remainder of these notes I’ll stick with “divergence” – though there are many other interesting
objects called “divergences” in mathematics, we won’t be discussing any of them here, so no confusion

3
divergence? Well, there’s a simple reason – while we’ll focus on discrete
random variables here, we’d like to develop a theory that, wherever possible,
applies to continuous random variables as well. The divergence is well-defined
for both discrete random variables and continuous ones; that is, if p and q
are two distributions satisfying certain regularity properties, then d(p, q) is
a uniquely determined, nonnegative real number whether p and q are dis-
crete or continuous. In contrast, the natural definition of entropy (so called
differential entropy) for continuous random variables has two bad behaviors.
First, it can be negative, which is undesirable for a measure of uncertainty.
Second, and arguably worse, the differential entropy is not even uniquely de-
fined. There are multiple ways to describe the same continuous distribution
– for example, the following three distributions are the same:
1. “The Gaussian distribution with mean 0 and variance 1.”
2. “The Gaussian distribution with mean 0 and standard deviation 1.”
3. “The Gaussian distribution with mean 0 and second moment equal to
1.”
Technically, the act of switching from one of these descriptions to another
can be viewed as a smooth change of coordinates2 in the space of distribution
parameters. For example, we move from the first description to the second
by changing coordinates from (µ, σ 2 ) to (µ, σ), which we can do by applying

the function f (x, y) = (x, y). Regrettably, the differential entropy is not
invariant under such coordinate changes – change the way you describe the
distribution, and the differential entropy changes as well. This is undesirable
– the foundations of our theory should be independent of the contingencies
of how we describe the distributions under study. The divergence passes this
test in both discrete and continuous cases; the differential entropy does not.3
Since we can define the entropy in terms of the divergence in the discrete
case, we’ll start with the divergence and derive the entropy along the way.

3 Introducing the Divergence


It is often said that the divergence d(p, q) between distributions p and q
measures how “surprised” you are if you think the state of the world is q
but then measure it to be p. However, this idea of surprise isn’t typically
explained or made precise. To motivate the KL divergence, we’ll start from a
somewhat unusual beginning – the Chernoff bounds – that makes exact the
role that the divergence plays in governing how surprised you ought to be.
should arise.
2
Technically, the idea of a smooth coordinate change is captured by diffeomorphisms: invertible func-
tions on coordinate space whose inverses are also smooth.
3
It is possible to define alternative notions of entropy that at-
tempt to skirt these issues; however, they have their own difficulties.
https://en.wikipedia.org/wiki/Limiting_density_of_discrete_points

4
Let’s begin with a simple running example. You are drawing from an
(infinite) deck of standard playing cards, with four card suits {♠, ♣, ♦, ♥}
and thirteen card values {1, . . . , 13}. We’ll view the sets of possible val-
ues as alphabets: X = {♠, ♣, ♦, ♥} is the alphabet of possible suits, and
Y = {1, . . . , 13} the alphabet of possible values. We’ll let X and Y be the
corresponding random variables, so for each realization, X ∈ X and Y ∈ Y.
Suppose that I have a prior belief that the distribution of suits in the
deck is uniform. My belief about the suits can be summarized by a vector
q = ( 14 , 14 , 14 , 14 ). It’s extremely convenient to view q as a single point in the
probability simplex P X of all valid probability distributions over X .
Definition 1 (Probability Simplex). For any finite alphabet X with |X| = m,
the probability simplex P X is the set
( )
X
X m
P , q∈R | qi = 1, qi ≥ 0 ∀i . (1)
i

Remark. It’s helpful to remember that P X is an P m − 1-dimensional space;


the “missing” dimension is due to the constraint i qi = 1. When m = 3,
P X is an equilateral triangle; when m = 4 a tetrahedron, and so on.
If q is your belief, you would naturally expect that, if you drew enough
cards, the observed distribution of suits would be “close” to q, and that if
you could draw infinitely many cards, the distribution would indeed converge
to q. Let’s make this precise: define p̂n to be the distribution of suits you
observe after pulling n cards. It’s important to remember that p̂ is a random
vector, which changes in each realization. But it would be reasonable to
expect that p̂ → q as n → ∞, and indeed this is true almost surely (with
probability 1) according to the Strong Law of Large Numbers, if q is in fact
the true distribution of cards in the deck.
But what happens if you keep drawing cards and the observed distribution
p̂n is much different than your belief q? Then you would justifiably be sur-
prised, and your “level of surprise” can be quantified by the probability of
observing p̂n if the true distribution were q, which I’ll denote P(p̂n ; q). We’d
expect P(p̂n ; q) to become small when n grows large. Indeed, there is a quite
strong result here – P(p̂n ; q) decays exponentially in n, with a very special
exponent.
Definition 2 (Kullback-Leibler Divergence). For p, q ∈ P X , the Kullback-
Leibler (KL) divergence of q from p is
X p(x)
d(p, q) , p(x) log , (2)
q(x)
x∈X

where we are using the conventions that log ∞ = ∞, log 0 = −∞, 0/0 = 0,
and 0 × ∞ = 0.

5
Theorem 1 (Chernoff Bounds). Suppose that the card suits are truly dis-
tributed according to p 6= q. Then,
e−nd(p,q)
≤ P(p̂n ; q) ≤ e−nd(p,q) . (3)
(n + 1)m
So, the probability of observing p̂n when you thought the distribution was
q decays exponentially, with the exponent given by the divergence of your
belief q from the true distribution p. Another way to say this: ignoring
non-exponential factors,
1 ∼ d(p, q) ,
− log P(p̂n ; q) = (4)
n
that is, d(p, q) is the minus the log of your average surprise per card drawn.
The Chernoff bounds thus provide firm mathematical content to the idea
that the divergence measures surprise.
Just to make this concrete, suppose that I start with the belief that the
deck is uniform over suits; that is, my belief is q = ( 14 , 14 , 14 , 14 ) over the
alphabet {♠, ♣, ♦, ♥}. Unbeknownst to me, you have removed all the black
cards from the deck, which therefore has true distribution p = 0, 0, 21 , 12 .


I draw 100 cards and record the suits. How surprised am I by the suit
distribution I observe? The divergence between my belief and the true deck
distribution is d(p, q) ≈ 0.69, so Theorem 1 states that the dominating factor
in the probability of my observing an empirical distribution close to p in 100
draws is e−0.69×100 ≈ 10−30 . I am quite surprised indeed!
We’ll now state one of the most important properties of the divergence.
Theorem 2 (Gibbs’ Inequality). For all p, q ∈ P X , it holds that d(p, q) ≥ 0,
with equality iff p = q.
In words, you can never have “negative surprise,” and you are only unsur-
prised if you what you observed is exactly what you expected.
Proof. There are lots of ways to prove Gibbs’ Inequality; here’s one with
Lagrange multipliers. Fix p ∈ P X . We’d like to show that the problem
min d(p, q) (5)
q∈P X

has value 0 and that this value is achieved at the unique point q = p. We
need two gradients: the gradient Pof d(p, q) with respect to q and the gradient
of the implicit constraint g(q) , x∈X q(x) = 1. The former is
" #
X p(x)
∇q d(p, q) = ∇q p(x) log (6)
q(x)
x∈X
" #
X
= −∇q p(x) log q(x) (7)
x∈X
= −p ⊘ q, (8)

6
where p ⊘ q is the elementwise quotient of vectors, and where we recall the
convention 0/0 = 0. On the other hand,

∇q g(q) = 1, (9)

the vector whose entries are all unity. The method of Lagrange multipliers
states that we should seek λ ∈ R such that

∇q d(p, q) = λ∇q (gq ), (10)

or
− p ⊘ q = λ1, (11)
from which it’s easy to see that the only solution is q = p and λ = −1.
It’s a quick check that the corresponding solution value is d(p, p) = 0, which
completes the proof.

Remark. Theorem 2 is the primary sense in which d behaves “like a distance”


on the simplex P. On the other hand, d is unlike a distance in that it is not
symmetric and does not satisfy the triangle inequality.4
Let’s close out this section by noting one of the many connections be-
tween the divergence and classical statistics. Maximum likelihood estimation
is a foundational method of modern statistical practice; tools from linear
regression to neural networks may be viewed as likelihood maximizers. The
divergence allows a particularly elegant formulation of maximum likelihood
estimation: likelihood maximization is the same as divergence min-
imization. Let θ be some statistical parameter, which may be multidimen-
sional; for example, in the context of normal distributions, we may have
θ = (µ, σ 2 ); in the context of regression, θ may be the regression coefficients
β. Let pθX be the probability distribution over X with parameters θ. Let
{x1 , . . . , xn } be a sequence of i.i.d. observations of X. Maximum likelihood
estimation encourages us to find the parameter θ such that
n
Y
θ ∗ = argmax pθX (xi ) , (12)
θ i=1

i.e. the parameter value that makes the data most probable. To express this
in terms of the divergence, we need just one more piece of notation: let p̂X be
the empirical distribution of observations of X. Then, it’s a slightly involved
algebraic exercise to show that the maximum likelihood estimation problem
can also be written
θ ∗ = argmin d(p̂X , pθX ). (13)
θ
4
In fact, d is related to a “proper” distance metric on P X , which is usually called the Fisher In-
formation Metric and is the fundamental object of study in the field of information geometry
(Amari and Nagaoka, 2007).

7
This is rather nice – maximum likelihood estimation consists in making the
parameterized distribution pθX as close as possible to the observed data dis-
tribution p̂X in the sense of the divergence.5

4 Entropy
After having put it off for a while, let’s define the Shannon entropy. If
we think about the divergence as a (metaphorical) distance, the entropy
measures how close a distribution is to the uniform.

Definition 3 (Shannon Entropy). The Shannon entropy of p ∈ P X is


X
H(p) , − p(x) log p(x) . (14)
x∈X

When convenient, we will also use the notation H(X) to refer to the entropy
of a random variable X distributed according to p.

Theorem 3. The Shannon entropy is related to the divergence according to


the formula
H(p) = log m − d(p, u) , (15)
where m = |X | is the size of the alphabet X and where u is the uniform
distribution on X .6

Remark. This formula makes it easy to remember the entropy of the uniform
distribution – it’s just log m, where m is the number of possible choices. If
we are playing a game in which I draw a card from the infinite deck, the suit
of the card is uniform, and the entropy of the suit distribution is therefore
H(X) = log 4 = 2 log 2.
Remark. In words, d(p, u) is your surprise if you thought the suit distribution
was uniform and then found it was in fact p. If you are relatively unsurprised,
then p is very close to uniform. Indeed, Gibbs’ inequality (Theorem 2) im-
mediately implies that H(p) assumes its largest value of log m exactly when
p = u.

5
This result is another hint at the beautiful geometry of the divergence: the operation of minimizing a
distance-like measure is often called “projection.” Maximum likelihood estimation thus consists in a
kind of statistical projection.
6
Here is as good a place as any to note that for discrete random variables, the divergence can also be
defined in terms of the entropy. Technically, d is the Bregman divergence induced by the Shannon
entropy, and can be characterized by the equation

d(p, q) = − [H(p) − H(q) − h∇q H, p − qi] . (16)

Intuitively, d(p, q) is minus the approximation loss associated with estimating the difference in entropy
between p and q using a first-order Taylor expansion centered at q. This somewhat artificial-seeming
construction turns out to lead in some very interesting directions in statistics and machine learning.

8
Remark. Theorem 3 provides one useful insight into why the Shannon entropy
does not generalize naturally to continuous distributions – whereas equation
(15) involves the uniform distribution on X , but there can be no analogous
formula for continuous random variables on R because there is no uniform
distribution on R.

A Bayesian Interpretation of Entropy


The construction of the entropy in terms of the divergence is fairly natural –
we use the divergence to measure how close p is to the uniform distribution,
flip the sign so that high entropy distributions are more uniform, and add
a constant term to make the entropy nonnegative. This formulation of the
entropy turns out to have another interesting characterization in the context
of Bayesian prediction. In Bayesian prediction, I will pull a single card from
the deck. Before I do, I ask you to provide a distribution p over the alphabet
{♠, ♣, ♦, ♥} representing your prediction about the suit of the card I pulled.
As examples, you can choose p = (1, 0, 0, 0) if you are certain that the suite
will be ♠, or p = ( 41 , 41 , 14 , 14 ) to express maximal ignorance. After you guess,
I pull the card, obtaining a sample x ∈ X , and reward you based on the
quality of your prediction relative to the outcome x. I do this based on a
loss function f (p, x); after your guess I give you f (p, x) dollars, say. If we
assume that my aim is to encourage you to (a) report your true beliefs about
the deck and (b) reward you based only on what happened (i.e. not on what
could have happened), then there is only one appropriate loss function f ,
which turns out to be closely related to the entropy. More formally,
Definition 4. A loss function f is proper if, for any alphabet Y and random
variable Y on Y,
pX|Y =y = argmin E[f (p, x)|Y = y]. (17)
p∈P X

Remark. In this definition, it’s useful to think of Y as some kind of “side


information” or “additional data.” For example, Y could be my telling you
that the card I pulled is a red card, which could influence your predictive
distribution. When f is proper, you have an incentive to factor that into
your predictive distribution. While it may feel that “of course” you should
factor this in, not all loss functions encourage you to do so. For example, if
f is constant, then you have no incentive to use Y at all, since each guess
is just as good as any other. A proper loss function guarantees that you
can maximize your payout (minimize your loss) by completely accounting
for all available data when forming your prediction, which should therefore
be pX|Y =y . Thus, a proper loss function ensures that the Bayesian prediction
game is “honest”.
Definition 5. A loss function f is local if f (p, x) = ψ(p, p(x)) for some
function ψ.

9
Remark. The function f is local iff f can be written as a function only of
my prediction p and how much probabilistic weight I put on the event that
actually occurred – not events that “could have happened” but didn’t. Thus,
a proper loss funciton ensures that the Bayesian prediction game is “fair.”
Somewhat amazingly, the log loss function given by f (p, x) = − log p(x) is
the only loss function that is both proper and local (both honest and fair),
up to an affine transformation.
Theorem 4 (Uniqueness of the Log-Loss). Let f be a local and proper reward
function. Then, f (p, x) = A log p(x) + B for some constants A < 0 and
B ∈ R.
Without loss of generality, we’ll take A = −1 and B = 0. The entropy in
this context occurs as the expected log-loss when you know the distribution
of suits in the deck. If you know, say, that the proportions in the deck are
p = ( 41 , 14 , 14 , 14 ) and need to formulate your predictive distribution, Theorem
4 implies that your best guess is just p, since you have no additional side
information. Then....
Definition 6 (Entropy, Bayesian Characterization). The (Shannon) en-
tropy of p is your minimal expected loss when playing the Bayesian prediction
game in which the true distribution of suits is p.
Remark. To see that this definition is consistent with the one we saw before,
we can simply compute the expectation:

E[f (p, X)] = E[− log p(X)] (18)


X
=− p(x) log p(x), (19)
x∈X

which matches Definition 3. The second inequality follows from the fact that,
if you are playing optimally, p is both the true distribution of X and your
best predictive distribution.

5 Conditional Entropy
The true magic of probability theory is conditional probabilities, which for-
malize the idea of learning: P(A|B) represents my best belief about A given
what I know about B. While the Shannon entropy itself is quite interesting,
information theory really starts becoming a useful framework for thinking
probabilistically when we formulate the conditional entropy, which encodes
the idea of learning as a process of uncertainty reduction.
In this section and the next, we’ll need to keep track of multiple random
variables and distributions. To fix notation, we’ll let pX ∈ P X be the distri-
bution of a discrete random variable X on alphabet X , pY ∈ P Y the distribu-
tion of a discrete random variable Y on alphabet Y, and pX,Y ∈ P X ×Y their

10
joint distribution on alphabet X × Y. Additionally, we’ll denote the product
distribution of marginals as pX ⊗ pY ∈ P X ×Y ; that is, (pX ⊗ pY )(x, y) =
pX (x)pY (y).

Definition 7 (Conditional Entropy). The conditional entropy of X given


Y is X
H(X|Y ) , pX,Y (x, y) log pX|Y (x|y) . (20)
x,y∈X ×Y

Remark. It might seem as though H(X|Y ) ought to be defined as


X
H̃(X|Y ) = pX|Y (x|y) log pX|Y (x|y) , (21)
x,y∈X ×Y

which looks more symmetrical. However, a quick think makes clear that this
definition isn’t appropriate, because it doesn’t include any information about
the distribution of Y . If Y is concentrated around some very informative (or
uninformative) values, then H̃ won’t notice that some values of Y are more
valuable than others.
In the framework of our Bayesian interpretation of the entropy above, the
conditional entropy is your expected reward in the guessing game assuming
you receive some additional side information. For example, consider playing
the suit-guessing game in the infinite deck of cards. Recall that the suit
distribution is uniform, with entropy H(X) = H(u) = 2 log 2. Suppose now
that you get side information – when I draw the card from the deck, before
I ask you to guess the suit, I tell you the color (black or red). Since for each
color there are just two possible suits, the entropy decreases. Formally, if
X is the suit and Y the color, it’s easy to compute that H(X|Y ) = log 2 –
knowing the color reduces your uncertainty by half.
The conditional entropy is somewhat more difficult to express in terms of
the divergence, but it does have a useful relationship with the (unconditional)
entropy.

Theorem 5. The conditional entropy is related to the unconditional entropy


as
H(X|Y ) = H(X, Y ) − H(Y ), (22)
where H(X, Y ) is the entropy of the distribution pX,Y .

Remark. This theorem is easy to remember, because it looks like what you
get by recalling the definition of the conditional probability and taking logs:

pX,Y (x, y)
pX|Y (x|y) = . (23)
pY (y)

Indeed, take logs and compute the expectations over X and Y to prove the
theorem directly. Another way to remember this theorem is to just say it

11
out: the uncertainty you have about X given that you’ve been told Y is
equal to the uncertainty you had about both X and Y , less the uncertainty
that was resolved when you learned Y .
From this theorem, it’s a quick use of Gibbs’ inequality to show:
Theorem 6 (Side Information Reduces Uncertainty).
H(X|Y ) ≤ H(X). (24)
That is, knowing Y can never make you more uncertain about X, only
less. This makes sense – after all, if Y is not actually informative about X,
you can just ignore it.
Theorem 6 implies that H(X) − H(X|Y ) ≥ 0. This difference quantifies
how much Y reduces uncertainty about X; if H(X) − H(X|Y ) = 0, for
example, then H(X|Y ) = H(X) and it is natural to say that Y “carries no
information” about X. We encode the idea of information as uncertainty
reduction in the next section.

6 Information Three Ways


Thus far, we’ve seen two concepts – divergence and entropy – that play fun-
damentals role in information theory. But neither of them exactly resemble
an idea of “information,” so how does the theory earn its name? Our brief
note at the end of the last section suggests that we think about informa-
tion as a relationship between two variables X and Y , in which knowing
Y decreases our uncertainty (entropy) about X. As it turns out, the idea
of information information that falls out of this motivation is a remarkably
useful one, and can be formulated in many interesting and different ways.
Let’s start by naming this difference:
Definition 8 (Mutual Information). The mutual information I(X, Y ) in
Y about X is
I(X, Y ) , H(X) − H(X|Y ). (25)
The mutual information is just the uncertainty reduction associated with
knowing Y . In the context of the Bayesian guessing game, I(X, Y ) is the
“value” of being told the suit color, compared to having to play the game
without that information. From our calculations above, in the suit-guessing
game, I(X, Y ) = H(X) − H(X|Y ) = 2 log 2 − log 2 = log 2.
Let’s now express mutual information in two other ways. Remarkably,
these follow directly via simple algebra, but each identity gives a new way to
think about the meaning of the mutual information.
Theorem 7. The mutual information may also be written as:
I(X, Y ) = d(pX,Y , pX ⊗ pY ) (26)
= EY [d(pX|Y , pX )] (27)

12
Let’s start by unpacking equation (26). I(X, Y ) is the divergence between
the actual joint distribution pX,Y and the product distribution pX ⊗ pY . Im-
portantly, the latter is what the distribution of X and Y would be, were they
independent random variables with the same marginals. This plus Gibb’s in-
equality implies:

Corrolary 1. Random variables X and Y are independent if and only if


I(X, Y ) = 0.

So, I(X, Y ) is something like a super-charged correlation coefficient – it


measures the degree of statistical correlation between X and Y , but it is
stronger than the correlation coefficient in two ways. First, I(X, Y ) detects
all kinds of statistical relationships, not just linear ones. Second, while the
correlation coefficient can vanish for dependent variables, this never happens
for the mutual information – zero mutual information implies dependence,
period. As a quick illustration, it’s not hard to see that if X is the suit color
and Z is the numerical value of the card pulled, then I(X, Z) = 0. Intu-
itively, if we were playing the suit-guessing game and I offered to tell you the
card’s face-value, you would be rightly annoyed – that’s an unhelpful (“un-
informative”) offer, because the face-values and suit colors are independent.7
Equation (26) has another useful consequence. Since that formulation is
symmetric in X and Y ,

Corrolary 2. The mutual information is symmetric:

I(X, Y ) = I(Y, X) . (28)

Finally, as we noted briefly at the end of the previous section, the following
is a direct consequence of Equation (26) and Gibbs’ inequality.

Corrolary 3. The mutual information is nonnegative:

I(X, Y ) ≥ 0. (29)

Now let’s unpack equation (27). One way to read this is as quantifying
the danger of ignoring available information: d(pX|Y =y , pX ) is how surprised
you would be if you ignored the information Y and instead kept using pX
as your belief. If I told you that the deck contained only red  cards, but you
chose to ignore this and continue guessing u = 14 , 14 , 14 , 14 as your guess,
you would be surprised to keep seeing red cards turn up draw after draw.
Formulation (27) expresses the mutual information as the expected surprise
7
So, why don’t we just dispose of correlation coefficients and use I(X, Y ) instead? Well, correlation
coefficients can be estimated from data relatively simply and are fairly robust to error. In contrast,
I(X, Y ) requires that we have reasonably good estimates of the joint distribution pX,Y , which is not
usually available. Furthermore, it can be hard to distinguish I(X, Y ) = 10−6 from I(X, Y ) = 0, and
statistical tests of significance that would solve this problem are much more complex than those for
correlation coefficients.

13
you would experience by ignoring your available side information Y , with
the expectation taken over all the possible values the side information could
assume. While this formulation may seem much more opaque than (26), it
turns out to be remarkably useful when thinking geometrically, as it expresses
the mutual information as the average “distance” between the marginal pX
and the conditionals pX|Y . Pursuing this thought turns out to express the
mutual information as something like the “moment of inertia” for the joint
distribution pX,Y .

7 Why Information Shrinks


The famous 2nd Law of Thermodynamics states that, in a closed system,
entropy increases. The physicists’ concept of entropy is closely related to but
slightly different from the information theorist’s concept, and we therefore
won’t make a direct attack on the 2nd Law in these notes. However, there
is a close analog of the 2nd Law that gives much of the flavor and can be
formulated in information theoretic terms. Whereas the 2nd Law states
that entropy grows, the Data Processing Inequality states that information
shrinks.

Theorem 8 (Data Processing Inequality). Let X and Y be random variables,


and let Z = g(Y ), where g is some function g : Y → Z. Then,

I(X, Z) ≤ I(X, Y ). (30)

This is not the most general possible form of the Data Processing Inequal-
ity, but it has the right flavor. The meaning of this theorem is both “obvious”
and striking in its generality.8 Intuitively, if you are using Y to predict X,
then any processing you do to Y can only reduce your predictive power. Data
processing can enable tractable computations; reduce the impact of noise in
your observations; and improve your visualizations. The one thing it can’t do
is create information out of thin air. No amount of processing is a substitute
for having enough of the data you really want.
We’ll pursue the proof the Data Processing Inequality, as the steps are
quite enlightening. First, we need the conditional mutual information:

Definition 9 (Conditional Mutual Information). The conditional mutual


information of X and Y given Z is
X
I(X, Y |Z) = pZ (z)d(pX,Y |Z=z , pX|Z=z ⊗ pY |Z=z ). (31)
z∈Z

8
My first thought when seeing this was: “g can be ANY function? Really?” I then spent half an hour
fruitlessly attempting to produce a counter-example.

14
The divergence in the summand P is naturally written I(X, Y |Z = z), in
which case we have I(X, Y |Z) = z∈Z pZ (z)I(X, Y |Z = z), which has the
form of an expectation of mutual informations conditioned on specific values
of Z. The conditional mutual information is naturally understood as the
value of knowing Y for the prediction of X, given that you also already
know Z. Somewhat surprisingly, both of the cases I(X, Y |Z) > I(X, Y ) and
I(X, Y |Z) < I(X, Y ) may hold; that is, knowing Z can either increase or
decrease the value of knowing Y in the context of predicting X.

Theorem 9 (Chain Rule of Mutual Information). We have

I(X, (Y, Z)) = I(X, Z) + I(X, Y |Z). (32)

Remark. The notation I(X, (Y, Z)) refers to the (regular) mutual information
between X and the random variable (Y, Z), which we can regard as a single
random variable on the alphabet Y × Z.

Proof. We can compute directly, dividing up sums and remembering relations


like pX,Y,Z (x, y, z) = pX,Y |Z (x, y|z)pZ (z). Omitting some of the more tedious
algebra,

I(X, (Y, Z)) = d(pX,Y,Z , pX ⊗ pY,Z )


X pX,Z (x, z)
= pX,Y |Z (x, z) log +
pX (x)pZ (z)
x,z∈X ×Z
X pX,Y |Z (x, y|z)
pZ (z)pX,Y |Z (x, y|z) log
pY |Z (y|z)pX|Z (x|z)
x,y,z∈X ×Y×Z

= I(X, Z) + I(X, Y |Z) ,

as was to be shown.

As always, the Chain Rule has a nice interpretation if you think about
estimating X by first learning Z, and then Y . At the end of this process,
you know both Y and Z, and therefore have information I(X, (Y, Z)). This
total information splits into two pieces: the information you gained when
you learned Z, and the information you gained when you learned Y after
already knowing Z.
We are now ready to prove the Data Processing Inequality.9

Proof. Since Z = g(Y ), that is, is a function of Y alone, we have that


Z ⊥ X|Y , that is, given Y , Z and X are independent.10 So, I(X, Z|Y ) = 0.

9
Proof borrowed from http://www.cs.cmu.edu/~ aarti/Class/10704/lec2-dataprocess.pdf
10
In fact, Z ⊥ X|Y is often taken as the hypothesis of the Data Processing inequality rather than
Z = g(Y ), as it is somewhat weaker and sufficient to prove the result.

15
On the other hand, using the chain rule in two ways,

I(X, (Y, Z)) = I(X, Z) + I(X, Y |Z)


= I(X, Y ) + I(X, Z|Y ).

Since I(X, Z|Y ) = 0 by our argument above, we obtain I(X, Y ) = I(X, Z) +


I(X, Y |Z). Since I(X, Y |Z) is nonnegative by Gibbs’ inequality, we conclude
that I(X, Z) ≤ I(X, Y ), as was to be shown.

The Data Processing Inequality states that, in the absence of additional


information sources, processing leaves you with less information than you
started. The 2nd Law of Thermodynamics states that, in the absence of
additional energy sources, the system dynamics leave you with less order
than you started. These formulations suggest a natural parallel between the
concepts of information and order, and therefore a natural parallel between
the two theorems. We’ll close out this note with an extremely simplistic-yet-
suggestive way to think about this.
Let X0 and Y0 each be random variables reflecting the possible locations
and momenta of two particles at time t = 0. We’ll assume (a) that the
particles don’t interact, but that (b) the experimenter has placed the two
particles very close to each other with similar momenta. Thus, the initial
configuration of the system is highly ordered, reflected by I(X0 , Y0 ) > 0. If
we knew Y0 , then we’d also significantly reduce our uncertainty about X0 .
How does this system evolve over time? We’re assuming no interactions, so
each of the particles evolve separately according to some dynamics, which
we can write X1 = gx (X0 ) and Y1 = gy (Y0 ). Using the data processing
inequality twice, we have

I(X1 , Y1 ) ≤ I(X0 , Y1 ) ≤ I(X0 , Y0 ). (33)

Thus, the dynamics tend to reduce information. Of course, we can complicate


this picture in various ways, by considering particle interactions or external
potentials, either of which require a more sophisticated analysis. The full
2nd Law, which is beyond the scope of these notes, is most appropriate for
considering these cases.

8 Some Further Reading


This introduction has been far from exhaustive, and I heartily encourage
those interested to explore these topics in more detail. The below is a short
list of some of the resources I have found most intriguing and useful, in
addition to those cited in the introduction.

16
Information Theory “in General”
1. Shannon’s original work (Shannon, 1948) – in the words of one of my
professors, “the most important masters’ thesis of the 20th century.”
2. Shannon’s entertaining information-theoretic study of written English
(Shannon, 1951).
3. The text of Cover and Thomas (1991) is the standard modern overview
of the field for both theorists and practitioners.
4. Colah’s blog post “Visual Information Theory” at http://colah.github.io/posts/2015-09-Vi
is both entertaining and extremely helpful for getting basic intuition
around the relationship between entropy and communication.

Information Theory, Statistics, and Machine Learning


1. An excellent and entertaining introduction to these topics is the already-
mentioned MacKay (2003).
2. Those who want to further explore will likely enjoy Csiszar and Shields
(2004), but I would suggest doing this one after MacKay.
3. Readers interested in pursuing the Bayesian development of entropy
much more deeply may enjoy Bernardo and Smith (2008), which pro-
vides an extremely thorough development of decision theory with a
strong information-theoretic perspective.
4. The notes for the course “Information Processing and Learning” at
Carnegie-Mellon’s famous Machine Learning department are excellent
and accessible; find them at http://www.cs.cmu.edu/~aarti/Class/10704/lecs.html

Information Theory, Physics, and Biology


1. Marc Harper has a number of very fun papers in which he views biologi-
cal evolutionary dynamics as learning processes through the framework
of information theory; a few are (Harper, 2009, 2010).
2. John Baez and his student Blake Pollard wrote a very nice and easy-
reading review article of the role of information concepts in biological
and chemical systems (Baez and Pollard, 2016).
3. More generally, John Baez’s blog is a treasure-trove of interesting vi-
gnettes and insights on the role that information plays in the physical
and biological worlds: https://johncarlosbaez.wordpress.com/category/information-and
For a more thoroughly worked-out connection between information dis-
sipation and the Second Law of Thermodynamics, see this one: https://johncarlosbaez.wordp

17
References
Amari, S.-I. and Nagaoka, H. (2007). Methods of Information Geometry.
American Mathematical Society.

Baez, J. and Pollard, B. (2016). Relative entropy in biological systems.


Entropy, 18(2):46.

Bernardo, J. M. and Smith, A. F. (2008). Bayesian Theory. John Wiley and


Sons, New York.

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory.


John Wiley and Sons, New York.

Csiszar, I. and Shields, P. C. (2004). Information Theory and Statistics: A


Tutorial. Foundations and Trends in Communications and Information
Theory, 1(4):417–528.

Harper, M. (2009). Information geometry and evolutionary game theory,


arXiv: 0911.1383. pages 1–13.

Harper, M. (2010). The replicator equation as an inference dynamic, arXiv:


0911.1763. pages 1–10.

MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Al-


gorithms. Cambridge Univeristy Press, 4th edition.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell


System Technical Journal, 27:379–423.

Shannon, C. E. (1951). Prediction and entropy of printed English. Bell


System Technical Journal, 30(1):50–64.

18

You might also like