Divergence, Entropy, Information: Phil Chodrow
Divergence, Entropy, Information: Phil Chodrow
Phil Chodrow 1
1
MIT Operations Research Center
[email protected]
https://philchodrow.github.io/
arXiv:1708.07459v1 [cs.IT] 24 Aug 2017
1
Contents
1 Why Information Theory? 3
4 Entropy 8
5 Conditional Entropy 10
2
1 Why Information Theory?
Briefly, information theory is a mathematical theory of learning with rich
connections to physics, statistics, and biology. Information-theoretic meth-
ods quantify complexity and predictability in systems, and make precise
how observing one feature of a system assists in predicting other features.
Information-theoretic thinking helps to structure algorithms; describe natu-
ral processes; and draw surprising connections between seemingly disparate
fields.
Formally, information theory is a subfield of probability, the mathemat-
ical study of uncertainty and randomness. What is distinctive about in-
formation theory is its emphasis on properties of probability distributions
that are independent of how those distributions are represented. Because
of this representation-independence, information-theoretic quantities often
have claim to be the most “fundamental” properties of a system or problem,
governing its complexity and learnability.
In its original formulation Shannon (1948), information theory is a theory
of communication: specifically, the transmission of a signal of some given
complexity over an unreliable channel, such as a telephone line corrupted by
a certain amount of white noise. In these notes, we will emphasize a slightly
different role for information theory, as a theory of learning more generally.
This emphasis is consistent with the original formulation, since the com-
munication problem may be viewed as the problem of the message receiver
learning the intent of the sender based on the potentially corrupted trans-
mission. However, the emphasis on learning allows us to more easily glimpse
some of the rich connections of information theory to other disciplines. Spe-
cial consideration in these notes will be given to statistical motivations for
information-theoretic concepts. Theoretical statistics is the design of mathe-
matical methods for learning from data; information-theoretic considerations
determine when such learning is possible and to what degree. We will close
with a connection to physics; some connections to biology are cited in the
references.
1
For the remainder of these notes I’ll stick with “divergence” – though there are many other interesting
objects called “divergences” in mathematics, we won’t be discussing any of them here, so no confusion
3
divergence? Well, there’s a simple reason – while we’ll focus on discrete
random variables here, we’d like to develop a theory that, wherever possible,
applies to continuous random variables as well. The divergence is well-defined
for both discrete random variables and continuous ones; that is, if p and q
are two distributions satisfying certain regularity properties, then d(p, q) is
a uniquely determined, nonnegative real number whether p and q are dis-
crete or continuous. In contrast, the natural definition of entropy (so called
differential entropy) for continuous random variables has two bad behaviors.
First, it can be negative, which is undesirable for a measure of uncertainty.
Second, and arguably worse, the differential entropy is not even uniquely de-
fined. There are multiple ways to describe the same continuous distribution
– for example, the following three distributions are the same:
1. “The Gaussian distribution with mean 0 and variance 1.”
2. “The Gaussian distribution with mean 0 and standard deviation 1.”
3. “The Gaussian distribution with mean 0 and second moment equal to
1.”
Technically, the act of switching from one of these descriptions to another
can be viewed as a smooth change of coordinates2 in the space of distribution
parameters. For example, we move from the first description to the second
by changing coordinates from (µ, σ 2 ) to (µ, σ), which we can do by applying
√
the function f (x, y) = (x, y). Regrettably, the differential entropy is not
invariant under such coordinate changes – change the way you describe the
distribution, and the differential entropy changes as well. This is undesirable
– the foundations of our theory should be independent of the contingencies
of how we describe the distributions under study. The divergence passes this
test in both discrete and continuous cases; the differential entropy does not.3
Since we can define the entropy in terms of the divergence in the discrete
case, we’ll start with the divergence and derive the entropy along the way.
4
Let’s begin with a simple running example. You are drawing from an
(infinite) deck of standard playing cards, with four card suits {♠, ♣, ♦, ♥}
and thirteen card values {1, . . . , 13}. We’ll view the sets of possible val-
ues as alphabets: X = {♠, ♣, ♦, ♥} is the alphabet of possible suits, and
Y = {1, . . . , 13} the alphabet of possible values. We’ll let X and Y be the
corresponding random variables, so for each realization, X ∈ X and Y ∈ Y.
Suppose that I have a prior belief that the distribution of suits in the
deck is uniform. My belief about the suits can be summarized by a vector
q = ( 14 , 14 , 14 , 14 ). It’s extremely convenient to view q as a single point in the
probability simplex P X of all valid probability distributions over X .
Definition 1 (Probability Simplex). For any finite alphabet X with |X| = m,
the probability simplex P X is the set
( )
X
X m
P , q∈R | qi = 1, qi ≥ 0 ∀i . (1)
i
where we are using the conventions that log ∞ = ∞, log 0 = −∞, 0/0 = 0,
and 0 × ∞ = 0.
5
Theorem 1 (Chernoff Bounds). Suppose that the card suits are truly dis-
tributed according to p 6= q. Then,
e−nd(p,q)
≤ P(p̂n ; q) ≤ e−nd(p,q) . (3)
(n + 1)m
So, the probability of observing p̂n when you thought the distribution was
q decays exponentially, with the exponent given by the divergence of your
belief q from the true distribution p. Another way to say this: ignoring
non-exponential factors,
1 ∼ d(p, q) ,
− log P(p̂n ; q) = (4)
n
that is, d(p, q) is the minus the log of your average surprise per card drawn.
The Chernoff bounds thus provide firm mathematical content to the idea
that the divergence measures surprise.
Just to make this concrete, suppose that I start with the belief that the
deck is uniform over suits; that is, my belief is q = ( 14 , 14 , 14 , 14 ) over the
alphabet {♠, ♣, ♦, ♥}. Unbeknownst to me, you have removed all the black
cards from the deck, which therefore has true distribution p = 0, 0, 21 , 12 .
I draw 100 cards and record the suits. How surprised am I by the suit
distribution I observe? The divergence between my belief and the true deck
distribution is d(p, q) ≈ 0.69, so Theorem 1 states that the dominating factor
in the probability of my observing an empirical distribution close to p in 100
draws is e−0.69×100 ≈ 10−30 . I am quite surprised indeed!
We’ll now state one of the most important properties of the divergence.
Theorem 2 (Gibbs’ Inequality). For all p, q ∈ P X , it holds that d(p, q) ≥ 0,
with equality iff p = q.
In words, you can never have “negative surprise,” and you are only unsur-
prised if you what you observed is exactly what you expected.
Proof. There are lots of ways to prove Gibbs’ Inequality; here’s one with
Lagrange multipliers. Fix p ∈ P X . We’d like to show that the problem
min d(p, q) (5)
q∈P X
has value 0 and that this value is achieved at the unique point q = p. We
need two gradients: the gradient Pof d(p, q) with respect to q and the gradient
of the implicit constraint g(q) , x∈X q(x) = 1. The former is
" #
X p(x)
∇q d(p, q) = ∇q p(x) log (6)
q(x)
x∈X
" #
X
= −∇q p(x) log q(x) (7)
x∈X
= −p ⊘ q, (8)
6
where p ⊘ q is the elementwise quotient of vectors, and where we recall the
convention 0/0 = 0. On the other hand,
∇q g(q) = 1, (9)
the vector whose entries are all unity. The method of Lagrange multipliers
states that we should seek λ ∈ R such that
or
− p ⊘ q = λ1, (11)
from which it’s easy to see that the only solution is q = p and λ = −1.
It’s a quick check that the corresponding solution value is d(p, p) = 0, which
completes the proof.
i.e. the parameter value that makes the data most probable. To express this
in terms of the divergence, we need just one more piece of notation: let p̂X be
the empirical distribution of observations of X. Then, it’s a slightly involved
algebraic exercise to show that the maximum likelihood estimation problem
can also be written
θ ∗ = argmin d(p̂X , pθX ). (13)
θ
4
In fact, d is related to a “proper” distance metric on P X , which is usually called the Fisher In-
formation Metric and is the fundamental object of study in the field of information geometry
(Amari and Nagaoka, 2007).
7
This is rather nice – maximum likelihood estimation consists in making the
parameterized distribution pθX as close as possible to the observed data dis-
tribution p̂X in the sense of the divergence.5
4 Entropy
After having put it off for a while, let’s define the Shannon entropy. If
we think about the divergence as a (metaphorical) distance, the entropy
measures how close a distribution is to the uniform.
When convenient, we will also use the notation H(X) to refer to the entropy
of a random variable X distributed according to p.
Remark. This formula makes it easy to remember the entropy of the uniform
distribution – it’s just log m, where m is the number of possible choices. If
we are playing a game in which I draw a card from the infinite deck, the suit
of the card is uniform, and the entropy of the suit distribution is therefore
H(X) = log 4 = 2 log 2.
Remark. In words, d(p, u) is your surprise if you thought the suit distribution
was uniform and then found it was in fact p. If you are relatively unsurprised,
then p is very close to uniform. Indeed, Gibbs’ inequality (Theorem 2) im-
mediately implies that H(p) assumes its largest value of log m exactly when
p = u.
5
This result is another hint at the beautiful geometry of the divergence: the operation of minimizing a
distance-like measure is often called “projection.” Maximum likelihood estimation thus consists in a
kind of statistical projection.
6
Here is as good a place as any to note that for discrete random variables, the divergence can also be
defined in terms of the entropy. Technically, d is the Bregman divergence induced by the Shannon
entropy, and can be characterized by the equation
Intuitively, d(p, q) is minus the approximation loss associated with estimating the difference in entropy
between p and q using a first-order Taylor expansion centered at q. This somewhat artificial-seeming
construction turns out to lead in some very interesting directions in statistics and machine learning.
8
Remark. Theorem 3 provides one useful insight into why the Shannon entropy
does not generalize naturally to continuous distributions – whereas equation
(15) involves the uniform distribution on X , but there can be no analogous
formula for continuous random variables on R because there is no uniform
distribution on R.
9
Remark. The function f is local iff f can be written as a function only of
my prediction p and how much probabilistic weight I put on the event that
actually occurred – not events that “could have happened” but didn’t. Thus,
a proper loss funciton ensures that the Bayesian prediction game is “fair.”
Somewhat amazingly, the log loss function given by f (p, x) = − log p(x) is
the only loss function that is both proper and local (both honest and fair),
up to an affine transformation.
Theorem 4 (Uniqueness of the Log-Loss). Let f be a local and proper reward
function. Then, f (p, x) = A log p(x) + B for some constants A < 0 and
B ∈ R.
Without loss of generality, we’ll take A = −1 and B = 0. The entropy in
this context occurs as the expected log-loss when you know the distribution
of suits in the deck. If you know, say, that the proportions in the deck are
p = ( 41 , 14 , 14 , 14 ) and need to formulate your predictive distribution, Theorem
4 implies that your best guess is just p, since you have no additional side
information. Then....
Definition 6 (Entropy, Bayesian Characterization). The (Shannon) en-
tropy of p is your minimal expected loss when playing the Bayesian prediction
game in which the true distribution of suits is p.
Remark. To see that this definition is consistent with the one we saw before,
we can simply compute the expectation:
which matches Definition 3. The second inequality follows from the fact that,
if you are playing optimally, p is both the true distribution of X and your
best predictive distribution.
5 Conditional Entropy
The true magic of probability theory is conditional probabilities, which for-
malize the idea of learning: P(A|B) represents my best belief about A given
what I know about B. While the Shannon entropy itself is quite interesting,
information theory really starts becoming a useful framework for thinking
probabilistically when we formulate the conditional entropy, which encodes
the idea of learning as a process of uncertainty reduction.
In this section and the next, we’ll need to keep track of multiple random
variables and distributions. To fix notation, we’ll let pX ∈ P X be the distri-
bution of a discrete random variable X on alphabet X , pY ∈ P Y the distribu-
tion of a discrete random variable Y on alphabet Y, and pX,Y ∈ P X ×Y their
10
joint distribution on alphabet X × Y. Additionally, we’ll denote the product
distribution of marginals as pX ⊗ pY ∈ P X ×Y ; that is, (pX ⊗ pY )(x, y) =
pX (x)pY (y).
which looks more symmetrical. However, a quick think makes clear that this
definition isn’t appropriate, because it doesn’t include any information about
the distribution of Y . If Y is concentrated around some very informative (or
uninformative) values, then H̃ won’t notice that some values of Y are more
valuable than others.
In the framework of our Bayesian interpretation of the entropy above, the
conditional entropy is your expected reward in the guessing game assuming
you receive some additional side information. For example, consider playing
the suit-guessing game in the infinite deck of cards. Recall that the suit
distribution is uniform, with entropy H(X) = H(u) = 2 log 2. Suppose now
that you get side information – when I draw the card from the deck, before
I ask you to guess the suit, I tell you the color (black or red). Since for each
color there are just two possible suits, the entropy decreases. Formally, if
X is the suit and Y the color, it’s easy to compute that H(X|Y ) = log 2 –
knowing the color reduces your uncertainty by half.
The conditional entropy is somewhat more difficult to express in terms of
the divergence, but it does have a useful relationship with the (unconditional)
entropy.
Remark. This theorem is easy to remember, because it looks like what you
get by recalling the definition of the conditional probability and taking logs:
pX,Y (x, y)
pX|Y (x|y) = . (23)
pY (y)
Indeed, take logs and compute the expectations over X and Y to prove the
theorem directly. Another way to remember this theorem is to just say it
11
out: the uncertainty you have about X given that you’ve been told Y is
equal to the uncertainty you had about both X and Y , less the uncertainty
that was resolved when you learned Y .
From this theorem, it’s a quick use of Gibbs’ inequality to show:
Theorem 6 (Side Information Reduces Uncertainty).
H(X|Y ) ≤ H(X). (24)
That is, knowing Y can never make you more uncertain about X, only
less. This makes sense – after all, if Y is not actually informative about X,
you can just ignore it.
Theorem 6 implies that H(X) − H(X|Y ) ≥ 0. This difference quantifies
how much Y reduces uncertainty about X; if H(X) − H(X|Y ) = 0, for
example, then H(X|Y ) = H(X) and it is natural to say that Y “carries no
information” about X. We encode the idea of information as uncertainty
reduction in the next section.
12
Let’s start by unpacking equation (26). I(X, Y ) is the divergence between
the actual joint distribution pX,Y and the product distribution pX ⊗ pY . Im-
portantly, the latter is what the distribution of X and Y would be, were they
independent random variables with the same marginals. This plus Gibb’s in-
equality implies:
Finally, as we noted briefly at the end of the previous section, the following
is a direct consequence of Equation (26) and Gibbs’ inequality.
I(X, Y ) ≥ 0. (29)
Now let’s unpack equation (27). One way to read this is as quantifying
the danger of ignoring available information: d(pX|Y =y , pX ) is how surprised
you would be if you ignored the information Y and instead kept using pX
as your belief. If I told you that the deck contained only red cards, but you
chose to ignore this and continue guessing u = 14 , 14 , 14 , 14 as your guess,
you would be surprised to keep seeing red cards turn up draw after draw.
Formulation (27) expresses the mutual information as the expected surprise
7
So, why don’t we just dispose of correlation coefficients and use I(X, Y ) instead? Well, correlation
coefficients can be estimated from data relatively simply and are fairly robust to error. In contrast,
I(X, Y ) requires that we have reasonably good estimates of the joint distribution pX,Y , which is not
usually available. Furthermore, it can be hard to distinguish I(X, Y ) = 10−6 from I(X, Y ) = 0, and
statistical tests of significance that would solve this problem are much more complex than those for
correlation coefficients.
13
you would experience by ignoring your available side information Y , with
the expectation taken over all the possible values the side information could
assume. While this formulation may seem much more opaque than (26), it
turns out to be remarkably useful when thinking geometrically, as it expresses
the mutual information as the average “distance” between the marginal pX
and the conditionals pX|Y . Pursuing this thought turns out to express the
mutual information as something like the “moment of inertia” for the joint
distribution pX,Y .
This is not the most general possible form of the Data Processing Inequal-
ity, but it has the right flavor. The meaning of this theorem is both “obvious”
and striking in its generality.8 Intuitively, if you are using Y to predict X,
then any processing you do to Y can only reduce your predictive power. Data
processing can enable tractable computations; reduce the impact of noise in
your observations; and improve your visualizations. The one thing it can’t do
is create information out of thin air. No amount of processing is a substitute
for having enough of the data you really want.
We’ll pursue the proof the Data Processing Inequality, as the steps are
quite enlightening. First, we need the conditional mutual information:
8
My first thought when seeing this was: “g can be ANY function? Really?” I then spent half an hour
fruitlessly attempting to produce a counter-example.
14
The divergence in the summand P is naturally written I(X, Y |Z = z), in
which case we have I(X, Y |Z) = z∈Z pZ (z)I(X, Y |Z = z), which has the
form of an expectation of mutual informations conditioned on specific values
of Z. The conditional mutual information is naturally understood as the
value of knowing Y for the prediction of X, given that you also already
know Z. Somewhat surprisingly, both of the cases I(X, Y |Z) > I(X, Y ) and
I(X, Y |Z) < I(X, Y ) may hold; that is, knowing Z can either increase or
decrease the value of knowing Y in the context of predicting X.
Remark. The notation I(X, (Y, Z)) refers to the (regular) mutual information
between X and the random variable (Y, Z), which we can regard as a single
random variable on the alphabet Y × Z.
as was to be shown.
As always, the Chain Rule has a nice interpretation if you think about
estimating X by first learning Z, and then Y . At the end of this process,
you know both Y and Z, and therefore have information I(X, (Y, Z)). This
total information splits into two pieces: the information you gained when
you learned Z, and the information you gained when you learned Y after
already knowing Z.
We are now ready to prove the Data Processing Inequality.9
9
Proof borrowed from http://www.cs.cmu.edu/~ aarti/Class/10704/lec2-dataprocess.pdf
10
In fact, Z ⊥ X|Y is often taken as the hypothesis of the Data Processing inequality rather than
Z = g(Y ), as it is somewhat weaker and sufficient to prove the result.
15
On the other hand, using the chain rule in two ways,
16
Information Theory “in General”
1. Shannon’s original work (Shannon, 1948) – in the words of one of my
professors, “the most important masters’ thesis of the 20th century.”
2. Shannon’s entertaining information-theoretic study of written English
(Shannon, 1951).
3. The text of Cover and Thomas (1991) is the standard modern overview
of the field for both theorists and practitioners.
4. Colah’s blog post “Visual Information Theory” at http://colah.github.io/posts/2015-09-Vi
is both entertaining and extremely helpful for getting basic intuition
around the relationship between entropy and communication.
17
References
Amari, S.-I. and Nagaoka, H. (2007). Methods of Information Geometry.
American Mathematical Society.
18