0% found this document useful (0 votes)

77 views18 pages

Divergence, Entropy, Information: Phil Chodrow

1. The document introduces information theory as a mathematical theory of learning with connections to fields like artificial intelligence, physics, and biology. It quantifies complexity and predictability in systems. 2. Rather than starting with entropy, the document chooses to begin with the Kullback-Leibler divergence as the foundational concept. This is because the divergence is well-defined for both discrete and continuous random variables, unlike entropy which can be negative or dependent on variable description for continuous cases. 3. The document then motivates the Kullback-Leibler divergence by discussing how it governs how "surprised" one should be when measuring a system thought to have distribution q but which actually has distribution

Uploaded by

Hala Lala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views18 pages

Divergence, Entropy, Information: Phil Chodrow

Uploaded by

Hala Lala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Divergence, Entropy, Information

An Opinionated Introduction to Information Theory

Phil Chodrow 1
1
MIT Operations Research Center
[email protected]
https://philchodrow.github.io/
arXiv:1708.07459v1 [cs.IT] 24 Aug 2017

August 25, 2017

Information theory is a mathematical theory of learning with deep con-

nections with topics as diverse as artificial intelligence, statistical physics,
and biological evolution. Many primers on the topic paint a broad picture
with relatively little mathematical sophistication, while many others develop
specific application areas in detail. In contrast, these informal notes aim to
outline some elements of the information-theoretic “way of thinking,” by cut-
ting a rapid and interesting path through some of the theory’s foundational
concepts and theorems. They are aimed at practicing systems scientists who
are interested in exploring potential connections between information theory
and their own fields. The main mathematical prerequisite for the notes is
comfort with elementary probability, including sample spaces, conditioning,
and expectations.
We take the Kullback-Leibler divergence as our foundational concept, and
then proceed to develop the entropy and mutual information. We discuss
some of the main foundational results, including the Chernoff bounds as a
characterization of the divergence; Gibbs’ Theorem; and the Data Processing
Inequality. A recurring theme is that the definitions of information theory
support natural theorems that sound “obvious” when translated into English.
More pithily, “information theory makes common sense precise.” Since the
focus of the notes is not primarily on technical details, proofs are provided
only where the relevant techniques are illustrative of broader themes. Oth-
erwise, proofs and intriguing tangents are referenced in liberally-sprinkled
footnotes. The notes close with a highly nonexhaustive list of references to
resources and other perspectives on the field.

1
Contents
1 Why Information Theory? 3

2 Why Not Start with Entropy? 3

3 Introducing the Divergence 4

4 Entropy 8

5 Conditional Entropy 10

6 Information Three Ways 12

7 Why Information Shrinks 14

8 Some Further Reading 16

2
1 Why Information Theory?
Briefly, information theory is a mathematical theory of learning with rich
connections to physics, statistics, and biology. Information-theoretic meth-
ods quantify complexity and predictability in systems, and make precise
how observing one feature of a system assists in predicting other features.
Information-theoretic thinking helps to structure algorithms; describe natu-
ral processes; and draw surprising connections between seemingly disparate
fields.
Formally, information theory is a subfield of probability, the mathemat-
ical study of uncertainty and randomness. What is distinctive about in-
formation theory is its emphasis on properties of probability distributions
that are independent of how those distributions are represented. Because
of this representation-independence, information-theoretic quantities often
have claim to be the most “fundamental” properties of a system or problem,
governing its complexity and learnability.
In its original formulation Shannon (1948), information theory is a theory
of communication: specifically, the transmission of a signal of some given
complexity over an unreliable channel, such as a telephone line corrupted by
a certain amount of white noise. In these notes, we will emphasize a slightly
different role for information theory, as a theory of learning more generally.
This emphasis is consistent with the original formulation, since the com-
munication problem may be viewed as the problem of the message receiver
learning the intent of the sender based on the potentially corrupted trans-
mission. However, the emphasis on learning allows us to more easily glimpse
some of the rich connections of information theory to other disciplines. Spe-
cial consideration in these notes will be given to statistical motivations for
information-theoretic concepts. Theoretical statistics is the design of mathe-
matical methods for learning from data; information-theoretic considerations
determine when such learning is possible and to what degree. We will close
with a connection to physics; some connections to biology are cited in the
references.

2 Why Not Start with Entropy?

Entropy is easily the information-theoretic concept with the widest popular
currency, and many expositions of that theory take entropy as their starting
point. We, however, will choose a different point of departure for these notes,
and derive entropy along the way. Our point of choice is the Kullback-Leibler
(KL) divergence between two distributions, also called in some contexts the
relative entropy, relative information, or free energy.1 Why start with the

1
For the remainder of these notes I’ll stick with “divergence” – though there are many other interesting
objects called “divergences” in mathematics, we won’t be discussing any of them here, so no confusion

3
divergence? Well, there’s a simple reason – while we’ll focus on discrete
random variables here, we’d like to develop a theory that, wherever possible,
applies to continuous random variables as well. The divergence is well-defined
for both discrete random variables and continuous ones; that is, if p and q
are two distributions satisfying certain regularity properties, then d(p, q) is
a uniquely determined, nonnegative real number whether p and q are dis-
crete or continuous. In contrast, the natural definition of entropy (so called
differential entropy) for continuous random variables has two bad behaviors.
First, it can be negative, which is undesirable for a measure of uncertainty.
Second, and arguably worse, the differential entropy is not even uniquely de-
fined. There are multiple ways to describe the same continuous distribution
– for example, the following three distributions are the same:
1. “The Gaussian distribution with mean 0 and variance 1.”
2. “The Gaussian distribution with mean 0 and standard deviation 1.”
3. “The Gaussian distribution with mean 0 and second moment equal to
1.”
Technically, the act of switching from one of these descriptions to another
can be viewed as a smooth change of coordinates2 in the space of distribution
parameters. For example, we move from the first description to the second
by changing coordinates from (µ, σ 2 ) to (µ, σ), which we can do by applying
√
the function f (x, y) = (x, y). Regrettably, the differential entropy is not
invariant under such coordinate changes – change the way you describe the
distribution, and the differential entropy changes as well. This is undesirable
– the foundations of our theory should be independent of the contingencies
of how we describe the distributions under study. The divergence passes this
test in both discrete and continuous cases; the differential entropy does not.3
Since we can define the entropy in terms of the divergence in the discrete
case, we’ll start with the divergence and derive the entropy along the way.

3 Introducing the Divergence

It is often said that the divergence d(p, q) between distributions p and q
measures how “surprised” you are if you think the state of the world is q
but then measure it to be p. However, this idea of surprise isn’t typically
explained or made precise. To motivate the KL divergence, we’ll start from a
somewhat unusual beginning – the Chernoff bounds – that makes exact the
role that the divergence plays in governing how surprised you ought to be.
should arise.
2
Technically, the idea of a smooth coordinate change is captured by diffeomorphisms: invertible func-
tions on coordinate space whose inverses are also smooth.
3
It is possible to define alternative notions of entropy that at-
tempt to skirt these issues; however, they have their own difficulties.
https://en.wikipedia.org/wiki/Limiting_density_of_discrete_points

4
Let’s begin with a simple running example. You are drawing from an
(infinite) deck of standard playing cards, with four card suits {♠, ♣, ♦, ♥}
and thirteen card values {1, . . . , 13}. We’ll view the sets of possible val-
ues as alphabets: X = {♠, ♣, ♦, ♥} is the alphabet of possible suits, and
Y = {1, . . . , 13} the alphabet of possible values. We’ll let X and Y be the
corresponding random variables, so for each realization, X ∈ X and Y ∈ Y.
Suppose that I have a prior belief that the distribution of suits in the
deck is uniform. My belief about the suits can be summarized by a vector
q = ( 14 , 14 , 14 , 14 ). It’s extremely convenient to view q as a single point in the
probability simplex P X of all valid probability distributions over X .
Definition 1 (Probability Simplex). For any finite alphabet X with |X| = m,
the probability simplex P X is the set
( )
X
X m
P , q∈R | qi = 1, qi ≥ 0 ∀i . (1)
i

Remark. It’s helpful to remember that P X is an P m − 1-dimensional space;

the “missing” dimension is due to the constraint i qi = 1. When m = 3,
P X is an equilateral triangle; when m = 4 a tetrahedron, and so on.
If q is your belief, you would naturally expect that, if you drew enough
cards, the observed distribution of suits would be “close” to q, and that if
you could draw infinitely many cards, the distribution would indeed converge
to q. Let’s make this precise: define p̂n to be the distribution of suits you
observe after pulling n cards. It’s important to remember that p̂ is a random
vector, which changes in each realization. But it would be reasonable to
expect that p̂ → q as n → ∞, and indeed this is true almost surely (with
probability 1) according to the Strong Law of Large Numbers, if q is in fact
the true distribution of cards in the deck.
But what happens if you keep drawing cards and the observed distribution
p̂n is much different than your belief q? Then you would justifiably be sur-
prised, and your “level of surprise” can be quantified by the probability of
observing p̂n if the true distribution were q, which I’ll denote P(p̂n ; q). We’d
expect P(p̂n ; q) to become small when n grows large. Indeed, there is a quite
strong result here – P(p̂n ; q) decays exponentially in n, with a very special
exponent.
Definition 2 (Kullback-Leibler Divergence). For p, q ∈ P X , the Kullback-
Leibler (KL) divergence of q from p is
X p(x)
d(p, q) , p(x) log , (2)
q(x)
x∈X

where we are using the conventions that log ∞ = ∞, log 0 = −∞, 0/0 = 0,
and 0 × ∞ = 0.

5
Theorem 1 (Chernoff Bounds). Suppose that the card suits are truly dis-
tributed according to p 6= q. Then,
e−nd(p,q)
≤ P(p̂n ; q) ≤ e−nd(p,q) . (3)
(n + 1)m
So, the probability of observing p̂n when you thought the distribution was
q decays exponentially, with the exponent given by the divergence of your
belief q from the true distribution p. Another way to say this: ignoring
non-exponential factors,
1 ∼ d(p, q) ,
− log P(p̂n ; q) = (4)
n
that is, d(p, q) is the minus the log of your average surprise per card drawn.
The Chernoff bounds thus provide firm mathematical content to the idea
that the divergence measures surprise.
Just to make this concrete, suppose that I start with the belief that the
deck is uniform over suits; that is, my belief is q = ( 14 , 14 , 14 , 14 ) over the
alphabet {♠, ♣, ♦, ♥}. Unbeknownst to me, you have removed all the black
cards from the deck, which therefore has true distribution p = 0, 0, 21 , 12 .

I draw 100 cards and record the suits. How surprised am I by the suit
distribution I observe? The divergence between my belief and the true deck
distribution is d(p, q) ≈ 0.69, so Theorem 1 states that the dominating factor
in the probability of my observing an empirical distribution close to p in 100
draws is e−0.69×100 ≈ 10−30 . I am quite surprised indeed!
We’ll now state one of the most important properties of the divergence.
Theorem 2 (Gibbs’ Inequality). For all p, q ∈ P X , it holds that d(p, q) ≥ 0,
with equality iff p = q.
In words, you can never have “negative surprise,” and you are only unsur-
prised if you what you observed is exactly what you expected.
Proof. There are lots of ways to prove Gibbs’ Inequality; here’s one with
Lagrange multipliers. Fix p ∈ P X . We’d like to show that the problem
min d(p, q) (5)
q∈P X

has value 0 and that this value is achieved at the unique point q = p. We
need two gradients: the gradient Pof d(p, q) with respect to q and the gradient
of the implicit constraint g(q) , x∈X q(x) = 1. The former is
" #
X p(x)
∇q d(p, q) = ∇q p(x) log (6)
q(x)
x∈X
" #
X
= −∇q p(x) log q(x) (7)
x∈X
= −p ⊘ q, (8)

6
where p ⊘ q is the elementwise quotient of vectors, and where we recall the
convention 0/0 = 0. On the other hand,

∇q g(q) = 1, (9)

the vector whose entries are all unity. The method of Lagrange multipliers
states that we should seek λ ∈ R such that

∇q d(p, q) = λ∇q (gq ), (10)

or
− p ⊘ q = λ1, (11)
from which it’s easy to see that the only solution is q = p and λ = −1.
It’s a quick check that the corresponding solution value is d(p, p) = 0, which
completes the proof.

Remark. Theorem 2 is the primary sense in which d behaves “like a distance”

on the simplex P. On the other hand, d is unlike a distance in that it is not
symmetric and does not satisfy the triangle inequality.4
Let’s close out this section by noting one of the many connections be-
tween the divergence and classical statistics. Maximum likelihood estimation
is a foundational method of modern statistical practice; tools from linear
regression to neural networks may be viewed as likelihood maximizers. The
divergence allows a particularly elegant formulation of maximum likelihood
estimation: likelihood maximization is the same as divergence min-
imization. Let θ be some statistical parameter, which may be multidimen-
sional; for example, in the context of normal distributions, we may have
θ = (µ, σ 2 ); in the context of regression, θ may be the regression coefficients
β. Let pθX be the probability distribution over X with parameters θ. Let
{x1 , . . . , xn } be a sequence of i.i.d. observations of X. Maximum likelihood
estimation encourages us to find the parameter θ such that
n
Y
θ ∗ = argmax pθX (xi ) , (12)
θ i=1

i.e. the parameter value that makes the data most probable. To express this
in terms of the divergence, we need just one more piece of notation: let p̂X be
the empirical distribution of observations of X. Then, it’s a slightly involved
algebraic exercise to show that the maximum likelihood estimation problem
can also be written
θ ∗ = argmin d(p̂X , pθX ). (13)
θ
4
In fact, d is related to a “proper” distance metric on P X , which is usually called the Fisher In-
formation Metric and is the fundamental object of study in the field of information geometry
(Amari and Nagaoka, 2007).

7
This is rather nice – maximum likelihood estimation consists in making the
parameterized distribution pθX as close as possible to the observed data dis-
tribution p̂X in the sense of the divergence.5

4 Entropy
After having put it off for a while, let’s define the Shannon entropy. If
we think about the divergence as a (metaphorical) distance, the entropy
measures how close a distribution is to the uniform.

Definition 3 (Shannon Entropy). The Shannon entropy of p ∈ P X is

X
H(p) , − p(x) log p(x) . (14)
x∈X

When convenient, we will also use the notation H(X) to refer to the entropy
of a random variable X distributed according to p.

Theorem 3. The Shannon entropy is related to the divergence according to

the formula
H(p) = log m − d(p, u) , (15)
where m = |X | is the size of the alphabet X and where u is the uniform
distribution on X .6

Remark. This formula makes it easy to remember the entropy of the uniform
distribution – it’s just log m, where m is the number of possible choices. If
we are playing a game in which I draw a card from the infinite deck, the suit
of the card is uniform, and the entropy of the suit distribution is therefore
H(X) = log 4 = 2 log 2.
Remark. In words, d(p, u) is your surprise if you thought the suit distribution
was uniform and then found it was in fact p. If you are relatively unsurprised,
then p is very close to uniform. Indeed, Gibbs’ inequality (Theorem 2) im-
mediately implies that H(p) assumes its largest value of log m exactly when
p = u.

5
This result is another hint at the beautiful geometry of the divergence: the operation of minimizing a
distance-like measure is often called “projection.” Maximum likelihood estimation thus consists in a
kind of statistical projection.
6
Here is as good a place as any to note that for discrete random variables, the divergence can also be
defined in terms of the entropy. Technically, d is the Bregman divergence induced by the Shannon
entropy, and can be characterized by the equation

d(p, q) = − [H(p) − H(q) − h∇q H, p − qi] . (16)

Intuitively, d(p, q) is minus the approximation loss associated with estimating the difference in entropy
between p and q using a first-order Taylor expansion centered at q. This somewhat artificial-seeming
construction turns out to lead in some very interesting directions in statistics and machine learning.

8
Remark. Theorem 3 provides one useful insight into why the Shannon entropy
does not generalize naturally to continuous distributions – whereas equation
(15) involves the uniform distribution on X , but there can be no analogous
formula for continuous random variables on R because there is no uniform
distribution on R.

A Bayesian Interpretation of Entropy

The construction of the entropy in terms of the divergence is fairly natural –
we use the divergence to measure how close p is to the uniform distribution,
flip the sign so that high entropy distributions are more uniform, and add
a constant term to make the entropy nonnegative. This formulation of the
entropy turns out to have another interesting characterization in the context
of Bayesian prediction. In Bayesian prediction, I will pull a single card from
the deck. Before I do, I ask you to provide a distribution p over the alphabet
{♠, ♣, ♦, ♥} representing your prediction about the suit of the card I pulled.
As examples, you can choose p = (1, 0, 0, 0) if you are certain that the suite
will be ♠, or p = ( 41 , 41 , 14 , 14 ) to express maximal ignorance. After you guess,
I pull the card, obtaining a sample x ∈ X , and reward you based on the
quality of your prediction relative to the outcome x. I do this based on a
loss function f (p, x); after your guess I give you f (p, x) dollars, say. If we
assume that my aim is to encourage you to (a) report your true beliefs about
the deck and (b) reward you based only on what happened (i.e. not on what
could have happened), then there is only one appropriate loss function f ,
which turns out to be closely related to the entropy. More formally,
Definition 4. A loss function f is proper if, for any alphabet Y and random
variable Y on Y,
pX|Y =y = argmin E[f (p, x)|Y = y]. (17)
p∈P X

Remark. In this definition, it’s useful to think of Y as some kind of “side

information” or “additional data.” For example, Y could be my telling you
that the card I pulled is a red card, which could influence your predictive
distribution. When f is proper, you have an incentive to factor that into
your predictive distribution. While it may feel that “of course” you should
factor this in, not all loss functions encourage you to do so. For example, if
f is constant, then you have no incentive to use Y at all, since each guess
is just as good as any other. A proper loss function guarantees that you
can maximize your payout (minimize your loss) by completely accounting
for all available data when forming your prediction, which should therefore
be pX|Y =y . Thus, a proper loss function ensures that the Bayesian prediction
game is “honest”.
Definition 5. A loss function f is local if f (p, x) = ψ(p, p(x)) for some
function ψ.

9
Remark. The function f is local iff f can be written as a function only of
my prediction p and how much probabilistic weight I put on the event that
actually occurred – not events that “could have happened” but didn’t. Thus,
a proper loss funciton ensures that the Bayesian prediction game is “fair.”
Somewhat amazingly, the log loss function given by f (p, x) = − log p(x) is
the only loss function that is both proper and local (both honest and fair),
up to an affine transformation.
Theorem 4 (Uniqueness of the Log-Loss). Let f be a local and proper reward
function. Then, f (p, x) = A log p(x) + B for some constants A < 0 and
B ∈ R.
Without loss of generality, we’ll take A = −1 and B = 0. The entropy in
this context occurs as the expected log-loss when you know the distribution
of suits in the deck. If you know, say, that the proportions in the deck are
p = ( 41 , 14 , 14 , 14 ) and need to formulate your predictive distribution, Theorem
4 implies that your best guess is just p, since you have no additional side
information. Then....
Definition 6 (Entropy, Bayesian Characterization). The (Shannon) en-
tropy of p is your minimal expected loss when playing the Bayesian prediction
game in which the true distribution of suits is p.
Remark. To see that this definition is consistent with the one we saw before,
we can simply compute the expectation:

E[f (p, X)] = E[− log p(X)] (18)

X
=− p(x) log p(x), (19)
x∈X

which matches Definition 3. The second inequality follows from the fact that,
if you are playing optimally, p is both the true distribution of X and your
best predictive distribution.

5 Conditional Entropy
The true magic of probability theory is conditional probabilities, which for-
malize the idea of learning: P(A|B) represents my best belief about A given
what I know about B. While the Shannon entropy itself is quite interesting,
information theory really starts becoming a useful framework for thinking
probabilistically when we formulate the conditional entropy, which encodes
the idea of learning as a process of uncertainty reduction.
In this section and the next, we’ll need to keep track of multiple random
variables and distributions. To fix notation, we’ll let pX ∈ P X be the distri-
bution of a discrete random variable X on alphabet X , pY ∈ P Y the distribu-
tion of a discrete random variable Y on alphabet Y, and pX,Y ∈ P X ×Y their

10
joint distribution on alphabet X × Y. Additionally, we’ll denote the product
distribution of marginals as pX ⊗ pY ∈ P X ×Y ; that is, (pX ⊗ pY )(x, y) =
pX (x)pY (y).

Definition 7 (Conditional Entropy). The conditional entropy of X given

Y is X
H(X|Y ) , pX,Y (x, y) log pX|Y (x|y) . (20)
x,y∈X ×Y

Remark. It might seem as though H(X|Y ) ought to be defined as

X
H̃(X|Y ) = pX|Y (x|y) log pX|Y (x|y) , (21)
x,y∈X ×Y

which looks more symmetrical. However, a quick think makes clear that this
definition isn’t appropriate, because it doesn’t include any information about
the distribution of Y . If Y is concentrated around some very informative (or
uninformative) values, then H̃ won’t notice that some values of Y are more
valuable than others.
In the framework of our Bayesian interpretation of the entropy above, the
conditional entropy is your expected reward in the guessing game assuming
you receive some additional side information. For example, consider playing
the suit-guessing game in the infinite deck of cards. Recall that the suit
distribution is uniform, with entropy H(X) = H(u) = 2 log 2. Suppose now
that you get side information – when I draw the card from the deck, before
I ask you to guess the suit, I tell you the color (black or red). Since for each
color there are just two possible suits, the entropy decreases. Formally, if
X is the suit and Y the color, it’s easy to compute that H(X|Y ) = log 2 –
knowing the color reduces your uncertainty by half.
The conditional entropy is somewhat more difficult to express in terms of
the divergence, but it does have a useful relationship with the (unconditional)
entropy.

Theorem 5. The conditional entropy is related to the unconditional entropy

as
H(X|Y ) = H(X, Y ) − H(Y ), (22)
where H(X, Y ) is the entropy of the distribution pX,Y .

Remark. This theorem is easy to remember, because it looks like what you
get by recalling the definition of the conditional probability and taking logs:

pX,Y (x, y)
pX|Y (x|y) = . (23)
pY (y)

Indeed, take logs and compute the expectations over X and Y to prove the
theorem directly. Another way to remember this theorem is to just say it

11
out: the uncertainty you have about X given that you’ve been told Y is
equal to the uncertainty you had about both X and Y , less the uncertainty
that was resolved when you learned Y .
From this theorem, it’s a quick use of Gibbs’ inequality to show:
Theorem 6 (Side Information Reduces Uncertainty).
H(X|Y ) ≤ H(X). (24)
That is, knowing Y can never make you more uncertain about X, only
less. This makes sense – after all, if Y is not actually informative about X,
you can just ignore it.
Theorem 6 implies that H(X) − H(X|Y ) ≥ 0. This difference quantifies
how much Y reduces uncertainty about X; if H(X) − H(X|Y ) = 0, for
example, then H(X|Y ) = H(X) and it is natural to say that Y “carries no
information” about X. We encode the idea of information as uncertainty
reduction in the next section.

6 Information Three Ways

Thus far, we’ve seen two concepts – divergence and entropy – that play fun-
damentals role in information theory. But neither of them exactly resemble
an idea of “information,” so how does the theory earn its name? Our brief
note at the end of the last section suggests that we think about informa-
tion as a relationship between two variables X and Y , in which knowing
Y decreases our uncertainty (entropy) about X. As it turns out, the idea
of information information that falls out of this motivation is a remarkably
useful one, and can be formulated in many interesting and different ways.
Let’s start by naming this difference:
Definition 8 (Mutual Information). The mutual information I(X, Y ) in
Y about X is
I(X, Y ) , H(X) − H(X|Y ). (25)
The mutual information is just the uncertainty reduction associated with
knowing Y . In the context of the Bayesian guessing game, I(X, Y ) is the
“value” of being told the suit color, compared to having to play the game
without that information. From our calculations above, in the suit-guessing
game, I(X, Y ) = H(X) − H(X|Y ) = 2 log 2 − log 2 = log 2.
Let’s now express mutual information in two other ways. Remarkably,
these follow directly via simple algebra, but each identity gives a new way to
think about the meaning of the mutual information.
Theorem 7. The mutual information may also be written as:
I(X, Y ) = d(pX,Y , pX ⊗ pY ) (26)
= EY [d(pX|Y , pX )] (27)

12
Let’s start by unpacking equation (26). I(X, Y ) is the divergence between
the actual joint distribution pX,Y and the product distribution pX ⊗ pY . Im-
portantly, the latter is what the distribution of X and Y would be, were they
independent random variables with the same marginals. This plus Gibb’s in-
equality implies:

Corrolary 1. Random variables X and Y are independent if and only if

I(X, Y ) = 0.

So, I(X, Y ) is something like a super-charged correlation coefficient – it

measures the degree of statistical correlation between X and Y , but it is
stronger than the correlation coefficient in two ways. First, I(X, Y ) detects
all kinds of statistical relationships, not just linear ones. Second, while the
correlation coefficient can vanish for dependent variables, this never happens
for the mutual information – zero mutual information implies dependence,
period. As a quick illustration, it’s not hard to see that if X is the suit color
and Z is the numerical value of the card pulled, then I(X, Z) = 0. Intu-
itively, if we were playing the suit-guessing game and I offered to tell you the
card’s face-value, you would be rightly annoyed – that’s an unhelpful (“un-
informative”) offer, because the face-values and suit colors are independent.7
Equation (26) has another useful consequence. Since that formulation is
symmetric in X and Y ,

Corrolary 2. The mutual information is symmetric:

I(X, Y ) = I(Y, X) . (28)

Finally, as we noted briefly at the end of the previous section, the following
is a direct consequence of Equation (26) and Gibbs’ inequality.

Corrolary 3. The mutual information is nonnegative:

I(X, Y ) ≥ 0. (29)

Now let’s unpack equation (27). One way to read this is as quantifying
the danger of ignoring available information: d(pX|Y =y , pX ) is how surprised
you would be if you ignored the information Y and instead kept using pX
as your belief. If I told you that the deck contained only red cards, but you
chose to ignore this and continue guessing u = 14 , 14 , 14 , 14 as your guess,
you would be surprised to keep seeing red cards turn up draw after draw.
Formulation (27) expresses the mutual information as the expected surprise
7
So, why don’t we just dispose of correlation coefficients and use I(X, Y ) instead? Well, correlation
coefficients can be estimated from data relatively simply and are fairly robust to error. In contrast,
I(X, Y ) requires that we have reasonably good estimates of the joint distribution pX,Y , which is not
usually available. Furthermore, it can be hard to distinguish I(X, Y ) = 10−6 from I(X, Y ) = 0, and
statistical tests of significance that would solve this problem are much more complex than those for
correlation coefficients.

13
you would experience by ignoring your available side information Y , with
the expectation taken over all the possible values the side information could
assume. While this formulation may seem much more opaque than (26), it
turns out to be remarkably useful when thinking geometrically, as it expresses
the mutual information as the average “distance” between the marginal pX
and the conditionals pX|Y . Pursuing this thought turns out to express the
mutual information as something like the “moment of inertia” for the joint
distribution pX,Y .

7 Why Information Shrinks

The famous 2nd Law of Thermodynamics states that, in a closed system,
entropy increases. The physicists’ concept of entropy is closely related to but
slightly different from the information theorist’s concept, and we therefore
won’t make a direct attack on the 2nd Law in these notes. However, there
is a close analog of the 2nd Law that gives much of the flavor and can be
formulated in information theoretic terms. Whereas the 2nd Law states
that entropy grows, the Data Processing Inequality states that information
shrinks.

Theorem 8 (Data Processing Inequality). Let X and Y be random variables,

and let Z = g(Y ), where g is some function g : Y → Z. Then,

I(X, Z) ≤ I(X, Y ). (30)

This is not the most general possible form of the Data Processing Inequal-
ity, but it has the right flavor. The meaning of this theorem is both “obvious”
and striking in its generality.8 Intuitively, if you are using Y to predict X,
then any processing you do to Y can only reduce your predictive power. Data
processing can enable tractable computations; reduce the impact of noise in
your observations; and improve your visualizations. The one thing it can’t do
is create information out of thin air. No amount of processing is a substitute
for having enough of the data you really want.
We’ll pursue the proof the Data Processing Inequality, as the steps are
quite enlightening. First, we need the conditional mutual information:

Definition 9 (Conditional Mutual Information). The conditional mutual

information of X and Y given Z is
X
I(X, Y |Z) = pZ (z)d(pX,Y |Z=z , pX|Z=z ⊗ pY |Z=z ). (31)
z∈Z

8
My first thought when seeing this was: “g can be ANY function? Really?” I then spent half an hour
fruitlessly attempting to produce a counter-example.

14
The divergence in the summand P is naturally written I(X, Y |Z = z), in
which case we have I(X, Y |Z) = z∈Z pZ (z)I(X, Y |Z = z), which has the
form of an expectation of mutual informations conditioned on specific values
of Z. The conditional mutual information is naturally understood as the
value of knowing Y for the prediction of X, given that you also already
know Z. Somewhat surprisingly, both of the cases I(X, Y |Z) > I(X, Y ) and
I(X, Y |Z) < I(X, Y ) may hold; that is, knowing Z can either increase or
decrease the value of knowing Y in the context of predicting X.

Theorem 9 (Chain Rule of Mutual Information). We have

I(X, (Y, Z)) = I(X, Z) + I(X, Y |Z). (32)

Remark. The notation I(X, (Y, Z)) refers to the (regular) mutual information
between X and the random variable (Y, Z), which we can regard as a single
random variable on the alphabet Y × Z.

Proof. We can compute directly, dividing up sums and remembering relations

like pX,Y,Z (x, y, z) = pX,Y |Z (x, y|z)pZ (z). Omitting some of the more tedious
algebra,

I(X, (Y, Z)) = d(pX,Y,Z , pX ⊗ pY,Z )

X pX,Z (x, z)
= pX,Y |Z (x, z) log +
pX (x)pZ (z)
x,z∈X ×Z
X pX,Y |Z (x, y|z)
pZ (z)pX,Y |Z (x, y|z) log
pY |Z (y|z)pX|Z (x|z)
x,y,z∈X ×Y×Z

= I(X, Z) + I(X, Y |Z) ,

as was to be shown.

As always, the Chain Rule has a nice interpretation if you think about
estimating X by first learning Z, and then Y . At the end of this process,
you know both Y and Z, and therefore have information I(X, (Y, Z)). This
total information splits into two pieces: the information you gained when
you learned Z, and the information you gained when you learned Y after
already knowing Z.
We are now ready to prove the Data Processing Inequality.9

Proof. Since Z = g(Y ), that is, is a function of Y alone, we have that

Z ⊥ X|Y , that is, given Y , Z and X are independent.10 So, I(X, Z|Y ) = 0.

9
Proof borrowed from http://www.cs.cmu.edu/~ aarti/Class/10704/lec2-dataprocess.pdf
10
In fact, Z ⊥ X|Y is often taken as the hypothesis of the Data Processing inequality rather than
Z = g(Y ), as it is somewhat weaker and sufficient to prove the result.

15
On the other hand, using the chain rule in two ways,

I(X, (Y, Z)) = I(X, Z) + I(X, Y |Z)

= I(X, Y ) + I(X, Z|Y ).

Since I(X, Z|Y ) = 0 by our argument above, we obtain I(X, Y ) = I(X, Z) +

I(X, Y |Z). Since I(X, Y |Z) is nonnegative by Gibbs’ inequality, we conclude
that I(X, Z) ≤ I(X, Y ), as was to be shown.

The Data Processing Inequality states that, in the absence of additional

information sources, processing leaves you with less information than you
started. The 2nd Law of Thermodynamics states that, in the absence of
additional energy sources, the system dynamics leave you with less order
than you started. These formulations suggest a natural parallel between the
concepts of information and order, and therefore a natural parallel between
the two theorems. We’ll close out this note with an extremely simplistic-yet-
suggestive way to think about this.
Let X0 and Y0 each be random variables reflecting the possible locations
and momenta of two particles at time t = 0. We’ll assume (a) that the
particles don’t interact, but that (b) the experimenter has placed the two
particles very close to each other with similar momenta. Thus, the initial
configuration of the system is highly ordered, reflected by I(X0 , Y0 ) > 0. If
we knew Y0 , then we’d also significantly reduce our uncertainty about X0 .
How does this system evolve over time? We’re assuming no interactions, so
each of the particles evolve separately according to some dynamics, which
we can write X1 = gx (X0 ) and Y1 = gy (Y0 ). Using the data processing
inequality twice, we have

I(X1 , Y1 ) ≤ I(X0 , Y1 ) ≤ I(X0 , Y0 ). (33)

Thus, the dynamics tend to reduce information. Of course, we can complicate

this picture in various ways, by considering particle interactions or external
potentials, either of which require a more sophisticated analysis. The full
2nd Law, which is beyond the scope of these notes, is most appropriate for
considering these cases.

8 Some Further Reading

This introduction has been far from exhaustive, and I heartily encourage
those interested to explore these topics in more detail. The below is a short
list of some of the resources I have found most intriguing and useful, in
addition to those cited in the introduction.

16
Information Theory “in General”
1. Shannon’s original work (Shannon, 1948) – in the words of one of my
professors, “the most important masters’ thesis of the 20th century.”
2. Shannon’s entertaining information-theoretic study of written English
(Shannon, 1951).
3. The text of Cover and Thomas (1991) is the standard modern overview
of the field for both theorists and practitioners.
4. Colah’s blog post “Visual Information Theory” at http://colah.github.io/posts/2015-09-Vi
is both entertaining and extremely helpful for getting basic intuition
around the relationship between entropy and communication.

Information Theory, Statistics, and Machine Learning

1. An excellent and entertaining introduction to these topics is the already-
mentioned MacKay (2003).
2. Those who want to further explore will likely enjoy Csiszar and Shields
(2004), but I would suggest doing this one after MacKay.
3. Readers interested in pursuing the Bayesian development of entropy
much more deeply may enjoy Bernardo and Smith (2008), which pro-
vides an extremely thorough development of decision theory with a
strong information-theoretic perspective.
4. The notes for the course “Information Processing and Learning” at
Carnegie-Mellon’s famous Machine Learning department are excellent
and accessible; find them at http://www.cs.cmu.edu/~aarti/Class/10704/lecs.html

Information Theory, Physics, and Biology

1. Marc Harper has a number of very fun papers in which he views biologi-
cal evolutionary dynamics as learning processes through the framework
of information theory; a few are (Harper, 2009, 2010).
2. John Baez and his student Blake Pollard wrote a very nice and easy-
reading review article of the role of information concepts in biological
and chemical systems (Baez and Pollard, 2016).
3. More generally, John Baez’s blog is a treasure-trove of interesting vi-
gnettes and insights on the role that information plays in the physical
and biological worlds: https://johncarlosbaez.wordpress.com/category/information-and
For a more thoroughly worked-out connection between information dis-
sipation and the Second Law of Thermodynamics, see this one: https://johncarlosbaez.wordp

17
References
Amari, S.-I. and Nagaoka, H. (2007). Methods of Information Geometry.
American Mathematical Society.

Baez, J. and Pollard, B. (2016). Relative entropy in biological systems.

Entropy, 18(2):46.

Bernardo, J. M. and Smith, A. F. (2008). Bayesian Theory. John Wiley and

Sons, New York.

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory.

John Wiley and Sons, New York.

Csiszar, I. and Shields, P. C. (2004). Information Theory and Statistics: A

Tutorial. Foundations and Trends in Communications and Information
Theory, 1(4):417–528.

Harper, M. (2009). Information geometry and evolutionary game theory,

arXiv: 0911.1383. pages 1–13.

Harper, M. (2010). The replicator equation as an inference dynamic, arXiv:

0911.1763. pages 1–10.

MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Al-

gorithms. Cambridge Univeristy Press, 4th edition.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell

System Technical Journal, 27:379–423.

Shannon, C. E. (1951). Prediction and entropy of printed English. Bell

System Technical Journal, 30(1):50–64.

2005 Ikea Catalog
100% (1)
2005 Ikea Catalog
28 pages
Abductive Analysis: Theorizing Qualitative Research
From Everand
Abductive Analysis: Theorizing Qualitative Research
Iddo Tavory
No ratings yet
Science and Information Theory: Second Edition
From Everand
Science and Information Theory: Second Edition
Leon Brillouin
No ratings yet
An Introduction to Information Theory
From Everand
An Introduction to Information Theory
Fazlollah M. Reza
No ratings yet
Notes08 Infotheory
No ratings yet
Notes08 Infotheory
7 pages
Communication Theory and Coding: Basics
No ratings yet
Communication Theory and Coding: Basics
17 pages
Info Theory Polyanskiy Wu
No ratings yet
Info Theory Polyanskiy Wu
730 pages
Short Intro Quantum Information
No ratings yet
Short Intro Quantum Information
64 pages
Information Theory
No ratings yet
Information Theory
122 pages
Introduction To Information Entropy
No ratings yet
Introduction To Information Entropy
11 pages
Carter - An Introduction To Information Theory and Entropy
No ratings yet
Carter - An Introduction To Information Theory and Entropy
126 pages
Lectures On Probability, Entropy, and Statistical Physics
No ratings yet
Lectures On Probability, Entropy, and Statistical Physics
170 pages
Notes On Kullback-Leibler Divergence and Likelihood Theory
No ratings yet
Notes On Kullback-Leibler Divergence and Likelihood Theory
4 pages
Entropy
No ratings yet
Entropy
21 pages
An Introduction To Information Theory and Entropy: Tom Carter CSU Stanislaus
No ratings yet
An Introduction To Information Theory and Entropy: Tom Carter CSU Stanislaus
139 pages
2009 Lecture25
No ratings yet
2009 Lecture25
11 pages
Module 4
No ratings yet
Module 4
15 pages
An Introduction To Information Theory and Entropy: Tom Carter
No ratings yet
An Introduction To Information Theory and Entropy: Tom Carter
139 pages
Gentle Intro To Information Theory
No ratings yet
Gentle Intro To Information Theory
139 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
On Measures of Entropy and Information
No ratings yet
On Measures of Entropy and Information
18 pages
Ch5 Entropy and Information
No ratings yet
Ch5 Entropy and Information
77 pages
Ed
No ratings yet
Ed
300 pages
EE Final
No ratings yet
EE Final
27 pages
Information Theory
No ratings yet
Information Theory
18 pages
Lecture 15
No ratings yet
Lecture 15
7 pages
Introduction To Information Theory
No ratings yet
Introduction To Information Theory
20 pages
ACaticha-Entropic Physics Book-July 2022
No ratings yet
ACaticha-Entropic Physics Book-July 2022
364 pages
Popular Lectures on Mathematical Logic
From Everand
Popular Lectures on Mathematical Logic
Hao Wang
No ratings yet
Info Theory
No ratings yet
Info Theory
59 pages
Lecture 1
No ratings yet
Lecture 1
211 pages
Three Tutorial Lectures
No ratings yet
Three Tutorial Lectures
36 pages
A Bird's Eye view of Data Visualisation
From Everand
A Bird's Eye view of Data Visualisation
Nisarg Patel
No ratings yet
It Lectures
No ratings yet
It Lectures
342 pages
Quantum Information: Stephen M. Barnett
No ratings yet
Quantum Information: Stephen M. Barnett
60 pages
CSD411 - Week 4 - MF, IT and Model 9
No ratings yet
CSD411 - Week 4 - MF, IT and Model 9
48 pages
Lecture Note PDF
No ratings yet
Lecture Note PDF
373 pages
Stat Phys Info Theo
No ratings yet
Stat Phys Info Theo
214 pages
HTCyberSecurity. UNIT 1
No ratings yet
HTCyberSecurity. UNIT 1
23 pages
Lec35 - 210108062 - ZAINAB ALI
No ratings yet
Lec35 - 210108062 - ZAINAB ALI
9 pages
An Introduction To Kolmogorov Complexity and Its A
No ratings yet
An Introduction To Kolmogorov Complexity and Its A
31 pages
Entropy and Mutual Information
No ratings yet
Entropy and Mutual Information
63 pages
Information Theoretic Inequalities
No ratings yet
Information Theoretic Inequalities
18 pages
Information Theory - Parveen Khurana - Medium
No ratings yet
Information Theory - Parveen Khurana - Medium
26 pages
Entr 5
No ratings yet
Entr 5
2 pages
A Visual Introduction To Information Theory
No ratings yet
A Visual Introduction To Information Theory
43 pages
Information Theory and Coding by Example PDF
100% (1)
Information Theory and Coding by Example PDF
528 pages
Information Theory and Coding by Example
89% (9)
Information Theory and Coding by Example
528 pages
Module14 InformationTheoryandEntropy
No ratings yet
Module14 InformationTheoryandEntropy
24 pages
Non Monotonic Logic: Fundamentals and Applications
From Everand
Non Monotonic Logic: Fundamentals and Applications
Fouad Sabry
No ratings yet
Presentation Math7952
No ratings yet
Presentation Math7952
29 pages
Chapter 2
No ratings yet
Chapter 2
68 pages
BEC503-DC-M3-Information Theory
No ratings yet
BEC503-DC-M3-Information Theory
100 pages
Learning Logic: Critical Thinking With Intuitive Notation
From Everand
Learning Logic: Critical Thinking With Intuitive Notation
Stephen Plowright
4.5/5 (3)
Information Theory 5th Unit
No ratings yet
Information Theory 5th Unit
20 pages
Information Theory and Log-Likelihood Models: A Basis For Model Selection and Inference
No ratings yet
Information Theory and Log-Likelihood Models: A Basis For Model Selection and Inference
22 pages
Probability and Information Theory PDF
No ratings yet
Probability and Information Theory PDF
2 pages
Paper Theory On Information Theory
No ratings yet
Paper Theory On Information Theory
15 pages
Iict Unit One
No ratings yet
Iict Unit One
35 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Math7224 Notes
No ratings yet
Math7224 Notes
32 pages
Unesco - Eolss Sample Chapters: Social and Economic Disparities
No ratings yet
Unesco - Eolss Sample Chapters: Social and Economic Disparities
9 pages
Duke University Press, The American Dialect Society American Speech
No ratings yet
Duke University Press, The American Dialect Society American Speech
8 pages
1379695145-Pigue Research Project Final
No ratings yet
1379695145-Pigue Research Project Final
34 pages
Characterization of Sugars and Organic Acids in Commercial Varieties of Table Grapes
No ratings yet
Characterization of Sugars and Organic Acids in Commercial Varieties of Table Grapes
7 pages
Vernier Act7 Dissolved Oxygen PDF
No ratings yet
Vernier Act7 Dissolved Oxygen PDF
11 pages
Adhesion/decalcification Mechanisms of Acid Interactions With Human Hard Tissues
No ratings yet
Adhesion/decalcification Mechanisms of Acid Interactions With Human Hard Tissues
7 pages
Dental Erosion in Children: A Literature Review
No ratings yet
Dental Erosion in Children: A Literature Review
7 pages
How To Test Reliability in Historical Evidence
No ratings yet
How To Test Reliability in Historical Evidence
2 pages
1b68 PDF
No ratings yet
1b68 PDF
8 pages
The Growth of Human Teeth: A Simple Description of Its Kinetics
No ratings yet
The Growth of Human Teeth: A Simple Description of Its Kinetics
8 pages
The Axiomatic Method in Mathematics
No ratings yet
The Axiomatic Method in Mathematics
4 pages
Dirait IPA
100% (2)
Dirait IPA
1 page
Pa-1 Portion (Grade 11)
No ratings yet
Pa-1 Portion (Grade 11)
1 page
Vector Graphics Algo
No ratings yet
Vector Graphics Algo
24 pages
Path, Path Products and Regular Expressions - G9
No ratings yet
Path, Path Products and Regular Expressions - G9
37 pages
Course Outline
No ratings yet
Course Outline
6 pages
MSRX Setup Log
No ratings yet
MSRX Setup Log
2 pages
A Survey On E-Commerce Recommendation Systems Using Artificial Intelligence and Current Trends For Personalization To Improve Customer Experience
No ratings yet
A Survey On E-Commerce Recommendation Systems Using Artificial Intelligence and Current Trends For Personalization To Improve Customer Experience
5 pages
Code Is Political
No ratings yet
Code Is Political
11 pages
Laporan Praktikum Transformasi Dan Animasi: Oleh Azizah Tri Novanti 170533628613 S1 PTI 2017 A
No ratings yet
Laporan Praktikum Transformasi Dan Animasi: Oleh Azizah Tri Novanti 170533628613 S1 PTI 2017 A
14 pages
D203210-17 RasterLink6Plus InstallationGuide e
No ratings yet
D203210-17 RasterLink6Plus InstallationGuide e
68 pages
Medonic M-Series M32 Innovation Built of Total Quality: For Today'S Hematology Labs
No ratings yet
Medonic M-Series M32 Innovation Built of Total Quality: For Today'S Hematology Labs
4 pages
Swot Template Thomason
No ratings yet
Swot Template Thomason
16 pages
Nokia V Sim
No ratings yet
Nokia V Sim
114 pages
English Paper 1: Stage 9
No ratings yet
English Paper 1: Stage 9
48 pages
Cspo GEM
No ratings yet
Cspo GEM
3 pages
How To Find The Product Model of Your Dell Computer - Dell India
No ratings yet
How To Find The Product Model of Your Dell Computer - Dell India
3 pages
1.PPQA-SEPG Roles and Responsibilities PDF
No ratings yet
1.PPQA-SEPG Roles and Responsibilities PDF
2 pages
Sisco
No ratings yet
Sisco
10 pages
2015 Summer Model Answer Paper
No ratings yet
2015 Summer Model Answer Paper
40 pages
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
No ratings yet
Sensors: A Stress Sensor Based On Galvanic Skin Response (GSR) Controlled by Zigbee
30 pages
Leach Amplifier Assembly Guide v1.0 - pg1-20
No ratings yet
Leach Amplifier Assembly Guide v1.0 - pg1-20
20 pages
E Ink Electronic Ink
No ratings yet
E Ink Electronic Ink
16 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Control Structure C
No ratings yet
Control Structure C
12 pages
Hemochron Elite - Itc Usa
No ratings yet
Hemochron Elite - Itc Usa
4 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Self-Organization in Autonomous Sensor/Actuator Networks (Selforg)
No ratings yet
Self-Organization in Autonomous Sensor/Actuator Networks (Selforg)
40 pages
Computer Forensics Cyber Security (Oe)
No ratings yet
Computer Forensics Cyber Security (Oe)
10 pages
SR Eco VSD and SRH Models With Touchscreen Manual
100% (1)
SR Eco VSD and SRH Models With Touchscreen Manual
45 pages
ECE 546 - VLSI Systems Design Lecture 16: SRAM: Fall 2012 W. Rhett Davis NC State University
No ratings yet
ECE 546 - VLSI Systems Design Lecture 16: SRAM: Fall 2012 W. Rhett Davis NC State University
24 pages

Divergence, Entropy, Information: Phil Chodrow

Uploaded by

Divergence, Entropy, Information: Phil Chodrow

Uploaded by

Divergence, Entropy, Information

An Opinionated Introduction to Information Theory

August 25, 2017

Information theory is a mathematical theory of learning with deep con-

2 Why Not Start with Entropy? 3

3 Introducing the Divergence 4

6 Information Three Ways 12

7 Why Information Shrinks 14

8 Some Further Reading 16

2 Why Not Start with Entropy?

3 Introducing the Divergence

Remark. It’s helpful to remember that P X is an P m − 1-dimensional space;

∇q d(p, q) = λ∇q (gq ), (10)

Remark. Theorem 2 is the primary sense in which d behaves “like a distance”

Definition 3 (Shannon Entropy). The Shannon entropy of p ∈ P X is

Theorem 3. The Shannon entropy is related to the divergence according to

d(p, q) = − [H(p) − H(q) − h∇q H, p − qi] . (16)

A Bayesian Interpretation of Entropy

Remark. In this definition, it’s useful to think of Y as some kind of “side

E[f (p, X)] = E[− log p(X)] (18)

Definition 7 (Conditional Entropy). The conditional entropy of X given

Remark. It might seem as though H(X|Y ) ought to be defined as

Theorem 5. The conditional entropy is related to the unconditional entropy

6 Information Three Ways

Corrolary 1. Random variables X and Y are independent if and only if

So, I(X, Y ) is something like a super-charged correlation coefficient – it

Corrolary 2. The mutual information is symmetric:

I(X, Y ) = I(Y, X) . (28)

Corrolary 3. The mutual information is nonnegative:

7 Why Information Shrinks

Theorem 8 (Data Processing Inequality). Let X and Y be random variables,

I(X, Z) ≤ I(X, Y ). (30)

Definition 9 (Conditional Mutual Information). The conditional mutual

Theorem 9 (Chain Rule of Mutual Information). We have

I(X, (Y, Z)) = I(X, Z) + I(X, Y |Z). (32)

Proof. We can compute directly, dividing up sums and remembering relations

I(X, (Y, Z)) = d(pX,Y,Z , pX ⊗ pY,Z )

= I(X, Z) + I(X, Y |Z) ,

Proof. Since Z = g(Y ), that is, is a function of Y alone, we have that

I(X, (Y, Z)) = I(X, Z) + I(X, Y |Z)

Since I(X, Z|Y ) = 0 by our argument above, we obtain I(X, Y ) = I(X, Z) +

The Data Processing Inequality states that, in the absence of additional

I(X1 , Y1 ) ≤ I(X0 , Y1 ) ≤ I(X0 , Y0 ). (33)

Thus, the dynamics tend to reduce information. Of course, we can complicate

8 Some Further Reading

Information Theory, Statistics, and Machine Learning

Information Theory, Physics, and Biology

Baez, J. and Pollard, B. (2016). Relative entropy in biological systems.

Bernardo, J. M. and Smith, A. F. (2008). Bayesian Theory. John Wiley and

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory.

Csiszar, I. and Shields, P. C. (2004). Information Theory and Statistics: A

Harper, M. (2009). Information geometry and evolutionary game theory,

Harper, M. (2010). The replicator equation as an inference dynamic, arXiv:

MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Al-

Shannon, C. E. (1948). A mathematical theory of communication. The Bell

Shannon, C. E. (1951). Prediction and entropy of printed English. Bell

You might also like