0% found this document useful (0 votes)
223 views86 pages

Principles of Data Analysis - Prasenjit Saha (2003) PDF

Uploaded by

asantamb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views86 pages

Principles of Data Analysis - Prasenjit Saha (2003) PDF

Uploaded by

asantamb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Principles of Data Analysis

Rebecca Thorn

Prasenjit Saha

Published by Cappella Archive (ISBN 1-902918-11-8).


The text can be downloaded free from www.physik.uzh.ch/~psaha/pda/ in A4, letter, or paperback
size. This web page also has some brief reviews.
The paperback edition (which is much nicer but not much more expensive than laser printing) is
available directly from the publisher cappella-archive.com.
To everyone who has shared with me, in dif-
ferent times and places, the fun of the ideas
herein.
You know who you are!
Contents

Preface 1
1. Probability Rules! 3
2. Binomial and Poisson Distributions 15
3. Gaussian Distributions 22
4. Monte-Carlo Basics 31
5. Least Squares 35
6. Distribution Function Fitting 45
7. Entropy 50
8. Entropy and Thermodynamics 58
Appendix: Miscellaneous Formulas 70
Hints and Answers 74
Index 80
Preface
People who know me are probably asking: What is a theorist doing writing a book about
data analysis?
Well, it started as a course for final-year maths and physics undergraduates, which
I taught in 1996. Since then, some kind colleagues and friends have told me they found
the course notes useful, and suggested I tidy them up into a short book. So here it is. It is
also an affordable book: it is available free via the web, and if you care for a nicely-bound
copy, Cappella Archive will sell you one at production cost.
What this book hopes to convey are ways of thinking (= principles) about data
analysis problems, and how a small number of ideas are enough for a large number of
applications. The material is organized into eight chapters:
1) Basic probability theory, what it is with the Bayesians versus the Frequentists, and a
bit about why quantum mechanics is weird (Bell’s theorem).
2) Binomial and Poisson distributions, and some toy problems introducing the key ideas
of parameter fitting and model comparison.
3) The central limit theorem, and why it makes Gaussians ubiquitous, from counting
statistics to share prices.
4) An interlude on Monte-Carlo algorithms.
5) Least squares, and related things like the χ2 test and error propagation. [Including
the old problem of fitting a straight line amid errors in both x and y.]
6) Distribution function fitting and comparison, and why the Kolmogorov-Smirnov test
and variants of it with even longer names are not really arcane. [Sample problem:
invent your own KS-like statistic.]
7) Entropy in information theory and in image reconstruction.
8) Thermodynamics and statistical physics reinterpreted as data analysis problems.
As you see, we are talking about data analysis in its broadest, most general, sense.
Mixed in with the main text (but set in smaller type) are many examples, problems,
digressions, and asides. For problems, I’ve indicated levels of difficulty ([1]: trivial to
[4]: seriously hard) but here individual experiences can be very different, so don’t take
these ratings too seriously. But the problems really are at the heart of the book—data
analysis is nothing if isn’t about solving problems. In fact, when I started preparing the
original course, I first gathered a set of problems and then worked out a syllabus around
them. Digressions are long derivations of some key results, while asides are useless but
cute points, and both can be skipped without losing continuity.
The mathematical level is moderately advanced. For example, the text assumes the
reader is used to matrix notation and Lagrange multipliers, but Fourier transforms are
briefly explained in the Appendix. A few of the problems involve writing short programs,
in any programming language.

1
2 Preface

References are in footnotes, but Probability theory: The Logic of Science by E.T. Jaynes
(in press at Cambridge University Press [see also bayes.wustl.edu]) and Data Analysis:
A Bayesian Tutorial by D.S. Sivia (Oxford University Press 1996) have been so influential
that I must cite them here. My book is not a substitute for either of these, more of a
supplement: if Sivia makes the basics crystal-clear and Jaynes shows how powerful the
underlying ideas are, I have tried to illustrate the enormous range of applications and put
them in context.
Although this book is small, the number of people who have distinctly contributed to
it is not. James Binney first introduced me to the ideas of E.T. Jaynes, and he and Brian Buck
explained to me why they are important. Vincent Macaulay has since continued what
they had begun, and indeed been for fifteen years my guide to the literature, sympathetic
reviewer, voice of reason. If Vincent had written this book it would have been much better.
I remain grateful to Dayal Wickramasinghe for giving me the opportunity to design and
teach a course on a random interesting topic,1 and to the students for going through with it,
especially Alexander Austin, Geoffrey Brent, and Paul Cutting, who not only demolished
any problems I could concoct, but often found better solutions than I had. Several people
posed interesting problems to me, my response to which is included in this book; Ken
Freeman and Scott Tremaine posed the two hardest ones. Eric Grunwald, Abigail Kirk,
Onuttom Narayan, Inga Schmoldt, Firoza Sutaria, Andrew Usher, and Alan Whiting all
suggested improvements on early versions of the manuscript. Rebecca Thorn illustrated
the front cover and Nigel Dowrick wrote the back cover. David Byram-Wigfield provided
much-appreciated advice on typography and made the bound book happen. Naturally,
none of these people is responsible for errors that remain.
I hope you will find this book entertaining and useful, but I have to include a health
warning: this book is written by a physicist. I don’t mean the fact that most of the examples
come from physics and astronomy: I simply picked the examples I could explain best,
and many of them have straightforward translations into other areas. The health warning
has to do with physical scientists’ notions of interestingness. If you have hung out with
physicists you can probably spot a physicist even from the way they talk about the stock
market. So if you are a statistician or a social scientist and find this book completely
wrongheaded and abhorrent, my apologies, it’s a cultural thing.

December 2002

1 At the Australian National University in lovely Canberra, by Lake Burley Griffin amongst the kangaroos,
go there if you can.
1. Probability Rules!
What do you think of when someone says ‘Data’?
I often imagine a stream of tiny, slightly luminous numbers flowing from a telescope
to a spinning disc, sometimes a frenetic man reading a heap of filled questionnaires while
batting down any that try to get blown away. Some people see a troupe of chimpanzees
observing and nimbly analyzing the responses of their experimental humans. Some only
see a pale android with stupendous computing powers.
From such mental images we might abstract the idea that data are information not
yet in the form we want it, and therefore needing non-trivial processing. Moreover, the
information is incomplete, through errors or lack of some measurements, so probable
reconstructions of the incomplete parts are desired. Schematically, we might view data
analysis as
Incomplete Probability theory
−−−−−−−−−−−−−→ {Inferences} .
information
This book is an elaboration of the scheme above, interpreted as broadly as possible while
still being quantitative, and as we’ll see, both data and inferences may take quite surprising
forms.
Concrete situations involving data analysis, of which we will discuss many in this
book, tend to fall cleanly into one of four groups of problems.
(1) First and most straightforward are the situations where we want to measure some-
thing. The measurement process may be very indirect, and involve much theoretical
calculation. For example, imagine measuring the height of a mountain-top with a
barometer; this would require not only reading off the air pressure, but theoretical
knowledge of how air pressure behaves with height and how much uncertainty is in-
troduced by weather-dependent fluctuations. But the relation between the numbers
being observed and the numbers being inferred is assumed to be well understood.
We will call such situations ‘parameter-fitting’ problems.
(2) The second group of problems involves deciding between two or more possibilities.
An archetype of such problems is “Will it rain tomorrow?” We will take up similar
(though much easier) questions as ‘model comparison problems’.
(3) The third kind of situation is when we have a measurement, or perhaps we can see a
pattern, and we want to know whether to be confident that what we see is really there,
or to dismiss it as a mirage produced by noise. We will call these ‘goodness-of-fit
problems’, and they will typically arise together with parameter-fitting problems.
(4) Finally, we will come across some curious situations where the data are few indeed,
perhaps even a single number, but we do have some additional theoretical knowledge,
and we still want to make what inferences we can.
Through probability theory we can pose, and systematically work out, problems
of all these four kinds in a fairly unified way. So we will spend the rest of this chapter

3
4 Probability Rules!

developing the necessary probability theory from first principles and then some formalism
for applying it to data analysis.
The formal elements of probability theory are quite simple. We write prob(A) for the

probability associated with some state A, prob(A) is the probability of not A, prob(AB) or
prob(A, B) is the probability of A and B, and prob(A | B) is the so-called conditional proba-
bility of A given B.1 Probabilities are always numbers in [0, 1] and obey two fundamental
rules:

prob(A) + prob(A) = 1 (sum rule),
(1.1)
prob(AB) = prob(A | B) prob(B) (product rule).
Note that the product rule is not a simple multiplication, because A and B may depend
on each other. (Cats with green eyes or brown eyes are common, but cats with one green
eye and one brown eye are rare.)

EXAMPLE [Pick a number. . . ] Any natural number, and I will do likewise. What is the probability
that the two numbers are coprime?
Say the two numbers are m, n. We want the probability that m, n do not have a common factor
that equals 2, 3, or any other prime. For conciseness let us write Dp to mean the state “p divides

both m and n”. Accordingly, Dp means “p divides at most one of m, n” (not “p divides neither m
nor n”). We have

prob(Dp ) = 1 − p−2 .

(1.2)

The probability that m and n are coprime is


− − −
prob(m, n coprime) = prob(D2 , D3 , D5 , . . .) (1.3)

and using the product rule gives us


− − − − − −
prob(m, n coprime) = prob(D2 ) prob(D3 | D2 ) prob(D5 | D2 , D3 ) · · · . (1.4)

This scary expression fortunately simplifies, because the prob(Dp ) are in fact independent: 13 of
natural numbers are multiples of 3, and if you remove all multiples of 2, that’s still the case. Hence
we have
− − −
prob(m, n coprime) = prob(D2 ) prob(D3 ) prob(D5 ) · · ·
(1.5)
= 1 − 2−2 1 − 3−2 1 − 5−2 · · ·
  

The infinite product in the last line equals (see page 72 for more on this formula) 6/π2 . ⊓

PROBLEM 1.1: Achilles and the Tortoise take turns rolling a die (Achilles first) until one of them
rolls a six. What is the probability that the Tortoise rolls a six?
Try to get the answer in two different ways: (i) via an infinite series with one term per turn,
and (ii) by invoking a symmetry. [1]

1In this book all probabilities are really conditional, even if the conditions are tacit. So when we write
prob(A), what we really mean is prob(A|{tacit}).
Probability Rules! 5

From the rules (1.1), two corollaries follow.


The first corollary is called the marginalization rule. Consider a set of possibilities
Mk that are “exhaustive” and “mutually exclusive”, which is to say one and only one of
the Mk must hold. For such Mk
P
k prob(Mk | A) = 1, (1.6)

for any A, and using the product rule and (1.6) gives the marginalization rule
P
k prob(Mk , A) = prob(A). (1.7)

We will use the marginalization rule many times in this book; usually Mk will represent
the possible values of some parameter.
The second corollary follows on writing prob(AB) in two different ways and rear-
ranging:
prob(A | B) prob(B)
prob(B | A) = . (1.8)
prob(A)
It is called Bayes’ theorem,1 and we will also use it many times in this book. For the
moment, note that prob(A | B) may be very different from prob(B | A). (The probability of
rain given clouds in the sky is not equal to the probability of clouds in the sky given rain.)

EXAMPLE [A classic Bayesian puzzle] This puzzle appears in books (and television game shows)
in different guises.
You are in a room with three closed doors. One of the doors leads to good stuff, while the other
two are bad. You have to pick a door and go to your fate. You pick one, but just as you are about
to open it, someone else in the room opens another door and shows you that it is bad. You are now
offered the choice of switching to the remaining door. Should you?
Let’s say prob(a) is the probability that the door you picked is good, and similarly prob(b),
prob(c) for the other doors. And let’s say prob(B) is the probability that the second door is bad and
gets opened. We want
prob(B | a) prob(a)
prob(a | B) = (1.9)
prob(B)
where
prob(B) = prob(B | a) prob(a) + prob(B | b) prob(b)
(1.10)
+ prob(B | c) prob(c).
We set
1
prob(a) = prob(b) = prob(c) = 3 (1.11)

since we have no other information. We also have prob(B|b) = 0 because we have defined the states
b and B as mutually exclusive. That leaves prob(B|a) and prob(B|c).

1 Equation (1.8) isn’t much like what Bayes actually wrote, but Bayes’ theorem is what it’s called.
6 Probability Rules!

Now a subtlety arises. If the other person opened a door at random then prob(B | a) = prob(B |
c), which gives prob(a | B) = 12 . But when this problem is posed as a puzzle, it is always clear that
the other person knows which doors are bad, and is opening a bad door you have not chosen just
to tease you. This gives prob(B | a) = 12 (they could have opened B or C) and prob(B | c) = 1 (they
had to open B because c is good). Hence prob(a | B) = 13 , and it is favourable to switch. ⊓

In taking the sum and product rules (1.1) as fundamental, we have left unspecified what
a probability actually is. In orthodox statistics, a probability is always a fraction of cases

Number of cases where A holds


prob(A) = (1.12)
Number of possible cases

or a limit of such fractions, where the possible cases are all ‘equally likely’ because of some
symmetry. The sum and product rules readily follow. But for most of the applications in
this book a definition like (1.12)—the so-called “Frequentist” definition—is too restrictive.
More generally, we can think of probabilities as a formalization of our intuitive notions
of plausibility or degree-of-belief, which may or may not be frequency ratios, but which
always obey sum and product rules. This interpretation, together with principles for
assigning values to those probabilities which are not frequency ratios, dates back to Bayes
and Laplace. In more recent times it was developed by Jeffreys,1 and it is good enough
for this book.
Still, the latter definition leaves us wondering whether there might be other formal-
izations of our intuition that don’t always follow the sum and product rules. To address
this question we need an axiomatic development of probability that will derive the sum
and product rules instead of postulating them, and then we can compare the axioms with
our intuition. There are several such axiomatic developments: Cox2 relates his formu-
lation to our intuition for plausible reasoning, de Finetti3 relates his development to our
intuition for risks and gambling, while Kolmogorov’s axioms relate to measure theory.
But all three allow probabilities that are not frequency ratios but which nevertheless obey
the sum and product rules.
Whichever general definition one follows,4 the important consequence for data anal-
ysis is that non-Frequentist probabilities open up a large area of applications involving
Bayes’ theorem. Hence the generic name “Bayesian” for probability theory beyond the
Frequentist regime.

1 H.S. Jeffreys, Theory of Probability (Oxford University Press 1961—first edition 1937).
2 R.T. Cox, The Algebra of Probable Inference (Johns Hopkins Press 1961).
3 B. de Finetti, Theory of Probability (Wiley 1974). Translated by A. Machı́ & A. Smith from Teoria della

Probabilitá (Giulio Einaudi editore s.p.a. 1970).


4 Mathematicians tend to follow Kolmogorov, Bayesian statisticians seem to prefer de Finetti, while physical
scientists are most influenced by Cox.
Probability Rules! 7

DIGRESSION [Probability as state of knowledge] Cox’s development of probability as a measure of


our state of knowledge is particularly interesting from the data-analysis point of view. Here is a
sketch of it, without proofs of two results now called Cox’s theorems.1
Imagine a sort of proto-probability p(A) which measures how confident we are (on the basis of
available information) that A is true; p(A) = 0 corresponds to certainty that A is false and p(A) = 1
to certainty that A is true. As yet we have no other quantitative rules; we want to make up some
reasonable rules for combining proto-probabilities.
If we have values for the proto-probabilities of A and of B-given-A, we desire a formula that
will tell us the proto-probability of A-and-B. Let us denote the desired formula by the function F,
that is to say
p(AB) = F(p(B | A), p(A)). (1.13)

We require F to be such that if there is more than one way of computing a proto-probability they
should give the same result. So if we have p(A), p(B | A), and p(C | AB), we could combine the
first two to get p(AB) and then combine with the third to get p(ABC), or we could combine the last
two to get p(BC | A) and then combine with the first to get p(ABC), and both ways should give the
same answer. In other words we require

F(F(x, y), z) = F(x, F(y, z)). (1.14)

If we further require F(x, y) to be non-decreasing with respect to both arguments, the general
solution for F [though we will not prove it here] satisfies

w(F(x, y)) = w(x)w(y). (1.15)

where w is an arbitrary monotonic function satisfying w(0) = 0 and w(1) = 1. Let us define

q(A) = w(p(A)). (1.16)

Now, q(A) has all the properties of a proto-probability, so we are free to redefine proto-probability
as q(A) rather than p(A). If we do this, (1.15) gives

q(AB) = q(A)q(B | A) (1.17)

which is the product rule.



If we have a value for q(A) we also desire a formula that will tell us q(A). Let S be that formula,
i.e.,

q(A) = S(q(A)). (1.18)

Then S(S(x)) must be x. The general solution for S [though again we will not prove it here] is

S(x) = (1 − xm )1/m (1.19)

1The following is actually based on Chapter 2 of Jaynes; Cox’s own notation is different. Jaynes compares
with Kolmogorov and de Finetti in his Appendix A.
8 Probability Rules!

where m is a positive constant. Hence we have



qm (A) + qm (A) = 1, (1.20)

which is to say qm (A) satisfies the sum rule. But from (1.17), qm (A) also satisfies the product rule.
So we change definition again, and take qm (A) as the proto-probability.
The argument sketched here shows that, without loss of generality, we can make proto-
probabilities satisfy the sum and product rules. At which point we may rename them as simply
probabilities. ⊓

We will now develop some formalism, based on the probability rules, for the four topics
or problem-types mentioned at the start of this chapter. The rest of this book will be about
developing applications of the formalism. (Though occasionally we will digress, and
sometimes some aspect will develop into a topic by itself.) The details of the applications
will sometimes be quite complicated, but the basic ideas are all simple.
First we consider parameter fitting. Say we have a model M with some parameters
ω, and we want to fit it to some data D. We assume we know enough about the model
that, given a parameter value we can calculate the probability of any data set. That is to
say, we know prob(D | ω, M); it is known as the “likelihood”. Using Bayes’ theorem we
can write
prob(D | ω, M) prob(ω | M)
prob(ω | D, M) = . (1.21)
prob(D | M)
The left hand side, or the probability distribution of parameters given the data, is what
we are after. On the right, we have the likelihood and then the strange term prob(ω | M),
which is the probability distribution of the parameters without considering any data. The
denominator, by the marginalization rule, is
X
prob(D | M) = prob(D | ω′ , M) prob(ω′ | M), (1.22)
ω′

and clearly just normalizes the right hand side. The formula (1.21) can now be interpreted
as relating the probability distributions of ω before and after taking the data, via the
likelihood. Hence the usual names “prior” for prob(ω | M) and “posterior” for prob(ω |
D, M).1 In applications, although we will always have the full posterior probability
distribution, we will rarely display it in full. Usually it is enough to give the posterior’s
peak or mean or median (for parameter estimates) and its spread (for uncertainties); and
since none of these depend on the normalization of the posterior, we can usually discard
the denominator in (1.21) and leave the formula as a proportionality.
In equation (1.22), and elsewhere in this chapter, we are assuming that the param-
eters ω take discrete values; but this is just to simplify notation. In practice both data

1The presence of two different probability distributions for ω, with different conditions reminds us that all
probabilities in this book are conditional.
Probability Rules! 9

and parameters can be continuous rather than discrete, and if necessary we can replace
probabilities by probability densities and sums by integrals.
Probability theory tells how to combine probabilities in various ways, but it does
not tell us how to assign prior probabilities. For that, we may use symmetry arguments,
physical arguments, and occasionally just try to express intuition as numbers. If inferences
are dominated by data, the prior will not make much difference anyway. So for the
moment, let us keep in mind the simplest case: say ω can take only a finite number of
values and we don’t have any advance knowledge preferring any value, hence we assign
equal prior probability to all allowed values (as we have already done in equation 1.11 in
the three-doors puzzle); this is called the principle of indifference.1
If ω has a continuous domain, assigning priors gets a little tricky. We can assign a
constant prior probability density for prob(ω), but then changing variable from ω to (say)
ln ω will make the prior non-constant. We have to decide what continuous parameter-
variable is most appropriate or natural. In this book we will only distinguish between
two cases. A ‘location parameter’ just sets a location, which may be positive or negative
and largeness or smallness of the value has no particular significance; location parameters
usually take a constant prior density. In contrast, a ‘scale parameter’ sets a scale, has fixed
sign, and the largeness or smallness of the value does matter. Now, the logarithm of a scale
parameter is a location parameter. Thus a scale parameter ω takes prior density ∝ 1/ω.
This is known as a Jeffreys prior. (Of course, this argument is only a justification, not a
derivation. There are several possible derivations, and we will come to one later, at the
end of the next chapter.) Discrete parameters may also take Jeffreys priors; for example,
if ω is allowed to be any integer from 1 to ∞.
Often, in addition to the parameters ω that we are interested in, there are parameters
µ whose values we don’t care about. A simple example is an uninteresting normalization
present in many problems; we will come across more subtle examples too. The thing to
do with so-called nuisance parameters is to marginalize them out:
P
prob(D | ω, M) = µ prob(D | µ, ω, M) prob(µ | ω, M). (1.23)

Note that in the marginalization rule (1.7) the thing being marginalized is always to the
left of the condition in a conditional probability. If what we want to marginalize is to the
right, as is µ in prob(D | µ, ω, M) in (1.23) we need to move it to the left by multiplying by
a prior—in this case prob(µ | ω, M)—and invoking the product rule.
We now move on to model comparison. Say we have two models M1 and M2 , with
parameters ω1 and ω2 respectively, for the same data. To find out which model the data
favour we first compute
P
prob(D | M1 ) = ω1 prob(D | ω1 , M1 ) prob(ω1 | M1 ) (1.24)

1 No, really. The name comes from Keynes, though the idea is much older.
10 Probability Rules!

and similarly prob(D | M2 ), and then using Bayes’ theorem we have


P
prob(M1 | D) prob(D | ω1 , M1 ) prob(ω1 | M1 )
= Pω1
prob(M2 | D) ω2 prob(D | ω2 , M2 ) prob(ω2 | M2 )
(1.25)
prob(M1 )
×
prob(M2 )
since prob(D) cancels. Here prob(M1 ) and prob(M2 ) are priors on the models, and the
ratio of posteriors on the left hand side is called the “evidence”, “odds ratio”, or the
“Bookmakers’ odds” for M1 versus M2 .
Note that model comparison makes inferences about models with all parameters
marginalized out. In two respects it contrasts parameter fitting. First, when parameter
fitting we can usually leave the prior and likelihood unnormalized, but when comparing
models all probabilities need to be normalized. Second, when parameter fitting we are
disappointed if the likelihood prob(D | ω, M) is a very broad distribution because it makes
the parameters very uncertain, but in model comparison a broad likelihood distribution
actually favours a model because it contributes more in the sum (1.24).
The third topic is goodness of fit. Model comparison tells us the best available model
for a given data set, and parameter fitting tells us the best parameter values. But they
do not guarantee that any of the available models is actually correct. (Imagine fitting
data from an odd polynomial to different even polynomials. We would get a best-fit, but
it would be meaningless.) So having found the best model and parameters, unless we
know for other reasons that one of the models is correct, we still have to pose a question
like “could these data plausibly have come from that model?” Testing the goodness of
fit provides a way of addressing this question, though not a very elegant way. To do this
we must choose a statistic or function (say ψ) which measures the goodness of fit, better
fits giving lower ψ. A good example is the reciprocal of the likelihood: 1/ prob(D | ω, M).
Then we compute the probability that a random data set (from the fixed model and
parameter values being tested) would fit less well than the actual data:

prob(ψ > ψD ), (1.26)

ψD being what the actual data give. The probability (1.26) is called the p-value of the data
for the statistic ψ. If the p-value is very small the fit is clearly anomalously bad and the
model must be rejected.
A goodness-of-fit test needs a good choice of statistic to work well. If we choose
a ψ that is not very sensitive to deviations of model from data, the test will also be
insensitive. But there is no definite method for choosing a statistic. The choice is an ad
hoc element, which makes goodness of fit a less elegant topic than parameter fitting and
model comparison.
There is another important contrast between parameter fitting and model comparison
on the one hand and goodness-of-fit testing on the other hand. The first two hold the data
Probability Rules! 11

fixed and consider probabilities over varying models and parameters, whereas the third
holds the model and parameters fixed and considers probabilities over varying data sets.
Hence the first two need priors for models and parameters, whereas the third does not.
Now, priors are a very Bayesian idea, outside the Frequentist regime. Thus goodness-of-fit
testing is part of Frequentist theory, whereas parameter fitting and model comparison (as
developed here) are not expressible in Frequentist terms.
There is parameter estimation in Frequentist theory of course, but it works differently;
it is based on “estimators”. An estimator is a formula ̟(D, M) that takes a model and a
data set and generates an approximation to the parameters. The innards of ̟(D, M) are
up to the insight and ingenuity of its inventor. An example of an estimator is “ω such that
prob(D | ω, M) is maximized”; it is called the maximum likelihood estimator. A desirable
property of estimators is being “unbiased”, which means that the estimator averaged over
possible data sets equals the actual parameter value, i.e.,
P
D ̟(D, M) prob(D | ω, M)
P = ω. (1.27)
D prob(D | ω, M)

Not all used estimators are unbiased; for example, maximum likelihood estimators are in
general biased. Meanwhile, Frequentist model comparison sometimes compares the max-
imized values of the likelihoods for different models; sometimes it involves a goodness-
of-fit statistic. Thus, Frequentist parameter fitting and model comparison both rely much
more on ad hoc choices and much less on general principles than the Bayesian methods.
This book will refer only occasionally to estimators.
The fourth and last topic is maximum entropy. In our discussion of the first three
types of problem, assigning priors played a fairly minor role. We treated priors as just
a way of making some tacit assumptions explicit, on the understanding that with good
data any sensible prior would lead to the same inferences. This approach is fine for many
problems, including most we will discuss in this book. For some problems, however,
assigning probabilities is not a peripheral concern, it is the main thing. Such problems
involve data that are incomplete in some severe way, though they may be highly accurate.
Here is an artificial, but representative, example: suppose we have a biased six-sided die,
and data tell us the average number of dots is not 3.5 but 4.5; with no further data can
we make any predictions for the probability of 1 dot? Probability theory by itself does
not tell us how to proceed, we need to add something. That something is the principle of
maximum entropy.
The concept of entropy comes originally from physics but it has a more fundamental
meaning in information theory as a measure of the uncertainty in a probability distribu-
tion. Now, a probability distribution p1 , . . . , pN certainly implies uncertainty and lack of
knowledge, but it is not obvious that the uncertainty can be usefully quantified by a single
number. However, a fundamental theorem in information theory (Shannon’s theorem)
shows that if we suppose that there is a measure S(p1 , . . . , pN ) of the uncertainty and
12 Probability Rules!

moreover S satisfies certain desirable conditions, then S must be


N
X
S=− pi log pi . (1.28)
i=1

Equation (1.28) is the information-theoretic entropy. We will discuss its derivation and its
relation to physics in detail in chapters 7 and 8, but for now we can think of S (with log2 )
as the number of bits needed to change a probability distribution to a certainty.
The principle of maximum entropy is that when probabilities need to be assigned,
they are assigned so as to maximize the entropy so far as data allow. Thus, in the case of
P
the biased-die problem, we would maximize S(p1 , . . . , p6 ) subject to k kpk = 4.5. It is
P
not hard to see that if there are no constraints (apart from i pi = 1 of course) maximum
entropy reduces to indifference.
The above information-theory arguments and the limiting case of indifference all
motivate the maximum-entropy principle, but they do not require it. I should emphasize
that maximum entropy is something new being added to the methodology to fill a need.
It is the idea most associated with Jaynes, though the physicists Boltzmann and Gibbs
anticipated aspects of it.

PROBLEM 1.2: A Bayesian arriving in an unfamiliar town sees two taxis, and their numbers are
65 and 1729. Assuming taxis in this town are numbered sequentially from 1 to N, what is the
probability distribution for N? [2]

For many years Bayesian methods were controversial and promoted by only a small
minority of enthusiasts. Look up Bayes’ Theorem in a good old-fashioned probability
and statistics book, and you may find disapproval1 or horror2 at the idea of using Bayes’
theorem for anything other than some artificial examples. On the other hand, look up
“Frequentist” in a Bayesian manifesto3 and you will find example upon example of
problems where Frequentist methods fail. But post 1990 or so, with Bayesian methods
better known and the contentious arguments mostly already written, authors tend to be
more relaxed. Thus a mainly Frequentist source will endorse some Bayesian ideas,4 and
a Bayesian book will recommend some Frequentist methods.5 Jaynes once expressed the

1 W. Feller, An introduction to probability theory and its applications (Wiley 1971).


2 J.V. Uspensky, Introduction to mathematical probability, (McGraw-Hill 1937).
3 For example, T.J. Loredo, From Laplace to SN 1987A: Bayesian inference in astrophysics (1990, available in

bayes.wustl.edu) and above all Jaynes.


4
The data analysis parts of Numerical Recipes by W.H. Press, S.A Teukolsky, W.T. Vetterling, & B.P. Flannery
(Cambridge University Press 1992) are a good example.
5
J.M. Bernardo & A.F.M. Smith Bayesian Theory (Wiley 1994) and A. Gelman, J.B. Carlin, H.S. Stern, & D.B.
Rubin Bayesian Data Analysis (Chapman & Hall 1995) are two recent examples; Jeffreys (uncharacteristic of its
generation in this respect as in others) is another.
Probability Rules! 13

hope that in time one should no more need the label of “Bayesian” to use Bayes’ theorem
than one needs to be a “Fourierist” to use Fourier transforms. Writing in 2002, perhaps
that time has arrived.

PROBLEM 1.3: This problem is a long digression on yet another aspect of probability. However,
the actual calculation you have to do is very simple, and when you have done it you can feel justly
proud of having deduced Bell’s theorem, a profound statement about the physical world.
At the sub-molecular level, nature becomes probabilistic in an important and even disturbing
way. The theory of quantum mechanics (which predicts experimental results in this regime with
incredible accuracy) has probability as a fundamental part. But even some of the founders of
quantum mechanics, notably Einstein, were very uncomfortable about its probabilistic character,
and felt that quantum mechanics must be a simplification of an underlying deterministic reality.
Then in 1964 J.S. Bell concocted a thought experiment to demonstrate that if quantum me-
chanics is correct, then an underlying deterministic theory won’t work. Real experiments based
on Bell, starting with those by A. Aspect and coworkers (c. 1980), show exactly what quantum
mechanics predicts, though some problems of interpretation remain. Bell’s thought experiment can
be appreciated without a knowledge of quantum mechanics, and in this problem we will follow a
version due to Mermin.1
There are three pieces of apparatus, a source in the middle and two detectors to the left and
right. Each detector has a switch with three settings (1, 2, and 3 say) and two lights (red and green).
The source has a button on it; when this button is pressed the source sends out two particles, one
towards each detector. Each particle is intercepted by the detector it was sent towards, and in
response the detector flashes one of its lights.
The experiment consists of repeatedly pressing the source button for random settings of the
detector switches and recording which light flashed on which detector and what the switch settings
were. Thus if at some button-press, the left-hand detector is set to ‘1’ and flashes green while the
right-hand detector is set to ‘3’ and flashes red, we record 1G3R. The observations look something
like 1G3R 1G1G 3G2G 3G1R 3R2G 3R3R 1R3R 1R3G 1R2G. . .
An important point is that, apart from the sending of particles from source to detector, there
is no communication between the three pieces of apparatus. In particular, the source never knows
what the detector settings are going to be (the settings can change while a particle is in transit).
Examining the observations after many runs, we find two things.2
(i) When both detectors happen to have the same switch settings they always flash the same
colour, with no preference for red or green. Thus, we see 1R1R and 1G1G equally often, but never
1R1G or 1G1R.

1 N.D. Mermim, Boojums all the way through (Cambridge University Press 1990).
2 The particles are spin-half particles (e.g., electrons), and the source ensures that pairs emitted together have

the same spin. The three settings on the detectors measure the spin component along one of three directions
at 120◦ to each other: green light flashes for spin component + 12 , red light for − 12 . When the detectors have
1
 1
different switch settings, quantum mechanics predicts the colour coincidence rate to be cos2 2
× 120◦ = 4
.
This is what is observed.
14 Probability Rules!

(ii) Considering all the data (regardless of switch settings) the colours coincide, on-average, half
the time.
Neither of (i) or (ii) is remarkable in itself, but taken together they are fatal for a deterministic
explanation. To see this, we consider what a deterministic explanation implies. To determine which
colour flashes, each particle would have to carry, in effect, an instruction set of the type “green light
if the setting is 1, red light otherwise” (say GRR). Since the source doesn’t know what the detectors
switch settings are going to be, both particles from any button-press must have the same instruction
set. Different button-presses can give different instruction sets. In particular, the above data could
have resulted from GRR GRR GGG RRG RGR GGR RGR RGG RGG. . .
5
Your job is to show that the average coincidence rate from instruction sets will be 9 or more,
1
not 2. [4]
2. Binomial and Poisson Distributions

The binomial distribution is basically the number of heads in repeated tosses of a (possibly
biased) coin. The Poisson distribution is a limiting case. We go into these two in some
detail, because they come up in many applications, and because they are a nice context to
illustrate the key ideas of parameter fitting and model comparison.
If some event (e.g., heads) has probability p, the probability that it happens n times
in N independent trials is

N
prob(n | N) = Cn pn qN−n , q = 1 − p. (2.1)

The n occurrences could be combined with the N − n non-occurrences in NCn ways,


P
hence the combinatorial factor. The sum n prob(n | N) is just the binomial expansion of
(p + q)N , so the distribution is correctly normalized, and hence the name.1
A generalization which comes up occasionally is the multinomial distribution. Here
we have K possible outcomes with probabilities pk . The probability that the k-th outcome
will happen nk times in N independent trials is

K K K
Y pn
k
k X X
prob(n1 , . . . , nK ) = N! , pk = 1, nk = N. (2.2)
nk !
k=1 k=1 k=1

For the combinatorial factor, take the total number of possible permutations (N!) and
reduce by the number of permutations among individual outcomes (nk !). The sum of
P N
prob(n1 , . . . , nK ) over k is ( k pk ) so again the distribution is normalized.
Another variant is the negative binomial (also called Pascal) distribution. Here,
instead of fixing the number of trials N, we keep trying till we get a pre-specified number
n of events. The probability that we need N trials for n events is the probability of getting
n − 1 events in N − 1 trials and then an event at the N-th trial, so

N−1
prob(N | n) = Cn−1 pn qN−n . (2.3)
P∞
Now N=n N−1
Cn−1 qN−n is just the binomial expansion for (1 − q)−n , so (2.3) is nor-
malized.

ASIDE (A probably useless but certainly cute point.) If we let both prob(n) and prob(N) take Jeffreys
priors then prob(n | N) and prob(N | n) derive from each other via Bayes’ theorem. ⊓

1 Another term is ‘Bernoulli trials’ after Jakob Bernoulli, one of the pioneers of probability theory and author
of Ars Conjectandi (published 1713).

15
16 Binomial and Poisson Distributions

Figure 2.1: Posterior probability distribution after different numbers of simulated coin tosses; solid
curve for the uninformative prior, dashed curve for the informative prior. For neatness, the curves
are not normalized, just scaled to unit maximum.

EXAMPLE [A biased coin] This is a toy problem to illustrate how parameter fitting and model
comparison work.
Imagine a coin which is possibly biased, i.e., p ≡ prob(heads) possibly 6= 12 . We proceed to toss
it 4096 times to (a) examine if the coin is biased, and (b) find out what p is if it is biased. [This is an
imaginary coin, only my computer thinks it exists.]
We will consider two models: (i) M 1 , where p is a parameter to be fitted; and (ii) M 2 , where
1
p = and there are no parameters. If there are n heads after N tosses, the likelihoods from the two
2
Binomial and Poisson Distributions 17

models are
N
prob(D | p, M 1 ) = Cn pn (1 − p)N−n ,
(2.4)
prob(D | M 2 ) = N
Cn 2−N .

What to do for the prior for p in model M 1 ? To illustrate the possibilities, we try two priors: a
flat prior and a prior broadly peaked around p = 21 . These would commonly be called uninformative
and informative, the latter being based on a premise that makers of biased coins probably wouldn’t
try to pass off anything too obviously biased. We take

(2K + 1)! K
prob(p | M 1 ) = 2
p (1 − p)K (2.5)
(K!)

with K = 0 (uninformative) and K = 10 (informative). The factorials serve to normalize—see


equation (M.3) on page 70.
The posterior for M 1 is then

N
prob(p|D, M 1 ) ∝ Cn pn+K (1 − p)N−n+K (2.6)

and is plotted in Figure 2.1 after different numbers of tosses. After 0 tosses, the posterior is of course
the same as the prior. We see that with enough data the posteriors with both priors tend to the same
curve.
Our fit for p really consists of the posterior distribution itself (for a given prior). For brevity,
we just might calculate the mean and standard deviation of the posterior and not bother to plot
the curve. But there is nothing special about the mean and standard deviation, one might prefer
to quote instead the median, along with the 5th and 95th percentile values as a “90% confidence
interval”.1
To compare M 1 and M 2 we need to marginalize out the parameter dependence in M 1 , that is,
we need to multiply the likelihood prob(D | p, M 1 ) from (2.4) with the normalized prior prob(p | M 1 )
from (2.5) and integrate over p. For M 2 there are no parameters to marginalize. We get for the
marginalized likelihood ratio:

prob(D | M 1 ) (2K + 1)!(n + K)!(N − n + K)! N


= 2 . (2.7)
prob(D | M 2 ) (K!)2 (N + 2K + 1)!

In general, if p = 12 , the marginalized likelihood ratio will start at unity and slowly fall; if p 6= 12 ,
it will start at unity and hover or fall a little for a while, but will eventually rise exponentially. The
Bookmakers’ odds ratio is the ratio of posterior probabilities of the two models:

prob(M 1 | D) prob(D | M 1 ) prob(M 1 )


= × . (2.8)
prob(M 2 | D) prob(D | M 2 ) prob(M 2 )

Here prob(M 1 ) and prob(M 2 ) are priors on the models. Figure 2.2 shows the odds with model priors
set equal. One might prefer, say, prob(M 1 ) = 10−6 prob(M 2 ) on the grounds that most coins are

1 Bayesians tend to prefer “credible region”, to avoid possible confusion with the Frequentist meaning of
“confidence interval”, which has to do with the distribution of an estimator.
18 Binomial and Poisson Distributions

Figure 2.2: Bookmakers’ odds for the coin being biased, as the tosses progress. Again solid curve
for the uninformative prior and dashed curve for the informative prior, but they almost coincide.

unbiased, but as Figure 2.2 shows, even that would become pretty irrelevant when we are confronted
with enough data.
In more serious examples, we will have to do more numerical work, but the general idea will
remain much the same. Some things to take away from this example are the following.
(i) For parameter fitting, there’s no need to normalize probabilities; but for model comparison
normalization is essential—and remember that priors have to be normalized too.
(ii) Information on parameters comes from the moment we start to take data, but takes time to
stabilize.
(iii) Model comparison takes more data than parameter fitting—there’s a larger space of possibilities
to consider!
(iv) Priors are important when we have little data, but with lots of data they become pretty
irrelevant. ⊓

The Poisson distribution is the limit of a binomial distribution as p → 0 but N → ∞ such


that Np remains finite. Writing m for Np, we get for large N:
N!  m n  m N−n
prob(n | m) = × × 1−
n!(N − n)! N N
n
(2.9)
N  m n  m N

≃ × × 1−
n! N N
and hence (using the formula M.1 on page 70 for e)
mn
prob(n | m) → e−m . (2.10)
n!
Binomial and Poisson Distributions 19
P
The Poisson distribution is clearly normalized since n prob(n|m) = 1. The number of
trials N is replaced by a waiting time—m is proportional to that waiting time. A common
example in astronomy is the number of photons received from some source for given
exposure time. In this example one bothers with (2.10) only if the source is faint enough
that m is quite small—otherwise it goes over to another limiting form we’ll come to in the
next chapter. We’ll call m the mean (and justify that name shortly).

PROBLEM 2.1: Show that, for a Poisson process, the probability distribution of the waiting time τ
needed for n + 1 events is
mn+1 τn
prob(τ|n + 1, m) = e−mτ ,
n!
given that m is the mean for τ = 1. [2]

EXAMPLE [Two Poisson processes] In this example we work out Bookmakers’ odds for whether
two Poisson processes have different means, given data on numbers of events. One process has
produced m1 , . . . , mK events over K different periods, and the second has produced n1 , . . . , nL
events over L different periods. All the waiting periods (for both processes) are equal.
We take two models: in M 1 , both processes have mean a; in M 2 , the means are a and b.
P PL
Writing M = K k=1 m k and N = l=1 nl , the likelihoods are (using 2.10)

prob(D | a, b, M 2 ) ∝ e−aK e−bL aM bN ,


(2.11)
prob(D | a, M 1 ) ∝ e−a(K+L) aM+N .

(I’ve suppressed a product of mk ! and nl ! factors in the likelihoods, since they would always cancel
in this problem.) Notice how the likelihoods have simplified to depend on the data effectively only
through M and N.
Now we have to marginalize the likelihoods over the means a and b. So we need a normalized
prior for these. Since a and b set scales they should each take a Jeffreys prior. But the simple 1/a
Jeffreys prior is not normalized and thus cannot be used for model comparison, though it is fine for
parameter fitting. We need to modify the prior in some way, so that we can normalize it.
The way out is not very elegant, but it works. We know that a is not going to be orders of
magnitude different from the mk (and similarly for b), and just by looking at the data, we can name
some cmin and cmax , between which we are very confident a and b must lie. We then let

prob(a) = (aΛ)−1 , cmin ≤ a ≤ cmax ,


(2.12)
where Λ = ln(cmax /cmin ),

and zero outside [cmin , cmax ], and similarly for prob(b). These priors are normalized. Of course
they depend on our (arbitrary) choice of cmin and cmax , but only logarithmically, and since data-
dependent terms tend to be factorial or exponential, we put up with this arbitrariness.
An alternative (sometimes called the Γ -prior) is

prob(a) = (Γ (ǫ))−1 δǫ aǫ−1 e−δa (2.13)


20 Binomial and Poisson Distributions

with ǫ and δ being small numbers to be chosen by us. The Gamma-function formula (M.2) from
page 70 shows this prior is normalized. It is similar to (2.12) except that the cutoffs are nice and
smooth. The values of ǫ and δ determine in which regions the cutoffs appear, and we would want
the cutoffs to be around cmin and cmax . So we still need to think implicitly about cmin and cmax .
Let us say we adopt the prior (2.12). We should then really calculate
Z cmax
prob(D | M 1 ) = prob(D | a, M 1 ) (aΛ)−1 da (2.14)
cmin

and so on. But the integrands will become very small outside [cmin , cmax ] and we can comfortably
use the approximations
Z∞
prob(D | M 1 ) ≃ prob(D | a, M 1 ) (aΛ)−1 da (2.15)
0

and so on. Using the Gamma-function formula (M.2) again we get

prob(D | M 2 ) 1 (K + L)M+N (M − 1)!(N − 1)!


= . (2.16)
prob(D | M 1 ) Λ KM LN (M + N − 1)!
When M/K is not too different from N/L, these odds are close to 1/Λ, and decrease slowly as M
and N increase. But if M/K and N/L are very different, the odds favouring two different means
become spectacularly large. ⊓

PROBLEM 2.2: The following suggestion, applied to the previous example, would avoid having the
odds depend on the cutoff. For M 1 use the prior (2.12) on a; for M 2 , let a = uc and b = (1 − u)c,
and then use a flat prior in [0, 1] for u and the prior (2.12) for c.
The odds can be worked out without a computer only for K = L. For this case show that the
odds for different means are
M!N!
2M+N .
(M + N + 1)!
This is similar to what (2.16) would give for K = L, but of course not identical, and the difference
conveys some feeling for how much the results depend on the choice of prior. [3]

One thing we will often have to do later in this book is take expectation values. The
expectation value of a function f of a probabilistic number x is
X
hf(x)i ≡ f(xi ) prob(xi ), (2.17)
i

or an integral if appropriate. We can think of an expectation value as an average over the


probability distribution, weighted by the probability; or we can think of it as marginalizing
out the x dependence of f(x). Another notation is E(f(x)). One minor caution: if there
are several variables involved, we must not confuse expectation values over different
variables.
If we have some prob(x), then hxn i is called the n-th moment of x. The first two
2
moments are particularly important: hxi is called the mean, hx2 i − hxi is called the
variance, and the square root of the variance is called the standard deviation or dispersion.
Binomial and Poisson Distributions 21

PROBLEM 2.3: Show that a binomial distribution has a mean of Np and a variance of Npq, and
that a Poisson distribution has both mean and variance equal to m. [2]

PROBLEM 2.4: Two proofreaders independently read a book. The first finds n1 typos, the second
finds n2 typos, including n12 typos found by both. What is the expected number of typos not found
by either? [1]

DIGRESSION [Jeffreys Prior, again] The idea of expectation values gives us a nice way to derive the
Jeffreys prior, which we motivated on page 9 but have not yet properly derived.
Imagine that you and a sibling receive gift tokens in outwardly identical envelopes. The gift
tokens have values x and x/N, where N is known but x is unknown. You open your envelope and
find you have (in some units) 1. What is your expectation for your sibling’s value?
After opening your envelope, you know that x must be either 1 or N, leaving your sibling with
either 1/N or N. (For definiteness we suppose N > 1.) Let us first calculate the probability that you
have the larger gift token. To do this, we use the notation prob(1) for the probability of x = 1, and
prob(1∗ ) for the probability of the “data” (i.e., of you opening your envelope and finding a value of
1) and so on. The posterior probability that your gift token is larger is

prob(1∗ | 1) prob(1)
prob(1 | 1∗ ) = . (2.18)
prob(1∗ | 1) prob(1) + prob(1∗ | N) prob(N)

Assuming the envelopes are assigned at random, prob(1∗ | 1) = prob(1∗ | N), and hence

prob(1)
prob(1 | 1∗ ) = . (2.19)
prob(1) + prob(N)

From this we compute your expectation for your sibling’s token as

(1/N) prob(1) + N prob(N)


. (2.20)
prob(1) + prob(N)

Now (2.20) had better be 1. Otherwise you would be concluding that your sibling was better or
worse off, regardless of data (which, however well it may reflect the human condition, is bad data
analysis). Equating (2.20) to 1 gives

prob(N) = (1/N) prob(1), (2.21)

which is the Jeffreys prior.


Curiously, in order to make the expectation value (2.20) equal 1, the probability (2.19) has to be
1
> 2. In other words, with the Jeffreys prior you are in one sense neutral as to whether your sibling
is better off, while in another sense concluding that you yourself are probably better off. ⊓

3. Gaussian Distributions
A Gaussian (or normal) distribution is given by
(x − m)2
 
1
prob(x) = √ exp − . (3.1)
2π σ 2σ2
As we can easily verify (using the Gaussian integrals M.9 and M.10 from page 71) prob(x)
is normalized and has mean m and dispersion σ. A Gaussian’s tails fall off quickly; 68%
of the probability lies within m ± σ, 95% lies within m ± 2σ, and 99.7% lies within m ± 3σ.

Gaussians are a generic limiting form for the probability of a sum where the terms in
the sum are probabilistic. The detailed distributions of the terms tend to get washed out
in the sum, leaving Gaussians.
To see why this detail-washing-out happens, we note first that when we have sev-
eral independent trials from some prob(x), the probability distribution of the sum is a
convolution. Thus
Z
prob(sum of 2 trials = x) = prob(y) prob(x − y) dy, (3.2)

which is just a consequence of the marginalization rule, and so on for several trials. In
other words, the probability of a sum is the convolution of the probabilities. Convolutions
are conveniently manipulated using Fourier transforms; if we define
Z∞
cf(k) = eikx prob(x) dx (3.3)
−∞

and use the convolution theorem (M.16) from page 72 we have


Z
1 ∞ −ikx N
prob(sum of N trials = x) = e cf (k) dk. (3.4)
2π −∞

The Fourier transform cf(k) of a probability distribution function prob(x) is called the
characteristic function, or moment generating function. The latter name comes because

ikx
k2 2
cf(k) = e = 1 + ikhxi − hx i + · · · (3.5)
2!
and the n-th moment of x (if it exists) is i−n times the n-th derivative of cf(0).1 Since
prob(x) is normalized, cf(k) always exists, but derivatives need not. A counterexample is
the Lorentzian (or Cauchy) distribution, for which
1
prob(x) = ,
π(1 + x2 ) (3.6)
cf(k) = e−|k| .

1 In statistics the moment generating function is usually defined as hexp( kx)i rather than hexp( ikx)i as here.

22
Gaussian Distributions 23

Here prob(x) doesn’t have a mean or higher moments, and correspondingly, cf(0) is not
smooth and doesn’t generate any moments.
Suppose, though, that we have some prob(x) for which hxi and hx2 i do exist. We can
assume without losing generality that the mean is 0 and the variance is 1. Then

cf(k) = 1 − 12 k2 + k2 θ(k), (3.7)

where the remainder term θ(k) → 0 as k → 0. For large-enough N


2
cfN (k) ≃ e−Nk /2
(3.8)

and using the Fourier transform identity (M.17) from page 72 we get
1 2
prob(sum of N trials = x) ≃ √ e−x /(2N) . (3.9)
2πN

In other words, for N (many) independent trials, the probability distribution of the sum
tends to a Gaussian with mean and variance scaled up from the original by N; equivalently,
the mean of N trials tends to become Gaussian-distributed with the original mean and
with the variance scaled down by N. This result is known as the central limit theorem.
From the central limit theorem it follows immediately that for large N, a binomial
distribution tends to a Gaussian with mean Np and variance Npq, while a Poisson dis-
tribution tends to a Gaussian with equal mean and variance. But the small contributions
involved must themselves have finite means and dispersions, otherwise the theorem
doesn’t apply; in particular, a sum over trials from a Lorentzian doesn’t turn into a Gaus-
sian.

PROBLEM 3.1: For a probability distribution constant in [−1, 1] and zero elsewhere, calculate the
characteristic function and verify that it gives the correct variance. [1]

EXAMPLE [The central limit theorem on ice] This is a rather artificial example, but I thought it
would be interesting to see the central limit theorem do some rather surprising things.
Imagine a large frictionless ice rink, with many hockey pucks whizzing about on it. Each puck
has negligible size and mass m. The surface mass density of the puck distribution is Σ, and the
pucks have no preferred location. The pucks have speed distribution f(v) but the velocities have no
preferred direction.
We place on the ice, at rest, a disc of mass M and radius a (much larger than the pucks). We
then wait for some time τ, during which many pucks collide with the disc, pushing it around. We
want to calculate the probability distribution for how much the disc moves in that time.
In the following we will assume that most pucks are always moving much faster than the disc,
and also that most pucks move a distance ≫ a in time τ.
The central limit theorem becomes applicable because there are many collisions with the disc,
each contributing to the disc’s net displacement. Thus the probability distribution for the disc’s
24 Gaussian Distributions

position will be a Gaussian and the mean will clearly be zero. It is enough to calculate the variance,
or the mean square displacement (call it ∆2 ) in one direction from all collisions.
Consider a puck with speed v. It will hit the disc some time during the interval τ if (i) its initial
distance r from the disc is less than vτ, and (ii) it is aimed in the right direction. Since most pucks
will come from distances ≫ a, from distance r a fraction a/(πr) of pucks will be aimed at the disc.
If the puck hits the disc at a point θ on the disc circumference (relative to the x axis) and is travelling
in a direction ψ relative to the normal at that point, the change in the disc’s x-velocity will be

2mv
δvx = − cos θ cos ψ (3.10)
M

If the puck was initially at distance r, the collision will happen at time r/v. The induced disc
displacement at time τ will be

2m
δx = − (vτ − r) cos θ cos ψ. (3.11)
M

Averaging over θ and ψ we get


m2
hδx2 i = 2
(vτ − r)2 . (3.12)
M

Now we integrate hδx2 i over the puck distribution. Remembering the puck density and the
aiming factor, we have
Z Z vτ
2aΣ
∆2 = f(v) dv hδx2 i dr (3.13)
m 0

or
2Σam 3 3
∆2 = τ hv i. (3.14)
3M 2

The probability distribution for the disc is

2
+y2 )/(2∆2 )
prob(x, y) = (2π)−1 ∆−2 e−(x , (3.15)

or, in terms of the distance:


2
/(2∆2 )
prob(r) = ∆−2 r e−r . (3.16)

We had to assume that the third moment of f(v) exists, but the other details of f(v) got washed
out. ⊓

PROBLEM 3.2: A drunk has been leaning against a lamppost in a large city square, and decides to
take a walk. Our friend then walks very carefully, taking equal steps, but turns by a random angle
after each step.1 How far do they get after N steps? [2]

1 For a truly charming discussion of this and many many other problems, see G. Gamow, One two
three. . . infinity (Dover Publications 1988—first edition 1947).
Gaussian Distributions 25

EXAMPLE [Random walking share prices] Here is an application from finance.


Suppose that at time t remaining till some fiducial date (we will use a decreasing time variable
in this example) we own some stock with share price s(t). The share price may rise or fall, and we
want to reduce our exposure to possible future loss by trading away some possible future profit. In
order to do this we sell a ‘call option’ on part of our stock. A call option on a share is the option to
buy that share at t = 0 , not at s(0) but at a predetermined ‘strike price’ s0 . At t = 0 the option is
worth

s − s0 if s > s0
f(s, 0) = (3.17)
0 otherwise
because if s > s0 at t = 0 the holder of the option will buy the stock at s0 and make a gain of s − s0 ,
but if s < s0 at t = 0 then the option is worth nothing. The problem is to find the current value
f(s, t) of the option.
Suppose now that we sell such an option, not on all the stock we own but on a fraction
(∂f/∂s)−1 of it. Our net investment per share is then
−1
∂f

s− f (3.18)
∂s

and it has the nice property that changing s by ∆s and f by ∆f (at fixed t) produces no net change
in the investment. Accordingly, the combination (3.18) of stock owned and option sold is called a
‘neutral hedge equity’. With time, however, the hedge equity is expected to grow at the standard
risk-free interest rate r, i.e.,
−1  −1 
∂f ∂f
 
∆s − ∆f = s− f r (−∆t). (3.19)
∂s ∂s

(Here we have −∆t because t is decreasing.) We Taylor-expand ∆f as

∂f ∂2 f ∂f
∆f = f(s + ∆s, t + ∆t) − f(s, t) = ∆s + 1
2 2
(∆s)2 + ∆t (3.20)
∂s ∂s ∂t
plus higher-order terms which we will neglect. The fractional share price is assumed to random
walk, thus
(∆s)2 = −σ2 s2 ∆t, (3.21)

and the constant σ is called the ‘volatility’. Substituting for (∆s)2 in the Taylor expansion, and then
inserting ∆f in equation (3.19) and rearranging, we get

∂f 1 2 2 ∂2 f ∂f
= 2σ s + rs − r f. (3.22)
∂t ∂s2 ∂s
This is a partial differential equation for f(s, t), subject to the boundary conditions (3.17). The
solution (derived in the following digression) is called the Black-Scholes formula:

f(s, t) = s N(d1 ) − e−rt s0 N(d2 ). (3.23)

Here Z∞
1 2
N(x) ≡ √ e−t /2
dt (3.24)
2π −x
26 Gaussian Distributions

is the cumulative of a Gaussian (or alternatively, the error function from page 28 written differently),
and the arguments d1 , d2 are

ln(s/s0 ) + rt √ ln(s/s0 ) + rt √
d1 = √ + 12 σ t, d2 = √ − 12 σ t. (3.25)
σ t σ t

The first term in the Black-Scholes formula (3.23) is interpreted as the expected benefit from acquiring
a stock outright, the second term is interpreted as the present value of paying the strike price on the
expiration day.
Black-Scholes is the starting point for much work in current financial models.1 ⊓

DIGRESSION [Green’s function for Black-Scholes] The Black-Scholes differential equation (3.22)
is equivalent to a diffusion equation, a beast long familiar in physics and solvable by standard
techniques.
To simplify to a diffusion equation, we make a few changes of variable. Writing

f = e−rt g, x = ln(s/s0 ), (3.26)

that is, factoring out the interest-rate growth and putting the share price on a log scale, the differential
equation (3.22) becomes
∂g ∂2 g  ∂g
= 12 σ2 2
+ r − 12 σ2 . (3.27)
∂t ∂x ∂x
Now changing to moving coordinates

z = x + r − 12 σ2 t,

h(z, t) = g(x, t), (3.28)

we get
∂h 1 2 ∂2 h
= 2σ . (3.29)
∂t ∂z2
This is a diffusion equation, and its solution gives the effect of the boundary condition diffusing
back in time.
We can solve equation (3.29) using Fourier transforms and their properties (see page 72).
Writing H(k, t) for the Fourier transform of h(z, t) we have

dH
= − 12 k2 σ2 H, (3.30)
dt

which immediately integrates to


2
1
σ2 t
H = e− 2 k + const.

If the integration constant is 0, we can inverse transform to get

^ t) = √ 1 e− 2 z /(σ t) .
1 2 2
h(z, (3.31)
2πtσ

1 See e.g., www.nobel.se/economics/laureates/1997/press.html


Gaussian Distributions 27

^ t) is not the general solution, but a particular solution for the boundary condition
Now h(z,
^
h(z, 0) = δ(z). (This is clear from the δ function limit, equation M.18 on page 72.) Such particular
solutions are known in mathematical physics as Green’s functions. When this Green’s function is
convolved with the given boundary condition, it gives us the appropriate solution:
Z∞
1 1 ′ 2 2
h(z, t) = √ e− 2 (z −z) /(σ t) h(z′ , 0) dz′ ,
2πtσ 0 (3.32)
′ z′

h(z , 0) = s0 e − 1 .

To simplify the integral we change variable from z′ to v = (z′ − z)/(σ t), which gives
Z √ Z
s0 ∞ −1 v2 +z+σ tv s0 ∞ 1 2
h(z, t) = √ e 2 dv − √ e− 2 v dv. (3.33)
2π −d2 2π −d2

Now changing variable in the first integral to u = v − σ t gives
Z∞ Z
s0 1 2 1 2 s0 ∞ 1 2
h(z, t) = √ ez+ 2 σ t e− 2 u du − √ e− 2 v dv
2π −d1 2π −d2 (3.34)
1 2
= s0 ez+ 2 σ t
N(d1 ) − s0 N(d2 ).
Finally, putting back the variables s and f in place of z and h (via equations 3.26 and 3.28) gives the
Black-Scholes formula (3.23). ⊓

The distribution of experimental random errors is typically Gaussian, and is usually


(safely) taken to be so without bothering about the detailed contributions that make it
up. Posterior distributions in different problems may also often be well-approximated by
Gaussians. People giving parameter estimates often quote 1σ or 2σ or 3σ uncertainties on
the results. This can mean either of two things:
(i) The posterior probability distribution has been approximated by a Gaussian, and the
stated σ is the dispersion; or
(ii) The posterior is not a Gaussian, but 65% (or 95% or 99.7%, as applicable) of the area
under the posterior is within the stated range—in this case the σ notation is just a
shorthand, and Gaussians don’t really come into it.
Option (i) is a good idea if we can get a simple, even if approximate, answer. Option
(ii) is safer, because it avoids a Gaussian approximation, but will almost certainly need a
computer. We will find both options useful in later examples.
When we do want to approximate a posterior—or indeed any function f(x)—by a
Gaussian G(m, σ, x), there are two standard ways of doing so. The easy way is to simply
set m and σ2 to the mean and variance of f(x). Alternatively, one can require f(x) to
have the same peak (i.e., m) as G(m, σ, x), and require ln f(x) to have the same second
derivative (i.e., −σ2 ) as ln G(m, σ, x) at m. In other words, we can set
′ −2 f′′ (m)
f (m) = 0, σ = − . (3.35)
f(m)
We will use the formula (3.35) whenever possible, because the approximation is not
affected by the tails of f(x). But if (3.35) is intractable (because we cannot solve f′ (m) = 0
easily) we will fall back on the easier first method.
28 Gaussian Distributions

EXAMPLE [Estimating m and σ from a sample] Say we are given some numbers x1 , . . . , xN which
(we happen to know) come from a Gaussian distribution, and we want to use these data to estimate
m and σ.
We have for the likelihood (suppressing factors of 2π)
h X i
prob(D | m, σ) ∝ σ−N exp − 12 σ−2 (xi − m)2 . (3.36)
i

If we write X
−= 1 − 1 X 2 −2
m xi , σ2 = (x − m ), (3.37)
N i N i i

(the sample mean and variance), and rearrange the likelihood a bit, we get

− 2 +−
σ2
 
N
−N − 2σ (m − m)
prob(D | m, σ) ∝ σ e 2
. (3.38)

prior then prob(m | D, σ) becomes a Gaussian with mean m − and dispersion


√ If we give m a flat√
σ/ N, so m = m − ± σ/ N. (We are using 1σ errors, as is the default.) The uncertainty depends on
what we will estimate for σ, but since that’s bound to be pretty close to −
σ, we can safely say

− ±−
m=m σ/ N. (3.39)

Estimating σ is a bit more complicated. First we marginalize out m, and then put a 1/σ prior
on σ, getting the posterior
σ2
N−
prob(σ | D) ∝ σ−N e− 2σ2 . (3.40)

This is not a Gaussian in σ, but we can approximate it by one. Using approximation formula (3.35)
we get √
σ =−σ ±−σ/ 2N. (3.41)

This answer is simple, but it isn’t entirely satisfactory because it may admit negative values. So one
may prefer something more elaborate if that is a worry.

Had we used a flat prior for σ, (3.41) would have been replaced by σ = N/(N − 1) − σ±
√1
N/(N − 1) 2−
σ. If, on the other hand, we had the 1/σ prior but knew m independently, we
2 √ − √1 −
would have got σ = N/(N + 1) σ ± 2 N/(N + 1) σ. 2 ⊓

One sometimes comes across the “error function” erf(x), which is essentially the proba-
bility of a number drawn from a Gaussian being less than some value. The definition
is Z
2 x −t2
erf(x) = √ e dt, (3.42)
π 0
so   
1 x−m
prob(y < x | Gaussian) = 1 + erf √ . (3.43)
2 2σ
There’s also erfc(x) ≡ 1 − erf(x).
Gaussian Distributions 29

PROBLEM 3.3: Digital cameras use a technology, originally developed for astronomical imaging,
called a CCD (charge coupled device). The guts of a CCD is a matrix of pixels (picture elements,
say 1024 × 1024 of them). During an exposure, each pixel collects a charge proportional to the
number of photons reaching it. After the exposure, the collected charges are all measured, hence
counting the number of photons. When working with a CCD one really is counting photons, and
has to worry about counting statistics. The counts follow Poisson statistics, of course, but people
generally use the Gaussian approximation. So if a pixel
√ records N photons, it is taken to be drawn
from a Gaussian distribution with m = N and σ = N.

Figure 3.1: Counts along one row of an imaginary CCD.

The major source of instrumental error in CCDs comes because each pixel needs to have some
initial charge before it can operate. In effect, there are some photons of internal origin. These
(fictitious) internal photons can easily be calibrated for and their number subtracted out. But the
internal photons are themselves subject to Poisson/Gaussian noise, which we need to know about.
This is called readout noise.
To estimate the readout noise, one takes an exposure without opening the shutter, and estimates
a Gaussian fit to the counts. Figure 3.1 is a plot with simulated counts for 1024 pixels along one row
of a fictitious CCD. There’s a constant term (‘bias’) added to all the counts, so it’s only the variation
in the counts which matters.
I showed the plot to the astronomer who first taught me the above, and asked him what he
would estimate the readout noise to be, from this information. He looked at the plot for maybe 15
seconds, and said “I’d say it was between 6 and 7”. How do you think he got that? [2]

PROBLEM 3.4: As well as being excellent photon detectors, CCDs are unfortunately also excellent
detectors of cosmic rays. So any raw CCD image will have a few clusters of pixels with huge photon
counts.
30 Gaussian Distributions

One way of getting around the problem of cosmic rays is to divide an exposure into a few
shorter exposures and then take the pixel-by-pixel medians of the shorter exposures. (The median
is very unlikely to come from an exposure during which the pixel in question was hit by a cosmic
ray.)
Say we have five exposures, and consider one of the (vast majority of) pixels that were never
hit by cosmic rays. The five counts for our pixel would represent five trials from the same Gaussian
distribution, but the median of the five would not have a Gaussian distribution. What distribution
would it have? [3]

PROBLEM 3.5: Alan M. Turing, one of the patron saints of computer science, wrote a philosophical
article in 1950 on Computing machinery and Intelligence. In it he introduces what he calls the imitation
game (now usually called the Turing test). In it, a human ‘interrogator’ is talking by teletype to
two ‘witnesses’, one human and one a computer. The interrogator’s job is to figure out which is
which. The human witness’s job is to help the interrogator, the computer’s job is to confuse the
interrogator (i.e., pretend to be the human). Turing proposes the ability (if achieved) to fool most
human interrogators as an operational definition of intelligence.
Most of the article consists of Turing first considering, and then wittily rebutting, nine different
objections to his proposal. The last of these is that humans are allegedly capable of ESP, but
computers are not; Turing’s discussion includes the following passage.
Let us play the imitation game, using as witnesses a man who is good as a telepathic
receiver, and a digital computer. The interrogator can ask such questions as ‘What suit
does the card in my right hand belong to?’ The man by telepathy or clairvoyance gives
the right answer 130 times out of 400 cards. The machine can only guess at random, and
perhaps get 104 right, so the interrogator makes the right identification.
Nobody seems to know if Turing was serious in this passage, or just having the philosophers
on. Certainly, ‘130 times out of 400 cards’ sounds less than impressive. What do you think? [2]
4. Monte-Carlo Basics

With the material we have covered so far, we can write down likelihoods and posteriors
in many kinds of examples. Mostly though, the probability distributions will depend on
several parameters, and in some complicated way. So we need numerical methods to find
peaks and widths of peaks, and to marginalize over some parameters.
In this chapter we will develop methods for generating a sample of ω from a probabil-
ity distribution, say prob(ω | D). Of course, we can make this sample as large as we please,
given enough computer time. If ω happens to be one-dimensional, we can then easily
estimate the mean and variance, or percentile values, whatever. If ω is multi-dimensional,
we consider one dimension at a time (after generating the multi-dimensional sample); this
amounts to marginalizing over the other dimensions. In this way, we can estimate each
parameter (with uncertainties) with all the other parameters marginalized out.
We need first to know a little about random number generators. For our purpose ran-
dom number generators are functions on a computer that return a number with uniform
probability in [0, 1]. They are quite sophisticated things, designed to make the output
appear as truly probabilistic as possible. Actually, they are spewing out a pre-determined
sequence of numbers, which is why some people say “pseudo-random” numbers. A
state-of-knowledge view would be that if we don’t know the sequence and can’t, from the
output, figure out anything about what is coming next, then it’s random. Knuth’s book1
is the ultimate reference, but here are a couple of things to remember.
(a) The sequence a random number generator produces is determined by an initial ‘seed’;
usually, if you run the program again, you get the same sequence. If you want a
different sequence, just skip a few numbers at the start of the program.
(b) A random sample of a uniform distribution is not completely uniform but has struc-
ture on all scales. As the sample gets larger, the structure gradually fades. If you
take the output of a good random number generator and numerically take the power
spectrum (mod-square of the Fourier transform), it will be flat.
If we have some f(x), defined for x in [0, 1] and |f(x)| ≤ 1, it’s easy to concoct
random numbers distributed according to f(x). We first call the uniform random number
generator, say it returns r; we use r a fraction f(r) of the time; otherwise we try again.
This generalizes easily to any bounded f defined in a finite multi-dimensional space. It is
called the rejection method. It always works (though if f has an integrable singularity or
the region is infinite, one will need an extra transformation). But it can be very inefficient,
especially in several dimensions.
A more efficient way, if tractable, is to go via the cumulative distribution function. In

1D.E. Knuth, The Art of Computer Programming, vol. 2 (Addison-Wesley 1979), which includes the gem
“Random numbers should never be generated by a program written at random. Some theory should be used.”

31
32 Monte-Carlo Basics

one dimension, consider Zx


F(x) = f(x′ ) dx′ , (4.1)
−∞

and for a uniform random number r, output F−1 (r); it will be distributed according to f(x).
This can be generalized to higher dimensions, provided we can compute F−1 (r) easily.
A variant of the inverse-cumulant method above is to divide the domain of x into
K segments, each contributing equally to the cumulative distribution function, and then
generate K random numbers at once, one in each segment. To make it sound more
impressive, this procedure is called ‘Latin hypercube sampling’.

PROBLEM 4.1: Consider the distribution function


1
f(r) ∝
r2 (1 + r2 )

where r is the radial coordinate in three dimensional space. How can we use the inverse-cumulant
method to generate random space points (x, y, z) distributed according to f(r)? Suggest a method
that will work for
sin2 (r)
g(r) ∝ 2 .
r (1 + r2 )

(A description of the algorithms will do—no formal code-style notation required.) [2]

The inverse-cumulant and rejection methods won’t do, though, for most of our appli-
cations. The distribution functions will generally be far too complicated for inverse-
cumulants, and if they are sharply peaked in several dimensions, rejection is too ineffi-
cient. For such problems there is a powerful set of iterative schemes, sometimes called
‘Markov chain Monte-Carlo algorithms’. The idea is that the function to be sampled, say
f(x), is approached through a sequence of approximations fn (x). These fn (x) are distribu-
tion functions (in multi-dimensions), not samples. Each fn (x) is related to the next iterate
fn+1 (x) by some “transition probabilities” p(x → x′ ), thus:
X X
fn+1 (x) = fn (x′ ) p(x′ → x), where p(x′ → x) = 1. (4.2)
x′ x

The transition probabilities (normalized as above) don’t depend on n, hence the sequence
f1 , f2 , . . . is a Markov chain. The key to making fn (x) converge to f(x) is to choose the
right transition probabilities, and they can be anything that satisfies

f(x) p(x → x′ ) = f(x′ ) p(x′ → x), (4.3)

known as detailed balance.


Monte-Carlo Basics 33

DIGRESSION [Convergence of the iterations] If fn (x) has already converged to f(x), further iterates
will stay there, because (using 4.3 and 4.2)
X
fn+1 (x) = fn (x′ ) p(x′ → x)
x′
X (4.4)
= f (x) p(x → x′ ) = fn (x).
′ n
x

To see why fn (x) will approach f(x), we take


X ′ ′

|f(x) − fn+1 (x)| = f(x) − f (x ) p(x → x)

n
x′
X
f(x′ ) − fn (x′ ) p(x′ → x)

=
(4.5)
x ′
X ′
f(x ) − fn (x′ ) p(x′ → x),


x′

where the last step uses the triangle inequality. Summing now over x, we get
X X ′
f(x ) − fn (x′ ) ,

|f(x) − fn+1 (x)| ≤ (4.6)
x x′

implying convergence. ⊓

We can use the convergence property of fn (x) to generate a Markov chain x1 , x2 , . . . that
eventually samples f(x), as follows. At any xn , we pick a trial x′n , and set

x′n , with p(xn → x′n ),
xn+1 = (4.7)
xn , otherwise ,

where the transition probability p(xn → x′n ) satisfies the detailed balance condition (4.3).
In other words, we transit to x′n with probability p(xn → x′n ), otherwise we stay at xn .
Then x1 (which we choose at random) may be considered a sample of f1 (x), x2 a sample
of f2 (x), and so on. For large-enough N, the chain xN , xN+1 , . . . will be a sample of f(x).
The easiest choice for the transition probability is

p(x → x′ ) = min [f(x′ )/f(x), 1], (4.8)

and the resulting algorithm is called the Metropolis algorithm. There are many other
algorithms which improve on it in various ways,1 but plain Metropolis is still the most
popular.
There are some technical issues to think about when implementing the Metropolis
algorithm. The first is how to pick the new trial x′ . One is as a step: x′ = x + rδx, where

1 The original context of these algorithms was condensed-matter physics, and the most demanding appli-
cations are still in that area—see for example, J.J. Binney, N.J. Dowrick, A. Fisher, & M.E.J. Newman, The Theory
of Critical Phenonena, (Oxford University Press 1992).
34 Monte-Carlo Basics

δx is a stepsize and r a random number in [−1, 1]. Ultimate convergence won’t depend
on the choice of δx, but efficiency will. The second issue is how long to iterate for. Now
Metropolis operates by accepting all trial steps that would increase f and accepting some
but not all trial steps that would decrease f. So there will be an initial period (sometimes
called ‘burn-in’) when the algorithm seeks out the maximal region of f(x), and a later
‘equilibrium’ period when it wanders around in the maximal region, with occasional brief
forays into lower regions. Our sample should come from the equilibrium period, but
how do we recognize that it has begun? There are more sophisticated things one can do,
but one simple way is to look for the disappearance of an increasing trend in, say, 100
iterations.

PROBLEM 4.2: Consider our friend


1
f(r) ∝
r2 (1 + r2 )
again, and this time write a Metropolis program to generate a sample of (x, y, z) distributed accord-
ing to f(r). Plot a histogram of the x values in [−5, 5], and on it overplot the exact marginalized
distribution, which is Z
1 1
 
f(r) dy dz = ln 1 + 2 .
2π x
The size of the Metropolis steps, the size of the sample, and the scaling for the histogram are all up
to you, but you should choose them to show that the histogram is a believable approximation to the
exact distribution. [4]
5. Least Squares

The most bread-and-butter thing in data analysis is when we have a signal F(ω, t) de-
pending on some parameters ω and an independent variable t, and some Gaussian noise
n(σ(t)), and we measure the sum at certain discrete points. In other words, the data are

di = F(ω, ti ) + n(σ(ti )), i = 1, . . . , N. (5.1)

The form of F(ω, t) is known, and the noise n(σ(t)) has a Gaussian distribution with zero
mean and known time-dependent dispersion. The problem is to infer the values of the ω.
This (for reasons that will be obvious shortly) is called least-squares. The key requirements
are that the noise must be Gaussian and additive on the signal—not a universal situation,
but a very common one, and the reason for spending a lot of effort on this problem in its
many forms.1
By assumption and equation (5.1), di − F(ω, ti ) will have a Gaussian distribution,
hence the likelihood is
Y  h X i
prob(D | ω) = (2π)−N/2 σ−1 exp − Q2
i ,
i i i
(di − F(ω, ti ))
2 (5.2)
2
Qi = .
2σ2 (ti )

By rescaling
σ σ
di ← di , Fi (ω) ← F(ω, ti ) (5.3)
σ(ti ) σ(ti )
we can write
Q2
 
−N/2 −N
prob(D | ω, σ) = (2π) σ exp − 2 ,
2σ (5.4)
X 2
Q2 = (di − Fi (ω)) .
i

In (5.4) we left the overall noise scale σ as a parameter; sometimes it is known, sometimes
it must be inferred along with the ω.
Typically, some of the parameters will enter F linearly and some nonlinearly. For
P10 n
example if F is a tenth degree polynomial, we have Fi (cn ) = n=0 cn ti and all the
parameters cn are linear. Or Fi might be something like c0 + c1 cos(αti ) + c2 cos(2αti ),
with one non-linear and three linear parameters. Linear parameters are much easier to
deal with, so let us change our notation slightly to use a1 , . . . , aL for the linear param-
eters (amplitudes) and ω for all the nonlinear parameters. Thus we replace Fi (ω) by

1 This chapter is a mixture of well-known and comparatively little-known results. The parameter fitting and
model comparison parts are based mostly on G.L. Bretthorst, Bayesian Spectrum Analysis and Parameter Estimation
(Springer-Verlag 1988).

35
36 Least Squares
PL
l=1 al fli (ω), and the likelihood (5.4) by
Q2
 
−N/2 −N
prob(D | al , ω, σ) = (2π) σ exp − 2 ,

N 
X XL 2 (5.5)
2
Q = di − al fli (ω) .
l=1
i=1

Typically, the sum over l (the amplitudes) will have just a few terms but the sum over i
(the data) will have many terms.

EXAMPLE [Fitting a straight line] Let us work through this simple example before dealing with the
general case. We have some data y1 , . . . , yN measured at xi with Gaussian errors σi , and we want
to fit these to a straight line y = mx + c. There are no errors in x.
First, as usual, we rescale each yi and the corresponding xi to change all the σi to σ. The
likelihood is then
Q2
 
−N
prob(D | m, c) ∝ σ exp − 2 ,

N (5.6)
2
X 2
Q = (yi − mxi − c) .
i=1
Let us put uniform priors on the parameters m and c. Then the posterior probability prob(m, c | D)
is proportional to the likelihood.
The posterior, as well as being Gaussian in the data, is also a two-dimensional Gaussian in
m, c; the reason, as we can verify from (5.6), is that m, c are linear parameters (amplitudes). To find
− −
the values (m, c, say) that maximize the posterior we equate the relevant partial derivatives of Q2
to zero. Doing this, we have
P 2 P  −   P 
i xi i xi m i xi yi
P − = P (5.7)
i xi N c i yi

a matrix equation whose form


C−1 ·a = P (5.8)
− −
we will meet again. Solving the matrix equation for (m, c), we get
P P P
− N i xi yi − i xi i yi
m= P P 2 ,
N i x2i − i x i
P 2P P P (5.9)
− x y
i i − i xi yi x
c= i i
P 2 P 2 i i .
N i xi − i xi

Expanded around its minimum, Q2 is just a constant plus its quadratic part, hence the posterior is

Q2
 
−N
prob(m, c | D) ∝ σ exp − 2 ,

2 − 2 P x2 + (c − − (5.10)
Q = const + (m − m) i i c)2 N
− P
+ 2(m − m)(c −− c) i xi .
Least Squares 37

To find the uncertainty in either m or c, we marginalize out the other from the posterior. The
marginal posteriors turn out to be (using formula M.11 on page 71) Gaussian as well

− 2 σ2 N
 
(m − m)
prob(m | D) ∝ exp − 2 σ2m = ,
2σm ∆
P
(c − − c)2 σ2 i x2i
 
2 (5.11)
prob(c | D) ∝ exp − σc =
2σ2c ∆
P P 2
∆ = N i x2i − i x i .

Though these expressions look messy when written out in full, all one really needs to remember is
the matrix equation (5.7/5.8). The least-squares values (5.9) are the solution of the matrix equation,
and the expressions for σ2m and σ2c in (5.11) are the diagonal elements of σ2 C. ⊓

PROBLEM 5.1: A popular way of estimating errors in problems where it’s difficult to do the proba-
bility theory is called bootstrap. This problem is to test bootstrap in a simple context, by comparing
its results with what we have calculated from probability theory.
In the following table y is linear in x but with some Gaussian noise.

x y x y x y x y
0.023 0.977 0.034 0.928 0.047 0.951 0.059 1.206
0.070 1.094 0.080 1.002 0.082 0.769 0.087 0.979
0.099 1.043 0.129 0.686 0.176 0.638 0.187 0.808
0.190 0.728 0.221 0.760 0.233 0.770 0.246 0.869
0.312 0.631 0.397 0.575 0.399 0.735 0.404 0.571
0.461 0.679 0.487 0.415 0.530 0.299 0.565 0.410
0.574 0.509 0.669 0.291 0.706 0.167 0.850 -0.067
0.853 0.023 0.867 0.128 0.924 0.332 1.000 0.045

− and −
Fit for the coefficients in y = mx + c. Let’s say m c are the estimated values. Then invent a new
data set, by sampling the given data N times (32 times in this case) with replacement, and fit again
for m and c—say you get m ^ and ^c. Repeat a number of times, and calculate the matrix

− m − − c −−
 
h(m
^ − m)( ^ − m)i h(m
^ − m)(^ c)i
h(^c − − −
^ − m)i
c)(m − −
h(^c − c)(^c − c)i

(where the averages are over the generated data sets). This is the procedure for bootstrap, and the
matrix is taken as an estimate of the matrix σ2 C. [3]

We continue with the general case. Let us rewrite (5.5) in a more convenient notation. If
we define XN
d2 = d2i ,
i=1
XN XN (5.12)
−1
Pl = di fli (ω), Ckl = fki (ω)fli (ω),
i=1 i=1
38 Least Squares

then the likelihood (5.5), after expanding and rearranging, becomes

Q2


−N/2 −N
prob(D | al , ω, σ) = (2π) σ exp − 2 ,
2σ (5.13)
X X
Q2 = d2 + ak al C−1
kl − 2 a l Pl .
k,l l

Equation (5.13) is just asking to be written in matrix notation, so let us introduce

a ← al , P ← Pl , C ← Ckl , (5.14)

using which (5.13) becomes

Q2
 
−N/2 −N
prob(D | a, ω, σ) = (2π) σ exp − 2
2σ (5.15)
2 2 T −1 T
Q = d + a ·C ·a − 2P ·a.

Let us pause a moment to remark on C, a, P, since we will meet them many times
below. C is an L × L matrix depending on the nonlinear parameters but not the data;
σ2 C is called the covariance matrix. P is a column vector, and a sort of inner product of
data and model. And a is the vector of L amplitudes to be inferred. The likelihood is
completely specified by C, P, d2 , and σ2 , and we will not need to refer explicitly to the
data any more.
Completing the squares inside Q2 , and using the fact that C is symmetric, we can
rewrite (5.15) as

Q2
 
−N/2 −N
prob(D | a, ω, σ) = (2π) σ exp − 2
2σ (5.16)
2 2 T T −1
Q = d − P ·C·P + (a − C·P) ·C ·(a − C·P).

We see that the likelihood is an L-dimensional Gaussian in the amplitudes a. Let us put a
flat prior on a, so the posterior is also Gaussian in a. We can then easily write down the
mean and (using the formula M.14 from page 71) dispersion:
D E
T
hai = C·P, (a − hai) (a − hai) = σ2 C. (5.17)

In other words, a has an L-dimensional Gaussian distribution with mean C·P and disper-
sion σ2 C. The diagonal elements of the covariance matrix σ2 C give the dispersions in a
and the off-diagonal elements indicate how correlated the components of a are. (Recall
that C doesn’t depend on the data.)
Least Squares 39

If we change from a to some linearly transformed amplitudes b, the covariance of b


will be related to the covariance of a via the transformation matrix:
D E
T
(b − hbi) (b − hbi) =
 D E ∂b T (5.18)
∂b T
(a − hai) (a − hai) .
∂a ∂a

Having extracted as much as we can about the amplitudes, we now get them out of
the way by marginalizing. Using (M.13) from page 71 gives us

prob(D | ω, σ) = (2π)(L−N)/2 σL−N ×


d2 − PT ·C·P (5.19)
 
1/2
| det C| exp − .
2σ2

Now we have to deal with the nonlinear parameters. In general (5.19) will have to be
− say,
investigated numerically, by Monte-Carlo. But if PT ·C·P has a sharp maximum, at ω

then it is reasonable to linearize about ω:
 
− + (ωk − ω
−k) ∂F i (ω)
Fi (ω) ≃ Fi (ω) . (5.20)
∂ωk ω −

In this approximation (ωk − ω − k ) behave like more amplitudes. So we can give uncertain-
ties on ω by treating Dkl , where
N 
X   
∂Fi ∂Fi
D−1
kl = . (5.21)
∂ωk ω− ∂ωl ω −
i=1

as a covariance matrix (also called a Fisher matrix in this context). If we transform from
ω to some other parameters µ, and − − the covariance matrix in terms of
µ corresponds to ω,
µ is given analogously to (5.18):

h(µk − −
µk )(µl − −
µl )i =
X ∂µk
 
∂µl

(5.22)
− p )(ωq − ω
h(ωp − ω − q )i.
pq
∂ω p ∂ω q

This is called the formula for propagation of errors, and is much used, especially in its
diagonal form
X  ∂µk 2
h(µk − − 2
µk ) i = h(ωp − ω − p )2 i, (5.23)
p
∂ωp

but we should note that it is valid only in the linearized approximation. A simple corollary
is that if there are N independent least-squares estimates of some quantity,
√ all with equal
error-bars, then the error-bar of the combined estimate will be down by N.
40 Least Squares

PROBLEM 5.2: Suppose we have several data points d1 , d2 , . . . , dK at the same value of the inde-
pendent variable t, amongst other data points at other t. In least squares, should one include the K
data points individually, or combine them into a single point with a smaller error bar? [2]

EXAMPLE [Error propagation on square roots] In practice, people use the error propagation formula
− ± ∆ω, then if
(5.23) quite freely in nonlinear situations. For example, if ω has been estimated as ω

µ is defined as ω, one gets from (5.23):

µ= − ± ∆ω
ω √ .

2 ω
The key assumption, of course, is that the data are good enough that ∆ω ≪ ω. ⊓

PROBLEM 5.3: The simple error propagation formula isn’t always good enough. If we are going to

take ω, then we really have a prior that requires ω ≥ 0. Let us set the prior to zero for ω < 0
and a constant otherwise. The likelihood could be Gaussian in ω (with mean ω− and dispersion ∆ω
− < 0, but that does not prevent one from deriving a perfectly good posterior
say). Quite possibly ω

for µ = ω. In fact, one can derive a Gaussian approximation to the posterior for µ, and this has
mean −µ and dispersion ∆µ given by

− −+

− 2 + 2∆ω2 , ∆ω2
µ2 = 1
∆µ2 = √ −

2 ω ω .
2 ω2 + 2∆ω2
Derive this approximation. [3]

One doesn’t always have an estimate for the noise dispersions σi . What one then does is
(i) assume the σi equal some σ and (ii) take σ as a scale parameter and marginalize it out.
Multiplying (5.19) by a 1/σ prior and integrating using (M.5) we get
Z
prob(D | ω) = prob(D | ω, σ) σ−1 dσ

= π(L−N)/2 | det C|1/2 × (5.24)


(L−N)/2
Γ 12 (N − L) d2 − PT ·C·P

.
This is usually called a Student-t distribution.1 We can get an estimate for σ2 too, by
taking an expectation R 2
σ prob(D | ω, σ) σ−1 dσ
hσ2 i = R (5.25)
prob(D | ω, σ) σ−1 dσ
gives (adapting the calculation of 5.24)
d2 − PT ·C·P
 
2
hσ i = . (5.26)
N−L−2

1The real name of the person who derived something of this form in a somewhat different context, and
published it under the name of ‘Student’, is known to have been W.S. Gosset, but nobody calls it the Gosset
distribution.
Least Squares 41

PROBLEM 5.4: In most of this chapter we assume that there are no errors in an independent variable,
i.e., in the ti in equation (5.1). This problem is about the simplest case when there are errors in the
independent variable.
Suppose, we have some data (xi , yi ), where both the xi and the yi have errors. For simplicity
we’ll assume the noise dispersions in xi and yi are all unity. (This can always be arranged by
rescaling.) We want to fit a straight line y = mx + c.
The likelihood function is no longer (5.6), and your job is to work out what it is. To do this,
first consider the likelihood for one datum, prob(x1 , y1 | m, c). This is going to depend on the noise
in x and y, or the distance of (x1 , y1 ) from the corresponding noiseless point on the line. But we
don’t know what that noiseless point is, we just know it lies somewhere on y = mx + c. [4]

So much for fitting a model by least-squares. What if we have several models, and want to
find which one the data favour? For example, we might be fitting polynomials of different
orders, and need to find which order is favoured. We need the global likelihoods; that
is, we have to marginalize out the parameters, but with normalized priors. For very
nonlinear parameter dependences we have to do that numerically, but for linear and
linearizable parameters the marginalization can be done exactly.
For parameter fitting we took a flat prior on a. Now we modify that to a Gaussian
prior with a very large dispersion; it’s still nearly flat for a values having non-negligible
posterior, so our parameter estimates aren’t affected. We take

aT ·C−1 ·a
 
−L/2 −L −1/2
prob(a | δ) = (2π) δ | det C| exp − , (5.27)
2δ2

where δ is a new parameter assumed ≫ σ. Then the exponent in prob(D | a, ω, σ)×prob(a |


δ) has the same form as (5.15), but with the substitution

C−1 ← C−1 (1 + σ2 /δ2 ). (5.28)

Marginalizing out the amplitudes we get, in place of (5.19):

prob(D | ω, σ, δ) = (2π)−N/2 δ−L σL−N (1 + σ2 /δ2 )−L/2 ×


(5.29)
  2
d − PT ·C·P PT ·C·P

exp − + .
2σ2 2δ2

Since we assumed δ ≫ σ, we can discard the factor of (1 + σ2 /δ2 )−L/2 . Now we have
to marginalize out δ; we do our usual trick (cf. equation 2.12 on page 19) of assigning a
Jeffreys prior with cutoffs:

δ−1 δ−1
prob(δ) = , or (say) . (5.30)
ln(δmax /δmin ) Λδ
42 Least Squares

We then marginalize out δ by integrating—the limits of integration should properly be


δmin to δmax , but we further assume that we can approximate the limits as 0 to ∞ with
negligible error. This gives

prob(D | ω, σ) = Λ−1 −N/2 L/2 L−N


Γ 12 L ×

δ (2π) 2 σ
−L/2 (5.31)
PT ·C·P exp − 12 σ−2 d2 − PT ·C·P .
 

Marginalizing σ out in similar fashion gives

prob(D | ω) = Λ−1 −1 −N/2


δ Λσ π ×
−L/2 2 (L−N)/2 (5.32)
Γ 12 L Γ 12 (N − L) PT ·C·P d − PT ·C·P
 
.

For model comparison, one would take the ratios of the expressions in (5.31) or (5.32) for
two different models; provided both models had at least one amplitude, the Λ factors
would cancel.

PROBLEM 5.5: In the following table the y-values are a polynomial in x plus Gaussian noise.

x y x y x y x y
0.023 0.977 0.034 0.949 0.047 0.943 0.059 0.948
0.070 0.924 0.080 0.914 0.082 0.920 0.087 0.919
0.099 0.916 0.129 0.886 0.176 0.842 0.187 0.840
0.190 0.849 0.221 0.827 0.233 0.787 0.246 0.789
0.312 0.757 0.397 0.697 0.399 0.696 0.404 0.713
0.461 0.657 0.487 0.643 0.530 0.609 0.565 0.565
0.574 0.549 0.669 0.484 0.706 0.433 0.850 0.251
0.853 0.245 0.867 0.238 0.924 0.151 1.000 -0.016

What is the degree of the polynomial? [4]

Finally, we have to test the goodness of fit, that is, we have to ask whether the data at
hand could plausibly have come from the model we are fitting. After all, if none of the
models we are fitting is actually correct, neither the parameter estimates nor the error bars
on them mean very much.
We need a statistic to measure goodness of fit. A natural choice (though still an ad
hoc choice—see the discussion on page 10) is the logarithm of the likelihood. Going back
to equation (5.1) and assuming the σi are known, we define

N
X 2
2 [di − Fi (ω)]
χ = , (5.33)
σ2i
i=1
Least Squares 43

and from (5.2) the likelihood is


prob(D | ω, M) ∝ exp − 12 χ2 .
 
(5.34)
To use χ2 as the goodness-of-fit statistic, we need its p-value: the probability, given ω in
some model, that a random data set will fit less well than the actual data set (see page 10),
or
1 − prob(χ2 < χ2D | ω), (5.35)
where χ2D is what the actual data set gives. If prob(χ2 < χ2D | ω) is very close to 1, it means
the data are a very bad fit. For example, if prob(χ2 < χ2D | ω) > 0.99, the model is said to
be rejected with 99% confidence.
The χ2 test is the most-used goodness-of-fit test, and it helps that prob(χ2 < χ2D | ω)
is fairly easy to calculate, at least approximately. To start with, note that the likelihood
is a density in N-dimensional data space. The peak is at χ = 0, and outside of that the
1 2
density falls as e− 2 χ . Thus, if we consider χ as a radial coordinate in N-dimensions,
the density depends only on radius. We can calculate that density as a function directly
of radius, or prob(χ2 | ω). Since the volume element in N dimensions is ∝ χN−1 dχ, and
∝ (χ2 )N/2−1 dχ2 , we have
prob(χ2 | ω) ∝ (χ2 )N/2−1 exp − 12 χ2 .

(5.36)
Note: by convention, prob(χ2 | ω) is the probability density with respect to χ2 , not χ.
Actually, (5.36) applies only if the parameters are fixed—if we fit L amplitudes (includ-
ing linearizable nonlinear parameters) then we effectively remove L of the N dimensions
over which the data can vary independently. In that case
prob(χ2 | ω) ∝ (χ2 )ν/2−1 exp − 12 χ2 , ν = N − L.

(5.37)
Here ν is called the number of “degrees of freedom” and χ2 /ν is sometimes called “reduced
χ2 ”. There is no analogous prescription for adjusting for nonlinear parameters that cannot
be linearized.
We can derive a reasonably good Gaussian approximation for the χ2 distribution
(5.37). Using the approximation formula (3.35) from page 27 we get
(χ2 − µ)2
 
2 1
prob(χ | ω) ≃ √ exp − (5.38)
4πµ 4µ
where
µ = N − L − 2. (5.39)

The content of (5.38) is that χ2 should be close to µ; if it more than 5 µ away something is
fishy; it is more useful to know this fact than the value of prob(χ2 < χ2D | ω). Nevertheless,
the latter can be computed, by writing
Rχ2D /2 µ/2−1 −x
2 2 x e dx
prob(χ < χD | ω) = 0R∞ µ/2−1 −x (5.40)
0
x e dx
44 Least Squares

and observing that the right hand side is the definition of an incomplete Γ function. Thus

prob(χ2 < χ2D | ω) = Γ 1 1 2



2 µ, 2 χD (5.41)

One can only apply the χ2 test if the σi are known. If σ is estimated by equation
(5.26), that just sets χ2 at its expectation value. Interestingly, µ in (5.39) is the same as the
denominator in (5.26). To test for acceptability of a fit for the case of unknown σ, we would
have to consider the set of residuals di − Fi (ω), and ask whether these could plausibly
have been drawn from a Gaussian distribution with m = 0 and σ2 given by (5.26). That’s
part of the material of the next chapter.

PROBLEM 5.6: What is the analogue of χ2 if the data are counts ni from a Poisson distribution with
mean m? (The answer is known as the Cash statistic.) [1]

PROBLEM 5.7: [Tremaine’s paradox] For a given model, we have from (5.34):

prob(D | ω) ∝ exp − 12 χ2 (D, ω) .


 

If ω takes a flat prior, then using Bayes’ theorem prob(ω | D) is also proportional to the right hand
side. On the other hand, consider
" 2 #
χ2 (D, ω) − µ
prob(χ2 | ω, D) ∝ exp − .

Again, if ω takes a flat prior, Bayes’ theorem gives prob(ω | D, χ2 ) proportional to the right hand
side. Or does it?? We seem to have two very different expressions for the posterior! What is going
on here? [4]
6. Distribution Function Fitting
In the previous chapter we considered deterministic models with noise added. In this
chapter we’ll consider some situations where the model itself is probabilistic.
Suppose we sample a distribution function prob(x) with a non-uniform but known
sampling function prob(S | x). Here x need not be one-dimensional; for example, prob(x)
might represent the distribution of a certain type of object in the sky and prob(S | x)
the detection efficiency in different parts of the sky. Anyway, the data consist of the set
x1 , . . . , xN actually sampled. We can calculate prob(x | S) using Bayes’ theorem:

prob(S | x) prob(x)
prob(x | S) = R (6.1)
prob(S | x′ ) prob(x′ ) dx′

and hence the likelihood is


QN
prob(S | xi ) prob(xi )
prob(x1 , . . . , xN |S) = Z i=1 N . (6.2)
prob(S | x) prob(x) dx

If we have a model prob(x | ω, M) for the distribution function, we can use (6.2) for
parameter fitting and model comparison as usual, but these will almost always have to
be done numerically.

PROBLEM 6.1: Derive prob(x | S) in (6.1), but without using Bayes’ theorem. Consider a sequence of
events: prob(x) is the probability that the next event will be at x, while prob(x | S) is the probability
that the next event to be sampled is at x. Then do a calculation sort of like in Problem 1.1. [3]

EXAMPLE [The lighthouse] A lighthouse is out at sea off a straight stretch of coast. As its works
rotate, the lighthouse emits collimated flashes at random intervals, and hence in random directions
θ. There are detectors along the shore which record the position x where each flash reaches the
shore, but not the direction it came from. So the data are a set of positions x1 , . . . , xN along the
shore. The problem is to infer the position of the lighthouse.
Say the lighthouse is at distance a along the shore and b out to sea. If a flash has direction
θ, and this is towards the shore, then it will reach the shore at x = a + b tan θ. We have for the
likelihood:
dθ 1 b
prob(x | a, b) = = ,
dx 2π (a − x)2 + b2
YN (6.3)
b
prob(D | a, b) ∝ .
(a − xi )2 + b2
i=1

This problem is often used by Bayesians as a warning against blindly using the sample mean.
Symmetry might suggest that the mean of the xi estimates a; in fact, since the likelihood is a
Lorentzian, it doesn’t have a mean. On the other hand, the posterior prob(a, b | D) is perfectly well
behaved, and constrains a and b more and more tightly as more data are recorded. ⊓

45
46 Distribution Function Fitting

PROBLEM 6.2: Now consider two lighthouses, of relative intensities w and 1 − w, at (a1 , b1 ) and
(a2 , b2 ) respectively. Take a1 = 0.5, b1 = 0.2, a2 = 0, b2 = 0.9, and concoct some fictitious data as
follows. Generate a set of x values,

80% according to x = a1 + b1 tan θ,


20% according to x = a2 + b2 tan θ,

(i.e., w = 0.8) with θ random and uniformly distributed in [−π/2, π/2]. Keep the first 200 xi that
lie in [−1, 1] (the detectors extend along only part of the shoreline).
Now use the Metropolis algorithm to recover all five parameters, with 90% confidence intervals.
Use a flat prior in [−1, 1] for a1 and b1 and a flat prior in [0, 1] for w, b1 , and b2 . (Don’t forget the
denominator in the likelihood.) [4]

EXAMPLE [Where the model is noisy too] Sometimes we have a situation where we can’t calculate a
model distribution function, but we can generate a model sample. If the model sample size could be
made arbitrarily large, that would be as good as a distribution function, but practical considerations
(e.g., computer power) may limit this. So we have a finite sample of some underlying distribution
function, and we want to compare it with data, i.e., calculate likelihoods and posteriors. If the
procedure for generating the model sample has some adjustable parameters, we can do parameter
fitting in the usual way.
Since the model and data samples will not be at coincident points, we need to be able to get at
the probability of the data sample points from nearby model points. Basically, we need to smooth
the model somehow. The crudest way of smoothing is to bin the model and data; in effect assuming
that the underlying probability distribution is constant over a bin. Suppose we do this. The bin
size must be small enough for the constancy over bins approximation to be reasonable, but large
enough for most data sample points to share their bin with some model points.
Let us say we have a reasonable binning, of B bins. Say the underlying probability of a (model
or data sample) point being in the i-th bin is wi . Also say there are M model points and S data
sample points in all, mi and si respectively in the i-th bin. The probability distribution for the bin
occupancies will be a multinomial distribution (see equation 2.2 on page 15):

B i +si
Y wm
i
prob(si , mi | wi ) = M! S! . (6.4)
mi !si !
i=1

PB
Since we don’t know the wi , except that i=1 w i = 1, let us marginalize them out with a flat prior.
Using the identity
! %
B Z
Y B
Y
1
P 
wn
i
i
dwi δ B
j=1 w j −1 = ni ! (6.5)
(N + B − 1)!
i=1 i=1

we get
B
M! S! (B − 1)! Y (mi + si )!
prob(si , mi ) = . (6.6)
(M + S + B − 1)! mi !si !
i=1
Distribution Function Fitting 47

The (B − 1)! factor comes from normalizing the prior on the wi .


In similar fashion, we can calculate prob(mi ), and combining that with prob(si , mi ) in (6.6) we
get
B
S! (M + B − 1)! Y (mi + si )!
prob(si | mi ) = . (6.7)
(M + S + B − 1)! mi !si !
i=1

If M can be made arbitrarily large, we can make the bins so small that each bin contains at most one
data sample point. Then (6.7) simplifies to

Y
prob(si | mi ) ∝ (1 + mj ), (6.8)
j

where the product is only over boxes where sj = 1. Assuming mj ≫ 1, this is just the formula for
the continuous case. In other words, (6.7) has the correct large-mi limit. ⊓

PROBLEM 6.3: In the preceding example, we took M and S as fixed, and got a multinomial dis-
tribution for prob(si , mi | wi ). An alternative method would be to let M and S be the expectation
values for the totals when mi and si are drawn from the bin probabilities by a Poisson process.
What would this procedure give for prob(si , mi )? [3]

Apart from fitting model distribution functions to samples, and comparing different
models for the same data, we need to worry about goodness of fit. We may have to
judge whether a given data set could plausibly have come from a particular distribution
function, or whether two discrete samples could plausibly have come from the same
probability distribution. In fact, it is enough to consider the second problem, since the
first is a limiting case of it.
We have to invent a statistic that tries to measure the mismatch between the two
samples in question; then we calculate the distribution of the statistic for random samples
drawn from the same probability distribution and check if the actual value is anomalous.
We could use the likelihood itself (or a function of it) as the statistic, as we did with the χ2
test. But if the data are one-dimensional, there is an easier way for which we don’t have
to know what the underlying probability distribution is.
The trick is to base the statistic on the cumulative distribution function. Suppose that
from some probability distribution prob(x) there are two samples: u1 , u2 . . . of size Nu ,
and v1 , v2 , . . . of size Nv . We consider the difference of cumulative fractions

s(x) = frac ({ui } < x) − frac ({vi } < x).

Figure 6.1 illustrates. Changing the variable x in prob(x) will not change s at any of the
sample points; in particular, x can be chosen to make prob(x) flat. Thus, any statistic
48 Distribution Function Fitting

Figure 6.1: The difference of the cumulative distributions of two samples u1 , . . . , u10 and
v1 , . . . , v20 . The sizes of the two samples should be evident from the staircase. The x axis can
be stretched or shrunk without changing any statistic evaluated from the heights of the steps.

depending only on s(x) at the sample points will have its distribution independent of
prob(x).
There are many statistics defined from the s(x) staircase. The best known is the
Kolmogorov-Smirnov statistic, which is simply the maximum height or depth at any
point on the staircase. Another statistic (named after Cramers, von Mises, and Smirnov in
various combinations and permutations) is sum of the squares of the vertical midpoints of
the steps. The distribution of these particular statistics can be calculated by pure thought,
but we won’t go into that. For applications, it’s more important to understand how to
calculate the distributions by simulating, and that is the subject of the next problem.

PROBLEM 6.4: The important thing to notice about the s(x) staircase is that it is just a one-
dimensional random walk constrained to return to the origin. This suggests a way of simulating
the distribution of statistics based on s(x).
1) Generate two samples of random numbers in [0, 1], u1 , u2 . . . of size Nu , and v1 , v2 , . . . of size
Nv .
2) Sort the samples {ui } and {vi }.
3) Go up along the sorted arrays of u and v; whenever the next lowest x comes from the u (v)
sample, increase (decrease) s(x) by 1/Nu (1/Nv ). This generates the staircase.
4) Calculate the statistic from the staircase.
Iterating the process above gives the distribution of the statistic. Your job is to write a program
to do this. Plot up the p-value distribution prob(κ′ > κ) for the Kolmogorov-Smirnov statistic κ for
Distribution Function Fitting 49

Nu = 10, Nv = 20. Overplot the asymptotic pure-thought formula


X

prob(κ > κ) = f(µκ), f(λ) = 2 (−1)k−1 exp[−2k2 λ2 ],
k=1

^ + 0.12 + 0.11/N
^ , ^ =

µ≃ N N Nu Nv /(Nu + Nv ).

Then invent your own statistic and plot the p-value distribution for it. [4]

An important modification of the above is when you want to compare a sample against
a given probability distribution function. For instance, after doing a least-squares fit you
may want to test whether the residuals are consistent with a Gaussian of mean 0 and the
estimated dispersion. The modification is easy: you just let u be the sample and v the
distribution function with Nv → ∞. (For statistics involving vertical midpoints, to be well-
defined you’ll have to use only the rising steps.) In some cases the distribution function
in question might already have been fitted to the data by adjusting some parameters. We
encountered this problem in the χ2 test too, and in the linearizable case we solved it by
reducing the number of degrees of freedom (equation 5.37 on page 43) in the distribution
of χ2 . But for Kolmogorov-Smirnov and related tests, I know of no way of allowing for
the effect of fitted parameters.
Another possible generalization is to several different samples: just choose a statistic
which is calculated for pairs of samples and then combined in some sensible way.
Bear in mind, though, that cumulative statistics work only in one-dimension. There is
a “two-dimensional Kolmogorov-Smirnov test”, but it assumes that the two-dimensional
distribution function involved is a product of two one-dimensional distribution functions.
In general, if you are dealing with a sample in two or more dimensions then life is much
more difficult.
7. Entropy

So far in this book, prior probabilities have not been very interesting, because we have
not had interesting prior information to put into them. In the last two chapters, however,
the interest will shift to assigning probability distributions so as to incorporate prior
constraints. We will derive informative priors, ready to be multiplied by data likelihoods
in the usual way. Sometimes these priors will be so informative that they start to behave
almost like posteriors; this can happen when the data, though incomplete, are so accurate
that one might as well combine them with prior information in one probability distribution
and dispense with the likelihood.
The key player in this chapter is the entropy function

N
X
S=− pi log pi (7.1)
i=1

associated with a probability distribution p1 , . . . , pN . We have seen entropy briefly already


(page 12), but now we digress into some elementary information theory to see where it
comes from.

DIGRESSION [Shannon’s Theorem] We suppose there is a real function S(p1 , . . . , pN ) that measures
the uncertainty in a probability distribution, and ask what functional form S might have. We impose
three requirements.
(i) S is a continuous function of the pi .
(ii) If all the pi happen to be equal, then S increases with N. In other words,

1 1

s(N) ≡ S N, . . . , N (7.2)

is a monotonically increasing function of N. Qualitatively, this means that if there are more
possibilities, we are more uncertain.
(iii) S is additive, in the following sense. Suppose we start with two possibilities with probabilities
p1 and 1 − p1 . Then we decide to split the second possibility into two, with probabilities p2
and p3 = 1 − p1 − p2 . We require that S(p1 , p2 , p3 ) should be the uncertainty from p1 , 1 − p1
plus the uncertainty associated with splitting up the second possibility. In other words
 
p2 p3
S(p1 , p2 , p3 ) = S(p1 , 1 − p1 ) + (1 − p1 )S , . (7.3)
1 − p1 1 − p1

More generally, consider M mutually exclusive propositions with probabilities q1 , . . . , qM .


Instead of giving these probabilities directly we do the following. We group the first k together
and call the group probability p1 , and write p2 for the next group of the (k + 1)st to (k + l)th
and so on for N groups. Then we give the conditional probabilities for propositions within

50
Entropy 51

each group, which will be q1 /p1 , . . . , qk /p1 , qk+1 /p2 , . . . , qk+l /p2 and so on. The additivity
requirement is then
 
q1 q
S(q1 , . . . , qM ) = S(p1 , . . . , pN ) + p1 S ,..., k
p1 p1
  (7.4)
qk+1 q
+ p2 S , . . . , k+l + ···
p2 p2

The continuity requirement means that we only need to fix the form of S for rational values of
the pi . So let us consider (7.4) for
n 1
pi = i , qi = , (7.5)
M M
where M and the ni are all integers. Then (7.4) becomes

N
X
s(M) = S(p1 , . . . , pN ) + pi s(ni ), (7.6)
i=1

and this is true for general rational pi . Now choose ni = M/N, assuming N divides M. This gives

s(M) = s(N) + s(M/N), (7.7)

and among other things it immediately tells us that s(1) = 0. Had M and N been real variables
rather than integers, we could differentiate (7.7) with respect to N, set N = 1 and integrate with the
initial condition s(1) = 0 to obtain the unique solution s(N) ∝ ln N. As things stand, s(N) ∝ ln N
clearly is a solution, but we still have to prove uniqueness.
This is where the monotonicity requirement comes in. First, we note that (7.7) implies

s(nk ) = ks(n) (7.8)

for any integer k, n. Now let k, l be any integers ≥ 2. Then for sufficiently large n we can find an
integer m such that
m ln k m+1
≤ < , or lm ≤ kn < lm+1 . (7.9)
n ln l n
Since s is monotonic, we have from (7.8) and (7.9) that

m s(k) m+1
ms(l) ≤ ns(k) ≤ (m + 1)s(l), or ≤ ≤ . (7.10)
n s(l) n

Comparing (7.10) and (7.9) we have



ln k s(k) 1
ln l − s(l) ≤ n . (7.11)

Since n could be arbitrarily large, (7.11) forces s(N) ∝ ln N. We can just write s(N) = log N
leaving the base arbitrary. Substituting in (7.6) and using (7.5), we are led uniquely to (7.1) as a
measure of uncertainty. ⊓

52 Entropy

ASIDE It is easy enough to generalize the entropy expression (7.1) to continuous probability distri-
butions. We have Z  
p(x)
S(p(x)) = − p(x) log dx. (7.12)
w(x)
Here the weight w(x) is needed to keep S invariant under changes of the variable x. But in practice
the continuous expression (7.12) does not see much use. In problems where a continuous probability
distribution is needed, people usually discretize initially and then pass to the continuous limit late
in the calculation when entropy no longer appears explicitly. ⊓

According to the maximum entropy principle, entropy’s purpose in life is to get maximized
(by variation of the pi ) subject to data constraints. In general, the data constraints could
have any form, and the maximization may or may not be computationally feasible. But if
the data happen to consist entirely of expectation values over the probability distribution
then the maximization is both easy and potentially very useful. Let us take up this case.
Consider a variable ε which can take discrete values εi , each having probability pi .
P
The number of possible values may be finite or infinite, but i pi = 1. There are two
known functions x(ε) and y(ε), and we have measured their expectation values
X X
X= pi x(εi ), Y = pi y(εi ). (7.13)
i i

[There could be one, three, or more such functions, but let’s consider two.] We want to
assign the probabilities pi on the basis of our knowledge of the functional forms of x and
y and their measured expectation values. We therefore maximize the entropy subject to
P
the constraints (7.13) and also i pi = 1. The equation for the maximum is
∂ X X
pi ln pi + λ pi
∂pi i i
(7.14)
X X 
+x pi x(εi ) + y pi y(εi ) = 0,
i i

where λ, x, y are Lagrange multipliers to be set so as to satisfy the constraints. (We


might as well replace log in the entropy with ln, because the base of the log just amounts
to a multiplicative factor in the Lagrange multipliers.) Eliminating λ to normalize the
probabilities we get
exp[−x x(εi ) − y y(εi )]
pi = ,
Z(x, y)
X (7.15)
Z= exp[−x x(εi ) − y y(εi )].
i

Here and later x and y are understood to be functions of the measured values X and Y.
Z is traditionally called the partition function1 and it is very important because lots of
things can be calculated from it.

1
The symbol Z comes from the partition function’s German name Zustandsumme, or sum-over-states,
which is exactly what it is.
Entropy 53

DIGRESSION [Global entropy maximum] At this point we should make sure that the solution (7.15)
really is the global maximum of the entropy, and not just any stationary point. To do this let us
backtrack a bit and consider the pi as not yet fixed. Now consider a set of non-negative numbers
P
qi satisfying i qi = 1; these are formally like probabilities, but are not necessarily equal to the pi .
Now ln x ≤ (x − 1) for any finite non-negative x, with equality if and only if x = 1; this implies
   
X qi X qi
pi ln ≤ pi −1 = 0, (7.16)
pi pi
i i

and hence X
S(p1 , p2 , . . .) ≤ − pi ln qi , (7.17)
i

with equality if and only if all the qi = pi . If we now choose

exp[−x x(εi ) − y y(εi )]


qi = , (7.18)
Z(x, y)

with Z defined as in (7.15), then (7.17) becomes

S(p1 , p2 , . . .) ≤ ln Z(x, y) + xX + yY. (7.19)

But the assignment (7.15) for the pi gives precisely this maximum value, so (7.15) must be the
maximum-entropy solution. ⊓

When calculating the partition function it is very important to be aware of any degen-
eracies. A degeneracy is when there are distinct values εi 6= εj giving x(εi ) = x(εj ) and
y(εi ) = y(εj ). If we disregard degeneracies, we may wrongly sum over distinct values of
x and y rather than distinct values of ε.

EXAMPLE [Gaussian and Poisson distributions revisited] Suppose ε is a continuous variable. Then
the partition function is Z
Z = w(ε) exp[−x x(ε) − y y(ε)] dε, (7.20)

where w(ε), which expresses the degeneracy, is called the density of states. Now suppose that the
data we have are X = hεi and Y = hε2 i. Then
Z
Z = w(ε) exp −xε − yε2 dε,
 
(7.21)

and hence
prob(ε) ∝ w(ε) exp −xε − yε2 ,
 
(7.22)

with x and y taking appropriate values so as to give the measured hεi and hε2 i. The interpretation
is that when all we are given about a probability distribution are its mean and variance, our most
conservative inference is that it is a Gaussian.
54 Entropy

Now consider another situation. We are given that the expectation value for the number of
events in a certain time interval is m. What can we say about the probability distribution of the
number of events? To work this out, divide the given time interval into N subintervals, each so
small that only 0 or 1 event can occur in it. The degeneracy for n events in the full interval is the
number of ways we can distribute n events among the N subintervals, or
N Nn
Cn ≃ for N ≫ n. (7.23)
n!
We then have

X Nn −nx
= exp Ne−x .
 
Z(x) = e (7.24)
n=0
n!
We solve for x (which recall is the Lagrange multiplier) by equating the mean implied by Z(x) to m,
getting
m
x = − ln , Z = em , (7.25)
N
whence
e−nx mn
prob(n | hni = m) = = e−m . (7.26)
Z(x) n!
In other words, for an event-counting probability distribution where we know only the mean, our
most conservative inference is that it is Poisson. ⊓

PROBLEM 7.1: Using equation (7.15) we can write


∂ ln Z
hx(ε)i = − .
∂x
The left hand side is the mean X. Give
∂2 ln Z
∂x2
some kind of interpretation. [3]

Let us leave partition functions now, before they completely take over this chapter, and
return to least-squares, but with a new complication.
A general problem in image reconstruction is to infer an image on many picture
elements (pixels) fj from data di taken with a blurred and noisy camera:
P
di = j Rij fj + ni (σ). (7.27)
Here Rij is a blurring function, indicating that a fraction of the light that properly belongs
to the j-th pixel in practice gets detected on the i-th pixel. [I am stating the problem
in optical terms, but really Rij could be any linear operator.] There is noise on top of
the blurring. This looks like a straightforward least-squares problem with likelihood
(cf. equation 5.5 on page 36)
h X P 2 i
1 −2
prob(D | f1 , f2 , . . .) ∝ exp − 2 σ j Rij fj − di . (7.28)
i
The complication is that the number of data points is not much greater than the number of
pixels we want to reconstruct; it may even be less. So it is important to think about what
else we know. For instance, if the fj represent an image, the values must be non-negative.
What else should we put in the prior?
Entropy 55

DIGRESSION [The monkey argument] One line of reasoning that leads to a workable prior is what’s
nowadays called the monkey argument. Imagine a team of monkeys randomly chucking peanuts
into jars. When the peanuts land in the jars they get squashed and turned into peanut butter. There
are N peanuts and M jars. When the monkeys have finished, what is the probability distribution
for the amount of peanut butter in each jar? To work this out, say the i-th jar gets ni peanuts. The
distribution of the ni follows a multinomial distribution

N!
prob(ni | N, M) = M −N QM . (7.29)
i=1 ni !

If the number of peanuts is so much larger than the number of jars that ni ≫ 1 then we can use
Stirling’s approximation (formula M.7 on page 70), and

M
X
ln prob(ni | N, M) = −N ln M − ni ln(ni /N). (7.30)
i=1

If we write fi for ni /N (the fraction of the total peanut butter) then (7.30) becomes
X
ln prob(f1 , f2 , . . . | N, M) = −N ln M − N fi ln fi . (7.31)
i

We can take this story as a parable for an image processing problem: the jars correspond to pixels
and the peanut butter to brightness at each pixel. It suggests the prior
P 
prob(f1 , f2 , . . .) = exp −α i fi ln fi . (7.32)

Here α is an unknown parameter; it arose from our discretizing into peanuts. In applications α will
have to be marginalized away or fixed at the value that maximizes the posterior. ⊓

It is tempting to interpret images as probability distributions, and identify (7.31) with the
entropy. But the connection, if there is one, is not so simple, because (7.31) is a measure of
prior probability while (7.1) is a measure of the uncertainty associated with a probability
distribution. This issue remains controversial, and until it is resolved, it seems prudent
to refer to (7.31) and as a ‘configurational’ entropy, not to be identified too closely with
information theoretic entropy.
Another reason not to identify fj with a probability distribution is that it would leave
us at a loss to interpret the following,1 which is a prior for when fj is allowed to be
negative.

DIGRESSION [Macaulay and Buck’s prior] We now consider a different form of the monkey ar-
gument. This time, we have the monkeys tossing both positive and negative peanuts, and what
interests us is the difference between positive and negative kinds of peanut butter in each jar. Also,

1 V.A. Macaulay & B. Buck, Nucl Phys A 591, 85–103 (1995).


56 Entropy

instead of having a fixed number of peanuts and hence a multinomial distribution among the jars,
we suppose that the distribution of peanuts in any jar follows a Poisson distribution with mean
µ+ for positive peanuts and µ− for negative peanuts. (The other case could be worked out with a
Poisson distribution as well.) The probability for the i-th jar to have ni+ positive peanuts and ni−
negative peanuts is
(µ+ )ni+ (µ− )ni−
prob(ni+ , ni− ) = e−(µ+ +µ− ) . (7.33)
ni+ !ni− !
1
Let us consider the term from one jar and suppress the subscript i. Writing n = 2 (n+ + n− ) and
q = 12 (n+ − n− ) we have
 q
−(µ+ +µ− ) µ+ 1
prob(q, n) = e (µ+ µ− )n . (7.34)
µ− (n + q)!(n − q)!

Writing β2 = (µ+ µ− ) and γ = (µ+ /µ− ) changes (7.34) to

√ √ γq β2n
prob(q, n) = e−β( γ+1/ γ)
. (7.35)
(n + q)!(n − q)!

Since we are really interested in q, we marginalize n away by summing over it. The series is
essentially a modified Bessel function:


X √ √
prob(q) = prob(q, n) = e−β( γ+1/ γ)
γq I2q (2β). (7.36)
n=|q|

Using the asymptotic large-q form of I2q and ignoring factors varying as ln q, we have
√ √ √
log prob(q) = (2q)2 + (2β)2 − β( γ + 1/ γ)+
q (7.37)
q ln γ − 2q sinh−1 ,
β

where the base of the logarithm is an arbitrary constant. Changing variables from the integer q
to the real number f = 2qǫ (where ǫ is sort of the amount of butter per peanut), and also writing
w = 2βǫ, we have

√ √ √ f
log prob(f) = f2 + w2 − 12 w( γ + 1/ γ) + 1
2 ln γ − f sinh−1 . (7.38)
w
− 1
 1 √ √
If we further write f = w sinh 2 ln γ = 2 w( γ − 1/ γ) then the prior has the nice form

1 − 1
log prob(f) = (f2 + w2 ) 2 − (f2 + w2 ) 2
− (7.39)
+ f sinh−1 (f/w) − sinh−1 (f/w) ,



for one pixel, the full log prob being a sum of such terms. Here f, w, and the proportionality constant
are parameters which have to determined or marginalized away. ⊓

Entropy 57

We now have two priors,


X
log prob(f1 , f2 , . . .) = − fi ln fi or
i
X 1 − 1
(f2i + w2 ) 2 − (f2 + w2 ) 2 (7.40)
i

 
+ fi sinh−1 (f/w) − sinh−1 (fi /w) ,

the first if fi must be positive, the second if fi can be negative. In either case the posterior
is
prob(f1 , f2 , . . . | D) ∝ exp α ln prob(f1 , f2 , . . .) − 12 χ2 ,
 
X P 2 (7.41)
χ2 = σ−2 j R ij fj − di .
i

Despite their formidable appearance, these equations are quite practical, even with mil-
lions of pixels. It helps numerical work tremendously that both forms of the prior, as well
as χ2 , are convex functions of the fi . The standard technique to locate the maximum of
the posterior proceeds iteratively in two stages. In the first stage, one just tries to reach a
place with the correct value of χ2 ; that value is the number of pixels minus the number of
non-negligible singular values of Rij (cf. equation 5.38 on page 43). The second stage of
iterations holds χ2 fixed while increasing prob(f1 , f2 , . . .) as far as possible. This amounts
to treating α as a Lagrange multiplier. Finally, one can present uncertainties in various
ways. Applications are numerous and varied.1

1See B. Buck & V.A. Macaulay eds., Maximum entropy in action (Oxford University Press 1991) for some
examples.
8. Entropy and Thermodynamics
To conclude this book we’ll consider entropy in its historically original context, and
see that the information theoretic ideas from the previous chapter lead to a beautiful
reinterpretation of the branch of physics known as thermodynamics.1 This chapter will
try to explain all the necessary physical ideas as it goes along, so it is not essential to have
seen thermodynamics before; but if you have, it will be easier to follow.
Thermodynamics is about measuring a few macroscopic properties (such as inter-
nal energy, volume) of systems that are microscopically very complex—a gram of water
has > 3 × 1022 molecules—and predicting other macroscopic properties (such as tem-
perature, pressure). Given some macroscopic data we assign probabilities p1 , p2 , . . . to
different microstates using the principle of maximum entropy, and then use the assigned
probabilities to predict other macroscopic quantities. (Microstate refers to the detailed
state—including position and energy of every molecule—of the system. Naturally, we
never deal with microstates explicitly.)
In general, maximizing entropy subject to arbitrary data constraints is a hopelessly
difficult calculation even for quite small systems, never mind systems with 1022 molecules.
What saves thermodynamics is that the macroscopic measurables are either expectation
values, that is, of the form
X X
X= pi x(εi ), Y = pi y(εi ) (8.1)
i i

which we considered in the previous chapter, or somehow related to expectation values.


For example, if x(εi ) is the energy of the microstate labelled by εi , then the expectation
value X will be the internal energy. Any measurable that is extensive (meaning that it
doubles when you clone the system) such as internal energy, volume, particle number,
can play the role of X, Y. [There may be any number of X, Y variables in a problem, two
is just an example.] Non-extensive measurables like temperature and pressure cannot be
X, Y variables but will turn out to be related to them in an interesting way.
The key to maximizing the entropy subject to X, Y is the partition function
X
Z(x, y) = exp[−x x(εi ) − y y(εi )]. (8.2)
i

The maximum-entropy probability values are given by

exp[−x x(εi ) − y y(εi )]


pi = (8.3)
Z(x, y)

1 In standard physics usage, this chapter is about thermodynamics and statistical mechanics. However, I
will use the less common terms ‘classical thermodynamics’ and ‘statistical thermodynamics’, which are a little
more descriptive.

58
Entropy and Thermodynamics 59

(cf. equation 7.15 on page 52). The x, y variables are Lagrange multipliers associated with
the entropy maximization (originally appearing in equation 7.14) and hence ultimately
functions of the measured X, Y. Later on, the x, y will turn out to represent non-extensive
thermodynamic variables, but for now they are just arguments of Z.
Let us evaluate the entropy for the probability distribution (8.3), and then write S(X, Y)
for the maximum value.1 We get

S(X, Y) = ln Z(x, y) + xX + yY, (8.4)

and from (8.2) and (8.3) we have


   
∂ ln Z ∂ ln Z
X=− , Y=− . (8.5)
∂x y ∂y x

Differentiating (8.4) and inserting (8.5) gives us two complementary relations


   
∂S ∂S
x= , y= . (8.6)
∂X Y ∂Y X

Equations (8.4) to (8.6) express a so-called Legendre transformation. The functions S(X, Y)
and ln Z(x, y) are Legendre transforms of each other. The variables x and y are conjugate
to X and Y respectively in the context of Legendre transforms; correspondingly, −X and
−Y are conjugate to x and y. What this means is that although we started by assuming X
and Y as measured (independent variables) and x and y as functions of them, we equally
can take x and y as measured and use the same relations to infer X and Y. In fact, we are
free to choose any two out of X, Y, x, y, S, Z as the independent variables.
The partition function (8.2) and the Legendre transform relations (8.4) to (8.6) encap-
sulate the formal structure of statistical thermodynamics. Statistical thermodynamics is
a mixed micro- and macroscopic theory: the data are all macroscopic, but to compute
the partition function we need to have some microscopic knowledge; for example, if X
is the internal energy then x(εi ) in equations (8.1) and (8.2) will involve the microscopic
energy levels and Z will involve a sum over possible energy levels. Once the partition
function is in hand, the Legendre transform relations express the macroscopic thermo-
dynamic variables in terms of microscopic quantities, and all sorts of useful results can
be derived. All predictions are of course probabilistic, since they refer to the maximum
of S(p1 , p2 , . . .); but given the extremely large number of microstates (the so-called ther-
modynamic limit) the maximum of S(p1 , p2 , . . .) is usually extremely precise, and the
macroscopic uncertainties are usually negligible.

1 We are making a subtle but important semantic shift here. So far we used ‘entropy’ to mean S(p1 , p2 , . . .),
a function of the probability distribution and hence of our state of knowledge. From now on ‘entropy’ will
mean S(X, Y) which is the maximum value of S(p1 , p2 , . . .), and depends only on the measured X, Y. It is,
unfortunately, conventional to use the same name and symbol for both of these distinct concepts. Maybe
authors secretly enjoy the endless confusion it causes.
60 Entropy and Thermodynamics

The macroscopic subset of statistical thermodynamics is classical thermodynamics.


In classical thermodynamics the partition function is never calculated explicitly, hence
macroscopic variables can be related to each other but not to anything microscopic. For-
mally, classical thermodynamics consists just of the Legendre transform relations (8.4) to
(8.6) and their consequences. A major role is played by differential formulas of the type

dS = x dX + y dY, (8.7)

(which follows from equations 8.4 and 8.6). In fact, all classical thermodynamics formulas
are basically identities between partial derivatives. But because the partial derivatives are
physically relevant, classical thermodynamics is a powerful physical theory. And because
it requires no microscopic knowledge it is the most robust branch of physics, the only
branch to remain fundamentally unchanged through the 20th century.
But enough about formal structure. Let us see now how the formalism actually
works, taking examples first from classical thermodynamics and then from statistical
thermodynamics.
Suppose X and Y are the internal energy E and the volume V of a system. The system
is in equilibrium, i.e., there is no tendency for E, V to change at the moment. Then the
Legendre transform relations (8.4) and (8.6) give

1 p
S(E, V) = E + V + ln Z(τ, p/τ)
τ τ 
∂S

∂S
 (8.8)
where 1/τ ≡ , p/τ ≡ .
∂E V ∂V E

Despite the hopeful choice of symbols τ, p, we have as yet no physical interpretation for
these, nor any way of assigning a numerical value to S on a macroscopic basis. We do,
however, have the important principle that the total entropy of a system does not change
during a reversible process, though entropy may be transferred between different parts of
a system. A reversible change cannot change the net information we have on the system’s
internal state, so S does not change. During a reversible change, a system effectively
moves through a sequence of equilibrium states. Applying the principle

reversible ⇒ constant total S

to
τ dS = dE + p dV, (8.9)

(which is just 8.7 for the case we are considering) we can interpret τ as temperature, p as
pressure, τ dS as heat input, and define scales for measuring temperature and entropy.
The following digression explains the physical arguments leading to all this.
Entropy and Thermodynamics 61

DIGRESSION [Heat is work and work is heat] To get macroscopic interpretations for τ, p and S, we
do some thought experiments. We imagine a gas inside a cylinder with a piston, with external heat
sources and sinks, and subject the gas to various reversible processes. This imaginary apparatus is
just to help our intuition; the arguments and conclusions are not restricted to gases in cylinders. We
reason as follows.
(1) That p must be pressure is easy to see. If we insulate the gas to heat we have dS = 0, and
if we then compress it dE = −p dV ; since dE is the differential internal energy, from energy
conservation p dV must be the differential work done by the gas, so p must be pressure.
(2) Next, we can infer that τ dS is differential heat input. If we supply (or remove) external
heat, the input heat must be the internal energy change plus the work done by the gas. The
interpretation of τ dS follows from (8.9). Under reversibility, the total entropy of gas and heat
source/sink is still constant, but entropy moves with the heat. We may now identify equation
(8.9) as a relation between heat and work; it is one form of the first law of thermodynamics.
(3) Interpreting τ as temperature is more complicated, because unlike pressure, temperature is not
already defined outside thermodynamics. First we show that τ is a parameter relating to heat
transfer. To do this, let us rewrite the Legendre transform (8.8) slightly: instead of taking E, V
as independent variables let us take S, V as independent and write E(S, V ). This gives us

E(S, V ) = τS − pV + G(τ, p)

∂E
 
∂E
 (8.10)
where τ= , p=− ,
∂S V ∂V S

and we have introduced G(τ, p) = −τ ln Z. The variables τ, p in (8.10) are really the same as
in (8.8): τ clearly so, and p because of the partial-derivative identity

∂E ∂S ∂V
     
= −1. (8.11)
∂S V ∂V E ∂E S

Equation (8.10) is a Legendre transform where −p is a conjugate variable to V , and τ is conjugate


to S. If two gases are separated by an insulated piston, then p dV work will be done by the gas
at higher p on the gas at lower p. Analogously, if two gases are kept at fixed volume then τ dS
heat will flow from the one at higher τ to the one at lower τ. This shows that τ is some sort of
temperature. But it still does not let us assign numbers to τ and S separately, only to τ dS.
(4) To make further progress, we take the gas through a cycle of processes

Slow , τhot → Shigh , τhot


↑ ↓
Slow , τcold ← Shigh , τcold

known as a Carnot cycle. The horizontal arrows indicate changes of S at constant τ, as heat is
transferred into or out of the system. (To do this, the heat or source or sink must offer a slight
temperature difference, but for the sake of reversibility the difference is assumed infinitesimal.)
The vertical arrows indicate changes of τ at constant S. All four arrows involve changes in
V . Say we start at the upper left. Along the rightward arrow, some heat (say Qin ) is input;
meanwhile the gas expands, doing work against the cylinder’s piston. During the downward
62 Entropy and Thermodynamics

arrow, the gas is insulated and allowed to expand further, thus doing more work. Along the
leftward arrow the gas outputs some heat (say Qout ); meanwhile the gas has to be compressed,
so work is done on the gas through the piston. Finally during the upward arrow, the gas is
insulated and compressed further, thus doing more work on it, until it returns to its initial
state. (We cannot check directly that S has gone back to Slow , but since only two variables are
independent here, it is enough to check that p, V are at their initial values.) We now have

Qin = τhot (Shigh − Slow ), Qout = τcold (Shigh − Slow ),


(8.12)
Qin > Qout .

Thus, a Carnot cycle takes heat Qin from a source, gives Qout to a sink, and converts the
difference into net p dV work.
(5) We see from (8.12) that in a Carnot cycle

Qin τ
= hot (8.13)
Qout τcold

regardless of the details. We can use this ratio of heats to define a scale for τ. In fact, the Kelvin
temperature scale is defined in this way: for a Carnot cycle between 100 K and 200 K will have
Qin = 2Qout , and so on.1 The definition of 1 K is a matter of convention: it amounts to a
multiplicative constant in 1/τ and S but has no physical significance.
The preceding arguments show that it is possible to measure entropy macroscopically, through
integrals of the type Z
dE + p dV
∆S = (8.14)
τ
up to an additive constant. But the additive constant requires microscopic information and is not
measurable within classical thermodynamics. ⊓

So far we have used three kinds of thermodynamic variables: first S itself, then the exten-
sive measurables (X, Y variables), and third the non-extensive conjugates (x, y variables).
We can create a fourth kind by taking Legendre transforms; these are generically known
as thermodynamic potentials, and are rather non-intuitive.
We have, of course, already seen Z(x, y) as a Legendre transform of S(X, Y). We
can also define a sort of hybrid variable Z′ (x, Y), where the Legendre transform has been
applied with respect to X but not Y:

S(X, Y) = ln Z′ (x, Y) + xX. (8.15)

Analogous to (8.5) we have in this case

∂ ln Z′ ∂ ln Z′
   
X=− , y= . (8.16)
∂x Y ∂Y x

1
The number 1−τcold /τhot is called the efficiency of a Carnot cycle, since it is the fraction of heat converted
to work. See any book on thermodynamics for why this is interesting and important.
Entropy and Thermodynamics 63

From (8.15) and (8.16) we have

d[ln Z′ ] = −X dx + y dY. (8.17)

Z′ (x, Y) is a kind of partition function, but different from Z(x, y). In the sum (8.2) for
Z(x, y), x and y were parameters (originally Lagrange multipliers) used to enforce the
required values of the extensive variables X and Y respectively. In Z(x, Y), y does not
appear, instead the extensive variable Y appears directly as a parameter, i.e., the sum over
states εi is restricted to fixed Y:
X
Z′ (x, Y) = exp[−x x(εi )]. (8.18)
i, fixed Y

We can go further and define yet another partition function via

S(X, Y) = ln Z′′ (X, Y), (8.19)

for which the explicit form is the weird-looking


X
Z′′ (X, Y) = 1. (8.20)
i, fixed X,Y

Here X, Y are constrained directly in the sum, and no Lagrange multipliers are required.
This case amounts to working out all the microstates consistent with given X, Y, and
assigning equal probability to all those microstates, since in this case maximum-entropy
reduces to indifference.
When we need to compute a partition function, we are free to start with any of the
partition-function sums (8.2), or (8.18), or (8.20), and derive the others via the Legen-
dre transform relations (8.4), (8.15), or (8.19). In different situations, different partition
functions may be easiest to compute.
We can invent further thermodynamic potentials by swapping dependent and in-
dependent variables and then taking Legendre transform. A good example is G(τ, p),
introduced in (8.10) as a Legendre transform of E(S, V):

G(τ, p) = E − τS + pV. (8.21)

G is called the Gibbs free energy, and equals −τ ln Z. Another example is the enthalpy
H(S, p): to derive it we start with E(S, V) again and replace V by its conjugate variable
while leaving S as it is, thus:

E(S, V) = −pV + H(S, p), (8.22)

and p = −(∂E/∂V)S we have already identified (before equation 8.10) as the pressure.
64 Entropy and Thermodynamics

The zoo of thermodynamic variables we have begun to explore admits some elegant
identities. Going back to our original Legendre transform and differentiating (8.5) and
(8.6) we get        
∂X ∂Y ∂y ∂x
= , = . (8.23)
∂y x ∂x y ∂X Y ∂Y X
[If there were more than two X, Y variables we would take them in pairs.] Such identities
are known as Maxwell relations, and are very useful for relating thermodynamic variables
in often surprising ways.

PROBLEM 8.1: Applying the generic Maxwell relation formulas (8.23) to equation (8.21) we get:

∂S ∂V ∂τ ∂p
       
=− , =− . (8.24)
∂p τ ∂τ p ∂V S ∂S V

You can derive two more identities as follows. Introduce a new thermodynamic potential F(τ, V ) by
a suitable Legendre transform. Relate F to the enthalpy H from (8.22). Then derive the additional
Maxwell relations
∂S ∂p ∂τ ∂V
       
= , = .
∂V τ ∂τ V ∂p S ∂S p
F(τ, V ) is known as the Helmholtz free energy. [3]

So far we have considered only equilibria and reversible processes, the latter being inter-
preted as sequences of equilibria. All these involve no net loss of information, and the
total entropy remains constant. In non-equilibrium situations and during irreversible pro-
cesses the theory is not valid, but if an irreversible process begins and ends at equilibrium
states we can still relate the beginning and end states to each other.
What we mean by an irreversible process in thermodynamics is that some information
about the past state of a system is lost. Thus the total entropy increases, and in place of
the equality (8.7) we have
dS ≥ x dX + y dY. (8.25)

Equality holds for reversible changes. The most important form of (8.25) is

τ dS ≥ dE + p dV, (8.26)

which replaces (8.9); it is a statement of the second law of thermodynamics.


We can calculate the entropy change in an irreversible process if we can connect the
beginning and end states by an imaginary reversible process. Typically, this reversible
process will introduce an imaginary external agent that makes the system do some extra
work and pays the system an equivalent amount of heat (and hence entropy); the irre-
versible process neither does the extra work nor receives the heat, it generates the entropy
by itself.
Entropy and Thermodynamics 65

EXAMPLES [Some irreversible processes] A generic kind of irreversible process is an equalization


process: two systems in different equilibria are brought together and allowed to settle into a common
equilibrium.
A simple example is temperature equalization: two systems at τ1 and τ2 > τ1 are brought
together and allowed to exchange heat till their temperatures equalize. This would normally happen
in a non-equilibrium, irreversible manner. But we can imagine an outside agent mediating the heat
transfer reversibly. The agent takes an infinitesimal amount of heat δQ from the system at τ1 , and
goes through a Carnot cycle to transfer heat to the system at τ2 . The heat transferred through the
Carnot cycle will be (τ2 /τ1 )δQ, while (1 − τ2 /τ1 )δQ is converted to work, with no net change of
entropy. Now suppose the agent turns the work into heat and transfers it (still reversibly) to the
system at τ2 . The system gains entropy
1 1
 
dS = − δQ > 0 (8.27)
τ2 τ1
from the agent, but no net energy. For the irreversible process there is no agent, but the entropy
increase is the same.
Another example is pressure equalization. Consider two gases at the same τ but different
p1 , p2 . They are separated by a piston which is free to move, and hence the pressures get equalized.
The system as a whole is insulated. Now imagine an agent that reversibly compresses the lower-
pressure (say p2 ) gas and lets the p1 gas expand, both by infinitesimal volume dV . The net work
done on the agent is (p1 − p2 ) dV . The agent turns this work into heat and distributes it between
the gases so as to keep their temperatures equal. The system gains
p1 − p2
dS = dV (8.28)
τ
from the agent, and no net energy. Again, for the irreversible process there is no agent, the system
increases its own entropy.
A variant of the pressure-equalization example is gas expanding into a vacuum. Here (8.28)
simplifies to
p
dS = dV > 0. (8.29)
τ
Another variant of pressure equalization is mutual diffusion: two gases initially at the same
temperature and pressure are allowed to mix. Each gas expands to the full volume, and though
expansion is not into a vacuum, once again there is no p dV work and no heat input. The entropy
change is again given by (8.29), but one must add the contributions from both gases.
But what if the mutually diffusing gases are the same? Then there is no macroscopic change,
but the above argument still gives us an entropy increase! This is the famous Gibbs paradox, and
the source of it is a tacit assumption that the mutually diffusing gases are distinguishable. One
can work around the Gibbs paradox by taking indistinguishability of gas molecules into account
in different ad-hoc ways1 but ultimately the paradox goes away only when we have a microscopic
theory for indistinguishable particles, or quantum statistics, on which more later. ⊓

1 Gibbs’s original resolution is in the closing pages of J.W. Gibbs Elementary Principles of Statistical Mechanics
(originally 1902, reprinted by Ox Bow Press 1981). An ingenious resolution using (almost) entirely macroscopic
arguments is given in Section 23 of Yu.B. Rumer & M.Sh. Ryvkin Thermodynamics, Statistical Physics, and Kinetics
(Mir Publishers 1980).
66 Entropy and Thermodynamics

PROBLEM 8.2: Consider two cylinders with pistons, each filled with a gas and the two connected
by a valve. The pistons maintain different constant pressures p1 , p2 as gas is forced through the
valve. (The volumes of gas V1 , V2 change of course.) The system is thermally insulated. Show that
the enthalpy H of the gas remains constant as it is forced through the valve. Hence show that the
entropy change is given by
V
dS = − dp
τ
integrated over an imaginary reversible process.
To integrate the entropy change we would need to know V/τ as a function of p at constant H;
that depends on the gas’s intermolecular forces and will not concern us here.
Constant-enthalpy expansion is very important for a different reason. It turns out that
(∂τ/∂p)H depends sensitively on intermolecular forces, and the expansion may be accompanied by
heating or cooling, depending on the gas. With a cunning choice of gas, constant-enthalpy expansion
(known as the Joule-Thompson or Joule-Kelvin effect) does most of our domestic refrigeration. [3]

The entropy-increase statement (8.25) implies that as an irreversible process approaches


equilibrium at known values of X, Y, the entropy increases to a maximum. Rather than
X, Y, however, we may prefer to take x, Y or x, y as independently known. In particular,
many kinds of process (especially in chemistry) take place at given τ, p or τ, V, but not
given E, V. So it is useful to Legendre-transform the inequality (8.25) to appropriate
variables. If x, Y are fixed we take the differential d[ln Z′ (x, Y)] from (8.17) and insert the
inequality (8.25), which gives us

d[ln Z′ (x, Y)] ≥ −X dx + y dY. (8.30)

If x, y are fixed we instead compute the differential d[ln Z(x, y)] and insert the inequality
(8.25), and that gives us
d[ln Z(x, y)] ≥ −X dx − Y dy. (8.31)

The inequalities (8.30) and (8.31) mean that as a system approaches equilibrium and
S(X, Y) increases to a maximum, a thermodynamic potential also moves to an extremum.
Conventionally, thermodynamic potentials are defined as Legendre transforms of E, rather
than of S as here; that gives them an extra factor of −τ and makes them decrease to a
minimum during an irreversible process.

PROBLEM 8.3: Derive thermodynamic potentials that are minimized by equilibria at (i) fixed τ, V ,
and (ii) fixed τ, p. [2]

We now move on from classical thermodynamics, and incorporate some microscopic


considerations. In this book we will only manage the minimal theory of the so-called
ideal gas, but with quantum mechanics included. ‘Gas’ is meant in a very general sense,
it can refer to the gas of conduction electrons in a metal, or to liquid helium.
Entropy and Thermodynamics 67

For the particles in such a gas there are a number (possibly infinity) of energy levels,
say level 0 with energy ε0 , level 1 with energy ε1 , and so on. A microstate state of the gas
has n0 particles in level 0, n1 particles in level 1, and so on. The measured thermodynamic
P P
parameters are the number of particles N = i ni and the total energy E = i ni εi . The
partition function is a sum over all possible sets {ni }. Writing −µ/τ and 1/τ for the
Lagrange multipliers corresponding to N and E respectively, we have
X P
Z= exp [ i ni (µ − εi )/τ]. (8.32)
{ni }

We can rearrange the double sum as


Y X
exp[ni (µ − εi )/τ], (8.33)
i ni
in which case the {ni } are generated automatically by the product.
Quantum mechanics has two important consequences that we have to take into
account when evaluating Z.
(i) Identical particles are indistinguishable, so permuting particles doesn’t count as a
separate state. We have already used this fact in writing (8.32)—for distinguishable
particles there would have been some factorials.
(ii) Quantum particles come in two varieties: Bose-Einstein particles (or bosons) for
which ni may be any non-negative integer; and Fermi-Dirac particles (or fermions)
for which ni can only be 0 or 1. Electrons are fermions, while Helium atoms are
bosons.
Working out the sum with and without the restriction on ni , we have
Y ±1
(µ−εi )/τ
Z= 1±e , (8.34)
i
where the upper sign refers to the Fermi-Dirac case and the lower sign to the Bose-Einstein
case. Evaluating the expectation values using (8.5) we have
  X
∂ ln Z εi
E=− = (ε −µ)/τ
,
∂ [1/τ] µ/τ e i ± 1
i
  X (8.35)
∂ ln Z 1
N= = ,
∂ [µ/τ] τ
i
e(εi −µ)/τ ± 1
and consequently for the the occupancy of the i-th energy level we have
1
ni = (ε −µ)/τ . (8.36)
e i ±1

If the energy levels εi are continuous, or spaced closely enough to be approximable


as continuous, we can write continuous forms of (8.36) and (8.35):
Z Z
ρ(ε)
n(ε) = (ε−µ)/τ , E = ε n(ε) dε, N = n(ε) dε. (8.37)
e ±1
where ρ(ε) is the density of states.
68 Entropy and Thermodynamics

EXAMPLE [The classical ideal gas] In gases of our everyday experience, usually ε ≫ µ, in which
case the partition function (8.34) simplifies to

X
ln Z ≃ eµ/τ e−εi /τ (8.38)
i

and the distinction between Bose-Einstein and Fermi-Dirac disappears; the common limiting form
is called the classical or Maxwell-Boltzmann gas. If the gas is monoatomic, the density of states is
expressible in terms of the momentum p as

4π 2 p2
ρ(ε) = p dp dV, ε= (8.39)
h3 2m

where h is a physical constant (Planck’s constant) and m is the particle mass. Approximating the
partition function (8.38) by an integral and using (M.4) from page 70 gives us

3/2
3/2
h2

µ/τ τ V
ln Z = e , Γ = . (8.40)
Γ 2πm

Using (8.5) we have


 
∂ ln Z µ NΓ
 
N= = ln Z, = ln ,
∂ [µ/τ] τ
τ Vτ3/2
  (8.41)
∂ ln Z 3
E=− = 2 τN,
∂ [1/τ] µ/τ

Substituting from the above into the Legendre transform (8.4) we have

τ3/2 V
  
E µ 5
S = ln Z + − N = 2 + ln N. (8.42)
τ τ NΓ

And here we recover the well-known result that for a monoatomic ideal gas τV 2/3 is constant in
constant-S processes.
We may now invoke the relation (8.8) for p/τ from classical thermodynamics, which gives us

p ∂S N
 
= = (8.43)
τ ∂V E,N V

2
(since τ = 3 E/N and thus τ is constant in the partial derivative) and we have finally

pV = Nτ, (8.44)

which is the equation of state for a classical gas. ⊓



Entropy and Thermodynamics 69

PROBLEM 8.4: Photons in a cavity form a Bose-Einstein gas, with the difference that photons can
be freely absorbed and emitted by any matter in the cavity, or its walls, so there is no constraint on
N. For photons, the energy level is given by

ε = hν,

where ν is the frequency. The density of states has the form

8π 2
ρ(ε) dε = ν dν dV
c3

where c is the speed of light. Show that the energy per unit volume in photons with frequencies
between ν and ν + dν is
8πh ν3 dν
.
c3 ehν/τ − 1
This is known, for historical reasons, as the blackbody radiation formula.
A good example of a photon gas is the radiation left over from the Big Bang. The expansion
of the universe has reduced temperature and frequencies a lot, and today the typical ν is in the
microwave range and the temperature is 2.73 K. [2]
Appendix: Miscellaneous Formulas
This Appendix collects a number of formulas that are used in the main text, and are
often useful in other places too. Most of them will not be derived here, but any book on
mathematical methods for physical sciences 1 will derive most or all of them.

1. First, a reminder of the definition of e:


1 N

e ≡ lim 1+ N . (M.1)
N→∞

2. Integrals for marginalization often involve Gamma and Beta functions which are de-
fined as Z∞
Γ (n + 1) ≡ xn e−x dx = n! (M.2)
0

and Z1
m!n!
B(m, n) ≡ xm (1 − x)n dx = . (M.3)
0 (m + n + 1)!
Note that m, n in these two formulas need not be integers. Useful non-integer cases are
1
 √ 3
 1

Γ 2 = π, Γ 2 = 2 π. (M.4)

Changing variables in (M.2) gives another useful integral:


Z∞    n  K −n/2
−n K dσ
σ exp − 2 =Γ . (M.5)
0 2σ σ 2 2

A useful approximation for the Gamma function is Stirling’s formula (actually the leading
term in an asymptotic series)
√ 1
n! ≃ 2π e−n nn+ 2 , (M.6)
and its further truncation
ln(n!) ≃ n ln n − n. (M.7)

3. Also useful for marginalization is the most beautiful of all elementary integrals
Z∞
2 √
e−x dx = π (M.8)
−∞

1 My favourites are G.B. Arfken & H.J. Weber, Mathematical Methods for Physicists (Academic Press 1995—first
edition 1966) for detail and P. Dennery & A. Krzywicki, Mathematics for Physicists (Dover Publications 1996—first
edition 1967) for conciseness.

70
Appendix: Miscellaneous Formulas 71

and a number of integrals derived from it. These are sometimes called Gaussian integrals.

From (M.8) we have Z∞


1 2 √ 1
e− 2 αx dx = 2πα− 2 . (M.9)
−∞

Invoking Leibnitz’s rule for differentiating under the integral sign, we differentiate (M.9)
with respect to α, and divide the result by (M.9) itself, to get
R∞ 2 − 1 αx2
x e 2 dx
R−∞
∞ − 1
αx2 = α−1 (M.10)
−∞
e 2 dx

Going to two dimensions, we have


Z∞
1 2 2
e− 2 (αx +βy −2γxy) dx dy
−∞
1 Z
2π 2 ∞ − 1 (αβ−γ2 )x2 /β

= e 2 dx (M.11)
β −∞

=√
αβ − γ2
and R∞
x2 exp − 12 (αx2 + βy2 − 2γxy) dx dy
 
−∞ β
R∞  1 2 2 = . (M.12)
αβ − γ2

−∞
exp − 2 (αx + βy − 2γxy) dx dy
Formulas (M.11) and (M.12) look awful, but their content is simple: (M.11) shows that
marginalizing out one of the variables in a 2D Gaussian leaves a 1D Gaussian, and (M.12)
tells us the dispersion of the latter.
The multi-dimensional generalizations of (M.9) and (M.10) can be written concisely
in matrix notation. Let x denote a column vector with L components each ranging from
−∞ to ∞ and H denoting a constant L × L matrix. Then 12 xT H x generalizes − 12 αx2 to L
dimensions. We then have the formulas1
Z L/2
1 T
 (2π)
exp − 2 x H x dx1 . .dxL = √ (M.13)
| det H|

and R
x xT exp − 12 xT H x dx1 . .dxL

R 1 T
 = H−1 . (M.14)
exp − 2 x H x dx1 . .dxL
We see that H−1 behaves like σ2 in the Gaussian. Moreover, marginalizing some of the
variables in a multi-dimensional Gaussian amounts to discarding the corresponding rows
and columns of H−1 .
1 See the Appendix of Sivia for a derivation.
72 Appendix: Miscellaneous Formulas

4. Now we have a very brief summary of Fourier transforms. A function f(x) and its
Fourier transform, say F(k), are related by
Z∞
F(k) ≡ eikx f(x) dx
−∞
Z∞ (M.15)
1 −ikx
f(x) = e F(k) dk.
2π −∞

Naturally, f(x) needs to fall off fast enough at large |x| for the Fourier transform to exist.
Fourier transforms have many useful properties, four of which are used in chapter 3.
(i) The Fourier transform of f′ (x) is (−ik)F(k), (easy to verify).
(ii) The Fourier transform of the convolution of two functions is the product of the Fourier
transforms (the convolution theorem):
Z∞ Z
1 ∞ −ikx
f(y)g(x − y) dy = e F(k)G(k) dk. (M.16)
−∞ 2π −∞

It is not hard to derive this, at least if we are allowed to freely change the order of
integration.
(iii) The Fourier transform of a Gaussian is another Gaussian:
1 1 2 2 1 2 2
f(x) = √ e− 2 x /σ ⇔ F(k) = e− 2 k σ . (M.17)
2πσ

The functional form stays the same, only σ is replaced by 1/σ.


(iv) Although the Fourier transform of a constant is not defined, it can be given a meaning
in the limit. Consider the limit of (M.17) as σ → 0: f(x) gets arbitrarily sharp and
narrow but the area under it remains 1. We have

f(x) → δ(x), F(k) → 1, (M.18)

where δ(x) (called a δ function or Dirac δ) is a curious beast with the property that
for any well-behaved h(x)
Z
δ(x − a) h(x) dx = h(a) (M.19)

if the integral goes over x = a, and 0 otherwise.

5. Finally, a note on the infinite product (1.5) on page 4, which has little to do with anything
else in this book, but is too pretty to pass by. Consider
Y −1
1 − p−z (M.20)
p prime
Appendix: Miscellaneous Formulas 73

where z > 1. We can binomial-expand each term in the product to get


Y
1 + p−z + p−2z + p−3z + · · · .

(M.21)
p prime

Expanding out this product, and using the fact that every natural number n has a unique
prime factorization, gives

X 1
. (M.22)
n=1
nz

The sum (M.22) is the Riemann Zeta function, originally introduced for real z by Euler
and later analytically continued to complex z by Riemann. This function has a spooky
tendency to appear in any problem to do with primes. For z = 2 it equals1 π2 /6.

1 See, for example, equation (6.90) in R.L. Graham, D.E. Knuth, & O. Patashnik, Concrete Mathematics
(Addison-Wesley 1990).
Hints and Answers

ANSWER 1.1: Adding probabilities for 1st, 2nd,. . . turn


h i
1 5 5 3 5

prob(Tortoise rolls 6) = 6 6 + 6 + ... = 11 .

Alternatively
5 5
prob(Tortoise rolls 6) = 6 prob(Achilles rolls 6) = 11 .

ANSWER 1.2:
prob(65 | N) = prob(1729 | N) = N−1 , provided N ≥ 1729

and the prior on N is N−1 . Hence

prob(N | 65, 1729) ∝ N−3 , N ≥ 1729.

ANSWER 1.3: The instruction sets GGG and RRR give coincidence no matter what the switch settings
are. Each of the remaining instruction sets (RRG RGR RGG GRR GRG GGR) gives non-coincidence for
only 4 out of the 9 possible switch combinations.

ANSWER 2.1: It is the probability of having n events in the time interval [0, τ], which is
e−mτ mn τn /n!, and then one more event in the time interval [τ, τ + dτ]. As the Gamma func-
tion formula (M.2) shows, the result is normalized.

ANSWER 2.2: Modify the derivation of (2.16), put K = L, and then use the Beta function (M.3) for
the integral over u.

ANSWER 2.3: Straightforward manipulations of the sums.

ANSWER 2.4: (n1 − n12 )(n2 − n12 )/n12 .

ANSWER 3.1: cf(k) = sin(k)/k.

ANSWER 3.2: Like equation (3.16), with ∆2 = 1


2 in units of a step length.

ANSWER 3.3: Since 0.3% of a Gaussian lies outside 3σ, the difference between the extremes in 1024
trials will be about 6σ. In fact in this example m = 137.5 and σ = 7.

ANSWER 3.4: A Gaussian, times four factors of the type (3.43), times a combinatorial factor 5C2 ×3.

74
Hints and Answers 75

ANSWER 3.5: In the Gaussian approximation 130 is a 3σ result.

ANSWER 4.1: In polar coordinates

x = r sin θ cos φ, y = r sin θ sin φ, z = r cos θ

the volume element


dx dy dz = r2 dr d(cos θ) dφ
so choose r according to r2 f(r) and choose φ and cos θ uniformly.
For g(r), start with f(r) and then use rejection for the extra factor.

ANSWER 4.2: See Figure A.1.

Figure A.1: This uses 40000 points. The integrable singularity at x = 0 is harmless.

ANSWER 5.1: We get


y = 1.015 − 1.047x
The bootstrap matrix is  
0.0017 −0.0030
−0.0030 0.0072
versus σ2 C (using equation 5.26 on page 40 for σ2 ) of
 
0.0014 −0.0022
−0.0022 0.0060


ANSWER 5.2: Replacing the K points with the mean and replacing σ with σ/ K, changes Q2 only
by a constant.
76 Hints and Answers

ANSWER 5.3: prob(µ | D) ∝ (dω/dµ) prob(ω | D). Then use the Gaussian approximation formula
(3.35).

ANSWER 5.4: Marginalize along the line, which gives

Q2
 
2 −1
prob(x1 , y1 | m, c) ∝ (1 + m ) 2 exp −
2(1 + m2 )
Q2 = (y1 − mx1 − c)2


which is similar (5.6) but with 1 + m2 in place of σ. The argument of the exponent is the
perpendicular distance from (x1 , y1 ) to the line.

ANSWER 5.5: The polynomial is


1 − x + x2 − x3

and Figure A.2 shows data and fits. The factor

 −L/2  (L−N)/2
1 1
PT ·C·P d2 − PT ·C·P

Γ 2L Γ 2 (N − L)

from equation (5.32) evaluates to

1028.5 , 1036.3 , 1043.3 , 1041.0 , 1039.0 , 1037.3

for polynomials of degree 1 to 6.

ANSWER 5.6: Use −2 ln(likelihood), or


X
2 (ln(ni !) + m − ni ln m)
i

which reduces to χ2 for large m.

ANSWER 5.7: Beware the denominator in Bayes’ theorem!


In our first invocation of Bayes’ theorem, the denominator prob(D) is a well-defined constant.
In our second invocation, the denominator is prob(χ2 ). But since χ2 depends on ω, marginalizing
out ω to get prob(χ2 ) has no meaning.

ANSWER 6.1: There could be any number of unsampled events before the next sampled event, so


X − n
prob(x | S) = prob(S) prob(S, x).
n=0
Hints and Answers 77

Figure A.2: Fitted polynomials of degrees 1 to 6.

ANSWER 6.2: Figure A.3 shows the data my program generated. (The binning is just for the figure,
the analysis does not bin the data.) From the figure we would expect, that a1 , b1 would be easy to
infer, a2 , b2 much harder. My Metropolis program gets the following 90% confidence intervals.

w = 0.88+0.08
−0.16
a1 = 0.48+0.03
−0.04
+0.06
b1 = 0.21−0.06
a2 = −0.29+0.35
−0.31 b2 = 0.42+0.22
−0.09

ANSWER 6.3: In equation (6.6), replace M! by e−M MM and S! by e−S SS .

ANSWER 6.4: See Figures A.4 and A.5.

ANSWER 7.1: The variance hx 2 (ε)i − hx(ε)i2 .


78 Hints and Answers

Figure A.3: Histogram of detections in the two-lighthouse problem.

Figure A.4: Plot of p-values for the KS statistic, for Nu = 10, Nv = 20. The histogram is from
simulations, the curve is the asymptotic formula.

P
Figure A.5: Plot of p-values for the statistic κ = i xi xi+1 (suggested by A. Austin) also for
Nu = 10, Nv = 20.

ANSWER 8.1:
F(τ, V ) = E(S, V ) − τS
H(S, p) = E(S, V ) + pV
H(S, p) = τS + Vp + F(τ, V )
Hints and Answers 79

ANSWER 8.2: Since p1 , p2 are constant, work done on the gas is p2 V2 −p1 V1 , and hence H = U+pV
is constant. Using
dH = τ dS + V dp

gives dS.

ANSWER 8.3: F(τ, V ) and G(τ, p).

ANSWER 8.4: Substitute in (8.37), with µ = 0 because there is no particle-number constraint.


Index

Bayesian 6 detailed balance 32


Bayes’ theorem 5, 12, 15, 44, 45 diffusion equation 26–27
Bell’s theorem 13–14 dispersion, see standard deviation
Bernoulli distribution, see binomial distri- entropy
bution and continuous probability distributions
Beta function 70 52
configurational 55–57
binomial distribution 15
information theoretic 12, 50
blackbody radiation 69 principle of maximum entropy 11–12,
Black-Scholes formula 25–27 52–53, 58
thermodynamic 59
Bookmakers’ odds 10, 17
error bars
bootstrap 37
possible meanings of 27
Bose-Einstein distribution 67
error function 28
Carnot cycle 61–62
errors, propagation of 39–40
Cash statistic 44
estimators, see under Frequentist theory
cats, eye colours of 4
evidence, see Bookmakers’ odds
Cauchy distribution, see Lorentzian
expectation values 20
CCDs 29 Fermi-Dirac distribution 67
central limit theorem Fisher matrix 39
inapplicability of 23, 45
Flanders and Swann 61
proof of 23
Fourier transform 13, 22, 31, 72
characteristic function 22
Frequentist 6
chi-square test 42–44
degrees of freedom 43 Frequentist theory 11
reduced χ2 43 estimators 11
maximum likelihood 11
conditional probability, see under probabil-
unbiased estimators 11
ity theory
Gamma function 70
confidence interval 17
Gaussian distribution 22
convolution 22, 72
and maximum entropy 53
covariance matrix 38 integrals over 71
Cox’s theorems 7 tails of 22
Cramers, von Mises, Smirnov statistic, see Gibbs paradox 65
Kolmogorov-Smirnov statistic and goodness of fit
variants general concept 10
credible region, see confidence interval Green’s function 27
degeneracy, see under partition function human condition 21

80
Index 81

ideal gas 66 odds ratio, see Bookmakers’ odds


classical 68 parameter fitting
indifference, principle of 9 general concept 8–9
insufficient reason, see indifference parameters
Jeffreys prior, see under prior location 9
Kolmogorov-Smirnov statistic and variants scale 9
47–49 partition function 52–54, 58, 62–63
one-dimensional nature of 49 degeneracy 53
least-squares density of states 53
linear parameters 38 Pascal distribution, see negative binomial
model comparison 41–42 distribution
nonlinear parameters 39
Poisson distribution 18
Legendre transform 59 , 62–63 and maximum entropy 53
lighthouse problems 45–46 posterior 8
likelihood 8
principle of maximum entropy, see under
Lorentzian distribution 22 entropy
marginalization rule, see under probability prior 8
theory assignment 9, 11
Markov chain Monte-Carlo 32–33 Gamma 19
maximum likelihood, see under Frequentist informative and uninformative 17, 50
theory Jeffreys 9, 15, 21
Jeffreys normalized 19 , 41
Maxwell-Boltzmann distribution 68
Macaulay and Buck 55
Maxwell relations 64 vs posterior 50
mean 20
probability theory
Metropolis algorithm 33–34 Bayes theorem 5
model comparison conditional probability 4
general concept 9–10 marginalization rule 5
moment generating function, see character- product rule 4
istic function sum rule 4

moments 20 product rule, see under probability theory


monkey and peanuts 55 random numbers 31
multinomial distribution 15, 46, 55 inverse cumulant method 32
Latin hypercube sampling 32
negative binomial distribution 15
rejection method 31
noise
random walks 23–26, 48
estimating 40
Gaussian 35, 54 Riemann Zeta function 73
normal distribution, see Gaussian distribu- Shannon’s theorem 11
tion proof of 50–51
nuisance parameters 9 standard deviation 20
82 Index

statistical mechanics, see under thermody- enthalpy 63, 66


namics first law 61
statistic Gibbs free energy 63
choice of 10, 42, 47 Helmholtz free energy 64
interpretation of S, τ, p 61–62
straight line fitting
irreversible processes 64–66
hard case 41
potentials 62–63
simple case 36
reversible processes 60–62
Student-t distribution 40 second law 64
sum rule, see under probability theory statistical (aka statistical mechanics) 59
thermodynamic limit 59 Tremaine’s paradox 44

thermodynamics Turing test 30


classical 60 variance 20

You might also like