ACaticha-Entropic Physics Book-July 2022
ACaticha-Entropic Physics Book-July 2022
Probability, Entropy,
and the Foundations of Physics
ARIEL CATICHA
Preface 1
2 Probability 9
2.1 The design of probability theory . . . . . . . . . . . . . . . . . . 10
2.1.1 Rational beliefs? . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Quantifying rational belief . . . . . . . . . . . . . . . . . . 12
2.2 The sum rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 The associativity constraint . . . . . . . . . . . . . . . . . 16
2.2.2 The general solution and its regraduation . . . . . . . . . 17
2.2.3 The general sum rule . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Cox’s proof . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The product rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 From four arguments down to two . . . . . . . . . . . . . 22
2.3.2 The distributivity constraint . . . . . . . . . . . . . . . . 24
2.4 Some remarks on the sum and product rules . . . . . . . . . . . . 26
2.4.1 On meaning, ignorance and randomness . . . . . . . . . . 26
2.4.2 Independent and mutually exclusive events . . . . . . . . 27
2.4.3 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 How about “quantum” probabilities? . . . . . . . . . . . . . . . . 28
2.6 The expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Probability vs. frequency: the law of large numbers . . . . . . . . 36
2.9 The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . 39
2.9.1 The de Moivre-Laplace theorem . . . . . . . . . . . . . . 40
2.9.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . 42
2.10 Updating probabilities: Bayes’rule . . . . . . . . . . . . . . . . . 46
iv CONTENTS
References 329
Preface*
Science consists in using information about the world for the purpose of pre-
dicting, explaining, understanding, and/or controlling phenomena of interest.
The basic di¢ culty is that the available information is usually insu¢ cient to
attain any of those goals with certainty. A central concern in these lectures will
be the problem of inductive inference, that is, the problem of reasoning under
conditions of incomplete information.
Our goal is twofold. First, to develop the main tools for inference — proba-
bility and entropy — and to demonstrate their use. And second, to demonstrate
their importance for physics. More speci…cally our goal is to clarify the con-
ceptual foundations of physics by deriving the fundamental laws of statistical
mechanics and of quantum mechanics as examples of inductive inference. Per-
haps all physics can be derived in this way.
The level of these lectures is somewhat uneven. Some topics are fairly ad-
vanced — the subject of recent research — while some other topics are very
elementary. I can give two related reasons for including both in the same book.
The …rst is pedagogical: these are lectures — the easy stu¤ has to be taught
too. More importantly, the standard education of physicists includes a very
inadequate study of probability and even of entropy. The result is a widespread
misconception that these “elementary” subjects are trivial and unproblematic
— that the real problems of theoretical and experimental physics lie elsewhere.
As for the second reason, it is inconceivable that the interpretations of prob-
ability and of entropy would turn out to bear no relation to our understanding
of physics. Indeed, if the only notion of probability at our disposal is that of
a frequency in a large number of trials one might be led to think that the en-
sembles of statistical mechanics must be real, and to regard their absence as an
urgent problem demanding an immediate solution — perhaps an ergodic solu-
tion. One might also be led to think that analogous ensembles are needed in
quantum theory perhaps in the form of parallel worlds. Similarly, if the only
available notion of entropy is derived from thermodynamics, one might end up
thinking that entropy is some physical quantity that can be measured in the lab,
and that it has little or no relevance beyond statistical mechanics. Thus, it is
worthwhile to revisit the “elementary”basics because usually the basics are not
elementary at all, and even more importantly, because they are so fundamental.
Acknowledgements: Most specially I am indebted to N. Caticha and C. R.
Rodríguez, whose views on these matters have over the years profoundly in-
x Preface
‡uenced my own, but I have also learned much from discussions with many
colleagues and friends: D. Bartolomeo, C. Cafaro, N. Carrara, S. DiFranzo, V.
Dose, K. Earle, A. Gi¢ n, A. Golan, S. Ipek, D. T. Johnson, K. Knuth, O. Lunin,
S. Nawaz, P. Pessoa, R. Preuss, M. Reginatto, J. Skilling, J. Stern, C.-Y. Tseng,
K. Vanslette, and A. Youse…. I would also like to thank all the students who
over the years have taken my course on Information Physics; their questions
and doubts have very often pointed the way to clarifying my own questions and
doubts.
Albany, ...
Chapter 1
1.1 Probability
The question of the meaning and interpretation of the concept of probability
has long been controversial. It is clear that the interpretations o¤ered by var-
ious schools are at least partially successful or else they would already have
been discarded long ago. But the di¤erent interpretations are not equivalent.
They lead people to ask di¤erent questions and to pursue their research in dif-
ferent directions. Some questions may become essential and urgent under one
interpretation while totally irrelevant under another. And perhaps even more
important: under di¤erent interpretations equations can be used di¤erently and
this can lead to di¤erent predictions.
are about but there are many shades of interpretation. As we shall later argue a more useful
de…nition of probability is the degree to which an ideally rational agent ought to believe in
the truth of a proposition. Other interpretations include, for example, the degree of personal
belief as contrasted to a degree of rational belief or reasonable expectation, the degree of
plausibility of a proposition, the degree of credibility, and also the degree of implication (the
degree to which b implies a is the conditional probability of a given b).
1.1 Probability 3
by Bayes’ theorem – a theorem which was …rst written by Laplace. This ap-
proach enjoys several advantages. One is that the di¢ culties associated with
attempting to pinpoint the precise meaning of the word ‘random’can be avoided.
Bayesian probabilities allow us to reason in a consistent and rational manner
about quantities, such as parameters, that rather than being random might
be merely unknown. Also Bayesian probabilities are not restricted to repeat-
able events and therefore they allow us to reason about singular, unique events.
Thus, in going from the frequentist to the Bayesian interpretations the domain
of applicability and therefore the usefulness of the concept of probability is
considerably enlarged.
The crucial aspect of Bayesian probabilities is that di¤erent agents may
have di¤erent degrees of belief in the truth of the very same proposition, a
fact that is described by referring to Bayesian probabilities as being subjective.
This term is somewhat misleading. At one end of the spectrum we …nd the so-
called subjective Bayesian or personalistic view (see, e.g., [Savage 1972; Howson
Urbach 1993; Je¤rey 2004]), and the other end there is the objective Bayesian
view (see e.g. [Je¤reys 1939; Cox, 1946; Jaynes 1985, 2003; Lucas 1970]).
For an excellent elementary introduction with a philosophical perspective see
[Hacking 2001]. According to the subjective view, two reasonable individuals
faced with the same evidence, the same information, can legitimately di¤er in
their con…dence in the truth of a proposition and may therefore assign di¤erent
degrees of personal belief. Subjective Bayesians accept that individuals can
change their beliefs, merely on the basis of introspection, reasoning, or even
revelation.
At the other end of the Bayesian spectrum, the objective Bayesian view
considers the theory of probability as an extension of logic. It is then said
that a probability measures a degree of implication, it is the degree to which
one proposition implies another. It is assumed that the objective Bayesian
has thought so long and so hard about how probabilities are assigned that no
further reasoning will induce a revision of beliefs except when confronted with
new information. In an ideal situation two di¤erent individuals will, on the basis
of the same information, assign the same probabilities.
2 The approach is decidedly pragmatic: the purpose of thinking is to acquire beliefs; the
we shall see in Chapter 11 in non-relativistic quantum mechanics particles and atoms will
be described as ontic. However, in relativistic quantum …eld theory it is the …elds that are
ontic and those excited states (probabilistic distributions of …elds) that are called particles
are epistemic. Our position re‡ects a pragmatic realism that is close to the internal realism
advocated by H. Putnam [Putnam 1979, 1981, 1987].
4 The term ‘epistemic’is not appropriate for emotions or values. However, such ontologically
subjective entities will not enter our discussions and there is no pressing need to accommodate
them in our terminology.
5 Here again, our position bears some resemblance to that of H. Putnam who has forcefully
argued for the rejection of the fact/value dichotomy [Putnam 1991, 2003].
6 Inductive Inference and Physics
through the introduction of relative entropy as the tool for updating. The theory
of probability would be severely handicapped – indeed it would be quite useless –
without a companion theory for updating probabilities.
The framework for inference will be constructed by a process of eliminative
induction.6 The objective is to design the appropriate tools, which in our case,
means designing the theory of probability and entropy. The di¤erent ways in
which probabilities and entropies are de…ned and handled will lead to di¤erent
inference schemes and one can imagine a vast variety of possibilities. To select
one we must …rst have a clear idea of the function that those tools are supposed
to perform, that is, we must specify design criteria or design speci…cations that
the desired inference framework must obey. Then, in the eliminative part of the
process one proceeds to systematically rule out all those inference schemes that
fail to perform as desired.
There is no implication that an inference framework designed in this way is
in any way “true”, or that it succeeds because it achieves some special intimate
agreement with reality. Instead, the claim is pragmatic: the method succeeds to
the extent that the inference framework works as designed and its performance
will be deemed satisfactory as long as it leads to scienti…c models that are
empirically adequate. Whatever design criteria are chosen, they are meant to
be only provisional — just like everything else in science, there is no reason to
consider them immune from further change and improvement.
The pros and cons of eliminative induction have been the subject of con-
siderable philosophical research (e.g. [Earman 1992; Hawthorne 1993; Godfrey-
Smith 2003]). On the negative side, eliminative induction, like any other form
of induction, is not guaranteed to work. On the positive side, eliminative in-
duction adds an interesting twist to Popper’s scienti…c methodology. According
to Popper scienti…c theories can never be proved right, they can only be proved
false; a theory is corroborated only to the extent that all attempts at falsifying
it have failed. Eliminative induction is fully compatible with Popper’s notions
but the point of view is just the opposite. Instead of focusing on failure to
falsify one focuses on success: it is the successful falsi…cation of all rival theories
that corroborates the surviving one. The advantage is that one acquires a more
explicit understanding of why competing theories are eliminated.
In chapter 2 we address the problem of the design and construction of prob-
ability theory as a tool for inference. In other words, we show that degrees of
rational belief, those measures of plausibility that we require to do inference,
should be manipulated and calculated according to the ordinary rules of the
calculus of probabilities.
The problem of designing a theory for updating probabilities is addressed
mostly in chapter 6 and then completed in chapter 8. We discuss the central
6 Eliminative induction is a method to select one alternative from within a set of possible
ones. For example, to select the right answer to a question one considers a set of possible
candidate answers and proceeds to systematically eliminate those that are found wrong or
unacceptable in one way or another. The answer that survives after all others have been ruled
out is the best choice. There is, of course, no guarantee that the last standing alternative is
the correct answer – the only certainty is that all other answers were de…nitely wrong.
1.3 Entropic Physics 7
Probability
Our goal is to establish the theory of probability as the general theory for
reasoning on the basis of incomplete information. This requires us to tackle
two di¤erent problems. The …rst problem is to …gure out how to achieve a
quantitative description of a state of partial knowledge. Once this is settled we
address the second problem of how to update from one state of knowledge to
another when new information becomes available.
Throughout we will assume that the subject matter –the set of propositions
the truth of which we want to assess –has been clearly speci…ed. This question
of what it is that we are actually talking about is much less trivial than it might
appear at …rst sight.1 Nevertheless, it will not be discussed further.
The …rst problem, that of describing or characterizing a state of partial
knowledge, requires that we quantify the degree to which we believe each propo-
sition in the set is true. The most basic feature of these beliefs is that they form
an interconnected web that must be internally consistent. The idea is that in
general the strengths of one’s beliefs in some propositions are constrained by
one’s beliefs in other propositions; beliefs are not independent of each other. For
example, the belief in the truth of a certain statement a is strongly constrained
by the belief in the truth of its negation, not-a: the more I believe in one, the
less I believe in the other.
In this chapter we will also address a special case of the second problem
— that of updating from one consistent web of beliefs to another when new
information in the form of data becomes available. The basic updating strategy
re‡ects the conviction that what we learned in the past is valuable, that the web
of beliefs should only be revised to the extent required by the data. We will see
that this principle of minimal updating leads to the uniquely natural rule that
is widely known as Bayes’ rule.2 As an illustration of the enormous power of
1 Consider the example of quantum mechanics: Are we talking about particles, or about
experimental setups, or both? Is it the position of the particles or the position of the detectors?
Are we talking about position variables, or about momenta, or both? Or neither?
2 The presentation in this chapter includes material published in [Caticha Gi¢ n 2006,
It is not easy to identify criteria of rationality that are su¢ ciently general and
precise. Perhaps we can settle for the more manageable goal of avoiding ir-
rationality in those glaring cases where it is easily recognizable. And this is
the approach we take: rather than o¤ering a precise criterion of rationality we
design a framework with the more modest goal of avoiding some forms of irra-
tionality that are perhaps su¢ ciently obvious to command general agreement.
The basic requirement is that if a conclusion can be reached by arguments that
follow two di¤erent paths then the two arguments must agree. Otherwise our
framework is not performing the function for which it is being designed, namely,
to provide guidance as to what we are supposed to believe. Thus, the web of
rational beliefs must avoid inconsistencies. As we shall see this requirement
turns out to be extremely restrictive.
Finally,
Figure 2.1: The universe of discourse — the set of all propositions — for a
three-sided die with faces labelled a, b, and c forms an ordered lattice. The
assignment of degrees of belief to each proposition a ! [a] leads to a “web
of belief”. The web of belief is highly constrained because it must re‡ect the
structure of the underlying lattice.
Since the goal is to design a quantitative theory, we require that these relations
be represented by some functions F and G,
and
[abjc] = G([ajc]; [bjc]; [ajbc]; [bjac]) : (2.7)
Note the qualitative nature of this assumption: what is being asserted is the
existence of some unspeci…ed functions F and G and not their speci…c functional
forms. The same F and G are meant to apply to all propositions; what is being
designed is a single inductive scheme of universal applicability. Note further
that the arguments of F and G include all four possible degrees of belief in a
and b in the context of c and not any potentially questionable subset.
The functions F and G provide a representation of the Boolean operations
or and and. The requirement that F and G re‡ect the appropriate associative
2.1 The design of probability theory 15
and distributive properties of the Boolean and and or turns out to be extremely
constraining. Indeed, we will show that there is only one representation — all
allowed representations are equivalent to each other — and that this unique
representation is equivalent to probability theory.
In section 2.2 the associativity of or is shown to lead to a constraint that
requires the function F to be equivalent to the sum rule for probabilities. In
section 2.3 we focus on the distributive property of and over or and the corre-
sponding constraint leads to the product rule for probabilities.3
Our method will be design by eliminative induction: now that we have iden-
ti…ed a su¢ ciently broad class of theories — quantitative theories of universal
applicability, with degrees of belief represented by real numbers and the oper-
ations of conjunction and disjunction represented by functions — we can start
weeding the unacceptable ones out.
[~
ajb] = f ([ajb]) . (2.8)
This statement expresses the intuition that the more one believes in ajb, the less
one believes in a
~jb.
A second Cox axiom is that the degree of belief of “a and b given c,”written
as [abjc], must depend on [ajc] and [bjac],
This is also very reasonable. When asked to check whether “a and b” is true,
we …rst look at a; if a turns out to be false the conjunction is false and we need
not bother with b; therefore [abjc] must depend on [ajc]. If a turns out to be
true we need to take a further look at b; therefore [abjc] must also depend on
[bjac]. However, one could object that [abjc] could in principle depend on all
four quantities [ajc], [bjc], [ajbc] and [bjac]. This objection, which we address
below, has a long history. It was partially addressed in [Tribus 1969; Smith
Erickson 1990; Garrett 1996] and …nally resolved in [Caticha 2009b].
3 Our subject is degrees of rational belief but the algebraic approach followed here [Caticha
2009] can be pursued in its own right irrespective of any interpretation. It was used in
[Caticha 1998] to derive the manipulation rules for complex numbers interpreted as quantum
mechanical amplitudes; in [Knuth 2003] in the mathematical problem of assigning real numbers
(valuations) on general distributive lattices; and in [Goyal et al 2010] to justify the use of
complex numbers for quantum amplitudes.
16 Probability
Since this must hold for arbitrary choices of the propositions a, b, c, and d,
we conclude that in order to be of universal applicability the function F must
satisfy (2.15) for arbitrary values of the real numbers (x; y; z). Therefore the
function F must be associative.
Remark: The requirement of universality is crucial. Indeed, in a universe of
discourse with a discrete and …nite set of propositions it is conceivable that
2.2 The sum rule 17
the triples (x; y; z) in (2.15) do not form a dense set and therefore one cannot
conclude that the function F must be associative for arbitrary values of x, y,
and z. For each speci…c …nite universe of discourse one could design a tailor-
made, single-purpose model of inference that could be consistent, i.e. it would
satisfy (2.15), without being equivalent to probability theory. However, we are
concerned with designing a theory of inference of universal applicability, a single
scheme applicable to all universes of discourse whether discrete and …nite or
otherwise. And the scheme is meant to be used by all rational agents irrespective
of their state of belief — which need not be discrete. Thus, a framework designed
for broad applicability requires that the values of x form a dense set.4
instead of the old. The original and the regraduated scales are equivalent be-
cause by virtue of being invertible the function is monotonic and therefore
preserves the ranking of propositions. See …gure 2-2. However, the regraduated
scale is much more convenient because, instead of the complicated rule (2.16),
for mutually exclusive ajd and bjd the or operation is now represented by a
much simpler rule: for mutually exclusive propositions a and b we have the sum
rule
(a _ bjd) = (ajd) + (bjd) : (2.19)
4 The possibility of alternative probability models was raised in [Halpern 1999]. That these
models are ruled out by universality was argued in [Van Horn 2003] and [Caticha 2009].
18 Probability
Figure 2.2: The degrees of belief [a] can be regraduated, [a] ! (a), to another
scale that is equivalent — it preserves transitivity of degrees of belief and the
associativity of or. The regraduated scale is more convenient in that it provides
a more convenient representation of or — a simple sum rule.
Thus, the new numbers are neither more nor less correct than the old, they
are just considerably more convenient.
Perhaps one can make the logic of regraduation a little bit clearer by consid-
ering the somewhat analogous situation of introducing the quantity temperature
as a measure of degree of “hotness”. Clearly any acceptable measure of “hot-
ness” must re‡ect its transitivity — if a is hotter than b and b is hotter than
c then a is hotter than c — which explains why temperatures are represented
by real numbers. But the temperature scales can be quite arbitrary. While
many temperature scales may serve equally well the purpose of ordering sys-
tems according to their hotness, there is one choice — the absolute or Kelvin
scale — that turns out to be considerably more convenient because it simpli…es
the mathematical formalism. Switching from an arbitrary temperature scale to
the Kelvin scale is one instance of a convenient regraduation. (The details of
temperature regraduation are given in chapter 3.)
In the old scale, before regraduation, we had set the range of degrees of belief
from one extreme of total disbelief, [~ aja] = vF , to the other extreme of total
certainty, [aja] = vT . The regraduated value F = ( F ) + is easy to …nd.
Setting d = a~~b in eq.(2.19) gives
a~b) = (aj~
(a _ bj~ a~b) + (bj~
a~b) =) F =2 F ; (2.20)
and therefore
F =0: (2.21)
At the opposite end, the regraduated T = ( T )+ remains undetermined but
2.2 The sum rule 19
if we set b = a
~ eq.(2.19) leads to the following normalization condition
T = (a _ a
~jd) = (ajd) + (~
ajd) : (2.22)
a _ b = (ab) _ (a~b) _ (~
ab) = a _ (~
ab) (2.23)
Since the two propositions on the right are mutually exclusive the sum rule
(2.19) applies,
(a _ bjd) = (ajd) + (~
abjd) + [ (abjd) (abjd)] (2.24)
= (ajd) + (ab _ a
~bjd) (abjd) ; (2.25)
Let
def def
r = F (x; y) and s = F (y; z) ; (2.28)
and let partial derivatives be denoted by subscripts,
def @F (x; y) def @F (x; y)
F1 (x; y) = 0 and F2 (x; y) = 0 (2.29)
@x @y
(F1 denotes a derivative with respect to the …rst argument). Then eq.(2.15) and
its derivatives with respect to x and y are
F (r; z) = F (x; s) ; (2.30)
F1 (r; z)F1 (x; y) = F1 (x; s) ; (2.31)
and
F1 (r; z)F2 (x; y) = F2 (x; s)F1 (y; z) : (2.32)
where
F2 (x; y)
K(x; y) = : (2.34)
F1 (x; y)
Di¤erentiating the right hand side of eq.(2.35) with respect to y and comparing
with the derivative of eq.(2.33) with respect to z, we have
@ @ @
(K (x; s) F2 (y; z)) = (K (x; s) F1 (y; z)) = (K (x; y)) = 0: (2.36)
@y @z @z
Therefore, the derivative of the left hand side of eq.(2.35) with respect to y is
@
(K (x; y) K (y; z)) = 0; (2.37)
@y
or,
1 @K (x; y) 1 @K (y; z)
= : (2.38)
K (x; y) @y K (y; z) @y
Since the left hand side is independent of z while the right hand side is inde-
pendent of x it must be that they depend only on y,
1 @K (x; y) def
= h (y) : (2.39)
K (x; y) @y
2.2 The sum rule 21
Integrate using the fact that K 0 because both F1 and F2 are positive, to get
Z y
K(x; y) = K(x; 0) exp h(y 0 )dy 0 : (2.40)
0
Similarly, Z y
K (y; z) = K (0; z) exp h(y 0 )dy 0 ; (2.41)
0
so that
K(x; 0)
K(x; y) = = K(0; y)H(x) ; (2.42)
H(y)
where
Z x
def
H(x) = exp h(x0 )dx0 0: (2.43)
0
Therefore,
K(x; 0) def
= K(0; y)H(y) = (2.44)
H(x)
where = K(0; 0) is a constant and (2.40) becomes
H (x)
K (x; y) = : (2.45)
H (y)
H(s) H(s)
F1 (y; z) = and F2 (y; z) = : (2.46)
H(y) H(z)
ds dy dz
= + : (2.48)
H(s) H(y) H(z)
The space of functions of four arguments is very large so we …rst narrow it down
to just two. Then, we require that the representation of and be compatible with
the representation of or that we have just obtained. This amounts to imposing
a consistency constraint that follows from the distributive properties of the
Boolean and and or. A …nal trivial regraduation yields the product rule of
probability theory.
G(5) [G(5) (x; y; z); u; G(5) (v; w; s)] = G(5) [x; G(5) (y; u; w); s] : (2.60)
(5) (5)
where G1 and G3 denote derivatives with respect to the …rst and third argu-
ments respectively. Therefore, either
(5) (5)
G3 (x; y; z) = 0 or G1 [G(5) (x; y; z); u; G(5) (v; w; s)] = 0 : (2.62)
The …rst possibility says that G(5) is independent of its third argument which
means that it is of the type G(1) that has already been ruled out. The second
possibility says that G(5) is independent of its …rst argument which means that
it is already included among the type G(3) .
Conclusion:
The possible functions G that are viable candidates for a general theory of
inductive inference are equivalent to type G(3) ,
a (b _ c) = ab _ ac ; (2.64)
which we rewrite as
@ 2 G (u; v + w)
=0; (2.67)
@v@w
and let v + w = z, to get
@ 2 G (u; z)
=0; (2.68)
@z 2
which shows that G is linear in its second argument,
or,
u
u = A(u) T ) A(u) = : (2.71)
T
Therefore
uv (abjd) (ajd) (bjad)
G (u; v) = or = : (2.72)
T T T T
Conclusion:
2.4.3 Marginalization
Once we decide that it is legitimate to quantify degrees of belief by real numbers
p the problem becomes how do we assign these numbers. The sum and product
rules show how we should assign probabilities to some statements once proba-
bilities have been assigned to others. Here is an important example of how this
works.
We want to assign a probability to a particular statement b. Let a1 ; a2 ; : : : ; an
be mutually exclusive and exhaustive statements and suppose that the proba-
bilities of the conjunctions baj are known. We want to calculate p(b) given the
joint probabilities p(baj ). The solution is straightforward: sum p(baj ) over all
aj s, use the product rule, and eq.(2.81) to get
X X
p(baj ) = p(b) p(aj jb) = p(b) : (2.83)
j j
Figure 2.3: In the double slit experiment particles are generated at s, pass
through a screen with two slits A and B, and are detected at the detector
screen. The observed interference pattern is evidence of wave-like behavior.
necessary for us to show that quantum e¤ects are not a counterexample to the
universality of probability theory.5
The argument below is also valuable in other ways. First, it provides an
example of the systematic use of the sum and product rules. Second, it under-
scores the importance of remembering that probabilities are always conditional
on something and that it is often useful to be very explicit about what those
conditioning statements might be. Finally, we will learn something important
about quantum mechancis.
The paradigmatic example for interference e¤ects is the double slit experi-
ment. It was …rst discussed in 1807 by Thomas Young who sought a demon-
stration of the wave nature of light that would be as clear and de…nitive as the
interference e¤ects of water waves. It has also been used to demonstrate the
peculiarly counter-intuitive behavior of quantum particles which seem to prop-
agate as waves but are only detected as particles — the so-called wave-particle
duality.
The quantum version of the double slit problem can be brie‡y stated as
follows. The experimental setup is illustrated in Figure 2.3. A single particle is
emitted at a source s, it goes through a screen where two slits a and b have been
cut, and the particle is later detected farther downstream at some location d.
The standard treatment goes something like this: according to the rules of
5 The ‡aw in the argument that quantum theory is incompatible with the standard rules
for manipulating probabilities was pointed out long ago by B. O. Koopman in a paper that
went largely unnoticed [Koopman 1955]. See also [Ballentine 1986].
30 Probability
ab = a + b ; (2.85)
The …rst term on the right j a j2 / pa (d) re‡ects the probability of detection
when only slit a is open and, similarly, j b j2 / pb (d) is the probability when
only b is open. The presence of the interference terms a b + a b is taken as
evidence that in quantum mechanics
So far so good.
One might go further and vaguely interpret (2.87) as “the probability of
paths a or b is not the sum of the probability of path a plus the probability of
path b”. And here the trouble starts because it is then almost natural to reach
the troubling conclusions that “quantum mechanics violates the sum rule”, that
“quantum mechanics lies outside the reach of classical probability theory”, and
that “quantum mechanics requires quantum probabilities.” As we shall see, all
these conclusions are unwarranted but in the attempt to “explain”them extreme
ideas have been proposed. For example, according to the standard and still
dominant explanation — the Copenhagen interpretation — quantum particles
are not characterized as having de…nite positions or trajectories. It is only at
the moment of detection that a particle acquires a de…nite position. Thus the
Copenhagen interpretation evades the impasse about the alleged violation of
the sum rule by claiming that it makes no sense to even raise the question of
whether the particle went through one slit, through the other slit, through both
slits or through neither.
The notion that physics is an example of inference and that probability
theory is the universal framework for reasoning with incomplete information
leads us along a di¤erent track. To construct our model of quantum mechanics
we must …rst establish the subject matter:
The model: We shall assume that a “point particle”is a system characterized
by its position, that the position of a particle has de…nite values at all times,
and that the particle moves along trajectories that are continuous. Since the
positions and the trajectories are in general unknown we are justi…ed in invoking
the use of probabilities.
2.5 How about “quantum” probabilities? 31
Our goal is to show that these physical assumptions about the nature of
particles can be combined with the rules of probability theory in a way that is
compatible with the predictions of quantum mechanics. It is useful to introduce
a notation that is explicit. We deal with the following propositions:
d = “particle is detected at d”
Since particles with de…nite positions cannot go through both a and b the last
term vanishes, p( djab) = 0. Therefore,
Using the product rule the left hand side of (2.89) can be written as
so that
p(djsab) = p( djsab) + p( djsab) : (2.92)
In order to compare this result with quantum mechanics, we rewrite eq.(2.87)
in the new more explicit notation,
We can now see that probability theory, eq.(2.92), and quantum mechanics,
eq.(2.93), are not in contradiction; they di¤er because they refer to the proba-
bilities of di¤erent statements.
32 Probability
It is important to appreciate what we have shown and also what we have not
shown. What we have just shown is that eqs.(2.87) or (2.93) are not in con‡ict
with the sum rule of probability theory. What we have not (yet) shown is that
the rules of quantum mechanics such as eq.(2.85) and (2.86) can be derived as
an example of inference; that is a much lengthier matter that will be tackled
later in Chapter 11.
We pursue this matter further to …nd how one might be easily misled into a
paradox. Use the product rule to rewrite eq.(2.92) as
and consider the …rst term on the right. To a classically trained mind (or perhaps
a classically brainwashed mind) it would appear reasonable to believe that the
passage of the particle through slit a is completely una¤ected by whether the
distant slit b is open or not. We are therefore tempted to make the substitutions
? ?
p( jsab) = p( jsa~b) and p(djsab ) = p(djsa~b ) : (C1)
or
?
p(djsab) = p( djsa~b) + p( js~
ab) : (C3)
This equation does, indeed, contradict quantum mechanics, eq.(2.93). What is
particularly insidious about the “classical” eqs.(C1-C3) is that, beyond being
intuitive, there are situations in which these substitutions are actually correct
— but not always.
We might ask, what is wrong with (C1-C3)? How could it possibly be
otherwise? Well, it is otherwise. Equations (C1-C3) represent an assumption
that happens to be wrong. The assumption does not re‡ect a wrong probability
theory; it re‡ects wrong physics. It represents a piece of physical information
that does not apply to quantum particles. Quantum mechanics looks so strange
to classically trained minds because opening a slit at some distant location can
have important e¤ects even when the particle does not go through it. Quantum
mechanics is indeed strange but this is not because it violates probability theory;
it is strange because it is not local.
We conclude that there is no need to construct a theory of “quantum”proba-
bilities. Conversely, there is no need to refer to probabilities as being “classical”.
There is only one kind of probability and quantum mechanics does not refute
the claim that probability theory is of universal applicability.
seems reasonable that those values xi that have larger pi should have a dominant
contribution to the estimate of x. We therefore make the following reasonable
choice: The expected value of the quantity x is denoted by hxi and is given by
def
X
hxi = pi xi . (2.95)
i
The term ‘expected’ value is not always an appropriate one because it can
happen that hxi is not one of the values xi that is actually allowed; in such
cases the “expected” value hxi is not a value we would expect. For example,
the expected value of a die toss is (1 + + 6)=6 = 3:5 which is not an allowed
result.
Using the average hxi as an estimate of x may be reasonable, but it is also
somewhat arbitrary. Alternative estimates are possible; one could, for example,
have chosen the value for which the probability is maximum — this is called the
‘mode’. This raises two questions.
The …rst question is whether hxi is a good estimate. If the probability distri-
bution is sharply peaked all the values of x that have appreciable probabilities
are close to each other and to hxi. Then hxi is a good estimate. But if the
distribution is broad the actual value of x may deviate from hxi considerably.
To describe quantitatively how large this deviation might be we need to describe
how broad the probability distribution is.
A convenient measure of the width of the distribution is the root mean square
(rms) deviation de…ned by
def 2
x = h(x hxi) i1=2 : (2.96)
The quantity x is also called the standard deviation, its square ( x)2 is called
the variance. The term ‘variance’ may suggest variability or spread but there
is no implication that x is necessarily ‡uctuating or that its values are spread;
x merely refers to our incomplete knowledge about x.6
If x hxi then x will not deviate much from hxi and we expect hxi to be
a good estimate.
The de…nition of x is somewhat arbitrary. It is dictated both by common
sense and by convenience. Alternatively we could have chosen to de…ne the
4
width of the distribution as hjx hxiji or h(x hxi) i1=4 but these de…nitions
are less convenient for calculations.
6 The interpretation of probability matters. Among the many in…nities that a- ict quan-
tum …eld theories the variance of …elds and of the corresponding energies at a point are badly
divergent quantities. If these variances re‡ect actual physical ‡uctuations one should also
expect those ‡uctuations of the spacetime geometry that are sometimes described as a space-
time foam. On the other hand, if one adopts a view of probability as a tool for inference then
the situation changes signi…cantly. One can argue that the information codi…ed into quantum
…eld theories is su¢ cient to provide successful estimates of some quantities — which accounts
for the tremendous success of these theories — but is completely inadequate for the estima-
tions of other quantities. Thus divergent variances may be more descriptive of our complete
ignorance rather than of large physical ‡uctuations.
34 Probability
Now that we have a way of deciding whether hxi is a good estimate for x
we may raise a second question: Is there such a thing as the “best” estimate
for x? Consider an alternative estimate x0 . The alternative x0 is “good” if the
2
deviations from it are small, i.e., h(x x0 ) i is small. The condition for the
0
“best” x is that its variance be a minimum
d 2
h(x x0 ) i =0; (2.97)
dx0 x0 b e s t
which implies x0 b est = hxi. Conclusion: hxi is the best estimate for x when by
“best” we mean the estimate with the smallest variance. But other choices are
possible, for example, had we actually decided to minimize the width hjx x0 ji
the best estimate would have been the median, x0 b est = xm , a value such that
Prob(x < xm ) = Prob(x > xm ) = 1=2.
We conclude this section by mentioning two important identities that will
be repeatedly used in what follows. The …rst is that the average deviation from
the mean vanishes,
hx hxii = 0; (2.98)
because deviations from the mean are just as likely to be positive and negative.
The second useful identity is
D E
2
(x hxi) = hx2 i hxi2 : (2.99)
N m
P (mjN; ) = (1 )N m
: (2.101)
m
This is called the binomial distribution. The range of applicability of this distri-
bution is enormous. Whenever trials are identical (same probability in every
2.7 The binomial distribution 35
trial) and independent (i.e., the outcome of one trial has no in‡uence on the
outcome of another, or alternatively, knowing the outcome of one trial provides
us with no information about the possible outcomes of another) the distribution
is binomial.
Next we brie‡y review some properties of the binomial distribution. The
parameter plays two separate roles. On one hand is a parameter that labels
the distributions P (mjN; ); on the other hand, we have P (1j1; ) = so that
the parameter also happens to be the probability of in a single trial.
Using the binomial theorem (hence the name of the distribution) one can
show these probabilities are correctly normalized:
N
X N
X N m N
P (mjN; ) = (1 )N m
= ( + (1 )) = 1: (2.102)
m=0 m=0
m
This sum over m is complicated. The following elegant trick is useful. Consider
the sum
XN
N m N m
S( ; ) = m ;
m=0
m
This is the best estimate, but how good is it? To …nd the answer we need to
calculate the variance
D E
2 2
( m) = (m hmi) = hm2 i hmi2 :
To …nd hm2 i,
N
X N
X
2 2 N
hm i = m P (mjN; ) = m2 m
(1 )N m
;
m=0 m=0
m
36 Probability
Therefore,
hm2 i = (N )2 + N (1 ), (2.104)
Now we can address the question of how good an estimate hmi is. Notice that
m grows with N . This might seem to suggest that our estimate of m gets
worse for large N but this is not quite true because hmi also grows with N . The
ratio r
m (1 ) 1
= / 1=2 , (2.106)
hmi N N
shows that while both the estimate hmi and its uncertainty m grow with N ,
the relative uncertainty decreases.
2
where " is an arbitrary constant. Replacing (x hxi) by its least value "2 gives
Z
2
( x) "2 (x) dx = "2 P (jx hxij ") ;
jx hxij "
2 (1 )
hf i = and ( f) =
N
(1 )
P (jf j " jN ) . (2.110)
N "2
Thus, probabilities and frequencies are related to each other but they are
not the same thing. Since hf i = , one might have been tempted to de…ne the
probability in terms of the expected frequency hf i but this does not work. The
problem is that the notion of expected value presupposes that the concept of
probability has already been de…ned. De…ning probability in terms of expected
values would be circular.7
We can express this important point in yet a di¤erent way: We cannot de…ne
probability as a limiting frequency limN !1 f because there exists no frequency
function f 6= m(N )=N to take a limit; the limit makes no sense.
The law of large numbers is easily generalized beyond the binomial distrib-
ution. Consider the average
N
1 X
x= xr ; (2.113)
N r=1
Furthermore, since the xr are independent, their variances are additive. For
example,
var(x1 + x2 ) = var(x1 ) + var(x2 ) : (2.115)
(Prove it.) Therefore,
P
N 2 2
xr
var(x) = var( )=N = : (2.116)
r=1 N N N
Tchebyshev’s inequality now gives,
2
P (jx j "jN ) (2.117)
N "2
so that for any " > 0
lim P (jx j "jN ) = 0 or lim P (jx j "jN ) = 1 ; (2.118)
N !1 N !1
or
x! in probability. (2.119)
Again, what vanishes for large N is not the di¤erence x itself, but rather
the probability that jx j is larger than any given small amount.
7 Expected values can be introduced independently of probability (see [Je¤rey 2004]) but
The intuition behind this idea is that the errors of the individual measurements
will probably be positive just as often as they are negative so that in the sum
the errors will tend to cancel out. Thus one expects that the error of x will be
smaller than that of any of the individual xr .
This intuition can be put on a …rmer ground as follows. Let us assume that
the measurements are performed under identical conditions and are independent
of each other. We also assume that although there is some unknown error the
experiments are unbiased — that is, they are at least expected to yield the right
answer. This is expressed by
hxr i = x and xr = : (2.121)
The sample average x is also a- icted by some unknown error so that strictly x
is not the same as x but its expected value is. Indeed,
N
1 X
hxi = hxr i = x : (2.122)
N r=1
is very sharply peaked at hmi = N . This suggests that to …nd a good approx-
imation for P we need to pay special attention to a very small range of m. One
might be tempted to follow the usual approach and directly expand in a Taylor
series but a problem becomes immediately apparent: if a small change in m
produces a small change in P then we only need to keep the …rst few terms,
but in our case P is a very sharp function. To reproduce this kind of behavior
we need a huge number of terms in the series expansion which is impractical.
Having diagnosed the problem one can easily …nd a cure: instead of …nding a
Taylor expansion for the rapidly varying P , one …nds an expansion for log P
which varies much more smoothly.
Let us therefore expand log P about its maximum at m0 , the location of
which is at this point still unknown. The …rst few terms are
d log P 1 d2 log P 2
log P = log P jm0 + (m m0 ) + (m m0 ) + : : : ;
dm m0 2 dm2 m0
where
What is a derivative with respect to an integer? For large m the function log m!
varies so slowly (relative to the huge value of log m! itself) that we may consider
m to be a continuous variable. This leads to a very useful approximation —
called the Stirling approximation — for the logarithm of a large factorial
m
X Z m+1
m+1
log m! = log n log x dx = (x log x x)j1 m log m m:
n=1 1
A somewhat better expression which includes the next term in the Stirling ex-
pansion is
2.9 The Gaussian distribution 41
1
log m! m log m m+
log 2 m + : : : (2.125)
2
Notice that the third term is much smaller than the …rst two: the …rst two
terms are of order m while the last is of order log m. For m = 1023 , log m is
only 55:3.
The derivatives in the Taylor expansion are
d log P (N m)
= log m + log (N m) + log log (1 ) = log ;
dm m(1 )
and
d2 log P 1 1 N
= = :
dm2 m N m m(N m)
To …nd the value m0 where P is maximum set d log P=dm = 0. This gives
m0 = N = hmi, and substituting into the second derivative of log P we get
d2 log P 1 1
= = 2:
dm2 hmi N (1 ) ( m)
Therefore
2
(m hmi)
log P = log P (hmi) 2 + :::
2 ( m)
or " #
2
(m hmi)
P (m) = P (hmi) exp 2 :
2 ( m)
The remaining unknown constant P (hmi) can be evaluated by requiring that
the distribution P (m) be properly normalized, that is
N
X Z N Z 1
1= P (m) P (x) dx P (x) dx:
m=0 0 1
Using Z r
1
x2
e dx = ;
1
we get
1
P (hmi) = q :
2
2 ( m)
Thus, the expression for the Gaussian distribution with mean hmi and rms
deviation m is
" #
2
1 (m hmi)
P (m) = q exp 2 : (2.126)
2 ( m)
2 2 ( m)
42 Probability
2
where hxi = N " and ( x) = N (1 )"2 . Thus, the Gaussian distribution
arises whenever we have a quantity that is the result of adding a large number
of small independent contributions. The derivation above assumes that the
microscopic contributions are discrete (either 0 or "), and identically distributed
but, as shown in the next section, both of these conditions can be relaxed.
Note that now we no longer assume that the variables xr be identically distrib-
uted nor that the distributions pr (xr ) be binomial, but we still assume indepen-
dence.
The probability density for XN is given by the integral
Z N
!
X
PN (X) = dx1 : : : dxN p1 (x1 ) : : : pN (xN ) X xr : (2.131)
r=1
(The expression on the right is the expected value of an indicator function. The
derivation of (2.131) is left as an exercise.)
2.9 The Gaussian distribution 43
diverge in cases that are physically interesting such as when the variables xr
are identically distributed ( r and r are independent of r). To resolve this
di¢ culty instead of the variable XN we will consider a di¤erent suitably shifted
and normalized variable,
X mN
Y = ; (2.134)
sN
where
N
X N
X
def def
mN = r and s2N = 2
r : (2.135)
r=1 r=1
1 XD E
N
3
lim jxr rj =0; (2.137)
N !1 s3
N r=1
then
1 Y 2 =2
lim PN (Y ) = P (Y ) = p e ; (2.138)
N !1 2
which is Gaussian with zero mean and unit variance, hY i = 0 and Y = 1.
Proof:
Consider the Fourier transform,
Z +1
FN (k) = dY PN (Y )eikY
1
Z " N
#
ik X
= dx1 : : : dxN p1 (x1 ) : : : pN (xN ) exp (xr r)
sN r=1
44 Probability
The Fourier transform f (k) of a distribution p(x) has many interesting and
useful properties. For example,
Z
f (k) = dx p(x)eikx = eikx ; (2.140)
In words, the coe¢ cients of the Taylor expansion of f (k) give all the moments
of p(x). The Fourier transform f (k) is called the moment generating function
and also the characteristic function of the distribution p(x).
Going back to the calculation of Pn (Y ), eq.(2.136), its Fourier transform,
eq.(2.139) is,
Q
N
FN (k) = fr (k) ; (2.142)
r=1
where Z
ik
fr (k) = dxr pr (xr ) exp (xr r) :
sN
Since sN diverges as N ! 1 we can expand
k k2 D 2
E k
fr (k) = 1 + i hxr ri (xr r) + Rr ( )
sN 2s2N sN
k 2 r2 k
=1 + Rr ( ) ; (2.143)
2s2N sN
k k 3D 3
E
Rr ( ) Cj j jxr rj (2.144)
sN sN
for some constant C. Therefore, for su¢ ciently large N , using log(1 + x) x,
N
X
log Fn (k) = log fr (k) (2.145)
r=1
N
X N
k 2 r2 k k2 X k
= + Rr ( ) = + Rr ( ) (2.146)
r=1
2s2N sN 2 r=1
sN
2.9 The Gaussian distribution 45
jkj3 X D E
N
P
N k 3
Rr ( ) C jxr rj !0 (2.147)
r=1 sN s3N r=1
k2 k2 =2
log FN (k) ! or FN (k) ! F (k) = e (2.148)
2
Taking the inverse Fourier transform leads to eq.(2.138) and concludes the proof.
It is easy to check that the Lyapunov condition is satis…ed when the xr vari-
3
ables are identically distributed. Indeed, if all r = , r = and hjxr j i=
then
1 XD E
N
X N
3 N
s2N = 2
r =N 2
and lim jxr rj = lim =0:
N !1 s3
N r=1 N !1 N 3=2 3=2
r=1
(2.149)
We can now return to our original goal of calculating the probability distri-
bution of XN . For any given but su¢ ciently large N we have
1 Y 2 =2
PN (X)dX = PN (Y )dY p e dY : (2.150)
2
From eq.(2.134) we have
dX
dY = (2.151)
sN
therefore
1 (X mN )2
PN (X) p exp : (2.152)
2 s2N 2s2N
And when the xr variables are identically distributed we get
1 (X N )2
PN (X) p exp : (2.153)
2 N 2 2N 2
To conclude we comment on the signi…cance of the central limit theorem. We
have shown that almost independently of the form P of the distributions pr (xr ) the
distribution
P 2 of the sum X is Gaussian centered at r r with standard deviation
r r . Not only the pr (xr ) need not be binomial, they do not even have to
be equal to each other. This helps to explain the widespread applicability of
Gaussian distributions: they apply to almost any ‘macro-variables’(such as X)
that result from adding a large number of independent ‘micro-variables’(such
as xr ).
But there are restrictions. Although Gaussian distributions are very com-
mon, there are exceptions. The derivation shows that the Lyapunov condition
played a critical role. Earlier we mentioned that the success of Gaussian distri-
butions is due to the fact that they codify the information that happens to be
46 Probability
priors by q and posteriors by p. Later, however, we will often revert to the common practice of
referring to all probabilities as p. Hopefuly no confusion should arise as the correct meaning
should be clear from the context.
2.10 Updating probabilities: Bayes’rule 47
experiment, add some uncertainty in the form of Gaussian noise and we have a
very reasonable estimate of the conditional distribution q(xj ). The distribution
q(xj ) is called the sampling distribution and also (but less appropriately) the
likelihood function. We will assume it is known. We should emphasize that the
crucial information about how x is related to is contained in the functional
form of the distribution q(xj ) — say, whether it is a Gaussian or a Cauchy
distribution— and not in the actual values of x and which are, at this point,
still unknown.
Thus, to describe the web of prior beliefs we must know the prior q( ) and
also the sampling distribution q(xj ). This means that we must know the full
joint distribution,
q( ; x) = q( )q(xj ) : (2.154)
This is important. We must be clear about what we are talking about: the
relevant universe of discourse is neither the space of possible s nor the space
X of possible data x. It is rather the product space X and the probability
distributions that concern us are the joint distributions q( ; x).
Next we collect data: the observed value turns out to be x0 . Our goal is
to use this information to update to a web of posterior beliefs represented by
a new joint distribution p( ; x). How shall we choose p( ; x)? Since the new
data tells us that the value of x is now known to be x0 the new web of beliefs is
constrained to satisfy
Z
p(x) = d p( ; x) = (x x0 ) : (2.155)
(For simplicity we have here assumed that x is a continuous variable; had x been
discrete the Dirac s would be replaced by Kronecker s.) This is all we know
and it is not su¢ cient to determine p( ; x). Apart from the general requirement
that the new web of beliefs must be internally consistent there is nothing in
any of our previous considerations that induces us to prefer one consistent web
over another. A new principle is needed and this is where the prior information
comes in.
of how we ought to choose our beliefs – incorporates an ethical component. Pursued to its
48 Probability
This seems so reasonable and natural that an explicit statement may appear
super‡uous. The important point, however, is that it is not logically necessary.
We could update in many other ways that preserve both internal consistency
and consistency with the new information.
As we saw above the new data, eq.(2.155), does not fully determine the joint
distribution
p( ; x) = p(x)p( jx) = (x x0 )p( jx) : (2.156)
All distributions of the form
where p( jx0 ) is quite arbitrary, are compatible with the newly acquired data.
We still need to assign p( jx0 ). It is at this point that we invoke the PMU. We
stipulate that, having updated q(x) to p(x) = (x x0 ), no further revision is
needed and we set
p( jx0 ) = q( jx0 ) : ((PMU))
Therefore, the web of posterior beliefs is described by
to get
p( ) = q( jx0 ) : (2.160)
In words, the posterior probability equals the prior conditional probability of
given x0 . This result, which we will call Bayes’rule, is extremely reasonable: we
maintain those beliefs about that are consistent with the data values x0 that
turned out to be true. Beliefs based on values of x that were not observed are
discarded because they are now known to be false. ‘Maintain’and ‘discard’are
the key words: the former re‡ects the PMU in action, the latter is the updating.
Using the product rule
by the likelihood factor q(x0 j ) in such a way as to enhance our preference for
values of that make the observed data more likely, less surprising.
The factor in the denominator q(x0 ), which is often called the ‘evidence’, is
the prior probability of the data. It is given by
Z
q(x ) = q( )q(x0 j ) d ;
0
(2.163)
and plays the role of a normalization constant for the posterior distribution p( ).
It does not help to discriminate one value of from another because it a¤ects
all values of equally. As we shall see later in this chapter the evidence turns
out to be important in problems of model selection (see eq. 2.234).
Remark: Bayes’rule is often written in the form
q(x0 j )
q( jx0 ) = q( ) ; (2.164)
q(x0 )
and called Bayes’theorem.11 This formula is very simple; but perhaps it is too
simple. It is true for any value of x0 whether observed or not. Eq.(2.164) is
just a restatement of the product rule, eq.(2.161), and therefore it is a simple
consequence of the internal consistency of the prior web of beliefs. No posteriors
are involved: the left hand side is not a posterior but rather a prior probability
–the prior conditional on x0 . To put it di¤erently, in an actual update, q( ) !
p( ), both probabilities refer to the same proposition . In (2.164), q( ) !
q( jx0 ) cannot be an update because it refers to the probabilities of two di¤erent
propositions, and jx0 . Of course these subtleties have not stood in the way of
the many extremely successful applications of Bayes theorem. But by confusing
priors with posteriors the formula (2.164) has contributed to obscure the fact
that an additional principle – the PMU – was needed for updating. And this
has stood in the way of a deeper understanding of the connection between the
Bayesian and entropic methods of inference.
by Bayes. The person who …rst explicitly stated the theorem and, more importantly, who …rst
realized its deep signi…cance was Laplace.
50 Probability
q(c)A
q(cjy) =
q(y)
One also needs to know q(y), the probability of the test being positive irre-
spective of whether the person has cancer or not. To obtain q(y) use
q(~
c)q(yj~
c) (1 q(c)) (1 A)
q(~
cjy) = = ;
q(y) q(y)
q(c)A
q(cjy) = : (2.165)
q(c)A + (1 q(c)) (1 A)
For an accuracy A = 0:99 and an incidence q(c) = 0:01 we get q(cjy) = 50%
which is not nearly as bad as one might have originally feared. Should one
2.10 Updating probabilities: Bayes’rule 51
dismiss the information provided by the test as misleading? No. Note that the
probability of having cancer prior to the test was 1% and on learning the test
result this was raised all the way up to 50%. Note also that when the disease
is really rare, q(c) ! 0, we still get q(cjy) ! 0 even when the test is quite
accurate. This means that for rare diseases most positive tests turn out to be
false positives.
We conclude that both the prior and the data contain important information;
neither should be neglected.
Remark: The previous discussion illustrates a mistake that is common in verbal
discussions: if h denotes a hypothesis and e is some evidence, it is quite obvious
that we should not confuse q(ejh) with q(hje). However, when expressed verbally
the distinction is not nearly as obvious. For example, in a criminal trial jurors
might be told that if the defendant was guilty (the hypothesis) the probability
of some observed evidence would be large, and the jurors might easily be misled
into concluding that given the evidence the probability is high that the defendant
is guilty. Lawyers call this the prosecutor’s fallacy.
about its meaning. So here is a preview of things to come: What is information? We will
continue to use the term with its usual colloquial meaning, namely, roughly, information is
what you get when your question receives a satisfactory answer. But we will also need a more
precise and technical de…nition. Later we shall elaborate on the idea that information is a
constraint on our beliefs, or better, on what our beliefs ought to be if only we were ideally
rational.
2.10 Updating probabilities: Bayes’rule 53
For simplicity we deal with just two identical experiments. The prior web of
beliefs is described by the joint distribution,
The second experiment yields the data x2 = x02 and requires a second application
of Bayes’rule. The posterior p1 ( ) in eq.(2.171) now plays the role of the prior
and the new posterior distribution for is
q(x02 j )
p12 ( ) = p1 ( jx02 ) = p1 ( ) ; (2.172)
p1 (x02 )
therefore
p12 ( ) / q( )q(x01 j )q(x02 j ) : (2.173)
We have explicitly followed the update from q( ) to p1 ( ) to p12 ( ). The same
result is obtained if the data from both experiments were processed simultane-
ously,
p12 ( ) = q( jx01 ; x02 ) / q( )q(x01 ; x02 j ) : (2.174)
From the symmetry of eq.(2.173) it is clear that the same posterior p12 ( ) is
obtained irrespective of the order that the data x01 and x02 are processed. The
commutivity of Bayesian updating follows from the special circumstance that the
information conveyed by one experiment does not revise or render obsolete the
information conveyed by the other experiment. As we generalize our methods
of inference for processing other kinds of information that do interfere with each
other (and therefore one may render the other obsolete) we should not expect,
much less demand, that commutivity will continue to hold.
The problem of choosing the …rst prior in the inference chain is a di¢ cult
one. We will tackle it in several di¤erent ways. Later in this chapter, as we
introduce some elementary notions of data analysis, we will address it in the
standard way: just make a “reasonable” guess — whatever that might mean.
When tackling familiar problems where we have experience and intuition this
seems to work well. But when the problems are truly new and we have neither
experience nor intuition then the guessing can be risky and we would like to
develop more systematic ways to proceed. Indeed it can be shown that certain
types of prior information (for example, symmetries and/or other constraints)
can be objectively translated into a prior once we have developed the appropriate
tools — entropy and geometry. (See e.g. [Jaynes 1968][Caticha Preuss 2004]
and references therein.)
Our more immediate goal here is, …rst, to remark on the dangerous conse-
quences of extreme degrees of belief, and then to prove our previous intuitive
assertion that the accumulation of data will swamp the original prior and render
it irrelevant.
q(xj B)
p( jB) = q( jB) =1: (2.175)
q(xjB)
Q
N
q(xj ) = q(xr j ) ; (2.176)
r=1
q( ) q( ) QN
p( ) = q(xj ) = q(xr j ) : (2.177)
q(x) q(x) r=1
p( 1 ) q( 1 )
= R(x) ; (2.178)
p( 2 ) q( 2 )
def Q
N
def q(xr j 1 )
R(x) = Rr (xr ) and Rr (xr ) = : (2.179)
r=1 q(xr j 2 )
1 1 PN
log R(x) = log Rr (xr ) : (2.182)
N N r=1
56 Probability
1
lim Pr log R(x) K( 1 ; 2) "j 1 =1 (2.183)
N !1 N
1
K( 1 ; 2) = log R(x)j 1
N
P
= q(xr j 1 ) log Rr (xr ) : (2.184)
xr
In other words,
given 1, eN (K ")
R(x) eN (K+") in probability. (2.185)
P q(xr j 1 )
K( 1 ; 2) = + q(xr j 1 ) log ; (2.186)
xr q(xr j 2)
p(Ej )
p( jE) = p( ) ; (2.187)
p(E)
1 3 Other names include relative information, directed divergence, and Kullback-Leibler dis-
tance.
1 4 From here on we revert to the usual notation p for probabilities. Whether p refers to a
prior or a posterior will, as is usual in this …eld, have to be inferred from the context.
2.11 Hypothesis testing and con…rmation 57
and one can say that the hypothesis is partially con…rmed or corroborated by
the evidence E when p( jE) > p( ).
Sometimes one wishes to compare two hypothesis, 1 and 2 , and the com-
parison is conveniently done using the ratio
p( 1 jE) p( 1 ) p(Ej 1 )
= : (2.188)
p( 2 jE) p( 2 ) p(Ej 2 )
The relevant quantity is the “likelihood ratio” or “Bayes factor”
def p(Ej 1 )
R( 1 ; 2) = : (2.189)
p(Ej 2 )
When R( 1 : 2 ) > 1 one says that the evidence E provides support in favor of
1 against 2 .
The question of the testing or con…rmation of a hypothesis is so central to
the scienti…c method that it pays to explore it. First we introduce the concept of
weight of evidence, a variant of the Bayes factor, that has been found particularly
useful in such discussions. Then, to explore some of the subtleties and potential
pitfalls we discuss the paradox associated with the name of Hempel.
Weight of evidence
A useful variant of the Bayes factor is its logarithm,
def p(Ej 1 )
wE ( 1 ; 2) = log : (2.190)
p(Ej 2 )
This is called the weight of evidence for 1 against 2 [Good 1950].15 A useful
special case is when the second hypothesis 2 is the negation of the …rst. Then
def p(Ej )
wE ( ) = log ; (2.191)
p(Ej ~)
is called the weight of evidence in favor of the hypothesis provided by the
evidence E. The change to a logarithmic scale is convenient because it confers
useful additive properties upon the weight of evidence — which justi…es calling
it a ‘weight’. Consider, for example, the odds in favor of given by the ratio
def p( )
Odds( ) = : (2.192)
p( ~)
The posterior and prior odds are related by
p( jE) p( ) p(Ej )
= ; (2.193)
p( ~jE) p( ~) p(Ej ~)
1 5 According to [Good 1983] the concept was known to H. Je¤reys and A. Turing around
1940-41 and C. S. Peirce had proposed the name weight of evidence for a similar concept
already in 1878.
58 Probability
Hempel’s paradox
Here is the paradox: “A case of a hypothesis supports the hypothesis. Now, the
hypothesis that all crows are black is logically equivalent to the contrapositive
that all non-black things are non-crows, and this is supported by the observation
of a white shoe.” [Hempel 1967]
The premise that “a case of a hypothesis supports the hypothesis” seems
reasonable enough. After all, how else but by observing black crows can one ever
expect to con…rm that “all crows are black”? But to assert that the observation
of a white shoe con…rms that all crows are black seems a bit too much. If so
then the very same white shoe would equally well con…rm the hypotheses that
all crows are green, or that all swans are black. We have a paradox.
Let us consider the starting premise that the observation of a black crow
supports the hypothesis = “All crows are black” more carefully. Suppose we
observe a crow (C) and it turns out to be black (B). The evidence is E = BjC,
and the corresponding weight of evidence is positive,
p(BjC ) 1
wBjC ( ) = log = log 0; (2.196)
~
p(BjC ) p(BjC ~)
as expected. It is this result that justi…es our intuition that “a case of a hy-
pothesis supports the hypothesis”; the question is whether there are limitations.
[Good 1983]
The reference to the possibility of white shoes points to an uncertainty about
whether the observed object will turn out to be a crow or something else. Using
eq.(2.195) the relevant weight of evidence concerns the joint probability of B
and C,
wBC ( ) = wC ( ) + wBjC ( ) ; (2.197)
which, as we show below, is also positive. Indeed, using Bayes’theorem,
!
p(Cj ) p(C)p( jC) p( ~)
wC ( ) = log = log : (2.198)
p(Cj ~) p( ) p(C)p( ~jC)
2.11 Hypothesis testing and con…rmation 59
Now, in the absence of any background information about crows the observation
that a certain object turns out to be a crow tells us nothing about its color and
therefore p( jC) = p( ) and p( ~jC) = p( ~). Therefore wC ( ) = 0. Recalling
eq.(2.196) leads to
wBC ( ) 0 : (2.199)
A similar conclusion holds if the evidence consists in the observation of a white
shoe. Does a non-black non-crow support all crows are black? In this case
because
~B
p(Cj ~ ) 1
wCj
~ B~ ( ) = log = log 0 (2.200)
~B
p(Cj ~ ~) ~B
p(Cj ~ ~)
and
!
~ )
p(Bj ~
p(B)p( ~
jB) p( ~)
wB~ ( ) = log = log =0; (2.201)
~ ~)
p(Bj p( ) ~ ~jB)
p(B)p( ~
wBC ( 1 ) = wC ( 1 ) + wBjC ( 1 ) :
p(BjC 1 ) 1 3
wBjC ( 1 ) = log = log 3
10 (2.202)
p(BjC 2 ) 1 10
4 3
while p(Cj 1 ) = 10 and p(Cj 2 ) = 10 so that
p(Cj 1 ) 1
wC ( 1 ) = log = log 10 2:303 : (2.203)
p(Cj 2 )
we are completely ignorant about the relation between two variables the prudent
way to proceed is, of course, to try to …nd out whether a relevant connection
exists and what it might be. But this is not always possible and in these cases the
default assumption should be that they are a priori independent. Or shouldn’t
it?
The justi…cation of the assumption of independence a priori is purely prag-
matic. Indeed the universe contains an in…nitely large number of other variables
about which we know absolutely nothing and that could in principle a¤ect our
inferences. Seeking information about all those other variables is clearly out of
the question: waiting to make an inference until after all possible information
has been collected amounts to being paralyzed into making no inferences at
all. On the positive side, however, the assumption that the vast majority of
those in…nitely many other variables are completely irrelevant actually works
— perhaps not all the time but at least most of the time. Induction is risky.
There is one …nal loose end that we must revisit: our arguments above indi-
cate that, in the absence of any other background information, the observation
of a white shoe not only supports the hypothesis that “all crows are black”, but
it also supports the hypothesis that “all swans are black”. Two questions arise:
is this reasoning correct? and, if so, why is it so disturbing? The answer to the
…rst question is that it is indeed correct. The answer to the second question is
that con…rming the hypothesis “all swans are black” is disturbing because we
do have background information about the color of swans which we failed to
include in the analysis. Had we not known anything about swans there would
have been no reason to feel any discomfort at all. This is just one more exam-
ple of the fact that inductive arguments are not infallible; a positive weight of
evidence provides mere support and not absolute certainty.
1 (x )2
p(xj ; ) = p exp 2
: (2.204)
2 2 2
p( ; ) QN
p( ; jx) = p(xi j ; ) : (2.205)
p (x) i=1
Almost equally obvious (at least to those who are comfortable with the Bayesian
interpretation of probabilities) is
2. The information about the parameters that is codi…ed into the prior dis-
tribution p( ).
Where and how this prior information was obtained is not relevant at this point;
it could have resulted from previous experiments, or from other background
knowledge about the problem. The only relevant part is whatever ended up
being distilled into p( ).
The last piece of information is not always explicitly recognized; it is
3. The information that is codi…ed into the functional form of the ‘sampling’
distribution p(xj ).
1 PN 1 PN
2
x= xi and s2 = (xi x) ; (2.207)
N i=1 N i=1
62 Probability
eq.(2.206) becomes
" #
2
p( ; ) 1 ( x) + s2
p( ; jx) = exp : (2.208)
p (x) (2 2 )N=2 2 2 =N
It is interesting that the data appears here only in the particular combination
given in eq.(2.207) – di¤erent sets of data characterized by the same x and
s2 lead to the same inference about and . (As discussed earlier the factor
p (x) is not relevant here since it can be absorbed into the normalization of the
posterior p( ; jx).)
Eq. (2.208) incorporates the information described in items 1 and 3 above.
The prior distribution, item 2, remains to be speci…ed. Let us start by consid-
ering the simple case where the value of is actually known. Then p( ; ) =
p( ) ( 0 ) and the goal is to estimate . Bayes’theorem is now written as
" #
2
p( ) 1 PN (x
i )
p( jx) = exp (2.209)
p (x) (2 2 )N=2 i=1 2 02
0
" #
2
p( ) 1 ( x) + s2
= exp
p (x) (2 2 )N=2 2 02 =N
0
" #
2
( x)
/ p( ) exp 2 : (2.210)
2 0 =N
Suppose further that we know nothing about ; it could have any value. This
state of extreme ignorance is represented by a very broad distribution that we
take as essentially uniform within some large range; is just as likely to have one
value as another. For p( ) const the posterior distribution is Gaussian, with
mean given by the sample average x, and variance 02 =N: The best estimate
for the value of is the sample average and the uncertainty is the standard
deviation. This is usually expressed in the form
=x p0 : (2.211)
N
Note that the estimate of from N measurements has a much smaller error
than the estimate from just one measurement; the individual measurements are
plagued with errors but they tend to cancel out in the sample average — in
agreement with the previous result in eq.(2.123).
In the case of very little prior information — the uniform prior — we have
recovered the same results as in the standard non-Bayesian data analysis ap-
proach. But there are two important di¤erences: First, a frequentist approach
can yield an estimator but it cannot yield a probability distribution for a pa-
rameter that is not random but merely unknown. Second, the non-Bayesian
approach has no mechanism to handle additional prior information and can
only proceed by ignoring it. On the other hand, the Bayesian approach has
yielded a full probability distribution, eq.(2.210), and it can easily take prior
2.12 Examples from data analysis 63
information into account. For example, if, on the basis of other physical con-
siderations, we happen to know that has to be positive, then we just assign
p( ) = 0 for < 0 and we calculate the estimate of from the truncated
Gaussian in eq.(2.210).
A slightly more complicated case arises when the value of is not known.
Let us assume again that our ignorance of both and is quite extreme and
choose a uniform prior,
C for >0
p( ; ) / (2.212)
0 otherwise.
Another popular choice is a prior that is uniform in and in log . When
there is a considerable amount of data the two choices lead to practically the
same conclusions but we see that there is an important question here: what do
we mean by the word ‘uniform’? Uniform in terms of which variable? , or
2
, or log ? Later, in chapter 7, we shall have much more to say about this
misleadingly innocuous question.
To estimate we return to eq.(2.206) or (2.208). For the purpose of esti-
mating the variable is an uninteresting nuisance which, as we saw in section
2.5.4, can be eliminated through marginalization,
R1
p( jx) = d p( ; jx) (2.213)
0
" #
2
R1 1 ( x) + s2
/ d N
exp : (2.214)
0 2 2 =N
h i=x: (2.217)
2 1=2
The Gaussian integral over is 2 =N / and therefore
1 N s2
p( jX) / N 1
exp : (2.222)
2 2
As an estimate for we can use the value where the distribution is maximized,
r
N
max = s2 ; (2.223)
N 1
2
which agrees with our previous estimate of ( ) ,
2
max s2
= : (2.224)
N N 1
An error bar for itself can be obtained using the previous trick (provided N
is large enough) of taking a second derivative of log p: The result is
estimate the parameters and the complication is that the measured values of
y are a- icted by experimental errors,
For simplicity we assume that the probability of the error "i is Gaussian with
mean h"i i = 0 and that the variances "2i = 2 are known and the same for all
data pairs. We also assume that there are no errors a¤ecting the xs. A more
realistic account might have to reconsider these assumptions.
The sampling distribution is
Q
N
p(yj ) = p(yi j ) ; (2.227)
i=1
where
1 (yi f (xi ))2
p(yi j ) = p exp : (2.228)
2 2 2 2
Bayes’theorem gives,
P
N (y
i f (xi ))2
p( jy) / p( ) exp : (2.229)
i=1 2 2
P
N (y
i a bxi )2
p(a; bjy) / exp 2
: (2.231)
i=1 2
A good estimate of a and b is the value that maximizes the posterior distribution,
which we recognize as the Bayesian equivalent of the method of least squares.
However, the Bayesian analysis can already take us beyond the scope of the least
squares method because from p(a; bjy) we can also estimate the uncertainties
a and b.
quadratic line? It is not obvious. Having more parameters means that we will
be able to achieve a closer …t to the data, which is good, but we might also be
…tting the noise, which is bad. The same problem arises when the data shows
peaks and we want to estimate their location, their width, and their number.
Could there be an additional peak hiding in the noise? Are we just …tting the
noise, or does the data really support one additional peak?
We say these are problems of model selection. To appreciate how important
they can be consider replacing the modestly unassuming word ‘model’ by the
more impressive sounding word ‘theory’. Given two competing theories, which
one does the data support best? What is at stake is nothing less than the
foundation of experimental science.16
On the basis of data x we want to select one model among several competing
candidates labeled by m = 1; 2; : : : Suppose model m is de…ned in terms of some
parameters m = f m1 ; m2 ; : : :g and their relation to the data x is contained in
the sampling distribution p(xjm; m ). The extent to which the data supports
model m, i.e., the probability of model m given the data, is given by Bayes’
theorem,
p(m)
p(mjx) = p(xjm) ; (2.233)
p(x)
where p(m) is the prior for the model.
The factor p(xjm) is the prior probability for the data given the model and
plays the role of a likelihood function. This is precisely the quantity which, back
in eq.(2.163), we had called the ‘evidence’,
Z Z
p(xjm) = d m p(x; m jm) = d m p( m jm) p(xjm; m) : (2.234)
Thus, the problem of model selection is solved, at least in principle, once the
priors p(m) and p( m jm) are assigned. Of course, the practical problem of
calculating the multi-dimensional integrals can be quite formidable.
No further progress is possible without making speci…c choices for the various
functions in eq.(2.235) but we can o¤er some qualitative comments. When
comparing two models, m1 and m2 , it is fairly common to argue that a priori
we have no reason to prefer one over the other and therefore we assign the
same prior probability p(m1 ) = p(m2 ). (Of course this is not always justi…ed.
Particularly in the case of theories that claim to be fundamental people usually
1 6 For useful references on this topic see [Balasubramanian 1996, 1997], [Rodriguez 2005].
2.12 Examples from data analysis 67
have very strong prior prejudices favoring one theory against the other. Be that
as it may, let us proceed.)
Suppose the prior p( m jm) represents a uniform distribution over the para-
meter space. Since
Z
1
d m p( m jm) = 1 then p( m jm) ; (2.236)
Vm
where Vm is the ‘volume’of the parameter space. Suppose further that p(xjm; m )
has a single peak of height Lm ax spread out over a region of ‘volume’ m . The
value m where p(xjm; m ) attains its maximum can be used as an estimate
for m and the ‘volume’ m is then interpreted as an uncertainty. Then the
integral of p(xjm; m ) can be approximated by the product Lm ax m . Thus,
in a very rough and qualitative way the probability for the model given the data
is
Lm ax m
p(mjx) / : (2.237)
Vm
We can now interpret eq.(2.237) as follows. Our preference for a model will be
dictated by how well the model …ts the data; this is measured by [p(xjm; m )]m ax =
Lm ax . The volume of the region of uncertainty m also contributes: if more
values of the parameters are consistent with the data, then there are more ways
the model agrees with the data, and the model is favored. Finally, the larger the
volume of possible parameter values Vm the more the model is penalized. Since
a larger volume Vm means a more complex model the 1=Vm factor penalizes
complexity. The preference for simpler models is said to implement Occam’s
razor. This is a reference to the principle, stated by William of Occam, a 13th
century Franciscan monk, that one should not seek a more complicated expla-
nation when a simpler one will do. Such an interpretation is satisfying but
ultimately it is quite unnecessary. Occam’s principle does not need not be put
in by hand: Bayes’theorem takes care of it automatically in eq.(2.235)!
^(x), called the ‘statistic’or the ‘estimator’, that relates the two: the estimate
for is ^(x). The problem is to estimate the unknown when what is known
is the sampling distribution p(xj ) and the data x. The solution proposed
by Fisher was to select as estimator ^(x) that value of that maximizes the
probability of the observed data x. Since p(xj ) is a function of the variable x
where appears as a …xed parameter, Fisher introduced a function of , which
he called the likelihood function, where the observed data x appear as …xed
parameters,
def
L ( jx) = p(xj ) : (2.238)
Thus, the estimator ^(x) is the value of that maximizes the likelihood function
and, accordingly, this method of parameter estimation is called the method of
‘maximum likelihood’.
The likelihood function L( jx) is not the probability of ; it is not normalized
in any way; and it makes no sense to use it to compute an average or a variance
of . Nevertheless, the same intuition that leads one to propose maximization
of the likelihood to estimate also suggests using the width of the likelihood
function to estimate an error bar. Fisher’s somewhat ad hoc proposal turned
out to be extremely useful and it dominated the …eld of statistics throughout the
20th century. Its success is readily explained within the Bayesian framework.
The Bayesian approach agrees with the method of maximum likelihood in
the common case where the prior is uniform,
This is why the Bayesian discussion in this section has reproduced so many of
the standard results of the ‘orthodox’theory. But then the Bayesian approach
has many other advantages. In addition to greater conceptual clarity, unlike the
likelihood function, the Bayesian posterior is a true probability distribution that
allows estimation not just of but of all its moments. And, most important,
there is no limitation to uniform priors. If there is additional prior information
that is relevant to a problem the prior distribution provides a mechanism to
take it into account.
Chapter 3
An important problem that occupied the minds of many scientists in the 18th
century was to …gure out how to construct a perpetual motion machine. They
all failed. Ever since a rudimentary understanding of the laws of thermodynam-
ics was achieved in the 19th century no competent scientist would waste time
considering perpetual motion. Other scientists tried to demonstrate the impos-
sibility of perpetual motion from the established principles of mechanics. They
failed too: there exist no derivations of the Second Law from purely mechanical
principles. It took a long time, and for many the subject remains controver-
sial, but it has gradually become clear that the reason hinges on the fact that
entropy is not a physical quantity to be derived from mechanics; it is a tool
for inference, a tool for reasoning in situations of incomplete information. It is
quite impossible that such a non-mechanical quantity could have emerged from
a combination of purely mechanical notions. If anything it should be the other
way around.
In this chapter we trace some of the early developments leading to the notion
of entropy. Much of this chapter (including the title) is inspired by a beautiful
article by E. T. Jaynes [Jaynes 1988]. I have also borrowed from historical
papers by Klein [1970, 1973] and U¢ nk [2004].
not a fact that he could prove. Indeed, he could not have had a proof: thermo-
dynamics had not been invented yet. His conviction derived instead from the
long list of previous attempts (including those by his own father Lazare Carnot)
that had ended in failure. Carnot’s brilliant idea was to proceed anyway and
assume what he knew was true but could not prove as the postulate from which
he would draw all sorts of useful conclusions about engines.1
At the time Carnot did his work the nature of heat as a form of energy trans-
fer had not yet been understood. He adopted the model that was fashionable
at the time – the caloric model – according to which heat is a substance that
could be transferred but neither created nor destroyed. For Carnot an engine
would use heat to produce work in much the same way that falling water can
turn a waterwheel and produce work: the caloric would “fall” from a higher
temperature to a lower temperature thereby making the engine turn. What was
being transformed into work was not the caloric itself but the energy acquired
in the fall.
According to the caloric model the amount of heat extracted from the high
temperature source should be the same as the amount of heat discarded into the
low temperature sink. Later measurements showed that this was not true, but
Carnot was lucky. Although the model was seriously wrong, it did have a great
virtue: it suggested that the generation of work in a heat engine should include
not just the high temperature source from which heat is extracted (the boiler)
but also a low temperature sink (the condenser) into which heat is discarded.
Later, when heat was correctly interpreted as a form of energy transfer it was
understood that in order to operate continuously for any signi…cant length of
time an engine would have to repeat the same cycle over and over again, always
returning to same initial state. This could only be achieved if the excess heat
generated in each cycle were discarded into a low temperature reservoir.
Carnot’s caloric-waterwheel model was fortunate in yet another respect— he
was not just lucky, he was very lucky— a waterwheel engine can be operated in
reverse and used as a pump. This led him to consider a reversible heat engine
in which work would be used to draw heat from a cold source and ‘pump it up’
to deliver heat to the hot reservoir. The analysis of such reversible heat engines
led Carnot to the important conclusion
Carnot’s Principle: “No heat engine E operating between two temperatures
can be more e¢ cient than a reversible engine ER that operates between the same
temperatures.”
The proof of Carnot’s principle is quite straightforward but because he used
the caloric model it was not correct— the necessary revisions were supplied later
by Clausius in 1850. As a side remark, it is interesting that Carnot’s notebooks,
1 In his attempt to understand the undetectability of the ether Einstein faced a similar
problem: he knew that it was hopeless to seek an understanding of the constancy of the speed
of light on the basis of the primitive physics of the atomic structure of solid rulers that was
available at the time. Inspired by Carnot he deliberately followed the same strategy – to give
up and declare victory – and postulated the constancy of the speed of light as the unproven
but known truth which would serve as the foundation from which other conclusions could be
derived.
3.1 Carnot: reversible engines 71
which were made public by his family in about 1870 long after his death, indi-
cate that soon after 1824 Carnot came to reject the caloric model and that he
achieved the modern understanding of heat as a form of energy transfer. This
work— which preceded Joule’s experiments by about …fteen years— was not pub-
lished and therefore had no in‡uence on the development of thermodynamics
[Wilson 1981].
The following is Clausius’proof. Figure (3.1a) shows a heat engine E that
draws heat q1 from a source at high temperature t1 , delivers heat q2 to a sink at
low temperature t2 , and generates work w = q1 q2 . Next consider an engine
ES that is more e¢ cient than a reversible one, ER . In …gure (3.1b) we show the
super-e¢ cient engine ES coupled to the reversible ER . Then for the same heat
q1 drawn from the hot source the super-e¢ cient engine ES would deliver more
work than ER , wS > wR . One could split the work wS generated by ES into two
parts wR and wS wR . The …rst part wR could be used to drive ER in reverse
and pump heat q1 back up to the hot source, which is thus left unchanged. The
remaining work wS wR could then be used for any other purposes. The net
result is to extract heat q2R q2S > 0 from the cold reservoir and convert it
to work without any need for the hight temperature reservoir, that is, without
any need for fuel. The conclusion is that the existence of a super-e¢ cient heat
engine would allow the construction of a perpetual motion engine. Therefore
the assumption that the latter do not exist implies Carnot’s principle that heat
engines cannot be more e¢ cient than reversible ones.
The statement that perpetual motion is not possible is true but it is also
incomplete in one important way. It blurs the distinction between perpetual
motion engines of the …rst kind which operate by violating energy conservation
and perpetual motion engines of the second kind which do not violate energy
conservation. Carnot’s conclusion deserves to be singled out as a new principle
because it is speci…c to the second kind of machine.
Other important conclusions obtained by Carnot include
(1) that all reversible engines operating between the same temperatures are
equally e¢ cient;
(2) that their e¢ ciency is a function of the temperatures only,
def w
e = = e(t1 ; t2 ) ; (3.1)
q1
and is therefore independent of all other details of how the engine is constructed
and operated;
(3) that the e¢ ciency increases with the temperature di¤erence [see eq.(3.5)
below]; and …nally
(4) that the most e¢ cient heat engine cycle, now called the Carnot cycle, is one
in which all heat is absorbed at the high t1 and all heat is discharged at the low
t2 . (Thus, the Carnot cycle is de…ned by two isotherms and two adiabats.)
(The proofs of these statements are left as an exercise for the reader.)
The next important step, the determination of the universal function e(t1 ; t2 ),
was accomplished by Kelvin.
72 Entropy I: The Evolution of Carnot’s Principle
wb q2 wa
ec = ea + = ea + eb 1 ; (3.4)
q2 q1 q1
3.2 Kelvin: temperature 73
or
ec = ea + eb (1 ea ) ; (3.5)
which is a functional equation for e = e (t1 ; t2 ). Before we proceed to …nd the
solution we note that since 0 e 1 it follows that ec ea . Similarly, writing
ec = eb + ea (1 eb ) ; (3.6)
@
xa (t1 ; t2 ) = g(t2 ) : (3.9)
@t2
Integrating gives x(t1 ; t2 ) = F (t1 ) + G(t2 ) where the two functions F and G
are at this point unknown. The boundary condition e (t; t) = 0 or equivalently
x(t; t) = 0 implies that we deal with merely one unknown function: G(t) =
F (t). Therefore
f (t2 )
x(t1 ; t2 ) = F (t1 ) F (t2 ) or e (t1 ; t2 ) = 1 ; (3.10)
f (t1 )
The scale factor C re‡ects the still remaining freedom to choose the units. In the
absolute scale the e¢ ciency for the ideal reversible heat engine is very simple,
T2
e (t1 ; t2 ) = 1 : (3.12)
T1
In short, what Kelvin proposed was to use an ideal reversible engine as a ther-
mometer with its e¢ ciency playing the role of the thermometric variable.
Carnot’s principle that any heat engine E 0 must be less e¢ cient than the
reversible one, e0 e, is rewritten as
w q2 T2
e0 = =1 e=1 ; (3.13)
q1 q1 T1
or,
q1 q2
0: (3.14)
T1 T2
It is convenient to rede…ne heat so that inputs are positive heat, Q1 = q1 , while
outputs are negative heat, Q2 = q2 . Then,
Q1 Q2
+ 0; (3.15)
T1 T2
where the equality holds when and only when the engine is reversible.
The generalization to an engine or any system that undergoes a cyclic process
in which heat is exchanged with more than two reservoirs is straightforward. If
heat Qi is absorbed from the reservoir at temperature Ti we obtain the Kelvin
form (1854) of Carnot’s principle,
X Qi
0: (3.16)
i
Ti
It may be worth emphasizing that the Ti are the temperatures of the reservoirs.
In an irreversible process the system will not in general be in thermal equilibrium
and it may not be possible to assign a temperature to it.
The next non-trivial step, taken by Clausius, was to use eq.(3.16) to intro-
duce the concept of entropy.
trivial – before he came up with a new compact statement of the second law
that allowed substantial further progress [Cropper 1986].
Clausius rewrote Kelvin’s eq.(3.16) for a cycle where the system absorbs in-
…nitesimal (positive or negative) amounts of heat dQ from a continuous sequence
of reservoirs, I
dQ
0; (3.17)
T
where T is the temperature of each reservoir. The equality is attained for a
reversible process in which the system is slowly taken through a continuous
sequence of equilibrium states. In such a process T is both the temperature of
the system and of the reservoirs. The equality implies that the integral from
any state A to any other state B is independent of the path taken,
I Z B Z B
dQ dQ dQ
=0) = ; (3.18)
T A;R1AB T A;R2AB T
where R1AB and R2AB denote any two reversible paths linking the states A and
B. Clausius realized that eq.(3.18) implies the existence of a function of the
thermodynamic state. This function, which he called entropy, is de…ned up to
an additive constant by
Z B
dQ
SB = SA + : (3.19)
A;RAB T
This …rst notion of entropy we will call the Clausius entropy or the thermo-
dynamic entropy. Note that the Clausius entropy is de…ned only for states of
thermal equilibrium which severely limits its range of applicability.
Eq.(3.19) seems like a mere reformulation of eqs.( 3.16) and (3.17) but it
represents a major advance because it allowed thermodynamics to reach beyond
the study of cyclic processes. Consider a possibly irreversible process in which
a system is taken from an initial state A to a …nal state B, and suppose the
system is returned to the initial state along a reversible path. Then, the more
general eq.(3.17) gives
Z B Z A
dQ dQ
+ 0: (3.20)
A;irrev T B;RAB T
Thus the second law can be stated in terms of the total entropy S total = S res +S
as
total total
S…nal Sinitial ; (3.22)
76 Entropy I: The Evolution of Carnot’s Principle
or
2 q 3
p vx2 + vy2 + vz2
log 4 5 = log p(vx ) + log p(vy ) + log p(vz ) ; (3.28)
p(0) p(0) p(0) p(0)
(3.30)
Therefore
G0 (vx ) G0 (vy )
= = 2 ; (3.31)
vx vy
where 2 is a constant. Integrating gives
p(vx )
log = G(vx ) = vx2 + const ; (3.32)
p(0)
so that
3=2
P (~v ) = f (v) = exp vx2 + vy2 + vz2 ; (3.33)
when two gases of the same molecular species are mixed? Is this an irreversible
process?
For Gibbs there was no paradox, much less one that would require some
esoteric new (quantum) physics for its resolution. For him it was quite clear
that thermodynamics was not concerned with microscopic details but rather
with the changes from one macrostate to another. He correctly explained that
the mixing of two gases of the same molecular species does not lead to a di¤erent
macrostate. Indeed: by “thermodynamic” state
“...we do not mean a state in which each particle shall occupy more or less
exactly the same position as at some previous epoch, but only a state which
shall be indistinguishable from the previous one in its sensible properties.
It is to states of systems thus incompletely de…ned that the problems of
thermodynamics relate.” [Gibbs 1875-78]
confusion, he was not always explicit about which one he was using, sometimes
mixing them within the same paper, and even within the same equation. At
…rst, just like Maxwell, he de…ned the probability of a molecule having a velocity
~v within a small cell d3 v as the fraction of particles with velocities within the cell.
But then he also de…ned the probability as being proportional to the amount of
time that the molecule spent within that particular cell. Both de…nitions have
a clear origin in mechanics.
By 1868 he had managed to generalize Maxwell’s work in several directions.
He extended the theorem of equipartition of energy for point particles to complex
molecules. And he also generalized the Maxwell distribution to particles in the
presence of an external potential U (~x) which is the Boltzmann distribution. The
latter, in modern notation, is
1 p2
P (~x; p~)d3 xd3 p / exp + U (~x) d3 xd3 p : (3.34)
kT 2m
The argument was that in equilibrium the distribution should be stationary,
that it should not change as a result of collisions among particles. The colli-
sion argument was indeed successful but it gave the distribution for individual
molecules; it did not keep track of the correlations among molecules that would
arise from the collisions themselves.
A better treatment of the interactions among molecules was needed and
was found soon enough, also in 1868. The idea that naturally suggests itself
is to replace the potential U (~x) due to a single external force by a potential
U (~x1 : : : ~xN ) that includes all intermolecular forces. The change is enormous:
it led Boltzmann to consider the probability of the microstate of the system
as a whole rather than the probabilities of the microstates of individual mole-
cules. Thus, the universe of discourse shifted from the one-particle phase space
with volume element d3 xd3 p to the N -particle phase space with volume ele-
ment d3N xd3N p and Boltzmann was led to the microcanonical distribution in
which the N -particle microstates are uniformly distributed over a hypersurface
of constant energy — a subspace of 6N 1 dimensions.
The question of probability was, once again, brought to the foreground. A
notion of probability as the fraction of molecules in d3 v was no longer usable but
Boltzmann could still identify the probability of the system being in some region
of the N -particle phase space (rather than the one-particle space of molecular
velocities) with the relative amount of time that the system would spend in the
region. This obviously mechanical concept of probability is sometimes called
the “time” ensemble.
Perhaps inadvertently, at least at …rst, Boltzmann also introduced another
de…nition, according to which the probability that the state of the system is
within a certain region of phase space at a given instant in time is propor-
tional to the volume of the region. This is almost natural: just like the faces
of a symmetric die are assigned equal probabilities, Boltzmann assigned equal
probabilities to equal volumes in phase space.
At …rst Boltzmann did not think it was necessary to comment on whether the
two de…nitions of probability are equivalent or not, but eventually he realized
3.6 Boltzmann: entropy and probability 81
that their assumed equivalence should be explicitly stated. Later this came
to be known as the “ergodic” hypothesis, namely, that over a long time the
trajectory of the system would cover the whole region of phase space consistent
with the given value of the energy (and thus erg-odic). Throughout this period
Boltzmann’s various notions of probability were all still conceived as mechanical
properties of the gas.
In 1871 Boltzmann achieved a signi…cant success in establishing a connection
between thermodynamic entropy and microscopic concepts such as the probabil-
ity distribution in the N -particle phase space. In modern notation the argument
runs as follows. The energy of N interacting particles is given by
N
X p2i
H= + U (x1 ; : : : ; xN ; V ) ; (3.35)
i
2m
where dzN = d3N xd3N p is the volume element in the N -particle phase space,
and PN is the N -particle distribution function,
Z
exp ( H)
PN = where Z = dzN e H ; (3.37)
Z
and = 1=kT , so that,
3
E= N kT + hU i : (3.38)
2
The connection to the thermodynamic entropy, eq.(3.19), requires a clear
idea of the nature of heat and how it di¤ers from work. One needs to express
heat in purely microscopic terms, and this is quite subtle because at the mole-
cular level there is no distinction between a motion that is supposedly of a
“thermal” type and other types of motion such as plain displacements or rota-
tions. The distribution function turns out to be the crucial ingredient. In any
in…nitesimal transformation the change in the internal energy separates into two
contributions, Z Z
E = dzN H PN + dzN PN H : (3.39)
An aside: The discrete version of this idea might be familiar from elementary
quantum mechanics. If the probability that a quantum system is in a microstate
i with energy eigenvalue "i is pi , then the internal energy is
X
E = hHi = pi " i : (3.41)
i
The second term is the energy transfer that results from changing the energy
levels while keeping the probabilities pi …xed. This can be achieved, for example,
by changing an external electric …eld, or by changing the volume of the box in
which the particles are contained. The corresponding change in energy is called
work. Then, the …rst term, which refers to an energy change which involves
only a change in the probability of occupation pi and not in the energy levels is
called heat. Thus, in quantum mechanics, a slow process in which the quantum
system makes no transitions ( pi = 0) while the energy levels are moved around
( "i 6= 0) is called an adiabatic process (i.e., no heat is transferred).
Getting back to Boltzmann, substituting E from eq.(3.38) into (3.40), one
gets
3
Q = N k T + hU i h U i : (3.43)
2
This is not a complete di¤erential, but dividing by the temperature yields (after
some algebra)
Z
Q 3 hU i
= N k log T + + k log d3N x e U + const : (3.44)
T 2 T
If the identi…cation of Q with heat is correct then this strongly suggests that
the expression in brackets should be identi…ed with the Clausius entropy S.
Further rewriting leads to
E
S= + k log Z + const ; (3.45)
T
which is recognized as the correct modern expression. Indeed, the free energy,
F = E T S, is such that Z = e F=kT .
Boltzmann’s path towards understanding the second law was guided by one
notion from which he never wavered: matter is an aggregate of molecules. Apart
from this the story of his progress is the story of the increasingly more important
role played by probabilistic notions, and ultimately, it is the story of the evolu-
tion of his understanding of the notion of probability itself. By 1877 Boltzmann
achieves his …nal goal and explains entropy purely in terms of probability –me-
chanical notions were by now reduced to the bare minimum consistent with the
subject matter: we are, after all, talking about collections of molecules with po-
sitions and momenta and their total energy is conserved. His …nal achievement
3.6 Boltzmann: entropy and probability 83
W N!
P (w1 ; : : : ; wm ) = where W = : (3.46)
mN w1 ! : : : wm !
Xm
wn wn
log W = N log ; (3.49)
n=1
N N
or
m
X
log W = N fn log fn ; (3.50)
n=1
detail, the distribution that maximizes log W subject to the constraints (3.47)
is
wn
fn = / e "n ; (3.51)
N
where is a Lagrange multiplier determined by the total energy. When applied
to a gas, the possible states of a molecule are cells in the one-particle phase
space. Therefore
Z
log W = N dz1 f (x; p) log f (x; p) ; (3.52)
where dz1 = d3 xd3 p and the most probable distribution (3.51) is the same equi-
librium distribution found earlier by Maxwell and generalized by Boltzmann.
The derivation of the Boltzmann distribution (3.51) from a purely proba-
bilistic argument is a major accomplishment. However, although minimized, the
role of dynamics it is not completely eliminated. The Hamiltonian enters the
discussion in two places. One is quite explicit: there is a conserved energy the
value of which is imposed as a constraint. The second is much more subtle; we
saw above that the probability of a macrostate is proportional to the multiplic-
ity W provided the microstates are assigned equal probabilities, or equivalently,
equal volumes in phase space are assigned equal a priori weights. As always,
equal probabilities must at ultimately be justi…ed in terms of some form of un-
derlying symmetry. As we shall later see in chapter 5, the required symmetry
follows from Liouville’s theorem –under a Hamiltonian time evolution a region
in phase space moves around and its shape is distorted but its volume remains
conserved: Hamiltonian time evolution preserves volumes in phase space. The
nearly universal applicability of the ‘equal a priori postulate’can be traced to
the fact that the only requirement that the dynamics be Hamiltonian but the
functional form of the Hamiltonian is not important.
It is very surprising that although Boltzmann calculated the maximized value
log W for an ideal gas and knew that it agreed with the thermodynamical en-
tropy except for a scale factor, he never wrote the famous equation that bears
his name
S = k log W : (3.53)
This equation, as well as Boltzmann’s constant k, were both …rst introduced by
Planck.
There is, however, a serious problem with eq.(3.52): it involves the distri-
bution function f (x; p) in the one-particle phase space and therefore it cannot
take correlations into account. Indeed, Boltzmann used his eq.(3.52) in the one
case where it actually works, for ideal gases of non-interacting particles. The
expression that applies to systems of interacting particles is3
Z
log WG = dzN fN log fN ; (3.54)
3 For the moment we disregard the question of the distinguishability of the molecules. The
so-called Gibbs paradox and the extra factor of 1=N ! will be discussed in detail in section 5.11.
3.7 Some remarks 85
“The states of the bodies which we handle are certainly not known to us
exactly. What we know about a body can generally be described most
accurately and most simply by saying that it is one taken at random from
a great number (ensemble) of bodies which are completely described.”[my
italics, Gibbs 1902, p.163]
It is clear that for Gibbs probabilities represent a state of knowledge, that the
ensemble is a purely imaginary construction, just a tool for handling incom-
plete information. On the other hand, it is also clear that Gibbs still thinks of
probabilities in terms of frequencies. If the only available notion of probability
requires an ensemble and real ensembles are nowhere to be found then either
one gives up on probabilistic arguments altogether or one invents an imaginary
ensemble. Gibbs opted for the second alternative.
This brings our story of entropy up to about 1900. In the next chapter we
start a more deliberate and systematic study of the connection between entropy
and information.
not do; it trivializes the enormous achievements of the 19th century thinkers
and it misrepresents the actual nature of research. Scienti…c research is not a
neat tidy business.
I mentioned that this chapter was inspired by a beautiful article by E. T.
Jaynes with the same title [Jaynes 1988]. I think Jaynes’article has great peda-
gogical value but I disagree with him on how well Gibbs understood the logical
status of thermodynamics and statistical mechanics as examples of inferential
and probabilistic thinking. My own assessment runs in quite the opposite di-
rection: the reason why the conceptual foundations of thermodynamics and
statistical mechanics have been so controversial throughout the 20th century
is precisely because neither Gibbs nor Boltzmann, nor anyone else at the time,
were particularly clear on the interpretation of probability. I think that we could
hardly expect them to have done much better; they did not bene…t from the
writings of Keynes (1921), Ramsey (1931), de Finetti (1937), Je¤reys (1939),
Cox (1946), Shannon (1948), Brillouin (1952), Polya (1954) and, of course,
Jaynes himself (1957). Indeed, whatever clarity Jaynes attributes to Gibbs, is
not Gibbs’; it is the hard-won clarity that Jaynes attained through his own
e¤orts and after absorbing much of the best the 20th century had to o¤er.
The decades following Gibbs (1902) were extremely fruitful for statistical
mechanics but they centered in the systematic development of calculational
methods and their application to a bewildering range of systems and phenom-
ena, including the extension to the quantum domain by Bose, Einstein, Fermi,
Dirac, and von Neumann. With the possible exception of Szilard there were
no signi…cant conceptual advances concerning the connection of entropy and
information until the work of Shannon, Brillouin, and Jaynes around 1950. In
this book we will approach statistical mechanics from the point of view of in-
formation and entropic inference. For an entry point to the extensive literature
on alternative approaches based, for example, on Boltzmann’s equation, the er-
godic hypothesis, etc., see e.g. [Ehrenfest 2012] [ter Haar 1955] [Wehrl 1978]
[Mackey 1989][Lebowitz 1993, 1999] and [U¢ nk 2001, 2003, 2006].
Chapter 4
What is information? Our central goal is to gain insight into the nature of
information, how one manipulates it, and the implications such insights have for
physics. In chapter 2 we provided a …rst partial answer. We might not yet know
precisely what information is but sometimes we can recognize it. For example,
it is clear that experimental data contains information, that the correct way to
process it involves Bayes’ rule, and that this is very relevant to the empirical
aspect of all science, namely, to data analysis. Bayes’rule is the machinery that
processes the information contained in data to update from a prior to a posterior
probability distribution. This suggests a possible generalization: “information”
is whatever induces a rational agent to update from one state of belief to another.
This is a notion that will be explored in detail later.
In this chapter we pursue a di¤erent point of view that has turned out to
be extremely fruitful. We saw that the natural way to deal with uncertainty,
that is, with lack of information, is to introduce the notion of degrees of belief,
and that these measures of plausibility should be manipulated and calculated
using the ordinary rules of the calculus of probabilities. This achievement is a
considerable step forward but it is not su¢ cient.
What the rules of probability theory allow us to do is to assign probabilities
to some “complex”propositions on the basis of the probabilities of some other,
perhaps more “elementary”, propositions. The problem is that in order to get
the machine running one must …rst assign probabilities to those elementary
propositions. How does one do this?
The solution is to introduce a new inference tool designed speci…cally for
assigning those elementary probabilities. The new tool is Shannon’s measure of
an “amount of information” and the associated method of reasoning is Jaynes’
Method of Maximum Entropy, or MaxEnt. [Shannon 1948, Brillouin 1952,
Jaynes 1957b, 1983, 2003]
88 Entropy II: Measuring Information
probablities could be sharply peaked around a completely wrong value and the actual amount
of missing information could be substantial.
4.1 Shannon’s information measure 89
tion were to be obtained not all at once but in installments. The consistency
requirement is that the particular manner in which we obtain this information
should not matter. This idea can be expressed as follows.
Figure 4.1: The n states are divided into N groups to formulate the grouping
axiom.
Let pijg denote the probability that the system is in state i conditional on its
being in group g. For i 2 g we have
pi
pi = pig = Pg pijg so that pijg = : (4.2)
Pg
Suppose we were to obtain the missing information in two steps, the …rst of
which would allow us to single out one of the groups g while the second would
allow us to decide which is the actual i within the selected group g. The amount
of information required in the …rst step is SG = S[P ] where P = fPg g with
g = 1 : : : N . Now suppose we did get this information, and as a result we found,
for example, that the system was in group g 0 . Then for the second step, to
single out the state i within the group g 0 , the amount of additional information
needed would be Sg0 = S[p jg0 ]. But at the beginning of this process we do
not yet know which of the gs is the correct one. Then the expected P amount of
missing information to take us from the gs to the actual i is g Pg Sg . The
consistency requirement is that it should not matter whether we get the total
90 Entropy II: Measuring Information
any two integers s and t both larger than 1. The ratio of their logarithms can be
approximated arbitrarily closely by a rational number, i.e., we can …nd integers
and (with arbitrarily large) such that
log s +1 +1
< or t s <t : (4.10)
log t
But F is monotonic increasing, therefore
+1
F (t ) F (s ) < F (t ); (4.11)
Therefore,
X X
SG [P ] = F (n) Pg F (mg ) = Pg [F (n) F (mg )] : (4.16)
g g
Comments
Notice that for discrete probability distributions we have pi 1 and log pi 0.
Therefore S 0 for k > 0. As long as we interpret S as the amount of
uncertainty or of missing information it cannot be negative. We can also check
that in cases where there is no uncertainty we get S = 0: if any state has
probability one, all the other states have probability zero and every term in S
vanishes.
The fact that entropy depends on the available information implies that there
is no such thing as the entropy of a system. The same system may have many
di¤erent entropies. Indeed, two di¤erent agents may reasonably assign di¤erent
probability distributions p and p0 so that S[p] 6= S[p0 ]. But the non-uniqueness of
entropy goes even further: the same agent may legitimately assign two entropies
to the same system. This possibility is already shown in the Grouping Axiom
which makes explicit reference to two entropies S[p] and SG [P ] referring to two
di¤erent descriptions of the same system — a …ne-grained and a coarse-grained
description. Colloquially, however, one does refer to the entropy of a system; in
such cases the relevant information available about the system should be obvious
from the context. For example, in thermodynamics by the entropy one means
the particular entropy obtained when the only information available is speci…ed
by the known values of those few variables that specify the thermodynamic
macrostate.
The choice of the constant k is purely a matter of convention. In thermo-
dynamics the choice is Boltzmann’s constant kB = 1:38 10 16 erg/K which
re‡ects the historical choice of the Kelvin as the unit of temperature. A more
convenient choice is k = 1 which makes temperature have energy units and
entropy dimensionless. In communication theory and computer science, the
conventional choice is k = 1= loge 2 1:4427, so that
N
X
S[p] = pi log2 pi : (4.19)
i=1
The base of the logarithm is 2, and the entropy is said to measure information
in units called ‘bits’.
Next we turn to the question of interpretation. Earlier we mentioned that
from the Grouping Axiom it seems more appropriate to interpret S as a measure
of the expected rather than the actual amount of missing information. If one
adopts this interpretation, the actual amount of information that we gain when
we …nd that i is the true alternative would be log 1=pi . But this is not always
satisfactory because it clashes with the intuition that in general large messages
will carry large amounts of information while short messages will carry small
amounts. Indeed, consider a variable that takes just two values, 0 and 1, with
probabilities p and 1 p respectively. For very small p, log 1=p would be very
large, while the information that communicates the true alternative is physi-
cally conveyed by a very short one bit message, namely “0”. This shows that
interpreting log 1=p as an actual amount of information is not quite right. It
4.1 Shannon’s information measure 93
one of Renyi-Tsallis entropies. Which, among all those alternatives, should one
choose? This is a problem to which we will return in chapter 6.
Figure 4.2: Showing the concavity of the entropy S(p) S for the case of two
states.
turns out to be useful. Despite the positive sign K is sometimes read as the
‘entropy of p relative to q,’and often called “relative entropy”. It is easy to see
that in the special case when qi is a uniform distribution then K is essentially
equivalent to the Shannon entropy – they di¤er by a constant. Indeed, for
qi = 1=n, eq.(4.23) becomes
X
K[p; 1=n] = pi (log pi + log n) = log n S[p] : (4.24)
i
The relative entropy is also known by many other names including informa-
tion divergence, information for discrimination, and Kullback-Leibler divergence
[Kullback 1959]. The expression (4.23) has an old history. It was already used
by Gibbs in his Elementary Principles of Statistical Mechanics [Gibbs 1902] and
by Turing as the expected weight of evidence, eq.(2.190) [Good 1983].
It is common to interpret K[p; q] as the amount of information that is gained
(thus the positive sign) when one thought the distribution that applies to a
certain process is q and one learns that the distribution is actually p. Indeed, if
the distribution q is the uniform distribution and re‡ects the minimum amount
of information we can interpret K[p; 1=n] as the amount of information in p.
As we saw in section (2.11) the weight of evidence factor in favor of hypoth-
esis 1 against 2 provided by data x is
def p(xj 1 )
w( 1 : 2) = log : (4.25)
p(xj 2 )
This quantity can be interpreted as the information gained from the observation
of the data x: Indeed, this is precisely the way [Kullback 1959] de…nes the notion
of information: the log-likelihood ratio is the “information” in the data x for
discrimination in favor of 1 against 2 . Accordingly, the relative entropy,
Z
p(xj 1 )
dx p(xj 1 ) log = K( 1 ; 2 ) ; (4.26)
p(xj 2 )
96 Entropy II: Measuring Information
with equality if and only if pi = qi for all i. The proof uses the concavity of the
logarithm,
log x x 1: (4.28)
(The graph of the curve y = log x lies under the straight line y = x 1.)
Therefore
qi qi
log 1; (4.29)
pi pi
which implies
X qi X
pi log (qi pi ) = 0 : (4.30)
i
pi i
which establishes the range of the entropy between the two extremes of complete
certainty (pi = ij for some value j) and complete uncertainty (the uniform
distribution) for a variable that takes n discrete values.
4.3 Su¢ ciency* 97
The virtue of this notation is its compactness but one must keep in mind the
same symbol x is used to denote both a variable x and its values xi . To be more
explicit,
X X
px log px = px (xi ) log px (xi ) : (4.33)
x i
The uncertainty or lack of information about two (or more) variables x and
y is expressed by the joint distribution pxy and the corresponding joint entropy
is X
Sxy = pxy log pxy : (4.34)
xy
When the variables x and y are independent, pxy = px py , the joint entropy
is additive X
Sxy = px py log(px py ) = Sx + Sy ; (4.35)
xy
that is, the joint entropy of independent variables is the sum of the entropies
of each variable. This additivity property also holds for the other measure of
uncertainty we had introduced earlier, namely, the variance,
= Sxy + Sx + Sy : (4.37)
Sxy Sx + Sy ; (4.38)
with the equality holding when the two variables x and y are independent.
The inequality (4.38) is referred to as subadditivity. Its interpretation is clear:
entropy increases when information about correlations among subsystems is
discarded.
In words: the entropy of two variables is the entropy of one plus the conditional
entropy of the other. Also, since Sy is positive we see that conditioning reduces
entropy,
Sxy Sxjy : (4.42)
A related entropy-like quantity is the so-called “mutual information” of x
and y, denoted Mxy , which “measures” how much information x and y have in
common, or alternatively, how much information is lost when the correlations
between x and y are discarded. This is given by the relative entropy between
4.6 Continuous distributions 99
the joint distribution pxy and the product distribution px py that discards all
information contained in the correlations. Using eq.(4.37),
def
X pxy
Mxy = K[pxy ; px py ] = pxy log (4.43)
xy
p x py
= Sx + Sy Sxy 0;
We approach the continuous case as a limit from the discrete case. Consider
a continuous distribution p(x) de…ned on an interval for xa x xb . Divide the
interval into equal intervals x = (xb xa ) =N . For large N the distribution
p(x) can be approximated by a discrete distribution
pn = p(xn ) x ; (4.47)
and as N ! 1 we get
Z xb
p(x)
SN ! log N dx p(x) log (4.49)
xa 1= (xb xa )
which diverges. The divergence is what one would naturally expect: it takes a
…nite amount of information to identify one discrete alternative within a …nite
set, but it takes an in…nite amount to single out one point in a continuum.
The di¤erence SN log N has a well de…ned limit and we might be tempted to
consider Z xb
p(x)
dx p(x) log (4.50)
xa 1= (xb xa )
as a candidate for the continuous entropy, until we realize that, except for
an additive constant, it coincides with the unacceptable expression (4.45) and
should be discarded for precisely the same reason: it is not invariant under
changes of variables. Had we …rst changed variables to y = y(x) and then
discretized into N equal y intervals we would have obtained a di¤erent limit
Z yb
p0 (y)
dy p0 (y) log : (4.51)
ya 1= (yb ya )
The problem is that the limiting procedure depends on the particular choice of
discretization; the limit depends on which particular set of intervals x or y
we have arbitrarily decided to call equal. Another way to express the same idea
is to note that the denominator 1= (xb xa ) in (4.50) represents a probability
density that is constant in the variable x, but not in y. Similarly, the density
1= (yb ya ) in (4.51) is constant in y, but not in x.
Having identi…ed the origin of the problem we can now suggest a solution.
On the basis of our prior knowledge about the particular problem at hand we
must identify a privileged set of coordinates that will de…ne what we mean
by equal intervals or by equal volumes. Equivalently, we must identify one
preferred probability distribution (x) we are willing to de…ne as uniform —
where by “uniform” we mean a distribution that assigns equal probabilities to
equal volumes. Then, and only then, it makes sense to propose the following
de…nition Z xb
def p(x)
S[p; ] = dx p(x) log : (4.52)
xa (x)
It is easy to check that this is invariant,
Z xb Z yb
p(x) p0 (y)
dx p(x) log = dy p0 (y) log 0 : (4.53)
xa (x) ya (y)
should the measurement be carried out? How do we remain within the bounds
of a budget? The goal is to choose the best possible experiment given a set
of practical constraints. The idea is to compare the amounts of information
available before and after the experiment. The di¤erence is the amount of
information provided by the experiment and this is the quantity that one seeks to
maximize subject to the appropriate constraints. The basic idea was proposed
in [Lindley 1956]; a more modern application is [Loredo 2003].
The problem can be idealized as follows. We want to make inferences about
a variable . Let q( ) be the prior. We want to select the optimal experiment
from within a family of experiments labeled by ". The label " can be discrete
or continuous, one parameter or many, and each experiment " is speci…ed by its
likelihood function q" (x" j ).
The amount of information before the experiment is performed is given by
Z
q( )
Kb = K[q; ] = d q( ) log ; (4.57)
( )
where ( ) de…nes what we mean by the uniform distribution in the space of s.
If experiment " were to be performed and data x" were obtained the amount of
information after the experiment would be
Z
q" ( jx" )
Ka (x" ) = K[q" ; ] = d q" ( jx" ) log : (4.58)
( )
But all decisions must be made before the data x" is available; the expected
amount of information to be obtained from experiment " is
Z Z
q" ( jx" )
hKa i = dx" q" (x" ) d q" ( jx" ) log ; (4.59)
( )
where q" (x" ) is the prior probability that data x" is observed in experiment ",
Z Z
q" (x" ) = d q" (x" ; ) = d q( )q" (x" j ) : (4.60)
Communication theory studies the problem of how a message that was se-
lected at some point of origin can be reproduced at some later destination point.
The complete communication system includes an information source that gen-
erates a message composed of, say, words in English, or pixels on a picture.
A transmitter translates the message into an appropriate signal. For example,
sound pressure is encoded into an electrical current, or letters into a sequence of
zeros and ones. The signal is such that it can be transmitted over a communica-
tion channel, which could be electrical signals propagating in coaxial cables or
radio waves through the atmosphere. Finally, a receiver reconstructs the signal
back into a message to be interpreted by an agent at the destination point.
From the point of view of the engineer designing the communication system
the challenge is that there is some limited information about the set of potential
messages to be sent but it is not known which speci…c messages will be selected
for transmission. The typical sort of questions one wishes to address concern
the minimal physical requirements needed to communicate the messages that
could potentially be generated by a particular information source. One wants to
characterize the sources, measure the capacity of the communication channels,
and learn how to control the degrading e¤ects of noise. And after all this,
it is somewhat ironic but nevertheless true that such “information theory” is
completely unconcerned with whether any “information”is being communicated
at all. Shannon’s great insight was that, as far as the engineer is concerned,
whether the messages convey some meaning or not is completely irrelevant.
To illustrate the basic ideas consider the problem of data compression. A
useful idealized model of an information source is a sequence of random variables
x1 ; x2 ; : : : which take values from a …nite alphabet of symbols. We will assume
that the variables are independent and identically distributed. (Eliminating
these limitations is both possible and important.) Suppose that we deal with a
binary source in which the variables xi , which are usually called ‘bits’, take the
values zero or one with probabilities p or 1 p respectively. Shannon’s idea was
to classify the possible sequences x1 ; : : : ; xN into typical and atypical according
to whether they have high or low probability. The expected number of zeros
and ones is N p and N (1 p) respectively. For large N the probability of any
one of these typical sequences is approximately
P (x1 ; : : : ; xN ) pN p (1 p)N (1 p)
; (4.64)
so that
where S(p) is the two-state entropy, eq.(4.20), the maximum value of which is
Smax = log 2. Therefore, the probability of typical sequences is roughly
N S(p)
P (x1 ; : : : ; xN ) e : (4.66)
Since the total probability of typical sequences is less than one, we see that
their number has to be less than about eN S(p) which for large N is considerably
4.8 Communication Theory 105
less than the total number of possible sequences, 2N = eN log 2 . This fact is
very signi…cant. Transmitting an arbitrary sequence irrespective of whether it
is typical or not requires a long message of N bits, but we do not have to waste
resources in order to transmit all sequences. We only need to worry about the
far fewer typical sequences because the atypical sequences are too rare. The
number of typical sequences is about
and therefore we only need about N S(p)=Smax bits to identify each one of them.
Thus, it must be possible to compress the original long but typical message into
a much shorter one. The compression might imply some small probability of
error because the actual message might conceivably turn out to be atypical but
one can, if desired, avoid any such errors by using one additional bit to ‡ag
the sequence that follows as typical and short or as atypical and long. Actual
schemes for implementing the data compression are discussed in [Cover Thomas
91].
Next we state these intuitive notions in a mathematically precise way.
1 1PN
log P (x1 ; : : : ; xN ) = log p(xi ) ; (4.69)
N N i
1
lim Prob log P (x1 ; : : : ; xN ) + hlog p(x)i " = 1; (4.70)
N !1 N
where
hlog p(x)i = S[p] : (4.71)
This concludes the proof.
We can elaborate on the AEP idea further. The typical sequences are those
for which eq.(4.66) or (4.68) is satis…ed. To be precise let us de…ne the typical
set AN;" as the set of sequences with probability P (x1 ; : : : ; xN ) such that
N [S(p)+"] N [S(p) "]
e P (x1 ; : : : ; xN ) e : (4.72)
In words: the typical set has probability approaching certainty; typical se-
quences are nearly equally probable (thus the ‘equipartition’); and there are
about eN S(p) of them. To summarize:
The possible sequences are equally likely (well... at least most of them).
Proof: Eq.(4.70) states that for …xed ", for any given there is an N such
that for all N > N , we have
1
Prob log P (x1 ; : : : ; xN ) + S[p] " 1 : (4.73)
N
Thus, the probability that the sequence (x1 ; : : : ; xN ) is "-typical tends to one,
and therefore so must Prob[AN;" ]. Setting = " yields part (1). To prove (2)
write
P
1 Prob[AN;" ] = P (x1 ; : : : ; xN )
(x1 ;:::;xN )2AN;"
P N [S(p)+"] N [S(p)+"]
e =e jAN;" j : (4.74)
(x1 ;:::;xN )2AN;"
Jaynes [Jaynes 1957b, 1957c] and particularly [Jaynes 1963]. Other relevant papers are
reprinted in [Jaynes 1983] and collected online at http://bayes.wustl.edu.
108 Entropy II: Measuring Information
In words: within the family fpg of all distributions that satisfy the constraints
(4.80) the distribution that achieves the maximum entropy is the canonical
distribution p given in eq.(4.82).
Having found the maximum entropy distribution we can now develop the
MaxEnt formalism along lines that closely parallel statistical mechanics. Each
distribution within the family of distributions of the form (4.82) can be thought
of as a point in a continuous space — the “statistical manifold” of canonical
distributions. Each speci…c choice of expected values (F 1 ; F 2 ; : : :) determines a
unique point within the space, and therefore the F k play the role of coordinates.
To each point (F 1 ; F 2 ; : : :) we can also associate a number, the value of the
maximized entropy. Therefore, Sm ax (F 1 ; F 2 ; : : :) = Sm ax (F ) is a scalar …eld on
the statistical manifold.
In thermodynamics it is conventional to drop the su¢ x ‘max’and to refer to
S(F ) as the entropy of the system. This language can be misleading. We should
constantly remind ourselves that S(F ) is just one out of many possible entropies
that one could associate to the same physical system: S(F ) is that particular
entropy that measures the amount of information that is missing for an agent
whose knowledge consists of the numerical values of the F s and nothing else.
The quantity
0 = log Z( 1 ; 2 ; : : :) = log Z( ) (4.90)
110 Entropy II: Measuring Information
is sometimes called the “free energy” because it is closely related to the ther-
modynamic free energy (Z = e F ). The quantities S(F ) and log Z( ) are
Legendre transforms of each other,
k
S(F ) = log Z( ) + kF : (4.91)
They contain the same information and therefore just as the F s are obtained
from log Z( ) from eq.(4.84), the s can be obtained from S(F ),
@S(F )
= k : (4.92)
@F k
The proof is straightforward: write
@S(F ) @ log Z( ) @ j @ j j
k
= k
+ F + k ; (4.93)
@F @ j @F @F k
and use eq.(4.84). Equation (4.92) shows that the multipliers k are the com-
ponents of the gradient of the entropy S(F ) on the manifold of canonical distri-
butions. Thus, the change in entropy when the constraints are changed by F k
while the functions f k held …xed is
S= k Fk : (4.94)
P @fik @f k
fk = i pi v= v: (4.96)
@v @v
def @f k
Wk = fk = v: (4.97)
@v
F k = W k + Qk (4.99)
S = log Z( ) + ( k F k )
1P k k k
k fi k
= k fi + k fi e + kF + k Fk
Z i
= k fk fk ; (4.100)
S= k Qk : (4.101)
It is easy to see that this is equivalent to eq.(4.92) where the partial derivatives
are derivatives at constant v. Thus the entropy remains constant in in…nitesimal
“adiabatic”processes — those with Qk = 0. From the point of view of informa-
tion theory [see eq.(4.98)] this result is a triviality: the amount of information
in a distribution cannot change when the probabilities do not change,
pi = 0 ) Qk = 0 ) S = 0 : (4.102)
This justi…cation has stirred a considerable controversy that goes beyond the
issue we discussed earlier of whether the Shannon entropy is the correct way
to measure information. Some of the objections that have been raised are the
following:
(O1) The observed spectrum of black body radiation is whatever it is, inde-
pendently of whatever information happens to be available to us.
(O2) In most realistic situations the expected value of the energy is not a
quantity we happen to know. How, then, can we justify using it as a
constraint?
(O3) Even when the expected values of some quantities happen to be known,
there is no guarantee that the resulting inferences will be any good at all.
(A) The ideal case: We know that hf i = F and we know that it captures all
the information that happens to be relevant to the problem at hand.
We have called case A the ideal situation because it re‡ects a situation in which
the information that is necessary to reliably answer the questions that interest
4.11 On constraints and relevant information 113
(B) The important case: We know that hf i captures all the information that
happens to be relevant for the problem at hand but its actual numerical
value F is not known.
(C) The predictive case: There is nothing special about the function f
except that we happen to know its expected value, hf i = F . In particular,
we do not know whether information about hf i is complete or whether it
is at all relevant to the problem at hand.
not to use it; they allow us to explore its limitations; and what is perhaps most
important is that they provide powerful hints for further development. Here I
collect a few remarks about avoiding such pitfalls — a topic to which we shall
later return (see section 8.3).
with an unknown multiplier that can be estimated from the data D using
Bayesian methods. If the n experiments are independent Bayes rule gives,
fi
p( ) Qn e
p( jD) = ; (4.105)
p(D) i=1 Z
p( ) P
n
log p( jD) = log (log Z + fi )
p(D) i=1
p( )
= log n(log Z + f ) ; (4.106)
p(D)
@ log Z 1 @ log p( )
+f = ; (4.108)
@ n @
or, using (4.84),
1 @ log p( )
hf i = f : (4.109)
n @
As n ! 1 we see that the optimal is such that hf i ! f . This is to be
expected: for large n the data overwhelms the prior p( ) and f tends to hf i (in
probability). But the result eq.(4.109) also shows that when n is not so large
then the prior can make a non-negligible contribution: in general one should
not assume that hf i f .
Let us emphasize that this analysis holds only when the selection of a privi-
leged function f (x) can be justi…ed by additional knowledge about the physical
nature of the problem. In the absence of such information we are back to the
previous example — just data — and we have no reason to prefer the distribu-
tion e fj over any other canonical distribution e gj for any arbitrary function
g(x).5
5 Our conclusion di¤ers from that reached in [Jaynes 1978, pp. 72-75] which did not include
Statistical Mechanics
What makes phase space so convenient for the formulation of mechanics is that
Hamilton’s equations are …rst order in time. This means that through any
given point z(t0 ), which can be thought as the initial condition, there is just
one trajectory z(t) and therefore trajectories can never intersect each other.
In a ‡uid the actual positions and momenta of the molecules are unknown
and thus the macrostate of the ‡uid is described by a probability density in phase
space, f (z; t). When the system evolves continuously according to Hamilton’s
equations there is no information loss and the probability ‡ow satis…es a local
conservation equation,
@
f (z; t) = r J (z; t) ; (5.3)
@t
where J is the probability current,
J (z; t) = f (z; t)z_ (5.4)
also [Balian 1991, 1992, 1999].) For an entry point to the extensive literature on alternative
approaches based, for example, on the ergodic hypothesis see e.g. [Ehrenfest 1912] [Khinchin
1949] [ter Haar 1955] [Wehrl 1978] [Mackey 1989] [Lebowitz 1993, 1999] and [U¢ nk 2001,
2003, 2006]. For a discussion of why ergodic arguments are irrelevant to statistical mechanics
see [Earman Redei 1996].
5.1 Liouville’s theorem 121
with
d~xi d~
pi @H @H
z_ = ::: ; ::: = ::: ; ::: (5.5)
dt dt @~
pi @~xi
Since
N
X @ @H @ @H
r z_ = =0; (5.6)
i=1
@~xi @~
pi @~
pi @~xi
N
X
@f @f @H @f @H def
= z_ r f = = fH; f g : (5.7)
@t i=1
@~xi @~
pi @~
pi @~xi
d @
f (z(t); t) = f (z; t) + z_ r f : (5.8)
dt @t
Next consider a small volume element z(t) the boundaries of which are car-
ried along by the ‡uid ‡ow. Since trajectories cannot cross each other (because
Hamilton’s equations are …rst order in time) they cannot cross the boundary of
the evolving volume z(t) and therefore the total probability within z(t) is
conserved,
d d
Prob[ z(t)] = [ z(t)f (z(t); t)] = 0 : (5.11)
dt dt
But f (z(t); t) itself is constant, eq.(5.9), therefore
d
z(t) = 0 ; (5.12)
dt
which means that the shape of a region of phase space may get deformed by
time evolution but its volume remains invariant. This result is usually known
as Liouville’s theorem.
122 Statistical Mechanics
The …rst input from Hamiltonian dynamics is that information is not lost and
therefore we must require that S(t) be constant,
d
S(t) = 0 : (5.15)
dt
Therefore,
Z
d @f (z; t) f (z; t) @f (z; t)
S(t) = dz log + : (5.16)
dt @t (z) @t
A second input from Hamiltonian dynamics is that probabilities are not merely
conserved; they are locally conserved. This is expressed by eqs.(5.3) and (5.4).
The …rst term of eq.(5.16) can be rewritten,
Z
d f
S(t) = dz r (f z_ ) log ; (5.18)
dt
This condition must hold for any arbitrary choice of f (z; t), therefore
because this would be in contradiction with the Second Law.3 This is true but
it should not an objection; it is further evidence that there is no such thing as the
unique entropy of a system. Di¤erent entropies attach to di¤erent descriptions
of the system and, as we shall see in Section 5.7, equations (5.24) and (5.25)
will turn out to be crucial elements in the derivation of the Second Law.
Remark: In section 4.1 we pointed out that the interpretation of entropy S[f; ]
as a measure of information has its shortcomings. This could potentially un-
dermine our whole program of deriving statistical mechanics as an example of
entropic inference. Fortunately, as we shall see later in chapter 6 the framework
of entropic inference can be considerably strengthened by removing any reference
to questionable information measures. In this approach entropy S[f; ] requires
no interpretation; it is a tool designed for updating from a prior to a posterior
f distribution. More explicitly the entropy S[f; ] is introduced to rank can-
didate distributions f according to some criterion of “preference” relative to a
prior in accordance to certain “reasonable” design speci…cations. Recasting
statistical mechanics into this entropic inference framework is straightforward.
For example, the requirement that Hamiltonian time evolution does not a¤ect
the ranking of distributions — that is, if f1 (z; t) is preferred over f2 (z; t) at time
t then the corresponding f1 (z; t0 ) is preferred over f2 (z; t0 ) at any other time t0
— is expressed through eq.(5.15) so the proof of the Equal a Priori Theorem
proceeds exactly as above.
only consider problems where the number of particles is held …xed. Processes
where particles are exchanged as in the equilibrium between a liquid and its
vapor, or where particles are created and destroyed as in chemical reactions,
constitute an important but straightforward extension of the theory.
It thus appears that it is su¢ cient to impose that f be some function of the
energy. According to the formalism developed in section 4.10 and the remarks in
4.11 this is easily accomplished: the constraints codifying the information that
could be relevant to problems of thermal equilibrium should be the expected
values of functions (") of the energy. For example, h (")i could include various
moments, h"i, h"2 i,. . . or perhaps more complicated functions. The remaining
question is which functions (") and how many of them.
To answer this question we look at thermal equilibrium from the point of
view leading to what is known as the microcanonical formalism. Let us enlarge
our description to include the system of interest A and its environment, that
is, the thermal bath B with which it is in equilibrium. The advantage of this
broader view is that the composite system C = A + B can be assumed to be
isolated and we know that its energy "c is some …xed constant. This is highly
relevant information: when the value of "c is known, not only do we know
h"c i = "c but we know the expected values h ("c )i = ("c ) for absolutely all
functions ("c ): In other words, in this case we have succeeded in identifying
the relevant information and we are …nally ready to assign probabilities using
the MaxEnt method. (When the value of "c is not known we are in that state
of “intermediate” knowledge described as case (B) in section 4.11.)
Now we are ready to deploy the MaxEnt method. The argument depends
crucially on using a measure (z) that is constant in the phase-space variables.
Maximize the entropy,
Z
S[f ] = dz f (z) log f (z) ; (5.26)
of the composite system C subject to normalization and the …xed energy con-
straint,
f (z) = 0 if "(z) 6= "C : (5.27)
To simplify the discussion it is convenient to divide phase space into discrete
cells a of equal a priori probability. By the theorem of section 5.2 these cells are
of equal phase-space volume z. Then we can use the discrete entropy,
X
S= pc log pc where pc = f (zc ) z : (5.28)
c
For system A let the (discretized) microstate za have energy "a . For the thermal
bath B a much less detailed description is su¢ cient. Let the number of bath
microstates with energy "b be B ("b ). We assume that the microstates c of
the composite system, C = A + B, are labelled by specifying the state of A
and the state of B, c = (a; b). This condition looks innocent but this may be
deceptive; it implies A and B are not quantum mechanically entangled. Our
relevant information also includes the fact that A and B interact very weakly,
126 Statistical Mechanics
that is, any interaction potential Vab depending on both microstates a and b can
be neglected. The interaction must be weak but cannot be strictly zero, just
barely enough to attain equilibrium. It is this condition of weak interaction that
justi…es us in talking about a system A separate from the bath B. Under these
conditions the total energy "c constrains the allowed microstates of C = A + B
to the subset that satis…es
" a + "b = " c : (5.29)
The total number of such microstates is
X
C B
("c ) = ("c "a ) : (5.30)
a
and we conclude that the distribution that codi…es the relevant information
about equilibrium is
1
pa = exp( "a ) ; (5.36)
Z
5.3 The constraints for thermal equilibrium 127
which has the canonical form of eq.(4.82). (Being independent of a the factor
B
("c )= C ("c ) has been absorbed into the normalization Z.)
Remark: It may be surprising that strictly speaking a system such as A does
not have a temperature. The temperature T is not an ontic property of the
system but an epistemic property that characterizes the probability distribution
(5.36). Indeed, although we often revert to language to the e¤ect that the system
is in a macrostate with temperature T , we should note that in actual fact the
system is in a particular microstate and not in a probability distribution. The
latter refers to our state of knowledge and not to the ontic state of the system.
Our goal in this section was to identify the relevant variables. We are now in a
position to give the answer: the relevant information about thermal equilibrium
can be summarized by the expected value of the energy h"i because someone
who just knows h"i and is maximally ignorant about everything else is led to
assign probabilities according to eq.(4.82) which coincides with (5.36).
But our analysis has also disclosed an important limitation. Eq.(5.32) shows
that in general the distribution for a system in equilibrium with a bath depends
in a complicated way on the properties of the bath. The information in h"i is
adequate only when (a) the system and the bath interact weakly enough that
the energy of the composite system C can be neatly partitioned into the energies
of A and of B, eq.(5.29), and (b) the bath is so much larger than the system
that its e¤ects can be represented by a single parameter, the temperature T .
Conversely, if these conditions are not met, then more information is needed.
When the system-bath interactions are not su¢ ciently weak eq.(5.29) will not be
valid and additional information concerning the correlations between A and B
will be required. On the other hand if the system-bath interactions are too weak
then within the time scales of interest the system A will reach only a partial
thermal equilibrium with those few degrees of freedom in its very immediate
vicinity. The system A is e¤ectively surrounded by a thermal bath of …nite size
and the information contained in the single parameter or the expected value
h"i will not su¢ ce. This situation will be brie‡y addressed in section 5.5.
1 "a
pa = e (5.38)
Z
where the Lagrange multiplier is determined from
@ log Z X
"a
=E and Z( ; V ) = e : (5.39)
@ a
@SG @ log Z @ @
=k +k E+k =k ; (5.41)
@E V @ @E @E
where eq.(5.39) has been used to cancel the …rst two terms.
The connection between the statistical formalism and thermodynamics hinges
on a suitable identi…cation of internal energy, work and heat. The …rst step is
the crucial one: we adopt Boltzmann’s assumption, eq.(3.38), and identify the
5.4 The canonical formalism 129
Since "a = "a (V ) the …rst term h "i on the right can be physically induced by
pushing or pulling on a piston to change the volume,
X @"a @"
h "i = pa V = V : (5.43)
a
@V @V
h "i = W = P V ; (5.44)
therefore,
Q
SG = k Q or SG = ; (5.48)
T
where we introduced the suggestive notation
1 1
k = or = : (5.49)
T kT
Integrating eq.(5.48) from an initial state A to a …nal state B gives
Z B
dQ
SG (B) SG (A) = (5.50)
A T
130 Statistical Mechanics
where the integral is along a reversible path — the states along the path are
equilibrium states — and where the temperature T is de…ned by
@SC def 1 Q
= so that SC = : (5.52)
@E V T T
Comparing eqs.(5.50) and (5.51) we see that the maximized Gibbs entropy SG
and the Clausius SC di¤er only by an additive constant. Adjusting the constant
so that SG matches the Clausius entropy SC for one equilibrium state they will
match for all equilibrium states. We can therefore conclude that
F (T; V ) = E TS ; (5.55)
so that
F = S T P V : (5.56)
For processes at constant T we have F = W which justi…es the name ‘free’
energy –the amount of energy that is free to be converted to useful work when
the system is not isolated but in contact with a bath at temperature T . Eq.(5.40)
then leads to
F = kT log Z(T; V ) or Z = e F : (5.57)
Several useful thermodynamic relations can be easily obtained from eqs.(5.53),
(5.54), and (5.56). For example, the identities
@F @F
= S and = P; (5.58)
@T V @V T
B B 1 2
log ("c "a ) = log ("c ) "a " ::: ; (5.59)
2 a
leading to corrections to the Boltzmann distribution,
1 1 2
pa = exp( "a " : : :) : (5.60)
Z 2 a
An alternative path is to provide a more detailed model of the bath [Plastino
and Plastino 1994]. As before, we consider a system A that is weakly coupled to
132 Statistical Mechanics
a heat bath B that has a …nite size. The microstates of A and B are labelled a
and b and have energies "a and "b respectively. The composite system C = A+B
can be assumed to be isolated and have a constant energy "c = "a + "b (or more
precisely C has energy in some arbitrarily narrow interval about "c ). To model
the bath B we assume that the number of microstates of B with energy less
than " is W (") = C" , where the exponent is some constant that depends on
the size of the bath. Such a model can be quite realistic. For example, when
the bath consists of N harmonic oscillators we have = N , and when the bath
is an ideal gas of N molecules we have = 3N=2.
Then the number of microstates of B in a narrow energy range " is
B 1
(") = W (" + ") W (") = C" "; (5.61)
1 1
= and = : (5.66)
"c "2c
We will not pursue the subject any further except to comment that distri-
butions such as (5.63) have been proposed by C. Tsallis on the basis of a very
di¤erent logic [Tsallis 1988, 2011].
Non-extensive thermodynamics
The idea proposed by Tsallis is to generalize the Boltzmann-Gibbs canonical
formalism by adopting a di¤erent “non-extensive entropy”,
P
1 i pi
T (p1 ; : : : ; pn ) = ;
1
5.6 The thermodynamic limit 133
As ! 0 we get
1 P 1+
T1+ = (1 i pi )
1 P P
= [1 i pi (1 + log pi )] = i pi log pi : (5.68)
The distribution that maximizes the Tsallis entropy subject to the usual
normalization and energy constraints,
P P
i pi = 1 and i " i pi = E ;
is
1
pi = [1 "i ]1=( 1)
; (5.69)
Z
where Z is a normalization constant and the constant is a ratio of Lagrange
multipliers. This distribution is precisely of the form (5.63) with = 1="c and
=1+( 1) 1 .
Our conclusion is that Tsallis distributions make perfect sense within the
canonical Gibbs-Jaynes approach to statistical mechanics. However, in order
to justify them, it is not necessary to introduce an alternative thermodynamics
through new ad hoc entropies; it is merely necessary to recognize that some-
times a partial thermal equilibrium is reached with heat baths that are not
extremely large. What distinguishes the canonical Boltzmann-Gibbs distribu-
tions from (5.63) or (5.69) is the relevant information on the basis of which we
draw inferences and not the inference method. An added advantage is that the
free and undetermined parameter can, within the standard MaxEnt formalism
advocated here, be calculated in terms of the size of the bath.
the few macroscopic variables over which we have some control. Most other
questions are deemed not “interesting” and thus they are never asked. For ex-
ample, suppose we are given a gas in equilibrium within a cubic box, and the
question is where will we …nd a particular molecule. The answer is that the
expected position of the molecule is at the center of the box but with a very
large standard deviation — the particle can be anywhere in the box. Such an
answer is not very impressive. On the other hand, if we ask for the energy of
the gas at temperature T , or how it changes as the volume is changed by V ,
then the answers are truly impressive.
Consider a system in thermal equilibrium in a macrostate described by a
canonical distribution f (z) assigned on the basis of constraints on the values of
certain macrovariables X. For simplicity we will assume X is a single variable,
the energy, X = E = h"i. The generalization to more than one variable is not
di¢ cult. The microstates z can be divided into typical and atypical microstates.
The typical microstates are those contained within a region R de…ned by im-
posing upper and lower bounds on f (z).
In this section we shall explore a few properties of the typical region. We
will show that the probability of the typical region turns out to be “high”, that
is, Prob[R ] = 1 where is a small positive number. We will also show
that the thermodynamic entropy SC and the “phase”volume W of the typical
region are related through Boltzmann’s equation,
SC k log W ; (5.70)
where R
W = Vol(R ) = R
dz : (5.71)
The surprising feature is that SC turns out to be essentially independent of .
The following theorems which are adaptations of the Asymptotic Equipartition
Property [Shannon 1948, Shannon Weaver 1949] state this result in a mathe-
matically precise way. (See also [Jaynes 1965] and section 4.8.)
The Asymptotic Equipartition Theorem: Let f (z) be the canonical dis-
tribution and S = SG =k = SC =k the corresponding entropy,
"(z)
e
f (z) = and S = E + log Z : (5.72)
Z
If limN !1 "=N = 0, that is, the energy ‡uctuations " ( is the standard
deviation) may increase with N but they do so less rapidly than N , then, as
N ! 1,
1 S
log f (z) ! in probability, (5.73)
N N
Since S=N is independent of z the theorem roughly states that the probabil-
ities of the accessible microstates are “essentially” equal. The microstates z for
which ( log f (z))=N di¤ers substantially from S=N have either too low prob-
ability — they are deemed “inaccessible” — or they might individually have a
5.6 The thermodynamic limit 135
high probability but are too few to contribute signi…cantly. The term ‘essen-
tially’is tricky because f (z) may di¤er from e S by a huge multiplicative factor
— perhaps several billion — but log f (z) will still di¤er from S by an amount
that is unimportant because it grows less rapidly than N .
Remark: The left hand side of (5.73) is a quantity associated to a microstate z
while the right side contains the entropy S. This may mislead us into thinking
that the entropy S is some ontological property associated to the individual
microstate z rather than a property of the macrostate. But this is not so:
the entropy S is a property of a whole probability distribution f (z) and not
of the individual zs. Any given microstate z0 can lie within the support of
several di¤erent distributions possibly describing di¤erent physical situations
and having di¤erent entropies. The mere act of …nding that the system is in
state z0 at time t0 is not su¢ cient to allow us to …gure out whether the system
is best described by a macrostate of equilibrium as in (5.73) or whether it was
undergoing some dynamical process that just happened to pass through z0 at
t0 .
Next we prove the theorem. Apply the Tchebyshev inequality, eq.(2.109),
2
x
P (jx hxij ) , (5.74)
to the variable
1
log f (z) :
x= (5.75)
N
Its expected value is the entropy per particle,
1
hxi = hlog f i
N
S 1
= = ( E + log Z) : (5.76)
N N
To calculate the variance,
1 h 2
i
( x)2 = (log f )2 hlog f i ; (5.77)
N2
use
D E D E
2 2
(log f ) = ( " + log Z)
2 2
= "2 + 2 h"i log Z + (log Z) ; (5.78)
so that
2 2
2 "
( x)2 = "2 h"i = : (5.79)
N2 N
Collecting these results gives
2 2
1 S "
Prob log f (z) : (5.80)
N N N
136 Statistical Mechanics
For systems such that the relative energy ‡uctuation "=N tends to 0 as N ! 1
the limit on the right is zero,
1 S
lim Prob log f (z) =0; (5.81)
N !1 N N
which concludes the proof.
Remark: Note that the theorem applies only to those systems with interparticle
interactions such that the energy ‡uctuations " are su¢ ciently well behaved.
For example, it is not uncommon that "=E / N 1=2 and that the energy is
an extensive quantity, E=N ! const. Then
" "E 1
= / 1=2 ! 0 : (5.82)
N E N N
Typically this happens when the spatial correlations among particles fall suf-
…ciently fast with distance — distant particles are uncorrelated. Under these
conditions both energy and entropy are extensive quantities.
The following theorem elaborates on these ideas further. To be precise let
us de…ne the typical region R as the set of microstates with probability f (z)
such that
e S N f (z) e S+N ; (5.83)
or, using eq.(5.72),
1 E N 1 E+N
e f (z) e : (5.84)
Z Z
This last expression shows that the typical microstates have energy within a
narrow range
"(z) E N kT : (5.85)
Remark: Even though states z with energies lower than typical can individually
be more probable than the typical states it turns out (see below) that they are
too few and their volume is negligible compared to W .
Theorem of typical microstates: For N su¢ ciently large
In words:
The typical region has probability close to one; typical microstates are
almost equally probable; the phase volume they occupy is about eS , that is,
S = log W .
5.6 The thermodynamic limit 137
For large N the entropy is a measure of the logarithm of the phase volume of
typical states,
S = log W N ; (5.86)
where log W = N O(1) while 1. The results above are not very sensitive
to the value of . A broad range of values 1=N 1 are allowed. This means
that can be “microscopically large”(e.g., 10 6 ; 10 12 10 23 ) provided it
6 12
remains “macroscopically small”(e.g., 10 ; 10 1). Incidentally, note
that it is the (maximized) Gibbs entropy that satis…es the Boltzmann formula
SG = SC = k log W (where the irrelevant subscript has been dropped).
Proof: Eq.(5.81) states that for …xed , for any given there is an N such
that for all N > N , we have
1 S
Prob log f (z) 1 : (5.87)
N N
Thus, the probability that a microstate z drawn from the distribution f (z) is
-typical tends to one, and therefore so must Prob[R ]. Setting = yields
part (1). This also shows that the total probability of the set of states with
S+N 1 E+N N
f (z) > e = e or "(z) < E (5.88)
Z
is negligible — states that individually are more probable than typical occupy
a negligible volume. To prove (2) write
R
1 Prob[R ] = R dz f (z)
R
e S N R dz = e S N W : (5.89)
Similarly, to prove (3) use (1),
R
1 < Prob[R ] = R dz f (z)
R
e S+N R dz = e S+N W ; (5.90)
Finally, from (2) and (3),
(1 )eS N
W eS+N ; (5.91)
which is the same as
log(1 ) log W S
+ ; (5.92)
N N
and proves (4).
Remark: The theorems above can be generalized to situations involving several
macrovariables X k in addition to the energy. In this case, the expected value
of log f (z) is
h log f i = S = k X k + log Z ; (5.93)
and its variance is
2
( log f ) = k m XkXm X k hX m i : (5.94)
138 Statistical Mechanics
The energy of the universe is constant. The entropy of the universe tends
to a maximum.
The Second Law was amended into a stronger form by Gibbs (1878):
In this and the following two sections we derive and comment on the Second
Law following the argument in [Jaynes 1963, 1965]. Jaynes’derivation is decep-
tively simple: the mathematics is trivial.6 But it is conceptually subtle so it
may be useful to recall some of our previous results. The entropy mentioned in
the Second Law is the thermodynamic entropy of Clausius SC , which is de…ned
only for equilibrium states.
Consider a system at time t in a state of equilibrium de…ned by certain
thermodynamic variables X(t). As we saw in section 5.4 the macrostate of
equilibrium is described by the canonical probability distribution f can (z; t) ob-
tained by maximizing the Gibbs entropy SG subject to the constraints X(t)
= hx(t)i where the quantities x = x(z) are functions of the microstate such as
energy, density, etc. The thermodynamic entropy SC is then given by
can
SC (t) = SG (t) : (5.95)
tunately, Gibbs’ treatment of the conceptual foundations left much to be desired and was
promptly criticized in an extremely in‡uential review by Paul and Tatyana Ehrenfest [Ehren-
fest 1912]. Jaynes’ decisive contribution was to place the subject on a completely di¤erent
foundation based on improved conceptual understandings of probability, entropy, and infor-
mation.
5.7 The Second Law of Thermodynamics 139
Since the Gibbs entropy remains constant it is sometimes argued that this con-
tradicts the Second Law but note that the time-evolved SG (t0 ) is not the ther-
modynamic entropy because the new f (t0 ) is not necessarily of the canonical
form, eq.(4.82).
From the new distribution f (t0 ) we can, however, compute the new expected
values X(t0 ) = hx(t0 )i that apply to the state of equilibrium at t0 . Of all dis-
tributions agreeing with the same new values X(t0 ) the canonical distribution
f can (t0 ) is that which has maximum Gibbs entropy, SG can 0
(t ). Therefore
M axEnt
f (t0 ) ! f can (t0 ) (5.98)
implies
SG (t0 ) can 0
SG (t ) : (5.99)
can 0
But SG (t ) coincides with the thermodynamic entropy of the new equilibrium
state,
can 0
SG (t ) = SC (t0 ) : (5.100)
Collecting all these results, eqs.(5.95)-(5.100), we conclude that the thermody-
namic entropy has increased,
This is the Second Law. The equality applies when the time evolution is quasi-
static so that the distribution remains canonical at all intermediate instants
through the process.
To summarize, the chain of steps is
can
SC (t) = SG (t) = SG (t0 ) can 0
SG (t ) = SC (t0 ) : (5.102)
(1) (2) (3) (4)
Steps (1) and (4) hinge on identifying the maximized Gibbs entropy with the
thermodynamic entropy — which is justi…ed provided we have correctly identi-
…ed the relevant macrovariables X for the particular problem at hand. Step (2)
follows from the constancy of the Gibbs entropy under Hamiltonian evolution
140 Statistical Mechanics
a consequence of the Second Law. If anything, it is the other way around: the
existence of a …nal state of equilibrium is a pre-condition for the Second Law.
The extension of entropic methods of inference beyond situations of equilib-
rium is, of course, highly desirable. I will o¤er two comments on this matter.
The …rst is that there is at least one theory of extreme non-equilibrium that
is highly successful and very well known. It is called quantum mechanics —
the ultimate framework for a probabilistic time-dependent non-equilibrium dy-
namics. The derivation of quantum mechanics as an example of an “entropic
dynamics” will be tackled in Chapter 11. The second comment is that term
‘non-equilibrium’ is too broad and too vague to be useful. In order to make
progress it is important to be very speci…c about which type of non-equilibrium
process one is trying to describe.8
theory of elephants; but how could one ever come up with a theory of non-elephants?”
142 Statistical Mechanics
Figure 5.1: Entropy increases towards the future: The microstate of a system
in equilibrium with macrovariables X(t) at the initial time t lies somewhere
within R(t). A constraint is removed and the system spontaneously evolves to
a new equilibrium at t0 > t in the region R(t0 ) characterized by values X(t0 )
and with the same volume as R(t). The maximum entropy region that describes
equilibrium with the same values X(t0 ) irrespective of the prior history is R0 (t0 ).
The experiment is reproducible because all states within the larger region R0 (t0 )
are characterized by the same X(t0 ).
the microstates in R(t) will also evolve to be within R0 (t0 ) which means that
W (t) = W (t0 ) W 0 (t0 ). Conversely, if it happened that W (t) > W 0 (t0 ) we
would sometimes observe that an initial microstate within R(t) would evolve
into a …nal microstate lying outside R0 (t0 ), that is, sometimes we would observe
that X(t) would not evolve to X(t0 ). Such an experiment would de…nitely not
be reproducible.
A new element has been introduced into the discussion of the Second Law:
reproducibility [Jaynes 1965]. Thus, we can express the Second Law in the
somewhat tautological form:
We can address this question from a di¤erent angle: How do we know that
the chosen constraints X are the relevant macrovariables that provide an ade-
quate thermodynamic description? In fact, what do we mean by an adequate
description? Let us rephrase these questions di¤erently: Could there exist ad-
ditional physical constraints Y that signi…cantly restrict the microstates com-
patible with the initial macrostate and which therefore provide an even better
description? The answer is that to the extent that we are only interested in the
5.9 On reversibility, irreversibility, and the arrow of time 143
Figure 5.2: The reproducibility arrow of time leads to the Second law: we can
guarantee that the system will reproducibly evolve to R0 (t0 ) by controlling the
initial microstate to be in region Ra (t).
initial equilibrium state at time t < t0 did it come from? The answer is that
once an equilibrium has been reached at time t0 then, by the very de…nition
of equilibrium, the spontaneous evolution into the future t00 > t0 will maintain
the same equilibrium state. And vice versa: to the extent that the system
has evolved spontaneously — that is, to the extent that there are no external
interventions — the time reversibility of the Hamiltonian dynamics leads us to
that if the system is in equilibrium at t00 , then it must have been in equilibrium
at any all other previous time t0 . In other words, if the equilibrium at time
t0 is de…ned by variables X(t0 ), then the spontaneous evolution both into the
future t00 > t0 and from the past t < t0 , lead to the same equilibrium state,
X(t) = X(t0 ) = X(t00 ).
But, of course, our interest lies precisely in the e¤ect of external interven-
tions. A reproducible experiment that starts in equilibrium and ends in the
equilibrium state of region R0 (t0 ) de…ned by X(t0 ) is shown in Figure 5.2. We
can guarantee that the initial microstate will end somewhere in R0 (t0 ) by con-
trolling the initial equilibrium macrostate to have values X(t) that de…ne a
region such as Ra (t). Then we have an external intervention: a constraint is
removed and the system is allowed to evolve spontaneously. The initial region
Ra (t) will necessarily have a volume very very much smaller than R0 (t). In-
deed, the region Ra (t) would be highly atypical within R0 (t). The entropy of
the initial region Ra (t) is lower than that of the …nal region R0 (t) which leads
to the Second Law once again.
146 Statistical Mechanics
Figure 5.2 also shows that there are many other initial equilibrium macrostates
such as Rb (t) de…ned by values Xb (t) that lead to the same …nal equilibrium
macrostate. For example, if the system is a gas in a box of given volume and
the …nal state is equilibrium, the gas might have initially been in equilibrium
con…ned by a partition to the left half of the box, or the upper half, or the right
third, or any of many other such constrained states.
Finally, we note that introducing the notion of reproducibility is something
that goes beyond the laws of mechanics. Reproducibility refers to our capability
to control the initial microstate by deliberately manipulating the values Xa (t)
in order to reproduce the later values X(t0 ). The relation is that of a cause
Xa (t) leading to an e¤ect X(t0 ). To the extent that causes are supposed to
precede their e¤ects we conclude that the reproducibility arrow of time is the
causal arrow of time.
Our goal has been to derive the Second Law which requires an arrow of time
but, at this point, the origin of the latter remains unexplained. In chapter 11
we will revisit this problem within the context of a dynamics conceived as an
application of entropic inference. There we will …nd that the time associated
to such an “entropic dynamics” is intrinsically endowed with the directionality
required for the Second Law.
Overview
It may be useful to collect the main arguments of the previous three sections
in a more condensed form along with a short preview of things to come later in
chapter 11. To understand what is meant by the Second Law — roughly that
entropy increases as time increases — one must specify what entropy we are
talking about, and we must specify an arrow of time so that it is clear what we
mean by ‘time increases’.
The entropy in the Second Law of Thermodynamics is the thermodynamic
entropy of Clausius, eq.(5.51), which is only de…ned for equilibrium states. One
can invent all sorts of other entropies, which might increase or not. One might,
for example, de…ne an entropy associated to a microstate. The increase of such
an entropy could be due to some form of coarse graining, or could be induced by
unknown external perturbations. Or we could have the entropy of a probability
distribution that increases in a process of di¤usion. Or the entropy of the
distribution j j2 as it might arise in quantum mechanics, which increases just
as often as it decreases. Or any of many other possibilities. But none of these
refer to the Second Law, nor do they violate it.
Then there is the question of the arrow of time: either it is de…ned by the
Second Law or it is not. If it is, then the Second law is a tautology. While this
is a logical possibility one can o¤er a pragmatic objection: an arrow of time
linked to the entropy of equilibrium states is too limited to be useful — thermal
equilibrium is too rare, too local, too accidental a phenomenon in our universe.
It is more fruitful to pursue the consequences of an arrow of time that originates
through some other mechanism.
Entropic dynamics (ED) o¤ers a plausible mechanism in the spirit of infer-
5.10 Avoiding pitfalls –II: is this a 2nd law? 147
ence and information (see chapter 11). In the ED framework, time turns out
to be intrinsically endowed with directionality; ultimately this is what sets the
direction of causality. The arrow of such an “entropic time” is linked to an en-
tropy but not to thermodynamics; it is linked to the dynamics of probabilities.
The causal arrow of time is the arrow of entropic time.
Thus, the validity of the Second Law rests on three elements: (1) the ther-
modynamic entropy given by the maximized Gibbs entropy; (2) the existence of
an arrow of time; and (3) the existence of a time-reversible dynamical law that
involves no loss of information. The inference/information approach to physics
contributes to explain all three of these elements.
@t f = @ (f v ) ; (5.108)
k
k x (z) + log fs (z) = s = const (5.110)
where
Z Z
k k
X [f ] = dz f (z)x (z) and S[f ] = dz f (z) log f (z) : (5.112)
= k Xk S (5.113)
where the X k represent the amount of X k withdrawn from the reservoirs and
supplied to the system. The corresponding change in the thermodynamic en-
tropy of the reservoirs is given by eq.(4.94),
Sres = k Xk ; (5.114)
so that
= (Sres + S) : (5.115)
Discussion We have just shown that starting from any arbitrary initial dis-
tribution f0 at time t0 , the solutions ft of (5.107) evolve irreversibly towards a
…nal state of equilibrium fs . Furthermore, there is a potential that monoton-
ically decreases towards its minimum value s and that can (sometimes) be
interpreted as the monotonic increase of the total entropy of system plus reser-
voirs.
What are we to make of all this? We appear to have a derivation of the
Second Law with the added advantage of a dynamical equation that describes
not only the approach to equilibrium but also the evolution of distributions
arbitrarily far from equilibrium. If anything, this is too good to be true, where
is the mistake?
The mistake is not to be found in the mathematics which is rather straight-
forward. The theorems are indeed true. The problem lies in the physics. To see
this suppose we deal with a single system in thermal contact with a heat bath at
temperature T . What is highly suspicious is that the process of thermalization
described by eq.(5.107) appears to be of universal validity. It depends only on
the bath temperature and is independent of all sorts of details about the ther-
mal contact. This is blatantly wrong physics: surely the thermal conductivity
of the walls that separate the system from the bath — whether the conductivity
is high or low, whether it is uniform or not — must be relevant.
Here is another problem with the physics: the parameter that describes the
evolution of ft has been called t and this might mislead us to think that t has
something to do with time. But this need not be so. In order for t to deserve
being called time — and for the Fokker-Planck equation to qualify as a true
dynamical equation, even if only an e¤ective or phenomenological one — one
must establish how the evolution parameter is related to properly calibrated
clocks. As it is, eq.(5.107) is not a dynamical equation. It is just a cleverly
constructed equation for those curves in the space of distributions that have the
peculiar property of admitting canonical distributions as stationary states.
lead to a new microstate this approach ignores the case of classical, and even
non-identical particles. For example, nanoparticles in a colloidal suspension or
macromolecules in solution are both classical and non-identical. Several authors
(e.g., [Grad 1961, 1967][Jaynes 1992]) have recognized that quantum theory has
no bearing on the matter; indeed, as remarked in section 3.5, this was already
clear to Gibbs.
Our purpose here is to discuss the Gibbs paradox from the point of view of
information theory. The discussion follows [Tseng Caticha 2001]. Our conclu-
sion will be that the paradox is resolved once it is realized that there is no such
thing as the entropy of a system, that there are many entropies. The choice
of entropy is a choice between a description that treats particles as being dis-
tinguishable and a description that treats them as indistinguishable; which of
these alternatives is more convenient depends on the resolution of the particular
experiment being performed.
The “grouping” property of entropy, eq.(4.3),
P
S[p] = SG [P ] + g Pg Sg [p jg ]
SH = E + log ZH ; (5.120)
measures the amount of information required to specify the microstate when all
we know is the value E.
5.11 Entropies, descriptions and the Gibbs paradox 151
Identical particles
Before we compute and interpret the probability distribution over mesostates
and its corresponding entropy we must be more speci…c about which mesostates
we are talking about. Consider a system of N classical particles that are exactly
identical. The interesting question is whether these identical particles are also
“distinguishable”. By this we mean the following: we look at two particles now
and we label them. We look at the particles later. Somebody might have
switched them. Can we tell which particle is which? The answer is: it depends.
Whether we can distinguish identical particles or not depends on whether we
were able and willing to follow their trajectories.
A slightly di¤erent version of the same question concerns an N -particle sys-
tem in a certain state. Some particles are permuted. Does this give us a di¤erent
state? As discussed earlier the answer to this question requires a careful speci-
…cation of what we mean by a state.
Since by a microstate we mean a point in the N -particle phase space, then
a permutation does indeed lead to a new microstate. On the other hand, our
concern with particle exchanges suggests that it is useful to introduce the notion
of a mesostate de…ned as the group of those N ! microstates that are obtained
by particle permutations. With this de…nition it is clear that a permutation of
the identical particles does not lead to a new mesostate.
Now we can return to discussing the connection between the thermodynamic
macrostate description and the description in terms of mesostates using, as
before, the method of Maximum Entropy. Since the particles are (su¢ ciently)
identical, all those N ! microstates i within the same mesostate g have the same
energy, which we will denote by Eg (i.e., Ei = Eg for all i 2 g). To the
macrostate of energy E = hEi we associate the canonical distribution,
Eg
e
Pg = ; (5.121)
ZL
where
P Eg @ log ZL
ZL = e and = E: (5.122)
g @
The corresponding entropy, eq.(5.40) is (setting k = 1)
SL = E + log ZL ; (5.123)
measures the amount of information required to specify the mesostate when all
we know is E.
Two di¤erent entropies SH and SL have been assigned to the same macrostate
E; they measure the di¤erent amounts of additional information required to
specify the state of the system to a high resolution (the microstate) or to a low
resolution (the mesostate).
The relation between ZH and ZL is obtained from
P Ei P Eg ZH
ZH = e = N! e = N !ZL or ZL = : (5.124)
i g N!
152 Statistical Mechanics
Non-identical particles
We saw that classical identical particles can be treated, depending on the res-
olution of the experiment, as being distinguishable or indistinguishable. Here
5.11 Entropies, descriptions and the Gibbs paradox 153
we go further and point out that even non-identical particles can be treated as
indistinguishable. Our goal is to state explicitly in precisely what sense it is up
to the observer to decide whether particles are distinguishable or not.
We de…ned a mesostate as a subset of N ! microstates that are obtained as
permutations of each other. With this de…nition it is clear that a permutation
of particles does not lead to a new mesostate even if the exchanged particles
are not identical. This is an important extension because, unlike quantum
particles, classical particles cannot be expected to be exactly identical down to
every minute detail. In fact in many cases the particles can be grossly di¤erent –
examples might be colloidal suspensions or solutions of organic macromolecules.
A high resolution device, for example an electron microscope, would reveal that
no two colloidal particles or two macromolecules are exactly alike. And yet,
for the purpose of modelling most of our macroscopic observations it is not
necessary to take account of the myriad ways in which two particles can di¤er.
Consider a system of N particles. We can perform rather crude macroscopic
experiments the results of which can be summarized with a simple phenomeno-
logical thermodynamics where N is one of the relevant variables that de…ne the
macrostate. Our goal is to construct a statistical foundation that will explain
this macroscopic model, reduce it, so to speak, to “…rst principles.” The par-
ticles might ultimately be non-identical, but the crude phenomenology is not
sensitive to their di¤erences and can be explained by postulating mesostates g
and microstates i with energies Ei Eg , for all i 2 g, as if the particles were
identical. As in the previous section this statistical model gives
ZH X
Ei
ZL = with ZH = e ; (5.129)
N! i
ST = SL = SH log N ! : (5.130)
Z^H X
Z^L = with Z^H = e E^{
; (5.131)
Na !Nb !
^
{
Conclusion
The Gibbs paradox in its various forms arises from the widespread misconception
that entropy is a real physical quantity and that one is justi…ed in talking about
the entropy of the system. The thermodynamic entropy is not a property of the
system. Entropy is a property of our description of the system, it is a property of
the macrostate. More explicitly, it is a function of the macroscopic variables used
to de…ne the macrostate. To di¤erent macrostates re‡ecting di¤erent choices of
variables there correspond di¤erent entropies for the very same system.
But this is not the complete story: entropy is not just a function of the
macrostate. Entropies re‡ect a relation between two descriptions of the same
system: one description is the macrostate, the other is the set of microstates,
or the set of mesostates, as the case might be. Then, having speci…ed the
macrostate, an entropy can be interpreted as the amount of additional infor-
mation required to specify the microstate or mesostate. We have found the
5.11 Entropies, descriptions and the Gibbs paradox 155
not allow us to take into account the information contained in generic prior
distributions.
Thus, Bayes’ rule allows information contained in arbitrary priors and in
data, but not in arbitrary constraints,1 while on the other hand, MaxEnt can
handle arbitrary constraints but not arbitrary priors. In this chapter we bring
those two methods together: by generalizing the PMU we show how the MaxEnt
method can be extended beyond its original scope, as a rule to assign proba-
bilities, to a full-‡edged method for inductive inference, that is, a method for
updating from arbitrary priors given information in the form of arbitrary con-
straints. It should not be too surprising that the extended Maximum Entropy
method — which we will henceforth abbreviate as ME, and also refer to as
‘entropic inference’or ‘entropic updating’— includes both MaxEnt and Bayes’
rule as special cases.
Historically the ME method is a direct descendant of MaxEnt. As we saw in
chapter 4 in the MaxEnt framework entropy is interpreted through the Shannon
axioms as a measure of the amount of information that is missing in a probability
distribution. We discussed some limitations of this approach. The Shannon
axioms refer to probabilities of discrete variables; for continuous variables the
entropy is not de…ned. But a more serious objection was raised: even if we grant
that the Shannon axioms do lead to a reasonable expression for the entropy, to
what extent do we believe the axioms themselves? Shannon’s third axiom, the
grouping property, is indeed sort of reasonable, but is it necessary? Is entropy
the only consistent measure of uncertainty or of information? What is wrong
with, say, the standard deviation? Indeed, there exist examples in which the
Shannon entropy does not seem to re‡ect one’s intuitive notion of information
[U¢ nk 1995]. One could introduce other entropies justi…ed by di¤erent choices
of axioms (see, for example, [Renyi 1961] and [Tsallis 1988]). Which one should
we adopt? If di¤erent systems are to handled using di¤erent Renyi entropies,
how do we handle composite systems?
From our point of view the real limitation is that neither Shannon nor Jaynes
were concerned with the problem of updating. Shannon was analyzing the capac-
ity of communication channels and characterizing the diversity of the messages
that could potentially be generated by a source (section 4.8). His entropy makes
no reference to prior distributions. On the other hand, as we already mentioned,
Jaynes conceived MaxEnt as a method to assign probabilities on the basis of
constraint information and a …xed underlying measure, not an arbitrary prior.
He never meant to update from one probability distribution to another.
Considerations such as these motivated several attempts to develop ME di-
rectly as a method for updating probabilities without invoking questionable
measures of uncertainty [Shore and Johnson 1980; Skilling 1988-1990; Csiszar
1991, 2008; Caticha 2003, 2014a]. The important contribution by Shore and
Johnson was the realization that one could axiomatize the updating method
itself rather than the information measure. Their axioms are justi…ed on the
1 Bayes’ rule can handle constraints when they are expressed in the form of data that can
be plugged into a likelihood function but not all constraints are of this kind.
159
Caticha Gi¢ n 2006, Caticha 2007, 2014a, Vanslette 2017] and in earlier versions of these
lectures [Caticha 2008, 2012c].
160 Entropy III: Updating Probabilities
in the form of data into a constraint that can be processed using ME, is that
it is particularly clear. It throws light on Bayes’ rule and demonstrates its
complete compatibility with ME updating. Thus, within the ME framework
maximum entropy and Bayesian methods are uni…ed into a single consistent
theory of inference. One advantage of this insight is that it allows a number
of generalizations of Bayes’ rule (see section 2.10.2). Another is that it has
implications for physics: it provides an important missing piece for the old
puzzles of quantum mechanics concerning the so-called collapse of the wave
function and the measurement problem(see Chapter 11).
There is yet another function that the ME method must perform in order to
fully qualify as a method of inductive inference. Once we have decided that the
distribution of maximum entropy is to be preferred over all others the following
question arises immediately: the maximum of the entropy functional is never
in…nitely sharp, are we really con…dent that distributions that lie very close
to the maximum are totally ruled out? We must …nd a quantitative way to
assess the extent to which distributions with lower entropy are ruled out. This
topic, which completes the formulation of the ME method, will be addressed in
chapter 8.
1982, 2003] and references therein. For a critical appraisal see [Norton 2011, 2013].
6.1 What is information? 161
say that we have received information when among the vast variety of messages
that could have been generated by a distant source, we discover which particular
message was actually sent. It is thus that the message “carries” information.
The analogy with physics is immediate: the set of all possible states of a physical
system can be likened to the set of all possible messages, and the actual state of
the system corresponds to the message that was actually sent. Thus, the system
“conveys” a message: the system “carries” information about its own state.
Sometimes the message might be di¢ cult to read, but it is there nonetheless.
This language — information is physical — useful as it has turned out to be,
does not, however, exhaust the meaning of the word ‘information’.
Here we will follow a di¤erent path. We seek an epistemic notion of infor-
mation that is somewhat closer to the everyday colloquial use of the term —
roughly, information is what I get when my question has been answered. Indeed,
a fully Bayesian information theory requires an explicit account of the relation
between information and the beliefs of ideally rational agents. Furthermore,
implicit in the recognition that most of our beliefs are held on the basis of in-
complete information is the idea that our beliefs would be better if only we had
more information. Thus a theory of probability demands a theory for updating
probabilities.
The desire and need to update our assessment of what beliefs we ought to
hold is driven by the conviction that not all beliefs, not all probability assign-
ments, are equally good. The concern with ‘good’and ‘better’bears on the issue
of whether probabilities are subjective, objective, or somewhere in between. We
argued earlier (in Chapter 1) that what makes one probability assignment bet-
ter than another is that the adoption of better beliefs has real consequences:
they provide a better guidance about how to cope with the world, and in this
pragmatic sense, they provide a better guide to the “truth”. Thus, objectivity is
desirable; objectivity is the goal. Probabilities are useful to the extent that they
incorporate some degree of epistemic objectivity.4 What we seek are updating
mechanisms that allow us to process information and incorporate its objective
features into our beliefs. Bayes’rule behaves precisely in this way. We saw in
section 2.10.3 that as more and more data are taken into account the original
(possibly subjective) prior becomes less and less relevant, and all rational agents
become more and more convinced of the same truth. This is crucial: were it
not this way Bayesian reasoning would not be deemed acceptable.
To set the stage for the discussion below consider some examples. Suppose
a new piece of information is acquired. This could take a variety of forms. The
typical example in data analysis would be something like: The prior probability
of a certain proposition might have been q and after analyzing some data we
feel rationally justi…ed in asserting that a better assignment would be p. More
explicitly, propositions such as “the value of the variable X lies between x "
and x + "”might initially have had probabilities that were broadly spread over
the range of x and after a measurement is performed the new data might induce
4 We recall from Section 1.1.3 that probabilities are ontologically subjective but epistemi-
cally they can span the range from being fully subjective to fully objective.
162 Entropy III: Updating Probabilities
Information is de…ned by its e¤ects: (a) it induces us to update from prior beliefs
to posterior beliefs, and (b) it restricts our options as to what we are honestly
and rationally allowed to believe. This, I propose, is the de…ning characteristic
of information.
One signi…cant aspect of this notion is that for a rational agent, the identi…-
cation of what constitutes information — as opposed to mere noise — already
involves a judgement, an evaluation; it is a matter of facts and also a matter of
values. Furthermore, once a certain proposition has been identi…ed as informa-
tion, the revision of beliefs acquires a moral component; it is no longer optional:
it becomes a moral imperative.
Another aspect is that the notion that information is directly related to
changing our minds does not involve any talk about amounts of information.
Nevertheless it allows precise quantitative calculations. Indeed, constraints on
the acceptable posteriors are precisely the kind of information the method of
maximum entropy is designed to handle.
6.1 What is information? 163
Figure 6.1: (a) In mechanics force is de…ned as that which a¤ects motion. (2)
Inference is dynamics too: information is de…ned as that which a¤ects rational
beliefs.
Constraints can take a wide variety of forms including, in addition to the ex-
amples mentioned above, anything capable of a¤ecting beliefs. For example,
in Bayesian inference the likelihood function constitutes information because it
contributes to constrain our posterior beliefs. And constraints need not be just
in the form of expected values; they can specify the functional form of a dis-
tribution or be imposed through various geometrical relations. (See Chapters 8
and 11.)
Concerning the act of updating it may be worthwhile to point out an analogy
with dynamics — the study of change. In Newtonian dynamics the state of
motion of a system is described in terms of momentum — the “quantity” of
motion — while the change from one state to another is explained in terms of
an applied force or impulse. Similarly, in Bayesian inference a state of belief
is described in terms of probabilities — a “degree” of belief — and the change
from one state to another is due to information (see Fig.6.1). Just as a force is
that which induces a change from one state of motion to another, so information
is that which induces a change from one state of belief to another. Updating is a
form of dynamics. In Chapter 11 we will reverse the logic and derive dynamical
laws of physics as examples of entropic updating of probabilities — an entropic
dynamics.
What about prejudices and superstitions? What about divine revelations?
Do they constitute information? Perhaps they lie outside our restriction to
164 Entropy III: Updating Probabilities
beliefs of ideally rational agents, but to the extent that their e¤ects are indistin-
guishable from those of other sorts of information, namely, they a¤ect beliefs,
they should qualify as information too. Whether the sources of such information
are reliable or not is quite another matter. False information is information too.
In fact, even ideally rational agents can be a¤ected by false information because
the evaluation that assures them that the data was competently collected or
that the message originated from a reliable source involves an act of judgement
that is not completely infallible. Strictly, all those judgements, which constitute
the …rst step of the inference process, are themselves the end result of other
inference processes that are not immune from uncertainty.
What about limitations in our computational power? Such practical limi-
tations are unavoidable and they do in‡uence our inferences so should they be
considered information? No. Limited computational resources may a¤ect the
numerical approximation to the value of, say, an integral, but they do not a¤ect
the actual value of the integral. Similarly, limited computational resources may
a¤ect the approximate imperfect reasoning of real humans and real computers
but they do not a¤ect the reasoning of those ideal rational agents that are the
subject of our present concerns.
Universality
The goal is to design a method for induction, for reasoning when not much
is known. In order for the method to perform its function — to be useful —
we require that it be of universal applicability. Consider the alternative: we
could design methods that are problem-speci…c, and employ di¤erent induc-
tion methods for di¤erent problems. Such a framework, unfortunately, would
fail us precisely when we need it most, namely, in those situations where the
information available is so incomplete that we do not know which method to
employ.
We can argue this point somewhat di¤erently. It is quite conceivable that
di¤erent situations could require di¤erent problem-speci…c induction methods.
What we want to design here is a general-purpose method that captures what
all the other problem-speci…c methods have in common.
Parsimony
To specify the updating we adopt a very conservative criterion that recognizes
the value of information: what has been laboriously learned in the past is valu-
able and should not be disregarded unless rendered obsolete by new information.
The only aspects of one’s beliefs that should be updated are those for which new
evidence has been supplied. Thus we adopt a
This version of the principle generalizes the earlier version presented in section
2.10.2 which was restricted to information in the form of data.
The special case of updating in the absence of new information deserves
a comment. The PMU states that when there is no new information ideally
rational agents should not change their minds.5 In fact, it is di¢ cult to imagine
any notion of rationality that would allow the possibility of changing one’s mind
for no apparent reason. This is important and it is worthwhile to consider it
from a di¤erent angle. Degrees of belief, probabilities, are said to be subjective:
two di¤erent agents might not share the same beliefs and could conceivably
assign probabilities di¤erently. But subjectivity does not mean arbitrariness.
It is not a blank check allowing rational agents to change their minds for no
5 Our concern here is with ideally rational agents who have fully processed all information
acquired in the past. Our subject is not the psychology of actual humans who often change
their minds by processes that are not fully conscious.
166 Entropy III: Updating Probabilities
Independence
The next general requirement turns out to be crucially important: without it
the very possibility of scienti…c theories would be compromised. The point
is that every scienti…c model, whatever the topic, if it is to be useful at all,
must assume that all relevant variables have been taken into account and that
whatever was left out — the rest of the universe — should not matter. To put
it another way: in order to do science we must be able to understand parts of
the universe without having to understand the universe as a whole. Granted,
it is not necessary that the understanding be complete and exact; it must be
merely adequate for our purposes.
The assumption, then, is that it is possible to focus our attention on a suit-
ably chosen system of interest and neglect the rest of the universe because they
are “su¢ ciently independent.” Thus, in any form of science the notion of sta-
tistical independence must play a central and privileged role. This idea — that
some things can be neglected, that not everything matters — is implemented
by imposing a criterion that tells us how to handle independent systems. The
requirement is quite natural: Whenever two systems are a priori believed to
be independent and we receive information about one it should not matter if
the other is included in the analysis or not. This amounts to requiring that
independence be preserved unless information about correlations is explicitly
introduced.
Again we emphasize: none of these criteria are imposed by Nature. They
are desirable for pragmatic reasons; they are imposed by design.
which distribution among all those that are in principle acceptable — they all
satisfy the constraints — should we select?
Our goal is to design a method that allows a systematic search for the pre-
ferred posterior distribution. The central idea, …rst proposed in [Skilling 1988],6
is disarmingly simple: to select the posterior …rst rank all candidate distribu-
tions in increasing order of preference and then pick the distribution that ranks
the highest. Irrespective of what it is that makes one distribution “preferable”
over another (we will get to that soon enough) it is clear that any such ranking
must be transitive: if distribution p1 is preferred over distribution p2 , and p2 is
preferred over p3 , then p1 is preferred over p3 . Transitive rankings are imple-
mented by assigning to each p a real number S[p], which is called the entropy of
p, in such a way that if p1 is preferred over p2 , then S[p1 ] > S[p2 ]. The selected
distribution (one or possibly many, for there may be several equally preferred
distributions) is that which maximizes the entropy functional.
The importance of this strategy of ranking distributions cannot be overesti-
mated: it implies that the updating method will take the form of a variational
principle — the method of Maximum Entropy (ME) — and that the latter will
involve a certain functional — the entropy — that maps distributions to real
numbers and that is designed to be maximized. These features are not imposed
by Nature; they are all imposed by design. They are dictated by the func-
tion that the ME method is supposed to perform. (Thus, it makes no sense to
seek a generalization in which entropy is a complex number or a vector; such a
generalized entropy would just not perform the desired function.)
Next we specify the ranking scheme, that is, we choose a speci…c functional
form for the entropy S[p]. Note that the purpose of the method is to update
from priors to posteriors so the ranking scheme must depend on the particular
prior q and therefore the entropy S must be a functional of both p and q.
The entropy S[p; q] describes a ranking of the distributions p relative to the
given prior q. S[p; q] is the entropy of p relative to q, and accordingly S[p; q]
is commonly called relative entropy. This is appropriate and sometimes we
will follow this practice. However, since all entropies are relative, even when
relative to a uniform distribution, the quali…er ‘relative’is redundant and can
be dropped. This is somewhat analogous to the situation with energy: it is
implicitly understood that all energies are relative to some reference frame or
some origin of potential energy but there is no need to constantly refer to a
‘relative energy’— it is just not done.
The functional S[p; q] is designed by a process of elimination — one might call
it a process of eliminative induction. First we state the desired design criteria;
this is the crucial step that de…nes what makes one distribution preferable over
another. Candidate functionals that fail to satisfy the criteria are discarded
— hence the quali…er ‘eliminative’. As we shall see the criteria adopted below
are su¢ ciently constraining that there is a single entropy functional S[p; q] that
survives the process of elimination.
6 [Skilling 1988] deals with the more general problem of ranking positive additive distribu-
This approach has a number of virtues. First, to the extent that the design
criteria are universally desirable, the single surviving entropy functional will
be of universal applicability too. Second, the reason why alternative entropy
candidates are eliminated is quite explicit — at least one of the design criteria
is violated. Thus, the justi…cation behind the single surviving entropy is not that
it leads to demonstrably correct inferences, but rather, that all other candidates
demonstrably fail to perform as desired.
symmetries of the lattice of propositions see [Knuth 2005, 2006; Knuth Skilling 2012].
6.2 The design of entropic inference 169
conditional posterior is
P (ijD) = q(ijD) : (6.1)
(We adopt the following notation: priors are denoted by q, candidate posteriors
by lower case p, and the selected posterior by upper case P . We shall write
either p(i) or pi .)
We emphasize: the point is not that we make the unwarranted assumption
that keeping q(ijD) unchanged is guaranteed to lead to correct inferences. It
need not; induction is risky. The point is, rather, that in the absence of any
evidence to the contrary there is no reason to change our minds and the prior
information takes priority.
The consequence of DC1 is that non-overlapping domains of i contribute
additively to the entropy,
X
S(p; q) = F (pi ; qi ) ; (6.2)
i
consists of updating q(i) ! P (i) = ii0 to agree with the new information and
invoking the PMU so that P ( ji0 ) = q( ji0 ) remains unchanged. Therefore,
which is Bayes’rule (see sections 2.10.2 and 6.6 below). Thus, entropic inference
is designed to include Bayesian inference as a special case. Note however that
imposing DC1 is not identical to imposing Bayesian conditionalization: DC1 is
not restricted to information in the form of absolute certainties such as p(D) = 1.
Comment 4: If the label i is turned into a continuous variable x the criterion
DC1 requires that information that refers to points in…nitely close but just
outside the domain D will have no in‡uence on probabilities conditional on D.
This may seem surprising as it may lead to updated probability distributions
that are discontinuous. Is this a problem? No.
In certain situations (common in e.g. physics) we might have explicit reasons
to believe that conditions of continuity or di¤erentiability should be imposed
and this information might be given to us in a variety of ways. The crucial
point, however — and this is a point that we keep and will keep reiterating — is
that unless such information is in fact explicitly given we should not assume it.
If the new information leads to discontinuities, so be it. The inference process
should not be expected to discover and replicate information with which it was
not supplied.
Subsystem independence
DC2 When two systems are a priori believed to be independent and we receive
independent information about one then it should not matter if the other
is included in the analysis or not.
@x
dx = (x0 )dx0 where (x0 ) = : (6.9)
@x0
9 The insight that coordinate invariance could be derived as a consequence of the require-
Since p(x) is a density; the transformed function p0 (x0 ) is such that p(x)dx =
p0 (x0 )dx0 is invariant. Therefore
p0 (x0 ) q 0 (x0 )
p(x) = and q(x) = ; (6.10)
(x0 ) (x0 )
This extends the method of maximum entropy beyond its original purpose
as a rule to assign probabilities from a given underlying measure (MaxEnt) to a
method for updating probabilities from any arbitrary prior (ME). Furthermore,
the logic behind the updating procedure does not rely on any particular meaning
assigned to the entropy, either in terms of information, or heat, or disorder.
Entropy is merely a tool for inductive inference. No interpretation for S[p; q]
is given and none is needed. We do not need to know what entropy means; we
only need to know how to use it.
Comment: In chapter 8 we will re…ne the method further. There we will
address the question of assessing the extent to which distributions that are
close to the entropy maximum ought to be ruled out or should be included in
the analysis. Their contribution — which accounts for ‡uctuation phenomena —
turns out to be particularly signi…cant in situations where the entropy maximum
is not particularly sharp.
The derivation above has singled out a unique S[p; q] to be used in inductive
inference. Other “entropies” (such as, the one-parameter family of entropies
proposed in [Renyi 1961, Aczel Daróczy 1975, Amari 1985, Tsallis 1988], see
Section 6.5.3 below) might turn out to be useful for other purposes — perhaps
as measures of some kinds of information, or measures of discrimination or
6.3 The proofs 173
DC1 states that the constraint on D ~ does not have an in‡uence on the condi-
tional probabilities pijD . It may however in‡uence the probabilities pi within D
through an overall multiplicative factor. To deal with this complication consider
then a special case where the overall probabilities of D and D ~ are constrained
too, P P
pi = PD and pj = PD~ ; (6.14)
i2D ~
j2D
leading to
@S
= for i 2 D ; (6.15)
@pi
@S
= ~ + aj for j 2 D
~: (6.16)
@pj
174 Entropy III: Updating Probabilities
Eqs.(6.13-6.16) are n + 3 equations we must solve for the pi s and the three
Lagrange multipliers. Since S = S(p1 : : : pn ; q1 : : : qn ) its derivative
@S
= fi (p1 : : : pn ; q1 : : : qn ) (6.17)
@pi
could in principle also depend on all 2n variables. But this violates the locality
criterion because any arbitrary change in aj within D ~ would in‡uence the pi s
within D. The only way that probabilities conditioned on D can be shielded
from arbitrary changes in the constraints pertaining to D ~ is that for any i 2 D
the function fi depends only on pj s with j 2 D. Furthermore, this must hold
not just for one particular partition of X into domains D and D, ~ it must hold
for all conceivable partitions including the partition into atomic propositions.
Therefore fi can depend only on pi ,
@S
= fi (pi ; q1 : : : qn ) : (6.18)
@pi
But the power of the locality criterion is not exhausted yet. The information
to be incorporated into the posterior can enter not just through constraints but
also through the prior. Suppose that the local information about domain D ~ is
~
altered by changing the prior within D. Let qj ! qj + qj for j 2 D. Then ~
(6.18) becomes
@S
= fi (pi ; q1 : : : qj + qj : : : qn ) (6.19)
@pi
which shows that pi with i 2 D will be in‡uenced by information about D ~ unless
~ Again, this must hold for
fi with i 2 D is independent of all the qj s for j 2 D.
~ and therefore,
all possible partitions into D and D,
@S
= fi (pi ; qi ) for all i 2 X : (6.20)
@pi
The choice of the functions fi (pi ; qi ) can be restricted further. If we were to
maximize S[p; q] subject to constraints
P P
i pi = 1 and i ai pi = A (6.21)
we get
@S
= fi (pi ; qi ) = + ai for all i 2 X ; (6.22)
@pi
where and are Lagrange multipliers. Solving for pi gives the posterior,
Pi = gi (qi ; ; ; ai ) (6.23)
for some functions gi . As stated in Section 6.2.3 we do not assume that the
labels i themselves carry any particular signi…cance. This means, in particular,
that for any proposition labelled i we want the selected posterior Pi to depend
only on the prior qi and on the constraints –that is, on , , and ai . We do not
6.3 The proofs 175
want to have di¤erent updating rules for di¤erent propositions: two di¤erent
propositions i and i0 with the same priors qi = qi0 and the same constraints
ai = ai0 should be updated to the same posteriors, Pi = Pi0 . In other words the
functions gi and fi must be independent of i. Therefore
@S
= f (pi ; qi ) for all i 2 X : (6.24)
@pi
for some still undetermined function F . The constant has no e¤ect on the
maximization and can be dropped.
The corresponding expression for a continuous variable x is obtained replac-
ing i by x, and the sum over i by an integral over x leading to eq.(6.2),
Z
S[p; q] = dx F (p(x); q(x)) : (6.26)
Comment: One might wonder whether in taking the continuum limit there
might be room for introducing …rst and higher derivatives of p and q so that
the function F might include more arguments,
? dp dq
F = F (p; q; ; ; : : :) : (6.27)
dx dx
The answer is no! As discussed in the previous section one must not allow the
inference method to introduce assumptions about continuity or di¤erentiability
unless such conditions are explicitly introduced as information. In the absence
of any information to the contrary the prior information takes precedence; if this
leads to discontinuities we must accept them. On the other hand, we may …nd
ourselves in situations where our intuition insists that the discontinuities should
just not be there. The right way to handle such situations (see section 4.12) is
to recognize the existence of additional constraints concerning continuity that
must be explicitly taken into account.
Case (a) — Suppose nothing is said about subsystem 2 and the information
about subsystem 1 is extremely constraining: for subsystem 1 we maximize
S1 [p1 ; q1 ] subject to the constraint that p1 (i1 ) is P1 (i1 ), the selected posterior
being, naturally, p1 (i1 ) = P1 (i1 ). For subsystem 2 we maximize S2 [p2 ; q2 ] sub-
ject only to normalization so there is no update, P2 (i2 ) = q2 (i2 ).
176 Entropy III: Updating Probabilities
When the systems are treated jointly, however, the inference is not nearly
as trivial. We want to maximize the entropy of the joint system,
P
S[p; q] = F (p(i1 ; i2 ); q1 (i1 )q2 (i2 )) ; (6.28)
i1 ;i2
subject to normalization,
P
p(i1 ; i2 ) = 1 ; (6.29)
i1 ;i2
Notice that this is not just one constraint: we have one constraint for each value
of i1 , and each constraint must be supplied with its own Lagrange multiplier,
1 (i1 ). Then,
h P P P i
S i1 1 (i1 ) i2 p(i1 ; i2 ) P1 (i1 ) i1 ;i2 p(i1 ; i2 ) 1 =0:
(6.31)
The independent variations p(i1 ; i2 ) yield
@S @
= F (p; q1 q2 ) = f (p; q1 q2 ) : (6.33)
@p @p
Next we impose that the selected posterior is the product P1 (i1 )q2 (i2 ). The
function f must be such that
f (P1 q2 ; q1 q2 ) = + 1 : (6.34)
Since the RHS is independent of the argument i2 , the f function on the LHS
must be such that the i2 -dependence cancels out and this cancellation must
occur for all values of i2 and all choices of the prior q2 . Therefore we impose
that for any value of x the function f (p; q) must satisfy
p @F p
f ;1 = f (p; q) or = f (p; q) = : (6.36)
q @p q
Thus, the function f (p; q) has been constrained to a function (p=q) of a single
argument.
6.3 The proofs 177
Case (b) — Next we consider a situation in which both subsystems are up-
dated and the information is assumed to be extremely constraining: when the
subsystems are treated separately q1 (i1 ) is updated to P1 (i1 ) and q2 (i2 ) is up-
dated to P2 (i2 ). When the systems are treated jointly we require that the joint
prior for the combined system q1 (i1 )q2 (i2 ) be updated to P1 (i1 )P2 (i2 ).
First we treat the subsystems separately. Maximize the entropy of subsystem
1,
P
S[p1 ; q1 ] = i1 F (p1 (i1 ); q1 (i1 )) subject to p1 (i1 ) = P1 (i1 ) : (6.37)
To each constaint — one constraint for each value of i1 — we must supply one
Lagrange multiplier, 1 (i1 ). Then,
P
S i1 1 (i1 ) ( p(i1 ) P1 (i1 )) = 0 : (6.38)
Using eq.(6.36),
@S @ p1
= F (p1 ; q1 ) = ; (6.39)
@p1 @p1 q1
and, imposing that the selected posterior be P1 (i1 ), we …nd that the function
must obey
P1 (i1 )
= 1 (i1 ) : (6.40)
q1 (i1 )
Similarly, for system 2 we …nd
P2 (i2 )
= 2 (i2 ) : (6.41)
q2 (i2 )
Next we treat the two subsystems jointly. Maximize the entropy of the joint
system, P
S[p; q] = F (p(i1 ; i2 ); q1 (i1 )q2 (i2 )) ; (6.42)
i1 ;i2
and we impose that the selected posterior be the product P1 (i1 )P2 (i2 ). There-
fore, the function must be such that
P1 P2
= 1 + 2 : (6.47)
q1 q2
P1 P2
= e 1e 2
where = exp ; (6.48)
q1 q2
and rewrite it as
P1 P2 2 (i2 ) 1 (i1 )
e =e : (6.49)
q1 q2
This shows that for any value of i1 , the dependences in the LHS on i2 through
P2 =q2 and 2 must cancel each other out. In particular, if for some subset of
i2 s the subsystem 2 is updated so that P2 = q2 , which amounts to no update at
all, the i2 dependence on the left is eliminated but the i1 dependence remains
una¤ected,
P1 0
e 2 = e 1 (i1 ) : (6.50)
q1
0
where 2 is some constant independent of i2 . A similar argument with f1 $ 2g
yields
P2 0
2 (i2 )
e 1 =e ; (6.51)
q2
0
where 1 is a constant. Taking the exponential of (6.40) and (6.41) leads to
P1 0 0 P2 0 0
e 2 =e 1 2 =e 1
and e 1 =e 2 1 =e 2
: (6.52)
q1 q2
P1 P2 P1 P2
= ; (6.53)
q1 q2 q1 q2
0 0
where a constant factor e ( 1 + 2 ) has been absorbed into a new function . The
general solution of this functional equation is a power,
so that
(x) = a log x + b ; (6.55)
where a and b are constants. Finally, we can integrate (6.36),
@F p p
= = a log +b ; (6.56)
@p q q
6.4 Consistency with the law of large numbers 179
to get
p
F [p; q] = ap log + b0 p + c (6.57)
q
P pi
S[p; q] = i api log + b0 p i + c : (6.58)
qi
The additive constant c may be dropped: it contributes a term that does not de-
pend on the probabilities and has no e¤ect on the ranking scheme. Furthermore,
since S[p; q] will be maximized subject to constraints
P that include normalization
which is implemented by adding a term p
i i , the b0 constant can always be
absorbed into the undetermined multiplier : Thus, the b0 term has no e¤ect on
the selected distribution and can be dropped too. Finally, a is just an overall
multiplicative constant, it also does not a¤ect the overall ranking except in the
trivial sense that inverting the sign of a will transform the maximization prob-
lem to a minimization problem or vice versa. We can therefore set a = 1 so
that maximum S corresponds to maximum preference which gives us eq.(6.12)
and concludes our derivation.
N! P
m
QN (f jq) = q n1 : : : q m
nm
with ni = N : (6.59)
n1 ! : : : n m ! 1 i=1
When the ni are su¢ ciently large we can use Stirling’s approximation,
p
log n! = n log n n + log 2 n + O(1=n) : (6.60)
1 0 This is the “frequency” with which one observes the microstate i in the large sample N .
Then
p
log QN (f jq) N log N N + log 2 N
P p
ni log ni ni + log 2 ni ni log qi
i
r
P ni ni P ni p
= N log log (N 1) log 2 N
i N N qi i N
P p p
= N S[f; q] log fi (N 1) log 2 N ; (6.61)
i
The answer is given by (6.64): for large N maximizing the probability QN (f jq)
subject to the constraint a = A, is equivalent to maximizing the entropy S[f; q]
subject to a = A. In the limit of large N the frequencies fi converge P (in
probability) to the desired posterior Pi while the sample average a = ai fi
converges (also in probability) to the expected value hai = A.
[Csiszar 1984] and [Grendar 2001] have argued that the asymptotic argument
above provides by itself a valid justi…cation for the ME method of updating. An
agent whose prior is q receives the information hai = A which can be reasonably
interpreted as a sample average a = A over a large ensemble of N trials. The
agent’s beliefs are updated so that the posterior P coincides with the most
6.5 Random remarks 181
What if the prior q(x) vanishes for some values of x? S[p; q] can be in…nitely
negative when q(x) vanishes within some region D. In other words, the ME
method confers an overwhelming preference on those distributions p(x) that
vanish whenever q(x) does. One must emphasize that this is as it should be; it
is not a problem. As we saw in section 2.10.4 a similar situation also arises in the
context of Bayes’ theorem where a vanishing prior represents a tremendously
serious commitment because no amount of data to the contrary would allow us to
revise it. In both ME and Bayes updating we should recognize the implications
of assigning a vanishing prior. Assigning a very low but non-zero prior represents
a safer and less prejudiced representation of one’s beliefs.
For more on the choice of priors see the review [Kass Wasserman 1996]; in
particular for entropic priors see [Rodriguez 1990-2003, Caticha Preuss 2004]
A related entropy Z
1 0
S0 0 = 0
1 dx p +1
(6.68)
has been proposed in [Tsallis 1988] and forms the foundation of a so-called non-
extensive statistical mechanics (see section 5.5). Clearly these two entropies
are equivalent in that they generate equivalent variational problems – maxi-
mizing S is equivalent to maximizing S 0 0 . To conclude our brief remarks on
the entropies S we point out that quite apart from the di¢ culty of achieving
consistency with the law of large numbers, some the probability distributions
obtained maximizing S may also be derived through a more standard use of
MaxEnt or ME as advocated in these lectures (section 5.5).
(1) the prior information codi…ed into a prior distribution q( ); (2) the data
x 2 X (obtained in one or many experiments); and (3) the known relation
between and x given by the model as de…ned by the sampling distribution
or likelihood, q(xj ). The updating consists of replacing the prior probability
distribution q( ) by a posterior distribution P ( ) that applies after the data has
been processed.
The crucial element that will allow Bayes’rule to be smoothly incorporated
into the ME scheme is the realization that before the data is available not only we
do not know , we do not know x either. Thus, the relevant space for inference
is not the space but the product space X and the relevant joint prior is
q(x; ) = q( )q(xj ). Let us emphasize two points: …rst, the likelihood function
is an integral part of the prior distribution; and second, the information about
how x is related to is contained in the functional form of the distribution q(xj )
— for example, whether it is a Gaussian or a Cauchy distribution or something
else – and not in the numerical values of the arguments x and which are, at
this point, still unknown.
Next data is collected and the observed values turn out to be x0 . We must
update to a posterior that lies within the family of distributions p(x; ) that
re‡ect the fact that x is now known to be x0 ,
Z
p(x) = d p( ; x) = (x x0 ) : (6.69)
This data information constrains but is not su¢ cient to determine the joint
distribution
p(x; ) = p(x)p( jx) = (x x0 )p( jx0 ) : (6.70)
Any choice of p( jx0 ) is in principle possible. So far the formulation of the prob-
lem parallels section 2.10 exactly. We are, after all, solving the same problem.
Next we apply the ME method and show that we get the same answer.
According to the ME method the selected joint posterior P (x; ) is that
which maximizes the entropy,
Z
p(x; )
S[p; q] = dxd p(x; ) log ; (6.71)
q(x; )
subject to the appropriate constraints. Note that the information in the data,
eq.(6.69), represents an in…nite number of constraints on the family p(x; ):
for each value of x there is one constraint and one Lagrange multiplier (x).
Maximizing S, (6.71), subject to (6.69) and normalization,
R R R
S+ dxd p(x; ) 1 + dx (x) d p(x; ) (x x0 ) = 0 ;
(6.72)
(x x0 )
P (x; ) = q(x; ) = (x x0 )q( jx) : (6.75)
q(x)
which is recognized as Bayes’ rule. Thus, Bayes’ rule is derivable from and
therefore consistent with the ME method.
To summarize: the prior q(x; ) = q(x)q( jx) is updated to the posterior
P (x; ) = P (x)P ( jx) where P (x) = (x x0 ) is …xed by the observed data
while P ( jx0 ) = q( jx0 ) remains unchanged. Note that in accordance with the
philosophy that drives the ME method one only updates those aspects of one’s
beliefs for which corrective new evidence has been supplied.
I conclude with a few simple examples that show how ME allows general-
izations of Bayes’ rule. The general background for these generalized Bayes
problems is the familiar one: We want to make inferences about some variables
on the basis of information about other variables x and of a relation between
them.
that are easily handled by ME. Maximizing S, (6.71), subject to (6.77) and
normalization, leads to
P (x; ) = PD (x)q( jx) : (6.78)
The corresponding marginal posterior,
R R q(xj )
P ( ) = dx PD (x)q( jx) = q( ) dx PD (x) ; (6.79)
q(x)
R e f (x)
P ( ) = q( ) dx q(xj ) : (6.84)
Z
These two examples (6.79) and (6.84) are su¢ ciently intuitive that one could
have written them down directly without deploying the full machinery of the ME
method, but they do serve to illustrate the essential compatibility of Bayesian
and Maximum Entropy methods. Next we consider a slightly less trivial exam-
ple.
The Lagrange multipliers (x) are determined from the data constraint, (6.69),
e (x)
(x x0 ) R
= where Z( ; x0 ) = d e f( )
q( jx0 ) ; (6.88)
z Zq(x0 )
so that the joint posterior becomes
f( )
e
P (x; ) = (x x0 )q( jx0 ) : (6.89)
Z
The remaining Lagrange multiplier is determined by imposing that the pos-
terior P (x; ) satisfy the constraint (6.85). This yields an implicit equation for
,
@ log Z
=F : (6.90)
@
Note that since Z = Z( ; x0 ) the multiplier will depend on the observed data
x0 . Finally, the new marginal distribution for is
f( )
e q(x0 j ) e f ( )
P ( ) = q( jx0 ) = q( ) : (6.91)
Z q(x0 ) Z
For = 0 (no moment constraint) we recover Bayes’rule. For 6= 0 Bayes’rule
is modi…ed by a “canonical” exponential factor yielding an e¤ective likelihood
function.
then,
q( ) e (x)f (x; )
P (x; ) = PD (x) R 0) (6.96)
d 0 q( 0 ) e (x)f (x;
so that
P (x; ) q( ) e (x)f (x; ) R (x)f (x; 0 )
P ( jx) = = with Z(x) = d 0 q( 0 ) e
P (x) Z(x)
(6.97)
The multiplier (x) is determined from (6.92)
1 @Z(x)
= F (x) : (6.98)
Z(x) @ (x)
R R e (x)f (x; )
P ( ) = dx PD (x)P ( jx) = q( ) dx PD (x) : (6.99)
Z(x)
In the limit when the data are sharply determined PD (x) = (x x0 ) the
posterior takes the form of Bayes theorem,
(x0 )f (x0 ; )
e
P ( ) = q( ) ; (6.100)
Z(x0 )
0 0
where up to a normalization factor e (x )f (x ; ) plays the role of the likelihood
and the normalization constant Z plays the role of the evidence.
In conclusion, these examples demonstrate that the method of maximum en-
tropy can fully reproduce the results obtained by the standard Bayesian methods
and allows us to extend them to situations that lie beyond their reach such as
when the likelihood function is not known.
together? The answer depends on the problem at hand. (Here we follow [Gi¢ n
Caticha 2007].)
We refer to constraints as commuting when it makes no di¤erence whether
they are handled simultaneously or sequentially. The most common example is
that of Bayesian updating on the basis of data collected in several independent
experiments. In this case the order in which the observed data x0 = fx01 ; x02 ; : : :g
is processed does not matter for the purpose of inferring . (See section 2.10.3)
The proof that ME is completely compatible with Bayes’rule implies that data
constraints implemented through functions, as in (6.69), commute. It is useful
to see how this comes about.
When an experiment is repeated it is common to refer to the value of x in
the …rst experiment and the value of x in the second experiment. This is a dan-
gerous practice because it obscures the fact that we are actually talking about
two separate variables. We do not deal with a single x but with a composite
x = (x1 ; x2 ) and the relevant space is X1 X2 . After the …rst experiment
yields the value x01 , represented by the constraint C1 : P (x1 ) = (x1 x01 ), we can
perform a second experiment that yields x02 and is represented by a second con-
straint C2 : P (x2 ) = (x2 x02 ). These constraints C1 and C2 commute because
they refer to di¤ erent variables x1 and x2 . An experiment, once performed and
its outcome observed, cannot be un-performed ; its result cannot be un-observed
by a second experiment. Thus, imposing the second constraint does not imply
a revision of the …rst.
In general constraints need not commute and when this is the case the order
in which they are processed is critical. For example, suppose the prior is q and
we receive information in the form of a constraint, C1 . To update we maximize
the entropy S[p; q] subject to C1 leading to the posterior P1 as shown in Figure
6.2. Next we receive a second piece of information described by the constraint
C2 . At this point we can proceed in two very di¤erent ways:
(a) Sequential updating. Having processed C1 , we use P1 as the current prior
and maximize S[p; P1 ] subject to the new constraint C2 . This leads us to the
posterior Pa .
(b) Simultaneous updating. Use the original prior q and maximize S[p; q]
subject to both constraints C1 and C2 simultaneously. This leads to the poste-
rior Pb .
At …rst sight it might appear that there exists a third possibility of simulta-
neous updating: (c) use P1 as the current prior and maximize S[p; P1 ] subject to
both constraints C1 and C2 simultaneously. Fortunately, and this is a valuable
check for the consistency of the ME method, it is easy to show that case (c)
is equivalent to case (b). Whether we update from q or from P1 the selected
posterior is Pb .
To decide which path (a) or (b) is appropriate we must be clear about how the
ME method handles constraints. The ME machinery interprets a constraint such
as C1 in a very mechanical way: all distributions satisfying C1 are in principle
allowed and all distributions violating C1 are ruled out.
Updating to a posterior P1 consists precisely in revising those aspects of the
prior q that disagree with the new constraint C1 . However, there is nothing …nal
190 Entropy III: Updating Probabilities
Figure 6.2: Illustrating the di¤erence between processing two constraints C1 and
C2 sequentially (q ! P1 ! Pa ) and simultaneously (q ! Pb or q ! P1 ! Pb ).
about the distribution P1 . It is just the best we can do in our current state of
knowledge and we fully expect that future information may require us to revise
it further. Indeed, when new information C2 is received we must reconsider
whether the original C1 remains valid or not. Are all distributions satisfying
the new C2 really allowed, even those that violate C1 ? If this is the case then
the new C2 takes over and we update from P1 to Pa . The constraint C1 may
still retain some lingering e¤ect on the posterior Pa through P1 , but in general
C1 has now become obsolete.
Alternatively, we may decide that the old constraint C1 retains its validity.
The new C2 is not meant to revise C1 but to provide an additional re…nement
of the family of allowed posteriors. If this is the case, then the constraint that
correctly re‡ects the new information is not C2 but the more restrictive C1 ^ C2 .
The two constraints should be processed simultaneously to arrive at the correct
posterior Pb .
To summarize: sequential updating is appropriate when old constraints be-
come obsolete and are superseded by new information; simultaneous updating is
appropriate when old constraints remain valid. The two cases refer to di¤erent
states of information and therefore we expect that they will result in di¤erent
inferences. These comments are meant to underscore the importance of under-
standing what information is and how it is processed by the ME method; failure
to do so will lead to errors that do not re‡ect a shortcoming of the ME method
but rather a misapplication of it.
6.8 Conclusion 191
6.8 Conclusion
Any Bayesian account of the notion of information cannot ignore the fact that
Bayesians are concerned with the beliefs of rational agents. The relation be-
tween information and beliefs must be clearly spelled out. The de…nition we
have proposed — that information is that which constrains rational beliefs and
therefore forces the agent to change its mind — is convenient for two reasons.
First, the information/belief relation very explicit, and second, the de…nition is
ideally suited for quantitative manipulation using the ME method.
Dealing with uncertainty requires that one solve two problems. First, one
must represent a state of partial knowledge as a consistent web of interconnected
beliefs. The instrument to do it is probability. Second, when new information
becomes available the beliefs must be updated. The instrument for this is rela-
tive entropy. It is the only candidate for an updating method that is of universal
applicability; that recognizes the value of prior information; and that recognizes
the privileged role played by the notion of independence in science. The re-
sulting general method — the ME method — can handle arbitrary priors and
arbitrary constraints; and it includes MaxEnt and Bayes’rule as special cases.
The design of the ME method is essentially complete. However, the fact
that ME operates by ranking distributions according to preference immediately
raises questions about why should distributions that lie very close to the entropy
maximum be totally ruled out; and if not ruled out completely, to what extent
should they contribute to the inference. Do they make any di¤erence? To what
extent can we even distinguish similar distributions? The discussion of these
matters in the next two chapters will signi…cantly extend the utility of the ME
method as a framework for inference.
Chapter 7
Information Geometry
d`2 = gab d a d b
: (7.1)
The singular importance of the metric tensor gab derives from a theorem due
µ
to N. Cencov that states that the metric gab on the manifold of probability
distributions is essentially unique: up to an overall scale factor there is only
one metric that takes into account the fact that these are not distances between
simple structureless dots but between probability distributions [Cencov 1981].
N!
p(fmi gj ) = ( 1 )m1 ( 2 )m2 : : : ( n mn
) ; (7.2)
m1 !m2 ! : : : mn !
where = ( 1; 2
::: n
),
Pn Pn i
N= i=1 mi and i=1 =1: (7.3)
1 1 P
n
p(xj ; ) = 2 )n=2
exp 2
(xa a 2
) ; (7.4)
(2 2 a=1
common and very convenient notational convention in di¤erential geometry. We adopt the
standard Einstein convention of summing over repeated indices P Pwhenever one appears as a
superscript and the other as a subscript, that is, gab f ab = a
ab
b gab f . Furthermore, we
shall follow standard practice and indistinctly refer to both the metric tensor gab and the
quadratic form (7.1) as the ‘metric’.
7.2 Vectors in curved spaces 195
1 k
k fi
p(ijF ) = e ; (7.5)
Z
are derived by maximizing the Shannon entropy S[p] subject to constraints on
the expected values of n functions fik = f k (xi ) labeled by superscripts k =
1; 2; : : : n,
P
f k = pi fik = F k : (7.6)
i
Vectors as displacements
Perhaps the most primitive notion of a vector is associated to a displacement
in space and is visualized as an arrow; other vectors such as velocities and
accelerations are de…ned in terms of such displacements and from these one can
elaborate further to de…ne forces, force …elds and so on.
This notion of vector as a displacement is useful in ‡at spaces but it does not
work in a curved space — a bent arrow is not useful. The appropriate gener-
alization follows from the observation that smoothly curved spaces are “locally
‡at” — by which one means that within a su¢ ciently small region deviations
from ‡atness can be neglected. Therefore one can de…ne the in…nitesimal dis-
placement from a point x = (x1 : : : xn ) by a vector,
In other words, the vector has di¤erent representations in di¤erent frames but
the vector itself is independent of the choice of coordinates.
The coordinate independence can be made more explicit by introducing the
notion of a basis. A coordinate frame singles out n special vectors f~ea g de…ned
so that the b component of ~ea is
eba = b
a : (7.12)
More explicitly,
~e1 (1; 0 : : : 0); ~e2 (0; 1; 0 : : : 0); : : : ; ~e1 (0; 0 : : : 1) : (7.13)
Any vector ~v can be expressed in terms of the basis vectors,
~v = v a ~ea : (7.14)
The basis vectors in the primed frame f~ea0 g are de…ned in the same way
0
b0
eba0 = a0 : (7.15)
so that using eq.(7.11) we have
0 0
~v = v a ~ea0 = Xba v b ~ea0 = v b ~eb ; (7.16)
where 0
~eb = Xba ~ea0 or, equivalently, ~ea0 = Xab0 ~eb : (7.17)
a
Eq.(7.16) shows that while the components v and the basis vectors ~ea both
depend on the frame, the vector ~v itself is invariant, and eq.(7.17) shows that
the invariance follows from the fact that components and basis vectors transform
according to inverse matrices. Explicitly, using the chain rule,
0 0
0 @xa @xb @xa 0
Xba Xcb0 = 0 = = ca0 : (7.18)
b
@x @x c @xc0
Remark: Eq.(7.16) is the main reason we care about vectors: they are ob-
jects that are independent of the accidents of the particular choice of coordinate
system. Therefore they are good candidates to represent quantities that carry
physical meaning. Conversely, since we can always change coordinates, it is
commonly thought that discussions that avoid coordinates and employ methods
of analysis that are coordinate-free are somehow deeper or more fundamental.
The introduction of coordinates is often regarded as a blemish that is barely
tolerated because it often facilitates computation. However, when we come to
statistical manifolds the situation is di¤erent. In this case coordinates can be
parameters in probability distributions (such as, for example, temperatures or
chemical potentials) that carry a statistical and physical meaning and therefore
have a signi…cance that goes far beyond the mere geometrical role of labelling
points. Thus, there is much to gained from recognizing, on one hand, the
geometrical freedom to assign coordinates and, on the other hand, that often
enough special coordinates are singled out because they carry physically rel-
evant information. The case can therefore be made that when dealing with
statistical manifolds coordinate-dependent methods can in many respects be
fundamentally superior.
198 Information Geometry
d @
= va a : (7.21)
d @x
Note further that the partial derivatives @=@xa transform exactly as the basis
vectors, eq.(7.17)
0
@ @xa @ 0 @
= = Xaa ; (7.22)
@x a @xa @xa0 @xa0
so that there is a 1 : 1 correspondence between the directional derivative d=d
and the vector ~v that is tangent to the curve x( ). In fact, we can use one to
represent the other,
d @
~v and ~ea : (7.23)
d @xa
and, since mathematical objects are de…ned purely through their formal rules of
manipulation, it is common practice to ignore the distinction between the two
concepts and set
d @
~v = and ~ea = : (7.24)
d @xa
The partial derivative is indeed appropriate because the “vector” @=@xa is the
derivative along those curves xb ( ) parametrized by the parameter = xa that
are de…ned by keeping the other coordinates constant, xb ( ) = const, for b 6= a.
The vector ~ea has components
@xb
eba = = b
a : (7.25)
@xa
From a physical perspective, however, beyond the rules for formal manipu-
lation mathematical objects are also assigned a meaning, an interpretation, and
it is not clear that the two concepts, the derivative d=d and the tangent vector
~v , should be considered as physically identical. Nevertheless, we can still take
advantage of the isomorphism to calculate using one picture while providing
interpretations using the other.
7.3 Distance and volume in curved spaces 199
Remark: The fact that at any given point one can always change to normal
coordinates such that Pythagoras’ theorem is locally valid is what de…nes a
Riemannian manifold. However, it is important to realize that while one can
do this at any single arbitrary point of our choice, in general one cannot …nd a
coordinate frame in which eq.(7.27) is simultaneously valid at all points within
an extended region.
Changing back to the original frame
^ ^
d`2 = a
^
b dx dx
^^
a
b
= a
^
b Xa
^^
a Xbb dxa dxb : (7.28)
Example: These ideas are also useful in ‡at spaces when dealing with non-
Cartesian coordinates. The distance element of three-dimensional ‡at space in
spherical coordinates (r; ; ) is
d`2 = dr2 + r2 d 2
+ r2 sin2 d 2
; (7.39)
and the corresponding metric tensor is
0 1
1 0 0
(gab ) = @0 r2 0 A : (7.40)
0 0 r2 sin2
The volume element is the familiar expression
dV = g 1=2 drd d = r2 sin drd d : (7.41)
Important example: A uniform distribution over such a curved manifold is
one which assigns equal probabilities to equal volumes. Therefore,
p(x)dn x / g 1=2 (x)dn x : (7.42)
7.4 Derivations of the information metric 201
The expected value of the relative di¤erence, h i, might at …rst sight seem a
good candidate to measure distinguishability, but it does not work because it
vanishes identically,
Z Z
@ log p(xj ) a a @
h i = dx p(xj ) d =d dx p(xj ) = 0: (7.44)
@ a @ a
R
(Depending on the problem the symbol dx will be used to represent either
discrete sums or integrals over one or more dimensions.) However, the variance
does not vanish,
Z
2 2 @ log p(xj ) @ log p(xj ) a b
d` = h i = dx p(xj ) d d : (7.45)
@ a @ b
d`2 = gab d a d b
: (7.47)
Up to now no notion of distance has been introduced. Normally one says that
the reason it is di¢ cult to distinguish two points in, say, the three dimensional
space we seem to inhabit, is that they happen to be too close together. It is very
tempting to invert this intuition and assert that the two points and +d must
be very close together because they are di¢ cult to distinguish. Furthermore,
note that being a variance, d`2 = h 2 i, the quantity d`2 is positive and vanishes
only when the d a vanish. Thus it is natural to interpret gab as the metric tensor
of a Riemannian space. This is the information metric. The realization by C.
R. Rao that gab is a metric in the space of probability distributions [Rao 1945]
gave rise to the subject of information geometry [Amari 1985], namely, the
application of geometrical methods to problems in inference and in information
theory.
Remark: The derivation of (7.46) involved a Taylor expansion, eq.(7.43), to
…rst order in d . One might wonder if keeping higher orders in d would lead
to a “better” metric. The answer is no. What is being done here is de…ning
the metric tensor, not …nding an approximation to it. To illustrate this point
consider the following analogy. The trajectory of a moving particle is given by
xa = xa (t). The position at time t + t is given by the Taylor expansion
1
xa (t + t) = xa (t) + v a (t) t + aa (t) t2 + : : : (7.48)
2
The velocity of the particle is de…ned by the term linear in t. Despite the
Taylor expansion the de…nition of velocity is exact, no approximation is in-
volved; the higher order terms do not constitute an improvement to the notion
of velocity.
Other useful expressions for the information metric are
Z
@p1=2 (xj ) @p1=2 (xj )
gab = 4 dx
@ a @ b
Z 2 1=2
@ p (xj )
= 4 dx p1=2 (xj ) ; (7.49)
@ a@ b
and Z
@ 2 log p(xj ) @ 2 log p
gab = dx p(xj ) = h i: (7.50)
@ a@ b @ a@ b
The coordinates are quite arbitrary; one can freely relabel the points in the
manifold. It is then easy to check that gab are the components of a tensor and
7.4 Derivations of the information metric 203
leads to
a0
a @ a a0 @ @ @
d = d and = (7.52)
@ a0 @ a @ a @ a0
Distances between more distant distributions are merely angles de…ned on the
surface of the unit sphere Sm 1 . To express d`2 in terms of the original coordi-
nates pi substitute
1 dpi
d i= (7.56)
2 p1=2 (i)
to get
1 X (dpi )2 1 ij
d`2 = = gij dpi dpj with gij = : (7.57)
4 i p(i) 4 p(i)
Remark: In (7.56) and in gij above we have gone back to the original notation
p(i) rather than pi to emphasize that the repeated index i is not being summed
over.
Except for an overall constant (7.57) is the same information metric (7.47) we
de…ned earlier. Indeed, consider an n-dimensional subspace (n m 1) of the
simplex Sm 1 de…ned by i = i ( 1 ; : : : ; n ). The parameters a , i = 1 : : : n, can
204 Information Geometry
@ i a@ j
d`2 = ij d
i
d j
d = d
ij
b
@ a @ b
1 X i @ log pi @ log pi a b
= p d d ; (7.58)
4 i @ a @ b
which (except for the factor 1/4) we recognize as the discrete version of (7.46)
and (7.47).
This interesting result does not constitute a “derivation.”There is a priori no
reason why the square root coordinates i should be singled out as special and
attributed a Euclidean metric. But perhaps it helps to lift the veil of mystery
that might otherwise surround the strange expression (7.46).
where d`r is the length of the segment in the radial direction (i.e., normal to the
sphere) and d`t is the length tangent to the sphere. To calculate d`r consider
two neighboring spheres of radii r and r + dr. Di¤erentiating (7.59) gives,
Xm i d i
dr = : (7.61)
i=1 r
If the sphere were embedded in ‡at space dr would itself be the radial distance
between the two spheres. In our curved space it is not; the best we can do is to
claim the actual radial distance is proportional to dr. Therefore
a(r2 ) P i i 2
d`2r = i d (7.62)
r2
where by spherical symmetry the (positive) function a(r2 ) depends only on r2
so that it is constant over the surface of the sphere. To calculate the tangential
length d`t we note that the actual geometry of the sphere is independent of
the space in which it is embedded. If it were embedded in ‡at space then the
tangential length would be
P
(tang. length)2 = i (d i )2 dr2 (7.63)
In our curved space, the actual tangential distance can involve an overall scale
factor, !
2
2 2 P i 2 1P i i
d`t = b(r ) i (d ) d : (7.64)
r2 i
By spherical symmetry the (positive) scale factor b(r2 ) depends only on r. Sub-
stituting into (7.60), we see that the metric of a generic spherically symmetric
space involves two arbitrary positive functions of r2 ,
1 P i i 2 P
d`2 = 2 a(r2 ) b(r2 ) i d + b(r2 ) i (d i )2 : (7.65)
r
Expressed in terms of the original pi coordinates the metric of a spherically
symmetric space takes the form
P i 2 P 1
d`2 = A(jpj) i dp + B(jpj) i i (dpi )2 ; (7.66)
2p
where
1 1
A(jpj) = [a(jpj) b(jpj)] and B(jpj) = b(jpj) (7.67)
4jpj 2
and
def P
jpj = i pi = r2 : (7.68)
Setting P P
i i
jpj = ip = 1 and i dp =0; (7.69)
gives the metric induced on the simplex Sm 1 ,
P 1
d`2 = B(1) i i (dpi )2 : (7.70)
p
Up to an overall scale this result agrees with previous expressions for the infor-
mation metric.
206 Information Geometry
1
number of distinguishable (in N trials)
Statistical length = ` = lim p
N !1 distributions that …t along the curve
N=2
p (7.78)
Since the number of pdistinguishable points grows as N it is convenient to
introduce
p a factor 1= N so that there is a …nite limit as N ! 1. The factor
2 is purely conventional. p
Remark: It is not actually necessary to include the 2=N factor; this leads to
a notion of statistical length `N de…ned on the space of N -trial multinomials.
(See section 7.6.)
More explicitly, let the curve p = p( ) be parametrized by . The separation
between two neighboring distributions that can barely be resolved in N trials
is
1=2
N P 1 dpi 2 2 N P 1 dpi 2
( ) 1 or ( ) : (7.79)
2 i pi d 2 i pi d
The number of distinguishable distributions within the interval d is d = and
the corresponding statistical length, eq.(7.78), is
1=2 1=2
P 1 dpi 2 P (dpi )2
d` = ( ) d = (7.80)
i pi d i pi
Thus, the width of the ‡uctuations is the unit used to de…ne a local measure
of “distance”. To the extent that ‡uctuations are intrinsic to statistical prob-
lems the geometry they induce is unavoidably hard-wired into the statistical
manifolds. The statistical P or distinguishability length di¤ers from a possible
Euclidean distance d`2E = (dpi )2 because the ‡uctuations are not uniform
over the space Sm 1 which a¤ects our ability to resolve neighboring points.
Equation (7.81) agrees the previous de…nitions of the information metric.
Consider the n-dimensional subspace (n m 1) of the simplex Sm 1 de…ned
by pi = pi ( 1 ; : : : ; n ). The distance between two neighboring distributions in
this subspace, p(ij ) and p(ij + d ), is
Pm ( p )2 P
m 1 @p
i i a @pj
d`2 = = a
d d b = gab d a d b
(7.82)
i=1 pi i;j=1 pi @ @ b
208 Information Geometry
where
P
m @ log pi @ log pi
gab = pi ; (7.83)
i=1 @ a @ b
which is the discrete version of (7.46).
Markov embeddings
Consider a discrete variable i = 1; : : : ; n and let the probability of any particular
i be Pr(i) = pi . In practice the limitation to discrete variables is not very
serious because we can choose an n large enough to approximate a continuous
distribution to any desirable degree. However, it is possible to imagine situations
where the continuum limit is tricky — here we avoid such situations.
The set of numbers p = (p1 ; : : : pn ) can be used as coordinates to de…ne
a point on a statistical manifold. In this particular P case the manifold is the
(n 1)-dimensional simplex Sn 1 = fp = (p1 ; : : : pn ) : pi = 1g. The argument
is considerably simpli…ed by considering instead the n-dimensional space of non-
normalized distributions. This is the positive “octant” Rn+ = fp = (p1 ; : : : pn ) :
pi > 0g. The boundary is explicitly avoided so that Rn+ is an open set.
Next we introduce the notion of Markov mappings. The set of values of i
can be grouped or partitioned into M disjoint subsets with 2 M n. Let
A = 1 : : : M label the subsets, then the probability of the Ath subset is
X
Pr(A) = P A = pi : (7.88)
i2A
pi = q A
i
PA : (7.91)
i
This is a sum over A but since qA = 0 unless i 2 A only one term in the sum is
non-vanishing and the map is clearly invertible. These Q maps, called Markov
+
mappings, de…ne an embedding of RM into Rn+ . Markov mappings preserve
normalization, X X X
pi = i
qA PA = PA : (7.92)
i i A
One Markov map Q running in the opposite direction R2+ ! R3+ could be
1 2
Q(P 1 ; P 2 ) = (P 1 ; P 2 ; P 2 ) = (p1 ; p2 ; p3 ) : (7.94)
3 3
i
This particular map is de…ned by setting all qA = 0 except q11 = 1, q32 = 1=3,
3
and q2 = 2=3.
Example: We can use binomial distributions to analyze the act of tossing a
coin (the outcomes are either heads or tails) or, equally well, the act of throwing
a die (provided we only care about outcomes that are either even or odd). This
amounts to embedding the space of coin distributions (which are binomials,
+
RM with M = 2) as a subspace of the space of die distributions (which are
multinomials, Rn+ with n = 6).
To minimize confusion between the two spaces we will use lower case symbols
to refer to the original larger space Rn+ and upper case symbols to refer to the
+
coarse grained embedded space RM .
Having introduced the notion of Markov embeddings we can now state the
i
basic idea behind Campbell’s argument. For a …xed choice of fqA g, that is for
a …xed Markov map Q, the distribution P and its image p = Q(P ) represent
exactly the same information. In other words, whether we talk about heads/tails
outcomes in coins or about even/odd outcomes in dice, binomials are binomials.
Therefore the map Q is invertible. The Markov image Q(SM 1 ) of the simplex
SM 1 in Sn 1 is statistically “identical” to SM 1 ,
Q(SM 1) = SM 1 ; (7.95)
in the sense that it is just as easy or just as di¢ cult to distinguish the two
distributions P and P + dP as it is to distinguish their images p = Q(P ) and
p + dp = Q(P + dP ). Whatever geometrical relations are assigned to distribu-
tions in SM 1 , exactly the same geometrical relations should be assigned to the
corresponding distributions in Q(SM 1 ). Thus Markov mappings are not just
embeddings, they are congruent embeddings; distances between distributions in
+
RM should match the distances between the corresponding images in Rn+ .
Our goal is to …nd the Riemannian metrics that are invariant under Markov
mappings. It is easy to see why imposing such invariance is extremely restrictive:
+
The fact that distances computed in RM must agree with distances computed in
+
subspaces of Rn introduces a constraint on the allowed metric tensors; but we
+
can always embed RM in spaces of larger and larger dimension which imposes
a larger and larger number of constraints. It could very well have happened
that no Riemannian metric managed to survive such restrictive conditions; it is
quite remarkable that some do and it is even more remarkable that (up to an
uninteresting scale factor which amounts to a choice of the unit of distance) the
surviving Riemannian metric is unique.
The invariance of the metric is conveniently expressed as an invariance of
+
the inner product: inner products among vectors in RM should coincide with
the inner products among the corresponding images in Rn+ : Let vectors tangent
7.5 Uniqueness of the information metric 211
+
to RM be denoted by
~ = V A @ = V AE
V ~A ; (7.96)
@P A
~ A g is a coordinate basis. The inner product of two such vectors is
where fE
~ ;U
~ )M = g (M )
A B
(V AB V U (7.97)
@
~v = v i = v i~ei ; (7.99)
@pi
and the inner product of two such vectors is
(n)
(~v ; ~u)n = gij v i uj (7.100)
where
(n) def
gij = (~ei ; ~ej )n : (7.101)
~ tangent to Rm
The images of vectors V +
under Q are obtained from eq.(7.91)
@ @pi @ i @ ~ A = qA
i
Q A
= A i
= qA or Q E ~ei ; (7.102)
@P @P @p @pi
which leads to
~ = ~v
Q V with v i = qA
i
VA : (7.103)
Therefore, the invariance or isometry we want to impose is expressed as
~ ;U
(V ~ )M = (Q V
~ ;Q U
~ )n = (~v ; ~u)n : (7.104)
The Theorem
+
Let ( ; )M be the inner product on RM for any M 2 f2; 3; : : :g. The theorem
states that the metric is invariant under Markov embeddings if and only if
(M ) AB
gAB = (eA ; eB )M = (jP j) + jP j (jP j) ; (7.105)
PA
def P
where jP j = A P A , and and are smooth (C 1 ) functions with > 0 and
+ > 0. The proof is given in the next section.
An important by-product of this theorem is that (7.105) has turned out to be
the metric of a generic spherically symmetric space, eq.(7.66). In other words,
As we shall see in Chapters 10 and 11 this fact will turn out to be important in
the derivation of quantum mechanics.
+
The metric above refers to the positive cone RM but ultimately we are
interested in the metric induced on the simplex SM 1 de…ned by jP j = 1. In
order to …nd the induced metric we …rst show that vectors that are tangent to
the simplex SM 1 are such that
def
X
jV j = VA =0 : (7.106)
A
+
Indeed, consider the derivative of any function f = f (jP j) de…ned on RM along
the direction de…ned by V~,
@f df @jP j df
0=VA A
=VA A
= jV j ; (7.107)
@P djP j @P djP j
Therefore the choice of the function (jP j) is irrelevant and the corresponding
metric on SM 1 is determined up to a multiplicative constant (1) =
AB
gAB = ; (7.110)
PA
which is the information metric that was heuristically suggested earlier, eqs.(7.57)
and (7.58). Indeed, transforming to a generic coordinate frame P A = P A ( 1 ; : : : ; M
)
yields
d`2 = gAB P A P B = gab d a d b (7.111)
with
X @ log P A @ log P A
gab = PA : (7.112)
A @ a @ b
The Proof
The strategy is to consider special cases of Markov embeddings to determine
what kind of constraints they impose on the metric. First we consider the
7.5 Uniqueness of the information metric 213
+
consequences of invariance under the family of Markov maps Q0 that embed RM
0
into itself. In this case n = M and the action of Q is to permute coordinates.
A simple example in which just two coordinates are permuted is
(p1 ; : : : pa ; : : : pb ; : : : pM ) = Q0 (P 1 ; : : : P M )
= (P 1 ; : : : P b ; : : : P a ; : : : P m ) (7.113)
i
The required qA are
a b b a i i
qA = A ; qA = A and qA = A for A 6= a; b ; (7.114)
~ A = q i ~ei , gives
so that eq.(7.102), Q0 E A
~ a = ~eb ;
Q0 E ~ b = ~ea
Q0 E and ~ A = ~eA
Q0 E for A 6= a; b : (7.115)
The invariance
~ A; E
(E ~ B )M = (Q0 E
~ A ; Q0 E
~ B )M (7.116)
yields,
(M ) (M ) (M ) (M )
gaA (P ) = gbA (p) and gbA (P ) = gaA (p) for A 6= a; b (7.117)
(M ) (M ) (M ) (M )
gaa (P ) = gbb (p) and gbb (P ) = gaa (p) (7.118)
(M ) (M )
gAB (P ) = gAB (p) for A; B 6= a; b :
+
These conditions are useful for points along the line through the center of RM ,
1 2 M
P = P = : : : = P . Let Pc = (c=M; : : : ; c=M ) with c = jPc j; we have
pc = Q0 (Pc ) = Pc . Imposing eqs.(7.117) and (7.118) for all choices of the pairs
(a; b) implies
(M )
gAA (Pc ) = FM (c)
(M )
gAB (Pc ) = GM (c) for A 6= B ; (7.119)
~ A = qA 1
Q00 E i
~ei = ~ek(A 1)+1 + : : : + ~ekA (7.122)
k
214 Information Geometry
~ A; E
(E ~ B )M = (Q00 E
~ A ; Q00 E
~ B )kM (7.123)
yields,
kA
X
(M ) 1 (kM )
gAB (P ) = gij (p) : (7.124)
k2
i; j=k(A 1)+1
Along the center lines, Pc = (c=M; : : : ; c=M ) and pc = (c=kM; : : : ; c=kM ), equa-
tions (7.119) and (7.124) give
1 k 1
FM (c) = FkM (c) + GkM (c) (7.125)
k k
and
GM (c) = GkM (c) : (7.126)
But this holds for all values of M and k, therefore GM (c) = (c) where is a
function independent of M . Furthermore, eq.(7.125) can be rewritten as
1 1
[FM (c) (c)] = [FkM (c) (c)] = (c) ; (7.127)
M kM
where (c) is a function independent of the integer M . Indeed, for any two
integers M1 and M2 we have
1 1 1
[FM1 (c) (c)] = [FM1 M2 (c) (c)] = [FM2 (c) (c)] :
M1 M1 M2 M2
(7.128)
Therefore,
FM (c) = (c) + M (c) ; (7.129)
and for points along the center line,
(M )
gAB (Pc ) = (c) + M (c) AB : (7.130)
So far the invariance under the special Markov embeddings Q0 and Q00 has
+
allowed us to …nd the metric of RM for arbitrary M but only along the center
(M ) +
line P = Pc for any c > 0. To …nd the metric gAB (P ) at any arbitrary P 2 RM
+
000
we show that it is possible to cleverly choose the embedding Q : RM ! Rn+
so that the image of P can be brought arbitrarily close to the center line of
Rn+ , Q000 (P ) pc , where the metric is known. Indeed, consider the embeddings
+
Q000 : RM ! Rn+ de…ned by
P1 P1 P2 P2 PM PM
Q000 (P 1 ; : : : P M ) = ( ;::: ; ;::: ;:::; ;::: ): (7.131)
k k k k k k
| 1 {z 1} | 2 {z 2} | M {z M}
k1 tim es k2 tim es km tim es
7.5 Uniqueness of the information metric 215
~ A = qA 1
Q000 E i
~ei = ~ek1 +:::+kA 1 +1
+ : : : + ~ek1 +:::+kA : (7.135)
kA
Using eq.(7.130) the invariance
~ A; E
(E ~ B )M = (Q000 E
~ A ; Q000 E
~ B )n (7.136)
yields, for A = B,
k1 +:::+k
X A
(M ) 1 (n)
gAA (P ) = 2 gij (pc )
(kA ) i; j=k1 +:::+kA 1 +1
k1 +:::+k
X A
1
= 2 [ (c) + n (c) ij ]
(kA ) i; j=k1 +:::+kA 1 +1
1 h i
2
= 2 (kA ) (c) + kA n (c)
(kA )
n c (c)
= (c) + (c) = (c) + ; (7.137)
kA PA
where we used eq.(7.133), P A = ckA =n. Similarly, for A 6= B,
k1 +:::+k
X A k1 +:::+k
X B
(M ) 1 (n)
gAB (P ) = gij (pc ) (7.138)
kA kB
i=k1 +:::+kA 1 +1 j=k1 +:::+kB 1 +1
1
= kA kB (c) = (c) : (7.139)
kA kB
216 Information Geometry
Therefore,
(M ) ~ A; E
~ B iM = (c) + c (c) AB
gAB = hE ; (7.140)
PA
with c = jP j. This almost concludes the proof.
The sign conditions on and follow from the positive-de…niteness of inner
products. Using eq.(7.108),
X VA 2
~ ;V
(V ~ ) = jV j2 + jP j ; (7.141)
PA
A
~ ;V
we see that for vectors with jV j = 0, (V ~ ) 0 implies that > 0, while for
A A
vectors with V = KP , where K is any constant we have
~ ;V
(V ~ ) = K 2 jP j2 ( + ) > 0 ) + >0: (7.142)
~ ;V
Conversely, we show that if these sign conditions are satis…ed then (V ~) 0
for all vectors. Using Cauchy’s inequality,
2
P 2 P 2 P
xi yi kxi yi k ; (7.143)
i i i
Therefore,
X VA 2
~ ;V
(V ~ ) = jV j2 + jP j jV j2 ( + ) 0; (7.145)
PA
A
i
Furthermore, since qA = 0 unless i 2 A,
i i
P qA P qAi
qB AB
= AB = A : (7.148)
i pi i P
A P
~ A; Q E
~ B )n = (jP j) + jP j (jP j) AB
(Q E = (eA ; eB )M (7.149)
PA
which concludes the proof.
N! n1 nm
PN (nj ) = 1 ::: m ; (7.150)
n1 ! : : : n m !
where
P
m P
m
n = (n1 : : : nm ) with ni = N and i = 1; (7.151)
i=1 i=1
The result is
ni nm nj nm
gij = ( )( ) ; (7.153)
i m j m
P1
m N N
d`2 = ( ij + )d i d j : (7.155)
i;j=1 i m
Using
P
m P
m
i = 1 =) d i =0: (7.156)
i=1 i=1
218 Information Geometry
Therefore,
P
m N
d`2 = gij d i d j with gij = ij : (7.158)
i;j=1 i
Canonical distributions
Let z denote the microstates of a system (e.g., points in phase space) and let
m(z) be the underlying measure (e.g., a uniform density on phase space). The
space of macrostates is a statistical manifold: each macrostate is a canonical
distribution (see sections 4.10 and 5.4) obtained by maximizing entropy S[p; m]
subject to n constraints hf a i = F a for a = 1 : : : n, plus normalization,
Z
1 a
a f (z)
a
p(zjF ) = m(z)e where Z( ) = dz m(z)e a f (z) : (7.162)
Z( )
@F a @ 1 @Z 1 @Z @Z 1 @2Z
= ( )= 2 (7.164)
@ b @ b Z@ a Z @ a@ b Z @ a@ b
= F a F b hf a f b i : (7.165)
Therefore
def @F a
C ab = h(f a F a )(f b F b )i = : (7.166)
@ b
Next, using the chain rule
c @ a @ a @F b
a = = ; (7.167)
@ c @F b @ c
we see that the matrix
@ a
Cab = (7.168)
@F b
is the inverse of the covariance matrix,
Cab C bc = c
a ;
@ 2 S(F )
Cab = : (7.169)
@F a @F b
The information metric is
Z
@ log p(zjF ) @ log p(zjF )
gab = dz p(zjF )
@F a @F b
Z
@ c @ d @ log p @ log p
= a b
dz p : (7.170)
@F @F @ c @ d
@ log p(zjF )
= Fc f c (z) (7.171)
@ c
therefore,
gab = Cca Cdb C cd =) gab = Cab ; (7.172)
so that the metric tensor gab is the inverse of the covariance matrix C ab which,
by eq.(7.169), is the Hessian of the entropy.
Instead of F a we could use the Lagrange multipliers a themselves as coor-
dinates. Then the information metric is the covariance matrix,
Z
ab @ log p(zj ) @ log p(zj )
g = dz p(zj ) = C ab : (7.173)
@ a @ b
220 Information Geometry
The uniform distribution over the space of macrostates assigns equal prob-
abilities to equal volumes,
Gaussian distributions
Gaussian distributions are a special case of canonical distributions — they max-
imize entropy subject to constraints on mean values and correlations. Consider
Gaussian distributions in D dimensions,
c1=2 1
p(xj ; C) = exp Cij (xi i
)(xj j
) ; (7.176)
(2 )D=2 2
where 1 i D, Cij is the inverse of the correlation matrix, and c = det Cij .
The mean values i are D parameters, while the symmetric Cij matrix is an
additional 21 D(D+1) parameters. Thus the dimension of the statistical manifold
is D + 12 D(D + 1).
Calculating the information distance between p(xj ; C) and p(xj + d ; C +
dC) is a matter of keeping track of all the indices involved. Skipping all details,
the result is
d`2 = gij d i d j
+ gkij dCij d k
+ g ij kl dCij dCkl ; (7.177)
where
1 ik jl
gij = Cij ; gkij = 0 ; and g ij kl = (C C + C il C jk ) ; (7.178)
4
where C ik is the correlation matrix, that is, C ik Ckj = i
j. Therefore,
1
d`2 = Cij d i d j
+ C ik C jl dCij dCkl : (7.179)
2
To conclude we consider a few interesting special cases. For Gaussians that
di¤er only in their means the information distance between p(xj ; C) and p(xj +
d ; C) is obtained setting dCij = 0, that is,
d`2 = Cij d i d j
; (7.180)
1 1
p(xj ; ) = 2 )D=2
exp 2 ij
(xi i
)(xj j
) : (7.181)
(2 2
The covariance matrix and its inverse are both diagonal and proportional to the
unit matrix,
1
Cij = 2 ij
; C ij = 2 ij
; and c= 2D
: (7.182)
Using
1 2 ij
dCij = d 2 ij
= 3
d (7.183)
1 1 4 ik jl 2 ij 2 kl
d`2 = 2 ij
d i
d j
+ 3
d 3
d (7.184)
2
which, using
ik jl k j k
ij kl = j k = k =D ; (7.185)
simpli…es to
ij 2D
d`2 = 2
d id j
+ 2
(d )2 : (7.186)
1 2
d`2 = 2
(d )2 + 2
(d )2 : (7.187)
There is one last issue that must be addressed before one can claim that the de-
sign of the inference method of maximum entropy (ME) is more or less complete.
The goal is to rank probability distributions in order to select a distribution that,
according to some desirability criteria, is preferred over all others. The ranking
tool is entropy; higher entropy represents higher preference. But there is noth-
ing in our previous arguments to tell us by how much. Suppose the maximum
of the entropy function is not particularly sharp, are we really con…dent that
distributions with entropy close to the maximum are totally ruled out? We want
a quantitative measure of the extent to which distributions with lower entropy
are ruled out. Or, to phrase this question di¤erently: We can rank probability
distributions p relative to a prior q according to the relative entropy S[p; q] but
any monotonic function of the relative entropy will accomplish the same goal.
Does twice the entropy represent twice the preference or four times as much?
Can we quantify ‘preference’? The discussion below follows [Caticha 2000].
where Z
p( )
S[p; g 1=2 ] = d p( ) log (8.2)
g 1=2 ( )
8.2 The ME method 225
and Z
p(xj )
S( ) = dx p(xj ) log : (8.3)
q(x)
The notation shows that S[p; g 1=2 ] is a functional of p( ) while S( )R is a function
of . Maximizing (8.1) with respect to variations p( ) such that d p( ) = 1,
yields Z
p( )
0= d log 1=2 + S( ) + log p( ) ; (8.4)
g ( )
where the required Lagrange multiplier has been written as 1 log . Therefore
the probability that the value of should lie within the small volume g 1=2 ( )dn
is Z
n 1 S( ) 1=2 n
P ( )d = e g ( )d with = dn g 1=2 ( ) eS( ) : (8.5)
Equation (8.5) is the result we seek. It tells us that, as expected, the preferred
value of is the value 0 that maximizes the entropy S( ), eq.(8.3), because
this maximizes the scalar probability density exp S( ). But it also tells us the
degree to which values of away from the maximum are ruled out.
Remark: The density exp S( ) is a scalar function and the presence of the
Jacobian factor g 1=2 ( ) makes eq.(8.5) manifestly invariant under changes of
the coordinates in the space .
totally ruled out, a better update is obtained marginalizing the joint posterior
PJ (x; ) = P ( )p(xj ) over ,
Z Z
n eS( )
P (x) = d P ( )p(xj ) = dn g 1=2 ( ) p(xj ) : (8.6)
In situations where the entropy maximum at 0 is very sharp we recover the old
result,
P (x) p(xj 0 ) : (8.7)
When the entropy maximum is not very sharp eq.(8.6) is the more honest up-
date.
The discussion in section 8.1 is itself an application of the same old ME
method discussed in section 6.2.4, not on the original space X , but on the
enlarged product space X . Thus, adopting the improved posterior (8.6)
does not re‡ect a renunciation of the old ME method — only a re…nement. To
the summary description of the ME method above we can add the single line:
The ME method can be deployed to assess its own limitations and to take
the appropriate corrective measures.
Figure 8.2: The MaxEnt solution for the constraint hii = r for di¤erent values
of r leads to the dotted line. If r is unknown averaging over r should lead to
the distribution at the point marked by .
2 The title for this section is borrowed from Rodriguez’s paper on the two-envelope paradox
[Rodriguez 1988]. Other papers of his on the general subject of ignorance and geometry (see
the bibliography) are highly recommended for the wealth of insights they contain.
230 Entropy IV: Entropic Inference
where p(r) re‡ects our uncertainty about r. It may, for example, make sense to
pick a uniform distribution over r but the precise choice is not important for our
purposes. The point is that since the MaxEnt dotted curve is concave the point
necessarily lies below C so that 2 < 1=3. And we have a paradox: we started
admitting complete ignorance and through a process that claims to express full
ignorance at every step we reach the conclusion that the die is biased against
i = 2. Where is the mistake?
The …rst clue is symmetry: We started with a situation that treats the
outcomes i = 1; 2; 3 symmetrically and end up with a distribution that is biased
against i = 2. The symmetry must have been broken somewhere and it is clear
that this happened at the moment the constraint on hii = r was imposed —
this is shown as vertical lines on the simplex. Had we chosen to express our
ignorance not in terms of the unknown value of hii = r but in terms of some
other function hf (i)i = s then we could have easily broken the symmetry in
some other direction. For example, let f (i) be a cyclic permutation of i,
then repeating the analysis above would lead us to conclude that 1 < 1=3,
which represents a die biased against i = 1. Thus, the question becomes: What
leads to choose a constraint on hii rather than a constraint on hf i when we are
equally ignorant about both?
The discussion in section 4.11 is relevant here. There we identi…ed four
epistemically di¤erent situations:
(A) The ideal case: We know that hf i = F and we know that it captures all
the information that happens to be relevant to the problem at hand.
(B) The important case: We know that hf i captures all the information that
happens to be relevant to the problem at hand but its actual numerical
value F is not known.
(C) The predictive case: There is nothing special about the function f
except that we happen to know its expected value, hf i = F . In particular,
we do not know whether information about hf i is complete or whether it
is at all relevant to the problem at hand.
(D) The extreme ignorance case: We know neither that hf i captures rele-
vant information nor its numerical value F . There is nothing that singles
out one function f over any other.
The paradox with the three-sided die arises because two epistemically di¤erent
situations, case B and case D have been confused. On one hand, the unknown
8.3 Avoiding pitfalls –III 231
e "i X
"i
p(ij ) = where Z( ) = e : (8.14)
Z( ) i
Since the inverse temperature = (E) is itself unknown we must average over
, Z
pt (i) = d p( )p(ij ) : (8.15)
To the extent that both distributions re‡ect complete ignorance we must have
p( ) = ( ) or =0: (8.16)
the precise value = 0; we have concluded that the system is in…nitely hot —
ignorance is hell.
The paradox is dissolved once we realize that, just as with the die problem,
we have confused two epistemically di¤erent situations — types D and B above:
Knowing nothing about a system is not the same as knowing that it is in thermal
equilibrium at a temperature that happens not be unknown.
It may be worthwhile to rephrase this important point in di¤erent words. If
I is the space of microstates and is some unknown arbitrary quantity in some
space B the rules of probability theory allow us to write
Z
p(i) = d p(i; ) where p(i; ) = p( )p(ij ) : (8.17)
Topics in Statistical
Mechanics*
1 kf
k
(z)
R kf
k
(z)
p(zjF ) = m(z) e with Z( ) = dz m(z) e ; (9.1)
Z( )
1
P (F )dF = eST (F ) g 1=2 (F )dF; (9.5)
@ 2 ST (F_ ; F ) @2 h i
gij = = S(F_ ; F ) + S 0 (FT F_ ; FT F) (9.8)
@ F_ i @ F_ j @ F_ i @ F_ j
where the dot indicates that the derivatives act on the …rst argument. The …rst
term on the right is
Z
@ 2 S(F_ ; F ) @2 p(zjF_ ) m(z)
= dz p(zjF_ ) log
@ F_ i @ F_ j @ F_ i @ F_ j m(z) p(zjF )
2 Z 2
@ S(F ) @ p(zjF ) p(zjF )
= + dz log : (9.9)
@F i @F j @F i @F j m(z)
To calculate the integral on the right use eq.(9.1) written in the form
p(zjF ) k
log = log Z( ) kf (z) ; (9.10)
m(z)
so that the integral vanishes,
Z Z
@2 @2
log Z( ) i j dz p(zjF ) k dz p(zjF )f k (z) = 0 : (9.11)
@F @F @F i @F j
Similarly,
@2 @ 2 S 0 (FT F )
S 0 (FT F_ ; FT F) = (9.12)
@ F_ i @ F_ j @F i @F j
Z
@ 2 p(z 0 jFT F ) p(z 0 jFT F )
+ dz 0 log
@F i @F j m0 (z 0 )
@ 2 S(F )
gij = : (9.13)
@F i @F j
We conclude that the probability that the value of F ‡uctuates into a small
volume g 1=2 (F )dF becomes
1 k
p(F )dF = eS(F ) 0k F
g 1=2 (F )dF : (9.14)
Jacobian can be considered constant, and one obtains the usual results [Landau
1977], namely, that the probability distribution for the ‡uctuations is given by
the exponential of a Legendre transform of the entropy.
The remaining di¢ culties are purely computational and of the kind that
can in general be tackled systematically using the method of steepest descent
to evaluate the appropriate generating function. Since we are not interested
in variables referring to the bath we can integrate Eq.(9.4) over z 0 , and use
the distribution P (z; F ) = p(F )p(zjF ) to compute various moments. As an
example, the correlation between i = i h i i and f j = f j f j or
j j j
F =F F is
@ h ii
i fj = i Fj = +( 0i h i i) F0j Fj : (9.15)
@ 0j
When the di¤erences 0i h i i or F0j F j are negligible one obtains the usual
expression,
j
i fj i : (9.16)
A Prelude to Dynamics:
Kinematics
In this and the following chapters our main concern will be to deploy the con-
cepts of probability, entropy, and information geometry to formulate quantum
mechanics (QM) as a dynamical model that describes the evolution of probabil-
ities in time. The fact that the dynamical variables are probability distributions
turns out to be highly signi…cant because all changes of probabilities — includ-
ing their time evolution — must be compatible with the basic principles for
updating probabilities. In other words, the kinds of dynamics we seek are those
driven by the maximization of an entropy subject to constraints that carry the
information that is relevant to the particular system at hand.
The goal of entropic dynamics is to generate a trajectory in a space of
probability distributions. As we saw in Chapter 7, these spaces are statistical
manifolds and have an intrinsic metric structure given by the information metric.
Furthermore, our interest in trajectories naturally leads us to consider both their
tangent vectors and their dual covectors because it is these objects that will be
used to represent velocities and momenta respectively. It turns out that, just as
the statistical manifold has a natural metric structure, the statistical manifold
plus all the spaces of tangent covectors is itself a manifold — the cotangent
bundle — that can be endowed with a natural structure called symplectic.1
Our goal will be to formulate an entropic dynamics that naturally re‡ects these
metric and symplectic structures.
But not every curve is a trajectory and not every parameter that labels
points along a curve is time. In order to develop a true dynamics we will have
to construct a concept of time — a problem that will be addressed in the next
chapter. This chapter is devoted to kinematics. As a prelude to a true dynamics
1 The term symplectic was invented by Weyl in 1946. The old name for the family of
groups of transformations that preserve certain antisymmetric bilinear forms had been complex
groups. Since the term was already in use for complex variables, Weyl thought this was
unnecessarily confusing. So he invented a new term by literally translating ‘complex’from its
Latin roots com-plexus, which means “together-braided,” to its Greek roots - " o&.
238 A Prelude to Dynamics: Kinematics
we shall develop some of the tools needed to study families of curves that are
closely associated with the symplectic and metric structures.2
To simplify the discussion in this chapter we shall consider the special case
of a statistical manifold of …nite dimension. Back in Chapter 7 we studied
the information geometry of the manifold associated with a parametric family
of probability distributions. The uncertain variable x can be either discrete
or continuous and the distributions (x) = (xj ) are labeled by parameters
i
(i = 1 : : : n) which will be used as coordinates on the manifold.3 First,
to introduce the main ideas, we shall consider the simpler example in which
the i are generic parameters of no particular signi…cance. Then, we shall
address the example of a simplex — a statistical manifold for which the uncertain
variables are discrete, x = i = 1 : : : n, and the probabilities themselves are used
as coordinates, i = (i). The result is a formalism in which the linearity of the
evolution equations, the emergence of a complex structure, Hilbert spaces, and
a Born rule, are derived rather than postulated.
expands on previous work on the geometric and symplectic structure of quantum mechanics
[Kibble 1979; Heslot 1985; Anandan and Aharonov 1990; Cirelli et al. 1990; Abe 1992;
Hughston 1995; Ashekar and Schilling 1998; de Gosson, Hiley 2011; Elze 2012; Reginatto and
Hall 2011, 2012].
3 We will continue to adopt the standard notation of using upper indices to label coordinates
and components of vectors (e.g. i and V ~ = V i~ei ) and lower indices to denote components
of covectors (e.g. @F=@ i = @i F ). We also adopt the Einstein summation convention: a sum
over an index is understood whenever it appears repeated as an upper and a lower index.
10.1 Gradients and covectors 239
so that
d
=V : (10.4)
d
The vectors
@
= @i
ei = (10.5)
@ i
constitute the “coordinate” basis — a basis of vectors that is adapted to the
coordinate grid in the sense that the vectors fei g are tangent to the grid lines.
More explicitly, the vector ei is tangent to the coordinate curve de…ned by
holding constant all j s with j 6= i, and using i as the parameter along the
curve.
The second way to think about the directional derivative df =d is to write
def d
rf [V ] = (@i f )V i =
f (10.6)
d
and interpret df =d as the scalar that results from the action of the linear
functional rf on the vector V . Indeed, using linearity the action of rf on the
vector V is
rf [V ] = V i rf [ei ] so that rf [ei ] = @i f : (10.7)
j
When the function f is one of the coordinates, f ( ) = , we obtain
j
@ j
r j [ei ] = i
= i : (10.8)
@
Furthermore, using the chain rule
@f
rf ( ) = r i = @i f r i ; (10.9)
@ i
we see that f@i f g are the components of the covector rf , and that fr i g
constitute a covector basis which is dual or reciprocal to the vector basis fei g.
The transformation of vector components and of basis vectors under a change
of coordinates (see eqs.(7.11) and (7.17)) is such that the vectors V = V i ei are
invariant. Generic covectors ! = !i r i can also be de…ned as invariant objects
whose components !i transform as @i f . Using the chain rule the transformation
0
to primed coordinates, i ! i , is
j j
@f @ @f @
i 0 = i0
or @i0 f = i0
@j f : (10.10)
@ @ @ j @
Using
0
i0 @ i @ j
r = r j and ! i0 = i0
!j (10.11)
@ j @
we can check that ! is indeed invariant,
i i0
! = !i r = ! i0 r : (10.12)
Alternatively, we can de…ne generic covectors as linear functionals of vectors.
Indeed, using linearity and (10.8) the action of ! on the vector V ,
![V ] = !i r i [V ] = !i V j r i [ej ] = !i V i ; (10.13)
is invariant.
240 A Prelude to Dynamics: Kinematics
if = ( 0) then V ( ) = ( 0 + )= : (10.14)
We shall assume that the map V is su¢ ciently smooth and invertible.
To de…ne the Lie derivative of a scalar function f ( ) along (the congruence
de…ned by) the …eld V ( ) we …rst introduce the notion of Lie-dragging. Given
the function f de…ne a new function f called the pull-back of f under the action
of the map V ,
f ( ) = f( ) : (10.15)
Then the Lie derivative of f along V is de…ned by
def 1
$V f = lim [f ( ) f ( )] : (10.16)
!0
The important point here is that both functions f and f are evaluated at the
same point . The idea is that when Lie-dragging is applied to a vector …eld,
its Lie derivative would involve subtracting vectors located at the same tangent
space. As ! 0,
i
i d
f ( ) = f( ) = f( + )
d
i
@f d df
= f( ) + = f( ) + (10.17)
@ i d d
10.2 Lie derivatives 241
so that
df
$V f = = V [f ] : (10.18)
d
This result is not particularly surprising: the Lie derivative of f along V is
just the derivative of f along V . It gets more interesting when we apply the
same idea to the Lie derivative of a vector …eld U along the congruence de…ned
by V . We note, in particular, that the Lie derivative of a scalar function is a
scalar function too or, in other words, the Lie derivative is a scalar di¤erential
operator and, therefore, its action on a vector U yields a vector, $V U , that can
itself act on functions, $V U [f ], to yield other scalar functions.
The de…nition of the Lie derivative can be extended to vectors and tensors
by imposing the natural additional requirement that the Lie derivative be a
derivative, that is, it must obey a Leibniz rule. For example, the Lie derivative
$V U of a vector U is de…ned so that for any scalar function f the Lie derivatives
satisfy
def
$V U [f ] = $V U [f ] + U [$V f ] ; (10.19)
while the derivative $V T of a generic tensor T acting on a collection a; b; ::: of
vectors or covectors satis…es
def
$V [T (a; b; :::)] = [$V T ] (a; b; :::) + T ($V a; b; :::) + T (a; $V b; :::) + ::: (10.20)
Next we compute the Lie derivatives of vectors, covectors, and tensors in terms
of their components.
$V U [f ] = V U [f ] : (10.22)
$V U [f ] = $V U [f ] + U V [f ] : (10.23)
Therefore,
def
$V U = V U U V = [V ; U ] : (10.24)
where we introduced the Lie bracket notation on the right. Since the Lie bracket
is antisymmetric, so is the Lie derivative,
$V U = $U V : (10.25)
242 A Prelude to Dynamics: Kinematics
To evaluate the Lie derivative in terms of components one calculates the deriv-
atives in (10.24),
V U [f ] = V i @i U j @j f = V i @i U j @j f + V i U j @i @j f ; (10.26)
i j i j i j
U V [f ] = U @i V @j f = U @i V @j f + U V @i @j f : (10.27)
The result is
$V U = [V ; U ] = V i @i U j U i @i V j @j ; (10.28)
($V U )j = [V ; U ]j = V i @i U j U i @i V j : (10.29)
Side remark: Equation (10.29) shows an important di¤erence between the Lie
derivative $V U and the covariant derivative rV U . The latter depends on V ( )
only at the point . Indeed, if f ( ) is some scalar function, then rf V U = f rV U .
In contrast, $V U also depends on the derivatives of V ( ) at .
Next we consider a generic vector …eld U and use the fact that !(U ) = !i U i is
a scalar function. The Lie derivative of the right hand side is
Since $V obeys the Leibniz rule, the Lie derivative of the left hand side can also
be written as
and, using the Leibniz rule on the left hand side, we get
Setting (10.36) equal to (10.37) and using (10.29) leads to the desired expression,
Notice that in the derivation above we have not used the symmetry or any other
properties of G beyond the fact that G( ; ) is a tensor so that G(A; B) is a scalar
function. This means that the expression (10.38) gives the Lie derivative of any
covariant tensor with components Tij ,
a particular model. One may consider models where the positions of particles are assumed
ontic. In other models one might assume that it is the …eld variables that are ontic, while the
particles are quantum excitations of the …elds. It is even possible to conceive of hybrid models
in which some ontic variables represent the particles we call “matter”(e.g., the fermions) while
some other ontic variables represent the gauge …elds we call “forces” (e.g., the electromagnetic
…eld).
244 A Prelude to Dynamics: Kinematics
Given any manifold such as P we can construct two other manifolds that
turn out to be useful. One of these manifolds is denoted by T P and is called
the tangent bundle. The idea is the following. Consider all the curves passing
through a point = ( 1 : : : n ). The set of all the vectors that are tangent to
those curves is a vector space called the tangent space at and is denoted T P .
The space T P composed of P plus all its tangent spaces T P turns out to be a
manifold of a special type generically called a …ber bundle; P is called the base
manifold and T P is called the …ber at the point . Thus, T P is appropriately
called the tangent bundle.
We can also consider the space of all covectors at a point . Such a space is
denoted T P and is called the cotangent space at . The second special man-
ifold we can construct is the …ber bundle composed of P plus all its cotangent
spaces T P . This …ber bundle is denoted by T P and is called the cotangent
bundle.
The reason we care about vectors and covectors is that these are the objects
that will eventually be used to represent velocities and momenta. Indeed, if P
is the e-con…guration space, its associated cotangent bundle T P, which we will
call the e-phase space, will play a central role. But that is for later; for now all
we need is that the tangent and cotangent bundles are geometric objects that
are always available to us independently of any physical considerations.
d d i @ d i @
V = = + ; (10.41)
d d @ i d @ i
where @=@ i and @=@ i are the basis vectors and the index i = 1 : : : n is summed
over. The directional derivative of a function F (X) along the curve X( ) is
i
dF @F d @F d i ~ [V ] ;
= i + = rF (10.42)
d @ d @ i d
where i is a composite index. The …rst index (chosen from the beginning
of the Greek alphabet) takes two values, = 1; 2. It is used to keep track of
whether i is an upper i index ( = 1) or a lower i index ( = 2).5 Then
eqs.(10.41) and (10.43) are written as
i
d i @ i dX d i =d
V = =V ; with V = = ; (10.45)
d @X i d d i =d
and
@F ~
~ =
rFrX i : (10.46)
@X i
The repeated indices indicate a double summation over and i. The action of
~ [ ] on a vector V is de…ned by the action of the basis
the linear functional rF
~
covectors rX i
on the basis vectors, @=@X j = @ j ,
i
~ i @X i
rX [@ j ] = j
= j : (10.47)
@X
Using linearity we …nd
~ [V ] = @F V
rF i
=
dF
: (10.48)
@X i d
called the symplectic form, and the manifold acquires a certain ‡oppy structure
that is somewhat less rigid than that provided by a metric. This structure
is described as the symplectic geometry of the manifold [Arnold 1997][Souriau
1997][Schutz 1980]. As was the case for the metric tensor, the symplectic form
also induces a map from vectors to covectors and the group of transformations
that preserve it is particularly important. It is called the symplectic group
which in Hamiltonian mechanics has long been known as the group of canonical
transformations.
Once local coordinates ( i ; i ) on T P have been established there is a nat-
ural choice of symplectic form
~ i[ ]
[; ]= r ~ i[ ]
r ~ i[ ]
r ~ i[ ] :
r (10.49)
(V ; U ) = V 1i U 2i V 2i U 1i = i; j V
i
U j
; (10.51)
0 1
i; j = ij : (10.52)
1 0
The form is non-degenerate, that is, for every vector V there exists some
vector U such that [V ; U ] 6= 0.
An aside: In the language of exterior calculus the symplectic form can be
derived by …rst introducing the Poincare 1-form
!= ~
id
i
; (10.53)
~ is the exterior derivative on T P and the corresponding symplectic
where d
2-form is
= d!~ = d~ i^d
~ i: (10.54)
By construction is locally exact ( = d!) ~ and closed (d ~ = 0).
Remark: A generic 2n-dimensional symplectic manifold is a manifold with a
di¤erential two-form !( ; ) that is closed (d!~ = 0) and non-degenerate (that is,
if the vector …eld V is nowhere vanishing, then the one-form !(V ; ) is nowhere
vanishing too). The Darboux theorem [Guillemin Sternberg 1984] states that
one can choose coordinates (q i ; pi ) so that at any point the two-form ! can be
written as
! = dq ~ i ^ dp
~ i: (10.55)
In general this can only be done locally; there is no choice of coordinates that
will accomplish this diagonalization globally. The point of this remark is to
emphasize that our construction of in (10.49) follows a very di¤erent logic: we
10.4 Hamiltonian ‡ows 247
k; i V
k
= @ i V~ or ~ V~ ( ) :
(V ; ) = r (10.59)
In the opposite direction we can easily check that (10.59) implies $V = 0.
Indeed,
($V ) i; j = @ i (@ j V~ ) @ j (@ i V~ ) = 0 : (10.60)
Using (10.52), eq.(10.59) is more explicitly written as
d i~ d i~ i @ V~ ~ i @ V~ ~
r i r = r + r i ; (10.61)
d d @ i @ i
or
d i
@ V~ d i @ V~
= and = ; (10.62)
d @ i d @ i
which we recognize as Hamilton’s equations for a Hamiltonian function V~ . This
justi…es calling V the Hamiltonian vector …eld associated to the Hamiltonian
function V~ . This is how Hamiltonians enter physics — as a way to generate
vector …elds that preserve .
From (10.51), the action of the symplectic form on two Hamiltonian vector
…elds V = d=d and U = d=d generated respectively by V~ and U ~ is
d id i d id i
(V ; U ) = ; (10.63)
d d d d
248 A Prelude to Dynamics: Kinematics
@ V~ @ U
~ @ V~ @ U
~ def
(V ; U ) = i
= fV~ ; U
~g ; (10.64)
@ @ i @ i@ i
where, on the right hand side, we have introduced the Poisson bracket notation.
It is easy to check that the derivative of an arbitrary function F (X) along the
congruence de…ned by the vector …eld V = d=d , which is given by (10.42) or
(10.48),
dF @F dX i @F d i @F d i
= i
= i
+ ; (10.65)
d @X d @ d @ i d
can be expressed in terms of Poisson brackets,
dF
= fF; V~ g : (10.66)
d
These results are summarized as follows:
(1) The ‡ows that preserve the symplectic structure, $V = 0, are generated by
Hamiltonian vector …elds V associated to Hamiltonian functions V~ , eq.(10.62),
i
dX
V i
= = fX i ; V~ g : (10.67)
d
(2) The action of on two Hamiltonian vector …elds is the Poisson bracket of
the associated Hamiltonian functions,
(V ; U ) = i; j V
i
U j
= fV~ ; U
~g : (10.68)
d i ~
@H d i
~
@H
= and = ; (10.69)
d @ i d @ i
and the evolution of any function f (X) given by the Hamiltonian vector H(X)
is
df ~ @
@H ~ @
@H
~
= H(f ) = ff; Hg with H= : (10.70)
d @ i@ i @ i@ i
`2 = gij i j
; (10.71)
where
X @ log (xj ) @ log (xj )
gij ( ) = (xj ) : (10.72)
x @ i @ j
Since the only available tensor is gij the length element of T P,
`~2 = G i; j X i
X j
= G1i;1j i j
+2G1i;2j i
j +G2i;2j i j ; (10.73)
250 A Prelude to Dynamics: Kinematics
`~2 = gij i j
+ gij i
j + g ij i j ; (10.74)
with new constants 0 and 0 and we can either keep or drop as convenience
or convention dictates. In contrast, the values of 0 and 0 are signi…cant; they
are not a matter of convention. For future convenience we shall write 0 = 1=h2
in terms of a new constant h, and choose the irrelevant as = h.6 This allows
us to absorb h into gij and write
X @ log (xj ) @ log (xj )
gij ( ) = h (xj ) : (10.76)
x @ i @ j
To …x the value of 0 we impose an additional requirement that is motivated
by its eventual relevance to physics. Consider a curve [ ( ); ( )] on T P and
its ‡ow-reversed curve — or -reversed curve — is given by
0 0
( )! ( )= ( ) and ( )! ( )= ( ): (10.77)
When projected to P the ‡ow-reversed curve coincides with the original curve,
but it is now traversed in the opposite direction. We shall require that the
~ j remains invariant under ‡ow-reversal. Since under ‡ow-reversal
speed jd`=d
the mixed terms in (10.74) change sign, it follows that invariance implies
that 0 = 0.
The net result is that the line element, which has been designed to be fully
determined by information geometry, takes a particularly simple form,
`~2 = gij i j
+ g ij i j : (10.78)
that are conjugate to the probability densities of the positions of particles. There we shall
rewrite 0 as 0 = 1=~2 , and the (irrelevant) constant as = ~. In conventional units the
value of ~ is …xed by experiment but one can always choose units so that ~ = 1.
10.5 The information geometry of e-phase space 251
Then eq.(10.79) is
1
1 0 g
J= G = : (10.81)
g 0
We can immediately check that
i k i
JJ = 1 or J kJ j = j ; (10.82)
which shows that J is a square root of the negative unit matrix. This fact is
expressed by saying that J endows T P with a complex structure.
Furthermore, we can check that the action of J on any two vectors U and
V is an isometry, that is
(J U )T G(J V ) = U T J T GJ V (10.84)
(the superscript T stands for transpose). But, from (10.80) and (10.81), we have
J T GJ = G : (10.85)
Therefore
(J U )T G(J V ) = U T GV ; (10.86)
which is (10.83).
To summarize, in addition to the symplectic and metric G structures the
cotangent bundle T P is also endowed with a complex structure J. Such highly
structured spaces are generically known as Kähler manifolds. Here we deal
with a curved Kähler manifold that is special in that it inherits its metric from
information geometry.
252 A Prelude to Dynamics: Kinematics
def P
n
i ~ def
j j = and N = 1 j j : (10.88)
i=1
The Hamiltonians H~ that are relevant to quantum mechanics are such that the
initial condition
N~ =0 (10.89)
is preserved by the ‡ow. However, as we shall see in the next chapter, the actual
quantum Hamiltonians will also preserve the constraint N ~ = const even when
the constant does not vanish.7 Therefore, we have
~ i
~ = fN
@ N ~ = 0 or P @ H = P d
~ ; Hg =0: (10.90)
i i
@ i d
7 As we shall see in the next chapter the quantum evolution of probabilities (x) takes
the form of a local conservation equation, eq.(11.49). This means that the Hamiltonian will
~ = const whether the constant vanishes or not.
preserve the constraint N
10.6 Quantum kinematics: symplectic and metric structures 253
Since the probabilities i must remain positive we shall further require that
d i =d 0 at the border of the simplex where i = 0.
In addition to the ‡ow generated by H~ we can also consider the ‡ow gener-
~
ated by N and parametrized by . From eq.(10.62) the corresponding Hamil-
tonian vector …eld N is given by
i
i @ i dX ~g ;
N =N i
with N = = fX i ; N (10.91)
@X d
or, more explicitly,
i P @
d d i
N 1i = =0; N 2i = =1; or N = : (10.92)
d d i @ i
~ is conserved
A Global Gauge Symmetry — We can also see that if N
~
along H, then H is conserved along N ,
dH~
~ N
= fH; ~g = 0 ; (10.94)
d
which implies that the conserved quantity N ~ is the generator of a symmetry
transformation.
The phase space of interest is the 2(n 1)-dimensional T S but the de-
scription is simpli…ed by using the n unnormalized coordinates of the larger
embedding space T S + . The introduction of one super‡uous coordinate forces
us to also introduce one super‡uous momentum. We eliminate the extra co-
ordinate by imposing the constraint N ~ = 0. We eliminate the extra momentum
by declaring it unphysical: the shifted point ( 0 ; 0 ) = ( ; + ) is declared to
be equivalent to ( ; ), which we describe by saying that ( ; ) and ( ; + ) lie
on the same “ray”. This equivalence is described as a global “gauge”symmetry
which, as we shall later see, is the reason why quantum mechanical states are
represented by rays rather than vectors in a Hilbert space.
B
`2 = gij i j
with gij = A ni nj + ij ; (10.95)
2 i
254 A Prelude to Dynamics: Kinematics
where n is a covector with components ni = 1 for all i =P1 : : : n,8 and A = A(j j)
i
and B = B(j j) are smooth scalar functions of j j = . These expressions
can be simpli…ed by a suitable change of coordinates.
The important term here is ‘suitable’. The point is that coordinates are often
chosen because they receive a particularly useful interpretation; the coordinate
might be a temperature, an angle or, in our case, a probability. In such cases the
advantages of the freedom to change coordinates might be severely outweighed
by the loss of a clear physical interpretation. Thus, we seek a change of coor-
dinates i ! 0i that will preserve the interpretation of as unnormalized or
relative probabilities. Such transformations are of the form,
i i
= ( 0 ) = (j 0 j) 0i
; (10.96)
1
0
`2 = gij 0i 0j
with 0
gij = A0 ni nj + B ij ; (10.99)
2 i
2 B _2 0
A0 = A ( _ j 0 j + ) + j j+B_ : (10.100)
2
We can now take advantage of the freedom to choose : set (j j) = ~=B(j j)
where ~ is a constant.9 Dropping the primes, the length element is
~
`2 = gij i j
with gij = A(j j) ni nj + ij ; (10.101)
2 i
or,
P ~
`2 = A(j j) j j2 + i
( i 2
) : (10.102)
i 2
8 The only reason to introduce the peculiar covector n is to maintain the Einstein convention
9 The constant ~ plays the same role as h in (10.76). Of course, ~ will eventually be
identi…ed with Planck’s constant divided by 2 , and one could choose units so that ~ = 1.
10.6 Quantum kinematics: symplectic and metric structures 255
2 i 2A
g ij = ij
+C i j
where C(j j) = : (10.103)
~ ~Aj j + ~2 =2
We are now ready to write down the metric for T S + . We follow the same
argument that led to eq.(10.78) and impose invariance under ‡ow reversal. Since
the only tensors at our disposal are gij and g ij , the length element of T S + must
be of the form,
`~2 = G i; j X i
X j
= gij i j
+ g ij i j : (10.104)
Therefore, substituting (10.95) and (10.103), `~2 can be more explicitly written
as
2 2
P
n P
n P
n ~ 2 i
`~2 = A i +C i i + 2
i + 2
i ; (10.105)
i=1 i=1 i=1 2 i ~
or
P
n ~ 2 i
`~2 = Aj j2 + Cj j2 h i2 + 2
i + 2
i : (10.106)
i=1 2 i ~
From (10.104), writing the indices as a 2 2 matrix, the metric tensors are
1
g 0 1 g 0
G= 1 ; G = : (10.107)
0 g 0 g
As before, the tensor G and its inverse G 1 can be used to lower and raise
indices. Using G 1 to raise the …rst index of the symplectic form i; j as we
did in eq.(10.79), we see that eqs.(10.80) and (10.81),
1
0 1 1 0 g
= and J= G = ; (10.108)
1 0 g 0
`~2 ( ) = gij i j
+ g ij ( i + ni )( j + nj ) ; (10.109)
256 A Prelude to Dynamics: Kinematics
(1904) and E. Study (1905) in their studies of shortest paths on complex projective spaces.
The latter include the projective Hilbert spaces used in quantum mechanics.
1 1 Later we shall explicitly show that choosing A 6= 0 has no e¤ect on the Hamiltonian ‡ows
and the tensor J, eq.(10.108), which de…nes the complex structure, becomes
2 i i
i i; k 0 ~ j
J j = G k; j or [J i j ] = ~ i : (10.115)
2 i j 0
1=2 i j =~ 1=2 i j =~
j = j e and i~ j = i~ j e ; (10.116)
where the index = 1; 2 takes two values (with ; ; : : : chosen from the middle
of the Greek alphabet).
Since changing the phase j ! j + 2 ~ in (10.116) yields the same point
we see that the cotangent space T S is a ‡at n-dimensional “hypercube”
(its edges have coordinate length 2 ~) with the opposite faces identi…ed, some-
thing like periodic boundary conditions.12 Thus, the new T S is still locally
isomorphic to the old Rn , which makes it a legitimate choice of cotangent space.
Remark: The choice of cotangent space is central to the derivation of quantum
mechanics and some additional justi…cation might be desirable. Here we take
the easy way out and argue that identifying the relevant e-phase space in which
the quantum dynamics is played out — see chapter 11 — represents signi…cant
progress even when its physical origin remains unexplained. Nevertheless, a
more illuminating justi…cation has in fact been proposed by Selman Ipek (see
section 4.5 in [Ipek 2021]) in the context of the relativistic dynamics of …elds,
a subject that lies outside the scope of the non-relativistic physics discussed in
this book.
1 2 Strictly, T S is a parallelepiped; from (10.113) we see that the lengths of its edges are
i i 1=2 i =~ i i
i = 1=2
ei =~ +i e = +i i (10.118)
2 ~ 2 i ~
i
so that
i i i i i i
= +i and = i : (10.119)
i 2 i ~ i 2 i ~
Adding and subtracting these equations we …nd
~
j = j j + j j and j = j j j j : (10.120)
2i j
0 1
[ jk ] = jk ; (10.123)
1 0
and the metric tensor and its inverse take a particularly simple form,
0 1 0 1
[Gjk ] = i jk and [Gjk ] = i jk
: (10.125)
1 0 1 0
1 3 The canonical transformation is generated by the function
!
2
i~ P k
F( ; ) = k 1 + log
2 k k
according to
@F @F
= k ; = i~ k :
@ k @ k
10.7 Quantum kinematics: Hamilton-Killing ‡ows 259
Note, in particular, that the choice A(j j) = 0 for the embedding space has led
to a metric tensor that is independent of the coordinates which corroborates
that T S + is indeed ‡at.
Finally, using G j; k to raise the …rst index of k; l gives the components
of the tensor J
j def j; k i 0 j
J l = G k; l or [J j l ] = l : (10.126)
0 i
($H G) j; k = H l@ lG j; k +G l; k @ j H
l
+G j; l @ k H
l
=0: (10.128)
More explicitly,
2 3
@H 2k @H 2j @H 1k @H 2j
@ j + @ k ; @ j + @i~ k 5
[($H G)jk ] = 4
i @H 2k =0: (10.130)
@H 1j @H 1k @H 1j
@i~ j + @ k ; @i~ j + @i~ k
j
If we further require that H be a Hamiltonian ‡ow, $H = 0, then H satis…es
Hamilton’s equations,
@H~ @H~
H 1j = and H 2j = ; (10.131)
@i~ j @ j
so that
@2H~ @2H~
= 0 and =0: (10.133)
@ j@ k @ j@ k
260 A Prelude to Dynamics: Kinematics
Therefore, in order to generate a ‡ow that preserves both G and , the function
~ ; ) must be linear in and linear in
H( ,
~ ; P
n
^ P
n
^ +M
^j
H( )= j Hjk k + j Lj j + const ; (10.134)
j;k=1 j=1
^ jk , L
where the kernels H ^ j , and M
^ j are independent of and , and the additive
constant can be dropped because it has no e¤ect on the ‡ow. Imposing that the
‡ow preserves the normalization constraint N ~ = const, eq.(10.90), implies that
H~ must be invariant under the phase shift ! ei . Therefore, L ^j = M ^ j = 0,
and we conclude that
~ ; P
n
^
H( )= j Hjk k : (10.135)
j;k=1
d j @H~ 1 P n
= H 1j = = ^ jk k ;
H (10.136)
d @i~ j i~ k=1
di~ j @H~ Pn
= H 2j = = ^
k Hkj : (10.137)
d @ j k=1
Taking the complex conjugate of (10.136) and comparing with (10.137), shows
^ ij is Hermitian, and that the Hamiltonian function H
that the kernel H ~ is real,
^ jk = H
H ^ kj and ~ ;
H( ~ ;
) = H( ): (10.138)
and the latter is recognized as the Schrödinger equation. Beyond being Her-
^ jk remains undetermined. These are the
mitian, the actual form of the kernel H
main results of this chapter.
If (1) and (2) are normalized the superposition (3) will not in general be
normalized except for appropriately chosen constants. More importantly, the
gauge-transformed states
0(1) (1) i 0(2) (2) i
= e 1
and = e 2
(10.141)
(1) (2)
are supposed to be “physically” equivalent to the original and but in
general the superposition
0(3) 0(1) 0(2)
= c1 + c2 (10.142)
where
Cj j2 2A
A (j j) = A with C(j j) = : (10.144)
4 A~j j + ~2 =2
Similarly, the other two matrix elements, 21 and 22, are given by
1
($H G)2i;1j = ($H G)1j;2i and ($H G)2i;2j = ($H G)1i;1j ; (10.148)
~2
which shows that they provide no additional information about the form of H ~
beyond that already provided by eqs.(10.146) and (10.147).
It is easy to verify that the family of bilinear Hamiltonians, eq.(10.135),
provides the desired solution. Indeed, we can easily check that a bilinear H~
implies that the quantities
@2H~ @ @H~
; 1 k ; (10.149)
@ i@ j @ k @ j
and their complex conjugates all vanish. Then, both eqs.(10.146) and (10.147)
are satis…ed identically.
To see that there are no other solutions we argue as follows. Consider
2 ~
(10.146) as a system of linear equations in the unknown 2nd derivatives, @ @i @H j .
@H~ @H~
The coordinates i and i , the …rst derivatives @ i, @ i , and the mixed deriv-
@2H~
atives @ i@ j are independent quantities that de…ne the constant coe¢ cients in
the linear system. The number of unknowns is n(n + 1)=2 and, since
($H G)1j;1i = ($H G)1i;1j ; (10.150)
we see that the number of equations matches the number of unknowns. Thus,
since the determinant of the system does not vanish, except possibly for special
values of the s, we conclude that the solution
@2H~
=0 (10.151)
@ i@ j
is unique. The observation that the remaining eqs.(10.147) are also satis…ed too
concludes the proof.
In conclusion: whether the embedding space T S + is ‡at (A = 0) or not
(A 6= 0) the HK ‡ows are described by the linear Schrödinger equation (10.139).
10.8 Hilbert space 263
The inner product — The choice of an inner product for the points is
now “natural”in the sense that the necessary ingredients are already available.
The Hamilton-Killing ‡ows followed from imposing that the symplectic form
, eq.(10.123), and the ‡at space tensor G, eq.(10.125), be preserved. In order
that the inner product also be preserved it is natural to choose an inner product
de…ned in terms of those two tensors. We adopt the familiar Dirac notation to
represent the states as vectors j i. The inner product h j i is de…ned in
terms of the tensors G and ,
i j
h j i = a (G i; j +b i; j ) ; (10.152)
1 4 We use the term Hilbert space loosely to describe any complex vector space with a Her-
mitian inner product. In this chapter we deal with complex vector spaces of …nite dimen-
sionality. The term Hilbert space is more commonly applied to in…nite dimensional vector
spaces of square-integrable functions that can be spanned by a countable basis. In in…nite
dimensions all sorts of questions arise concerning the the convergence of sums with an in…nite
numbers of terms and various other limiting procedures. It can be rigourously shown that
the conclusions we draw here for …nite dimensions also hold in the in…nite dimensional case
dimensions.
264 A Prelude to Dynamics: Kinematics
k P
n
h j i=a j ; i~ j [Gjk +b jk ] = a~ (1 ib) j j + (1 + ib) j j :
i~ k j=1
(10.153)
We shall adopt the standard de…nitions and conventions. Requiring that h j i =
h j i implies that a = a and b = ijbj. Furthermore, the standard convention
that the inner product h j i be anti-linear in its …rst factor and linear in the
second leads us to choose b = +1. Finally, we adopt the standard normalization
and set a = 1=2~. The result is the familiar expression for the positive de…nite
inner product,
def 1 j k P
n
h j i = (G j; k +i j; k ) = j j : (10.154)
2~ j=1
P
n
j i= jji j where j = hjj i ; (10.156)
j=1
where, to be explicit about the interpretation, we emphasize that j and jji are
completely di¤erent objects: j represents an ontic state while jji represents an
epistemic state — j is one of the faces of the quantum die, and jji represents
the state of certainty that the actual face is j, that is (j) = 1. In this “j”
representation, the vectors fjjig form a basis that is orthogonal and complete,
P
n
hkjji = jk and jjihjj = ^1 : (10.157)
j=1
i 0 j i i
[(J )j ] = = ; (10.158)
0 i i~ j i(i~ i )
which shows that J plays the role of multiplication by i, that is, when acting
on a point the action of J is represented by an operator J,^
J J ^
!i is j i ! Jj i = ij i : (10.159)
10.8 Hilbert space 265
~ ; ) with kernel H
The bilinear Hamilton functions H( ^ ij in eq.(11.146) can
^ and its matrix elements,
now be written in terms of a Hermitian operator H
~ ;
H( ^ i and
) = h jHj ^ jk = hjjHjki
H ^ : (10.161)
^H ( )j (0)i where
j ( )i = U ^H ( ) = exp( iH
U ^ =~) : (10.163)
Commutators — ~[ ;
The Poisson bracket of two Hamiltonian functions U ]
and V~ [ ; ],
!
P
n ~
U V~ ~
U V~
~ ; V~ g =
fU ;
j=1 j i~ j i~ j j
~ ; V~ g =
fU ^ ; V^ ]j i :
i~h j[U (10.164)
Thus the Poisson bracket is the expectation of the commutator. This identity is
much sharper than Dirac’s pioneering discovery that the quantum commutator
of two quantum variables is merely analogous to the Poisson bracket of the
corresponding classical variables.
266 A Prelude to Dynamics: Kinematics
Law without Law: “The only thing harder to understand than a law of sta-
tistical origin would be a law that is not of statistical origin, for then there
would be no way for it — or its progenitor principles — to come into
being.”
Two tests: “No test of these views looks like being someday doable, nor more
interesting and more instructive, than a derivation of the structure of
quantum theory... No prediction lends itself to a more critical test than
this, that every law of physics, pushed to the extreme, will be found statis-
tical and approximate, not mathematically perfect and precise.”
J. A. Wheeler 1
“... but it is important to note that the whole content of the theory depends
critically on just what we mean by ‘probability’.”
E. T Jaynes 2
material that evolved gradually in a series of previous publications [Caticha 2009a, 2010a,
2010b], [Caticha et al 2014], and [Caticha 2012c, 2014b, 2015a, 2017a, 2017b].
4 At present this possibility appears unlikely, but we should not underestimate the cleverness
of future scholars.
5 Non-relativistic quantum mechanics can, of course, be derived from an underlying rela-
tivistic quantum …eld theory, but no ontic dynamics is assumed to underwrite the latter (see
[Ipek et al 2014, 2018, 2020]).
11.1 Mechanics without mechanism 269
one hand there is the simplicity that arises from not having to keep track of
irrelevant details that are eventually washed out when taking the averages and,
on the other hand, it allows the intriguing possibility that there is no ontic
sub-quantum dynamics at all.
Quantum mechanics involves probabilities and, therefore, it is a theory of
inference. But this has not always been clear. The center of the controversy
has been the interpretation of the quantum state — the wave function. Does
it represent the actual real state of the system — its ontic state — or does it
represent a state of knowledge about the system — an epistemic state? The
problem has been succinctly stated by Jaynes: “Our present QM formalism
is a peculiar mixture describing in part realities in Nature, in part incomplete
human information about Nature — all scrambled up by Heisenberg and Bohr
into an omelette that nobody has seen how to unscramble.” [Jaynes 1990]6
The ontic interpretations have been fairly common. At the very beginning,
Schrödinger’s original waves were meant to be real material waves — although
the need to formulate the theory in con…guration space immediately made that
interpretation quite problematic.
Then the Copenhagen interpretation — an umbrella designation for the not
always overlapping views of Bohr, Heisenberg, Pauli, and Born [Stapp 1972;
Jammer 1966, 1974] — took over and became the orthodoxy (see, however,
[Howard 2004]). On the question of quantum reality and the epistemic vs. ontic
nature of the quantum state it is deliberately vague. As crystallized in the
standard textbooks, including the classics by [Dirac 1948], [von Neumann 1955]
and [Landau Lifshitz 1977], it regards the quantum state as an objective and
complete speci…cation of the properties of the system but only after they become
actualized through the act of measurement. According to Bohr the connection
between the wave function and the world is indirect. The wave function does
not represent the world itself, but is a mere tool to compute probabilities for
the outcomes of measurements that we are forced to describe using a classical
language that is ultimately inadequate [Bohr 1933, 1958, 1963]. Heisenberg’s
position is somewhat more ambiguous. While he fully agrees with Bohr on the
inadequacy of our classical language to describe a quantum reality, his take
on the wave function is that it describes something more ontic, an objective
tendency or potentiality for events to occur. And then there is also Einstein’s
ensemble or statistical interpretation which is more explicitly epistemic. In his
words, “the -function is to be understood as the description not of a single
system but of an ensemble of systems” [Einstein 1949b, p. 671]. Whether
he meant a virtual ensemble in the sense of Gibbs is not so clear. (See also
[Ballentine 1970, Fine 1996].)
6 An important point to be emphasized here is that the distinction ontic/epistemic is not the
Bohr, Heisenberg, Einstein and other founders of quantum theory were all
keenly aware of the epistemological and pragmatic elements at the foundation of
quantum mechanics (see e.g., [Stapp 1972] on Bohr, and [Fine 1996] on Einstein)
but, unfortunately, they wrote at a time when the language and the tools of a
quantitative epistemology — the Bayesian and entropic methods that are the
subject of this book — had not yet been su¢ ciently developed.
The conceptual problems that plagued the orthodox interpretation moti-
vated the creation of ontic alternatives such as the de Broglie-Bohm pilot wave
theory [Bohm Hiley 1993, Holland 1993], Everett’s many worlds interpretation
[Everett 1957, Zeh 2016], and the spontaneous collapse theories [Ghirardi et al
1986, Bassi et al 2013]. In these theories the wave function is ontic, it represents
a real state of a¤airs. On the other side, the epistemic interpretations have had
a growing number of advocates including, for example, [Ballentine 1970, 1998;
Caves et al 2007; Harrigan Spekkens 2010; Friedrich 2011; Leifer 2014].7 The
end result is that the conceptual struggles with quantum theory have engen-
dered a literature that is too vast to even consider reviewing here. Excellent
sources for the earlier work are found in [Jammer 1966, 1974; Wheeler Zurek
1983]; for more recent work see, e.g. [Schlosshauer 2004; Jaeger 2009; Leifer
2014].
Faced with all this controversy, Jaynes also understood where one might
start looking for a solution: “We suggest that the proper tool for incorporating
human information into science is simply probability theory — not the currently
taught ‘random variable’kind, but the original ‘logical inference’kind of James
Bernoulli and Laplace” which he proceeds to explain “is often called Bayesian
inference” and is “supplemented by the notion of information entropy”.
The Entropic Dynamics (ED) developed below achieves ontological clarity
by sharply separating the ontic elements from the epistemic elements — posi-
tions of particles (or distributions of …elds) on one side and probabilities and
their conjugate momenta on the other. In this regard ED is in agreement with
Einstein’s view that “... on one supposition we should in my opinion hold ab-
solutely fast: The real factual situation of the system S2 is independent of what
is done with system S1 which is spatially separated from the former.”[Einstein
1949a, p.85] (See also [Howard 1985].) ED is also in broad agreement with Bell’s
views on the desirability of formulating physics in terms of local “beables”8 (See
Bell’s papers reproduced in [Bell 2004].)
ED is an epistemic dynamics of probabilities and not an ontic dynamics of
particles (or …elds). Of course, if probabilities at one instant are large in one
place and at a later time they are large in some other place one infers that
the particles must have moved — but nothing in ED assumes the existence of
something that has pushed the particles around. ED is a mechanics without
7 For criticism of the epistemic view see e.g. [Zeh 2002; Ferrero et al 2004; Marchildon
2004].
8 In contast to mere observ ables, the be ables are supposed to represent something that is
ontic. In the Bohmian approach and in ED particle positions (and …elds) are local beables.
According to the Bohmian and the many worlds interpretations the wave function is a nonlocal
beable.
11.1 Mechanics without mechanism 271
Rovelli 1996; Caticha 1998, 2006; Zeilinger 1999; Brukner Zeilinger 2002; Fuchs 2002; Spekkens
2007; Goyal et al 2010; Hardy 2001, 2011, Chiribella et al 2011, D’Ariano 2017].
1 1 Here we are concerned with non-relativistic quantum mechanics. When the ED framework
is applied to relativistic quantum …eld theory it is the …elds that are the only ontic variables
[Ipek Caticha 2014, Ipek et al. 2018, 2020].
1 2 See also [Guerra 1981, Guerra Morato 1983] and references therein.
272 Entropic Dynamics: Time and Quantum Theory
and it is these values that we wish to infer. Since the positions are unknown
the main target of our attention will be the probability distribution (x).
The previous paragraph may seem straightforward common sense but the
assumptions it involves are not at all innocent. First, note that ED already sug-
gests a new perspective on the old question of determinism vs. indeterminism.
If we understand quantum mechanics as a generalization of a deterministic clas-
sical mechanics then it is natural to seek the cause of indeterminism. But within
an inference framework that is designed to deal with insu¢ cient information one
must accept uncertainty, probabilities, and indeterminism as the expected and
inevitable norm that requires no explanation. It is the determinism of classical
mechanics that demands an explanation. Indeed, as we shall see in section ??
while most quantities are a- icted by uncertainties there are situations where for
some very specially chosen variables one can, despite the lack of information,
achieve complete predictability. This accounts for the emergence of classical
determinism from an entropic dynamics that is intrinsically indeterministic.
Second, we note that the assumption that the particles have de…nite posi-
tions represents a major departure from the standard interpretation of quantum
mechanics according to which de…nite values can be attained but only as the
result of a measurement. In contrast, positions in ED play the very special role
of de…ning the ontic state of the system. Let us be very explicit: in the ED de-
scription of the double slit experiment, we might not know which slit the particle
goes through, but the particle de…nitely goes through one slit or the other.13
Indeed, as far as positions are concerned, ED agrees with Einstein’s view that
spatially separated objects have de…nite separate ontic states [Einstein 1949a,
p.85].
And third, we emphasize once again that (x) represents probabilities that
are to be manipulated according to exactly the same rules described in previous
chapters — the entropic and Bayesian methods. Just as there is no such thing
as a quantum arithmetic, there is no quantum probability theory either.
relative to a prior Q(x0 jx), and subject to the appropriate constraints speci…ed
below. (For notational simplicity in multidimensional integrals such as (11.1) we
will write dx0 instead of d3N x0 .) It is through the choice of prior and constraints
that the relevant pieces of information that de…ne the dynamics are introduced.
where the Lagrange multipliers n are constants that are independent of x but
may depend on the index n in order to describe non-identical particles. The n s
will be eventually be taken to in…nity in order to enforce the fact that the steps
are meant to be in…nitesimally short. In Cartesian coordinates the uniform
measure is a numerical constant that can be absorbed into the normalization
and therefore has no e¤ect on Q.14 The result is a product of Gaussians,
1X
Q(x0 jx) / exp n ab xan xbn ; (11.5)
2 n
1 4 Indeed, as 0
n ! 1 the prior Q becomes independent of any choice of (x ) provided the
latter is su¢ ciently smooth.
11.3 The entropic dynamics of short steps 275
which describes the a priori lack of correlations among the particles. Next
we specify the constraints that specify the information that is speci…c to each
individual short step.
XN
@'
h '(x)i = a
h xan i = 0
(x) ; (11.6)
n=1
@xn
where 0 (x) is some small but for now unspeci…ed function. This information
is already su¢ cient to construct an interesting entropic dynamics which turns
out to be a kind of di¤usion where the expected “drift”, h xn i, is determined
by the “potential” '.15
The physical origin of the potential '(x) is at this point unknown so how can
one justify its introduction? First, we note that identifying the relevant con-
straints, such as (11.6), represents signi…cant progress even when their physical
origin remains unexplained. This situation has historical precedents. For exam-
ple, in Newton’s theory of gravity or in the theory of elasticity, the speci…cation
of the forces turned out to be very useful even though their microscopic origin
had not yet been fully understood. Indeed, as we shall show the assumption
of a constraint involving a con…guration space function '(x) is instrumental to
explain quantum phenomena such as entanglement, interference, and tunneling.
A second more formal justi…cation, is motivated by the geometrical discussion
in chapter 10. We seek a dynamics in which evolution takes the form of curves
or trajectories on the statistical manifold of probabilities f g. Then, if the prob-
abilities (x) are treated as generalized coordinates, it is only natural to expect
that quantities '(x) must at some point be introduced to play the role of their
conjugate momenta. What might be surprising is that the single function '(x)
will play three roles that might appear to be totally unrelated to each other:
…rst, as a constraint in an entropic inference; second, as the momenta conjugate
1 5 However, to construct that particular dynamics that describes quantum systems we must
further require that '=~ be a multi-valued function with the topological properties of an angle
— '(x) and '(x) + 2 ~ represent the same “angle” (see section 11.8.4 below).
276 Entropic Dynamics: Time and Quantum Theory
These N constraints involve a single vector …eld Aa (~x) that lives in the 3-
dimensional physical space (~x 2 X). This ensures that all particles couple to
one single electromagnetic …eld. The strength of the coupling is given by the
values of the 00n . These are small quantities that could be speci…ed directly but,
as is often the case in entropic inference, it is much more convenient to specify
them indirectly in terms of the corresponding Lagrange multipliers.
1 1X a b
P (x0 jx) = exp n ab xan xn xbn xn (11.9)
Z 2 n
given by
0
a
xn = h xan i = ab
[@nb ' n Ab (~
xn )] ; (11.11)
n
1
h wna i = 0 and h wna wnb 0 i = ab
nn0 : (11.12)
n
1 6 The distribution (11.8) is not merely a local maximum or a stationary point. It yields
the absolute maximum of the relative entropy S[P; Q] subject to the constraints. The proof
follows the standard argument originally due to Gibbs (see section 4.10).
11.4 Entropic time 277
The directionality of the motion and the correlations among the particles are
introduced by a systematic drift in a direction determined by @na ' and Aa ,
while the position ‡uctuations remain isotropic and uncorrelated. As n ! 1,
the trajectory is expected to be continuous. As we shall see below, whether the
trajectory is di¤erentiable or not depends on the particular choices of n and
0
. Eqs. (11.11) and (11.12) also show that the e¤ect of 0 is to enhance or
suppress the magnitude of the drift relative to the ‡uctuations.
where (~x) is a function in 3d-space and the multipliers n will later be related
to the electric charges qn by n = qn =c.
shall return to the question of whether and how this epistemic notion of time is
related to the presumably more “physical” time that is measured by clocks.
This equation follows directly from the laws of probability and, therefore, it is
true independently of any physical assumptions which means that it is not very
useful as it stands. To make it useful, something else must be added.
There is something peculiar about con…guration spaces. For example, when
we represent the position of a particle as a point with coordinates (x1 ; x2 ; x3 ) it
is implicitly understood that the values of the three coordinates x1 , x2 , and x3
hold simultaneously — no surprises here. Things get a bit more interesting when
we describe a system of N particles by a single point x = (~x1 ; ~x2 ; : : : ~xN ) in 3N -
dimensional con…guration space. The point x is meant to represent the state at
one instant, that is, it is also implicitly assumed that all the 3N coordinate values
are simultaneous. What is peculiar about con…guration spaces is that they
implicitly introduce a notion of simultaneity. Furthermore, when we express
uncertainty about the values of an x by means of a probability distribution
P (x) it is also implicitly understood that the di¤erent possible values of x all
refer to the same instant. The system could be at this point here or it could
be at that point there, we might not know which, but whichever it is, the two
possibilities refer to positions at one and the same instant.17 And similarly,
when we consider the transition probability from x to x0 , given by P (x0 jx), it
is implicitly assumed that x refers to one instant, and the possible x0 s all refer
to another instant. Thus, in ED, a probability distribution over con…guration
space provides a criterion of simultaneity.
We can now return to eq.(11.15): if P (xk 1 ) happens to be the probability
of di¤erent values of x at an “initial”instant of entropic time t, and P (xk jxk 1 )
is the transition probability from xk 1 at one instant to xk at another instant,
then we can interpret P (xk ) as the probability of values of xk at a “later”
instant of entropic time t0 = t + t. Accordingly, we write P (xk 1 ) = t (x)
and P (xk ) = t0 (x0 ) so that
0
R 0
t0 (x ) = dx P (x jx) t (x) : (11.16)
1 7 We could of course consider the joint probability P (~ x2 (t2 )) of particle 1 being at
x1 (t1 ); ~
x1 at time t1 and particle 2 being at ~
~ x2 at time t2 , but the set of points f~ x2 (t2 )g is
x1 (t1 ); ~
not at all what one would call a con…guration space.
11.4 Entropic time 279
The temporal asymmetry is due to the fact that the distribution P (x0 jx), eq.(11.8),
is a Gaussian derived using the maximum entropy method, while the time-
1 8 In this respect, entropic time bears some resemblance with the relational notion of time
advocated by J. Barbour in the context of classical physics (see e.g. [Barbour 1994]).
280 Entropic Dynamics: Time and Quantum Theory
11.4.3 Duration
We have argued that the concept of time is intimately connected to the associ-
ated dynamics but at this point neither the transition probability P (x0 jx) that
speci…es the dynamics nor the corresponding entropic time have been fully de-
…ned yet. It remains to specify how the interval t between successive instants
is encoded into the multipliers n and 0 .
The basic criterion for this choice is convenience: duration is de…ned so that
motion looks simple. The description of motion is simplest when it re‡ects the
symmetry of translations in space and time. We therefore choose 0 and n
to be constants independent of x and t. The resulting entropic time resembles
Newtonian time in that it ‡ows “equably everywhere and everywhen.”
The particular choice of duration t in terms of the multipliers n and 0
can be motivated as follows. In Newtonian mechanics time is de…ned to simplify
the dynamics. The prototype of a classical clock is a free particle that moves
equal distances in equal times so that there is a well de…ned (constant) velocity.
In ED time is also de…ned to simplify the dynamics, but now it is the dynamics
of probabilities as prescribed by the transition probability. We de…ne duration
so that for short steps the system’s expected displacement h xi increases by
equal amounts in equal intervals t so there is a well de…ned drift velocity.
Referring to eq.(11.11) this is achieved by setting the ratio 0 = n proportional
to t and thus, the transition probability provides us with a clock. For future
convenience the proportionality constants will be expressed in terms of some
particle-speci…c constants mn ,
0
1
= t: (11.19)
n mn
At this point the constants mn receive no interpretation beyond the fact that
their dependence on the particle label n recognizes that the particles need not
11.5 Brownian sub-quantum motion and the evolution equation 281
0 1 mn
= so that n = : (11.20)
t
h xA i
bA (x) = = mAB @B '(x) AB (x) ; (11.24)
t
and A(x) is the electromagnetic vector expressed as a …eld in con…guration
space. Its components are
which shows that the constant controls the strength of the ‡uctuations. Note
that for very short steps, as t ! 0, the ‡uctuations become dominant: the
drift is h xA i O( t) while wA O( t1=2 ) which is characteristic of Brown-
ian paths. Thus, with the choice (11.20), the trajectory is continuous but not
di¤erentiable: a particle has a de…nite position but its velocity, the tangent
to the trajectory, is completely unde…ned. To state this more explicitly, since
wA O( t1=2 ) but h wA i = 0, we see that the limit t ! 0 and the
expectation h i do not commute,
xA xA
lim 6= lim (11.27)
t!0 t t!0 t
because
wA wA
lim = 0 while lim =1: (11.28)
t!0 t t!0 t
where the times t and t + t have been written down explicitly. As so often in
physics it is more convenient to rewrite the evolution equation above in di¤er-
ential form. One might be tempted to Taylor expand in t and x = x0 x,
but this is not possible because for small t the distribution P (x0 ; t + tjx; t),
eq. (11.21), is very sharply peaked at x0 = x. To handle such singular behav-
ior one follows an indirect procedure that is well known from di¤usion theory
[Chandrasekhar 1943]: multiply by a smooth test function f (x0 ) and integrate
over x0 ,
Z Z Z
0 0 0
dx t+ t (x )f (x ) = dx dx0 P (x0 ; t + tjx; t)f (x0 ) t (x) : (11.33)
The test function f (x0 ) is assumed su¢ ciently smooth precisely so that it can
be expanded about x. The important point here is that for Brownian paths
284 Entropic Dynamics: Time and Quantum Theory
eq.(11.26) implies that the terms ( x)2 contribute to O( t). Then, dropping
all terms of order higher than t, the integral in the brackets is
Z
@f 1 @2f
[ ] = dx0 P (x0 ; t + tjx; t) f (x) + A xA + xA xB + : : :
@x 2 @xA @xB
@f 1 @2f
= f (x) + bA (x) t A + t mAB A B + : : : (11.34)
@x 2 @x @x
where we used eq.(11.23) and (11.26),
Z
1
lim dx0 P (x0 ; t + tjx; t) xA = bA (x) ;
t!0+ t
Z
1
lim + dx0 P (x0 ; t + tjx; t) xA xB = mAB : (11.35)
t!0 t
Dropping the primes on the left hand side of (11.33), substituting (11.34) into
the right, and dividing by t, gives
Z Z
1 @f 1 @2f
dx [ t+ t (x) t (x)] f (x) = dx bA (x) A + mAB A B t (x) :
t @x 2 @x @x
(11.36)
Next integrate by parts on the right and let t ! 0. The result is
Z Z
@ A 1 AB @2 t
dx @t t (x)f (x) = dx (b t ) + m f (x) : (11.37)
@xA 2 @xA @xB
Since the test function f (x) is arbitrary, we conclude that
1
@t t = @A (bA t ) + mAB @A @B t : (11.38)
2
Thus, the di¤erential equation for the evolution of t (x) takes the form of a
Fokker-Planck equation. To proceed with our analysis we shall rewrite (11.38)
in several di¤erent forms.
@t = @A v A (11.40)
v A = bA + u A (11.41)
11.5 Brownian sub-quantum motion and the evolution equation 285
that will be called the phase. (Eventually will be identi…ed as the phase of
the wave function.) Eqs.(11.43) and (11.45) show, once again, that the action
of the constant is to control the relative strength of di¤usion and drift.
Next we shall rewrite the continuity equation (11.40) in yet another equiva-
lent but very suggestive form involving functional derivatives.
def P @f
df = dqi : (11.46)
i @qi
2 0 The de…nition of osmotic velocity adopted in [Nelson 1966] and other authors di¤ers from
ours by a sign. Nelson takes the osmotic velocity to be the velocity imparted by the external
force that is needed to balance the osmotic force (due to concentration gradients) in order to
attain equilibrium. Here the osmotic velocity is the velocity associated to the actual di¤usion
current, eq. (11.43).
286 Entropic Dynamics: Time and Quantum Theory
The partial derivative @f =@qi is de…ned as the coe¢ cient of the term linear in
dqi .
Similarly, if F [ ] = F (: : : x : : : x0 : : :) is a function of in…nitely many vari-
ables x labeled by a continuous index x, then a small change in the function
(x) ! (x)+ (x) will induce a small change F of the functional F [ ]. To
…rst order in we have
def R F
F = dx (x) (11.47)
(x)
where the functional derivative F= (x) is de…ned as the coe¢ cient of the term
linear in (x).
The virtue of this approach is that it allows us to manipulate and calculate
functional derivatives by just following the familiar rules of calculus such as
Taylor expansions, integration by parts, etc. For example, if the functional F [ ]
just returns the value of (x) at the point y, that is F [ ] = y , then
R F (y)
F = (y) = dx (x) implies = (x y) : (11.48)
(x) (x)
A(x). This is an interesting ED in its own right but it is not QM. Indeed, a
quantum dynamics consists in the coupled evolution of two dynamical …elds:
the density t (x) and the phase of the wave function. This second …eld can
be naturally introduced into ED by allowing the phase …eld t (x) in (11.49) to
become dynamical which amounts to an ED in which the constraint (11.6) is
itself continuously updated at each instant in time. To complete the construction
of ED we must identify the appropriate updating criterion (e.g., along the lines of
Chapter 10) to formulate an ED in which the phase …eld t guides the evolution
of t , and in return, the evolving t reacts back and induces the evolution of t .
Remark: One might suspect that the Hamiltonian H ~ in (11.51) will eventually
lead us to the concept of energy and this will indeed turn out to be the case.
But there is something peculiar about H: ~ the variables that de…ne H ~ are prob-
abilities and the phase …elds which are both epistemic quantities. It therefore
follows that in the ED approach the energy is also an epistemic concept. In ED
only positions are ontic; energy is not. Surprising as this may sound, it is not
an impediment to formulating laws of physics that are empirically successful.
Remark: We note that once the evolution equation is written in Hamiltonian
form in terms of the phase the constant disappears from the formalism. This
means that changes in arise from the combined e¤ect of drift and di¤usion
and it is no longer possible to attribute any particular e¤ect to one or the other.
The fact that it is possible to enhance or suppress the ‡uctuations relative
to the drift to achieve the same overall evolution shows that there is a whole
family of ED models that di¤er at the “microscopic” or sub-quantum level.
Nevertheless, as we shall see, all members of this family lead to the same
“emergent” Schrödinger equation at the “macroscopic” or quantum level. The
model in which ‡uctuations are (almost) totally suppressed is of particular inter-
est: the system evolves along the smooth lines of probability ‡ow. This suggests
that ED includes the Bohmian or causal form of quantum mechanics as a special
limiting case. (For more on this see section 11.6).
where the expectation is conditional on the later position x = x(t). Shifting the
time by t, bA can be equivalently written as
xA (t + t) xA (t) x(t+ t)
A 0
b (x ) = lim +
t!0 t
1 R
= lim dx P (xjx0 ) xA ; (11.54)
t!0+ t
with the same de…nition of xA as in eq.(11.52).
The two drift velocities, towards the future bA and from the past bA , do
not coincide. The connection between them was derived by Nelson in [Nelson
1966, 1985] and independently by Jaynes [Jaynes 1989]. It turns out to be a
straightforward consequence of Bayes’theorem, eq.(11.18). To derive it expand
0
t0 (x ) about x in (11.18) to get
t (x)
P (xjx0 ) = 1 @B log t0 (x) xB + : : : P (x0 jx) : (11.55)
t0 (x)
h xA xB ix = h wA wB ix + O( t3=2 )
= mAB t + O( t3=2 ) ; (11.58)
to get
Z
t (x) xA
lim + dx [ f (x) mAB f (x)@B log t0 (x) + mAB @B f (x) + : : :] :
t!0 t0 (x) t
(11.59)
Next take the limit t ! 0+ and note that the third term vanishes (just inte-
grate by parts). The result is
Z Z
dx bA (x)f (x) = dx bA (x) mAB @B log t (x) f (x) : (11.60)
1 A 1 A
vA = b + bA and uA = b bA : (11.62)
2 2
0 1 mn
= 0
so that n = ; (11.63)
t2 0 t3
1 1 xA xB
P (x0 jx) = exp 0
mAB b0A (x) b0B (x) ;
Z 2 t t t
(11.64)
where we used (11.11) to de…ne the drift velocity,
h xA i
b0A (x) = = mAB @B '0 (x) AB (x) ; (11.65)
t
h wA i = 0 and h wA wB i = 0
mAB t3 ; (11.66)
or
xA xB
b0A b0B = 0
mAB t: (11.67)
t t
It is noteworthy that h xA i O( t) and wA O( t3=2 ). This means that
as t ! 0 the dynamics is dominated by the drift and the ‡uctuations become
negligible. Indeed, since w0A O( t3=2 ) eq.(11.23) shows that the limit
xA
lim = b0A (11.68)
t!0 t
290 Entropic Dynamics: Time and Quantum Theory
is well de…ned. In words: the actual velocities of the particles coincide with
the expected or drift velocities. From eq.(11.65) we see that these velocities
are continuous functions. Since, as we shall later see, these smooth trajectories
coincide with the trajectories postulated in Bohmian mechanics, we shall call
them Bohmian trajectories to distinguish them from the Brownian trajectories
discussed in section 11.5.2.
in di¤erential form. Since for small t the transition probability P (x0 ; t+ tjx; t)
is very sharply peaked at x0 = x we proceed as in section 11.5.2. We multiply
by a smooth test function f (x0 ) and integrate over x0 ,
Z Z Z
0 0 0
dx t+ t (x )f (x ) = dx dx0 P (x0 ; t + tjx; t)f (x0 ) t (x) : (11.70)
The test function f (x0 ) is assumed su¢ ciently smooth precisely so that it can
be expanded about x. Then, dropping all terms of order higher than t, as
t ! 0 the integral in the brackets is
Z
@f
[ ] = dx0 P (x0 ; t + tjx; t) f (x) + A (x0A xA ) + :::
@x
@f
= f (x) + b0A (x) t A + : : : (11.71)
@x
where we used eq.(11.23). Dropping the primes on the left hand side of (11.70),
substituting (11.71) into the right, and dividing by t, gives
Z Z
1 @f
dx [ t+ t (x) t (x)] f (x) = dx b0A (x) A t (x) : (11.72)
t @x
Next integrate by parts on the right and let t ! 0. Since the test function
f (x) is arbitrary, we conclude
which is the desired evolution equation for t (x) written in di¤erential form.
This is a continuity equation where the current velocity is equal to the drift
velocity, v A = b0A .
Thus, whether we deal with Brownian ( 0 = const) or Bohmian ( 0 / 1= t2 )
trajectories we …nd the same continuity equation
provided the corresponding drift potentials 'Bohm ian and 'Brownian are chosen
such that they lead to the same phase …eld,
1=2
= 'Bohm ian = 'Brownian log : (11.75)
It also follows that whether we deal with Bohmian or Brownian paths, the
evolution of probabilities can be expressed in the same Hamiltonian form given
in eqs.(11.49) and (11.51).
0 1 mn
= 00 1
and n = 00
; (11.76)
t t
where and 00 are positive constants. We will not pursue this topic further
except to note that for < 2 the sub-quantum motion is dominated by ‡uc-
tuations and the trajectories are non-di¤erentiable, while for > 2 the drift
dominates and velocities are well de…ned.
ble 1979; Heslot 1985; Anandan and Aharonov 1990; Cirelli et al. 1990; Abe 1992; Hughston
1995; Ashekar and Schilling 1998; de Gosson, Hiley 2011; Elze 2012; Reginatto and Hall 2011,
2012]; [Caticha 2019, 2021b].
292 Entropic Dynamics: Time and Quantum Theory
~ x and r
and the action of the basis covectors r ~ x on the vector V is de…ned
by
x
~ x [V ] = d
r and r~ x [V ] = d x ; (11.82)
d d
that is,
~ x x ~ x0 ~ x ~
r [ x0
]= x0 ; r x[ ]= x ; and r [ ]=r x[ x0
]=0:
x0 x0
(11.83)
The fact that the space S is constrained to normalized probabilities means
that the coordinates x are not independent. This technical di¢ culty is handled
by embedding the 1-dimensional manifold S in a (1+ 1)-dimensional manifold
~ is a covector
S + where the coordinates x are unconstrained. Thus, strictly, rF
+ ~ + ~ x ~
on T S , that is, rF 2 T (T S )X and r and r x are the corresponding
basis covectors.
Instead of keeping separate track of the x and x coordinates it is more
convenient to combine them into a single index. A point X = ( ; ) will then
be labelled by its coordinates
x
X = (X 1x ; X 2x ) = ( x
; x) (11.84)
11.7 The epistemic phase space 293
dF ~ [V ] = F V x and rF ~ = F ~
= rF x
rX x ; (11.86)
d X X x
where the repeated upper and lower indices indicate a summation over and
an integration over x. Once we have introduced the composite indices x to
label tensor components there is no further need to draw a distinction between
x
and x — these are coordinates and not the components of a vector. From
now on we shall write (x) = x = x switching from one notation to another as
convenience dictates. On the other hand, for quantities such as x or d x =d
that are the components of vectors it is appropriate to keep x as an upper index.
or
d x V~ d x V~
= and = ; (11.96)
d x d x
The ED that preserves the symplectic structure and reproduces the con-
~ ; ] in
tinuity equation (11.49) is generated by the Hamiltonian functional H[
(11.51),
H~ ~
H
@t x = ; @t x = ; (11.101)
x x
The dynamics, however, is not yet fully determined because the integration
constant F [ ] in (11.51) remains to be speci…ed.
x0 ~
`2 = gxx0 x
with gxx0 = A(j j) nx nx0 + xx0 ; (11.109)
2 x
where n is a special covector which, in coordinates, has components nx = 1,
so that Z Z
2
~
`2 = A(j j) dx x + dx ( x )2 ; (11.110)
2 x
and the freedom in the function A(j j) re‡ects the ‡exibility in the choice
of spherically symmetric embedding. The corresponding inverse tensor, see
eq.(10.103), is
0 2 x 2A
g xx = xx0 +C x x0 where C(j j) = : (11.111)
~ ~Aj j + ~2 =2
The metric structure for T S + is obtained following the same argument that
led to eqs.(10.78) and (10.104). The simplest geometry that is invariant under
‡ow reversal and is determined by the information geometry of S + , which is
0
fully described by the tensor gxx0 and its inverse g xx , is given by the length
element
x0 x0 0
`~2 = G x; x0 X x
X = gxx0 x
+ g xx x x0 : (11.112)
should be quali…ed with some …ne print to the e¤ect that “we adopt the standard of mathe-
matical rigor typical of theoretical physics.” Ultimately the argument is justi…ed by the fact
that it leads to useful models that are empirically successful. For relevant references see [Cirelli
et al 1990][Pistone Sempi 1995] and also [Jaynes 2003, appendix B].
2 4 The quantities x are the components of a vector so in (11.109) it makes sense to keep
1
Using G to raise the …rst index of the symplectic form x; x0 ,
0 1
[ xx0 ] = xx0 ; (11.116)
1 0
as in eq.(10.79),
x; x00 x
G x00 ; x0 = J x0 ; (11.117)
we …nd
0
0 g xx
[J x x0 ] = : (11.118)
gxx0 0
And just as in the discrete case the square of the J tensor is minus the identity,
x x00 x
J x00 J x0 = x0 = xx0 or JJ = 1; (11.119)
x0 0
`~2 ( ) = gxx0 x
+ g xx ( x + )( x0 + ); (11.120)
Then the metric that measures the distance between neighboring rays on T S
is obtained by substituting m in back into (11.120), and setting j j = 1 and
j j = 0. The result is
Z
~ 2 x
s~2 = dx ( x 2
) + ( x h i)2 : (11.123)
2 x ~
and
2
0
x
xx0 0
[Gxx ] = ~
~ : (11.126)
0 2 x
xx0
x = x x + x x ;
~
x = ( x x x x) : (11.130)
2i x
300 Entropic Dynamics: Time and Quantum Theory
0 1
[ xx0 ] = xx0 ; (11.132)
1 0
dH~
~ N
= fH; ~g = 0 : (11.137)
d
Therefore N ~ is the generator of a global “gauge”symmetry and the Hamiltonian
~
H is invariant under the transformation x (0) ! x ( ). The interpretation is
that as we embed the e-phase space T S into the larger space T S + we introduce
two additional degrees of freedom. We eliminate one by imposing the constraint
N~ = 0; we eliminate the other by declaring that two states x (0) and x ( )
that lie on the same ray (or gauge orbit) are equivalent in the sense that they
represent the same epistemic state.
In coordinates the metric on T S + , eqs.(11.124) and (11.125), becomes
Z Z
x0
` = 2i dx x i~ x = dxdx0 G x; x0
2 x
; (11.138)
11.9 Hamilton-Killing ‡ows 301
x; x0
where in matrix form the metric tensor G x; x0 and its inverse G are
0 1 0 0 1
[Gxx0 ] = i xx0 and [Gxx ] = i xx0 : (11.139)
1 0 1 0
x; x0
Finally, using G to raise the …rst index of x0 ; x00 gives the components
of the tensor J
x def x; x0 i 0
J x00 = G x0 ; x00 or [J x x00 ] = xx00 : (11.140)
0 i
or 2 0 0 3
H 2x H 2x H 1x H 2x
+ ; + i~ x0
[($Q G)xx0 ] = i4 x
0
x0 x
0 5=0: (11.143)
H 2x H 1x H 1x H 1x
i~ x + ; i~ x + i~ x0
x0
x
If we further require that H be a Hamiltonian ‡ow, $H = 0, then we
substitute
H~ ~
H
H 1x = and H 2x = (11.144)
i~ x x
~ Z
d x H 1 ^ xx0 x0 ;
= H 1x = = dx0 H (11.147)
dt i~ x i~
~ Z
di~ x 2x H ^ xx0 :
=H = = dx0 x0 H (11.148)
dt x
Taking the complex conjugate of (11.147) and comparing with (11.148), shows
^ xx0 is Hermitian,
that the kernel H
^ 0 = Hx0 x ;
H (11.149)
xx
~ are real,
and we can check that the corresponding Hamiltonian functionals H
~ ;
H[ ~ ;
] = H[ ]:
@f
f (x) ! f" (x) = f (x ") or " f (x) = f" (x) f (x) = "a : (11.150)
@xa
The change of a functional F [ ; ] is
Z
F F
" F [ ; ] = dx " x+ " x = fF; P~a "a g (11.151)
x x
where Z Z
X @ @ x
P~a = dx x = dx x (11.152)
n @xa
n
a
@Xcm
a
is interpreted as the expectation of the total linear momentum, and Xcm are
the coordinates of the center of mass,
a 1 P P
Xcm = mn xan where M= mn : (11.153)
M n n
Substituting = 1=2 i =~
e into (11.161) and using V^x0 x = V^xx0 leads to
Z
F0 2 1=2
= 1=2
x dx0 x0 Im V^xx0 e i( x x0 )=~ =0: (11.163)
x ~
304 Entropic Dynamics: Time and Quantum Theory
This equation must be satis…ed for all choices of x0 . Therefore, it follows that
Furthermore, this last equation must in turn hold for all choices of x and x0 .
Therefore, the kernel V^xx0 must be local in x,
H~
@t x =f x ; Hg
~ = ; (11.167)
i~ x
which is the Schrödinger equation,
~2 AB
i~@t = m DA DB + V : (11.168)
2
In more standard notation it reads
X ~2 ab @ i @ i
i~@t = a n Aa (xn ) n Ab (xn ) +V :
n 2mn @xn ~ @xbn ~
(11.169)
At this point we can …nally provide the physical interpretation of the various
constants introduced along the way. Since the Schrödinger equation (11.169) is
the tool we use to analyze experimental data we can identify ~ with Planck’s
constant, mn will be interpreted as the particles’masses, and the n are related
to the particles’electric charges qn (in Gaussian units) by
qn
n = : (11.170)
c
For completeness we write the Hamiltonian in the ( ; ) variables,
Z X ab
~ @ qn @ qn
H[ ; ] = d3N x Aa (xn ) Ab (xn )
n 2mn @xan c @xbn c
X ~2 ab @ @
+ + V (x1 : : : xn ) : (11.171)
n 8mn 2 @xa b
n @xn
The Hamilton equations for and are the continuity equation (11.49),
~
H X @ ab
@ qn
@t = = Ab (xn ) ; (11.172)
n @xan mn @xbn c
11.11 Entropic time, physical time, and time reversal 305
~
H X ab
@ qn @ qn
@t = = Aa (xn ) Ab (xn )
n 2mn @xan c @xbn c
X ~2 ab
@ 2 1=2
+ 1=2 @xa @xb
V (x1 : : : xn ) : (11.173)
n 2mn n n
The action — Now that we have Hamilton’s equations (11.101) one can
invert the usual procedure and construct an action principle from which they
can be derived. De…ne the di¤erential
Z Z " ! ! #
~
H H~
A = dt dx @t x x @t x + x (11.174)
x x
but both solutions fqt ; pt g and fqtT ; pTt g describe evolution forward in time.
An alternative statement of time reversibility is the following: if there is
one trajectory of the system that takes it from state fq0 ; p0 g at time t0 to state
fq1 ; p1 g at the later time t1 , then there is another possible trajectory that takes
the system from state fq1 ; p1 g at time t0 to state fq0 ; p0 g at the later time
t1 . The merit of this re-statement is that it makes clear that nothing needs to
travel back in time. Indeed, rather than time reversal the symmetry might be
more appropriately described as momentum or motion or ‡ow reversal.
Since ED is a Hamiltonian dynamics one can expect that similar consid-
erations will apply to QM and indeed they do. It is straightforward to check
that given one solution f t (x); t (x)g that evolves forward in time, we can con-
struct another solution f Tt (x); Tt (x)g that is also evolving forward in time.
The reversed solution is
T T
t (x) = t (x) and t (x) = t (x) : (11.177)
The proof that this is a symmetry is straightforward; just take the complex
conjugate of (11.169), and let t ! t.
Remark: In section 10.5.1 we saw that the metric structure of the e-phase
space was designed so that potential time-reversal violations would be induced
at the dynamical level of the Hamiltonian and not at the kinematical level of
11.12 Hilbert space 307
where, in this “position” representation, the vectors fjxig form a basis that is
orthogonal and complete,
Z
hxjx0 i = xx0 and dx jxihxj = ^1 : (11.183)
i 0 x i x
[(J )x ] = = ; (11.184)
0 i i~ x i(i~ x)
308 Entropic Dynamics: Time and Quantum Theory
which shows that J plays the role of multiplication by i, that is, when acting
on a point the action of J is represented by an operator J,^
J J
^ i = ij i :
!i is j i ! Jj (11.185)
~ ;
Q[ ^ i and
] = h jQj ^ xx0 = hxjQjx
Q ^ 0i : (11.186)
d ^ i or i~ d j i = Qj
^ i:
i~ hxj i = hxjQj (11.187)
d d
These ‡ows are described by unitary transformations
Commutators — ~[ ;
The Poisson bracket of two Hamiltonian functionals U ]
and V~ [ ; ],
Z !
~
U V~ ~
U V~
~ ; V~ g =
fU dx ;
x i~ x i~ x x
~ ; V~ g = 1 ^ ; V^ ]j i :
fU h j[U (11.189)
i~
Thus the Poisson bracket is the expectation of the commutator.
11.13 Summary
We conclude with a summary of the main ideas.
3 = 1 1 + 2 2 ; (12.1)
where 1 and 2 are arbitrary complex numbers. Mathematical linearity refers
to the fact that solutions can be expressed as sums of solutions and there is no
implication that any of these solutions will necessarily describe physical situa-
tions.2 Physical linearity on the other hand — the superposition principle —
1 The presentation follows closely the work presented in [Caticha 2019]
[Johnson Caticha 2011; Nawaz Caticha 2011]. More details can be found in [Johnson 2011;
Nawaz 2012].
2 The di¤usion equation provides an illustration. Fourier series were originally invented
to describe the di¤usion of heat: a physical distribution of temperature, which can only
312 Topics in Quantum Theory*
refers to the fact that the superposition of physical solutions is also a physical
solution. The point to be emphasized is that the superposition principle is a
physical hypothesis of wide applicability that need not, however, be universally
true.
changes into
0 2 2 2 i( 2)
j 3j =j 1j 1 +j 2j 2 + 2 Re[ 1 2e
1
1 2] ; (12.5)
take positive values, is expressed as a sum of sines and cosines which cannot individually
represent physical distributions. Despite the unphysical nature of the individual sine and
cosine components the Fourier expansion is nevertheless very useful.
3 Our discussion parallels [Schrödinger 1938]. Schrödinger invoked time reversal invariance
which was a very legitimate move back in 1938 but today it is preferable to develop an
argument which does not invoke symmetries that are already known to be violated. The answer
proposed in [Pauli 1939] is worthy of note. (See also [Merzbacher 1962].) Pauli proposed that
admissible wave functions must form a basis for representations of the transformation group
that happens to be pertinent to the problem at hand. In particular, Pauli’s argument serves to
discard double-valued wave functions for describing the orbital angular momentum of scalar
particles. The question of single-valuedness was later revived in [Takabayashi 1952, 1983] in
the context of the hydrodynamical interpretation of QM, and later rephrased by in [Wallstrom
1989, 1994] as an objection to Nelson’s stochastic mechanics: are these theories equivalent to
QM or do they merely reproduce a subset of its solutions? Wallstrom’s objection to Nelson’s
stochastic mechanics is that it leads to phases and wave functions that are either both multi-
valued or both single-valued. Both alternatives are unsatisfactory because on one hand QM
requires single-valued wave functions, while on the other hand single-valued phases exclude
states that are physically relevant (e.g., states with non-zero angular momentum).
12.1 Linearity and the superposition principle 313
so that in general
0 2 2
j 3j 6= j 3j ; (12.6)
2
which precludes the interpretation of j 3 j as a probability. That is, even when
the epistemic states 1 and 2 describe actual physical situations, their super-
positions need not.
The problem does not arise when
ei( 1 2)
=1: (12.7)
If we were to group the wave functions into classes each characterized by its own
then we could have a limited version of the superposition principle that ap-
plies within each class. We conclude that beyond the linearity of the Schrödinger
equation we have a superselection rule that restricts the validity of the super-
position principle to wave functions that belong to the same -class.
To …nd the allowed values of we argue as follows. It is natural to assume
that if f ; g (at some given time t0 ) is a physical state then the state with
reversed momenta f ; g (at the same time t0 ) is an equally reasonable physical
state. Basically, the idea is that if particles can be prepared to move in one
direction, then they can also be prepared to move in the opposite direction. In
terms of wave functions the statement is that if t0 is a physically allowed initial
state, then so is t0 .4 Next we consider a generic superposition
3 = 1 + 2 : (12.8)
e2i = 1 or ei = 1: (12.10)
Thus, we are restricted to two discrete possibilities 1. Since the wave func-
tions are assumed su¢ ciently well behaved (continuous, di¤erentiable, etc.) we
conclude that they must be either single-valued, ei = 1, or double-valued,
ei = 1.
Thus, the superposition principle appears to be valid in a su¢ ciently large
number of cases to be a useful rule of thumb but it is restricted to either single-
valued or double-valued wave functions. The argument above does not exclude
4 We make no symmetry assumptions such as parity or time reversibility. It need not be
the case that there is any symmetry that relates the time evolution of t0 to that of t0 .
314 Topics in Quantum Theory*
the possibility that a multi-valued wave function might describe an actual phys-
ical situation. What the argument implies is that the superposition principle
would not extend to such states.
Entropic Dynamics of
Fermions*
Chapter 15
Entropic Dynamics of
Bosons*
Entropy V: Quantum
Entropy*
Epilogue: Towards a
Pragmatic Realism*
[Barbour 1994c] J. B. Barbour, “The emergence of time and its arrow from
timelessness”in Physical Origins of Time Asymmetry, eds. J. Halliwell et
al, (Cambridge U. Press, Cambridge 1994).
[Bohr 1934] N. Bohr, Atomic Theory and the Description of Nature (1934,
reprinted by Ox Bow Press, Woodbridge Connecticut, 1987).
[Bohr 1958] N. Bohr, Essays 1933-1957 on Atomic Physics and Human Knowl-
edge (1958, reprinted by Ox Bow Press, Woodbridge Connecticut, 1987).
[Bohr 1963] N. Bohr, Essays 1958-1962 on Atomic Physics and Human Knowl-
edge (1963, reprinted by Ox Bow Press, Woodbridge Connecticut, 1987).
[Caticha 1998c] A. Caticha, “Insu¢ cient reason and entropy in quantum the-
ory”, Found. Phys. 30, 227 (2000); arXiv.org/abs/quant-ph/9810074.
[Caticha Golan 2014] A. Caticha and A. Golan, “An Entropic framework for
Modeling Economies”, Physica A 408, 149 (2014).
[Cox 1961] R.T. Cox, The Algebra of Probable Inference (Johns Hopkins, Bal-
timore 1961).
[Csiszar 1991] I. Csiszár, “Why least squares and maximum entropy: an ax-
iomatic approach to inference for linear inverse problems”, Ann. Stat. 19,
2032 (1991).
[Doran Lasenby 2003] C. Doran and A. Lasenby, Geometric Algebra for Physi-
cists (Cambridge U.P., Cambridge UK, 2003).
[Earman Redei 1996] J. Earman and M. Rédei, “Why ergodic theory does
not explain the success of equilibrium statistical mechanics”, Brit. J. Phil.
Sci. 47, 63-78 (1996).
[Ellis 1985] B. Ellis, “What science aims to do”in Images of Science ed. by P.
Churchland and C. Hooker (U. of Chicago Press, Chicago 1985); reprinted
in [Papineau 1996].
338 References
[Fine 1996] A. Fine, The Shaky Game – Einstein Realism and the Quantum
Theory (University of Chicago Press, Chicago 1996)
[Garrett 1996] A. Garrett, “Belief and Desire”, Maximum Entropy and Bayesian
Methods ed. by G. R. Heidbreder (Kluwer, Dordrecht 1996).
[Grad 1961] H. Grad, “The Many Faces of Entropy”, Comm. Pure and Appl.
Math. 14, 323 (1961).
[Gregory 2005] P. C. Gregory, Bayesian Logical Data Analysis for the Phys-
ical Sciences (Cambridge UP, 2005).
[Grendar 2003] M. Grendar, Jr. and M. Grendar “Maximum Probability
and Maximum Entropy Methods: Bayesian interpretation”, Bayesian In-
ference and Maximum Entropy Methods in Science and Engineering, ed.
by G. Erickson and Y. Zhai, AIP Conf. Proc. 707, p. 490 (2004)
(arXiv.org/abs/physics/0308005).
[Greven et al 2003] A. Greven, G. Keller, and G. Warnecke (eds.), Entropy
(Princeton U. Press, Princeton 2003).
[Groessing 2008] G. Groessing, “The vacuum ‡uctuation theorem: Exact
Schrödinger equation via nonequilibrium thermodynamics”, Phys. Lett.
A 372, 4556 (2008).
[Groessing 2009] G. Groessing, “On the thermodynamic origin of the quan-
tum potential”, Physica A 388, 811 (2009).
[Guerra 1981] F. Guerra, “Structural aspects of stochastic mechanics and sto-
chastic …eld theory”, Phys. Rep. 77, 263 (1981).
[Guerra Morato 1983] F. Guerra and L. Morato, “Quantization of dynami-
cal systems and stochastic control theory”, Phys. Rev. D27, 1774 (1983).
[Guillemin Sternberg 1984] V. Guillemin and S. Sternberg, Symplectic tech-
niques in physics (Cambridge U. Press, Cambridge 1984).
[Hacking 2001] I. Hacking, An Introduction to Probability and Inductive Logic
(Cambridge U. Press, Cambridge 2001).
[Hall Reginatto 2002a] M. J. W. Hall and M. Reginatto, “Schrödinger equa-
tion from an exact uncertainty principle”, J. Phys. A 35, 3289 (2002).
[Hall Reginatto 2002b] M. J. W. Hall and M. Reginatto, “Quantum mechan-
ics from a Heisenberg-type equality”, Fortschr. Phys. 50, 646 (2002).
[Hall Reginatto 2016] M. J. W. Hall and M. Reginatto, Ensembles in Con-
…guration Space (Springer, Switzerland, 2016).
[Halpern 1999] J. Y. Halpern, “A Counterexample to Theorems of Cox and
Fine”, Journal of Arti…cial Intelligence Research 10, 67 (1999).
[Hardy 2001] L. Hardy, “Quantum Theory From Five Reasonable Axioms”
(arXiv.org/quant- ph/0101012).
[Hardy 2011] L. Hardy, “Reformulating and Reconstructing Quantum The-
ory” (arXiv.org:1104.2066).
[Harrigan Spekkens 2010] N. Harrigan and R. Spekkens, “Einstein, Incom-
pleteness, and the Epistemic View of Quantum States”, Found. Phys.
40,125 (2010).
References 341
[Ipek Caticha 2020] S. Ipek and A.Caticha, “The Entropic Dynamics of Quan-
tum Scalar Fields coupled to Gravity,”Symmetry 12, 1324 (2020); arXiv:
2006.05036.
[Ipek 2021] S. Ipek, The Entropic Dynamics of Relativistic Quantum Fields in
Curved Spacetime, Ph.D. Thesis, University at Albany, State University
of New York, 2021; arXiv:2105.07042 [gr-qc].
[Jaeger 2009] G. Jaeger, Entanglement, Information, and the Interpretation
of Quantum Mechanics (Springer, Berlin 2009).
[James 1897] W. James, The Will to Believe (1897, reprinted by Dover, New
York 1956).
[James 1907] W. James, Pragmatism (1907, reprinted by Dover, 1995).
[James 1911] W. James, The Meaning of Truth (1911, reprinted by Prometheus,
1997).
[Jammer 1966] M. Jammer, The Conceptual Development of Quantum Me-
chanics (McGraw-Hill, New York 1966).
[Jammer 1974] M. Jammer, The Philosophy of Quantum Mechanics – The
Interpretations of Quantum Mechanics in Historical Perspective (Wiley,
New York 1974).
[Jaynes 1957a] E. T. Jaynes, “How does the Brain do Plausible Reasoning”,
Stanford Univ. Microwave Lab. report 421 (1957); also published in
Maximum Entropy and Bayesian Methods in Science and Engineering,
G. J. Erickson and C. R. Smith (eds.) (Kluwer, Dordrecht 1988) and at
http://bayes.wustl.edu.
[Jaynes 1957b] E. T. Jaynes, “Information Theory and Statistical Mechan-
ics”, Phys. Rev. 106, 620 (1957).
[Jaynes 1957c] E. T. Jaynes, “Information Theory and Statistical Mechanics.
II”, Phys. Rev. 108, 171 (1957).
[Jaynes 1963] E. T. Jaynes, “Information Theory and Statistical Mechanics,”
in Statistical Physics, Brandeis Lectures in Theoretical Physics, K. Ford
(ed.), Vol. 3, p.181 (Benjamin, New York, 1963).
[Jaynes 1965] E. T. Jaynes, “Gibbs vs. Boltzmann Entropies”, Am. J. Phys.
33, 391 (1965).
[Jaynes 1968] E. T. Jaynes, “Prior Probabilities”, IEEE Trans. on Systems
Science and Cybernetics SSC-4, 227 (1968) and at http://bayes.wustl.edu.
[Jaynes 1979] E. T. Jaynes, “Where do we stand on maximum entropy?”The
Maximum Entropy Principle ed. by R. D. Levine and M. Tribus (MIT
Press 1979); reprinted in [Jaynes 1983] and at http://bayes.wustl.edu.
References 343
[Mehra 1998] J. Mehra, “Josiah Willard Gibbs and the Foundations of Sta-
tistical Mechanics”, Found. Phys. 28, 1785 (1998).
[Nelson 1986] E. Nelson, “Field theory and the future of stochastic mechan-
ics”, in in Stochastic Processes in Classical and Quantum Systems, ed. By
S. Albeverio et al., Lecture Notes in Physics 262 (Springer, Berlin 1986).
[Newton 1693] Isaac Newton’s third letter to Bentley, February 25, 1693 in
Isaac Newton’s papers and letters on Natural Philosophy and related doc-
uments, ed. by I. B. Cohen (Cambridge, 1958), p. 302.
[Pauli 1939] W. Pauli, Helv. Phys. Acta 12, 147 (1939) and W. Pauli, Gen-
eral Principles of Quantum Mechanics section 6 (Springer-Verlag, Berlin
1980).
[de la Peña and Cetto 2014] L. de la Peña and A.M. Cetto, The Emerging
Quantum: The Physics Behind Quantum Mechanics (Springer, 2014).
[Putnam 1987] H. Putnam, The Many Faces of Realism (Open Court, LaSalle,
Illinois 1987).
[Rao 1945] C. R. Rao, “Information and the accuracy attainable in the es-
timation of statistical parameters”, Bull. Calcutta Math. Soc. 37, 81
(1945).
[Reginatto Hall 2011] M. Reginatto and M.J.W. Hall, “Quantum theory from
the geometry of evolving probabilities,”AIP Conf. Proc. 1443, 96 (2012);
arXiv:1108.5601.
[Renyi 1961] A. Renyi, “On measures of entropy and information”, Proc. 4th
Berkeley Symposium on Mathematical Statistics and Probability, Vol 1, p.
547 (U. of California Press, Berkeley 1961).
[Riesz 1958] M. Riesz, Cli¤ ord Numbers and Spinors (The Institute for Fluid
Dynamics and Applied Mathematics, Lecture Series No.38, U. of Mary-
land, 1958).
[Smolin 1986a] L. Smolin, “On the nature of quantum ‡uctuations and their
relation to gravitation and the principle of inertia”, Class. Quantum Grav.
3, 347 (1986).
[Takabayasi 1983] T. Takabayasi, “Vortex, Spin and Triad for Quantum Me-
chanics of Spinning Particle,” Prog. Theor. Phys. 70, 1 (1983).
[Tseng Caticha 2001] C.-Y. Tseng and A. Caticha, “Yet another resolution
of the Gibbs paradox: an information theory approach”, Bayesian In-
ference and Maximum Entropy Methods in Science and Engineering, ed.
by R. L. Fry, A.I.P. Conf. Proc. 617, 331 (2002) (arXiv.org/abs/cond-
mat/0109324).
[von Mises 1957] R. von Mises, Probability, Statistics and Truth (Dover, 1957).
[U¢ nk 1996] J. U¢ nk, “The Constraint Rule of the Maximum Entropy Prin-
ciple”, Studies in History and Philosophy of Modern Physics 27, 47 (1996).
[U¢ nk 2001] J. U¢ nk, “Blu¤ Your Way in the Second Law of Thermody-
namics”, Studies in History and Philosophy of Modern Physics 32(3), 305
(2001).