0% found this document useful (0 votes)
12 views364 pages

ACaticha-Entropic Physics Book-July 2022

The document titled 'Entropic Physics' by Ariel Caticha explores the relationship between probability, entropy, and the foundations of physics. It covers various interpretations of probability, the design of probability theory, and the evolution of entropy concepts through historical figures. Additionally, it discusses statistical mechanics, information theory, and the updating of probabilities, providing a comprehensive framework for understanding entropic inference in physics.

Uploaded by

Maha Achour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views364 pages

ACaticha-Entropic Physics Book-July 2022

The document titled 'Entropic Physics' by Ariel Caticha explores the relationship between probability, entropy, and the foundations of physics. It covers various interpretations of probability, the design of probability theory, and the evolution of entropy concepts through historical figures. Additionally, it discusses statistical mechanics, information theory, and the updating of probabilities, providing a comprehensive framework for understanding entropic inference in physics.

Uploaded by

Maha Achour
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 364

ENTROPIC PHYSICS

Probability, Entropy,
and the Foundations of Physics

ARIEL CATICHA

Draft last modi…ed on 07/26/2022. Sections marked require


revision.
ii
Contents

Preface 1

1 Inductive Inference and Physics 1


1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 The frequency interpretation . . . . . . . . . . . . . . . . 1
1.1.2 The Bayesian interpretations . . . . . . . . . . . . . . . . 2
1.1.3 Subjective or objective? Epistemic or ontic? . . . . . . . . 3
1.2 Designing a framework for inductive inference . . . . . . . . . . . 5
1.3 Entropic Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Probability 9
2.1 The design of probability theory . . . . . . . . . . . . . . . . . . 10
2.1.1 Rational beliefs? . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Quantifying rational belief . . . . . . . . . . . . . . . . . . 12
2.2 The sum rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 The associativity constraint . . . . . . . . . . . . . . . . . 16
2.2.2 The general solution and its regraduation . . . . . . . . . 17
2.2.3 The general sum rule . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 Cox’s proof . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The product rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 From four arguments down to two . . . . . . . . . . . . . 22
2.3.2 The distributivity constraint . . . . . . . . . . . . . . . . 24
2.4 Some remarks on the sum and product rules . . . . . . . . . . . . 26
2.4.1 On meaning, ignorance and randomness . . . . . . . . . . 26
2.4.2 Independent and mutually exclusive events . . . . . . . . 27
2.4.3 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 How about “quantum” probabilities? . . . . . . . . . . . . . . . . 28
2.6 The expected value . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Probability vs. frequency: the law of large numbers . . . . . . . . 36
2.9 The Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . 39
2.9.1 The de Moivre-Laplace theorem . . . . . . . . . . . . . . 40
2.9.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . 42
2.10 Updating probabilities: Bayes’rule . . . . . . . . . . . . . . . . . 46
iv CONTENTS

2.10.1 Formulating the problem . . . . . . . . . . . . . . . . . . 46


2.10.2 Minimal updating: Bayes’rule . . . . . . . . . . . . . . . 47
2.10.3 Multiple experiments, sequential updating . . . . . . . . . 52
2.10.4 Remarks on priors* . . . . . . . . . . . . . . . . . . . . . 53
2.11 Hypothesis testing and con…rmation . . . . . . . . . . . . . . . . 56
2.12 Examples from data analysis . . . . . . . . . . . . . . . . . . . . 60
2.12.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . 60
2.12.2 Curve …tting . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.12.3 Model selection . . . . . . . . . . . . . . . . . . . . . . . . 65
2.12.4 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . 67

3 Entropy I: The Evolution of Carnot’s Principle 69


3.1 Carnot: reversible engines . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Kelvin: temperature . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3 Clausius: entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Maxwell: probability . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Gibbs: beyond heat . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6 Boltzmann: entropy and probability . . . . . . . . . . . . . . . . 79
3.7 Some remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4 Entropy II: Measuring Information 87


4.1 Shannon’s information measure . . . . . . . . . . . . . . . . . . . 88
4.2 Relative entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Su¢ ciency* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4 Joint entropy, additivity, and subadditivity . . . . . . . . . . . . 97
4.5 Conditional entropy and mutual information . . . . . . . . . . . . 98
4.6 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.8 Communication Theory . . . . . . . . . . . . . . . . . . . . . . . 103
4.9 Assigning probabilities: MaxEnt . . . . . . . . . . . . . . . . . . 106
4.10 Canonical distributions . . . . . . . . . . . . . . . . . . . . . . . . 108
4.11 On constraints and relevant information . . . . . . . . . . . . . . 111
4.12 Avoiding pitfalls –I . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.12.1 MaxEnt cannot …x ‡awed information . . . . . . . . . . . 115
4.12.2 MaxEnt cannot supply missing information . . . . . . . . 115
4.12.3 Sample averages are not expected values . . . . . . . . . . 116

5 Statistical Mechanics 119


5.1 Liouville’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.2 Derivation of Equal a Priori Probabilities . . . . . . . . . . . . . 122
5.3 The constraints for thermal equilibrium . . . . . . . . . . . . . . 124
5.4 The canonical formalism . . . . . . . . . . . . . . . . . . . . . . . 128
5.5 Equilibrium with a heat bath of …nite size . . . . . . . . . . . . . 131
5.6 The thermodynamic limit . . . . . . . . . . . . . . . . . . . . . . 133
5.7 The Second Law of Thermodynamics . . . . . . . . . . . . . . . . 138
5.8 Interpretation of the Second Law: Reproducibility . . . . . . . . 141
CONTENTS v

5.9 On reversibility, irreversibility, and the arrow of time . . . . . . . 143


5.10 Avoiding pitfalls –II: is this a 2nd law? . . . . . . . . . . . . . . 147
5.11 Entropies, descriptions and the Gibbs paradox . . . . . . . . . . 149

6 Entropy III: Updating Probabilities 157


6.1 What is information? . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2 The design of entropic inference . . . . . . . . . . . . . . . . . . . 164
6.2.1 General criteria . . . . . . . . . . . . . . . . . . . . . . . 165
6.2.2 Entropy as a tool for updating probabilities . . . . . . . . 166
6.2.3 Speci…c design criteria . . . . . . . . . . . . . . . . . . . . 168
6.2.4 The ME method . . . . . . . . . . . . . . . . . . . . . . . 172
6.3 The proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.4 Consistency with the law of large numbers . . . . . . . . . . . . . 179
6.5 Random remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.5.1 On priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.5.2 Informative and non-informative priors* . . . . . . . . . . 182
6.5.3 Comments on other axiomatizations . . . . . . . . . . . . 182
6.6 Bayes’rule as a special case of ME . . . . . . . . . . . . . . . . . 183
6.7 Commuting and non-commuting constraints . . . . . . . . . . . . 188
6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7 Information Geometry 193


7.1 Examples of statistical manifolds . . . . . . . . . . . . . . . . . . 194
7.2 Vectors in curved spaces . . . . . . . . . . . . . . . . . . . . . . . 195
7.3 Distance and volume in curved spaces . . . . . . . . . . . . . . . 199
7.4 Derivations of the information metric . . . . . . . . . . . . . . . . 201
7.4.1 From distinguishability . . . . . . . . . . . . . . . . . . . 201
7.4.2 From embedding in a Euclidean space . . . . . . . . . . . 203
7.4.3 From embedding in a spherically symmetric space . . . . 204
7.4.4 From asymptotic inference . . . . . . . . . . . . . . . . . . 206
7.4.5 From relative entropy . . . . . . . . . . . . . . . . . . . . 208
7.5 Uniqueness of the information metric . . . . . . . . . . . . . . . . 208
7.6 The metric for some common distributions . . . . . . . . . . . . . 217
7.7 Dimensionless distance? . . . . . . . . . . . . . . . . . . . . . . . 221

8 Entropy IV: Entropic Inference 223


8.1 Deviations from maximum entropy . . . . . . . . . . . . . . . . . 223
8.2 The ME method . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.3 Avoiding pitfalls –III . . . . . . . . . . . . . . . . . . . . . . . . 227
8.3.1 The three-sided die . . . . . . . . . . . . . . . . . . . . . . 227
8.3.2 Understanding ignorance . . . . . . . . . . . . . . . . . . 229
vi CONTENTS

9 Topics in Statistical Mechanics* 233


9.1 An application to ‡uctuations . . . . . . . . . . . . . . . . . . . . 233
9.2 Variational approximation methods –I* . . . . . . . . . . . . . . 236
9.2.1 Mean …eld Theory* . . . . . . . . . . . . . . . . . . . . . 236
9.2.2 Classical density functional theory* . . . . . . . . . . . . 236

10 A Prelude to Dynamics: Kinematics 237


10.1 Gradients and covectors . . . . . . . . . . . . . . . . . . . . . . . 238
10.2 Lie derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.2.1 Lie derivative of vectors . . . . . . . . . . . . . . . . . . . 241
10.2.2 Lie derivative of covectors . . . . . . . . . . . . . . . . . . 242
10.2.3 Lie derivative of the metric . . . . . . . . . . . . . . . . . 243
10.3 The cotangent bundle . . . . . . . . . . . . . . . . . . . . . . . . 243
10.3.1 Vectors, covectors, etc. . . . . . . . . . . . . . . . . . . . . 244
10.4 Hamiltonian ‡ows . . . . . . . . . . . . . . . . . . . . . . . . . . 245
10.4.1 The symplectic form . . . . . . . . . . . . . . . . . . . . . 245
10.4.2 Hamilton’s equations and Poisson brackets . . . . . . . . 247
10.5 The information geometry of e-phase space . . . . . . . . . . . . 249
10.5.1 The metric of e-phase space T P . . . . . . . . . . . . . . 249
10.5.2 A complex structure for T P . . . . . . . . . . . . . . . . 251
10.6 Quantum kinematics: symplectic and metric structures . . . . . . 252
10.6.1 The normalization constraint . . . . . . . . . . . . . . . . 252
10.6.2 The embedding space T S + . . . . . . . . . . . . . . . . . 253
10.6.3 The metric induced on the e-phase space T S . . . . . . . 255
10.6.4 Re…ning the choice of cotangent space . . . . . . . . . . . 257
10.7 Quantum kinematics: Hamilton-Killing ‡ows . . . . . . . . . . . 259
10.8 Hilbert space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.9 Assessment: is this all there is to Quantum Mechanics? . . . . . 266

11 Entropic Dynamics: Time and Quantum Theory 267


11.1 Mechanics without mechanism . . . . . . . . . . . . . . . . . . . 267
11.2 The ontic microstates . . . . . . . . . . . . . . . . . . . . . . . . 272
11.3 The entropic dynamics of short steps . . . . . . . . . . . . . . . . 273
11.3.1 The prior . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.3.2 The phase constraint . . . . . . . . . . . . . . . . . . . . . 275
11.3.3 The gauge constraints . . . . . . . . . . . . . . . . . . . . 276
11.3.4 The transition probability . . . . . . . . . . . . . . . . . . 276
11.3.5 Invariance under gauge transformations . . . . . . . . . . 277
11.4 Entropic time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.4.1 Time as an ordered sequence of instants . . . . . . . . . . 278
11.4.2 The arrow of entropic time . . . . . . . . . . . . . . . . . 279
11.4.3 Duration . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.5 Brownian sub-quantum motion and the evolution equation . . . . 281
11.5.1 The information metric of con…guration space . . . . . . . 282
11.5.2 The evolution equation in di¤erential form . . . . . . . . 283
11.5.3 The current and osmotic velocities . . . . . . . . . . . . . 284
CONTENTS vii

11.5.4 A quick review of functional derivatives . . . . . . . . . . 285


11.5.5 The evolution equation in Hamiltonian form . . . . . . . . 286
11.5.6 The future and past drift velocities . . . . . . . . . . . . . 287
11.6 An alternative: Bohmian sub-quantum motion . . . . . . . . . . 289
11.6.1 The evolution equation in di¤erential form . . . . . . . . 290
11.7 The epistemic phase space . . . . . . . . . . . . . . . . . . . . . . 291
11.7.1 The symplectic form in ED . . . . . . . . . . . . . . . . . 293
11.7.2 Hamiltonian ‡ows . . . . . . . . . . . . . . . . . . . . . . 294
11.7.3 The normalization constraint . . . . . . . . . . . . . . . . 295
11.8 The information geometry of e-phase space . . . . . . . . . . . . 296
11.8.1 The embedding space T S + . . . . . . . . . . . . . . . . . 297
11.8.2 The metric induced on the e-phase space T S . . . . . . . 298
11.8.3 A simpler embedding . . . . . . . . . . . . . . . . . . . . . 299
11.8.4 Re…ning the choice of cotangent space . . . . . . . . . . . 299
11.9 Hamilton-Killing ‡ows . . . . . . . . . . . . . . . . . . . . . . . . 301
11.10The e-Hamiltonian . . . . . . . . . . . . . . . . . . . . . . . . . . 303
11.11Entropic time, physical time, and time reversal . . . . . . . . . . 305
11.12Hilbert space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
11.13Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

12 Topics in Quantum Theory* 311


12.1 Linearity and the superposition principle . . . . . . . . . . . . . . 311
12.1.1 The single-valuedness of . . . . . . . . . . . . . . . . . . 312
12.1.2 Charge quantization . . . . . . . . . . . . . . . . . . . . . 314
12.2 Momentum in Entropic Dynamics* . . . . . . . . . . . . . . . . . 315
12.2.1 Uncertainty relations . . . . . . . . . . . . . . . . . . . . . 315
12.3 The classical limit* . . . . . . . . . . . . . . . . . . . . . . . . . . 315
12.4 Elementary applications* . . . . . . . . . . . . . . . . . . . . . . 316
12.4.1 The free particle* . . . . . . . . . . . . . . . . . . . . . . . 316
12.4.2 The double-slit experiment* . . . . . . . . . . . . . . . . . 316
12.4.3 Tunneling* . . . . . . . . . . . . . . . . . . . . . . . . . . 316
12.4.4 Entangled particles* . . . . . . . . . . . . . . . . . . . . . 316

13 The quantum measurement problem* 317


13.1 Measuring position: ampli…cation* . . . . . . . . . . . . . . . . . 317
13.2 “Measuring” other observables and the Born rule* . . . . . . . . 317
13.3 Evading no-go theorems* . . . . . . . . . . . . . . . . . . . . . . 317
13.4 Weak measurements* . . . . . . . . . . . . . . . . . . . . . . . . . 317
13.5 The qubit* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
13.6 Contextuality* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
13.7 Delayed choice experiments* . . . . . . . . . . . . . . . . . . . . . 317

14 Entropic Dynamics of Fermions* 319


viii CONTENTS

15 Entropic Dynamics of Spin* 321


15.1 Geometric algebra* . . . . . . . . . . . . . . . . . . . . . . . . . . 321
15.1.1 Multivectors and the geometric product* . . . . . . . . . 321
15.1.2 Spinors* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
15.2 Spin and the Pauli equation* . . . . . . . . . . . . . . . . . . . . 321
15.3 Entangled spins* . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

16 Entropic Dynamics of Bosons* 323


16.1 Boson …elds* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
16.2 Boson particles* . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

17 Entropy V: Quantum Entropy* 325


17.1 Density matrices* . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.2 Ranking density matrices* . . . . . . . . . . . . . . . . . . . . . . 325
17.3 The quantum maximum entropy method* . . . . . . . . . . . . . 325
17.4 Decoherence* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.5 Variational approximation methods –II* . . . . . . . . . . . . . . 325
17.5.1 The variational method* . . . . . . . . . . . . . . . . . . . 325
17.5.2 Quantum density functional formalism* . . . . . . . . . . 325

18 Epilogue: Towards a Pragmatic Realism* 327


18.1 Background: the tensions within realism* . . . . . . . . . . . . . 327
18.2 Pragmatic realism* . . . . . . . . . . . . . . . . . . . . . . . . . . 327

References 329
Preface*

Science consists in using information about the world for the purpose of pre-
dicting, explaining, understanding, and/or controlling phenomena of interest.
The basic di¢ culty is that the available information is usually insu¢ cient to
attain any of those goals with certainty. A central concern in these lectures will
be the problem of inductive inference, that is, the problem of reasoning under
conditions of incomplete information.
Our goal is twofold. First, to develop the main tools for inference — proba-
bility and entropy — and to demonstrate their use. And second, to demonstrate
their importance for physics. More speci…cally our goal is to clarify the con-
ceptual foundations of physics by deriving the fundamental laws of statistical
mechanics and of quantum mechanics as examples of inductive inference. Per-
haps all physics can be derived in this way.
The level of these lectures is somewhat uneven. Some topics are fairly ad-
vanced — the subject of recent research — while some other topics are very
elementary. I can give two related reasons for including both in the same book.
The …rst is pedagogical: these are lectures — the easy stu¤ has to be taught
too. More importantly, the standard education of physicists includes a very
inadequate study of probability and even of entropy. The result is a widespread
misconception that these “elementary” subjects are trivial and unproblematic
— that the real problems of theoretical and experimental physics lie elsewhere.
As for the second reason, it is inconceivable that the interpretations of prob-
ability and of entropy would turn out to bear no relation to our understanding
of physics. Indeed, if the only notion of probability at our disposal is that of
a frequency in a large number of trials one might be led to think that the en-
sembles of statistical mechanics must be real, and to regard their absence as an
urgent problem demanding an immediate solution — perhaps an ergodic solu-
tion. One might also be led to think that analogous ensembles are needed in
quantum theory perhaps in the form of parallel worlds. Similarly, if the only
available notion of entropy is derived from thermodynamics, one might end up
thinking that entropy is some physical quantity that can be measured in the lab,
and that it has little or no relevance beyond statistical mechanics. Thus, it is
worthwhile to revisit the “elementary”basics because usually the basics are not
elementary at all, and even more importantly, because they are so fundamental.
Acknowledgements: Most specially I am indebted to N. Caticha and C. R.
Rodríguez, whose views on these matters have over the years profoundly in-
x Preface

‡uenced my own, but I have also learned much from discussions with many
colleagues and friends: D. Bartolomeo, C. Cafaro, N. Carrara, S. DiFranzo, V.
Dose, K. Earle, A. Gi¢ n, A. Golan, S. Ipek, D. T. Johnson, K. Knuth, O. Lunin,
S. Nawaz, P. Pessoa, R. Preuss, M. Reginatto, J. Skilling, J. Stern, C.-Y. Tseng,
K. Vanslette, and A. Youse…. I would also like to thank all the students who
over the years have taken my course on Information Physics; their questions
and doubts have very often pointed the way to clarifying my own questions and
doubts.

Albany, ...
Chapter 1

Inductive Inference and


Physics

The process of drawing conclusions from available information is called inference.


When the available information is su¢ cient to make unequivocal assessments of
truth we speak of making deductions — on the basis of a certain piece of infor-
mation we deduce that a certain proposition is true. The method of reasoning
leading to deductive inferences is called logic. Situations where the available
information is insu¢ cient to reach such certainty lie outside the realm of logic.
In these cases we speak of doing inductive inference, and the methods deployed
are those of probability theory and entropic inference.

1.1 Probability
The question of the meaning and interpretation of the concept of probability
has long been controversial. It is clear that the interpretations o¤ered by var-
ious schools are at least partially successful or else they would already have
been discarded long ago. But the di¤erent interpretations are not equivalent.
They lead people to ask di¤erent questions and to pursue their research in dif-
ferent directions. Some questions may become essential and urgent under one
interpretation while totally irrelevant under another. And perhaps even more
important: under di¤erent interpretations equations can be used di¤erently and
this can lead to di¤erent predictions.

1.1.1 The frequency interpretation


Historically the frequentist interpretation has been the most popular: the prob-
ability of a random event is given by the relative number of occurrences of the
event in a su¢ ciently large number of identical and independent trials. The
appeal of this interpretation is that it seems to provide an empirical method to
estimate probabilities by counting over the ensemble of trials. The magnitude
2 Inductive Inference and Physics

of a probability is obtained solely from the observation of many repeated trials,


a process that is thought to be independent on any feature or characteristic of
the observers. Probabilities interpreted in this way have been called objective.
This view has dominated the …elds of statistics and physics for most of the 19th
and 20th centuries (see, e.g., [von Mises 1957]).
One disadvantage of the frequentist approach has to do with matters of rigor:
what precisely does one mean by ‘random’? If the trials are su¢ ciently identical,
shouldn’t one always obtain the same outcome? Also, if the interpretation is to
be validated on the basis of its operational, empirical value, how large should
the number of trials be? Unfortunately, the answers to these questions are
neither easy nor free from controversy. By the time the tentative answers have
reached a moderately acceptable level of sophistication the intuitive appeal of
this interpretation has long been lost. In the end, it seems the frequentist
interpretation is most useful when left a bit vague.
A more serious objection is the following. In the frequentist approach the
notion of an ensemble of trials is central. In cases where there is a natural
ensemble (tossing a coin, or a die, spins in a lattice, etc.) the frequency inter-
pretation seems natural enough. But even then there is always the question of
choosing the relevant ensemble: doesn’t this choice re‡ect somebody’s decisions,
interests, and purposes? How objective is this choice of ensemble? Indeed, for
many other problems the construction of an ensemble is at best highly arti…-
cial. For example, consider the probability of there being life in Mars. Are we
to imagine an ensemble of Mars planets and solar systems? In these cases the
ensemble would be purely hypothetical. It o¤ers no possibility of an empirical
determination of a relative frequency and this defeats the original goal of pro-
viding an objective operational interpretation of probabilities as frequencies. In
yet other problems there is no ensemble at all: consider the probability that the
nth digit of the number be 7. Are we to imagine alternative universes with
di¤erent values for the number ?
It is clear that there exist a number of interesting problems where one sus-
pects the notion of probability could be quite useful but which nevertheless lie
outside the domain of the frequentist approach.

1.1.2 The Bayesian interpretations


According to the Bayesian interpretations, which can be traced back to Bernoulli
and Laplace, but have only achieved popularity in the last few decades, a prob-
ability re‡ects the degree of belief of an agent in the truth of a proposition.1
These probabilities are said to be Bayesian because of the central role played
1 ‘Degree of belief’ is only a quick and dirty way to describe what Bayesian probabilities

are about but there are many shades of interpretation. As we shall later argue a more useful
de…nition of probability is the degree to which an ideally rational agent ought to believe in
the truth of a proposition. Other interpretations include, for example, the degree of personal
belief as contrasted to a degree of rational belief or reasonable expectation, the degree of
plausibility of a proposition, the degree of credibility, and also the degree of implication (the
degree to which b implies a is the conditional probability of a given b).
1.1 Probability 3

by Bayes’ theorem – a theorem which was …rst written by Laplace. This ap-
proach enjoys several advantages. One is that the di¢ culties associated with
attempting to pinpoint the precise meaning of the word ‘random’can be avoided.
Bayesian probabilities allow us to reason in a consistent and rational manner
about quantities, such as parameters, that rather than being random might
be merely unknown. Also Bayesian probabilities are not restricted to repeat-
able events and therefore they allow us to reason about singular, unique events.
Thus, in going from the frequentist to the Bayesian interpretations the domain
of applicability and therefore the usefulness of the concept of probability is
considerably enlarged.
The crucial aspect of Bayesian probabilities is that di¤erent agents may
have di¤erent degrees of belief in the truth of the very same proposition, a
fact that is described by referring to Bayesian probabilities as being subjective.
This term is somewhat misleading. At one end of the spectrum we …nd the so-
called subjective Bayesian or personalistic view (see, e.g., [Savage 1972; Howson
Urbach 1993; Je¤rey 2004]), and the other end there is the objective Bayesian
view (see e.g. [Je¤reys 1939; Cox, 1946; Jaynes 1985, 2003; Lucas 1970]).
For an excellent elementary introduction with a philosophical perspective see
[Hacking 2001]. According to the subjective view, two reasonable individuals
faced with the same evidence, the same information, can legitimately di¤er in
their con…dence in the truth of a proposition and may therefore assign di¤erent
degrees of personal belief. Subjective Bayesians accept that individuals can
change their beliefs, merely on the basis of introspection, reasoning, or even
revelation.
At the other end of the Bayesian spectrum, the objective Bayesian view
considers the theory of probability as an extension of logic. It is then said
that a probability measures a degree of implication, it is the degree to which
one proposition implies another. It is assumed that the objective Bayesian
has thought so long and so hard about how probabilities are assigned that no
further reasoning will induce a revision of beliefs except when confronted with
new information. In an ideal situation two di¤erent individuals will, on the basis
of the same information, assign the same probabilities.

1.1.3 Subjective or objective? Epistemic or ontic?


Whether Bayesian probabilities are subjective or objective is still a matter of
controversy. The confusion arises in large part because there are two di¤erent
senses in which the subjective/objective distinction can be drawn. One is onto-
logical (related to existence, to being) and the other epistemological (related to
knowledge and opinion).
In the ontological sense entities such as pains and emotions are said to sub-
jective because they exist only as experienced by some agent. Other entities,
such as atoms and chairs, are called objective because they presumably exist
out there in the real world quite independently of any agent. On the other hand
there is the epistemological sense. A proposition that states a fact is said to
be (epistemically) objective in that its truth can in principle be established in-
4 Inductive Inference and Physics

dependently of anyone’s attitudes or thoughts while a proposition that re‡ects


a value judgment is said to be (epistemically) subjective. Unfortunately the
boundaries between these distinctions are not nearly as sharply de…ned as one
might wish.
Bayesian probabilities are ontologically subjective because they are tools for
reasoning. They exist only in the minds of those agents who use them. This is
in stark contrast to those interpretations in which probabilities are conceived as
something physically real. Examples of the latter include Popper’s interpreta-
tion of probability as a physical propensity, and Heisenberg’s probability as an
objective potentiality, and then there is also probability as the objective chance
for an event to actually happen. Such ontologically objective “probabilities”are
not tools for reasoning or for inference. Unlike other physical entities – such
as particles and …elds – postulating their existence has not led to successful
theoretical models.
In the epistemological sense however the characterization of probabilities
as either subjective or objective is not nearly so clear. Our position is that
probabilities can lie anywhere in between. Probabilities will always retain a
subjective element because translating information into probabilities involves
judgments and di¤erent people will often judge di¤erently. On the other hand,
it is a presupposition of thought itself that some beliefs are better than oth-
ers — otherwise why go through the trouble of thinking? And they can be
“objectively” better in that they provide better guidance about how to cope
with the world.2 The adoption of better beliefs has real consequences. Not all
probability assignments are equally useful and it is plausible that what makes
some assignments better than others is that they correspond to some objective
feature of the world. One might even say that what makes them better is that
they provide a better guide to the “truth” — whatever that might be.
We shall …nd that while the epistemic subjectivity can only be eliminated
in idealized situations, the rules for processing information, that is, the rules
for updating probabilities, are considerably more objective. This means that as
new information is obtained it can be objectively incorporated into the updated
probabilities. Ultimately, it is the conviction that the updated or posterior
probabilities are somehow objectively better than the original prior probabil-
ities that provides the justi…cation for going through the trouble of gathering
information to update our beliefs.
To summarize: Referring to probabilities as subjective or objective can lead
to confusion because it is not clear whether these terms are deployed in an
ontological or an epistemological sense. As we shall see in the next chapter
probabilities will be designed as tools for handling uncertainty, for dealing with
incomplete information. Such probabilities will turn out to be ontologically
subjective but epistemically they can lie anywhere from fully objective to fully
subjective.

2 The approach is decidedly pragmatic: the purpose of thinking is to acquire beliefs; the

purpose of beliefs is to guide our actions.


1.2 Designing a framework for inductive inference 5

A suitable terminology A central concern is to maintain a clear distinction


between ontologically objective and ontologically subjective entities. Ontologi-
cally objective entities will be succinctly described as being real or ontic. Such
things exist in the sense that, at least within our theories, they constitute the
furniture of the world; examples include familiar macroscopic objects such as
tables and chairs.3 The only ontologically subjective entities that will concern
us here are probabilities and those other concepts closely associated with them
such as entropies or quantum wave functions. Such entities will be referred
to as epistemic.4 Thus, we shall avoid the terms objective/subjective in the
ontological sense and replace them by the terms ontic/epistemic.
On the other hand, in their epistemological sense the terms objective/subjective
are useful and must be retained. There is much to be gained by rejecting a sharp
objective/subjective dichotomy and replacing it with a continuous spectrum of
intermediate possibilities.5 Probabilities are epistemically hybrid; they incorpo-
rate both subjective and objective elements. We process information to suppress
the former and enhance the latter because this is what leads to probabilities that
are useful in practice.

1.2 Designing a framework for inductive infer-


ence
A common hope in science, mathematics and philosophy has been to …nd a
secure foundation for knowledge. So far the search has not been successful and
everything indicates that such indubitable foundation is nowhere to be found.
Accordingly, we adopt a pragmatic attitude: there are ideas about which we can
have greater or lesser con…dence, and from these we can infer the plausibility of
others; but there is nothing about which we can have full certainty and complete
knowledge.
Inductive inference in its Bayesian/entropic form is a framework designed
for the purpose of coping with the world in a rational way in situations where
the information available is incomplete. The framework must solve two related
problems. First, it must allow for convenient representations of states of partial
belief — this is handled through the introduction of probabilities. Second, it
must allow us to update from one state of belief to another in the fortunate
circumstance that some new information becomes available — this is handled
3 The ontic status of microscopic things depends on the particular model or theory. As

we shall see in Chapter 11 in non-relativistic quantum mechanics particles and atoms will
be described as ontic. However, in relativistic quantum …eld theory it is the …elds that are
ontic and those excited states (probabilistic distributions of …elds) that are called particles
are epistemic. Our position re‡ects a pragmatic realism that is close to the internal realism
advocated by H. Putnam [Putnam 1979, 1981, 1987].
4 The term ‘epistemic’is not appropriate for emotions or values. However, such ontologically

subjective entities will not enter our discussions and there is no pressing need to accommodate
them in our terminology.
5 Here again, our position bears some resemblance to that of H. Putnam who has forcefully

argued for the rejection of the fact/value dichotomy [Putnam 1991, 2003].
6 Inductive Inference and Physics

through the introduction of relative entropy as the tool for updating. The theory
of probability would be severely handicapped – indeed it would be quite useless –
without a companion theory for updating probabilities.
The framework for inference will be constructed by a process of eliminative
induction.6 The objective is to design the appropriate tools, which in our case,
means designing the theory of probability and entropy. The di¤erent ways in
which probabilities and entropies are de…ned and handled will lead to di¤erent
inference schemes and one can imagine a vast variety of possibilities. To select
one we must …rst have a clear idea of the function that those tools are supposed
to perform, that is, we must specify design criteria or design speci…cations that
the desired inference framework must obey. Then, in the eliminative part of the
process one proceeds to systematically rule out all those inference schemes that
fail to perform as desired.
There is no implication that an inference framework designed in this way is
in any way “true”, or that it succeeds because it achieves some special intimate
agreement with reality. Instead, the claim is pragmatic: the method succeeds to
the extent that the inference framework works as designed and its performance
will be deemed satisfactory as long as it leads to scienti…c models that are
empirically adequate. Whatever design criteria are chosen, they are meant to
be only provisional — just like everything else in science, there is no reason to
consider them immune from further change and improvement.
The pros and cons of eliminative induction have been the subject of con-
siderable philosophical research (e.g. [Earman 1992; Hawthorne 1993; Godfrey-
Smith 2003]). On the negative side, eliminative induction, like any other form
of induction, is not guaranteed to work. On the positive side, eliminative in-
duction adds an interesting twist to Popper’s scienti…c methodology. According
to Popper scienti…c theories can never be proved right, they can only be proved
false; a theory is corroborated only to the extent that all attempts at falsifying
it have failed. Eliminative induction is fully compatible with Popper’s notions
but the point of view is just the opposite. Instead of focusing on failure to
falsify one focuses on success: it is the successful falsi…cation of all rival theories
that corroborates the surviving one. The advantage is that one acquires a more
explicit understanding of why competing theories are eliminated.
In chapter 2 we address the problem of the design and construction of prob-
ability theory as a tool for inference. In other words, we show that degrees of
rational belief, those measures of plausibility that we require to do inference,
should be manipulated and calculated according to the ordinary rules of the
calculus of probabilities.
The problem of designing a theory for updating probabilities is addressed
mostly in chapter 6 and then completed in chapter 8. We discuss the central
6 Eliminative induction is a method to select one alternative from within a set of possible

ones. For example, to select the right answer to a question one considers a set of possible
candidate answers and proceeds to systematically eliminate those that are found wrong or
unacceptable in one way or another. The answer that survives after all others have been ruled
out is the best choice. There is, of course, no guarantee that the last standing alternative is
the correct answer – the only certainty is that all other answers were de…nitely wrong.
1.3 Entropic Physics 7

question “What is information?” and show that there is a unique method to


update from an old set of beliefs codi…ed in a prior probability distribution into
a new set of beliefs described by a new, posterior distribution when the informa-
tion available is in the form of a constraint on the family of acceptable posteriors.
In this approach the tool for updating is entropy. A central achievement is the
complete uni…cation of Bayesian and entropic methods.

1.3 Entropic Physics


Once the framework of entropic inference has been constructed we deploy it to
clarify the conceptual foundations of physics.
Prior to the work of E.T. Jaynes it was suspected that there was a connec-
tion between thermodynamics and information theory. But the connection took
the form of an analogy between the two …elds: Shannon’s information theory
was designed to be useful in engineering7 while thermodynamics was meant to
be “true”by virtue of re‡ecting “laws of nature”. The gap was enormous and to
this day many still think that the analogy is purely accidental. With the work of
Jaynes, however, it became clear that the connection is not an accident: the cru-
cial link is that both situations involve reasoning with incomplete information.
This development was signi…cant for many subjects — engineering, statistics,
computation — but for physics the impact of such a change in perspective is
absolutely enormous: thermodynamics and statistical mechanics provided the
…rst example of a fundamental theory that, instead of being a direct image of
nature, should be interpreted as a scheme for inference about nature.
Our goal in chapter 5 is to provide an explicit discussion of statistical me-
chanics as an example of entropic inference; the chapter is devoted to discussing
and clarifying the foundations of thermodynamics and statistical mechanics.
The development is carried largely within the context of Jaynes’ MaxEnt for-
malism and we show how several central topics such as the equal probability
postulate, the second law of thermodynamics, irreversibility, reproducibility,
and the Gibbs paradox can be considerably clari…ed when viewed from the in-
formation/inference perspective.
The insight derived from recognizing that one physical theory is an example
of inference leads to the obvious question: is statistical mechanics the only
one or are there other examples? The answer is yes. Starting in chapter 10
we explore new territory devoted to deriving quantum theory as an example
of entropic inference. The challenge is that the theory involves dynamics and
time in a fundamental way. It is signi…cant that the full framework of entropic
inference derived in chapters 6, 7, and 8 is needed here — the old entropic
methods developed by Shannon and Jaynes are no longer su¢ cient.
The payo¤ is considerable. The mathematical framework of quantum me-
chanics is derived and the entropic approach o¤ers new insights into many topics
that are central to quantum theory: the interpretation of the wave function, the
7 Even as late as 1961 Shannon expressed doubts that information theory would ever …nd

application in …elds other than communication theory. [Tribus 1978]


8 Inductive Inference and Physics

wave-particle duality, the quantum measurement problem, the introduction and


interpretation of observables other than position, including momentum, the cor-
responding uncertainty relations, and most important, it leads to a theory of
entropic time. The overall conclusion is that the laws of quantum mechanics are
not laws of nature; they are rules for processing information about nature.
Chapter 2

Probability

Our goal is to establish the theory of probability as the general theory for
reasoning on the basis of incomplete information. This requires us to tackle
two di¤erent problems. The …rst problem is to …gure out how to achieve a
quantitative description of a state of partial knowledge. Once this is settled we
address the second problem of how to update from one state of knowledge to
another when new information becomes available.
Throughout we will assume that the subject matter –the set of propositions
the truth of which we want to assess –has been clearly speci…ed. This question
of what it is that we are actually talking about is much less trivial than it might
appear at …rst sight.1 Nevertheless, it will not be discussed further.
The …rst problem, that of describing or characterizing a state of partial
knowledge, requires that we quantify the degree to which we believe each propo-
sition in the set is true. The most basic feature of these beliefs is that they form
an interconnected web that must be internally consistent. The idea is that in
general the strengths of one’s beliefs in some propositions are constrained by
one’s beliefs in other propositions; beliefs are not independent of each other. For
example, the belief in the truth of a certain statement a is strongly constrained
by the belief in the truth of its negation, not-a: the more I believe in one, the
less I believe in the other.
In this chapter we will also address a special case of the second problem
— that of updating from one consistent web of beliefs to another when new
information in the form of data becomes available. The basic updating strategy
re‡ects the conviction that what we learned in the past is valuable, that the web
of beliefs should only be revised to the extent required by the data. We will see
that this principle of minimal updating leads to the uniquely natural rule that
is widely known as Bayes’ rule.2 As an illustration of the enormous power of
1 Consider the example of quantum mechanics: Are we talking about particles, or about

experimental setups, or both? Is it the position of the particles or the position of the detectors?
Are we talking about position variables, or about momenta, or both? Or neither?
2 The presentation in this chapter includes material published in [Caticha Gi¢ n 2006,

Caticha 2007, Caticha 2009, Caticha 2014a].


10 Probability

Bayes’rule we will brie‡y explore its application to data analysis. As we shall


see in later chapters the minimal updating principle is not restricted to data
but can also be deployed to process more general kinds of information. This
will require the design of a more sophisticated updating tool –relative entropy.

2.1 The design of probability theory


Science requires a framework for inference on the basis of incomplete informa-
tion. We will show that the quantitative measures of plausibility or degrees of
belief that are the tools for reasoning should be manipulated and calculated
using the ordinary rules of the calculus of probabilities — and therefore proba-
bilities can be interpreted as degrees of belief.
The procedure we follow di¤ers in one remarkable way from the traditional
way of setting up physical theories. Normally one starts with a mathematical
formalism, and then one proceeds to try to …gure out what the formalism might
possibly mean; one tries to append an interpretation to it. This turns out to be
a very di¢ cult problem; historically it has a¤ected not only statistical physics —
what is the meaning of probabilities and of entropy — but also quantum theory
— what is the meaning of probabilities, wave functions and amplitudes. Here we
proceed in the opposite order, we …rst decide what we are talking about, degrees
of belief or degrees of plausibility (we use the two expressions interchangeably)
and then we design rules to manipulate them; we design the formalism; we
construct it to suit our purposes. The advantage of this pragmatic approach is
that the issue of meaning, of interpretation, is settled from the start.

2.1.1 Rational beliefs?


Before we proceed further it may be important to emphasize that the degrees
of belief discussed here are those held by an idealized rational agent that would
not be subject to the practical limitations under which we humans operate.
Di¤erent individuals may hold di¤erent beliefs and it is certainly important to
…gure out what those beliefs might be — perhaps by observing their gambling
behavior — but this is not our present concern. Our objective is neither to
assess nor to describe the subjective beliefs of any particular individual — this
important task is best left to psychology and to cognitive science. Instead we
deal with the altogether di¤erent but very common problem that arises when
we are confused and we want some guidance about what we are supposed to
believe. Our concern here is not so much with beliefs as they actually are, but
rather, with beliefs as they ought to be — at least as they ought to be to deserve
to be called rational. We are concerned with an idealized standard of rationality
that we humans ought to strive for at least when discussing scienti…c matters.
The concept of rationality is notoriously di¢ cult to pin down. We adopt a
pragmatic approach: ‘rational’is a compliment we pay to a particular mode of
argument that appears to lead to reliable conclusions. One thing we can safely
assert is that rational beliefs are constrained beliefs. The essence of rationality
2.1 The design of probability theory 11

lies precisely in the existence of some constraints — not everything goes. We


need to identify some normative criteria of rationality and the di¢ culty is to
…nd criteria that are su¢ ciently general to include all instances of rationally
justi…ed belief. Here is our …rst criterion of rationality:

The inference framework must be based on assumptions that have wide


appeal and universal applicability.

Whatever guidelines we pick they must be of general applicability — otherwise


they fail when most needed, namely, when not much is known about a problem.
Di¤erent rational agents can reason about di¤erent topics, or about the same
subject but on the basis of di¤erent information, and therefore they could hold
di¤erent beliefs, but they must agree to follow the same rules. What we seek here
are not the speci…c rules of inference that will apply to this or that particular
instance; what we seek is to identify some few features that all instances of
rational inference might have in common.
The second criterion is that

The inference framework must not be self-refuting.

It is not easy to identify criteria of rationality that are su¢ ciently general and
precise. Perhaps we can settle for the more manageable goal of avoiding ir-
rationality in those glaring cases where it is easily recognizable. And this is
the approach we take: rather than o¤ering a precise criterion of rationality we
design a framework with the more modest goal of avoiding some forms of irra-
tionality that are perhaps su¢ ciently obvious to command general agreement.
The basic requirement is that if a conclusion can be reached by arguments that
follow two di¤erent paths then the two arguments must agree. Otherwise our
framework is not performing the function for which it is being designed, namely,
to provide guidance as to what we are supposed to believe. Thus, the web of
rational beliefs must avoid inconsistencies. As we shall see this requirement
turns out to be extremely restrictive.
Finally,

The inference framework must be useful in practice — it must allow quan-


titative analysis.

Otherwise, why bother?


Whatever speci…c design criteria are chosen, one thing must be clear: they
are justi…ed on purely pragmatic grounds and therefore they are meant to be
only provisional. The notion of rationality itself is not immune to change and
improvement. Given some criteria of rationality we proceed to construct models
of the world, or better, models that will help us deal with the world — predict,
control, and explain the facts. The process of improving these models — better
models are those that lead to more accurate predictions, more accurate control,
and more lucid and encompassing explanations of more facts, not just the old
facts but also of new and hopefully even unexpected facts — may eventually
12 Probability

suggest improvements to the rationality criteria themselves. Better rationality


leads to better models which leads to better rationality and so on. The method
of science is not independent from the contents of science.

2.1.2 Quantifying rational belief


In order to be useful we require an inference framework that allows quantitative
reasoning. The …rst obvious question concerns the type of quantity that will
represent the intensity of beliefs. Discrete categorical variables are not adequate
for a theory of general applicability; we need a much more re…ned scheme.
Do we believe proposition a more or less than proposition b? Are we even
justi…ed in comparing propositions a and b? The problem with propositions
is not that they cannot be compared but rather that the comparison can be
carried out in too many di¤erent ways. We can classify propositions according
to the degree we believe they are true, their plausibility; or according to the
degree that we desire them to be true, their utility; or according to the degree
that they happen to bear on a particular issue at hand, their relevance. We
can even compare propositions with respect to the minimal number of bits that
are required to state them, their description length. The detailed nature of
our relations to propositions is too complex to be captured by a single real
number. What we claim is that a single real number is su¢ cient to measure
one speci…c feature, the sheer intensity of rational belief. This should not be
too controversial because it amounts to a tautology: an “intensity” is precisely
the type of quantity that admits no more quali…cations than that of being more
intense or less intense; it is captured by a single real number.
However, some preconception about our subject is unavoidable; we need
some rough notion that a belief is not the same thing as a desire. But how
can we know that we have captured pure belief and not belief contaminated
with some hidden desire or something else? Strictly we can’t. We hope that
our mathematical description captures a su¢ ciently puri…ed notion of rational
belief, and we can claim success only to the extent that the formalism proves to
be useful.
The inference framework will capture two intuitions about rational beliefs.
First, we take it to be a de…ning feature of the intensity of rational beliefs that
if a is more believable than b, and b more than c, then a is more believable
than c. Such transitive rankings can be implemented using real numbers and
therefore we are again led to claim that

Degrees of rational belief (or, as we shall later call them, probabilities)


are represented by real numbers.

Before we proceed further we need to establish some notation. The following


choice is standard.
2.1 The design of probability theory 13

Notation –Boolean Algebra


For every proposition a there exists its negation not-a, which will be denoted a
~.
If a is true, then a
~ is false and vice versa.
Given any two propositions a and b we say they have the same truth value
and write a = b, when a is true if and only if b is true. The conjunction of two
propositions “a and b” is denoted ab or a ^ b. The conjunction is true if and
only if both a and b are true. The disjunction of two propositions “a or b” is
denoted by a _ b or (less often) by a + b. The disjunction is true when either a
or b or both are true; it is false when both a and b are false.
With these symbols one de…nes an algebra of logic – a Boolean algebra.
Important properties of and and or include,
aa = a ; a_a=a ; (2.1)
commutivity,
ab = ba ; a_b=b_a (2.2)
associativity,
a(bc) = (ab)c ; a _ (b _ c) = (a _ b) _ c ; (2.3)
and distributivity,
a(b _ c) = (ab) _ (ac) ; (2.4)
a _ (bc) = (a _ b)(a _ c) : (2.5)
Typically we want to quantify the degree of belief in a in the context of some
background information expressed in terms of some other proposition b. Such
propositions will be writen as ajb and read “a given b”, or “a assuming b”. In
cases such as b = c~
c where b is guaranteed to be false, the conditional proposition
ajb is meaningless. To simplify the notation we will write (a _ b)jc = a _ bjc and
(ab)jc = abjc.
In any inference problem the set of all propositions that can be constructed
using or and and — the “universe of discourse” — forms an ordered lattice
structure. Figure 2.1 shows the universe of discourse for a single toss of a three-
sided die. The possible outcomes of the toss — the atomic propositions —
correspond to each one of the three faces of the die a, b, and c. Using or we
can build the lattice up until we reach top proposition which is necessarily true.
Using and we build the lattice down; the bottom proposition is necessarily false.
The real number that represents the degree of belief in ajb will initially
be denoted [ajb] and eventually in its more standard form p(ajb) and all its
variations. Assigning a degree of belief to each proposition in the universe of
discourse — or to each node in the lattice of propositions — produces what one
might call a “web of belief”.
Degrees of rational belief will range from the extreme value vF that rep-
resents complete certainty that the proposition is false (for example, for any
a, [~
aja] = vF ), to the opposite extreme vT that represents certainty that the
proposition is true (for example, for any a, [aja] = vT ). The transitivity of the
ranking scheme implies that there is a single value vF and a single vT .
14 Probability

Figure 2.1: The universe of discourse — the set of all propositions — for a
three-sided die with faces labelled a, b, and c forms an ordered lattice. The
assignment of degrees of belief to each proposition a ! [a] leads to a “web
of belief”. The web of belief is highly constrained because it must re‡ect the
structure of the underlying lattice.

The representation of OR and AND


The inference framework is designed to include a second intuition concerning
rational beliefs:

In order to be rational our beliefs in a_b and ab must be somehow related


to our separate beliefs in a and b.

Since the goal is to design a quantitative theory, we require that these relations
be represented by some functions F and G,

[a _ bjc] = F ([ajc]; [bjc]; [ajbc]; [bjac]) (2.6)

and
[abjc] = G([ajc]; [bjc]; [ajbc]; [bjac]) : (2.7)
Note the qualitative nature of this assumption: what is being asserted is the
existence of some unspeci…ed functions F and G and not their speci…c functional
forms. The same F and G are meant to apply to all propositions; what is being
designed is a single inductive scheme of universal applicability. Note further
that the arguments of F and G include all four possible degrees of belief in a
and b in the context of c and not any potentially questionable subset.
The functions F and G provide a representation of the Boolean operations
or and and. The requirement that F and G re‡ect the appropriate associative
2.1 The design of probability theory 15

and distributive properties of the Boolean and and or turns out to be extremely
constraining. Indeed, we will show that there is only one representation — all
allowed representations are equivalent to each other — and that this unique
representation is equivalent to probability theory.
In section 2.2 the associativity of or is shown to lead to a constraint that
requires the function F to be equivalent to the sum rule for probabilities. In
section 2.3 we focus on the distributive property of and over or and the corre-
sponding constraint leads to the product rule for probabilities.3
Our method will be design by eliminative induction: now that we have iden-
ti…ed a su¢ ciently broad class of theories — quantitative theories of universal
applicability, with degrees of belief represented by real numbers and the oper-
ations of conjunction and disjunction represented by functions — we can start
weeding the unacceptable ones out.

An aside on the Cox axioms


The development of probability theory in the following sections follows a path
clearly inspired by [Cox 1946]. A brief comment may be appropriate.
Cox derived the sum and product rules by focusing on the properties of
conjunction and negation. He assumed as one of his axioms that the degree of
belief in a proposition a conditioned on b being true, which we write as [ajb], is
related to the degree of belief corresponding to its negation, [~
ajb], through some
de…nite but initially unspeci…ed function f ,

[~
ajb] = f ([ajb]) . (2.8)

This statement expresses the intuition that the more one believes in ajb, the less
one believes in a
~jb.
A second Cox axiom is that the degree of belief of “a and b given c,”written
as [abjc], must depend on [ajc] and [bjac],

[abjc] = g ([ajc]; [bjac]) : (2.9)

This is also very reasonable. When asked to check whether “a and b” is true,
we …rst look at a; if a turns out to be false the conjunction is false and we need
not bother with b; therefore [abjc] must depend on [ajc]. If a turns out to be
true we need to take a further look at b; therefore [abjc] must also depend on
[bjac]. However, one could object that [abjc] could in principle depend on all
four quantities [ajc], [bjc], [ajbc] and [bjac]. This objection, which we address
below, has a long history. It was partially addressed in [Tribus 1969; Smith
Erickson 1990; Garrett 1996] and …nally resolved in [Caticha 2009b].
3 Our subject is degrees of rational belief but the algebraic approach followed here [Caticha

2009] can be pursued in its own right irrespective of any interpretation. It was used in
[Caticha 1998] to derive the manipulation rules for complex numbers interpreted as quantum
mechanical amplitudes; in [Knuth 2003] in the mathematical problem of assigning real numbers
(valuations) on general distributive lattices; and in [Goyal et al 2010] to justify the use of
complex numbers for quantum amplitudes.
16 Probability

Cox’s important contribution was to realize that consistency constraints de-


rived from the associativity property of and and from the compatibility of and
with negation were su¢ cient to demonstrate that degrees of belief should be
manipulated according to the laws of probability theory. We shall not pursue
this line of development here. See [Cox 1946; Jaynes 1957a, 2003].

2.2 The sum rule


Our …rst goal is to determine the function F that represents or. The space of
functions of four arguments is very large. To narrow down the …eld we initially
restrict ourselves to propositions a and b that are mutually exclusive in the
context of d. Thus,

[a _ bjd] = F ([ajd]; [bjd]; vF ; vF ) ; (2.10)

which e¤ectively restricts F to a function of only two arguments,

[a _ bjd] = F ([ajd]; [bjd]) : (2.11)

2.2.1 The associativity constraint


As a minimum requirement of rationality we demand that the assignment of
degrees of belief be consistent: if a degree of belief can be computed in two
di¤erent ways the two ways must agree. How else could we claim to be rational?
All functions F that fail to satisfy this constraint must be discarded.
Consider any three mutually exclusive statements a, b, and c in the context
of a fourth d. The consistency constraint that follows from the associativity of
the Boolean or,
(a _ b) _ c = a _ (b _ c) ; (2.12)
is remarkably constraining. It essentially determines the function F . Start from

[a _ b _ cjd] = F ([a _ bjd]; [cjd]) = F ([ajd]; [b _ cjd]) : (2.13)

Use F again for [a _ bjd] and also for [b _ cjd], we get

F fF ([ajd]; [bjd]) ; [cjd]g = F f[ajd]; F ([bjd]; [cjd])g : (2.14)

If we call [ajd] = x, [bjd] = y, and [cjd] = z, then

F (F (x; y); z) = F (x; F (y; z)) : (2.15)

Since this must hold for arbitrary choices of the propositions a, b, c, and d,
we conclude that in order to be of universal applicability the function F must
satisfy (2.15) for arbitrary values of the real numbers (x; y; z). Therefore the
function F must be associative.
Remark: The requirement of universality is crucial. Indeed, in a universe of
discourse with a discrete and …nite set of propositions it is conceivable that
2.2 The sum rule 17

the triples (x; y; z) in (2.15) do not form a dense set and therefore one cannot
conclude that the function F must be associative for arbitrary values of x, y,
and z. For each speci…c …nite universe of discourse one could design a tailor-
made, single-purpose model of inference that could be consistent, i.e. it would
satisfy (2.15), without being equivalent to probability theory. However, we are
concerned with designing a theory of inference of universal applicability, a single
scheme applicable to all universes of discourse whether discrete and …nite or
otherwise. And the scheme is meant to be used by all rational agents irrespective
of their state of belief — which need not be discrete. Thus, a framework designed
for broad applicability requires that the values of x form a dense set.4

2.2.2 The general solution and its regraduation


Equation (2.15) is a functional equation for F . It is easy to see that there exist
an in…nite number of solutions. Indeed, by direct substitution one can check
that eq.(2.15) is satis…ed by any function of the form
1
F (x; y) = ( (x) + (y) + ) ; (2.16)

where is an arbitrary invertible function and is an arbitrary constant. What


is not so easy to to show is this is also the general solution, that is, given one
can calculate F and, conversely, given any associative F one can calculate the
corresponding . A proof of this result –…rst given by Cox –is given in section
2.2.4 [Cox 1946; Jaynes 1957a; Aczel 1966].
The signi…cance of eq.(2.16) becomes apparent once it is rewritten as

(F (x; y)) = (x) + (y) + or ([a _ bjd]) =([ajd]) + ([bjd]) + :


(2.17)
This last form is central to Cox’s approach to probability theory. Note that there
was nothing particularly special about the original representation of degrees of
plausibility by the real numbers [ajd]; [bjd]; : : : Their only purpose was to provide
us with a ranking, an ordering of propositions according to how plausible they
are. Since the function (x) is monotonic, the same ordering can be achieved
using a new set of positive numbers,
def def
(ajd) = ([ajd]) + ; (bjd) = ([bjd]) + ; ::: (2.18)

instead of the old. The original and the regraduated scales are equivalent be-
cause by virtue of being invertible the function is monotonic and therefore
preserves the ranking of propositions. See …gure 2-2. However, the regraduated
scale is much more convenient because, instead of the complicated rule (2.16),
for mutually exclusive ajd and bjd the or operation is now represented by a
much simpler rule: for mutually exclusive propositions a and b we have the sum
rule
(a _ bjd) = (ajd) + (bjd) : (2.19)
4 The possibility of alternative probability models was raised in [Halpern 1999]. That these

models are ruled out by universality was argued in [Van Horn 2003] and [Caticha 2009].
18 Probability

Figure 2.2: The degrees of belief [a] can be regraduated, [a] ! (a), to another
scale that is equivalent — it preserves transitivity of degrees of belief and the
associativity of or. The regraduated scale is more convenient in that it provides
a more convenient representation of or — a simple sum rule.

Thus, the new numbers are neither more nor less correct than the old, they
are just considerably more convenient.
Perhaps one can make the logic of regraduation a little bit clearer by consid-
ering the somewhat analogous situation of introducing the quantity temperature
as a measure of degree of “hotness”. Clearly any acceptable measure of “hot-
ness” must re‡ect its transitivity — if a is hotter than b and b is hotter than
c then a is hotter than c — which explains why temperatures are represented
by real numbers. But the temperature scales can be quite arbitrary. While
many temperature scales may serve equally well the purpose of ordering sys-
tems according to their hotness, there is one choice — the absolute or Kelvin
scale — that turns out to be considerably more convenient because it simpli…es
the mathematical formalism. Switching from an arbitrary temperature scale to
the Kelvin scale is one instance of a convenient regraduation. (The details of
temperature regraduation are given in chapter 3.)
In the old scale, before regraduation, we had set the range of degrees of belief
from one extreme of total disbelief, [~ aja] = vF , to the other extreme of total
certainty, [aja] = vT . The regraduated value F = ( F ) + is easy to …nd.
Setting d = a~~b in eq.(2.19) gives

a~b) = (aj~
(a _ bj~ a~b) + (bj~
a~b) =) F =2 F ; (2.20)

and therefore
F =0: (2.21)
At the opposite end, the regraduated T = ( T )+ remains undetermined but
2.2 The sum rule 19

if we set b = a
~ eq.(2.19) leads to the following normalization condition

T = (a _ a
~jd) = (ajd) + (~
ajd) : (2.22)

2.2.3 The general sum rule


The restriction to mutually exclusive propositions in the sum rule eq.(2.19) can
be easily lifted. Any proposition a can be written as the disjunction of two
mutually exclusive ones, a = (ab) _ (a~b) and similarly b = (ab) _ (~
ab). Therefore
for any two arbitrary propositions a and b we have

a _ b = (ab) _ (a~b) _ (~
ab) = a _ (~
ab) (2.23)

Since the two propositions on the right are mutually exclusive the sum rule
(2.19) applies,

(a _ bjd) = (ajd) + (~
abjd) + [ (abjd) (abjd)] (2.24)
= (ajd) + (ab _ a
~bjd) (abjd) ; (2.25)

which leads to the general sum rule,

(a _ bjd) = (ajd) + (bjd) (abjd) : (2.26)

2.2.4 Cox’s proof


Understanding the proof that eq.(2.16) is the general solution of the associativity
constraint, eq.(2.15), is not necessary for understanding other topics in this
book. This section may be skipped on a …rst reading. The proof given below,
due to Cox, [Cox 1946] takes advantage of the fact that our interest is not just to
…nd the most general mathematical solution but rather that we want the most
general solution where the function F is to be used for the purpose of inference.
This allows us to impose additional constraints on F .
The general strategy in solving equations such as (2.15) is to take partial
derivatives to transform the functional equation into a di¤erential equation and
then to proceed to solve the latter. Fortunately we can assume that the allowed
functions F are continuous and twice di¤erentiable. Indeed, since inference
is just quanti…ed common sense, had the function F turned out to be non-
di¤erentiable serious doubt would be cast on the legitimacy of the whole scheme.
Furthermore, common sense also requires that F (x; y) be monotonic increasing
in both its arguments. Consider a change in the …rst argument x = [ajd] while
holding the second y = [bjd] …xed. A strengthening of one’s belief in ajd must be
re‡ected in a corresponding strengthening in ones’s belief in a _ bjd. Therefore
F (x; y) must be monotonic increasing in its …rst argument. An analogous line
of reasoning shows that F (x:y) must be monotonic increasing in the second
argument as well. Therefore,
@F (x; y) @F (x; y)
0 and 0: (2.27)
@x @y
20 Probability

Let
def def
r = F (x; y) and s = F (y; z) ; (2.28)
and let partial derivatives be denoted by subscripts,
def @F (x; y) def @F (x; y)
F1 (x; y) = 0 and F2 (x; y) = 0 (2.29)
@x @y
(F1 denotes a derivative with respect to the …rst argument). Then eq.(2.15) and
its derivatives with respect to x and y are
F (r; z) = F (x; s) ; (2.30)
F1 (r; z)F1 (x; y) = F1 (x; s) ; (2.31)
and
F1 (r; z)F2 (x; y) = F2 (x; s)F1 (y; z) : (2.32)

Eliminating F1 (r; z) from these last two equations we get


K(x; y) = K(x; s)F1 (y; z) : (2.33)

where
F2 (x; y)
K(x; y) = : (2.34)
F1 (x; y)

Multiplying eq.(2.33) by K(y; z) and using (2.34) we get


K(x; y)K(y; z) = K(x; s)F2 (y; z) : (2.35)

Di¤erentiating the right hand side of eq.(2.35) with respect to y and comparing
with the derivative of eq.(2.33) with respect to z, we have
@ @ @
(K (x; s) F2 (y; z)) = (K (x; s) F1 (y; z)) = (K (x; y)) = 0: (2.36)
@y @z @z

Therefore, the derivative of the left hand side of eq.(2.35) with respect to y is
@
(K (x; y) K (y; z)) = 0; (2.37)
@y
or,
1 @K (x; y) 1 @K (y; z)
= : (2.38)
K (x; y) @y K (y; z) @y

Since the left hand side is independent of z while the right hand side is inde-
pendent of x it must be that they depend only on y,
1 @K (x; y) def
= h (y) : (2.39)
K (x; y) @y
2.2 The sum rule 21

Integrate using the fact that K 0 because both F1 and F2 are positive, to get
Z y
K(x; y) = K(x; 0) exp h(y 0 )dy 0 : (2.40)
0

Similarly, Z y
K (y; z) = K (0; z) exp h(y 0 )dy 0 ; (2.41)
0

so that
K(x; 0)
K(x; y) = = K(0; y)H(x) ; (2.42)
H(y)
where
Z x
def
H(x) = exp h(x0 )dx0 0: (2.43)
0

Therefore,
K(x; 0) def
= K(0; y)H(y) = (2.44)
H(x)
where = K(0; 0) is a constant and (2.40) becomes

H (x)
K (x; y) = : (2.45)
H (y)

On substituting back into eqs.(2.33) and (2.35) we get

H(s) H(s)
F1 (y; z) = and F2 (y; z) = : (2.46)
H(y) H(z)

Next, use s = F (y; z), so that

ds = F1 (y; z)dy + F2 (y; z)dz : (2.47)

Substituting (2.46) we get

ds dy dz
= + : (2.48)
H(s) H(y) H(z)

This is easily integrated. Let


Z x
dx0
(x) = ; (2.49)
0 H(x0 )

so that dx=H(x) = d (x). Then

(s) = (F (y; z)) = (y) + (z) + ; (2.50)


22 Probability

where is an arbitrary integration constant. Therefore


1
F (y; z) = ( (y) + (z) + ) (2.51)

Substituting back into eq.(2.15) leads to = 1. (The second possibility = 0


is discarded because it leads to F (y; z) = y which is not useful for inference.)
This completes the proof that eq.(2.16) is the general solution of eq.(2.15):
Given any F (x; y) that satis…es eq.(2.15) one can construct the corresponding
(x) using eqs.(2.34), (2.39), (2.43), and (2.49). Finally, since H(x) 0, eq.
(2.43), the regraduating function (x), eq.(2.49), is a monotonic function of its
variable x.

2.3 The product rule


Next we consider the function G in eq.(2.7) that represents and. Once the orig-
inal plausibilities are regraduated by according to eq.(2.18), the new function
G for the plausibility of a conjunction reads

(abjc) = G[ (ajc); (bjc); (ajbc); (bjac)] : (2.52)

The space of functions of four arguments is very large so we …rst narrow it down
to just two. Then, we require that the representation of and be compatible with
the representation of or that we have just obtained. This amounts to imposing
a consistency constraint that follows from the distributive properties of the
Boolean and and or. A …nal trivial regraduation yields the product rule of
probability theory.

2.3.1 From four arguments down to two


We will separately consider special cases where the function G depends on only
two arguments, then three, and …nally all four arguments. Using commutivity,
ab = ba, the number of possibilities can be reduced to seven:

(abjc) = G(1) [ (ajc); (bjc)] (2.53)


(2)
(abjc) = G [ (ajc); (ajbc)] (2.54)
(3)
(abjc) = G [ (ajc); (bjac)] (2.55)
(4)
(abjc) = G [ (ajbc); (bjac)] (2.56)
(5)
(abjc) = G [ (ajc); (bjc); (ajbc)] (2.57)
(6)
(abjc) = G [ (ajc); (ajbc); (bjac)] (2.58)
(7)
(abjc) = G [ (ajc); (bjc); (ajbc); (bjac)] (2.59)

We want a function G that is of general applicability. This means that the


arguments of G(1) : : : G(7) can be varied independently. Our goal is to go down
the list and eliminate those possibilities that are clearly unsatisfactory.
2.3 The product rule 23

First some notation: complete certainty is denoted T , while complete disbe-


lief is F = 0, eq.(2.21). Derivatives are denoted with a subscript: the derivative
(3)
of G(3) (x; y) with respect to its second argument y is G2 (x; y).

Type 1: (abjc) = G(1) [ (ajc); (bjc)]


The function G(1) is unsatisfactory because it does not take possible correla-
tions between a and b into account. For example, when a and b are mutually
exclusive — say, b = a ~d, for some arbitrary d — we have (abjc) = F but
there are no constraints on either (ajc) = x or (bjc) = y. Thus, in order that
G(1) (x; y) = F for arbitrary choices of x and y, G(1) must be a constant which
is unacceptable.

Type 2: (abjc) = G(2) [ (ajc); (ajbc)]


This function is unsatisfactory because it overlooks the plausibility of bjc. For
example, let a = d _ d~ = T , then ab = b and (bjc) = G(2) [ T ; T ] which is
clearly unsatisfactory since the right hand side is a constant while b on the left
hand side is quite arbitrary.

Type 3: (abjc) = G(3) [ (ajc); (bjac)]


As we shall see this function turns out to be satisfactory.

Type 4: (abjc) = G(4) [ (ajbc); (bjac)]


This function strongly violates common sense: when a = b we have (ajc) =
G(4) ( T ; T ), so that (ajc) takes the same constant value irrespective of what
a might be [Smith Erickson 1990].

Type 5: (abjc) = G(5) [ (ajc); (bjc); (ajbc)]


This function turns out to be equivalent either to G(1) or to G(3) and can
therefore be ignored. The proof follows from associativity, (ab)cjd = a(bc)jd,
which leads to the constraint
h i
G(5) G(5) [ (ajd); (bjd); (ajbd)]; (cjd); G(5) [ (ajcd); (bjcd); (ajbcd)]
= G(5) [ (ajd); G(5) [ (bjd); (cjd); (bjcd)]; (ajbcd)]

and, with the appropriate identi…cations,

G(5) [G(5) (x; y; z); u; G(5) (v; w; s)] = G(5) [x; G(5) (y; u; w); s] : (2.60)

Since the variables x; y : : : s can be varied independently of each other we can


take a partial derivative with respect to z,
(5) (5)
G1 [G(5) (x; y; z); u; G(5) (v; w; s)]G3 (x; y; z) = 0 ; (2.61)
24 Probability

(5) (5)
where G1 and G3 denote derivatives with respect to the …rst and third argu-
ments respectively. Therefore, either
(5) (5)
G3 (x; y; z) = 0 or G1 [G(5) (x; y; z); u; G(5) (v; w; s)] = 0 : (2.62)

The …rst possibility says that G(5) is independent of its third argument which
means that it is of the type G(1) that has already been ruled out. The second
possibility says that G(5) is independent of its …rst argument which means that
it is already included among the type G(3) .

Type 6: (abjc) = G(6) [ (ajc); (ajbc); (bjac)]


This function turns out to be equivalent either to G(3) or to G(4) and can
therefore be ignored. The proof — which we omit because it is analogous to the
proof above for type 5 — also follows from associativity, (ab)cjd = a(bc)jd.

Type 7: (abjc) = G(7) [ (ajc); (bjc); (ajbc); (bjac)]


This function turns out to be equivalent either to G(5) or G(6) and can there-
fore be ignored. Again the proof which uses associativity, (ab)cjd = a(bc)jd, is
omitted because it is analogous to type 5.

Conclusion:
The possible functions G that are viable candidates for a general theory of
inductive inference are equivalent to type G(3) ,

(abjc) = G[ (ajc); (bjac)] : (2.63)

2.3.2 The distributivity constraint


The and function G will be determined by requiring that it be compatible with
the regraduated or function F , which is just a sum. Consider three statements
a, b, and c, where the last two are mutually exclusive, in the context of a fourth,
d. Distributivity of and over or,

a (b _ c) = ab _ ac ; (2.64)

implies that (a (b _ c) jd) can be computed in two ways,

(a (b _ c) jd) = ((abjd) _ (acjd)) : (2.65)

Using eq.(2.19) and (2.63) leads to

G[ (ajd) ; (bjad) + (cjad)] = G[ (ajd) ; (bjad)] + G[ (ajd) ; (cjad)] ;

which we rewrite as

G (u; v + w) = G (u; v) + G (u; w) ; (2.66)


2.3 The product rule 25

where (ajd) = u, (bjad) = v, and (cjad) = w.


To solve the functional equation (2.66) we …rst transform it into a di¤erential
equation. Di¤erentiate with respect to v and w,

@ 2 G (u; v + w)
=0; (2.67)
@v@w
and let v + w = z, to get
@ 2 G (u; z)
=0; (2.68)
@z 2
which shows that G is linear in its second argument,

G(u; v) = A(u)v + B(u) : (2.69)

Substituting back into eq.(2.66) gives B(u) = 0. To determine the function


A(u) we note that adjd = ajd and therefore,

(ajd) = (adjd) = G[ (ajd); (djad)] = G[ (ajd); T] ; (2.70)

or,
u
u = A(u) T ) A(u) = : (2.71)
T

Therefore
uv (abjd) (ajd) (bjad)
G (u; v) = or = : (2.72)
T T T T

The constant T is easily regraduated away: just normalize to p = = T . The


corresponding regraduation of the sum rule, eq.(2.26) is equally trivial. The
degrees of belief range from total disbelief F = 0 to total certainty T . The
corresponding regraduated values are pF = 0 and pT = 1.

The main result:


In the regraduated scale the and operation is represented by a simple product
rule,
p (abjd) = p (ajd) p (bjad) ; (2.73)

and the or operation is represented by the sum rule,

p (a _ bjd) = p (ajd) + p (bjd) p(abjd) : (2.74)

Degrees of belief p measured in this particularly convenient regraduated scale


will be called “probabilities”. The degrees of belief p range from total disbelief
pF = 0 to total certainty pT = 1.
26 Probability

Conclusion:

A state of partial knowledge — a web of interconnected rational beliefs—


is mathematically represented by quantities that are to be manipulated ac-
cording to the rules of probability theory.

Degrees of rational belief are probabilities.

Other representations — that is, other regraduations — of and and or are


possible. They would be equivalent in that they lead to the same inferences but
they would also be considerably less convenient; the choice is made on purely
pragmatic grounds.

2.4 Some remarks on the sum and product rules


2.4.1 On meaning, ignorance and randomness
The product and sum rules can be used as the starting point for a theory of
probability: Quite independently of what probabilities could possibly mean,
we can develop a formalism of real numbers (measures) that are manipulated
according to eqs.(2.73) and (2.74). This is the approach taken by Kolmogorov.
The advantage is mathematical clarity and rigor. The disadvantage, of course,
is that in actual applications the issue of meaning, of interpretation, turns out
to be important because it a¤ects how and why probabilities are used. It a¤ects
how one sets up the equations and it even a¤ects our perception of what counts
as a solution.
The advantage of the approach due to Cox is that the issue of meaning is
clari…ed from the start: the theory was designed to apply to degrees of belief.
Consistency requires that these numbers be manipulated according to the rules
of probability theory. This is all we need. There is no reference to measures of
sets or large ensembles of trials or even to random variables. This is remark-
able: it means that we can apply the powerful methods of probability theory
to thinking and reasoning about problems where nothing random is going on,
and to single events for which the notion of an ensemble is either absurd or at
best highly contrived and arti…cial. Thus, probability theory is the method for
consistent reasoning in situations where the information available might be in-
su¢ cient to reach certainty: probability is the tool for dealing with uncertainty
and ignorance.
This interpretation is not in con‡ict with the common view that probabilities
are associated with randomness. It may, of course, happen that there is an un-
known in‡uence that a¤ects the system in unpredictable ways and that there is
a good reason why this in‡uence remains unknown, namely, it is so complicated
that the information necessary to characterize it cannot be supplied. Such an
in‡uence we call ‘random’. Thus, being random is just one among many possible
reasons why a quantity might be uncertain or unknown.
2.4 Some remarks on the sum and product rules 27

2.4.2 Independent and mutually exclusive events


In special cases the sum and product rules can be rewritten in various useful
ways. Two statements or events a and b are said to be independent if the
probability of one is not altered by information about the truth of the other.
More speci…cally, event a is independent of b (given c) if

p (ajbc) = p (ajc) : (2.75)

For independent events the product rule simpli…es to

p(abjc) = p(ajc)p(bjc) or p(ab) = p(a)p(b) : (2.76)

The symmetry of these expressions implies that p (bjac) = p (bjc) as well: if a is


independent of b, then b is independent of a.
Two statements or events a1 and a2 are mutually exclusive given b if they
cannot be true simultaneously, i.e., p(a1 a2 jb) = 0. Notice that neither p(a1 jb)
nor p(a2 jb) need vanish. For mutually exclusive events the sum rule simpli…es
to
p(a1 _ a2 jb) = p(a1 jb) + p(a2 jb): (2.77)
The generalization to many mutually exclusive statements a1 ; a2 ; : : : ; an (mu-
tually exclusive given b) is immediate,
n
X
p(a1 _ a2 _ _ an jb) = p(ai jb) : (2.78)
i=1

If one of the statements a1 ; a2 ; : : : ; an is necessarily true, i.e., they cover all


possibilities, they are said to be exhaustive. Then their conjunction is necessarily
true, a1 _ a2 _ _ an = T , so that for any b,

p(T jb) = p(a1 _ a2 _ _ an jb) = 1: (2.79)

If, in addition to being exhaustive, the statements a1 ; a2 ; : : : ; an are also mutu-


ally exclusive then
Xn
p(T ) = p(ai ) = 1 : (2.80)
i=1

A useful generalization involving the probabilities p(ai jb) conditional on any


arbitrary proposition b is
Xn
p(ai jb) = 1 : (2.81)
i=1

The proof is straightforward:


n
X n
X
p(b) = p(bT ) = p(bai ) = p(b) p(ai jb) : (2.82)
i=1 i=1
28 Probability

2.4.3 Marginalization
Once we decide that it is legitimate to quantify degrees of belief by real numbers
p the problem becomes how do we assign these numbers. The sum and product
rules show how we should assign probabilities to some statements once proba-
bilities have been assigned to others. Here is an important example of how this
works.
We want to assign a probability to a particular statement b. Let a1 ; a2 ; : : : ; an
be mutually exclusive and exhaustive statements and suppose that the proba-
bilities of the conjunctions baj are known. We want to calculate p(b) given the
joint probabilities p(baj ). The solution is straightforward: sum p(baj ) over all
aj s, use the product rule, and eq.(2.81) to get
X X
p(baj ) = p(b) p(aj jb) = p(b) : (2.83)
j j

This procedure, called marginalization, is quite useful when we want to eliminate


uninteresting variables a so we can concentrate on those variables b that really
matter to us. The distribution p(b) is referred to as the marginal of the joint
distribution p(ab).
Here is a second example. Suppose that we happen to know the conditional
probabilities p(bja). When a is known we can make good inferences about b,
but what can we tell about b when we are uncertain about the actual value of
a? Then we proceed as follows. Use of the sum and product rules gives
X X
p(b) = p(baj ) = p(bjaj )p(aj ) : (2.84)
j j

This is quite reasonable: the probability of b is the probability we would assign


if the value of a were precisely known, averaged over all as. The assignment p(b)
clearly depends on how uncertain we are about the value of a. In the extreme
case when we are totally certain that a takes the particular value ak we have
p(aj ) = jk and we recover p(b) = p(bjak ) as expected.

2.5 How about “quantum” probabilities?


Despite the enormous e¤ort spent in understanding the peculiar behavior of
quantum particles it is widely believed that our understanding of quantum me-
chanics has been and remains unsatisfactory. From the very beginning it was
recognized that some deeply cherished principles would have to be abandoned.
Foremost among these proposals is the notion that quantum e¤ects are evidence
that reasoning in the quantum domain lies beyond the reach of classical prob-
ability theory, that a new theory of “quantum” probabilities or perhaps even a
quantum logic is required.
If this proposal turns out to be true then probability theory cannot, as we
have claimed in previous sections, be of universal applicability. It is therefore
2.5 How about “quantum” probabilities? 29

Figure 2.3: In the double slit experiment particles are generated at s, pass
through a screen with two slits A and B, and are detected at the detector
screen. The observed interference pattern is evidence of wave-like behavior.

necessary for us to show that quantum e¤ects are not a counterexample to the
universality of probability theory.5
The argument below is also valuable in other ways. First, it provides an
example of the systematic use of the sum and product rules. Second, it under-
scores the importance of remembering that probabilities are always conditional
on something and that it is often useful to be very explicit about what those
conditioning statements might be. Finally, we will learn something important
about quantum mechancis.
The paradigmatic example for interference e¤ects is the double slit experi-
ment. It was …rst discussed in 1807 by Thomas Young who sought a demon-
stration of the wave nature of light that would be as clear and de…nitive as the
interference e¤ects of water waves. It has also been used to demonstrate the
peculiarly counter-intuitive behavior of quantum particles which seem to prop-
agate as waves but are only detected as particles — the so-called wave-particle
duality.
The quantum version of the double slit problem can be brie‡y stated as
follows. The experimental setup is illustrated in Figure 2.3. A single particle is
emitted at a source s, it goes through a screen where two slits a and b have been
cut, and the particle is later detected farther downstream at some location d.
The standard treatment goes something like this: according to the rules of
5 The ‡aw in the argument that quantum theory is incompatible with the standard rules

for manipulating probabilities was pointed out long ago by B. O. Koopman in a paper that
went largely unnoticed [Koopman 1955]. See also [Ballentine 1986].
30 Probability

quantum mechanics the probability that the particle is detected at d is propor-


tional to the square of the magnitude of a complex number, the “amplitude” .
The quantum rules further stipulate that the amplitude for the experimental
setup in which both slits a and b are open is given by the sum of the amplitude
for the setup with slit a is open and b closed plus the amplitude for the setup
with slit a closed and b open. This is written as

ab = a + b ; (2.85)

and the probability of detection at d is


2 2
pab (d) / j ab j =j a + bj
2 2
/j aj +j bj + a b + a b : (2.86)

The …rst term on the right j a j2 / pa (d) re‡ects the probability of detection
when only slit a is open and, similarly, j b j2 / pb (d) is the probability when
only b is open. The presence of the interference terms a b + a b is taken as
evidence that in quantum mechanics

pab (d) 6= pa (d) + pb (d) : (2.87)

So far so good.
One might go further and vaguely interpret (2.87) as “the probability of
paths a or b is not the sum of the probability of path a plus the probability of
path b”. And here the trouble starts because it is then almost natural to reach
the troubling conclusions that “quantum mechanics violates the sum rule”, that
“quantum mechanics lies outside the reach of classical probability theory”, and
that “quantum mechanics requires quantum probabilities.” As we shall see, all
these conclusions are unwarranted but in the attempt to “explain”them extreme
ideas have been proposed. For example, according to the standard and still
dominant explanation — the Copenhagen interpretation — quantum particles
are not characterized as having de…nite positions or trajectories. It is only at
the moment of detection that a particle acquires a de…nite position. Thus the
Copenhagen interpretation evades the impasse about the alleged violation of
the sum rule by claiming that it makes no sense to even raise the question of
whether the particle went through one slit, through the other slit, through both
slits or through neither.
The notion that physics is an example of inference and that probability
theory is the universal framework for reasoning with incomplete information
leads us along a di¤erent track. To construct our model of quantum mechanics
we must …rst establish the subject matter:
The model: We shall assume that a “point particle”is a system characterized
by its position, that the position of a particle has de…nite values at all times,
and that the particle moves along trajectories that are continuous. Since the
positions and the trajectories are in general unknown we are justi…ed in invoking
the use of probabilities.
2.5 How about “quantum” probabilities? 31

Our goal is to show that these physical assumptions about the nature of
particles can be combined with the rules of probability theory in a way that is
compatible with the predictions of quantum mechanics. It is useful to introduce
a notation that is explicit. We deal with the following propositions:

s = “particle is generated at the source s”

a = “slit a is open” and a


~ = “slit a is closed”

b = “slit b is open” and ~b = “slit b is closed”

= “particle goes through slit a”

= “particle goes through slit b”

d = “particle is detected at d”

In this model we have, for example, p( j~ a) = 0. The probability of interest is


p(djsab). It refers to a situation in which the particle is generated at s, both
slits are open, and we are uncertain about whether it is detected at d. The rules
of probability theory give

p[( _ )djsab] = p( djsab) + p( djsab) p( djsab) (2.88)

Since particles with de…nite positions cannot go through both a and b the last
term vanishes, p( djab) = 0. Therefore,

p[( _ )djsab] = p( djsab) + p( djsab) : (2.89)

Using the product rule the left hand side of (2.89) can be written as

p[( _ )djsab] = p(djsab)p( _ jsabd) : (2.90)

Furthermore, since we assume the trajectories to be continuous, a particle that


leaves s and reaches d must with certainty have passed either through a or
through b, therefore p( _ jsabd) = 1. Therefore,

p[( _ )djsab] = p(djsab) ; (2.91)

so that
p(djsab) = p( djsab) + p( djsab) : (2.92)
In order to compare this result with quantum mechanics, we rewrite eq.(2.87)
in the new more explicit notation,

p(djsab) 6= p( djsa~b) + p( djs~


ab) : (2.93)

We can now see that probability theory, eq.(2.92), and quantum mechanics,
eq.(2.93), are not in contradiction; they di¤er because they refer to the proba-
bilities of di¤erent statements.
32 Probability

It is important to appreciate what we have shown and also what we have not
shown. What we have just shown is that eqs.(2.87) or (2.93) are not in con‡ict
with the sum rule of probability theory. What we have not (yet) shown is that
the rules of quantum mechanics such as eq.(2.85) and (2.86) can be derived as
an example of inference; that is a much lengthier matter that will be tackled
later in Chapter 11.
We pursue this matter further to …nd how one might be easily misled into a
paradox. Use the product rule to rewrite eq.(2.92) as

p(djsab) = p( jsab)p(djsab ) + p( jsab)p(djsab ) ; (2.94)

and consider the …rst term on the right. To a classically trained mind (or perhaps
a classically brainwashed mind) it would appear reasonable to believe that the
passage of the particle through slit a is completely una¤ected by whether the
distant slit b is open or not. We are therefore tempted to make the substitutions
? ?
p( jsab) = p( jsa~b) and p(djsab ) = p(djsa~b ) : (C1)

Then, making similar substitutions in the second term in (2.94), we get


?
p(djsab) = p( jsa~b)p(djsa~b ) + p( js~
ab)p(djs~
ab ) ; (C2)

or
?
p(djsab) = p( djsa~b) + p( js~
ab) : (C3)
This equation does, indeed, contradict quantum mechanics, eq.(2.93). What is
particularly insidious about the “classical” eqs.(C1-C3) is that, beyond being
intuitive, there are situations in which these substitutions are actually correct
— but not always.
We might ask, what is wrong with (C1-C3)? How could it possibly be
otherwise? Well, it is otherwise. Equations (C1-C3) represent an assumption
that happens to be wrong. The assumption does not re‡ect a wrong probability
theory; it re‡ects wrong physics. It represents a piece of physical information
that does not apply to quantum particles. Quantum mechanics looks so strange
to classically trained minds because opening a slit at some distant location can
have important e¤ects even when the particle does not go through it. Quantum
mechanics is indeed strange but this is not because it violates probability theory;
it is strange because it is not local.
We conclude that there is no need to construct a theory of “quantum”proba-
bilities. Conversely, there is no need to refer to probabilities as being “classical”.
There is only one kind of probability and quantum mechanics does not refute
the claim that probability theory is of universal applicability.

2.6 The expected value


Suppose we know that a quantity x can take values xi with probabilities pi .
Sometimes we need an estimate for the quantity x. What should we choose? It
2.6 The expected value 33

seems reasonable that those values xi that have larger pi should have a dominant
contribution to the estimate of x. We therefore make the following reasonable
choice: The expected value of the quantity x is denoted by hxi and is given by
def
X
hxi = pi xi . (2.95)
i

The term ‘expected’ value is not always an appropriate one because it can
happen that hxi is not one of the values xi that is actually allowed; in such
cases the “expected” value hxi is not a value we would expect. For example,
the expected value of a die toss is (1 + + 6)=6 = 3:5 which is not an allowed
result.
Using the average hxi as an estimate of x may be reasonable, but it is also
somewhat arbitrary. Alternative estimates are possible; one could, for example,
have chosen the value for which the probability is maximum — this is called the
‘mode’. This raises two questions.
The …rst question is whether hxi is a good estimate. If the probability distri-
bution is sharply peaked all the values of x that have appreciable probabilities
are close to each other and to hxi. Then hxi is a good estimate. But if the
distribution is broad the actual value of x may deviate from hxi considerably.
To describe quantitatively how large this deviation might be we need to describe
how broad the probability distribution is.
A convenient measure of the width of the distribution is the root mean square
(rms) deviation de…ned by
def 2
x = h(x hxi) i1=2 : (2.96)

The quantity x is also called the standard deviation, its square ( x)2 is called
the variance. The term ‘variance’ may suggest variability or spread but there
is no implication that x is necessarily ‡uctuating or that its values are spread;
x merely refers to our incomplete knowledge about x.6
If x hxi then x will not deviate much from hxi and we expect hxi to be
a good estimate.
The de…nition of x is somewhat arbitrary. It is dictated both by common
sense and by convenience. Alternatively we could have chosen to de…ne the
4
width of the distribution as hjx hxiji or h(x hxi) i1=4 but these de…nitions
are less convenient for calculations.
6 The interpretation of probability matters. Among the many in…nities that a- ict quan-
tum …eld theories the variance of …elds and of the corresponding energies at a point are badly
divergent quantities. If these variances re‡ect actual physical ‡uctuations one should also
expect those ‡uctuations of the spacetime geometry that are sometimes described as a space-
time foam. On the other hand, if one adopts a view of probability as a tool for inference then
the situation changes signi…cantly. One can argue that the information codi…ed into quantum
…eld theories is su¢ cient to provide successful estimates of some quantities — which accounts
for the tremendous success of these theories — but is completely inadequate for the estima-
tions of other quantities. Thus divergent variances may be more descriptive of our complete
ignorance rather than of large physical ‡uctuations.
34 Probability

Now that we have a way of deciding whether hxi is a good estimate for x
we may raise a second question: Is there such a thing as the “best” estimate
for x? Consider an alternative estimate x0 . The alternative x0 is “good” if the
2
deviations from it are small, i.e., h(x x0 ) i is small. The condition for the
0
“best” x is that its variance be a minimum
d 2
h(x x0 ) i =0; (2.97)
dx0 x0 b e s t

which implies x0 b est = hxi. Conclusion: hxi is the best estimate for x when by
“best” we mean the estimate with the smallest variance. But other choices are
possible, for example, had we actually decided to minimize the width hjx x0 ji
the best estimate would have been the median, x0 b est = xm , a value such that
Prob(x < xm ) = Prob(x > xm ) = 1=2.
We conclude this section by mentioning two important identities that will
be repeatedly used in what follows. The …rst is that the average deviation from
the mean vanishes,

hx hxii = 0; (2.98)
because deviations from the mean are just as likely to be positive and negative.
The second useful identity is
D E
2
(x hxi) = hx2 i hxi2 : (2.99)

The proofs are trivial — just use the de…nition (2.95).

2.7 The binomial distribution


Suppose the probability of a certain event is . The probability of not
happening is 1 . Using the theorems discussed earlier we can obtain the
probability that happens m times in N independent trials. The probability
that happens in the …rst m trials and not- or ~ happens in the subsequent
N m trials is, using the product rule for independent events, m (1 )N m .
But this is only one particular ordering of the m s and the (N m) ~ s. There
are
N! N
= (2.100)
m!(N m)! m
such orderings. Therefore, using the sum rule for the disjunction (or) of mutu-
ally exclusive events, the probability of m s in N independent trials irrespective
of the particular order of s and ~ s is

N m
P (mjN; ) = (1 )N m
: (2.101)
m

This is called the binomial distribution. The range of applicability of this distri-
bution is enormous. Whenever trials are identical (same probability in every
2.7 The binomial distribution 35

trial) and independent (i.e., the outcome of one trial has no in‡uence on the
outcome of another, or alternatively, knowing the outcome of one trial provides
us with no information about the possible outcomes of another) the distribution
is binomial.
Next we brie‡y review some properties of the binomial distribution. The
parameter plays two separate roles. On one hand is a parameter that labels
the distributions P (mjN; ); on the other hand, we have P (1j1; ) = so that
the parameter also happens to be the probability of in a single trial.
Using the binomial theorem (hence the name of the distribution) one can
show these probabilities are correctly normalized:
N
X N
X N m N
P (mjN; ) = (1 )N m
= ( + (1 )) = 1: (2.102)
m=0 m=0
m

The expected number of s is


N
X N
X N m
hmi = m P (mjN; ) = m (1 )N m
:
m=0 m=0
m

This sum over m is complicated. The following elegant trick is useful. Consider
the sum
XN
N m N m
S( ; ) = m ;
m=0
m

where and are independent variables. After we calculate S we will replace


by 1 to obtain the desired result, hmi = S( ; 1 ). The calculation of S
is easy once we realize that m m = @@ m . Then, using the binomial theorem
N
@ X N m N m @ N N 1
S( ; ) = = ( + ) =N ( + ) :
@ m=0 m @

Replacing by 1 we obtain our best estimate for the expected number of s


hmi = N : (2.103)

This is the best estimate, but how good is it? To …nd the answer we need to
calculate the variance
D E
2 2
( m) = (m hmi) = hm2 i hmi2 :

To …nd hm2 i,
N
X N
X
2 2 N
hm i = m P (mjN; ) = m2 m
(1 )N m
;
m=0 m=0
m
36 Probability

we can use the same trick we used before to get hmi:


N
X N @ @ N
S0( ; ) = m2 m N m
= ( + ) :
m=0
m @ @

Therefore,
hm2 i = (N )2 + N (1 ), (2.104)

and the …nal result for the rms deviation m is


p
m = N (1 ): (2.105)

Now we can address the question of how good an estimate hmi is. Notice that
m grows with N . This might seem to suggest that our estimate of m gets
worse for large N but this is not quite true because hmi also grows with N . The
ratio r
m (1 ) 1
= / 1=2 , (2.106)
hmi N N

shows that while both the estimate hmi and its uncertainty m grow with N ,
the relative uncertainty decreases.

2.8 Probability vs. frequency: the law of large


numbers
It is important to note that the “frequency” f = m=N of s obtained in one
N -trial sequence is not equal to . For one given …xed value of , the observed
frequency f can take any one of the allowed values 0=N; 1=N; 2=N; : : : N=N .
What is equal to is not the frequency itself but its expected value. Indeed,
using eq.(2.103), we have
m
hf i = h i = : (2.107)
N
Is this a good estimate of f ? To …nd out use eq.(2.105) to get
r
m m (1 ) 1
f= = = / 1=2 : (2.108)
N N N N
Therefore, for large N the distribution of frequencies is quite narrow and the
probability that the observed frequency of s di¤ers from tends to zero as
N ! 1.
The same ideas are more precisely conveyed by a theorem due to Bernoulli
known as the law of large numbers. A simple proof of the theorem involves
2.8 Probability vs. frequency: the law of large numbers 37

an inequality due to Tchebyshev which we derive next. Let (x) dx be the


probability that a variable X lies in the range between x and x + dx,

P (x < X < x + dx) = (x) dx:

The variance of X satis…es


Z Z
2 2 2
( x) = (x hxi) (x) dx (x hxi) (x) dx ;
jx hxij "

2
where " is an arbitrary constant. Replacing (x hxi) by its least value "2 gives
Z
2
( x) "2 (x) dx = "2 P (jx hxij ") ;
jx hxij "

which is Tchebyshev’s inequality,


2
x
P (jx hxij ") : (2.109)
"
Next we prove Bernoulli’s theorem. Consider …rst a special case. Let be
the probability of outcome in a single experiment, P ( jN = 1) = . In a
sequence of N independent repetitions of the experiment the probability of m
outcomes is binomial. Substituting

2 (1 )
hf i = and ( f) =
N

into Tchebyshev’s inequality we get Bernoulli’s theorem,

(1 )
P (jf j " jN ) . (2.110)
N "2

Therefore, the probability that the observed frequency f is appreciably di¤erent


from tends to zero as N ! 1. Or equivalently: for any small ", the probability
that the observed frequency f = m=N lies in the interval between " and
+ " tends to unity as N ! 1.,

lim P (jf j " jN ) = 1 : (2.111)


N !1

In the mathematical/statistical literature this result is commonly stated in the


form
f tends to in probability. (2.112)
The qualifying words ‘in probability’ are crucial: we are not saying that the
observed f tends to for large N . What vanishes for large N is not the di¤erence
f itself, but rather the probability that jf j is larger than any …xed amount
".
38 Probability

Thus, probabilities and frequencies are related to each other but they are
not the same thing. Since hf i = , one might have been tempted to de…ne the
probability in terms of the expected frequency hf i but this does not work. The
problem is that the notion of expected value presupposes that the concept of
probability has already been de…ned. De…ning probability in terms of expected
values would be circular.7
We can express this important point in yet a di¤erent way: We cannot de…ne
probability as a limiting frequency limN !1 f because there exists no frequency
function f 6= m(N )=N to take a limit; the limit makes no sense.
The law of large numbers is easily generalized beyond the binomial distrib-
ution. Consider the average
N
1 X
x= xr ; (2.113)
N r=1

where x1 ; : : : ; xN are N independent variables with the same mean hxr i =


2
and variance var(xr ) = ( xr ) = 2 . (In the previous discussion leading to
eq.(2.110) each variable xr is either 1 or 0 according to whether outcome
happens or not in the rth repetition of the experiment E.)
To apply Tchebyshev’s inequality, eq.(2.109), we need the mean and the
variance of x. Clearly,
N
1 X 1
hxi = hxr i = N = : (2.114)
N r=1 N

Furthermore, since the xr are independent, their variances are additive. For
example,
var(x1 + x2 ) = var(x1 ) + var(x2 ) : (2.115)
(Prove it.) Therefore,
P
N 2 2
xr
var(x) = var( )=N = : (2.116)
r=1 N N N
Tchebyshev’s inequality now gives,
2
P (jx j "jN ) (2.117)
N "2
so that for any " > 0
lim P (jx j "jN ) = 0 or lim P (jx j "jN ) = 1 ; (2.118)
N !1 N !1
or
x! in probability. (2.119)
Again, what vanishes for large N is not the di¤erence x itself, but rather
the probability that jx j is larger than any given small amount.
7 Expected values can be introduced independently of probability (see [Je¤rey 2004]) but

this does not help make probabilities equal to frequencies either.


2.9 The Gaussian distribution 39

Example: the simplest form of data analysis


We want to estimate a certain quantity x so we proceed to measure it. The
problem is that the result of the measurement x1 is a- icted by an error that
is essentially unknown. We could just set x x1 but we can do much better.
The procedure is to perform the measurement several times to collect data
(x1 x2 : : : xN ). Then instead of using the result of any single measurement xr as
an estimator for x one uses the sample average,
N
1 X
x= xr : (2.120)
N r=1

The intuition behind this idea is that the errors of the individual measurements
will probably be positive just as often as they are negative so that in the sum
the errors will tend to cancel out. Thus one expects that the error of x will be
smaller than that of any of the individual xr .
This intuition can be put on a …rmer ground as follows. Let us assume that
the measurements are performed under identical conditions and are independent
of each other. We also assume that although there is some unknown error the
experiments are unbiased — that is, they are at least expected to yield the right
answer. This is expressed by
hxr i = x and xr = : (2.121)
The sample average x is also a- icted by some unknown error so that strictly x
is not the same as x but its expected value is. Indeed,
N
1 X
hxi = hxr i = x : (2.122)
N r=1

Since the measurements are independent the variances are additive


N
X 2
xr 2
( x)2 = ( ) = N ( )2 = or x= : (2.123)
r=1
N N N N 1=2

We conclude that estimating x x is much better than just setting x x1 . And


the estimator x becomes better and better as N ! 1. Indeed, Tchebyshev’s
inequality gives
2
P (jx xj "jN ) or lim P (jx xj "jN ) = 1 ; (2.124)
N "2 N !1

so that as the number of measurements increases x ! x (in probability).

2.9 The Gaussian distribution


The Gaussian distribution is quite remarkable. It appears in an enormously wide
variety of problems such as the distribution of errors a¤ecting experimental data,
40 Probability

the distribution of velocities of molecules in gases and liquids, the distribution


of ‡uctuations of thermodynamical quantities, in di¤usion phenomena, and as
we shall later see, even at the very foundations of quantum mechanics. One
suspects that a deeply fundamental reason must exist for its wide applicability.
The Central Limit Theorem discussed below provides an explanation.

2.9.1 The de Moivre-Laplace theorem


The Gaussian distribution turns out to be a special case of the binomial distri-
bution. It applies to situations when the number N of trials and the expected
number of s, hmi = N , are both very large (i.e., N large, not too small).
To …nd an analytical expression for the Gaussian distribution we note that
when N is large the binomial distribution,
N! m
P (mjN; ) = (1 )N m
,
m!(N m)!

is very sharply peaked at hmi = N . This suggests that to …nd a good approx-
imation for P we need to pay special attention to a very small range of m. One
might be tempted to follow the usual approach and directly expand in a Taylor
series but a problem becomes immediately apparent: if a small change in m
produces a small change in P then we only need to keep the …rst few terms,
but in our case P is a very sharp function. To reproduce this kind of behavior
we need a huge number of terms in the series expansion which is impractical.
Having diagnosed the problem one can easily …nd a cure: instead of …nding a
Taylor expansion for the rapidly varying P , one …nds an expansion for log P
which varies much more smoothly.
Let us therefore expand log P about its maximum at m0 , the location of
which is at this point still unknown. The …rst few terms are

d log P 1 d2 log P 2
log P = log P jm0 + (m m0 ) + (m m0 ) + : : : ;
dm m0 2 dm2 m0

where

log P = log N ! log m! log (N m)! + m log + (N m) log (1 ):

What is a derivative with respect to an integer? For large m the function log m!
varies so slowly (relative to the huge value of log m! itself) that we may consider
m to be a continuous variable. This leads to a very useful approximation —
called the Stirling approximation — for the logarithm of a large factorial
m
X Z m+1
m+1
log m! = log n log x dx = (x log x x)j1 m log m m:
n=1 1

A somewhat better expression which includes the next term in the Stirling ex-
pansion is
2.9 The Gaussian distribution 41

1
log m! m log m m+
log 2 m + : : : (2.125)
2
Notice that the third term is much smaller than the …rst two: the …rst two
terms are of order m while the last is of order log m. For m = 1023 , log m is
only 55:3.
The derivatives in the Taylor expansion are
d log P (N m)
= log m + log (N m) + log log (1 ) = log ;
dm m(1 )
and
d2 log P 1 1 N
= = :
dm2 m N m m(N m)

To …nd the value m0 where P is maximum set d log P=dm = 0. This gives
m0 = N = hmi, and substituting into the second derivative of log P we get
d2 log P 1 1
= = 2:
dm2 hmi N (1 ) ( m)
Therefore
2
(m hmi)
log P = log P (hmi) 2 + :::
2 ( m)
or " #
2
(m hmi)
P (m) = P (hmi) exp 2 :
2 ( m)
The remaining unknown constant P (hmi) can be evaluated by requiring that
the distribution P (m) be properly normalized, that is
N
X Z N Z 1
1= P (m) P (x) dx P (x) dx:
m=0 0 1

Using Z r
1
x2
e dx = ;
1
we get
1
P (hmi) = q :
2
2 ( m)

Thus, the expression for the Gaussian distribution with mean hmi and rms
deviation m is
" #
2
1 (m hmi)
P (m) = q exp 2 : (2.126)
2 ( m)
2 2 ( m)
42 Probability

It can be rewritten as a probability for the frequency f = m=N using hmi = N


2
and ( m) = N (1 ). The probability that f lies in the small range df =
1=N is " #
2
1 (f )
p(f )df = p exp 2 df ; (2.127)
2 N 2 2 N
2
where N = (1 )=N .
To appreciate the signi…cance of the theorem consider a macroscopic
PN variable
x built up by adding a large number of small contributions, x = n=1 n , where
the n are statistically independent. We assume that each n takes the value "
with probability , and the value 0 with probability 1 . Then the probability
that x takes the value m" is given by the binomial distribution P (mjN; ). For
large N the probability that x lies in the small range m" dx=2 is
" #
2
1 (x hxi)
p(x)dx = q exp 2 dx ; (2.128)
2 ( x)
2 2 ( x)

2
where hxi = N " and ( x) = N (1 )"2 . Thus, the Gaussian distribution
arises whenever we have a quantity that is the result of adding a large number
of small independent contributions. The derivation above assumes that the
microscopic contributions are discrete (either 0 or "), and identically distributed
but, as shown in the next section, both of these conditions can be relaxed.

2.9.2 The Central Limit Theorem


The result of the previous section can be strengthened considerably. Consider
the sum
XN
X= xr ; (2.129)
r=1

of n independent variables x1 ; : : : ; xN . Our goal is to calculate the probability


distribution of XN for large N . Let pr (xr ) be the probability distribution for
the rth variable with
2 2
hxr i = r and ( xr ) = r : (2.130)

Note that now we no longer assume that the variables xr be identically distrib-
uted nor that the distributions pr (xr ) be binomial, but we still assume indepen-
dence.
The probability density for XN is given by the integral
Z N
!
X
PN (X) = dx1 : : : dxN p1 (x1 ) : : : pN (xN ) X xr : (2.131)
r=1

(The expression on the right is the expected value of an indicator function. The
derivation of (2.131) is left as an exercise.)
2.9 The Gaussian distribution 43

A minor annoyance is that as N ! 1 the limits such as


N
X
lim hXiN = lim r ; (2.132)
N !1 N !1
r=1
N
X
lim (X hXi)2 N
= lim 2
r ; (2.133)
N !1 N !1
r=1

diverge in cases that are physically interesting such as when the variables xr
are identically distributed ( r and r are independent of r). To resolve this
di¢ culty instead of the variable XN we will consider a di¤erent suitably shifted
and normalized variable,
X mN
Y = ; (2.134)
sN
where
N
X N
X
def def
mN = r and s2N = 2
r : (2.135)
r=1 r=1

The probability distribution of YN is given by


Z N
!
1 X
PN (Y ) = dx1 : : : dxN p1 (x1 ) : : : pN (xN ) Y (xr r) ; (2.136)
sN r=1

and its limit as N ! 1 is given by the following theorem.


The Central Limit Theorem:
If the independent variables xr , r = 1 : : : N , with means r and variances
2
r satisfy the Lyapunov condition,

1 XD E
N
3
lim jxr rj =0; (2.137)
N !1 s3
N r=1

then
1 Y 2 =2
lim PN (Y ) = P (Y ) = p e ; (2.138)
N !1 2
which is Gaussian with zero mean and unit variance, hY i = 0 and Y = 1.
Proof:
Consider the Fourier transform,
Z +1
FN (k) = dY PN (Y )eikY
1
Z " N
#
ik X
= dx1 : : : dxN p1 (x1 ) : : : pN (xN ) exp (xr r)
sN r=1
44 Probability

which can be rearranged into a product of the individual Fourier transforms


N Z
Y k
FN (k) = dxr pr (xr ) exp[i (xr r )] : (2.139)
r=1
sN

The Fourier transform f (k) of a distribution p(x) has many interesting and
useful properties. For example,
Z
f (k) = dx p(x)eikx = eikx ; (2.140)

while the series expansion of the exponential gives


*1 + 1
X (ikx)` X (ik)
`
f (k) = = x` : (2.141)
`! `!
`=0 `=0

In words, the coe¢ cients of the Taylor expansion of f (k) give all the moments
of p(x). The Fourier transform f (k) is called the moment generating function
and also the characteristic function of the distribution p(x).
Going back to the calculation of Pn (Y ), eq.(2.136), its Fourier transform,
eq.(2.139) is,
Q
N
FN (k) = fr (k) ; (2.142)
r=1

where Z
ik
fr (k) = dxr pr (xr ) exp (xr r) :
sN
Since sN diverges as N ! 1 we can expand

k k2 D 2
E k
fr (k) = 1 + i hxr ri (xr r) + Rr ( )
sN 2s2N sN
k 2 r2 k
=1 + Rr ( ) ; (2.143)
2s2N sN

with a remainder, Rr (k=sN ), bounded by

k k 3D 3
E
Rr ( ) Cj j jxr rj (2.144)
sN sN

for some constant C. Therefore, for su¢ ciently large N , using log(1 + x) x,
N
X
log Fn (k) = log fr (k) (2.145)
r=1
N
X N
k 2 r2 k k2 X k
= + Rr ( ) = + Rr ( ) (2.146)
r=1
2s2N sN 2 r=1
sN
2.9 The Gaussian distribution 45

By the Lyapunov condition, eq.(2.137), we have

jkj3 X D E
N
P
N k 3
Rr ( ) C jxr rj !0 (2.147)
r=1 sN s3N r=1

as N ! 1 uniformly in every …nite interval k 0 < k < k 00 . Therefore, as N ! 1

k2 k2 =2
log FN (k) ! or FN (k) ! F (k) = e (2.148)
2
Taking the inverse Fourier transform leads to eq.(2.138) and concludes the proof.
It is easy to check that the Lyapunov condition is satis…ed when the xr vari-
3
ables are identically distributed. Indeed, if all r = , r = and hjxr j i=
then

1 XD E
N
X N
3 N
s2N = 2
r =N 2
and lim jxr rj = lim =0:
N !1 s3
N r=1 N !1 N 3=2 3=2
r=1
(2.149)
We can now return to our original goal of calculating the probability distri-
bution of XN . For any given but su¢ ciently large N we have
1 Y 2 =2
PN (X)dX = PN (Y )dY p e dY : (2.150)
2
From eq.(2.134) we have
dX
dY = (2.151)
sN
therefore
1 (X mN )2
PN (X) p exp : (2.152)
2 s2N 2s2N
And when the xr variables are identically distributed we get

1 (X N )2
PN (X) p exp : (2.153)
2 N 2 2N 2
To conclude we comment on the signi…cance of the central limit theorem. We
have shown that almost independently of the form P of the distributions pr (xr ) the
distribution
P 2 of the sum X is Gaussian centered at r r with standard deviation
r r . Not only the pr (xr ) need not be binomial, they do not even have to
be equal to each other. This helps to explain the widespread applicability of
Gaussian distributions: they apply to almost any ‘macro-variables’(such as X)
that result from adding a large number of independent ‘micro-variables’(such
as xr ).
But there are restrictions. Although Gaussian distributions are very com-
mon, there are exceptions. The derivation shows that the Lyapunov condition
played a critical role. Earlier we mentioned that the success of Gaussian distri-
butions is due to the fact that they codify the information that happens to be
46 Probability

relevant to the particular phenomenon under consideration. Now we see what


that relevant information might be: it is contained in the …rst two moments,
the mean and the variance — Gaussian distributions apply to processes where
the third and higher moments are not relevant information.
Later we shall approach this same problem from the point of view of the
method of maximum entropy and there we will show that, indeed, the Gaussian
distribution can also be derived as the distribution that codi…es information
about the mean and the variance while remaining maximally ignorant about
everything else.

2.10 Updating probabilities: Bayes’rule


Now that we have solved the problem of how to represent a state of partial
knowledge as a consistent web of interconnected beliefs we can start to address
the problem of updating from one consistent web of beliefs to another when new
information becomes available. We will only consider those special situations
where the information to be processed is in the form of data. The question of
what else, beyond data, could possibly qualify as information will be addressed
in later chapters.8
Speci…cally the problem is to update our beliefs about a quantity (either
a single parameter or many) on the basis of data x (either a single number
or several) and of a known relation between and x. The updating consists
of replacing the prior probability distribution q( ) that represents our beliefs
before the data is processed, by a posterior distribution p( ) that applies after
the data has been processed.9

2.10.1 Formulating the problem


We must …rst describe the state of our knowledge before the data has been
collected or, if the data has already been collected, before we have taken it
into account. At this stage of the game not only we do not know , we do
not know x either. As mentioned above, in order to infer from x we must
also know how these two quantities are related to each other. Without this
information one cannot proceed further. Fortunately we usually know enough
about the physics of an experiment that if were known we would have a fairly
good idea of what values of x to expect. For example, given a value for the
charge of the electron, we can calculate the velocity x of an oil drop in Millikan’s
8 The discussion below may seem unnecessarily contrived to readers who are familiar with
Bayesian inference. But our goal is not to merely introduce Bayes theorem as a tool for
applications in data analysis. Our goal is to present Bayesian inference in a way that provides
a …rst stepping stone towards a more general inference framework which will eventually result
in a complete uni…cation of Bayesian and entropic methods [Caticha Gi¢ n 2006, Caticha
2007, Caticha 2014a].
9 On notation: it is important to distinguish priors from posteriors. Here we will denote

priors by q and posteriors by p. Later, however, we will often revert to the common practice of
referring to all probabilities as p. Hopefuly no confusion should arise as the correct meaning
should be clear from the context.
2.10 Updating probabilities: Bayes’rule 47

experiment, add some uncertainty in the form of Gaussian noise and we have a
very reasonable estimate of the conditional distribution q(xj ). The distribution
q(xj ) is called the sampling distribution and also (but less appropriately) the
likelihood function. We will assume it is known. We should emphasize that the
crucial information about how x is related to is contained in the functional
form of the distribution q(xj ) — say, whether it is a Gaussian or a Cauchy
distribution— and not in the actual values of x and which are, at this point,
still unknown.
Thus, to describe the web of prior beliefs we must know the prior q( ) and
also the sampling distribution q(xj ). This means that we must know the full
joint distribution,
q( ; x) = q( )q(xj ) : (2.154)
This is important. We must be clear about what we are talking about: the
relevant universe of discourse is neither the space of possible s nor the space
X of possible data x. It is rather the product space X and the probability
distributions that concern us are the joint distributions q( ; x).
Next we collect data: the observed value turns out to be x0 . Our goal is
to use this information to update to a web of posterior beliefs represented by
a new joint distribution p( ; x). How shall we choose p( ; x)? Since the new
data tells us that the value of x is now known to be x0 the new web of beliefs is
constrained to satisfy
Z
p(x) = d p( ; x) = (x x0 ) : (2.155)

(For simplicity we have here assumed that x is a continuous variable; had x been
discrete the Dirac s would be replaced by Kronecker s.) This is all we know
and it is not su¢ cient to determine p( ; x). Apart from the general requirement
that the new web of beliefs must be internally consistent there is nothing in
any of our previous considerations that induces us to prefer one consistent web
over another. A new principle is needed and this is where the prior information
comes in.

2.10.2 Minimal updating: Bayes’rule


The basic updating principle that we adopt below re‡ects the conviction that
what we have learned in the past, the prior knowledge, is a valuable resource
that should not be squandered. Prior beliefs should be revised only to the extent
that the new information has rendered them obsolete and the updated web of
beliefs should coincide with the old one as much as possible. We propose to
adopt the following principle of parsimony,

Principle of Minimal Updating (PMU) The web of beliefs ought to


be revised only to the minimal extent required by the new data.10
1 0 The ‘ought’in the PMU indicates that the design of the inference framework –the decision

of how we ought to choose our beliefs – incorporates an ethical component. Pursued to its
48 Probability

This seems so reasonable and natural that an explicit statement may appear
super‡uous. The important point, however, is that it is not logically necessary.
We could update in many other ways that preserve both internal consistency
and consistency with the new information.
As we saw above the new data, eq.(2.155), does not fully determine the joint
distribution
p( ; x) = p(x)p( jx) = (x x0 )p( jx) : (2.156)
All distributions of the form

p( ; x) = (x x0 )p( jx0 ) ; (2.157)

where p( jx0 ) is quite arbitrary, are compatible with the newly acquired data.
We still need to assign p( jx0 ). It is at this point that we invoke the PMU. We
stipulate that, having updated q(x) to p(x) = (x x0 ), no further revision is
needed and we set
p( jx0 ) = q( jx0 ) : ((PMU))
Therefore, the web of posterior beliefs is described by

p( ; x) = (x x0 )q( jx0 ) : (2.158)

To obtain the posterior probability for marginalize over x,


Z Z
p( ) = dx p( ; x) = dx (x x0 )q( jx0 ) ; (2.159)

to get
p( ) = q( jx0 ) : (2.160)
In words, the posterior probability equals the prior conditional probability of
given x0 . This result, which we will call Bayes’rule, is extremely reasonable: we
maintain those beliefs about that are consistent with the data values x0 that
turned out to be true. Beliefs based on values of x that were not observed are
discarded because they are now known to be false. ‘Maintain’and ‘discard’are
the key words: the former re‡ects the PMU in action, the latter is the updating.
Using the product rule

q( ; x0 ) = q( )q(x0 j ) = q(x0 )q( jx0 ) ; (2.161)

Bayes’rule can be written as


q(x0 j )
p( ) = q( ) : (2.162)
q(x0 )

The interpretation of Bayes’rule is straightforward: according to eq.(2.162)


the posterior distribution p( ) gives preference to those values of that were
previously preferred as described by the prior q( ), but this is now modulated
ultimate conclusion it suggests that the foundations of science are closely tied to ethics. This
is an important topic that deserves further exploration.
2.10 Updating probabilities: Bayes’rule 49

by the likelihood factor q(x0 j ) in such a way as to enhance our preference for
values of that make the observed data more likely, less surprising.
The factor in the denominator q(x0 ), which is often called the ‘evidence’, is
the prior probability of the data. It is given by
Z
q(x ) = q( )q(x0 j ) d ;
0
(2.163)

and plays the role of a normalization constant for the posterior distribution p( ).
It does not help to discriminate one value of from another because it a¤ects
all values of equally. As we shall see later in this chapter the evidence turns
out to be important in problems of model selection (see eq. 2.234).
Remark: Bayes’rule is often written in the form

q(x0 j )
q( jx0 ) = q( ) ; (2.164)
q(x0 )

and called Bayes’theorem.11 This formula is very simple; but perhaps it is too
simple. It is true for any value of x0 whether observed or not. Eq.(2.164) is
just a restatement of the product rule, eq.(2.161), and therefore it is a simple
consequence of the internal consistency of the prior web of beliefs. No posteriors
are involved: the left hand side is not a posterior but rather a prior probability
–the prior conditional on x0 . To put it di¤erently, in an actual update, q( ) !
p( ), both probabilities refer to the same proposition . In (2.164), q( ) !
q( jx0 ) cannot be an update because it refers to the probabilities of two di¤erent
propositions, and jx0 . Of course these subtleties have not stood in the way of
the many extremely successful applications of Bayes theorem. But by confusing
priors with posteriors the formula (2.164) has contributed to obscure the fact
that an additional principle – the PMU – was needed for updating. And this
has stood in the way of a deeper understanding of the connection between the
Bayesian and entropic methods of inference.

Example: Is there life on Mars?


Suppose we are interested in whether there is life on Mars or not. How is the
probability that there is life on Mars altered by new data indicating the presence
of water on Mars. Let = ‘There is life on Mars’. The prior information
includes the fact I = ‘All known life forms require water’. The new data is that
x0 = ‘There is water on Mars’. Let us look at Bayes’rule. We can’t say much
about q (x0 jI) but whatever its value it is de…nitely less than 1. On the other
hand q (x0 j I) 1. Therefore the factor multiplying the prior is larger than
1. Our belief in the truth of is strengthened by the new data x0 . This is
just common sense, but notice that this kind of probabilistic reasoning cannot
be carried out if one adheres to a strictly frequentist interpretation — there is
1 1 Neither the rule, eq.(2.162), nor the theorem, eq.(2.164), were ever actually written down

by Bayes. The person who …rst explicitly stated the theorem and, more importantly, who …rst
realized its deep signi…cance was Laplace.
50 Probability

no set of trials. The name ‘Bayesian probabilities’ given to ‘degrees of belief’


originates in the fact that it is only under this type of interpretation that the full
power of Bayes’ rule can be exploited. Everybody can prove Bayes’ theorem;
only Bayesians can reap the advantages of Bayes’rule.

Example: Testing positive for a rare disease


Suppose you are tested for a disease, say cancer, and the test turns out to be
positive. Suppose further that the test is said to be 99% accurate. Should you
panic? It may be wise to proceed with caution.
One should start by explaining that ‘99% accurate’means that when the test
is applied to people known to have cancer the result is positive 99% of the time,
and when applied to people known to be healthy, the result is negative 99% of
the time. We express this accuracy as q(yjc) = A = 0:99 and q(nj~ c) = A = 0:99
(y and n stand for ‘positive’ and ‘negative’, c and c~ stand for ‘cancer’ or ‘no
cancer’). There is a 1% probability of false positives, q(yj~
c) = 1 A, and a 1%
probability of false negatives, q(njc) = 1 A.
On the other hand, what we really want to know is p(c) = q(cjy), the prob-
ability of having cancer given that you tested positive. This is not the same
as the probability of testing positive given that you have cancer, q(yjc); the
two probabilities are not the same thing! So there might be some hope. The
connection between what we want, q(cjy), and what we know, q(yjc), is given
by Bayes’theorem,
q(c)q(yjc)
q(cjy) = :
q(y)
An important virtue of Bayes’ rule is that it doesn’t just tell you how to
process information; it also tells you what information you should seek. In this
case one should …nd q(c), the probability of having cancer irrespective of being
tested positive or negative. Suppose you inquire and …nd that the incidence of
cancer in the general population is 1%; this justi…es setting q(c) = 0:01. Thus,

q(c)A
q(cjy) =
q(y)

One also needs to know q(y), the probability of the test being positive irre-
spective of whether the person has cancer or not. To obtain q(y) use

q(~
c)q(yj~
c) (1 q(c)) (1 A)
q(~
cjy) = = ;
q(y) q(y)

and q(cjy) + q(~


cjy) = 1 which leads to our …nal answer

q(c)A
q(cjy) = : (2.165)
q(c)A + (1 q(c)) (1 A)

For an accuracy A = 0:99 and an incidence q(c) = 0:01 we get q(cjy) = 50%
which is not nearly as bad as one might have originally feared. Should one
2.10 Updating probabilities: Bayes’rule 51

dismiss the information provided by the test as misleading? No. Note that the
probability of having cancer prior to the test was 1% and on learning the test
result this was raised all the way up to 50%. Note also that when the disease
is really rare, q(c) ! 0, we still get q(cjy) ! 0 even when the test is quite
accurate. This means that for rare diseases most positive tests turn out to be
false positives.
We conclude that both the prior and the data contain important information;
neither should be neglected.
Remark: The previous discussion illustrates a mistake that is common in verbal
discussions: if h denotes a hypothesis and e is some evidence, it is quite obvious
that we should not confuse q(ejh) with q(hje). However, when expressed verbally
the distinction is not nearly as obvious. For example, in a criminal trial jurors
might be told that if the defendant was guilty (the hypothesis) the probability
of some observed evidence would be large, and the jurors might easily be misled
into concluding that given the evidence the probability is high that the defendant
is guilty. Lawyers call this the prosecutor’s fallacy.

Example: Uncertain data, nuisance variables and Je¤rey’s rule


As before we want to update from a prior joint distribution q( ; x) = q(x)q( jx)
to a posterior joint distribution p( ; x) = p(x)p( jx) when information becomes
available. When the information is data x0 that precisely …xes the value of x,
we impose that p(x) = (x x0 ). The remaining unknown p( jx) is determined
by invoking the PMU: no further updating is needed. This …xes the new p( jx0 )
to be the old q( jx0 ) and yields Bayes’rule.
It may happen, however, that there is a measurement error and the data x0
that was actually observed does not constrain the value of x completely. To be
explicit let us assume that the remaining uncertainty in x is well understood:
the observation x0 constrains our beliefs about x to a distribution Px0 (x) that
happens to be known. Px0 (x) could, for example, be a Gaussian distribution
centered at x0 , with some known standard deviation .
This information is incorporated into the posterior distribution, p( ; x) =
p(x)p( jx), by imposing that p(x) = Px0 (x). The remaining conditional distrib-
ution is, as before, determined by invoking the PMU,

p( jx) = q( jx) ; (2.166)

and therefore, the joint posterior is

p( ; x) = Px0 (x)q( jx) : (2.167)

Marginalizing over the uncertain x yields the new posterior for ,


R
p( ) = dx Px0 (x)q( jx) : (2.168)

This generalization of Bayes’rule is sometimes called Je¤rey’s conditionalization


rule [Je¤rey 2004].
52 Probability

Incidentally, this is an example of updating that shows that it is not always


the case that information comes purely in the form of data x0 . In the derivation
above there clearly is some information in the observed value x0 and also some
information in the particular functional form of the distribution Px0 (x), whether
it is a Gaussian or some other distribution.
The common element in our previous derivation of Bayes’ rule and in the
present derivation of Je¤rey’s rule is that in both cases the information being
processed is conveyed as a constraint on the allowed posterior marginal distri-
butions p(x).12 Later, in chapter 5, we shall see how the updating rules can be
generalized still further to apply to even more general constraints.
There is an alternative way to interpret (or derive) Je¤rey’s rule. Just as
with Bayesian updating the goal is to make an inference about on the basis
of observed data x0 and a known relation between and x0 . The di¤erence in
this case is that the relation between and x0 is expressed indirectly in terms
of some other auxiliary variables y. We are given the relation between and
the y variables and also the relation between y and the data x0 . These relations
are expressed by q(yj ) and q(yjx0 ) = Px0 (y). Although we have no particular
interest in these intermediate y variables they must nevertheless be included in
the analysis. Since they add an additional layer of complication, they are often
called nuisance variables. As before, the posterior p( ) is given by Bayes rule,
eq.(2.160),
Z
q( ; x0 ) 1
p( ) = q( jx0 ) = = dy q( ; x0 ; y)
q(x0 ) q(x0 )
Z
= dy q( jx0 ; y)q(yjx0 ) (2.169)

Assuming that q( jx0 ; y) = q( jy), that is, conditional on y, is independent of


x, and q(yjx0 ) = Px0 (y) we recover Je¤rey’s rule, eq.(2.168).

2.10.3 Multiple experiments, sequential updating


The problem here is to update our beliefs about on the basis of data x1 ; x2 ; : : :
obtained in a sequence of experiments. The relations between and the vari-
ables xi are given through known sampling distributions. We will assume that
the experiments are independent but they need not be identical. When the
experiments are not independent it is more appropriate to refer to them as be-
ing performed is a single more complex experiment the outcome of which is a
collection of numbers fx1 ; : : : ; xn g.
1 2 The concept of information is central to our discussions but so far we have been vague

about its meaning. So here is a preview of things to come: What is information? We will
continue to use the term with its usual colloquial meaning, namely, roughly, information is
what you get when your question receives a satisfactory answer. But we will also need a more
precise and technical de…nition. Later we shall elaborate on the idea that information is a
constraint on our beliefs, or better, on what our beliefs ought to be if only we were ideally
rational.
2.10 Updating probabilities: Bayes’rule 53

For simplicity we deal with just two identical experiments. The prior web of
beliefs is described by the joint distribution,

q(x1 ; x2 ; ) = q( )q(x1 j )q(x2 j ) = q(x1 )q( jx1 )q(x2 j ) ; (2.170)

where we have used independence, q(x2 j ; x1 ) = q(x2 j ).


The …rst experiment yields the data x1 = x01 . Bayes’rule gives the updated
distribution for as
q(x01 j )
p1 ( ) = q( jx01 ) = q( ) : (2.171)
q(x01 )

The second experiment yields the data x2 = x02 and requires a second application
of Bayes’rule. The posterior p1 ( ) in eq.(2.171) now plays the role of the prior
and the new posterior distribution for is
q(x02 j )
p12 ( ) = p1 ( jx02 ) = p1 ( ) ; (2.172)
p1 (x02 )
therefore
p12 ( ) / q( )q(x01 j )q(x02 j ) : (2.173)
We have explicitly followed the update from q( ) to p1 ( ) to p12 ( ). The same
result is obtained if the data from both experiments were processed simultane-
ously,
p12 ( ) = q( jx01 ; x02 ) / q( )q(x01 ; x02 j ) : (2.174)
From the symmetry of eq.(2.173) it is clear that the same posterior p12 ( ) is
obtained irrespective of the order that the data x01 and x02 are processed. The
commutivity of Bayesian updating follows from the special circumstance that the
information conveyed by one experiment does not revise or render obsolete the
information conveyed by the other experiment. As we generalize our methods
of inference for processing other kinds of information that do interfere with each
other (and therefore one may render the other obsolete) we should not expect,
much less demand, that commutivity will continue to hold.

2.10.4 Remarks on priors*


Let us return to the question of the extent to which probabilities incorporate
subjective and objective elements. We have seen that Bayes’ rule allows us
to update from prior to posterior distributions. The posterior distributions
incorporate the presumably objective information contained in the data plus
whatever earlier beliefs had been codi…ed into the prior. To the extent that the
Bayes updating rule is itself unique one can claim that the posterior is “more
objective” than the prior. As we update more and more we should expect that
our probabilities should re‡ect more and more the input data and less and less
the original subjective prior distribution. In other words, some subjectivity
is unavoidable at the beginning of an inference chain, but it can be gradually
suppressed as more and more information is processed.
54 Probability

The problem of choosing the …rst prior in the inference chain is a di¢ cult
one. We will tackle it in several di¤erent ways. Later in this chapter, as we
introduce some elementary notions of data analysis, we will address it in the
standard way: just make a “reasonable” guess — whatever that might mean.
When tackling familiar problems where we have experience and intuition this
seems to work well. But when the problems are truly new and we have neither
experience nor intuition then the guessing can be risky and we would like to
develop more systematic ways to proceed. Indeed it can be shown that certain
types of prior information (for example, symmetries and/or other constraints)
can be objectively translated into a prior once we have developed the appropriate
tools — entropy and geometry. (See e.g. [Jaynes 1968][Caticha Preuss 2004]
and references therein.)
Our more immediate goal here is, …rst, to remark on the dangerous conse-
quences of extreme degrees of belief, and then to prove our previous intuitive
assertion that the accumulation of data will swamp the original prior and render
it irrelevant.

Dangerous extremes: the prejudiced mind


The consistency of Bayes’rule can be checked for the extreme cases of certainty
and impossibility: Let B describe any background information. If q ( jB) = 1,
then assuming B is no di¤erent from assuming B alone; they are epistemically
equivalent. Therefore q(xj B) = q(xjB) and Bayes’rule gives

q(xj B)
p( jB) = q( jB) =1: (2.175)
q(xjB)

A similar argument can be carried through in the case of impossibility: If


q ( jB) = 0, then p ( jB) = 0. The conclusion is that if we are absolutely
certain about the truth of , acquiring data x will have absolutely no e¤ect on
our opinions; the new data is worthless.
This should serve as a warning to the dangers of erroneously assigning a
probability of 1 or of 0: since no amount of data could sway us from our prior
beliefs we may decide we did not need to collect the data in the …rst place. If
you are absolutely sure that Jupiter has no moons, you may either decide that
it is not necessary to look through the telescope, or, if you do look and you
see some little bright spots, you will probably decide the spots are mere optical
illusions. Extreme degrees of belief are dangerous: truly prejudiced minds do
not, and indeed, cannot question their own beliefs.

Lots of data overwhelms the prior


As more and more data are accumulated according to the sequential updating
described earlier one would expect that the continuous in‡ow of information
will eventually render irrelevant whatever prior information we might have had
at the start. We will now show that this is indeed the case: unless we have
2.10 Updating probabilities: Bayes’rule 55

assigned a pathological prior after a large number of experiments the posterior


becomes essentially independent of the prior.
Consider N independent repetitions of a certain experiment that yield the
data x = fx1 : : : xN g. (For simplicity we omit all primes on the observed data.)
The corresponding likelihood is

Q
N
q(xj ) = q(xr j ) ; (2.176)
r=1

and the posterior distribution p( ) is

q( ) q( ) QN
p( ) = q(xj ) = q(xr j ) : (2.177)
q(x) q(x) r=1

To investigate the extent to which the data x supports a particular value 1


rather than any other value 2 it is convenient to study the ratio

p( 1 ) q( 1 )
= R(x) ; (2.178)
p( 2 ) q( 2 )

where we introduced the likelihood ratios

def Q
N
def q(xr j 1 )
R(x) = Rr (xr ) and Rr (xr ) = : (2.179)
r=1 q(xr j 2 )

We will prove the following theorem:

given 1, R(x) ! 1 in probability. (2.180)

Equivalently, this is expressed as

lim Pr (R(x) > j 1 ) = 1 (2.181)


N !1

for any arbitrarily large positive number ,


The signi…cance of the theorem is that barring two trivial exceptions the
accumulation of data will drive a rational agent to become more and more
convinced of the truth — in this case the truth is 1 — and this happens
irrespective of the prior q( ). The …rst exception occurs when the prior q( 1 )
vanishes which re‡ects a mind that is so deeply prejudiced that it is incapable
of learning despite overwhelming evidence to the contrary. Such an agent can
hardly be called rational. The second exception occurs when q(xr j 1 ) = q(xr j 2 )
for all xr . This represents an experiment that is so poorly designed that it o¤ers
no possibility of distinguishing between 1 and 2 .
The proof of the theorem is an application of the law of large numbers.
Consider the quantity

1 1 PN
log R(x) = log Rr (xr ) : (2.182)
N N r=1
56 Probability

Since the variables log Rr (xr ) are independent, eq.(2.118) gives

1
lim Pr log R(x) K( 1 ; 2) "j 1 =1 (2.183)
N !1 N

where " is any small positive number and

1
K( 1 ; 2) = log R(x)j 1
N
P
= q(xr j 1 ) log Rr (xr ) : (2.184)
xr

In other words,

given 1, eN (K ")
R(x) eN (K+") in probability. (2.185)

In Chapter 4 we will prove that K( 1 ; 2 ) 0 which is called the Gibbs in-


equality. The equality holds if and only if the two distributions q(xr j 1 ) and
q(xr j 2 ) are identical, which is precisely the second of the two trivial exceptions
we explicitly avoid. Thus K( 1 ; 2 ) > 0, and this concludes the proof.
We see here the …rst appearance of a quantity,

P q(xr j 1 )
K( 1 ; 2) = + q(xr j 1 ) log ; (2.186)
xr q(xr j 2)

that will prove to be central in later discussions. When multiplied by 1, the


quantity K( 1 ; 2 ) is called the relative entropy — the entropy of q(xr j 1 )
relative to q(xr j 2 ).13 It can be interpreted as a measure of the extent that the
distribution q(xr j 1 ) can be distinguished from q(xr j 2 ).

The marginalization Paradox***


Non-informative priors***
The Stein shrinking phenomenon***

2.11 Hypothesis testing and con…rmation


The basic goal of statistical inference is to update our opinions about the truth
of a particular theory or hypothesis on the basis of evidence provided by data
E. The update proceeds according to Bayes rule,14

p(Ej )
p( jE) = p( ) ; (2.187)
p(E)
1 3 Other names include relative information, directed divergence, and Kullback-Leibler dis-

tance.
1 4 From here on we revert to the usual notation p for probabilities. Whether p refers to a

prior or a posterior will, as is usual in this …eld, have to be inferred from the context.
2.11 Hypothesis testing and con…rmation 57

and one can say that the hypothesis is partially con…rmed or corroborated by
the evidence E when p( jE) > p( ).
Sometimes one wishes to compare two hypothesis, 1 and 2 , and the com-
parison is conveniently done using the ratio
p( 1 jE) p( 1 ) p(Ej 1 )
= : (2.188)
p( 2 jE) p( 2 ) p(Ej 2 )
The relevant quantity is the “likelihood ratio” or “Bayes factor”

def p(Ej 1 )
R( 1 ; 2) = : (2.189)
p(Ej 2 )
When R( 1 : 2 ) > 1 one says that the evidence E provides support in favor of
1 against 2 .
The question of the testing or con…rmation of a hypothesis is so central to
the scienti…c method that it pays to explore it. First we introduce the concept of
weight of evidence, a variant of the Bayes factor, that has been found particularly
useful in such discussions. Then, to explore some of the subtleties and potential
pitfalls we discuss the paradox associated with the name of Hempel.

Weight of evidence
A useful variant of the Bayes factor is its logarithm,

def p(Ej 1 )
wE ( 1 ; 2) = log : (2.190)
p(Ej 2 )

This is called the weight of evidence for 1 against 2 [Good 1950].15 A useful
special case is when the second hypothesis 2 is the negation of the …rst. Then

def p(Ej )
wE ( ) = log ; (2.191)
p(Ej ~)
is called the weight of evidence in favor of the hypothesis provided by the
evidence E. The change to a logarithmic scale is convenient because it confers
useful additive properties upon the weight of evidence — which justi…es calling
it a ‘weight’. Consider, for example, the odds in favor of given by the ratio

def p( )
Odds( ) = : (2.192)
p( ~)
The posterior and prior odds are related by
p( jE) p( ) p(Ej )
= ; (2.193)
p( ~jE) p( ~) p(Ej ~)
1 5 According to [Good 1983] the concept was known to H. Je¤reys and A. Turing around

1940-41 and C. S. Peirce had proposed the name weight of evidence for a similar concept
already in 1878.
58 Probability

and taking logarithms we have

log Odds( jE) = log Odds( ) + wE ( ) : (2.194)

The weight of evidence can be positive and provide a partial con…rmation of


the hypothesis by increasing its odds, or it can be negative and provide a par-
tial refutation. Furthermore, when we deal with two pieces of evidence and E
consists of E1 and E2 , we have
p(E1 E2 j ) p(E1 j ) p(E2 jE1 )
log = log + log
~
p(E1 E2 j ) ~
p(E1 j ) p(E2 jE1 ~)
so that
wE1 E2 ( ) = wE1 ( ) + wE2 jE1 ( ) : (2.195)

Hempel’s paradox
Here is the paradox: “A case of a hypothesis supports the hypothesis. Now, the
hypothesis that all crows are black is logically equivalent to the contrapositive
that all non-black things are non-crows, and this is supported by the observation
of a white shoe.” [Hempel 1967]
The premise that “a case of a hypothesis supports the hypothesis” seems
reasonable enough. After all, how else but by observing black crows can one ever
expect to con…rm that “all crows are black”? But to assert that the observation
of a white shoe con…rms that all crows are black seems a bit too much. If so
then the very same white shoe would equally well con…rm the hypotheses that
all crows are green, or that all swans are black. We have a paradox.
Let us consider the starting premise that the observation of a black crow
supports the hypothesis = “All crows are black” more carefully. Suppose we
observe a crow (C) and it turns out to be black (B). The evidence is E = BjC,
and the corresponding weight of evidence is positive,
p(BjC ) 1
wBjC ( ) = log = log 0; (2.196)
~
p(BjC ) p(BjC ~)
as expected. It is this result that justi…es our intuition that “a case of a hy-
pothesis supports the hypothesis”; the question is whether there are limitations.
[Good 1983]
The reference to the possibility of white shoes points to an uncertainty about
whether the observed object will turn out to be a crow or something else. Using
eq.(2.195) the relevant weight of evidence concerns the joint probability of B
and C,
wBC ( ) = wC ( ) + wBjC ( ) ; (2.197)
which, as we show below, is also positive. Indeed, using Bayes’theorem,
!
p(Cj ) p(C)p( jC) p( ~)
wC ( ) = log = log : (2.198)
p(Cj ~) p( ) p(C)p( ~jC)
2.11 Hypothesis testing and con…rmation 59

Now, in the absence of any background information about crows the observation
that a certain object turns out to be a crow tells us nothing about its color and
therefore p( jC) = p( ) and p( ~jC) = p( ~). Therefore wC ( ) = 0. Recalling
eq.(2.196) leads to
wBC ( ) 0 : (2.199)
A similar conclusion holds if the evidence consists in the observation of a white
shoe. Does a non-black non-crow support all crows are black? In this case

wB~ C~ ( ) = wB~ ( ) + wCj


~ B~( ) 0

because
~B
p(Cj ~ ) 1
wCj
~ B~ ( ) = log = log 0 (2.200)
~B
p(Cj ~ ~) ~B
p(Cj ~ ~)
and
!
~ )
p(Bj ~
p(B)p( ~
jB) p( ~)
wB~ ( ) = log = log =0; (2.201)
~ ~)
p(Bj p( ) ~ ~jB)
p(B)p( ~

because, just as before, in the absence of any background information about


crows the observation of some non-black object tells us nothing about crows, so
~ = p( ) and p( ~jB)
that p( jB) ~ = p( ~).
But we could have additional background information that establishes a
connection between and C. One possible scenario is the following: There are
two worlds. In one world, denoted 1 there are a million birds of which one
hundred are crows and all of them are black; in the other world, denoted 2 ,
there also are a million birds among which there is one white and 999 black
crows. We pick a bird at random and it turns out to be a black crow. Which
world is it, 1 or 2 = ~1 ? The weight of evidence is

wBC ( 1 ) = wC ( 1 ) + wBjC ( 1 ) :

The relevant probabilities are p(BjC 1 ) = 1 and p(BjC 2 ) = 0:999. Therefore

p(BjC 1 ) 1 3
wBjC ( 1 ) = log = log 3
10 (2.202)
p(BjC 2 ) 1 10
4 3
while p(Cj 1 ) = 10 and p(Cj 2 ) = 10 so that

p(Cj 1 ) 1
wC ( 1 ) = log = log 10 2:303 : (2.203)
p(Cj 2 )

Therefore wBC ( 1 ) = 2:302 < 0. In this scenario the observation of a black


crow is evidence for the opposite conclusion that not all crows are black.
We conclude that just like any other form of induction the principle that
“a case of a hypothesis supports the hypothesis” involves considerable risk.
Whether it is justi…ed or not depends to a large extent on the nature of the
available background information. When confronted with a situation in which
60 Probability

we are completely ignorant about the relation between two variables the prudent
way to proceed is, of course, to try to …nd out whether a relevant connection
exists and what it might be. But this is not always possible and in these cases the
default assumption should be that they are a priori independent. Or shouldn’t
it?
The justi…cation of the assumption of independence a priori is purely prag-
matic. Indeed the universe contains an in…nitely large number of other variables
about which we know absolutely nothing and that could in principle a¤ect our
inferences. Seeking information about all those other variables is clearly out of
the question: waiting to make an inference until after all possible information
has been collected amounts to being paralyzed into making no inferences at
all. On the positive side, however, the assumption that the vast majority of
those in…nitely many other variables are completely irrelevant actually works
— perhaps not all the time but at least most of the time. Induction is risky.
There is one …nal loose end that we must revisit: our arguments above indi-
cate that, in the absence of any other background information, the observation
of a white shoe not only supports the hypothesis that “all crows are black”, but
it also supports the hypothesis that “all swans are black”. Two questions arise:
is this reasoning correct? and, if so, why is it so disturbing? The answer to the
…rst question is that it is indeed correct. The answer to the second question is
that con…rming the hypothesis “all swans are black” is disturbing because we
do have background information about the color of swans which we failed to
include in the analysis. Had we not known anything about swans there would
have been no reason to feel any discomfort at all. This is just one more exam-
ple of the fact that inductive arguments are not infallible; a positive weight of
evidence provides mere support and not absolute certainty.

2.12 Examples from data analysis


To illustrate the use of Bayes’ theorem as a tool to process information when
the information is in the form of data we consider some elementary examples
from the …eld of data analysis. (For more detailed treatments that are friendly
to physicists see e.g. [Bretthorst 1988, Sivia Skilling 2006, Gregory 2005].)

2.12.1 Parameter estimation


Suppose the probability for the quantity x depends on certain parameters ,
p = p(xj ). Although most of the discussion here can be carried out for an
arbitrary function p it is best to be speci…c and focus on the important case of
a Gaussian distribution,

1 (x )2
p(xj ; ) = p exp 2
: (2.204)
2 2 2

The objective is to estimate the parameters = ( ; ) on the basis of a set of


data x = (x1 ; : : : xN ). We assume the measurements are statistically indepen-
2.12 Examples from data analysis 61

dent of each other and use Bayes’theorem to get

p( ; ) QN
p( ; jx) = p(xi j ; ) : (2.205)
p (x) i=1

Independence is important in practice because it leads to considerable practical


simpli…cations but it is not essential: instead of N independent measurements
each providing a single datum we would have a single complex experiment that
provides N non-independent data.
Looking at eq.(2.205) we see that a more precise formulation of the same
problem is the following. We want to estimate certain parameters , in our case
and , from repeated measurements of the quantity x on the basis of several
pieces of information. The most obvious is

1. The information contained in the actual values of the collected data x.

Almost equally obvious (at least to those who are comfortable with the Bayesian
interpretation of probabilities) is

2. The information about the parameters that is codi…ed into the prior dis-
tribution p( ).

Where and how this prior information was obtained is not relevant at this point;
it could have resulted from previous experiments, or from other background
knowledge about the problem. The only relevant part is whatever ended up
being distilled into p( ).
The last piece of information is not always explicitly recognized; it is

3. The information that is codi…ed into the functional form of the ‘sampling’
distribution p(xj ).

If we are to estimate parameters on the basis of measurements of a quantity


x it is clear that we must know how and x are related to each other. Notice
that item 3 refers to the functional form –whether the distribution is Gaussian
as opposed to Poisson or binomial or something else – and not to the actual
values of the data x which is what is taken into account in item 1. The nature
of the relation in p(xj ) is in general statistical but it could also be completely
deterministic. For example, when x is a known function of , say x = f ( ), we
have p(xj ) = [x f ( )]. In this latter case there is no need for Bayes’rule.
Returning to the Gaussian case, let us rewrite eq. (2.205) as
" #
2
p( ; ) 1 PN (x
i )
p( ; jx) = exp (2.206)
p (x) (2 2 )N=2 i=1 2 2

Introducing the sample average x and sample variance s2 ,

1 PN 1 PN
2
x= xi and s2 = (xi x) ; (2.207)
N i=1 N i=1
62 Probability

eq.(2.206) becomes
" #
2
p( ; ) 1 ( x) + s2
p( ; jx) = exp : (2.208)
p (x) (2 2 )N=2 2 2 =N

It is interesting that the data appears here only in the particular combination
given in eq.(2.207) – di¤erent sets of data characterized by the same x and
s2 lead to the same inference about and . (As discussed earlier the factor
p (x) is not relevant here since it can be absorbed into the normalization of the
posterior p( ; jx).)
Eq. (2.208) incorporates the information described in items 1 and 3 above.
The prior distribution, item 2, remains to be speci…ed. Let us start by consid-
ering the simple case where the value of is actually known. Then p( ; ) =
p( ) ( 0 ) and the goal is to estimate . Bayes’theorem is now written as
" #
2
p( ) 1 PN (x
i )
p( jx) = exp (2.209)
p (x) (2 2 )N=2 i=1 2 02
0
" #
2
p( ) 1 ( x) + s2
= exp
p (x) (2 2 )N=2 2 02 =N
0
" #
2
( x)
/ p( ) exp 2 : (2.210)
2 0 =N

Suppose further that we know nothing about ; it could have any value. This
state of extreme ignorance is represented by a very broad distribution that we
take as essentially uniform within some large range; is just as likely to have one
value as another. For p( ) const the posterior distribution is Gaussian, with
mean given by the sample average x, and variance 02 =N: The best estimate
for the value of is the sample average and the uncertainty is the standard
deviation. This is usually expressed in the form

=x p0 : (2.211)
N
Note that the estimate of from N measurements has a much smaller error
than the estimate from just one measurement; the individual measurements are
plagued with errors but they tend to cancel out in the sample average — in
agreement with the previous result in eq.(2.123).
In the case of very little prior information — the uniform prior — we have
recovered the same results as in the standard non-Bayesian data analysis ap-
proach. But there are two important di¤erences: First, a frequentist approach
can yield an estimator but it cannot yield a probability distribution for a pa-
rameter that is not random but merely unknown. Second, the non-Bayesian
approach has no mechanism to handle additional prior information and can
only proceed by ignoring it. On the other hand, the Bayesian approach has
yielded a full probability distribution, eq.(2.210), and it can easily take prior
2.12 Examples from data analysis 63

information into account. For example, if, on the basis of other physical con-
siderations, we happen to know that has to be positive, then we just assign
p( ) = 0 for < 0 and we calculate the estimate of from the truncated
Gaussian in eq.(2.210).
A slightly more complicated case arises when the value of is not known.
Let us assume again that our ignorance of both and is quite extreme and
choose a uniform prior,
C for >0
p( ; ) / (2.212)
0 otherwise.
Another popular choice is a prior that is uniform in and in log . When
there is a considerable amount of data the two choices lead to practically the
same conclusions but we see that there is an important question here: what do
we mean by the word ‘uniform’? Uniform in terms of which variable? , or
2
, or log ? Later, in chapter 7, we shall have much more to say about this
misleadingly innocuous question.
To estimate we return to eq.(2.206) or (2.208). For the purpose of esti-
mating the variable is an uninteresting nuisance which, as we saw in section
2.5.4, can be eliminated through marginalization,
R1
p( jx) = d p( ; jx) (2.213)
0
" #
2
R1 1 ( x) + s2
/ d N
exp : (2.214)
0 2 2 =N

Change variables to t = 1= , then


R1 N 2 t2 2
p( jx) / dt t exp N ( x) + s2 : (2.215)
0 2
Repeated integrations by parts lead to
h i N
2
1
2
p( jx) / N ( x) + s2 ; (2.216)

which is called the Student-t distribution. Since the distribution is symmetric


the estimate for is easy to get,

h i=x: (2.217)

The posterior p( jx) is a Lorentzian-like function raised to some power. As the


number of data grows, say N & 10, the tails of the distribution are suppressed
and p( jx) approaches a Gaussian. To obtain an error bar in the estimate = x
we can estimate the variance of using the following trick. Note that for the
Gaussian in eq.(2.204),
d2 1
log p(xj ; ) = : (2.218)
dx2 xmax
2
64 Probability

Therefore, to the extent that eq.(2.216) approximates a Gaussian, we can write


" # 1
2 d2 s2
( ) log p( jx) = : (2.219)
d 2 max
N 1

(This explains the famous factor of N 1. As we can see it is not a particularly


fundamental result; it follows from approximations that are meaningful only for
large N .)
We can also estimate directly from the data. This requires that we mar-
ginalize over ,
R1
p( jx) = d p( ; jx) (2.220)
1
" #
2
1 N s2 R1 ( x)
/ N
exp d exp : (2.221)
2 2 1 2 2 =N

2 1=2
The Gaussian integral over is 2 =N / and therefore

1 N s2
p( jX) / N 1
exp : (2.222)
2 2

As an estimate for we can use the value where the distribution is maximized,
r
N
max = s2 ; (2.223)
N 1
2
which agrees with our previous estimate of ( ) ,
2
max s2
= : (2.224)
N N 1
An error bar for itself can be obtained using the previous trick (provided N
is large enough) of taking a second derivative of log p: The result is

= max p max : (2.225)


2 (N 1)

2.12.2 Curve …tting


The problem of …tting a curve to a set of data points is a problem of parameter
estimation. There are no new issues of principle to be resolved. In practice, how-
ever, it can be considerably more complicated than the simple cases discussed
in the previous paragraphs.
The problem is as follows. The observed data is in the form of pairs (xi ; yi )
with i = 1; : : : N and we believe that the true ys are related to the xs through
a function y = f (x) which depends on several parameters . The goal is to
2.12 Examples from data analysis 65

estimate the parameters and the complication is that the measured values of
y are a- icted by experimental errors,

yi = f (xi ) + "i : (2.226)

For simplicity we assume that the probability of the error "i is Gaussian with
mean h"i i = 0 and that the variances "2i = 2 are known and the same for all
data pairs. We also assume that there are no errors a¤ecting the xs. A more
realistic account might have to reconsider these assumptions.
The sampling distribution is

Q
N
p(yj ) = p(yi j ) ; (2.227)
i=1

where
1 (yi f (xi ))2
p(yi j ) = p exp : (2.228)
2 2 2 2
Bayes’theorem gives,

P
N (y
i f (xi ))2
p( jy) / p( ) exp : (2.229)
i=1 2 2

As an example, suppose that we are trying to …t a straight line through data


points
f (x) = a + bx ; (2.230)
and suppose further that being ignorant about the values of = (a; b) we choose
p( ) = p(a; b) const, then

P
N (y
i a bxi )2
p(a; bjy) / exp 2
: (2.231)
i=1 2

A good estimate of a and b is the value that maximizes the posterior distribution,
which we recognize as the Bayesian equivalent of the method of least squares.
However, the Bayesian analysis can already take us beyond the scope of the least
squares method because from p(a; bjy) we can also estimate the uncertainties
a and b.

2.12.3 Model selection


Suppose we are trying to …t a curve y = f (x) through data points (xi ; yi ),
i = 1; : : : N . How do we choose the function f ? To be speci…c let f be a
polynomial of order n,
n
f (x) = 0 + 1x + ::: + nx ; (2.232)

the techniques of the previous section allow us to estimate the parameters


0 ; : : : ; n but how do we decide the order n? Should we …t a straight or a
66 Probability

quadratic line? It is not obvious. Having more parameters means that we will
be able to achieve a closer …t to the data, which is good, but we might also be
…tting the noise, which is bad. The same problem arises when the data shows
peaks and we want to estimate their location, their width, and their number.
Could there be an additional peak hiding in the noise? Are we just …tting the
noise, or does the data really support one additional peak?
We say these are problems of model selection. To appreciate how important
they can be consider replacing the modestly unassuming word ‘model’ by the
more impressive sounding word ‘theory’. Given two competing theories, which
one does the data support best? What is at stake is nothing less than the
foundation of experimental science.16
On the basis of data x we want to select one model among several competing
candidates labeled by m = 1; 2; : : : Suppose model m is de…ned in terms of some
parameters m = f m1 ; m2 ; : : :g and their relation to the data x is contained in
the sampling distribution p(xjm; m ). The extent to which the data supports
model m, i.e., the probability of model m given the data, is given by Bayes’
theorem,
p(m)
p(mjx) = p(xjm) ; (2.233)
p(x)
where p(m) is the prior for the model.
The factor p(xjm) is the prior probability for the data given the model and
plays the role of a likelihood function. This is precisely the quantity which, back
in eq.(2.163), we had called the ‘evidence’,

Z Z
p(xjm) = d m p(x; m jm) = d m p( m jm) p(xjm; m) : (2.234)

Thus we see that while the evidence is of no signi…cance in the problem of


estimating parameters within a given model, it turns out to be the central
quantity when choosing among di¤erent models.
Substituting back into (2.233) gives
Z
p(mjx) / p(m) d m p( m jm)p(xjm; m ) : (2.235)

Thus, the problem of model selection is solved, at least in principle, once the
priors p(m) and p( m jm) are assigned. Of course, the practical problem of
calculating the multi-dimensional integrals can be quite formidable.
No further progress is possible without making speci…c choices for the various
functions in eq.(2.235) but we can o¤er some qualitative comments. When
comparing two models, m1 and m2 , it is fairly common to argue that a priori
we have no reason to prefer one over the other and therefore we assign the
same prior probability p(m1 ) = p(m2 ). (Of course this is not always justi…ed.
Particularly in the case of theories that claim to be fundamental people usually
1 6 For useful references on this topic see [Balasubramanian 1996, 1997], [Rodriguez 2005].
2.12 Examples from data analysis 67

have very strong prior prejudices favoring one theory against the other. Be that
as it may, let us proceed.)
Suppose the prior p( m jm) represents a uniform distribution over the para-
meter space. Since
Z
1
d m p( m jm) = 1 then p( m jm) ; (2.236)
Vm

where Vm is the ‘volume’of the parameter space. Suppose further that p(xjm; m )
has a single peak of height Lm ax spread out over a region of ‘volume’ m . The
value m where p(xjm; m ) attains its maximum can be used as an estimate
for m and the ‘volume’ m is then interpreted as an uncertainty. Then the
integral of p(xjm; m ) can be approximated by the product Lm ax m . Thus,
in a very rough and qualitative way the probability for the model given the data
is
Lm ax m
p(mjx) / : (2.237)
Vm
We can now interpret eq.(2.237) as follows. Our preference for a model will be
dictated by how well the model …ts the data; this is measured by [p(xjm; m )]m ax =
Lm ax . The volume of the region of uncertainty m also contributes: if more
values of the parameters are consistent with the data, then there are more ways
the model agrees with the data, and the model is favored. Finally, the larger the
volume of possible parameter values Vm the more the model is penalized. Since
a larger volume Vm means a more complex model the 1=Vm factor penalizes
complexity. The preference for simpler models is said to implement Occam’s
razor. This is a reference to the principle, stated by William of Occam, a 13th
century Franciscan monk, that one should not seek a more complicated expla-
nation when a simpler one will do. Such an interpretation is satisfying but
ultimately it is quite unnecessary. Occam’s principle does not need not be put
in by hand: Bayes’theorem takes care of it automatically in eq.(2.235)!

2.12.4 Maximum Likelihood


If one adopts the frequency interpretation of probabilities then most uses of
Bayes’theorem are not allowed. The reason is simple: from a frequentist per-
spective it makes sense to assign a probability distribution p(xj ) to the data
x = fxi g because the x are random variables but it is absolutely meaningless
to talk about probabilities for the parameters because they have no frequency
distributions; they are not random, they are merely unknown. This means that
many problems in science lie beyond the reach of a frequentist probability the-
ory.
To overcome this di¢ culty a new subject was invented: statistics. Within
the Bayesian approach the two subjects, statistics and probability theory, are
uni…ed into the single …eld of inductive inference. In the frequentist approach in
order to infer an unknown quantity on the basis of measurements of another
quantity, the data x, one postulates the existence of some function of the data,
68 Probability

^(x), called the ‘statistic’or the ‘estimator’, that relates the two: the estimate
for is ^(x). The problem is to estimate the unknown when what is known
is the sampling distribution p(xj ) and the data x. The solution proposed
by Fisher was to select as estimator ^(x) that value of that maximizes the
probability of the observed data x. Since p(xj ) is a function of the variable x
where appears as a …xed parameter, Fisher introduced a function of , which
he called the likelihood function, where the observed data x appear as …xed
parameters,
def
L ( jx) = p(xj ) : (2.238)
Thus, the estimator ^(x) is the value of that maximizes the likelihood function
and, accordingly, this method of parameter estimation is called the method of
‘maximum likelihood’.
The likelihood function L( jx) is not the probability of ; it is not normalized
in any way; and it makes no sense to use it to compute an average or a variance
of . Nevertheless, the same intuition that leads one to propose maximization
of the likelihood to estimate also suggests using the width of the likelihood
function to estimate an error bar. Fisher’s somewhat ad hoc proposal turned
out to be extremely useful and it dominated the …eld of statistics throughout the
20th century. Its success is readily explained within the Bayesian framework.
The Bayesian approach agrees with the method of maximum likelihood in
the common case where the prior is uniform,

p( ) = const ) p( jx) / p( )p(xj ) / p(xj ) : (2.239)

This is why the Bayesian discussion in this section has reproduced so many of
the standard results of the ‘orthodox’theory. But then the Bayesian approach
has many other advantages. In addition to greater conceptual clarity, unlike the
likelihood function, the Bayesian posterior is a true probability distribution that
allows estimation not just of but of all its moments. And, most important,
there is no limitation to uniform priors. If there is additional prior information
that is relevant to a problem the prior distribution provides a mechanism to
take it into account.
Chapter 3

Entropy I: The Evolution of


Carnot’s Principle

An important problem that occupied the minds of many scientists in the 18th
century was to …gure out how to construct a perpetual motion machine. They
all failed. Ever since a rudimentary understanding of the laws of thermodynam-
ics was achieved in the 19th century no competent scientist would waste time
considering perpetual motion. Other scientists tried to demonstrate the impos-
sibility of perpetual motion from the established principles of mechanics. They
failed too: there exist no derivations of the Second Law from purely mechanical
principles. It took a long time, and for many the subject remains controver-
sial, but it has gradually become clear that the reason hinges on the fact that
entropy is not a physical quantity to be derived from mechanics; it is a tool
for inference, a tool for reasoning in situations of incomplete information. It is
quite impossible that such a non-mechanical quantity could have emerged from
a combination of purely mechanical notions. If anything it should be the other
way around.
In this chapter we trace some of the early developments leading to the notion
of entropy. Much of this chapter (including the title) is inspired by a beautiful
article by E. T. Jaynes [Jaynes 1988]. I have also borrowed from historical
papers by Klein [1970, 1973] and U¢ nk [2004].

3.1 Carnot: reversible engines


Sadi Carnot was interested in improving the e¢ ciency of steam engines, that is,
of maximizing the amount of useful work that can be extracted from an engine
per unit of burnt fuel. His work, published in 1824, was concerned with whether
the e¢ ciency could be improved by either changing the working substance to
something other than steam or by changing the operating temperatures and
pressures.
Carnot was convinced that perpetual motion was impossible but this was
70 Entropy I: The Evolution of Carnot’s Principle

not a fact that he could prove. Indeed, he could not have had a proof: thermo-
dynamics had not been invented yet. His conviction derived instead from the
long list of previous attempts (including those by his own father Lazare Carnot)
that had ended in failure. Carnot’s brilliant idea was to proceed anyway and
assume what he knew was true but could not prove as the postulate from which
he would draw all sorts of useful conclusions about engines.1
At the time Carnot did his work the nature of heat as a form of energy trans-
fer had not yet been understood. He adopted the model that was fashionable
at the time – the caloric model – according to which heat is a substance that
could be transferred but neither created nor destroyed. For Carnot an engine
would use heat to produce work in much the same way that falling water can
turn a waterwheel and produce work: the caloric would “fall” from a higher
temperature to a lower temperature thereby making the engine turn. What was
being transformed into work was not the caloric itself but the energy acquired
in the fall.
According to the caloric model the amount of heat extracted from the high
temperature source should be the same as the amount of heat discarded into the
low temperature sink. Later measurements showed that this was not true, but
Carnot was lucky. Although the model was seriously wrong, it did have a great
virtue: it suggested that the generation of work in a heat engine should include
not just the high temperature source from which heat is extracted (the boiler)
but also a low temperature sink (the condenser) into which heat is discarded.
Later, when heat was correctly interpreted as a form of energy transfer it was
understood that in order to operate continuously for any signi…cant length of
time an engine would have to repeat the same cycle over and over again, always
returning to same initial state. This could only be achieved if the excess heat
generated in each cycle were discarded into a low temperature reservoir.
Carnot’s caloric-waterwheel model was fortunate in yet another respect— he
was not just lucky, he was very lucky— a waterwheel engine can be operated in
reverse and used as a pump. This led him to consider a reversible heat engine
in which work would be used to draw heat from a cold source and ‘pump it up’
to deliver heat to the hot reservoir. The analysis of such reversible heat engines
led Carnot to the important conclusion
Carnot’s Principle: “No heat engine E operating between two temperatures
can be more e¢ cient than a reversible engine ER that operates between the same
temperatures.”
The proof of Carnot’s principle is quite straightforward but because he used
the caloric model it was not correct— the necessary revisions were supplied later
by Clausius in 1850. As a side remark, it is interesting that Carnot’s notebooks,
1 In his attempt to understand the undetectability of the ether Einstein faced a similar

problem: he knew that it was hopeless to seek an understanding of the constancy of the speed
of light on the basis of the primitive physics of the atomic structure of solid rulers that was
available at the time. Inspired by Carnot he deliberately followed the same strategy – to give
up and declare victory – and postulated the constancy of the speed of light as the unproven
but known truth which would serve as the foundation from which other conclusions could be
derived.
3.1 Carnot: reversible engines 71

which were made public by his family in about 1870 long after his death, indi-
cate that soon after 1824 Carnot came to reject the caloric model and that he
achieved the modern understanding of heat as a form of energy transfer. This
work— which preceded Joule’s experiments by about …fteen years— was not pub-
lished and therefore had no in‡uence on the development of thermodynamics
[Wilson 1981].
The following is Clausius’proof. Figure (3.1a) shows a heat engine E that
draws heat q1 from a source at high temperature t1 , delivers heat q2 to a sink at
low temperature t2 , and generates work w = q1 q2 . Next consider an engine
ES that is more e¢ cient than a reversible one, ER . In …gure (3.1b) we show the
super-e¢ cient engine ES coupled to the reversible ER . Then for the same heat
q1 drawn from the hot source the super-e¢ cient engine ES would deliver more
work than ER , wS > wR . One could split the work wS generated by ES into two
parts wR and wS wR . The …rst part wR could be used to drive ER in reverse
and pump heat q1 back up to the hot source, which is thus left unchanged. The
remaining work wS wR could then be used for any other purposes. The net
result is to extract heat q2R q2S > 0 from the cold reservoir and convert it
to work without any need for the hight temperature reservoir, that is, without
any need for fuel. The conclusion is that the existence of a super-e¢ cient heat
engine would allow the construction of a perpetual motion engine. Therefore
the assumption that the latter do not exist implies Carnot’s principle that heat
engines cannot be more e¢ cient than reversible ones.
The statement that perpetual motion is not possible is true but it is also
incomplete in one important way. It blurs the distinction between perpetual
motion engines of the …rst kind which operate by violating energy conservation
and perpetual motion engines of the second kind which do not violate energy
conservation. Carnot’s conclusion deserves to be singled out as a new principle
because it is speci…c to the second kind of machine.
Other important conclusions obtained by Carnot include
(1) that all reversible engines operating between the same temperatures are
equally e¢ cient;
(2) that their e¢ ciency is a function of the temperatures only,

def w
e = = e(t1 ; t2 ) ; (3.1)
q1

and is therefore independent of all other details of how the engine is constructed
and operated;
(3) that the e¢ ciency increases with the temperature di¤erence [see eq.(3.5)
below]; and …nally
(4) that the most e¢ cient heat engine cycle, now called the Carnot cycle, is one
in which all heat is absorbed at the high t1 and all heat is discharged at the low
t2 . (Thus, the Carnot cycle is de…ned by two isotherms and two adiabats.)
(The proofs of these statements are left as an exercise for the reader.)
The next important step, the determination of the universal function e(t1 ; t2 ),
was accomplished by Kelvin.
72 Entropy I: The Evolution of Carnot’s Principle

Figure 3.1: (a) An engine E operates between heat reservoirs at temperatures


t1 and t2 . (b) A perpetual motion machine can be built by coupling a super-
e¢ cient engine ES to a reversible engine ER .

3.2 Kelvin: temperature


After Joule’s experiments in the 1840’s on the conversion of work into heat the
caloric model had to be abandoned. Heat was …nally recognized as a form of
energy transfer and the additional relation w = q1 q2 was the ingredient that,
in the hands of Kelvin and Clausius, allowed Carnot’s principle to be developed
into the next stage.
Suppose two reversible engines Ea and Eb are linked in series to form a single
more complex reversible engine Ec . Ea operates between temperatures t1 and
t2 , and Eb between t2 and t3 . Ea draws heat q1 and discharges q2 , while Eb uses
q2 as input and discharges q3 . The e¢ ciencies of the three engines are
wa wb
ea = e (t1 ; t2 ) = ; eb = e (t2 ; t3 ) = ; (3.2)
q1 q2
and
wa + wb
ec = e (t1 ; t3 ) = : (3.3)
q1
They are related by

wb q2 wa
ec = ea + = ea + eb 1 ; (3.4)
q2 q1 q1
3.2 Kelvin: temperature 73

or
ec = ea + eb (1 ea ) ; (3.5)
which is a functional equation for e = e (t1 ; t2 ). Before we proceed to …nd the
solution we note that since 0 e 1 it follows that ec ea . Similarly, writing

ec = eb + ea (1 eb ) ; (3.6)

implies ec eb . Therefore the e¢ ciency e (t1 ; t2 ) can be increased either by


increasing the higher temperature or by lowering the lower temperature.
To …nd the solution of eq.(3.5) change variables to xa = log (1 ea ), or
ea = 1 exa ,
xc (t1 ; t3 ) = xa (t1 ; t2 ) + xb (t2 ; t3 ) ; (3.7)
and then di¤erentiate with respect to t2 to get
@ @
xa (t1 ; t2 ) = xb (t2 ; t3 ) : (3.8)
@t2 @t2
The left hand side is independent of t3 while the second is independent of t1 ,
therefore @xa =@t2 must be some function g of t2 only,

@
xa (t1 ; t2 ) = g(t2 ) : (3.9)
@t2
Integrating gives x(t1 ; t2 ) = F (t1 ) + G(t2 ) where the two functions F and G
are at this point unknown. The boundary condition e (t; t) = 0 or equivalently
x(t; t) = 0 implies that we deal with merely one unknown function: G(t) =
F (t). Therefore

f (t2 )
x(t1 ; t2 ) = F (t1 ) F (t2 ) or e (t1 ; t2 ) = 1 ; (3.10)
f (t1 )

where f = e F . Since e (t1 ; t2 ) increases with t1 and decreases with t2 the


function f (t) must be monotonically increasing.
Kelvin recognized that there is nothing fundamental about the original tem-
perature scale t. It may depend, for example, on the particular materials em-
ployed to construct the thermometer. He realized that the freedom in eq.(3.10)
in the choice of the function f corresponds to the freedom of changing tem-
perature scales by using di¤erent thermometric materials. The only feature
common to all thermometers that claim to rank systems according to their ‘de-
gree of hotness’ is that they must agree that if A is hotter than B, and B is
hotter than C, then A is hotter than C. One can therefore regraduate any old
inconvenient t scale by a monotonic function to obtain a new scale T chosen
for the purely pragmatic reason that it leads to a more elegant formulation
of the theory. Inspection of eq.(3.10) immediately suggests that the optimal
choice of regraduating function, which leads to Kelvin’s de…nition of absolute
temperature, is
T = Cf (t) : (3.11)
74 Entropy I: The Evolution of Carnot’s Principle

The scale factor C re‡ects the still remaining freedom to choose the units. In the
absolute scale the e¢ ciency for the ideal reversible heat engine is very simple,
T2
e (t1 ; t2 ) = 1 : (3.12)
T1
In short, what Kelvin proposed was to use an ideal reversible engine as a ther-
mometer with its e¢ ciency playing the role of the thermometric variable.
Carnot’s principle that any heat engine E 0 must be less e¢ cient than the
reversible one, e0 e, is rewritten as
w q2 T2
e0 = =1 e=1 ; (3.13)
q1 q1 T1
or,
q1 q2
0: (3.14)
T1 T2
It is convenient to rede…ne heat so that inputs are positive heat, Q1 = q1 , while
outputs are negative heat, Q2 = q2 . Then,
Q1 Q2
+ 0; (3.15)
T1 T2
where the equality holds when and only when the engine is reversible.
The generalization to an engine or any system that undergoes a cyclic process
in which heat is exchanged with more than two reservoirs is straightforward. If
heat Qi is absorbed from the reservoir at temperature Ti we obtain the Kelvin
form (1854) of Carnot’s principle,
X Qi
0: (3.16)
i
Ti

It may be worth emphasizing that the Ti are the temperatures of the reservoirs.
In an irreversible process the system will not in general be in thermal equilibrium
and it may not be possible to assign a temperature to it.
The next non-trivial step, taken by Clausius, was to use eq.(3.16) to intro-
duce the concept of entropy.

3.3 Clausius: entropy


By about 1850 both Kelvin and Clausius had realized that two laws (energy
conservation and Carnot’s principle) were necessary as a foundation for ther-
modynamics. The somewhat awkward expressions for the second law that they
had adopted at the time were reminiscent of Carnot’s; they stated the impossi-
bility of heat engines whose sole e¤ect would be to transform heat from a single
source into work, or of refrigerators that could pump heat from a cold to a hot
reservoir without the input of external work. It took Clausius until 1865 –this
is some …fteen years later, which indicates that the breakthrough was not at all
3.3 Clausius: entropy 75

trivial – before he came up with a new compact statement of the second law
that allowed substantial further progress [Cropper 1986].
Clausius rewrote Kelvin’s eq.(3.16) for a cycle where the system absorbs in-
…nitesimal (positive or negative) amounts of heat dQ from a continuous sequence
of reservoirs, I
dQ
0; (3.17)
T
where T is the temperature of each reservoir. The equality is attained for a
reversible process in which the system is slowly taken through a continuous
sequence of equilibrium states. In such a process T is both the temperature of
the system and of the reservoirs. The equality implies that the integral from
any state A to any other state B is independent of the path taken,
I Z B Z B
dQ dQ dQ
=0) = ; (3.18)
T A;R1AB T A;R2AB T

where R1AB and R2AB denote any two reversible paths linking the states A and
B. Clausius realized that eq.(3.18) implies the existence of a function of the
thermodynamic state. This function, which he called entropy, is de…ned up to
an additive constant by
Z B
dQ
SB = SA + : (3.19)
A;RAB T

This …rst notion of entropy we will call the Clausius entropy or the thermo-
dynamic entropy. Note that the Clausius entropy is de…ned only for states of
thermal equilibrium which severely limits its range of applicability.
Eq.(3.19) seems like a mere reformulation of eqs.( 3.16) and (3.17) but it
represents a major advance because it allowed thermodynamics to reach beyond
the study of cyclic processes. Consider a possibly irreversible process in which
a system is taken from an initial state A to a …nal state B, and suppose the
system is returned to the initial state along a reversible path. Then, the more
general eq.(3.17) gives
Z B Z A
dQ dQ
+ 0: (3.20)
A;irrev T B;RAB T

From eq.(3.19) the second integral is SA SB . Since dQ is the amount is the


amount of heat released by the reservoirs at temperature T the …rst integral
represents minus the change in the entropy of the reservoirs which in this case
represent the rest of the universe,
res res res res
(SA SB ) + (SA SB ) 0 or SB + SB SA + SA : (3.21)

Thus the second law can be stated in terms of the total entropy S total = S res +S
as
total total
S…nal Sinitial ; (3.22)
76 Entropy I: The Evolution of Carnot’s Principle

which led Clausius to summarize the laws of thermodynamics as “The energy of


the universe is constant. The entropy of the universe tends to a maximum.”This
represents great progress: all restrictions to cyclic processes have disappeared.
But a word of caution and restraint is however necessary. Glib pronouncements
such as “the energy of the universe is constant”might have been the culmination
of insight by 19th century standards but today we know better. It is not that
the energy of the universe is increasing or decreasing, it is rather that the very
notion of a total energy in a curved expanding universe is not a quantity that
can be unambiguously de…ned. And similarly, “the entropy of the universe tends
to a maximum,”is a catchy phrase that captures our imagination but once one
realizes that the thermodynamic entropy applies only to systems in thermal
equilibrium one wonders what it could possibly mean for an expanding universe
that is clearly not in equilibrium.
Clausius was also responsible for initiating another independent line of re-
search in this subject. His paper “On the kind of motion we call heat” (1857)
was the …rst (failed!) attempt to deduce the second law from purely mechanical
principles applied to molecules. His results referred to averages taken over all
molecules, for example the kinetic energy per molecule, and involved theorems
in mechanics such as the virial theorem. For him the increase of entropy was
meant to be an absolute law and not just a matter of overwhelming probability.

3.4 Maxwell: probability


We owe to Maxwell the introduction of probabilistic notions into fundamental
physics (1860). Before him probabilities had been used by Laplace and by
Gauss as a tool in the analysis of experimental data. Maxwell realized the
practical impossibility of keeping track of the exact motion of all the molecules
in a gas and pursued a less detailed description in terms of the distribution of
velocities. (Perhaps he was inspired by his earlier study of the rings of Saturn
which required reasoning about particles undergoing very complex trajectories.)
Maxwell interpreted his distribution function as the number of molecules
with velocities in a certain range, and also as the probability P (~v )d3 v that a
molecule has a velocity ~v in a certain range d3 v. It would take a long time to
achieve a clearer understanding of the meaning of the term ‘probability’. In any
case, Maxwell concluded that “velocities are distributed among the particles
according to the same law as the errors are distributed in the theory of the
‘method of least squares’,” and on the basis of this distribution he obtained a
number of signi…cant results on the transport properties of gases.
Over the years he proposed several derivations of his velocity distribution
function. His derivation in1860 is particularly elegant because it relies on sym-
metry. Maxwell’s …rst assumption is a symmetry requirement, the distribution
should only depend on the actual magnitude j~v j = v of the velocity and not on
its direction,
q
P (~v )d3 v = f (v)d3 v = f vx2 + vy2 + vz2 d3 v : (3.23)
3.4 Maxwell: probability 77

A second assumption is that velocities along orthogonal directions should be


independent
f (v)d3 v = p(vx )p(vy )p(vz )d3 v : (3.24)
Therefore q
f vx2 + vy2 + vz2 = p(vx )p(vy )p(vz ) : (3.25)
Setting vy = vz = 0 we get

f (vx ) = p(vx )p(0)p(0) ; (3.26)

so that we obtain a functional equation for p,


q
p vx2 + vy2 + vz2 p(0)p(0) = p(vx )p(vy )p(vz ) ; (3.27)

or
2 q 3
p vx2 + vy2 + vz2
log 4 5 = log p(vx ) + log p(vy ) + log p(vz ) ; (3.28)
p(0) p(0) p(0) p(0)

or, introducing the function G = log[p=p(0)],


q
G vx2 + vy2 + vz2 = G(vx ) + G(vy ) + G(vz ): (3.29)

The solution is straightforward. Di¤erentiate with respect to vx and to vy to


get
q q
G0 vx2 + vy2 + vz2 G0 vx2 + vy2 + vz2
q vx = G0 (vx ) and q vy = G0 (vy ) :
2 2
vx + vy + vz 2 2 2
vx + vy + vz 2

(3.30)
Therefore
G0 (vx ) G0 (vy )
= = 2 ; (3.31)
vx vy
where 2 is a constant. Integrating gives

p(vx )
log = G(vx ) = vx2 + const ; (3.32)
p(0)

so that
3=2
P (~v ) = f (v) = exp vx2 + vy2 + vz2 ; (3.33)

the same distribution as “errors in the method of least squares”.


Maxwell’s distribution applies whether the molecule is part of a gas, a liquid,
or even a solid and, with the bene…t of hindsight, the reason is quite easy to
see. The probability that a molecule have velocity ~v and position ~x is given
by the Boltzmann distribution / exp H=kT . For a large variety of situations
78 Entropy I: The Evolution of Carnot’s Principle

the Hamiltonian for one molecule is of the form H = mv 2 =2 + V (~x) where


the potential V (~x) includes the interactions, whether they be weak or strong,
with all the other molecules. If the potential V (~x) is independent of ~v , then
the distribution for ~v and ~x factorizes. Velocity and position are statistically
independent, and the velocity distribution is Maxwell’s.
Maxwell was the …rst to realize that the second law is not an absolute law
(this was expressed in his popular textbook “Theory of Heat” in 1871), that it
“has only statistical certainty”and indeed, that in ‡uctuation phenomena “the
second law is continually being violated”. Such phenomena are not rare: just
look out the window and you can see that the sky is blue – a consequence of
the scattering of light by density ‡uctuations in the atmosphere.
Maxwell introduced the notion of probability into physics, but what did he
actually mean by the word ‘probability’? He used his distribution function as a
velocity distribution, the number of molecules with velocities in a certain range,
which betrays a frequentist interpretation. These probabilities are ultimately
mechanical properties of the gas. But he also used his distribution to represent
the lack of information we have about the precise microstate of the gas. This
latter interpretation is particularly evident in a letter he wrote in 1867 where
he argues that the second law could be violated by “a …nite being who knows
the paths and velocities of all molecules by simple inspection but can do no
work except open or close a hole.” Such a “demon” could allow fast molecules
to pass through a hole from a vessel containing hot gas into a vessel containing
cold gas, and could allow slow molecules pass in the opposite direction. The net
e¤ect being the transfer of heat from a low to a high temperature, a violation
of the second law. All that was required was that the demon “know” the right
information. [Klein 1970]

3.5 Gibbs: beyond heat


Gibbs generalized the second law in two directions: to open systems (allowing
transfer of particles as well as heat) and to inhomogeneous systems (as, for
example, in the equilibrium between a gas and a liquid phase). With the intro-
duction of the concept of the chemical potential, a quantity that regulates the
transfer of particles in much the same way that temperature regulates the trans-
fer of heat, he could apply the methods of thermodynamics to phase transitions,
mixtures and solutions, chemical reactions, and much else. His paper “On the
Equilibrium of Heterogeneous Systems” [Gibbs 1875-78] is formulated as the
purest form of thermodynamics –a phenomenological theory of extremely wide
applicability because its foundations do not rest on particular models about the
structure and dynamics of the microscopic constituents.
And yet, Gibbs was keenly aware of the signi…cance of the underlying mole-
cular constitution – he was familiar with Maxwell’s writings and in particular
with his “Theory of Heat”. His discussion of the process of mixing gases led
him to analyze the paradox that bears his name. The entropy of two di¤erent
gases increases when the gases are mixed; but does the entropy also increase
3.6 Boltzmann: entropy and probability 79

when two gases of the same molecular species are mixed? Is this an irreversible
process?
For Gibbs there was no paradox, much less one that would require some
esoteric new (quantum) physics for its resolution. For him it was quite clear
that thermodynamics was not concerned with microscopic details but rather
with the changes from one macrostate to another. He correctly explained that
the mixing of two gases of the same molecular species does not lead to a di¤erent
macrostate. Indeed: by “thermodynamic” state

“...we do not mean a state in which each particle shall occupy more or less
exactly the same position as at some previous epoch, but only a state which
shall be indistinguishable from the previous one in its sensible properties.
It is to states of systems thus incompletely de…ned that the problems of
thermodynamics relate.” [Gibbs 1875-78]

Thus, there is no entropy increase because there is no change of thermodynamic


state.2 Gibbs’resolution of the non-paradox hinges on distinguishing two kinds
of reversibility. One is the microscopic or mechanical reversibility in which the
velocities of each individual particle is reversed and the system retraces the
sequence of microstates. The other is macroscopic or Carnot reversibility in
which the system retraces the sequence of macrostates.
Gibbs understood, as had Maxwell before him, that the explanation of the
second law cannot rest on purely mechanical arguments. Since the second law
applies to “incompletely de…ned”descriptions any explanation must also involve
probabilistic concepts that are foreign to mechanics. This led him to conclude
that “... the impossibility of an uncompensated decrease of entropy seems to
be reduced to improbability,”a sentence that Boltzmann adopted as the motto
for the second volume of his “Lectures on the Theory of Gases.”
Remarkably neither Maxwell nor Gibbs established a connection between
probability and entropy. Gibbs was very successful at showing what one can
accomplish by maximizing entropy but he did not address the issue of what
entropy is or what it means. The crucial steps in this direction were taken by
Boltzmann.
But Gibbs’contributions did not end here. The ensemble theory introduced
in his “Principles of Statistical Mechanics”in 1902 (it was Gibbs who coined the
term ‘statistical mechanics’) represent a practical and conceptual step beyond
Boltzmann’s understanding of entropy.

3.6 Boltzmann: entropy and probability


It was Boltzmann who found the connection between entropy and probability,
but his path was long and tortuous [Klein 1973, U¢ nk 2004]. Over the years
he adopted several di¤erent interpretations of probability and, to add to the
2 For a discussion of the Gibbs’paradox from the more modern perspective of information

theory see section 5.11.


80 Entropy I: The Evolution of Carnot’s Principle

confusion, he was not always explicit about which one he was using, sometimes
mixing them within the same paper, and even within the same equation. At
…rst, just like Maxwell, he de…ned the probability of a molecule having a velocity
~v within a small cell d3 v as the fraction of particles with velocities within the cell.
But then he also de…ned the probability as being proportional to the amount of
time that the molecule spent within that particular cell. Both de…nitions have
a clear origin in mechanics.
By 1868 he had managed to generalize Maxwell’s work in several directions.
He extended the theorem of equipartition of energy for point particles to complex
molecules. And he also generalized the Maxwell distribution to particles in the
presence of an external potential U (~x) which is the Boltzmann distribution. The
latter, in modern notation, is
1 p2
P (~x; p~)d3 xd3 p / exp + U (~x) d3 xd3 p : (3.34)
kT 2m
The argument was that in equilibrium the distribution should be stationary,
that it should not change as a result of collisions among particles. The colli-
sion argument was indeed successful but it gave the distribution for individual
molecules; it did not keep track of the correlations among molecules that would
arise from the collisions themselves.
A better treatment of the interactions among molecules was needed and
was found soon enough, also in 1868. The idea that naturally suggests itself
is to replace the potential U (~x) due to a single external force by a potential
U (~x1 : : : ~xN ) that includes all intermolecular forces. The change is enormous:
it led Boltzmann to consider the probability of the microstate of the system
as a whole rather than the probabilities of the microstates of individual mole-
cules. Thus, the universe of discourse shifted from the one-particle phase space
with volume element d3 xd3 p to the N -particle phase space with volume ele-
ment d3N xd3N p and Boltzmann was led to the microcanonical distribution in
which the N -particle microstates are uniformly distributed over a hypersurface
of constant energy — a subspace of 6N 1 dimensions.
The question of probability was, once again, brought to the foreground. A
notion of probability as the fraction of molecules in d3 v was no longer usable but
Boltzmann could still identify the probability of the system being in some region
of the N -particle phase space (rather than the one-particle space of molecular
velocities) with the relative amount of time that the system would spend in the
region. This obviously mechanical concept of probability is sometimes called
the “time” ensemble.
Perhaps inadvertently, at least at …rst, Boltzmann also introduced another
de…nition, according to which the probability that the state of the system is
within a certain region of phase space at a given instant in time is propor-
tional to the volume of the region. This is almost natural: just like the faces
of a symmetric die are assigned equal probabilities, Boltzmann assigned equal
probabilities to equal volumes in phase space.
At …rst Boltzmann did not think it was necessary to comment on whether the
two de…nitions of probability are equivalent or not, but eventually he realized
3.6 Boltzmann: entropy and probability 81

that their assumed equivalence should be explicitly stated. Later this came
to be known as the “ergodic” hypothesis, namely, that over a long time the
trajectory of the system would cover the whole region of phase space consistent
with the given value of the energy (and thus erg-odic). Throughout this period
Boltzmann’s various notions of probability were all still conceived as mechanical
properties of the gas.
In 1871 Boltzmann achieved a signi…cant success in establishing a connection
between thermodynamic entropy and microscopic concepts such as the probabil-
ity distribution in the N -particle phase space. In modern notation the argument
runs as follows. The energy of N interacting particles is given by
N
X p2i
H= + U (x1 ; : : : ; xN ; V ) ; (3.35)
i
2m

where V stands for additional parameters that can be externally controlled


such as, for example, the volume of the gas. The …rst non-trivial decision was
to propose a quantity de…ned in purely microscopic (but not purely mechanical)
terms that would correspond to the macroscopic internal energy. He opted for
the “expectation” Z
E = hHi = dzN PN H ; (3.36)

where dzN = d3N xd3N p is the volume element in the N -particle phase space,
and PN is the N -particle distribution function,
Z
exp ( H)
PN = where Z = dzN e H ; (3.37)
Z
and = 1=kT , so that,
3
E= N kT + hU i : (3.38)
2
The connection to the thermodynamic entropy, eq.(3.19), requires a clear
idea of the nature of heat and how it di¤ers from work. One needs to express
heat in purely microscopic terms, and this is quite subtle because at the mole-
cular level there is no distinction between a motion that is supposedly of a
“thermal” type and other types of motion such as plain displacements or rota-
tions. The distribution function turns out to be the crucial ingredient. In any
in…nitesimal transformation the change in the internal energy separates into two
contributions, Z Z
E = dzN H PN + dzN PN H : (3.39)

The second integral, which can be written as h Hi = h U i, arises purely from


changes in the potential function U that are induced by manipulating parame-
ters such as volume. Such a change in the potential is precisely what one means
by mechanical work W , therefore, since E = Q + W , the …rst integral must
represent the transferred heat Q,
Q= E h Ui : (3.40)
82 Entropy I: The Evolution of Carnot’s Principle

An aside: The discrete version of this idea might be familiar from elementary
quantum mechanics. If the probability that a quantum system is in a microstate
i with energy eigenvalue "i is pi , then the internal energy is
X
E = hHi = pi " i : (3.41)
i

The energy transferred in an in…nitesimal process is


X X
E = hHi = "i pi + pi " i : (3.42)
i i

The second term is the energy transfer that results from changing the energy
levels while keeping the probabilities pi …xed. This can be achieved, for example,
by changing an external electric …eld, or by changing the volume of the box in
which the particles are contained. The corresponding change in energy is called
work. Then, the …rst term, which refers to an energy change which involves
only a change in the probability of occupation pi and not in the energy levels is
called heat. Thus, in quantum mechanics, a slow process in which the quantum
system makes no transitions ( pi = 0) while the energy levels are moved around
( "i 6= 0) is called an adiabatic process (i.e., no heat is transferred).
Getting back to Boltzmann, substituting E from eq.(3.38) into (3.40), one
gets
3
Q = N k T + hU i h U i : (3.43)
2
This is not a complete di¤erential, but dividing by the temperature yields (after
some algebra)
Z
Q 3 hU i
= N k log T + + k log d3N x e U + const : (3.44)
T 2 T
If the identi…cation of Q with heat is correct then this strongly suggests that
the expression in brackets should be identi…ed with the Clausius entropy S.
Further rewriting leads to
E
S= + k log Z + const ; (3.45)
T
which is recognized as the correct modern expression. Indeed, the free energy,
F = E T S, is such that Z = e F=kT .
Boltzmann’s path towards understanding the second law was guided by one
notion from which he never wavered: matter is an aggregate of molecules. Apart
from this the story of his progress is the story of the increasingly more important
role played by probabilistic notions, and ultimately, it is the story of the evolu-
tion of his understanding of the notion of probability itself. By 1877 Boltzmann
achieves his …nal goal and explains entropy purely in terms of probability –me-
chanical notions were by now reduced to the bare minimum consistent with the
subject matter: we are, after all, talking about collections of molecules with po-
sitions and momenta and their total energy is conserved. His …nal achievement
3.6 Boltzmann: entropy and probability 83

hinges on the introduction of yet another way of thinking about probabilities


involving the notion of the multiplicity of the macrostate.
He considered an idealized system consisting of N particles whose single-
particle phase space is divided into m cells labelled n = 1; :::; m. The number of
particles in the nth cell is denoted wn , and the distribution function is given by
the set of numbers w1 ; : : : ; wm . In Boltzmann’s previous work the determination
of the distribution function had been based on …guring out its time evolution
from the mechanics of collisions. Here he used a purely combinatorial argument.
A completely speci…ed state, what we call a microstate, is de…ned by specifying
the cell of each individual molecule. A macrostate is speci…ed less completely
by the distribution function, w1 ; : : : ; wm .
The probability of the macrostate is given by the probability of a microstate
1=mN multiplied by the number W of microstates compatible with a given
macrostate, which is called the “multiplicity”,

W N!
P (w1 ; : : : ; wm ) = where W = : (3.46)
mN w1 ! : : : wm !

Boltzmann proposed that the probability of the macrostate was proportional


to its multiplicity, to the number of ways in which it could be achieved, which
assumes any microstate is as likely as any other –the ‘equal a priori probability
postulate’. Thus, the most probable macrostate is that which maximizes P or
equivalently W subject to the constraints of a …xed total number of particles N
and a …xed total energy E,
m
X m
X
wn = N and wn "n = E: (3.47)
n=1 n=1

where "n is the energy of a particle in the nth cell.


When the occupation numbers wn are large enough that one can use Stir-
ling’s approximation for the factorials, we have
m
X
log W = N log N N (wn log wn wn ) ; (3.48)
n=1
P
which, using N = wn , can be written as

Xm
wn wn
log W = N log ; (3.49)
n=1
N N

or
m
X
log W = N fn log fn ; (3.50)
n=1

where fn = wn =N is the fraction of molecules in the nth cell, or alternatively,


the “probability” that a molecule is in its nth state. As we shall later derive in
84 Entropy I: The Evolution of Carnot’s Principle

detail, the distribution that maximizes log W subject to the constraints (3.47)
is
wn
fn = / e "n ; (3.51)
N
where is a Lagrange multiplier determined by the total energy. When applied
to a gas, the possible states of a molecule are cells in the one-particle phase
space. Therefore
Z
log W = N dz1 f (x; p) log f (x; p) ; (3.52)

where dz1 = d3 xd3 p and the most probable distribution (3.51) is the same equi-
librium distribution found earlier by Maxwell and generalized by Boltzmann.
The derivation of the Boltzmann distribution (3.51) from a purely proba-
bilistic argument is a major accomplishment. However, although minimized, the
role of dynamics it is not completely eliminated. The Hamiltonian enters the
discussion in two places. One is quite explicit: there is a conserved energy the
value of which is imposed as a constraint. The second is much more subtle; we
saw above that the probability of a macrostate is proportional to the multiplic-
ity W provided the microstates are assigned equal probabilities, or equivalently,
equal volumes in phase space are assigned equal a priori weights. As always,
equal probabilities must at ultimately be justi…ed in terms of some form of un-
derlying symmetry. As we shall later see in chapter 5, the required symmetry
follows from Liouville’s theorem –under a Hamiltonian time evolution a region
in phase space moves around and its shape is distorted but its volume remains
conserved: Hamiltonian time evolution preserves volumes in phase space. The
nearly universal applicability of the ‘equal a priori postulate’can be traced to
the fact that the only requirement that the dynamics be Hamiltonian but the
functional form of the Hamiltonian is not important.
It is very surprising that although Boltzmann calculated the maximized value
log W for an ideal gas and knew that it agreed with the thermodynamical en-
tropy except for a scale factor, he never wrote the famous equation that bears
his name
S = k log W : (3.53)
This equation, as well as Boltzmann’s constant k, were both …rst introduced by
Planck.
There is, however, a serious problem with eq.(3.52): it involves the distri-
bution function f (x; p) in the one-particle phase space and therefore it cannot
take correlations into account. Indeed, Boltzmann used his eq.(3.52) in the one
case where it actually works, for ideal gases of non-interacting particles. The
expression that applies to systems of interacting particles is3
Z
log WG = dzN fN log fN ; (3.54)

3 For the moment we disregard the question of the distinguishability of the molecules. The

so-called Gibbs paradox and the extra factor of 1=N ! will be discussed in detail in section 5.11.
3.7 Some remarks 85

where fN = fN (x1 ; p1 ; : : : ; xN ; pN ) is the probability distribution in the N -


particle phase space. This equation is usually associated with the name of Gibbs
who, in his “Principles of Statistical Mechanics”(1902), developed Boltzmann’s
combinatorial arguments into a very powerful theory of ensembles. The con-
ceptual gap between eq.(3.52) and (3.54) is enormous; it goes well beyond the
issue of intermolecular interactions. The probability in Eq.(3.52) is the single-
particle distribution, it can be interpreted as a “mechanical” property, namely,
the relative number of molecules in each cell and then the entropy Eq.(3.52) can
be interpreted as a mechanical property of the system. In contrast, eq.(3.54)
involves the N -particle distribution which is not a property of any single in-
dividual system but at best a property of an ensemble of similarly prepared
replicas of the system. Gibbs was not very explicit about his interpretation of
probability. He wrote

“The states of the bodies which we handle are certainly not known to us
exactly. What we know about a body can generally be described most
accurately and most simply by saying that it is one taken at random from
a great number (ensemble) of bodies which are completely described.”[my
italics, Gibbs 1902, p.163]

It is clear that for Gibbs probabilities represent a state of knowledge, that the
ensemble is a purely imaginary construction, just a tool for handling incom-
plete information. On the other hand, it is also clear that Gibbs still thinks of
probabilities in terms of frequencies. If the only available notion of probability
requires an ensemble and real ensembles are nowhere to be found then either
one gives up on probabilistic arguments altogether or one invents an imaginary
ensemble. Gibbs opted for the second alternative.
This brings our story of entropy up to about 1900. In the next chapter we
start a more deliberate and systematic study of the connection between entropy
and information.

3.7 Some remarks


I end with a disclaimer: this chapter has historical overtones but it is a story,
not a history. I have not mentioned many developments that are central to 20th
century physics— for example, the Boltzmann equation, or the ergodic hypoth-
esis, or all applications of statistical mechanics to the macroscopic properties of
matter, phase transitions, transport properties, and so on and on. These topics
represent paths that diverge from the central theme of this book, namely that
the laws of physics can be understood as rules for handling information and
uncertainty. The goal in this chapter was to discuss the origins of thermody-
namics and statistical mechanics in order to provide some background for the
…rst historical examples of such an entropic physics. At …rst I tried to write a
‘history as it should have happened’. I wanted to trace the development of the
concept of entropy from its origins with Carnot in a manner that re‡ects the
logical rather than the actual evolution. But I found that this approach would
86 Entropy I: The Evolution of Carnot’s Principle

not do; it trivializes the enormous achievements of the 19th century thinkers
and it misrepresents the actual nature of research. Scienti…c research is not a
neat tidy business.
I mentioned that this chapter was inspired by a beautiful article by E. T.
Jaynes with the same title [Jaynes 1988]. I think Jaynes’article has great peda-
gogical value but I disagree with him on how well Gibbs understood the logical
status of thermodynamics and statistical mechanics as examples of inferential
and probabilistic thinking. My own assessment runs in quite the opposite di-
rection: the reason why the conceptual foundations of thermodynamics and
statistical mechanics have been so controversial throughout the 20th century
is precisely because neither Gibbs nor Boltzmann, nor anyone else at the time,
were particularly clear on the interpretation of probability. I think that we could
hardly expect them to have done much better; they did not bene…t from the
writings of Keynes (1921), Ramsey (1931), de Finetti (1937), Je¤reys (1939),
Cox (1946), Shannon (1948), Brillouin (1952), Polya (1954) and, of course,
Jaynes himself (1957). Indeed, whatever clarity Jaynes attributes to Gibbs, is
not Gibbs’; it is the hard-won clarity that Jaynes attained through his own
e¤orts and after absorbing much of the best the 20th century had to o¤er.
The decades following Gibbs (1902) were extremely fruitful for statistical
mechanics but they centered in the systematic development of calculational
methods and their application to a bewildering range of systems and phenom-
ena, including the extension to the quantum domain by Bose, Einstein, Fermi,
Dirac, and von Neumann. With the possible exception of Szilard there were
no signi…cant conceptual advances concerning the connection of entropy and
information until the work of Shannon, Brillouin, and Jaynes around 1950. In
this book we will approach statistical mechanics from the point of view of in-
formation and entropic inference. For an entry point to the extensive literature
on alternative approaches based, for example, on Boltzmann’s equation, the er-
godic hypothesis, etc., see e.g. [Ehrenfest 2012] [ter Haar 1955] [Wehrl 1978]
[Mackey 1989][Lebowitz 1993, 1999] and [U¢ nk 2001, 2003, 2006].
Chapter 4

Entropy II: Measuring


Information

What is information? Our central goal is to gain insight into the nature of
information, how one manipulates it, and the implications such insights have for
physics. In chapter 2 we provided a …rst partial answer. We might not yet know
precisely what information is but sometimes we can recognize it. For example,
it is clear that experimental data contains information, that the correct way to
process it involves Bayes’ rule, and that this is very relevant to the empirical
aspect of all science, namely, to data analysis. Bayes’rule is the machinery that
processes the information contained in data to update from a prior to a posterior
probability distribution. This suggests a possible generalization: “information”
is whatever induces a rational agent to update from one state of belief to another.
This is a notion that will be explored in detail later.
In this chapter we pursue a di¤erent point of view that has turned out to
be extremely fruitful. We saw that the natural way to deal with uncertainty,
that is, with lack of information, is to introduce the notion of degrees of belief,
and that these measures of plausibility should be manipulated and calculated
using the ordinary rules of the calculus of probabilities. This achievement is a
considerable step forward but it is not su¢ cient.
What the rules of probability theory allow us to do is to assign probabilities
to some “complex”propositions on the basis of the probabilities of some other,
perhaps more “elementary”, propositions. The problem is that in order to get
the machine running one must …rst assign probabilities to those elementary
propositions. How does one do this?
The solution is to introduce a new inference tool designed speci…cally for
assigning those elementary probabilities. The new tool is Shannon’s measure of
an “amount of information” and the associated method of reasoning is Jaynes’
Method of Maximum Entropy, or MaxEnt. [Shannon 1948, Brillouin 1952,
Jaynes 1957b, 1983, 2003]
88 Entropy II: Measuring Information

4.1 Shannon’s information measure


Consider a set of mutually exclusive and exhaustive alternatives i, for example,
the possible values of a variable, or the possible states of a system. The state
of the system is unknown. Suppose that on the basis of some incomplete infor-
mation we have somehow assigned probabilities pi . In order to …gure out which
is the actual state within the set fig we need more information. The question
we address here is how much more information is needed. Note that we are not
asking which particular piece of information is missing; we are merely asking the
quantity of information that is missing. It seems reasonable that the amount of
missing information in a sharply peaked distribution is smaller than the amount
missing in a broad distribution, but how much smaller?1 Is it possible to quan-
tify the notion of amount of information? Can one …nd a function S[p] of the
probabilities that tends to be large for broad distributions and small for narrow
ones?
Consider a discrete set of n mutually exclusive and exhaustive discrete states
i, each with probability pi . The restriction to probabilities de…ned over a discrete
space of alternatives is an important limitation. According to Shannon, the
measure S of the amount of information that is missing when all we know
is the distribution pi must satisfy three axioms. It is quite remarkable that
these three conditions are su¢ ciently constraining to determine the quantity S
uniquely. The …rst two axioms are deceptively simple.
Axiom 1. S is a real continuous function of the probabilities pi , S[p] =
S (p1 ; : : : pn ).
Remark: It is explicitly assumed that S[p] depends only on the pi and on noth-
ing else. What we seek here is an absolute measure of the amount of missing
information in p. If the objective were to update from a prior q to a posterior
distribution p –a problem that will be later tackled in chapter 6 –then we would
require a functional S[p; q] depending on both q and p. Such S[p; q] would at
best be a relative measure: the information missing in p relative to the reference
distribution q.
Axiom 2. If all the pi ’s are equal, pi = 1=n. Then S = S (1=n; : : : ; 1=n) is
some function F (n) that is an increasing function of n.
Remark: This means that it takes less information to pinpoint one alternative
among a few than one alternative among many. It also means that knowing the
number n of available states is already a valuable piece of information. Notice
that the uniform distribution pi = 1=n is singled out to play a very special role.
Indeed, although no reference distribution has been explicitly mentioned, the
uniform distribution will, in e¤ect, provide the standard of complete ignorance.
The third axiom is a consistency requirement and is somewhat less intuitive.
The entropy S[p] is meant to measure the amount of additional information be-
yond the incomplete information already codi…ed in the pi that will be needed
to pinpoint the actual state of the system. Imagine that this missing informa-
1 If probabilities are subjective then this intuition is itself questionable. The subjective

probablities could be sharply peaked around a completely wrong value and the actual amount
of missing information could be substantial.
4.1 Shannon’s information measure 89

tion were to be obtained not all at once but in installments. The consistency
requirement is that the particular manner in which we obtain this information
should not matter. This idea can be expressed as follows.

Figure 4.1: The n states are divided into N groups to formulate the grouping
axiom.

Imagine the n states are divided into N groups labeled by g = 1 : : : N as


shown in Fig 4.1. The probability that the system is found in group g is
X
Pg = pi : (4.1)
i2g

Let pijg denote the probability that the system is in state i conditional on its
being in group g. For i 2 g we have
pi
pi = pig = Pg pijg so that pijg = : (4.2)
Pg

Suppose we were to obtain the missing information in two steps, the …rst of
which would allow us to single out one of the groups g while the second would
allow us to decide which is the actual i within the selected group g. The amount
of information required in the …rst step is SG = S[P ] where P = fPg g with
g = 1 : : : N . Now suppose we did get this information, and as a result we found,
for example, that the system was in group g 0 . Then for the second step, to
single out the state i within the group g 0 , the amount of additional information
needed would be Sg0 = S[p jg0 ]. But at the beginning of this process we do
not yet know which of the gs is the correct one. Then the expected P amount of
missing information to take us from the gs to the actual i is g Pg Sg . The
consistency requirement is that it should not matter whether we get the total
90 Entropy II: Measuring Information

missing information in one step, which completely determines i, or in two steps,


the …rst of which has low resolution and only determines one of the groups, say
g 0 , while the second step provides the …ne tuning that determines i within g 0 .
This gives us our third axiom:
Axiom 3. For all possible groupings g = 1 : : : N of the states i = 1 : : : n we
must have X
S[p] = SG [P ] + Pg Sg [p jg ] : (4.3)
g

This is called the “grouping” property.


Remark: Given axiom 3 it might seem more appropriate to interpret S as a
measure of the expected rather than the actual amount of missing information,
but if S = h: : :i is the expected value of something, it is not clear, at this point,
what that something and its interpretation would be . We will return to this
below.
The solution to Shannon’s constraints is obtained in two steps. The power
of Shannon’s axioms arises from their universality; they are meant to hold for
all choices of n and N , for all probability distributions, and for all possible
groupings. First assume that all states i are equally likely, pi = 1=n. Also
assume that the N groups g all have the same number of states, m = n=N , so
that Pg = 1=N and pijg = pi =Pg = 1=m. Then by axiom 2,

S[pi ] = S (1=n; : : : ; 1=n) = F (n) ; (4.4)

SG [Pg ] = S (1=N; : : : ; 1=N ) = F (N ) ; (4.5)


and
Sg [pijg ] = S(1=m; : : : ; 1=m) = F (m): (4.6)
Then, axiom 3 gives
F (mN ) = F (N ) + F (m) : (4.7)
This should be true for all integers N and m. It is easy to see that one solution
of this equation is
F (m) = k log m ; (4.8)
where k is any positive constant (just substitute), but it is also easy to see
that eq.(4.7) has in…nitely many other solutions. To single out (4.8) as the
unique solution we must further impose the additional requirement that F (m)
be monotonic increasing in m (axiom 2).
The uniqueness proof that we give below is due to [Shannon Weaver 1949]
(see also [Jaynes 2003]). Its details might not be of interest to most readers and
may be skipped. First we show that (4.8) is not the only solution of eq.(4.7).
Indeed, since any
Q integer m can be uniquely decomposed as a product of prime
numbers, m = r qr r , where i are integers and qr are prime numbers, using
eq.(4.7) we have P
F (m) = r r F (qr ) (4.9)
which means that eq.(4.7) can be satis…ed by arbitrarily specifying F (qr ) on the
primes and then de…ning F (m) for any other integer through eq.(4.9). Consider
4.1 Shannon’s information measure 91

any two integers s and t both larger than 1. The ratio of their logarithms can be
approximated arbitrarily closely by a rational number, i.e., we can …nd integers
and (with arbitrarily large) such that
log s +1 +1
< or t s <t : (4.10)
log t
But F is monotonic increasing, therefore
+1
F (t ) F (s ) < F (t ); (4.11)

and using eq.(4.7),


F (s) +1
F (t) F (s) < ( + 1)F (t) or < : (4.12)
F (t)
Which means that the ratio F (s)=F (t) can be approximated by the same rational
number = . Indeed, comparing eqs.(4.10) and (4.12) we get
F (s) log s 1
(4.13)
F (t) log t
or,
F (s) F (t) F (t)
(4.14)
log s log t log s
We can make the right hand side arbitrarily small by choosing su¢ ciently
large, therefore F (s)= log s must be a constant, which proves (4.8) is the unique
solution.
In the second step of our derivation we will still assume that all is are equally
likely, so that pi = 1=n and S[p] = F (n). But now we assume the groups g have
di¤erent sizes, mg , with Pg = mg =n and pijg = 1=mg . Then axiom 3 becomes
X
F (n) = SG [P ] + Pg F (mg ), (4.15)
g

Therefore,
X X
SG [P ] = F (n) Pg F (mg ) = Pg [F (n) F (mg )] : (4.16)
g g

Substituting our previous expression for F we get


X n X
SG [P ] = Pg k log = k Pg log Pg : (4.17)
g
mg g

Therefore Shannon’s quantitative measure of the amount of missing information,


the entropy of the probability distribution p1 ; : : : ; pn is
n
X
S[p] = k pi log pi : (4.18)
i=1
92 Entropy II: Measuring Information

Comments
Notice that for discrete probability distributions we have pi 1 and log pi 0.
Therefore S 0 for k > 0. As long as we interpret S as the amount of
uncertainty or of missing information it cannot be negative. We can also check
that in cases where there is no uncertainty we get S = 0: if any state has
probability one, all the other states have probability zero and every term in S
vanishes.
The fact that entropy depends on the available information implies that there
is no such thing as the entropy of a system. The same system may have many
di¤erent entropies. Indeed, two di¤erent agents may reasonably assign di¤erent
probability distributions p and p0 so that S[p] 6= S[p0 ]. But the non-uniqueness of
entropy goes even further: the same agent may legitimately assign two entropies
to the same system. This possibility is already shown in the Grouping Axiom
which makes explicit reference to two entropies S[p] and SG [P ] referring to two
di¤erent descriptions of the same system — a …ne-grained and a coarse-grained
description. Colloquially, however, one does refer to the entropy of a system; in
such cases the relevant information available about the system should be obvious
from the context. For example, in thermodynamics by the entropy one means
the particular entropy obtained when the only information available is speci…ed
by the known values of those few variables that specify the thermodynamic
macrostate.
The choice of the constant k is purely a matter of convention. In thermo-
dynamics the choice is Boltzmann’s constant kB = 1:38 10 16 erg/K which
re‡ects the historical choice of the Kelvin as the unit of temperature. A more
convenient choice is k = 1 which makes temperature have energy units and
entropy dimensionless. In communication theory and computer science, the
conventional choice is k = 1= loge 2 1:4427, so that
N
X
S[p] = pi log2 pi : (4.19)
i=1

The base of the logarithm is 2, and the entropy is said to measure information
in units called ‘bits’.
Next we turn to the question of interpretation. Earlier we mentioned that
from the Grouping Axiom it seems more appropriate to interpret S as a measure
of the expected rather than the actual amount of missing information. If one
adopts this interpretation, the actual amount of information that we gain when
we …nd that i is the true alternative would be log 1=pi . But this is not always
satisfactory because it clashes with the intuition that in general large messages
will carry large amounts of information while short messages will carry small
amounts. Indeed, consider a variable that takes just two values, 0 and 1, with
probabilities p and 1 p respectively. For very small p, log 1=p would be very
large, while the information that communicates the true alternative is physi-
cally conveyed by a very short one bit message, namely “0”. This shows that
interpreting log 1=p as an actual amount of information is not quite right. It
4.1 Shannon’s information measure 93

may perhaps be better to interpret log 1=p as a measure of how unexpected or


how surprising the piece of information is. Some authors do just this and call
log 1=pi the “surprise”of i, but then the direct interpretation of S as an amount
of expected “information” is lost.
The standard practice consists of de…ning the technical term ‘information’
as whatever is measured by (4.18). There is nothing wrong with this — de-
…nitions are not true or false, they are just more or less useful. Suppose we
interpret S[p] as the ‘lack of information’ or the ‘uncertainty’ implicit in p —
here the term ‘uncertainty’ is used as synonymous to ‘lack of information’ so
that more information implies less uncertainty. Unfortunately, as the following
example shows, this does not always work either. I normally keep my keys in
my pocket. My state of knowledge about the location of my keys is represented
by a probability distribution that is sharply peaked at my pocket and re‡ects
a small uncertainty. But suppose I check and I …nd that my pocket is empty.
Then my keys could be virtually anywhere. My new state of knowledge is rep-
resented by a very broad distribution that re‡ects a high uncertainty. We have
here a situation where the acquisition of information has increased the entropy
rather than decreased it. (This question is further discussed in section 4.7.)
The point of these remarks is not to suggest that there is something wrong
with the mathematical derivation — there is not, eq.(4.18) does follow from
the axioms. The point rather is to suggest caution when interpreting S. In
fact, at this point the notion of information itself is too imprecise, too vague.
Any attempt to de…ne its amount will always be open to the objection that
it is not clear what it is that is being measured. Is entropy the only way to
measure uncertainty? Doesn’t the variance also measure uncertainty? The
remarks above constitute a warning that this technical meaning of information
as whatever is measured by S does not coincide with the more colloquial meaning
of information as something that induces us to change our minds.
In their later writings both Shannon and Jaynes agreed that one should not
place too much signi…cance on the axiomatic derivation of eq.(4.18), that its
use can be fully justi…ed a posteriori by its formal properties, for example, by
the various inequalities it satis…es. This position can, however, be criticized
on the grounds that it is the axioms that confer meaning to the entropy; the
disagreement is not about the actual equations, but about what they mean and,
ultimately, about how they should be used.
On one hand, interpreting entropy as an amount of missing information is
a convenient and intuitive shortcut — much like interpreting probability as a
frequency, or interpreting temperature as a measure of expected kinetic energy.
It is usually safe as long as one is aware of its limitations. On the other hand,
the problem of interpretation can be quite serious because as long as the notion
of information is kept imprecise and vague it is possible to introduce other
measures of information. Indeed, such measures have been introduced by Renyi
and by Tsallis, creating a whole industry of alternative theories [Renyi 1961,
Tsallis 1988]. If the ultimate goal is to design a framework for inference this
situation is very unsatisfactory: whenever one can reach a conclusion using
Shannon’s entropy, one can equally well reach di¤erent conclusions using any
94 Entropy II: Measuring Information

one of Renyi-Tsallis entropies. Which, among all those alternatives, should one
choose? This is a problem to which we will return in chapter 6.

The two-state case


To gain intuition about S[p] consider the case of a variable that can take two
values. The paradigmatic example is a biased coin — for example, a bent coin
— for which the outcome ‘heads’is assigned probability p and ‘tails’probability
1 p. The corresponding entropy, shown in …gure 4.2 is

S(p) = p log p (1 p) log (1 p) , (4.20)

where we chose k = 1. It is easy to check that S 0 and that the maximum


uncertainty, attained for p = 1=2, is Smax = log 2.

Figure 4.2: Showing the concavity of the entropy S(p) S for the case of two
states.

An important set of properties follows from the concavity of the entropy


which itself follows from the concavity of the logarithm.
Suppose we are told the biased coin was drawn from a box that contains a
fraction q of coins for which the probability of heads is p1 and the remaining
fraction 1 q are coins for which the probability of heads is p2 . The probability
of heads for a random coin is given by the mixture

p = qp1 + (1 q)p2 : (4.21)

The concavity of entropy implies that the entropies corresponding to p1 , to p2 ,


4.2 Relative entropy 95

and to p satisfy the inequality


S(p) qS (p1 ) + (1 q) S (p2 ) = S ; (4.22)
with equality in the extreme cases where p1 = p2 , or q = 0, or q = 1. The
interpretation is that if all we know is p then the amount of missing information
is given by S(p). If, in addition to knowing p, we are further told that p is the
result of mixing p1 and p2 with probabilities q and 1 q, then we actually know
more and this leads to the inequality S S(p).

4.2 Relative entropy


The following entropy-like quantity, which we earlier met in eq.(2.186),
X pi
K[p; q] = + pi log ; (4.23)
i
qi

turns out to be useful. Despite the positive sign K is sometimes read as the
‘entropy of p relative to q,’and often called “relative entropy”. It is easy to see
that in the special case when qi is a uniform distribution then K is essentially
equivalent to the Shannon entropy – they di¤er by a constant. Indeed, for
qi = 1=n, eq.(4.23) becomes
X
K[p; 1=n] = pi (log pi + log n) = log n S[p] : (4.24)
i

The relative entropy is also known by many other names including informa-
tion divergence, information for discrimination, and Kullback-Leibler divergence
[Kullback 1959]. The expression (4.23) has an old history. It was already used
by Gibbs in his Elementary Principles of Statistical Mechanics [Gibbs 1902] and
by Turing as the expected weight of evidence, eq.(2.190) [Good 1983].
It is common to interpret K[p; q] as the amount of information that is gained
(thus the positive sign) when one thought the distribution that applies to a
certain process is q and one learns that the distribution is actually p. Indeed, if
the distribution q is the uniform distribution and re‡ects the minimum amount
of information we can interpret K[p; 1=n] as the amount of information in p.
As we saw in section (2.11) the weight of evidence factor in favor of hypoth-
esis 1 against 2 provided by data x is
def p(xj 1 )
w( 1 : 2) = log : (4.25)
p(xj 2 )
This quantity can be interpreted as the information gained from the observation
of the data x: Indeed, this is precisely the way [Kullback 1959] de…nes the notion
of information: the log-likelihood ratio is the “information” in the data x for
discrimination in favor of 1 against 2 . Accordingly, the relative entropy,
Z
p(xj 1 )
dx p(xj 1 ) log = K( 1 ; 2 ) ; (4.26)
p(xj 2 )
96 Entropy II: Measuring Information

is interpreted as the expected amount of information per observation drawn from


p(xj 1 ) in favor of 1 against 2 . Any such interpretations can be heuristically
useful but they ultimately su¤er from the same conceptual di¢ culties mentioned
earlier concerning the Shannon entropy. Later, in chapter 6, we shall see that
these interpretational di¢ culties can be avoided and that the relative entropy
turns out to be the fundamental quantity for inference – indeed, more funda-
mental, more general, and therefore, more useful than the Shannon entropy
def
itself. (We will also rede…ne it with a negative sign, S[p; q] = K[p; q], so that
it includes thermodynamic entropy as a special case.) In this chapter we just
derive some properties and consider some applications.

The Gibbs inequality – An important property of the relative entropy is


the Gibbs inequality,
K[p; q] 0; (4.27)

with equality if and only if pi = qi for all i. The proof uses the concavity of the
logarithm,
log x x 1: (4.28)

(The graph of the curve y = log x lies under the straight line y = x 1.)
Therefore
qi qi
log 1; (4.29)
pi pi
which implies
X qi X
pi log (qi pi ) = 0 : (4.30)
i
pi i

The Gibbs inequality provides some justi…cation to the common interpreta-


tion of K[p; q] as a measure of the “distance” between the distributions p and
q. Although suggestive, this language is not correct because K[p; q] 6= K[q; p]
while a true distance D is required to be symmetric, D[p; q] = D[q; p]. However,
as we shall later see, if the two distributions are su¢ ciently close the relative
entropy K[p + p; p] turns out to be symmetric and satis…es all the requirements
of a metric. Indeed, up to a constant factor, it is the only natural Riemannian
metric on the manifold of probability distributions. It is variously known as the
Fisher metric, the Fisher-Rao metric and more commonly as the information
metric.
The two inequalities S[p] 0 and K[p; q] 0 together with eq.(4.24) imply

0 S[p] log n ; (4.31)

which establishes the range of the entropy between the two extremes of complete
certainty (pi = ij for some value j) and complete uncertainty (the uniform
distribution) for a variable that takes n discrete values.
4.3 Su¢ ciency* 97

4.3 Su¢ ciency*

4.4 Joint entropy, additivity, and subadditivity


The entropy S[px ] re‡ects the uncertainty or lack of information about the
variable x when our knowledge about it is codi…ed in the probability distribution
px . It is convenient to refer to S[px ] directly as the “entropy of the variable x”
and write X
def
Sx = S[px ] = px log px : (4.32)
x

The virtue of this notation is its compactness but one must keep in mind the
same symbol x is used to denote both a variable x and its values xi . To be more
explicit,
X X
px log px = px (xi ) log px (xi ) : (4.33)
x i

The uncertainty or lack of information about two (or more) variables x and
y is expressed by the joint distribution pxy and the corresponding joint entropy
is X
Sxy = pxy log pxy : (4.34)
xy

When the variables x and y are independent, pxy = px py , the joint entropy
is additive X
Sxy = px py log(px py ) = Sx + Sy ; (4.35)
xy

that is, the joint entropy of independent variables is the sum of the entropies
of each variable. This additivity property also holds for the other measure of
uncertainty we had introduced earlier, namely, the variance,

var(x + y) = var(x) + var(y) : (4.36)

In thermodynamics additivity leads to extensivity: the entropy of an ex-


tended system is the sum of the entropies of its parts provided these parts are
independent. The thermodynamic entropy can be extensive only when the in-
teractions between various subsystems are su¢ ciently weak that correlations
between them can be neglected. Typically non-extensivity arises from correla-
tions induced by short range surface e¤ects (e.g., surface tension, wetting, cap-
illarity) or by long-range Coulomb or gravitational forces (e.g., plasmas, black
holes, etc.). Incidentally, the realization that extensivity is not a particularly
fundamental property immediately suggests that it should not be given the very
privileged role of a postulate in the formulation of thermodynamics.
When the two variables x and y are not independent the equality (4.35)
can be generalized into an inequality. Consider the joint distribution pxy =
px pyjx = py pxjy . The entropy K of pxy relative to the product distribution px py
98 Entropy II: Measuring Information

that would represent uncorrelated variables is given by


X pxy
K[pxy ; px py ] = pxy log
xy
px p y
X X
= Sxy pxy log px pxy log py
xy xy

= Sxy + Sx + Sy : (4.37)

Therefore, the Gibbs inequality, K 0, leads to

Sxy Sx + Sy ; (4.38)

with the equality holding when the two variables x and y are independent.
The inequality (4.38) is referred to as subadditivity. Its interpretation is clear:
entropy increases when information about correlations among subsystems is
discarded.

4.5 Conditional entropy and mutual information


Consider again two variables x and y. We want to measure the amount of
information about one variable x when we have some limited information about
some other variable y. This quantity, called the conditional entropy, and denoted
Sxjy , is obtained by calculating the entropy of x as if the precise value of y were
known and then taking the expectation over the possible values of y
X X X X
Sxjy = py S[pxjy ] = py pxjy log pxjy = pxy log pxjy ; (4.39)
y y x xy

where pxy is the joint distribution of x and y.


The conditional entropy is related to the entropy of x and to the joint entropy
by the following “chain rule.” Use the product rule for the joint distribution

log pxy = log py + log pxjy ; (4.40)

and take the expectation over x and y to get

Sxy = Sy + Sxjy : (4.41)

In words: the entropy of two variables is the entropy of one plus the conditional
entropy of the other. Also, since Sy is positive we see that conditioning reduces
entropy,
Sxy Sxjy : (4.42)
A related entropy-like quantity is the so-called “mutual information” of x
and y, denoted Mxy , which “measures” how much information x and y have in
common, or alternatively, how much information is lost when the correlations
between x and y are discarded. This is given by the relative entropy between
4.6 Continuous distributions 99

the joint distribution pxy and the product distribution px py that discards all
information contained in the correlations. Using eq.(4.37),
def
X pxy
Mxy = K[pxy ; px py ] = pxy log (4.43)
xy
p x py

= Sx + Sy Sxy 0;

where we used eq.(4.37). Note that Mxy is symmetrical in x and y. Using


eq.(4.41) the mutual information is related to the conditional entropies by

Mxy = Sx Sxjy = Sy Syjx : (4.44)

An important application of mutual information to the problem of experimental


design is given below in section 4.7.

4.6 Continuous distributions


Shannon’s derivation of the expression for entropy, eq.(4.18), applies to probabil-
ity distributions of discrete variables. The generalization to continuous variables
is not straightforward.
The discussion will be carried out for a one-dimensional continuous variable;
the generalization to more dimensions is trivial. The starting point is to note
that the expression Z
dx p(x) log p(x) (4.45)

is unsatisfactory. A change of variables x ! y = y(x) changes the probabil-


ity density p(x) to p0 (y) but the actual probabilities do not change, p(x)dx =
p0 (y)dy. The problem is that the transformation x ! y does not represent a
loss or gain of information so the entropy should not change either. However,
one can check that (4.45) is not invariant,
Z Z
dy
dx p(x) log p(x) = dy p0 (y) log p0 (y)
dx
Z
6= dy p0 (y) log p0 (y) : (4.46)

We approach the continuous case as a limit from the discrete case. Consider
a continuous distribution p(x) de…ned on an interval for xa x xb . Divide the
interval into equal intervals x = (xb xa ) =N . For large N the distribution
p(x) can be approximated by a discrete distribution

pn = p(xn ) x ; (4.47)

where xn = xa + n x and n is an integer. The discrete entropy is


N
X
SN = x p(xn ) log [p(xn ) x] ; (4.48)
n=1
100 Entropy II: Measuring Information

and as N ! 1 we get
Z xb
p(x)
SN ! log N dx p(x) log (4.49)
xa 1= (xb xa )

which diverges. The divergence is what one would naturally expect: it takes a
…nite amount of information to identify one discrete alternative within a …nite
set, but it takes an in…nite amount to single out one point in a continuum.
The di¤erence SN log N has a well de…ned limit and we might be tempted to
consider Z xb
p(x)
dx p(x) log (4.50)
xa 1= (xb xa )
as a candidate for the continuous entropy, until we realize that, except for
an additive constant, it coincides with the unacceptable expression (4.45) and
should be discarded for precisely the same reason: it is not invariant under
changes of variables. Had we …rst changed variables to y = y(x) and then
discretized into N equal y intervals we would have obtained a di¤erent limit
Z yb
p0 (y)
dy p0 (y) log : (4.51)
ya 1= (yb ya )

The problem is that the limiting procedure depends on the particular choice of
discretization; the limit depends on which particular set of intervals x or y
we have arbitrarily decided to call equal. Another way to express the same idea
is to note that the denominator 1= (xb xa ) in (4.50) represents a probability
density that is constant in the variable x, but not in y. Similarly, the density
1= (yb ya ) in (4.51) is constant in y, but not in x.
Having identi…ed the origin of the problem we can now suggest a solution.
On the basis of our prior knowledge about the particular problem at hand we
must identify a privileged set of coordinates that will de…ne what we mean
by equal intervals or by equal volumes. Equivalently, we must identify one
preferred probability distribution (x) we are willing to de…ne as uniform —
where by “uniform” we mean a distribution that assigns equal probabilities to
equal volumes. Then, and only then, it makes sense to propose the following
de…nition Z xb
def p(x)
S[p; ] = dx p(x) log : (4.52)
xa (x)
It is easy to check that this is invariant,
Z xb Z yb
p(x) p0 (y)
dx p(x) log = dy p0 (y) log 0 : (4.53)
xa (x) ya (y)

The following examples illustrate possible choices of the uniform (x):

1. When the variable x refers to position in “physical” Euclidean space, we


can feel fairly comfortable about what we mean by equal volumes: ex-
press x in Cartesian coordinates, that is, replace x by the triple (x; y; z)
4.7 Experimental design 101

and choose (x; y; z) = constant. If we were to translate into spherical


coordinates the corresponding 0 (r; ; ) would no longer be constant, but
we would still call it uniform because it assigns equal probabilities to equal
volumes.
2. In a curved D-dimensional space with a known metric tensor gij , i.e.,
the distance between neighboring points with coordinates xi and xi +
dxi is given by d`2 = gij dxi dxj , and the volume elements are given by
1=2
(det g) dD x. (See the discussion in section 7.3.) The uniform distribu-
tion is that which assigns equal probabilities to equal volumes,
1=2
(x)dD x / (det g) dD x : (4.54)
1=2
Therefore we choose (x) / (det g) .
3. In classical statistical mechanics the Hamiltonian evolution in phase space
is, according to Liouville’s theorem, such that phase space volumes are
conserved. This leads to a natural de…nition of equal volumes. The corre-
sponding choice of a that is uniform in phase space is called the postulate
of “equal a priori probabilities.” (See the discussion in section 5.2.)

Notice that the expression in eq.(4.52) is a relative entropy K[p; ]. Strictly,


there is no Shannon entropy in the continuum –not only do we have to subtract
an in…nite constant and spoil its (already shaky) interpretation as an informa-
tion measure, but we have to appeal to prior knowledge and introduce the
measure . Relative entropy is the more fundamental quantity — a theme that
will be fully developed in chapter 6. Indeed there is no di¢ culty in obtaining
the continuum limit from the discrete version of relative entropy. We can check
that
XN XN
pn p(xn ) x
KN = pn log = x p(xn ) log (4.55)
n=0
q n n=0
q(xn ) x
has a well de…ned limit,
Z xb
p(x)
K[p; q] = dx p(x) log ; (4.56)
xa q(x)

which is manifestly invariant under coordinate transformations.

4.7 Experimental design


A very useful and elegant application of the notion of mutual information is
to the problem of experimental design. The usual problem of Bayesian data
analysis is to make the best possible inferences about a certain variable on
the basis of data obtained from a given experiment. The problem we address
now concerns the decisions that must be made before the data is collected:
Where should the detectors be placed? How many should there be? When
102 Entropy II: Measuring Information

should the measurement be carried out? How do we remain within the bounds
of a budget? The goal is to choose the best possible experiment given a set
of practical constraints. The idea is to compare the amounts of information
available before and after the experiment. The di¤erence is the amount of
information provided by the experiment and this is the quantity that one seeks to
maximize subject to the appropriate constraints. The basic idea was proposed
in [Lindley 1956]; a more modern application is [Loredo 2003].
The problem can be idealized as follows. We want to make inferences about
a variable . Let q( ) be the prior. We want to select the optimal experiment
from within a family of experiments labeled by ". The label " can be discrete
or continuous, one parameter or many, and each experiment " is speci…ed by its
likelihood function q" (x" j ).
The amount of information before the experiment is performed is given by
Z
q( )
Kb = K[q; ] = d q( ) log ; (4.57)
( )
where ( ) de…nes what we mean by the uniform distribution in the space of s.
If experiment " were to be performed and data x" were obtained the amount of
information after the experiment would be
Z
q" ( jx" )
Ka (x" ) = K[q" ; ] = d q" ( jx" ) log : (4.58)
( )
But all decisions must be made before the data x" is available; the expected
amount of information to be obtained from experiment " is
Z Z
q" ( jx" )
hKa i = dx" q" (x" ) d q" ( jx" ) log ; (4.59)
( )
where q" (x" ) is the prior probability that data x" is observed in experiment ",
Z Z
q" (x" ) = d q" (x" ; ) = d q( )q" (x" j ) : (4.60)

Using Bayes theorem hKa i can be written as


Z
q" (x" ; )
hKa i = dx" d q" (x" ; ) log + Kb : (4.61)
q" (x" )q( )
Therefore, the expected information gained in experiment ", which is hKa i Kb ,
turns out to be
Z
q" (x" ; )
M (") = dx" d q" (x" ; ) log ; (4.62)
q" (x" )q( )
which we recognize as the mutual information of the data from experiment "
and the variable to be inferred, eq.(4.43). Clearly the best experiment is that
which maximizes M (") subject to whatever conditions (e.g., limited resources,
etc.) apply to the situation at hand.
4.8 Communication Theory 103

Incidentally, the mutual information, eq.(4.43), satis…es the Gibbs inequality


M (") 0. Therefore, unless the data x" and the variable are statistically
independent (which represents a totally useless experiment because information
about one variable tells us absolutely nothing about the other) all experiments
are to some extent informative, at least on the average. The quali…cation ‘on
the average’ is important: individual samples of data can lead to a negative
information gain. Indeed, as we saw in the keys/pocket example discussed in
section 4.1 a datum that turns out to be surprising can actually increase the
uncertainty in .
An interesting example is that of exploration experiments in which the goal
is to …nd something [Loredo 2003]. The general background for this kind of
problem is that observations have been made in the past leading to our current
prior q( ) and the problem is to decide where or when shall we make the next
observation. The simplifying assumption is that we choose among experiments
" that di¤er only in that they are performed at di¤erent locations, in particular,
the inevitable uncertainties introduced by noise are independent of "; they are
the same for all locations [Sebastiani Wynn 2000]. The goal is to identify the
optimal location for the next observation. An example in astronomy could be
as follows: the variable represents the location of a planet in the …eld of view
of a telescope; the data x" represents light intensity; and " represents the time
of observation and the orientation of the telescope.
The mutual information M (") can be written in terms of conditional entropy
as in eq.(4.44). Explicitly,
Z
q" (x" j )
M (") = dx" d q" (x" ; ) log
q" (x" )
Z
q" (x" j ) q" (x" )
= dx" d q" (x" ; ) log log
(x" ) (x" )
Z Z Z
q" (x" j ) q" (x" )
= d q( ) dx" q" (x" j ) log dx" q" (x" ) log ;
(x" ) (x" )
where (x" ) de…nes what we mean by the uniform distribution in the space of
x" s. The assumption for these location experiments is that the noise is the same
for all ", that is, the entropy of the likelihood function q" (xj ) is independent of
". Therefore maximizing
R q" (x" )
M (") = const dx" q" (x" ) log = const +Sx (") (4.63)
(x" )
amounts to choosing the " that maximizes the entropy of the data to be collected:
we expect to learn the most by collecting data where we know the least.

4.8 Communication Theory


Here we give the briefest introduction to some basic notions of communication
theory as originally developed by Shannon [Shannon 1948, Shannon Weaver
1949]. For a more comprehensive treatment see [Cover Thomas 1991].
104 Entropy II: Measuring Information

Communication theory studies the problem of how a message that was se-
lected at some point of origin can be reproduced at some later destination point.
The complete communication system includes an information source that gen-
erates a message composed of, say, words in English, or pixels on a picture.
A transmitter translates the message into an appropriate signal. For example,
sound pressure is encoded into an electrical current, or letters into a sequence of
zeros and ones. The signal is such that it can be transmitted over a communica-
tion channel, which could be electrical signals propagating in coaxial cables or
radio waves through the atmosphere. Finally, a receiver reconstructs the signal
back into a message to be interpreted by an agent at the destination point.
From the point of view of the engineer designing the communication system
the challenge is that there is some limited information about the set of potential
messages to be sent but it is not known which speci…c messages will be selected
for transmission. The typical sort of questions one wishes to address concern
the minimal physical requirements needed to communicate the messages that
could potentially be generated by a particular information source. One wants to
characterize the sources, measure the capacity of the communication channels,
and learn how to control the degrading e¤ects of noise. And after all this,
it is somewhat ironic but nevertheless true that such “information theory” is
completely unconcerned with whether any “information”is being communicated
at all. Shannon’s great insight was that, as far as the engineer is concerned,
whether the messages convey some meaning or not is completely irrelevant.
To illustrate the basic ideas consider the problem of data compression. A
useful idealized model of an information source is a sequence of random variables
x1 ; x2 ; : : : which take values from a …nite alphabet of symbols. We will assume
that the variables are independent and identically distributed. (Eliminating
these limitations is both possible and important.) Suppose that we deal with a
binary source in which the variables xi , which are usually called ‘bits’, take the
values zero or one with probabilities p or 1 p respectively. Shannon’s idea was
to classify the possible sequences x1 ; : : : ; xN into typical and atypical according
to whether they have high or low probability. The expected number of zeros
and ones is N p and N (1 p) respectively. For large N the probability of any
one of these typical sequences is approximately

P (x1 ; : : : ; xN ) pN p (1 p)N (1 p)
; (4.64)

so that

log P (x1 ; : : : ; xN ) N [p log p (1 p) log(1 p)] = N S(p) (4.65)

where S(p) is the two-state entropy, eq.(4.20), the maximum value of which is
Smax = log 2. Therefore, the probability of typical sequences is roughly
N S(p)
P (x1 ; : : : ; xN ) e : (4.66)

Since the total probability of typical sequences is less than one, we see that
their number has to be less than about eN S(p) which for large N is considerably
4.8 Communication Theory 105

less than the total number of possible sequences, 2N = eN log 2 . This fact is
very signi…cant. Transmitting an arbitrary sequence irrespective of whether it
is typical or not requires a long message of N bits, but we do not have to waste
resources in order to transmit all sequences. We only need to worry about the
far fewer typical sequences because the atypical sequences are too rare. The
number of typical sequences is about

eN S(p) = 2N S(p)= log 2 = 2N S(p)=Smax (4.67)

and therefore we only need about N S(p)=Smax bits to identify each one of them.
Thus, it must be possible to compress the original long but typical message into
a much shorter one. The compression might imply some small probability of
error because the actual message might conceivably turn out to be atypical but
one can, if desired, avoid any such errors by using one additional bit to ‡ag
the sequence that follows as typical and short or as atypical and long. Actual
schemes for implementing the data compression are discussed in [Cover Thomas
91].
Next we state these intuitive notions in a mathematically precise way.

Theorem: The Asymptotic Equipartition Property (AEP)


If x1 ; : : : ; xN are independent variables with the same probability distribution
p(x), then
1
log P (x1 ; : : : ; xN ) ! S[p] in probability. (4.68)
N
Proof: If the variables xi are independent, so are functions of them such the
logarithms of their probabilities, log p(xi ),

1 1PN
log P (x1 ; : : : ; xN ) = log p(xi ) ; (4.69)
N N i

and the law of large numbers (see section 2.8) gives

1
lim Prob log P (x1 ; : : : ; xN ) + hlog p(x)i " = 1; (4.70)
N !1 N
where
hlog p(x)i = S[p] : (4.71)
This concludes the proof.
We can elaborate on the AEP idea further. The typical sequences are those
for which eq.(4.66) or (4.68) is satis…ed. To be precise let us de…ne the typical
set AN;" as the set of sequences with probability P (x1 ; : : : ; xN ) such that
N [S(p)+"] N [S(p) "]
e P (x1 ; : : : ; xN ) e : (4.72)

Theorem of typical sequences:

(1) For N su¢ ciently large Prob[AN;" ] > 1 ".


106 Entropy II: Measuring Information

(2) jAN;" j eN [S(p)+"] where jAN;" j is the number of sequences in AN;" .


(3) For N su¢ ciently large jAN;" j (1 ")eN [S(p) "]
.

In words: the typical set has probability approaching certainty; typical se-
quences are nearly equally probable (thus the ‘equipartition’); and there are
about eN S(p) of them. To summarize:

The possible sequences are equally likely (well... at least most of them).

Proof: Eq.(4.70) states that for …xed ", for any given there is an N such
that for all N > N , we have

1
Prob log P (x1 ; : : : ; xN ) + S[p] " 1 : (4.73)
N

Thus, the probability that the sequence (x1 ; : : : ; xN ) is "-typical tends to one,
and therefore so must Prob[AN;" ]. Setting = " yields part (1). To prove (2)
write
P
1 Prob[AN;" ] = P (x1 ; : : : ; xN )
(x1 ;:::;xN )2AN;"
P N [S(p)+"] N [S(p)+"]
e =e jAN;" j : (4.74)
(x1 ;:::;xN )2AN;"

Finally, from part (1),


P
1 " < Prob[AN;" ] = P (x1 ; : : : ; xN )
(x1 ;:::;xN )2AN;"
P N [S(p) "] N [S(p) "]
e =e jAN;" j ; (4.75)
(x1 ;:::;xN )2AN;"

which proves (3).


We can now quantify the extent to which messages generated by an infor-
mation source of entropy S[p] can be compressed. A scheme that produces
compressed sequences that are longer than N S(p)=Smax bits is capable of dis-
tinguishing among all the typical sequences. The compressed sequences can be
reliably decompressed into the original message. Conversely, schemes that yield
compressed sequences of fewer than N S(p)=Smax bits cannot describe all typi-
cal sequences and are not reliable. This result is known as Shannon’s noiseless
channel coding theorem.

4.9 Assigning probabilities: MaxEnt


Probabilities are introduced to cope with uncertainty due to missing informa-
tion. The notion that entropy S[p] can be interpreted as a quantitative measure
of the amount of missing information has one remarkable consequence: it pro-
vides us with a method to assign probabilities. The idea is simple: It is just
4.9 Assigning probabilities: MaxEnt 107

as important to seek truth as to avoid error. Wishful thinking is not allowed:


we ought to assign probabilities that do not re‡ect more knowledge than you
actually have. More explicitly:

Among all possible probability distributions we ought to adopt the distri-


bution that represents what we do in fact know while honestly re‡ecting
ignorance about all else that we do not know.

The mathematical implementation of this idea involves entropy:

Since least information is expressed as maximum entropy, the preferred


distribution is that which maximizes entropy subject to whatever constraints
are imposed by the available information.

This method of reasoning is called the Method of Maximum Entropy and is


often abbreviated as MaxEnt.2 Ultimately, the method of maximum entropy
expresses an ethical principle of intellectual honesty that demands that one
should not assume information one does not have. This justi…cation of the
MaxEnt method is compelling but it relies on interpreting entropy as a measure
of missing information and therein lies its weakness: are we sure that entropy is
the unique measure of information or of uncertainty? This ‡aw will be addressed
and resolved later in chapter 6.
As a …rst example of MaxEnt in action consider a variable x about which
absolutely nothing is known except that it can take n discrete values xi with i =
1 : : : n. The distribution that represents
Pthe state of maximum ignorance is that
which maximizes the entropy S = P p log p subject to the single constraint
that the probabilities be normalized, p = 1. Introducing a Lagrange multiplier
to handle the constraint, the variation pi ! pi + pi gives
P P
0 = [S[p] ( i pi 1)] = i (log pi + 1 + ) pi ; (4.76)

so that independent variations pi lead to


1
log pi + 1 + =0 or pi = e ; (4.77)

which agrees with the intuition that maximum uncertainty is described by a


uniform distribution. The multiplier is determined from the normalization
constraint. The result is
1
pi = : (4.78)
n
We can check that the corresponding entropy,
P1 1
Sm ax = log = log n ; (4.79)
i n n
2 The presentation in these sections (4.9) and (4.10) follows the pioneering work of E.T.

Jaynes [Jaynes 1957b, 1957c] and particularly [Jaynes 1963]. Other relevant papers are
reprinted in [Jaynes 1983] and collected online at http://bayes.wustl.edu.
108 Entropy II: Measuring Information

is the maximum value allowed by eq.(4.31).


Remark: The distribution of maximum ignorance turns out to be uniform
and coincides with what we would have obtained using Laplace’s Principle of
Insu¢ cient Reason. It is sometimes asserted that MaxEnt provides a proof of
Laplace’s principle but such a claim is questionable because from the very be-
ginning the Shannon axioms give a privileged status to the uniform distribution.
It would be more appropriate to say that the method of maximum entropy has
been designed so as to reproduce Laplace’s principle.

4.10 Canonical distributions


Next we address a problem in which more information is available. The addi-
tional information is e¤ectively a constraint that de…nes the family acceptable
distributions. Although the constraints can take any form whatsoever in this
section we develop the MaxEnt formalism for the special case of constraints that
are linear in the probabilities. The most important applications are to situa-
tions of thermodynamic equilibrium where the relevant information is given in
terms of the expected values of those few macroscopic variables such as energy,
volume, and number of particles, over which one has some experimental control.
(In the next chapter we revisit this problem in detail.)
The goal is to select the distribution of maximum entropy from within the
family of all distributions for which the expectations of some functions f k (x) la-
beled by superscripts k = 1; 2; : : : have known numerical values F k . To simplify
the notation we assume that the variables x are discrete, x = xi for i = 1 : : : n,
and we set f k (xi ) = fik .
To maximize S[p] subject to the constraints
P
f k = i pi fik = F k with k = 1; 2; : : : ; (4.80)
P
and the normalization, pi = 1, introduce Lagrange multipliers and k ,
P P k
0 = S[p] i pi k i pi fi
P
= i log pi + 1 + + k fik pi ; (4.81)
where we adopt the Einstein summation convention that repeated upper and
lower indices are summed over. Independent variations pi lead to the so-called
‘canonical’distribution,
k
pi = exp ( 0 + k fi ) ; (4.82)
where we have set 1 + = 0 . The normalization constraint determines 0 ,
P def
e 0
= i exp( k fik ) = Z ( 1 ; 2 ; : : :) (4.83)
where we have introduced the so-called “partition”function Z( ). The remain-
ing multipliers k are determined by eqs.(4.80): substituting eqs.(4.82) and
(4.83) into eqs.(4.80) gives
@ log Z
= Fk : (4.84)
@ k
4.10 Canonical distributions 109

This set of equations can in principle be inverted to give k = k (F ); in


Ppractice
this is not usually necessary. Substituting eq.(4.82) into S[p] = pi log pi
yields the value of the maximized entropy,
P
Sm ax = i pi ( 0 + k fik ) = 0 + k F k : (4.85)

Equations (4.82-4.84) are a generalized form of the “canonical” distributions


…rst discovered by Maxwell, Boltzmann and Gibbs.
Strictly, the calculation above only shows that the entropy is stationary,
S = 0. To complete the argument we must show that (4.85) is indeed the
absolute maximum rather than just a local extremum or a stationary point.
Consider any other distribution p0i that satis…es the same constraints (4.80).
According to the basic Gibbs inequality for the entropy of p0 relative to the
canonical p is
P p0
K(p0 ; p) = i p0i log i 0; (4.86)
pi
or
P 0
S[p0 ] i pi log pi : (4.87)
Substituting eq.(4.82) and using the fact that p0i satis…es the same constraints
(4.80) gives
P 0
S[p0 ] k
i pi ( 0 + k fi ) = 0 + k F :
k
(4.88)
Therefore, recalling (4.85), we have

S[p0 ] S[p] = Sm ax : (4.89)

In words: within the family fpg of all distributions that satisfy the constraints
(4.80) the distribution that achieves the maximum entropy is the canonical
distribution p given in eq.(4.82).
Having found the maximum entropy distribution we can now develop the
MaxEnt formalism along lines that closely parallel statistical mechanics. Each
distribution within the family of distributions of the form (4.82) can be thought
of as a point in a continuous space — the “statistical manifold” of canonical
distributions. Each speci…c choice of expected values (F 1 ; F 2 ; : : :) determines a
unique point within the space, and therefore the F k play the role of coordinates.
To each point (F 1 ; F 2 ; : : :) we can also associate a number, the value of the
maximized entropy. Therefore, Sm ax (F 1 ; F 2 ; : : :) = Sm ax (F ) is a scalar …eld on
the statistical manifold.
In thermodynamics it is conventional to drop the su¢ x ‘max’and to refer to
S(F ) as the entropy of the system. This language can be misleading. We should
constantly remind ourselves that S(F ) is just one out of many possible entropies
that one could associate to the same physical system: S(F ) is that particular
entropy that measures the amount of information that is missing for an agent
whose knowledge consists of the numerical values of the F s and nothing else.
The quantity
0 = log Z( 1 ; 2 ; : : :) = log Z( ) (4.90)
110 Entropy II: Measuring Information

is sometimes called the “free energy” because it is closely related to the ther-
modynamic free energy (Z = e F ). The quantities S(F ) and log Z( ) are
Legendre transforms of each other,
k
S(F ) = log Z( ) + kF : (4.91)

They contain the same information and therefore just as the F s are obtained
from log Z( ) from eq.(4.84), the s can be obtained from S(F ),

@S(F )
= k : (4.92)
@F k
The proof is straightforward: write
@S(F ) @ log Z( ) @ j @ j j
k
= k
+ F + k ; (4.93)
@F @ j @F @F k

and use eq.(4.84). Equation (4.92) shows that the multipliers k are the com-
ponents of the gradient of the entropy S(F ) on the manifold of canonical distri-
butions. Thus, the change in entropy when the constraints are changed by F k
while the functions f k held …xed is

S= k Fk : (4.94)

A useful extension of the formalism is the following. Processes are common


where the functions f k can themselves be manipulated by controlling one or
more “external” parameters v, fik = f k (xi ; v). For example if a particular
f k refers to the energy of the system, then the parameter v could represent the
volume of the system or perhaps an externally applied magnetic …eld. A general
change in the expected value F k can be induced by changes in both f k and k ,
P
F k = f k = i pi fik + fik pi : (4.95)

The …rst term on the right is

P @fik @f k
fk = i pi v= v: (4.96)
@v @v

When F k represents the internal energy then f k is a small energy trans-


fer that can be controlled through an external parameter v. This suggests
that f k represents a kind of “generalized work,” W k , and the expectations
@f k =@v are analogues of pressure or susceptibility,

def @f k
Wk = fk = v: (4.97)
@v

The second term in eq.(4.95),


def P
Qk = k
i fi pi = fk fk (4.98)
4.11 On constraints and relevant information 111

is a kind of “generalized heat”, and

F k = W k + Qk (4.99)

is a “generalized …rst law.”However, there is no implication that the quantities


f k are conserved (e.g., energy is a conserved quantity but magnetization is not).
The corresponding change in the entropy is obtained from eq.(4.91),

S = log Z( ) + ( k F k )
1P k k k
k fi k
= k fi + k fi e + kF + k Fk
Z i
= k fk fk ; (4.100)

which, using eq.(4.98), gives

S= k Qk : (4.101)

It is easy to see that this is equivalent to eq.(4.92) where the partial derivatives
are derivatives at constant v. Thus the entropy remains constant in in…nitesimal
“adiabatic”processes — those with Qk = 0. From the point of view of informa-
tion theory [see eq.(4.98)] this result is a triviality: the amount of information
in a distribution cannot change when the probabilities do not change,

pi = 0 ) Qk = 0 ) S = 0 : (4.102)

4.11 On constraints and relevant information


MaxEnt is designed as a method to handle information in the form of constraints
(while Bayes handles information in the form of data). The broader question
“What is information?” shall be addressed in more detail in section 6.1 (see
also [Caticha 2007, 2014a]). The MaxEnt method is not at all restricted to
constraints in the form of expected values (several examples will be given in
later chapters) but this is a fairly common situation. To …x ideas consider a
MaxEnt problem in which we maximize S[p] subject to a constraint hf i = F to
get a distribution p(ij ) / e fi . For example, the probability distribution that
describes the state of thermodynamic equilibrium is obtained maximizing S[p]
subject to a constraint on the expected energy h"i = E to yield the Boltzmann
distribution p(ij ) / e "i where = 1=T is the inverse temperature (see
section 5.4). The questions we address here are: How do we decide which is the
right function f to choose? How do we decide its numerical value F ? When can
we expect the inferences to be reliable?3
When using the MaxEnt method to obtain, say, the canonical Boltzmann
distribution it has been common to adopt the following language:
3 This material follows the presentation in [Caticha 2012a].
112 Entropy II: Measuring Information

We seek the probability distribution that codi…es the information we ac-


tually have (e.g., the expected energy) and is maximally unbiased (i.e.
maximally ignorant or maximum entropy) about all the other information
we do not possess.

This justi…cation has stirred a considerable controversy that goes beyond the
issue we discussed earlier of whether the Shannon entropy is the correct way
to measure information. Some of the objections that have been raised are the
following:

(O1) The observed spectrum of black body radiation is whatever it is, inde-
pendently of whatever information happens to be available to us.
(O2) In most realistic situations the expected value of the energy is not a
quantity we happen to know. How, then, can we justify using it as a
constraint?
(O3) Even when the expected values of some quantities happen to be known,
there is no guarantee that the resulting inferences will be any good at all.

These objections deserve our consideration. They o¤er us an opportunity to


attain a deeper understanding of entropic inference.
The issue raised by O1 strikes at the very heart of what physical theories
are supposed to be and what purpose they are meant to serve. For the sake
of argument let us grant that there is such a thing as an external reality (why
not?), that actual phenomena out there are what they are independently of our
thoughts about them. Then the issue raised by O1 is whether the purpose of
our theories is to provide models that faithfully mirror this external reality or
whether the connection to reality is considerably more indirect and the models
are merely pragmatic tools for manipulating information about reality for the
purposes of prediction, control, explanation, etc.
In the former case, O1 is a legitimate objection because if theories mirror
reality then the information available to us should play no role. In the latter case,
however, the purpose of the theory is not to mirror reality. It is still true that
external realities remain independent of whatever information we might have,
but the theory is just a means to produce models and make predictions. The
objection O1 originates in a failure to recognize that in this latter conception
of ‘theory’the success of our models and predictions — including the successful
modeling and prediction of the black body spectrum — depends critically on
possessing the right information, the relevant information.
To address objections O2 and O3 it is useful to distinguish four epistemically
di¤erent types of constraints:

(A) The ideal case: We know that hf i = F and we know that it captures all
the information that happens to be relevant to the problem at hand.

We have called case A the ideal situation because it re‡ects a situation in which
the information that is necessary to reliably answer the questions that interest
4.11 On constraints and relevant information 113

us is available. The requirements of both relevance and completeness are crucial.


Note that a particular piece of evidence can be relevant and complete for some
questions but not for others. For example, the expected energy h"i = E is
highly informative for the question “Will system 1 be in thermal equilibrium
with another system 2?” or alternatively, “What is the temperature of system
1?”But the same expected energy is much less informative for the vast majority
of other possible questions such as, for example, “Where can we expect to …nd
molecule #237 in this sample of ideal gas?”
Our goal here has been merely to describe the ideal epistemic situation one
would like to achieve. We have not addressed the important question of how to
assess whether a particular piece of evidence is relevant and complete for any
speci…c issue at hand.

(B) The important case: We know that hf i captures all the information that
happens to be relevant for the problem at hand but its actual numerical
value F is not known.

This is the most common situation in physics. The answer to objection O2


starts from the observation that whether the value of the expected energy E
is known or not, it is nevertheless still true that maximizing entropy subject
to the energy constraint h"i = E leads to the indisputably correct family of
thermal equilibrium distributions (including, for example, the observed black-
body spectral distribution). The justi…cation behind imposing a constraint on
the expected energy cannot be that the quantity E happens to be known —
because of the brute fact that it is never actually known — but rather that it
is the quantity that should be known. Even when the actual numerical value is
unknown, the epistemic situation described in case B is one in which we recog-
nize the expected energy h"i as highly relevant information without which no
successful predictions are possible. (In the next chapter we revisit this impor-
tant question and provide the justi…cation why it is the expected energy — and
not some other conserved quantity such as h"2 i — that is relevant to thermal
equilibrium.)
Type B information is processed by allowing MaxEnt to proceed with the
unknown numerical value of h"i = E handled as a free parameter. This leads
us to a family of distributions p(ij ) / e "i containing the multiplier as a
free parameter. The actual value of the parameter is at this point unknown.
To determine it one needs additional information. The standard approach is to
infer either by a direct measurement using a thermometer, or infer it indirectly
by Bayesian analysis from other empirical data.

(C) The predictive case: There is nothing special about the function f
except that we happen to know its expected value, hf i = F . In particular,
we do not know whether information about hf i is complete or whether it
is at all relevant to the problem at hand.

However, we do know something and this information, although limited, has


some predictive value because it serves to constrain our attention to the subset
114 Entropy II: Measuring Information

of probability distributions that agree with it. Maximizing entropy subject to


such a constraint will yield the best possible predictions but there is absolutely
no guarantee that the predictions will be any good. Thus we see that, properly
understood, objection O3 is not a ‡aw of the MaxEnt method; it is a legitimate
warning that reasoning with incomplete information is a risky business.4
(D) The extreme ignorance case: We know neither that hf i captures rele-
vant information nor its numerical value F .
This is an epistemic situation that re‡ects complete ignorance. Case D applies to
any arbitrary function f ; it applies equally to all functions f . Since no speci…c
f is singled out one should just maximize S[p] subject to the normalization
constraint. The result is as expected: extreme ignorance is described by a
uniform distribution.
What distinguishes case C from D is that in C the value of F is actually
known. This brute fact singles out a speci…c f and justi…es using hf i = F as a
constraint. What distinguishes D from B is that in B there is actual knowledge
that singles out a speci…c f as being relevant. This justi…es using hf i = F as a
constraint. (How it comes to be that a particular f is singled out as relevant is
an important question to be tackled on a case by case basis — a speci…c example
is discussed in the next chapter.)
To summarize: between one extreme of ignorance (case D, we know neither
which variables are relevant nor their expected values), and the other extreme
of useful knowledge (case A, we know which variables are relevant and we also
know their expected values), there are intermediate states of knowledge (cases
B and C) — and these constitute the rule rather than the exception. Case B
is the more common and important situation in which the relevant variables
have been correctly identi…ed even though their actual expected values remain
unknown. The situation described as case C is less common because information
about expected values is not usually available. (What is usually available is
information in the form of sample averages which is not in general quite the
same thing — see the next section.)
Achieving the intermediate state of knowledge described as case B is the
di¢ cult problem presented by O2. Historically progress has been achieved in
individual cases mostly by intuition and guesswork, that is, trial and error.
Perhaps the seeds for a more systematic “theory of relevance” can already be
seen in the statistical theories of model selection.

4.12 Avoiding pitfalls –I


The method of maximum entropy has been successful in many applications, but
there are cases where it has failed or led to paradoxes and contradictions. Are
these symptoms of irreparable ‡aws? No. What they are is valuable opportuni-
ties for learning. They teach us how to use the method and warn us about how
4 It is this case C that Jaynes adopted for his foundations of entropic inference leading to

what he called predictive statistical mechanics. See [Jaynes 1963, 1986].


4.12 Avoiding pitfalls –I 115

not to use it; they allow us to explore its limitations; and what is perhaps most
important is that they provide powerful hints for further development. Here I
collect a few remarks about avoiding such pitfalls — a topic to which we shall
later return (see section 8.3).

4.12.1 MaxEnt cannot …x ‡awed information


An important point is that the issue of how a piece of information was obtained
in the …rst place should not be confused with the issue of how that piece of
information is to be processed. These are two separate issues.
The …rst issue is concerned with the prior judgements that are involved
in assessing whether a particular piece of data or constraint or proposition is
deemed worthy of acceptance as “information”, that is, whether it is “true”
or at least su¢ ciently reliable to provide the basis for the assignment of other
probabilities. The particular process of how a particular piece of information was
obtained — whether the data is uncertain, whether the messenger is reliable —
can serve to qualify and modify the information being processed. Once this …rst
step has been completed and a su¢ ciently reliable information has been accepted
then, and only then, one proceeds to tackle the second step of processing the
newly available information.
MaxEnt only claims to address the second issue: once a constraint has been
accepted as information, MaxEnt answers the question “What precise rule does
one follow to assign probabilities?” Had the “information” turned out to be
“false” our inferences about the world could be wildly misleading, but it is not
the MaxEnt method that should be blamed for this failure. MaxEnt cannot …x
‡awed information nor should we expect it to do it.

4.12.2 MaxEnt cannot supply missing information


It is not uncommon that we may …nd ourselves in situations where our intuition
insists that our MaxEnt inferences are not right — and this also applies to all
other forms of inference, whether based on Bayes’ rule or on entropy and its
generalizations (see chapter 6). The right way to proceed is to ask: why do we
feel that something is wrong?
The answer is that we must have some expectations which we have not
yet fully recognized that cause our intuitions to clash with the inferences from
MaxEnt. Further analysis will inevitably indicate that either those expectations
were misguided — we have here an opportunity to educate our intuition and
learn. Or, alternatively, the analysis might vindicate the earlier expectations.
This tells us that we had additional prior information that happened to be
relevant but, not having recognized it, we neglected to incorporate it into the
MaxEnt analysis. Either way, the right way to handle such situations is not to
blame the method: …rst blame the user.
116 Entropy II: Measuring Information

4.12.3 Sample averages are not expected values


Here is an example of a common temptation. A lucid analysis of the issues
involved is given in [U¢ nk 1996]. Once we accept that certain constraints
might refer to the expected values of certain variables, how do we decide their
numerical magnitudes? The numerical values of expectations are seldom known
and it is tempting to replace expected values by sample averages because it
is the latter that are directly available from experiment. But the two are not
the same: Sample averages are experimental data; expected values are not.
Expected values are epistemic; sample values are not.
For very large samples such a replacement can be justi…ed by the law of large
numbers — there is a high probability that sample averages will approximate
the expected values. However, for small samples using one as an approximation
for the other can lead to incorrect inferences. It is important to realize that these
incorrect inferences do not represent an intrinsic ‡aw of the MaxEnt method;
they are examples of using the MaxEnt method to process incorrect information.
Example –just data:
Here is a variation on the same theme. Suppose data D = (x1 ; x2 : : : xn ) has
been collected. We might be tempted to maximize S[p] subject to a constraint
hxi = C1 where C1 is unknown and then try to estimate C1 from the data. We
might, for example, try
1P
C1 xi : (4.103)
n i
The di¢ culty arises when we realize that if we know the data (x1 ; : : :) then we
also know their squares (x21 ; : : :) and their cubes and also any arbitrary function
of them (f (x1 ); : : :). Which of these should we use for an expected value con-
straint? Or should we use all of them? The answer is that the MaxEnt method
was not designed to tackle the kind of problem where the only information is
raw data D = (x1 ; x2 : : : xn ). It is not that MaxEnt gives a wrong answer; it
gives no answer at all because there is no constraint to impose; the MaxEnt
engine cannot even get started. Later, in chapter 6, we shall return to the prob-
lem of processing information in the form of data using entropic methods. The
answer, unsurprisingly, will establish the deep connection between Bayesian and
entropic inference.
Example –case B plus data:
One can imagine a di¤erent problem in order to see how MaxEnt could
get some traction. Suppose, for example, that in addition to the data D =
(x1 ; x2 : : : xn ) collected in n independent experiments we have additional infor-
mation that singles out a speci…c function f (x) as being “relevant.”Here we deal
with an epistemic situation that was described as type B in the previous section:
the expectation hf i captures relevant information. We proceed to maximize en-
tropy imposing the constraint hf i = F with F treated as a free parameter. If
the variable x can take k discrete values labeled by we let f (x ) = f and
the result is a canonical distribution
f P
k
e f
p(x j ) = where Z= e (4.104)
Z =1
4.12 Avoiding pitfalls –I 117

with an unknown multiplier that can be estimated from the data D using
Bayesian methods. If the n experiments are independent Bayes rule gives,
fi
p( ) Qn e
p( jD) = ; (4.105)
p(D) i=1 Z

where p( ) is the prior. It is convenient to consider the logarithm of the poste-


rior,

p( ) P
n
log p( jD) = log (log Z + fi )
p(D) i=1
p( )
= log n(log Z + f ) ; (4.106)
p(D)

where f is the sample average,


1P n
f= fi : (4.107)
n i=1

The value of that maximizes the posterior p( jD) is such that

@ log Z 1 @ log p( )
+f = ; (4.108)
@ n @
or, using (4.84),
1 @ log p( )
hf i = f : (4.109)
n @
As n ! 1 we see that the optimal is such that hf i ! f . This is to be
expected: for large n the data overwhelms the prior p( ) and f tends to hf i (in
probability). But the result eq.(4.109) also shows that when n is not so large
then the prior can make a non-negligible contribution: in general one should
not assume that hf i f .
Let us emphasize that this analysis holds only when the selection of a privi-
leged function f (x) can be justi…ed by additional knowledge about the physical
nature of the problem. In the absence of such information we are back to the
previous example — just data — and we have no reason to prefer the distribu-
tion e fj over any other canonical distribution e gj for any arbitrary function
g(x).5

5 Our conclusion di¤ers from that reached in [Jaynes 1978, pp. 72-75] which did not include

the e¤ect of the prior p( ).


Chapter 5

Statistical Mechanics

“There is no description or model that does not also re‡ect an interest or a


purpose.”
Anonimous1
“... but it is important to note that the whole content of the theory depends
critically on just what we mean by ‘probability’.”
E. T Jaynes [1957c]
Among the various theories that make up what we call physics, statitical
mechanics and thermodynamics hold a very special place because they provided
the …rst example [Jaynes 1957b, 1957c, 1963, 1965] of a fundamental theory that
could be interpreted as a procedure for processing relevant information. Our
goal in this chapter is to provide an explicit discussion of statistical mechanics
as an example of entropic inference.
The challenge in constructing the models that we call theoretical physics lies
in identifying the subject matter (the microstates) and the information (the con-
straints, the macrostates) that happens to be relevant to the problem at hand.
First we consider the microstates and provide some necessary background on
the dynamical evolution of probability distributions — Liouville’s theorem —
and use it to derive the so-called “postulate” of Equal a Priori Probabilities.
Next, we show that for situations of thermal equilibrium the relevant infor-
mation is encapsulated into a constraint on the expected value of the energy.
Depending on the speci…c problem one can also include additional constraints
on other conserved quantities such as number of particles or volume. Once the
foundation has been established we can proceed to explore some consequences.
We show how several central topics such as the second law of thermodynam-
ics, irreversibility, reproducibility, and the Gibbs paradox can be considerably
clari…ed when viewed from the information/inference perspective.2
1 I read this somewhere. Who wrote it? Probably H. Putnam or perhaps W. James. If

they did not then I will claim it as my own.


2 We approach statistical mechanics from the point of view of entropic inference. (See
120 Statistical Mechanics

5.1 Liouville’s theorem


Perhaps the most relevant, and therefore, most important piece of information
that has to be incorporated into any inference about physical systems is that
their time evolution is constrained by equations of motion. Whether these equa-
tions — those of Newton, Maxwell, Yang and Mills, or Einstein — can them-
selves be derived as examples of inference are questions which will not concern
us at this point. (Later, starting in chapter 11, we will revisit this question and
show that quantum mechanics and its Newtonian limit are themselves derivable
as theories of inference.)
To be speci…c, in this chapter we will limit ourselves to discussing classical
systems such as ‡uids. In this case there is an additional crucial piece of relevant
information: these systems are composed of molecules. For simplicity we will
assume that the molecules have no internal structure, that they are described
by their positions and momenta, and that they behave according to classical
mechanics.
The import of these remarks is that the proper description of the microstate
of a ‡uid of N particles in a volume V is in terms of a point in the N -particle
phase space, z = (~x1 ; p~1 : : : ~xN ; p~N ) with coordinates z , = 1 : : : 6N . The
time evolution is given by Hamilton’s equations,
d~xi @H d~
pi @H
= and = ; (5.1)
dt @~
pi dt @~xi
where H is the Hamiltonian,
XN
p2i
H= + U (~x1 ; : : : ~xN ; V ) : (5.2)
i=1
2m

What makes phase space so convenient for the formulation of mechanics is that
Hamilton’s equations are …rst order in time. This means that through any
given point z(t0 ), which can be thought as the initial condition, there is just
one trajectory z(t) and therefore trajectories can never intersect each other.
In a ‡uid the actual positions and momenta of the molecules are unknown
and thus the macrostate of the ‡uid is described by a probability density in phase
space, f (z; t). When the system evolves continuously according to Hamilton’s
equations there is no information loss and the probability ‡ow satis…es a local
conservation equation,
@
f (z; t) = r J (z; t) ; (5.3)
@t
where J is the probability current,
J (z; t) = f (z; t)z_ (5.4)
also [Balian 1991, 1992, 1999].) For an entry point to the extensive literature on alternative
approaches based, for example, on the ergodic hypothesis see e.g. [Ehrenfest 1912] [Khinchin
1949] [ter Haar 1955] [Wehrl 1978] [Mackey 1989] [Lebowitz 1993, 1999] and [U¢ nk 2001,
2003, 2006]. For a discussion of why ergodic arguments are irrelevant to statistical mechanics
see [Earman Redei 1996].
5.1 Liouville’s theorem 121

with
d~xi d~
pi @H @H
z_ = ::: ; ::: = ::: ; ::: (5.5)
dt dt @~
pi @~xi
Since
N
X @ @H @ @H
r z_ = =0; (5.6)
i=1
@~xi @~
pi @~
pi @~xi

evaluating the divergence in eq.(5.3) gives

N
X
@f @f @H @f @H def
= z_ r f = = fH; f g : (5.7)
@t i=1
@~xi @~
pi @~
pi @~xi

where fH; f g is Poisson bracket. This is called the Liouville equation.


Two important corollaries are the following. Instead of focusing on the
change in f (z; t) at a …xed point z as in eq.(5.7) we can study the change in
f (z(t); t) at a point z(t) as it is being carried along by the ‡ow. This de…nes
the so-called “convective” time derivative,

d @
f (z(t); t) = f (z; t) + z_ r f : (5.8)
dt @t

Using (5.7) we see that


d
f (z(t); t) = 0 ; (5.9)
dt
which means that f is constant along a ‡ow line. Explicitly,

f (z(t); t) = f (z(t0 ); t0 ) : (5.10)

Next consider a small volume element z(t) the boundaries of which are car-
ried along by the ‡uid ‡ow. Since trajectories cannot cross each other (because
Hamilton’s equations are …rst order in time) they cannot cross the boundary of
the evolving volume z(t) and therefore the total probability within z(t) is
conserved,
d d
Prob[ z(t)] = [ z(t)f (z(t); t)] = 0 : (5.11)
dt dt
But f (z(t); t) itself is constant, eq.(5.9), therefore

d
z(t) = 0 ; (5.12)
dt

which means that the shape of a region of phase space may get deformed by
time evolution but its volume remains invariant. This result is usually known
as Liouville’s theorem.
122 Statistical Mechanics

5.2 Derivation of Equal a Priori Probabilities


Earlier, in section 4.6, we pointed out that a proper de…nition of entropy in a
continuum, eq.(4.52), requires that one specify a privileged background measure
(z),
Z
f (z)
S[f; ] = dz f (z) log ; (5.13)
(z)

where dz = d3N xd3N p in Cartesian coordinates. The choice of (z) is important:


it determines what we mean by a uniform or maximally ignorant distribution.
It is customary to set (z) equal to a constant which we might as well choose
to be (z) = 1. This amounts to postulating that equal volumes of phase space
are assigned the same a priori probabilities. Ever since the introduction of
Boltzmann’s ergodic hypothesis there have been many failed attempts to derive
it from purely dynamical considerations. It is easy to imagine alternatives that
could appear to be just as plausible. One could, for example, divide phase
space in slices of constant energy and assign equal probabilities to equal energy
intervals. (At one point Boltzmann himself tried this. Needless to say, the idea
did not succeed and he abandoned it.) In this section we want to derive (z)
by proving the following theorem

Theorem on Equal a Priori Probabilities: Since a deterministic Hamil-


tonian dynamics involves no loss of information, if the entropy S[f; ] is
to be interpreted as the measure of amount of information, then (z) must
be a uniform measure over phase space.

Proof: The main non-dynamical hypothesis is that entropy measures informa-


tion. The information entropy of the time-evolved distribution f (z; t) is
Z
f (z; t)
S(t) = dz f (z; t) log : (5.14)
(z)

The …rst input from Hamiltonian dynamics is that information is not lost and
therefore we must require that S(t) be constant,

d
S(t) = 0 : (5.15)
dt
Therefore,
Z
d @f (z; t) f (z; t) @f (z; t)
S(t) = dz log + : (5.16)
dt @t (z) @t

The second term vanishes,


Z Z
@f (z; t) d
dz = dz f (z; t) = 0 : (5.17)
@t dt
5.2 Derivation of Equal a Priori Probabilities 123

A second input from Hamiltonian dynamics is that probabilities are not merely
conserved; they are locally conserved. This is expressed by eqs.(5.3) and (5.4).
The …rst term of eq.(5.16) can be rewritten,
Z
d f
S(t) = dz r (f z_ ) log ; (5.18)
dt

so that integration by parts (the surface term vanishes) gives


Z Z
d f
S(t) = dz f z_ r log = dz [ z_ r f + f z_ r log ] : (5.19)
dt
Hamiltonian dynamics enters here once again: the …rst term vanishes by Liou-
ville’s equation (5.7),
Z Z
@f
dz z_ r f = dz =0; (5.20)
@t

and therefore, imposing (5.15), leads to


Z
d
S(t) = dz f z_ r log =0: (5.21)
dt

This condition must hold for any arbitrary choice of f (z; t), therefore

z_ r log (z) = 0 : (5.22)

Furthermore, we have considerable freedom about the particular Hamiltonian


operating on the system. We could choose to change the volume in any arbi-
trarily prescribed way by pushing on a piston to change the volume, or we could
choose to vary an external magnetic …eld. Either way we can change H(t) and
therefore z_ at will. The time derivative dS=dt must still vanish irrespective of
the particular choice of the vector z_ . We conclude that

r log (z) = 0 or (z) = const : (5.23)

To summarize: the requirement that information is not lost in Hamiltonian


dynamics implies that the measure of information must be a constant of the
motion,
d
S(t) = 0 ; (5.24)
dt
and this singles out the Gibbs entropy,
Z
S(t) = dz f (z; t) log f (z; t) ; (5.25)

(in 6N -dimensional con…guration space) as the correct information entropy.


It is sometimes objected that (5.24) implies that the Gibbs entropy (5.25)
cannot be identi…ed with the thermodynamic entropy of Clausius, eq.(3.19),
124 Statistical Mechanics

because this would be in contradiction with the Second Law.3 This is true but
it should not an objection; it is further evidence that there is no such thing as the
unique entropy of a system. Di¤erent entropies attach to di¤erent descriptions
of the system and, as we shall see in Section 5.7, equations (5.24) and (5.25)
will turn out to be crucial elements in the derivation of the Second Law.
Remark: In section 4.1 we pointed out that the interpretation of entropy S[f; ]
as a measure of information has its shortcomings. This could potentially un-
dermine our whole program of deriving statistical mechanics as an example of
entropic inference. Fortunately, as we shall see later in chapter 6 the framework
of entropic inference can be considerably strengthened by removing any reference
to questionable information measures. In this approach entropy S[f; ] requires
no interpretation; it is a tool designed for updating from a prior to a posterior
f distribution. More explicitly the entropy S[f; ] is introduced to rank can-
didate distributions f according to some criterion of “preference” relative to a
prior in accordance to certain “reasonable” design speci…cations. Recasting
statistical mechanics into this entropic inference framework is straightforward.
For example, the requirement that Hamiltonian time evolution does not a¤ect
the ranking of distributions — that is, if f1 (z; t) is preferred over f2 (z; t) at time
t then the corresponding f1 (z; t0 ) is preferred over f2 (z; t0 ) at any other time t0
— is expressed through eq.(5.15) so the proof of the Equal a Priori Theorem
proceeds exactly as above.

5.3 The constraints for thermal equilibrium


Thermodynamics is mostly concerned with situations of thermal equilibrium.
What is the relevant information needed to make inferences in these special
cases?4 A problem here is that the notion of relevance is relative — a particular
piece of information might be relevant for one speci…c question and irrelevant
for another. So in addition to the explicit assumption of equilibrium we will
also need to make a somewhat more vague assumption that our general interest
is in those questions that are the typical concern of thermodynamics, namely,
questions involving equilibrium macrostates and the processes that take us from
one to another.
The …rst condition we must impose on f (z; t) to describe equilibrium is that
it be independent of time. Thus we require that ff; Hg = 0 and f must be a
function of conserved quantities such as energy, momentum, angular momentum,
or number of particles. But we do not want f to be merely stationary, as say, for
a rotating ‡uid, we want it to be truly static. We want f to be invariant under
time reversal. For these problems it turns out that it is not necessary to impose
that the total momentum and total angular momentum vanish; these constraints
will turn out to be satis…ed automatically. (The symmetry of most situations is
such that the same probability will be assigned to molecules moving to the left
as to those moving to the right.) To simplify the situation even more we will
3 See e.g. [Mackey 1989].
4 The presentation here follows [Caticha 2008]. See also [Lee and Presse 2012].
5.3 The constraints for thermal equilibrium 125

only consider problems where the number of particles is held …xed. Processes
where particles are exchanged as in the equilibrium between a liquid and its
vapor, or where particles are created and destroyed as in chemical reactions,
constitute an important but straightforward extension of the theory.
It thus appears that it is su¢ cient to impose that f be some function of the
energy. According to the formalism developed in section 4.10 and the remarks in
4.11 this is easily accomplished: the constraints codifying the information that
could be relevant to problems of thermal equilibrium should be the expected
values of functions (") of the energy. For example, h (")i could include various
moments, h"i, h"2 i,. . . or perhaps more complicated functions. The remaining
question is which functions (") and how many of them.
To answer this question we look at thermal equilibrium from the point of
view leading to what is known as the microcanonical formalism. Let us enlarge
our description to include the system of interest A and its environment, that
is, the thermal bath B with which it is in equilibrium. The advantage of this
broader view is that the composite system C = A + B can be assumed to be
isolated and we know that its energy "c is some …xed constant. This is highly
relevant information: when the value of "c is known, not only do we know
h"c i = "c but we know the expected values h ("c )i = ("c ) for absolutely all
functions ("c ): In other words, in this case we have succeeded in identifying
the relevant information and we are …nally ready to assign probabilities using
the MaxEnt method. (When the value of "c is not known we are in that state
of “intermediate” knowledge described as case (B) in section 4.11.)
Now we are ready to deploy the MaxEnt method. The argument depends
crucially on using a measure (z) that is constant in the phase-space variables.
Maximize the entropy,
Z
S[f ] = dz f (z) log f (z) ; (5.26)

of the composite system C subject to normalization and the …xed energy con-
straint,
f (z) = 0 if "(z) 6= "C : (5.27)
To simplify the discussion it is convenient to divide phase space into discrete
cells a of equal a priori probability. By the theorem of section 5.2 these cells are
of equal phase-space volume z. Then we can use the discrete entropy,
X
S= pc log pc where pc = f (zc ) z : (5.28)
c

For system A let the (discretized) microstate za have energy "a . For the thermal
bath B a much less detailed description is su¢ cient. Let the number of bath
microstates with energy "b be B ("b ). We assume that the microstates c of
the composite system, C = A + B, are labelled by specifying the state of A
and the state of B, c = (a; b). This condition looks innocent but this may be
deceptive; it implies A and B are not quantum mechanically entangled. Our
relevant information also includes the fact that A and B interact very weakly,
126 Statistical Mechanics

that is, any interaction potential Vab depending on both microstates a and b can
be neglected. The interaction must be weak but cannot be strictly zero, just
barely enough to attain equilibrium. It is this condition of weak interaction that
justi…es us in talking about a system A separate from the bath B. Under these
conditions the total energy "c constrains the allowed microstates of C = A + B
to the subset that satis…es
" a + "b = " c : (5.29)
The total number of such microstates is
X
C B
("c ) = ("c "a ) : (5.30)
a

At this point we are in a situation where we know absolutely nothing beyond


the fact that the composite system C can be in any one of its C ("c ) allowed
microstates. This is precisely the problem tackled in section 4.9. Thus, the
distribution of maximum entropy gives pc = 0 when "b 6= "c "a , and is uniform,
eq.(4.78), over the allowed microstates –those with "b = "c "a . The probability
of any allowed microstate of C is 1= C ("c ), and the corresponding entropy is
S C = log C ("c ). More importantly, the probability that system A is in the
particular microstate a with energy "a when it is in thermal equilibrium with
the bath B is X X 1
pa = pab = C (" )
; (5.31)
c
b fbj"b ="c "a g
C
where the sum is over all states b with energy "b = "c "a , and since 1= ("c )
is just a constant,
B
("c "a )
pa = C (" )
: (5.32)
c
This is the result we sought; now we need to interpret it. There is one …nal
piece of relevant information we can use: the thermal bath B is usually much
larger than system A, "c "a , which suggests a Taylor expansion. Since B ("b )
varies vary rapidly with "b it is convenient to rewrite pa as
B
pa / exp log ("c "a ) : (5.33)
and then Taylor expand the logarithm,
B B
log ("c "a ) = log ("c ) "a + : : : ; (5.34)
where the inverse temperature = 1=kT of the bath has been introduced ac-
cording to the standard thermodynamic de…nition,
B
@ log def
= : (5.35)
@"b "c

and we conclude that the distribution that codi…es the relevant information
about equilibrium is
1
pa = exp( "a ) ; (5.36)
Z
5.3 The constraints for thermal equilibrium 127

which has the canonical form of eq.(4.82). (Being independent of a the factor
B
("c )= C ("c ) has been absorbed into the normalization Z.)
Remark: It may be surprising that strictly speaking a system such as A does
not have a temperature. The temperature T is not an ontic property of the
system but an epistemic property that characterizes the probability distribution
(5.36). Indeed, although we often revert to language to the e¤ect that the system
is in a macrostate with temperature T , we should note that in actual fact the
system is in a particular microstate and not in a probability distribution. The
latter refers to our state of knowledge and not to the ontic state of the system.
Our goal in this section was to identify the relevant variables. We are now in a
position to give the answer: the relevant information about thermal equilibrium
can be summarized by the expected value of the energy h"i because someone
who just knows h"i and is maximally ignorant about everything else is led to
assign probabilities according to eq.(4.82) which coincides with (5.36).
But our analysis has also disclosed an important limitation. Eq.(5.32) shows
that in general the distribution for a system in equilibrium with a bath depends
in a complicated way on the properties of the bath. The information in h"i is
adequate only when (a) the system and the bath interact weakly enough that
the energy of the composite system C can be neatly partitioned into the energies
of A and of B, eq.(5.29), and (b) the bath is so much larger than the system
that its e¤ects can be represented by a single parameter, the temperature T .
Conversely, if these conditions are not met, then more information is needed.
When the system-bath interactions are not su¢ ciently weak eq.(5.29) will not be
valid and additional information concerning the correlations between A and B
will be required. On the other hand if the system-bath interactions are too weak
then within the time scales of interest the system A will reach only a partial
thermal equilibrium with those few degrees of freedom in its very immediate
vicinity. The system A is e¤ectively surrounded by a thermal bath of …nite size
and the information contained in the single parameter or the expected value
h"i will not su¢ ce. This situation will be brie‡y addressed in section 5.5.

So what’s the big deal?


We have identi…ed all the ingredients required to derive (see next section) the
canonical formalism of statistical mechanics as an example of entropic inference.
We saw that the identi…cation of h"i as relevant information relied on the micro-
canonical formalism in an essential way. Does this mean that the information
theory approach was ultimately unnecessary? That MaxEnt adds nothing to
our understanding of statistical mechanics? Absolutely not.
Alternative derivations of statistical mechanics all rely on invoking the right
cocktail of ad hoc hypothesis such as an ergodic assumption or a postulate for
equal a priori probabilities. This is not too bad; all theories, MaxEnt included,
require assumptions. Where MaxEnt can claim an unprecedented success is
that the assumptions it does invoke are not as ad hoc. They are precisely the
type of assumptions one would naturally expect of any theory of inference — a
speci…cation of the subject matter (the microstates), their underlying measure,
128 Statistical Mechanics

plus an identi…cation of the relevant constraints. Indeed, the central assumption


in the previous microcanonical argument — the assignment of equal probabil-
ities to cells of equal volume in phase space — is justi…ed by the information
approach which replaces the ad hoc equal a priori postulate by an equal a priori
theorem. Furthermore, as we shall see below, the recognition of the informa-
tional character of entropy leads to an unprecedented conceptual clari…cation
of the foundations of statistical mechanics – including the second law. But ul-
timately the justi…cation of any formal system must be pragmatic: does the
entropic model successfully predict, explain and unify? As we shall see in the
next sections the answer is an unquali…ed yes.

5.4 The canonical formalism


We consider a system in thermal equilibrium [Jaynes 1957b, 1957c, 1963]. The
energy of the (conveniently discretized) microstate za is "a = "a (V ) where V
represents a parameter over which we have experimental control. For example,
in ‡uids V is the volume of the system. We assume further that the expected
value of the energy is known, h"i = E.
Maximizing the (discretized) information entropy,
X
S[p] = pa log pa where pa = f (za ) z ; (5.37)
a

subject to constraints on normalization and energy h"i = E yields, eq.(4.82),

1 "a
pa = e (5.38)
Z
where the Lagrange multiplier is determined from

@ log Z X
"a
=E and Z( ; V ) = e : (5.39)
@ a

The maximized value of the Gibbs entropy is, eq.(4.85),

SG (E; V ) = kS(E; V ) = k log Z + k E ; (5.40)

where we reintroduced the constant k. Di¤erentiating with respect to E we


obtain the analogue of eq.(4.92),

@SG @ log Z @ @
=k +k E+k =k ; (5.41)
@E V @ @E @E

where eq.(5.39) has been used to cancel the …rst two terms.
The connection between the statistical formalism and thermodynamics hinges
on a suitable identi…cation of internal energy, work and heat. The …rst step is
the crucial one: we adopt Boltzmann’s assumption, eq.(3.38), and identify the
5.4 The canonical formalism 129

expected energy h"i with the thermodynamical internal energy E: h"i = E.


Next we consider a small change in the internal energy,
X X X
E= pa " a = pa " a + "a p a : (5.42)
a a a

Since "a = "a (V ) the …rst term h "i on the right can be physically induced by
pushing or pulling on a piston to change the volume,
X @"a @"
h "i = pa V = V : (5.43)
a
@V @V

Thus, it is reasonable to identify h "i with mechanical work,

h "i = W = P V ; (5.44)

where P is the pressure,


@"
P = : (5.45)
@V
Remark: This is an interesting expression in its own right. Notice that this
de…nition of pressure makes no reference of particles colliding with the wall of
a container; it is much more general. It applies to particles and to radiation
both in the classical and quantum regimes. It also applies to the zero-point
‡uctuations of quantum …elds where the resulting negative pressure is known as
the Casimir e¤ect.
Having identi…ed the work W , the second term in eq.(5.42) must therefore
represent heat,
Q= E W = h"i h "i : (5.46)
The corresponding change in entropy is obtained from eq.(5.40),
1
SG = log Z + ( E)
k
1 X
= e "a ("a + "a ) + E + E
Z a
= ( E h "i) ; (5.47)

therefore,
Q
SG = k Q or SG = ; (5.48)
T
where we introduced the suggestive notation
1 1
k = or = : (5.49)
T kT
Integrating eq.(5.48) from an initial state A to a …nal state B gives
Z B
dQ
SG (B) SG (A) = (5.50)
A T
130 Statistical Mechanics

where every intermediate state along the path from A to B is a maximum


entropy state.
We are now ready to complete the correspondence between this canonical
formalism and thermodynamics. The thermodynamic entropy introduced by
Clausius SC is de…ned only for equilibrium states,
Z B
dQ
SC (B) SC (A) = ; (5.51)
A T

where the integral is along a reversible path — the states along the path are
equilibrium states — and where the temperature T is de…ned by

@SC def 1 Q
= so that SC = : (5.52)
@E V T T

Comparing eqs.(5.50) and (5.51) we see that the maximized Gibbs entropy SG
and the Clausius SC di¤er only by an additive constant. Adjusting the constant
so that SG matches the Clausius entropy SC for one equilibrium state they will
match for all equilibrium states. We can therefore conclude that

A macrostate of thermal equilibrium is described by a maximum entropy


distribution. The maximized Gibbs entropy, SG (E; V ), corresponds to
the thermodynamic entropy SC originally introduced by Clausius. The
Lagrange multiplier corresponds to the inverse temperature.

Thus, the framework of entropic inference provides a natural explanation for


thermodynamic quantities such as temperature and entropy in terms of those
theoretical concepts — Lagrange multipliers, information entropies — that must
inevitably appear in all theories of inference.
Remark: It might not be a bad idea to stop for a moment and let this marvelous
notion sink in: temperature, that which we associate with hot things being hot
and cold things being cold is, in the end, nothing but a Lagrange multiplier. It
turns out that in some common cases temperature also happens to be a measure
of the mean kinetic energy per molecule. This latter conception is useful but it is
too limited; it fails to capture the full signi…cance of the concept of temperature.
For example, it does not apply to relativistic particles or to photons or to black
holes.
Substituting (5.44) and (5.48) into eq.(5.46), yields the fundamental ther-
modynamic identity,
E=T S P V ; (5.53)
where we dropped the G and C subscripts. Incidentally, this identity shows that
the “natural”variables for energy are S and V , that is, E = E(S; V ). Similarly,
writing
1 P
S= E+ V (5.54)
T T
con…rms that S = S(E; V ).
5.5 Equilibrium with a heat bath of …nite size 131

Equation (5.53) is useful either for processes at constant V so that E = Q;


or for processes at constant S for which E = W . But except for these latter
adiabatic processes ( Q = 0) the entropy is not a quantity that can be directly
controlled in the laboratory. For processes that occur at constant temperature
it is more convenient to introduce a new quantity, called the free energy, that is
a function of T and V . The free energy is given by a Legendre transform,

F (T; V ) = E TS ; (5.55)

so that
F = S T P V : (5.56)
For processes at constant T we have F = W which justi…es the name ‘free’
energy –the amount of energy that is free to be converted to useful work when
the system is not isolated but in contact with a bath at temperature T . Eq.(5.40)
then leads to
F = kT log Z(T; V ) or Z = e F : (5.57)
Several useful thermodynamic relations can be easily obtained from eqs.(5.53),
(5.54), and (5.56). For example, the identities
@F @F
= S and = P; (5.58)
@T V @V T

can be read directly from eq.(5.56).

5.5 Equilibrium with a heat bath of …nite size


In section 5.3 we saw that the canonical Boltzmann-Gibbs distribution applies
to situations where the system is in thermal equilibrium with an environment
that is much larger than itself. But this latter condition can be violated. For
example, when we deal with very fast phenomena or in situations where the
system-environment interactions are very weak then, over the time scales of
interest, the system will reach a partial equilibrium with only those few degrees
of freedom in its immediate vicinity. In such cases the e¤ective environment has
a …nite size and the information contained in the single parameter will not
su¢ ce. Below we o¤er some brief remarks on this topic.
One might, for example, account for …nite bath size e¤ects by keeping addi-
tional terms in the expansion (5.34),

B B 1 2
log ("c "a ) = log ("c ) "a " ::: ; (5.59)
2 a
leading to corrections to the Boltzmann distribution,
1 1 2
pa = exp( "a " : : :) : (5.60)
Z 2 a
An alternative path is to provide a more detailed model of the bath [Plastino
and Plastino 1994]. As before, we consider a system A that is weakly coupled to
132 Statistical Mechanics

a heat bath B that has a …nite size. The microstates of A and B are labelled a
and b and have energies "a and "b respectively. The composite system C = A+B
can be assumed to be isolated and have a constant energy "c = "a + "b (or more
precisely C has energy in some arbitrarily narrow interval about "c ). To model
the bath B we assume that the number of microstates of B with energy less
than " is W (") = C" , where the exponent is some constant that depends on
the size of the bath. Such a model can be quite realistic. For example, when
the bath consists of N harmonic oscillators we have = N , and when the bath
is an ideal gas of N molecules we have = 3N=2.
Then the number of microstates of B in a narrow energy range " is
B 1
(") = W (" + ") W (") = C" "; (5.61)

and the probability that A is in a particular microstate a of energy "a is given


by eq.(5.32),
"a
pa / B ("c "a ) / (1 ) 1; (5.62)
"c
so that
1 "a X "a
1 1
pa = (1 ) with Z = (1 ) : (5.63)
Z "c a "c
When the bath is su¢ ciently large "a ="c ! 0 and ! 1 one recovers the
Boltzmann distribution with appropriate corrections as in eq.(5.60). Indeed,
using
1 2
log(1 + x) = x x + ::: (5.64)
2
we expand

"a 1 "a 1 "a 2


(1 ) = exp ( 1)( ( ) + : : :) ; (5.65)
"c "c 2 "c

to get eq.(5.60) with

1 1
= and = : (5.66)
"c "2c

We will not pursue the subject any further except to comment that distri-
butions such as (5.63) have been proposed by C. Tsallis on the basis of a very
di¤erent logic [Tsallis 1988, 2011].

Non-extensive thermodynamics
The idea proposed by Tsallis is to generalize the Boltzmann-Gibbs canonical
formalism by adopting a di¤erent “non-extensive entropy”,
P
1 i pi
T (p1 ; : : : ; pn ) = ;
1
5.6 The thermodynamic limit 133

that depends on a parameter [Tsallis 1988, 2011].5 Equivalent versions of


such “entropies” have been proposed as alternative measures of information by
several other authors; see, for example [Renyi 1961], [Aczel 1975], and [Amari
1985].
One important feature is that the standard Shannon entropy is recovered in
the limit ! 0. Indeed, let = 1 + and use
log pi
pi = e = 1 + log pi + : : : : (5.67)

As ! 0 we get
1 P 1+
T1+ = (1 i pi )
1 P P
= [1 i pi (1 + log pi )] = i pi log pi : (5.68)

The distribution that maximizes the Tsallis entropy subject to the usual
normalization and energy constraints,
P P
i pi = 1 and i " i pi = E ;

is
1
pi = [1 "i ]1=( 1)
; (5.69)
Z
where Z is a normalization constant and the constant is a ratio of Lagrange
multipliers. This distribution is precisely of the form (5.63) with = 1="c and
=1+( 1) 1 .
Our conclusion is that Tsallis distributions make perfect sense within the
canonical Gibbs-Jaynes approach to statistical mechanics. However, in order
to justify them, it is not necessary to introduce an alternative thermodynamics
through new ad hoc entropies; it is merely necessary to recognize that some-
times a partial thermal equilibrium is reached with heat baths that are not
extremely large. What distinguishes the canonical Boltzmann-Gibbs distribu-
tions from (5.63) or (5.69) is the relevant information on the basis of which we
draw inferences and not the inference method. An added advantage is that the
free and undetermined parameter can, within the standard MaxEnt formalism
advocated here, be calculated in terms of the size of the bath.

5.6 The thermodynamic limit


If the Second Law “has only statistical certainty” (Maxwell, 1871) and any
violation “seems to be reduced to improbability” [Gibbs 1878] how can ther-
modynamic predictions attain so much certainty? Part of the answer hinges
on restricting the kind of questions we are willing to ask to those concerning
5 Criticism of Tsallis’ non-extensive entropy formalism is given in [La Cour and Schieve

2000] [Nauenberg 2003] [Presse et al 2013].


134 Statistical Mechanics

the few macroscopic variables over which we have some control. Most other
questions are deemed not “interesting” and thus they are never asked. For ex-
ample, suppose we are given a gas in equilibrium within a cubic box, and the
question is where will we …nd a particular molecule. The answer is that the
expected position of the molecule is at the center of the box but with a very
large standard deviation — the particle can be anywhere in the box. Such an
answer is not very impressive. On the other hand, if we ask for the energy of
the gas at temperature T , or how it changes as the volume is changed by V ,
then the answers are truly impressive.
Consider a system in thermal equilibrium in a macrostate described by a
canonical distribution f (z) assigned on the basis of constraints on the values of
certain macrovariables X. For simplicity we will assume X is a single variable,
the energy, X = E = h"i. The generalization to more than one variable is not
di¢ cult. The microstates z can be divided into typical and atypical microstates.
The typical microstates are those contained within a region R de…ned by im-
posing upper and lower bounds on f (z).
In this section we shall explore a few properties of the typical region. We
will show that the probability of the typical region turns out to be “high”, that
is, Prob[R ] = 1 where is a small positive number. We will also show
that the thermodynamic entropy SC and the “phase”volume W of the typical
region are related through Boltzmann’s equation,

SC k log W ; (5.70)

where R
W = Vol(R ) = R
dz : (5.71)
The surprising feature is that SC turns out to be essentially independent of .
The following theorems which are adaptations of the Asymptotic Equipartition
Property [Shannon 1948, Shannon Weaver 1949] state this result in a mathe-
matically precise way. (See also [Jaynes 1965] and section 4.8.)
The Asymptotic Equipartition Theorem: Let f (z) be the canonical dis-
tribution and S = SG =k = SC =k the corresponding entropy,
"(z)
e
f (z) = and S = E + log Z : (5.72)
Z
If limN !1 "=N = 0, that is, the energy ‡uctuations " ( is the standard
deviation) may increase with N but they do so less rapidly than N , then, as
N ! 1,
1 S
log f (z) ! in probability, (5.73)
N N
Since S=N is independent of z the theorem roughly states that the probabil-
ities of the accessible microstates are “essentially” equal. The microstates z for
which ( log f (z))=N di¤ers substantially from S=N have either too low prob-
ability — they are deemed “inaccessible” — or they might individually have a
5.6 The thermodynamic limit 135

high probability but are too few to contribute signi…cantly. The term ‘essen-
tially’is tricky because f (z) may di¤er from e S by a huge multiplicative factor
— perhaps several billion — but log f (z) will still di¤er from S by an amount
that is unimportant because it grows less rapidly than N .
Remark: The left hand side of (5.73) is a quantity associated to a microstate z
while the right side contains the entropy S. This may mislead us into thinking
that the entropy S is some ontological property associated to the individual
microstate z rather than a property of the macrostate. But this is not so:
the entropy S is a property of a whole probability distribution f (z) and not
of the individual zs. Any given microstate z0 can lie within the support of
several di¤erent distributions possibly describing di¤erent physical situations
and having di¤erent entropies. The mere act of …nding that the system is in
state z0 at time t0 is not su¢ cient to allow us to …gure out whether the system
is best described by a macrostate of equilibrium as in (5.73) or whether it was
undergoing some dynamical process that just happened to pass through z0 at
t0 .
Next we prove the theorem. Apply the Tchebyshev inequality, eq.(2.109),
2
x
P (jx hxij ) , (5.74)

to the variable
1
log f (z) :
x= (5.75)
N
Its expected value is the entropy per particle,
1
hxi = hlog f i
N
S 1
= = ( E + log Z) : (5.76)
N N
To calculate the variance,
1 h 2
i
( x)2 = (log f )2 hlog f i ; (5.77)
N2
use
D E D E
2 2
(log f ) = ( " + log Z)
2 2
= "2 + 2 h"i log Z + (log Z) ; (5.78)

so that
2 2
2 "
( x)2 = "2 h"i = : (5.79)
N2 N
Collecting these results gives
2 2
1 S "
Prob log f (z) : (5.80)
N N N
136 Statistical Mechanics

For systems such that the relative energy ‡uctuation "=N tends to 0 as N ! 1
the limit on the right is zero,

1 S
lim Prob log f (z) =0; (5.81)
N !1 N N
which concludes the proof.
Remark: Note that the theorem applies only to those systems with interparticle
interactions such that the energy ‡uctuations " are su¢ ciently well behaved.
For example, it is not uncommon that "=E / N 1=2 and that the energy is
an extensive quantity, E=N ! const. Then
" "E 1
= / 1=2 ! 0 : (5.82)
N E N N
Typically this happens when the spatial correlations among particles fall suf-
…ciently fast with distance — distant particles are uncorrelated. Under these
conditions both energy and entropy are extensive quantities.
The following theorem elaborates on these ideas further. To be precise let
us de…ne the typical region R as the set of microstates with probability f (z)
such that
e S N f (z) e S+N ; (5.83)
or, using eq.(5.72),
1 E N 1 E+N
e f (z) e : (5.84)
Z Z
This last expression shows that the typical microstates have energy within a
narrow range
"(z) E N kT : (5.85)
Remark: Even though states z with energies lower than typical can individually
be more probable than the typical states it turns out (see below) that they are
too few and their volume is negligible compared to W .
Theorem of typical microstates: For N su¢ ciently large

(1) Prob[R ] > 1


(2) Vol(R ) = W eS+N .
(3) W (1 )eS N
.
(4) limN !1 (log W S)=N = 0.

In words:

The typical region has probability close to one; typical microstates are
almost equally probable; the phase volume they occupy is about eS , that is,
S = log W .
5.6 The thermodynamic limit 137

For large N the entropy is a measure of the logarithm of the phase volume of
typical states,
S = log W N ; (5.86)
where log W = N O(1) while 1. The results above are not very sensitive
to the value of . A broad range of values 1=N 1 are allowed. This means
that can be “microscopically large”(e.g., 10 6 ; 10 12 10 23 ) provided it
6 12
remains “macroscopically small”(e.g., 10 ; 10 1). Incidentally, note
that it is the (maximized) Gibbs entropy that satis…es the Boltzmann formula
SG = SC = k log W (where the irrelevant subscript has been dropped).
Proof: Eq.(5.81) states that for …xed , for any given there is an N such
that for all N > N , we have
1 S
Prob log f (z) 1 : (5.87)
N N
Thus, the probability that a microstate z drawn from the distribution f (z) is
-typical tends to one, and therefore so must Prob[R ]. Setting = yields
part (1). This also shows that the total probability of the set of states with
S+N 1 E+N N
f (z) > e = e or "(z) < E (5.88)
Z
is negligible — states that individually are more probable than typical occupy
a negligible volume. To prove (2) write
R
1 Prob[R ] = R dz f (z)
R
e S N R dz = e S N W : (5.89)
Similarly, to prove (3) use (1),
R
1 < Prob[R ] = R dz f (z)
R
e S+N R dz = e S+N W ; (5.90)
Finally, from (2) and (3),
(1 )eS N
W eS+N ; (5.91)
which is the same as
log(1 ) log W S
+ ; (5.92)
N N
and proves (4).
Remark: The theorems above can be generalized to situations involving several
macrovariables X k in addition to the energy. In this case, the expected value
of log f (z) is
h log f i = S = k X k + log Z ; (5.93)
and its variance is
2
( log f ) = k m XkXm X k hX m i : (5.94)
138 Statistical Mechanics

5.7 The Second Law of Thermodynamics


We saw that in 1865 Clausius summarized the two laws of thermodynamics into

The energy of the universe is constant. The entropy of the universe tends
to a maximum.

However, since it makes no sense to assign a thermodynamic entropy to a uni-


verse that is not in thermal equilibrium we should at the very least be a bit more
explicit. First recall some de…nitions. A process in which every intermediate
state is a state of equilibrium — which is achieved if the process is very slow
or quasi-static — is said to be reversible. If no heat is exchanged the process is
said to be adiabatic. Then the Second Law can be stated as follows:

In an adiabatic irreversible process that starts and ends in equilibrium


the total entropy increases; if the process is adiabatic and reversible the
total entropy remains constant.

The Second Law was amended into a stronger form by Gibbs (1878):

In an adiabatic irreversible process not only does the entropy tend to


increase, but it does increase to the maximum value allowed by the con-
straints imposed on the system.

In this and the following two sections we derive and comment on the Second
Law following the argument in [Jaynes 1963, 1965]. Jaynes’derivation is decep-
tively simple: the mathematics is trivial.6 But it is conceptually subtle so it
may be useful to recall some of our previous results. The entropy mentioned in
the Second Law is the thermodynamic entropy of Clausius SC , which is de…ned
only for equilibrium states.
Consider a system at time t in a state of equilibrium de…ned by certain
thermodynamic variables X(t). As we saw in section 5.4 the macrostate of
equilibrium is described by the canonical probability distribution f can (z; t) ob-
tained by maximizing the Gibbs entropy SG subject to the constraints X(t)
= hx(t)i where the quantities x = x(z) are functions of the microstate such as
energy, density, etc. The thermodynamic entropy SC is then given by
can
SC (t) = SG (t) : (5.95)

The system, which is assumed to be thermally insulated from its environment,


is allowed (or forced) to evolve according to a certain Hamiltonian, H(t). The
evolution need not be slow. It could, for example, be the free expansion of a gas
6 The mathematical arguments can be traced to the work of Gibbs [Gibbs 1902]. Unfor-

tunately, Gibbs’ treatment of the conceptual foundations left much to be desired and was
promptly criticized in an extremely in‡uential review by Paul and Tatyana Ehrenfest [Ehren-
fest 1912]. Jaynes’ decisive contribution was to place the subject on a completely di¤erent
foundation based on improved conceptual understandings of probability, entropy, and infor-
mation.
5.7 The Second Law of Thermodynamics 139

into vacuum, or it could be given by the time-dependent Hamiltonian that de-


scribes some externally prescribed in‡uence, say, a moving piston or an imposed
…eld. Since no heat was exchanged with the environment the process is adia-
batic but not necessarily reversible. We further assume that a new equilibrium
is eventually reached at some later time t0 . This is a non-trivial condition to
which we will brie‡y return below. Under these circumstances the initial canon-
ical distribution f can (t), e.g. eq.(4.82) or (5.38), evolves according to Liouville’s
equation, eq.(5.7),
H(t)
f can (t) ! f (t0 ) ; (5.96)
and, according to eq.(5.24), the corresponding Gibbs entropy remains constant,
can
SG (t) = SG (t0 ) : (5.97)

Since the Gibbs entropy remains constant it is sometimes argued that this con-
tradicts the Second Law but note that the time-evolved SG (t0 ) is not the ther-
modynamic entropy because the new f (t0 ) is not necessarily of the canonical
form, eq.(4.82).
From the new distribution f (t0 ) we can, however, compute the new expected
values X(t0 ) = hx(t0 )i that apply to the state of equilibrium at t0 . Of all dis-
tributions agreeing with the same new values X(t0 ) the canonical distribution
f can (t0 ) is that which has maximum Gibbs entropy, SG can 0
(t ). Therefore
M axEnt
f (t0 ) ! f can (t0 ) (5.98)

implies
SG (t0 ) can 0
SG (t ) : (5.99)
can 0
But SG (t ) coincides with the thermodynamic entropy of the new equilibrium
state,
can 0
SG (t ) = SC (t0 ) : (5.100)
Collecting all these results, eqs.(5.95)-(5.100), we conclude that the thermody-
namic entropy has increased,

SC (t) SC (t0 ) : (5.101)

This is the Second Law. The equality applies when the time evolution is quasi-
static so that the distribution remains canonical at all intermediate instants
through the process.
To summarize, the chain of steps is
can
SC (t) = SG (t) = SG (t0 ) can 0
SG (t ) = SC (t0 ) : (5.102)
(1) (2) (3) (4)

Steps (1) and (4) hinge on identifying the maximized Gibbs entropy with the
thermodynamic entropy — which is justi…ed provided we have correctly identi-
…ed the relevant macrovariables X for the particular problem at hand. Step (2)
follows from the constancy of the Gibbs entropy under Hamiltonian evolution
140 Statistical Mechanics

— since this is a mathematical theorem this is the least controversial step. Of


course, if we did not have complete knowledge about the exact Hamiltonian
H(t) acting on the system an inequality would have been introduced already at
this point — such would be an entropy increase over and above the Second Law.
The crucial inequality, however, is introduced in step (3) where information is
discarded. The distribution f (t0 ) contains information about the macrovariables
X(t0 ) at the …nal time t0 , but since the Hamiltonian is known, it also contains
information about the whole previous history of f back to the initial time t and
including the initial values X(t). In contrast, a description in terms of the dis-
tribution f can (t0 ) contains information about the macrovariables X(t0 ) at time
t0 and nothing else. In a truly thermodynamic description all memory of the
history of the system is lost.
The Second Law refers to thermodynamic entropies only. These entropies
measure the amount of information available to someone with only macroscopic
means to observe and manipulate the system. The evolution implied by Hamil-
tonian dynamics leads to distributions, such as f (t0 ), that include information
beyond what is allowed in a purely thermodynamic description. It is the act
of discarding such extra information that lies at the foundation of the Second
Law. The irreversibility implicit in the Second Law arises from the restriction
to thermodynamic descriptions.7
The fact that the Second Law refers to the thermodynamic entropies of
initial and …nal states of equilibrium implies that many important processes lie
outside its purview. One example is that of a gas that forever expands into
vacuum and never reaches equilibrium. Another example, this time borrowed
from cosmology, is that of an expanding universe. These are not states to which
one can assign a thermodynamic entropy and therefore the Second Law does
not apply. It is not that the Second Law is in any way violated; it is rather that
in these matters it remains silent.
Thus, the Second Law of Thermodynamics is not a Law of “Nature.” The
Second Law is a law but its connection to Nature is indirect. It is a law within
the very useful but also very limited class of models restricted to thermodynamic
descriptions of thermal equilibrium.
It is important to emphasize what has just been proved: in an irreversible
adiabatic process from an initial to a …nal state of equilibrium the thermody-
namic entropy increases — this is the Second Law. Many questions have been
left unanswered; some we will brie‡y address in the next two sections. Other
questions we will not address: we have assumed that the system tends towards
and …nally reaches an equilibrium; how do we know that this happens? What
are the relaxation times, transport coe¢ cients, etc.? There are all sorts of as-
pects of non-equilibrium irreversible processes that remain to be explained but
this does not detract from what Jaynes’ explanation did in fact accomplish,
namely, it explained the Second Law, no more and, most emphatically, no less.
Remark: It is sometimes stated that it is the Second Law that drives a system
towards equilibrium. This is not correct. The approach to equilibrium is not
7 On the topic of descriptions see [Grad 1961, 1967, Balian Veneroni 1987, Balian 1999].
5.8 Interpretation of the Second Law: Reproducibility 141

a consequence of the Second Law. If anything, it is the other way around: the
existence of a …nal state of equilibrium is a pre-condition for the Second Law.
The extension of entropic methods of inference beyond situations of equilib-
rium is, of course, highly desirable. I will o¤er two comments on this matter.
The …rst is that there is at least one theory of extreme non-equilibrium that
is highly successful and very well known. It is called quantum mechanics —
the ultimate framework for a probabilistic time-dependent non-equilibrium dy-
namics. The derivation of quantum mechanics as an example of an “entropic
dynamics” will be tackled in Chapter 11. The second comment is that term
‘non-equilibrium’ is too broad and too vague to be useful. In order to make
progress it is important to be very speci…c about which type of non-equilibrium
process one is trying to describe.8

5.8 Interpretation of the Second Law: Repro-


ducibility
First a summary of the previous sections: We saw that for macroscopic systems
‡uctuations of the variables X(t) are negligible — all microstates within the
typical region R(t) are characterized by essentially the same values of X(t),
eq.(5.84). We also saw that given X(t) the typical region has probability one
— it includes essentially all possible initial microstates compatible with the
values X(t). Having been prepared in equilibrium at time t the system is then
subjected to an adiabatic process and it eventually attains a new equilibrium
at time t0 . The Hamiltonian evolution deforms the initial region R(t) into a
new region R(t0 ) with exactly the same original volume W (t) = W (t0 ); the
macrovariables evolve from their initial values X(t) to new values X(t0 ). Now
suppose that for the new equilibrium we adopt a thermodynamic description:
the preparation history is forgotten, and all we know are the new values X(t0 ).
The new typical region R0 (t0 ) which includes all microstates compatible with
the information X(t0 ) has volume W 0 (t0 ) > W (t) and entropy SC (t0 ) > SC (t)
— this is the Second law.
The volume W (t) = eSC (t)=k of the typical region R(t) can be interpreted in
two ways. On one hand it is a measure of our ignorance as to the true microstate
when all we know are the macrovariables X(t). On the other hand, the volume
W (t) is also a measure of the extent that we can control the actual microstate
of the system when the X(t) are the only variables we can experimentally ma-
nipulate.
After these preliminaries we come to the crux of the argument: With the
limited experimental means at our disposal we can guarantee that the initial
microstate will be somewhere within R(t) and therefore that in due course
of time it will evolve to be within R(t0 ). (See Figure 5.1) In order for the
process X(t) ! X(t0 ) to be experimentally reproducible it must be that all
8 I once heard the remark that “it is conceivable that one might be able to formulate a

theory of elephants; but how could one ever come up with a theory of non-elephants?”
142 Statistical Mechanics

Figure 5.1: Entropy increases towards the future: The microstate of a system
in equilibrium with macrovariables X(t) at the initial time t lies somewhere
within R(t). A constraint is removed and the system spontaneously evolves to
a new equilibrium at t0 > t in the region R(t0 ) characterized by values X(t0 )
and with the same volume as R(t). The maximum entropy region that describes
equilibrium with the same values X(t0 ) irrespective of the prior history is R0 (t0 ).
The experiment is reproducible because all states within the larger region R0 (t0 )
are characterized by the same X(t0 ).

the microstates in R(t) will also evolve to be within R0 (t0 ) which means that
W (t) = W (t0 ) W 0 (t0 ). Conversely, if it happened that W (t) > W 0 (t0 ) we
would sometimes observe that an initial microstate within R(t) would evolve
into a …nal microstate lying outside R0 (t0 ), that is, sometimes we would observe
that X(t) would not evolve to X(t0 ). Such an experiment would de…nitely not
be reproducible.
A new element has been introduced into the discussion of the Second Law:
reproducibility [Jaynes 1965]. Thus, we can express the Second Law in the
somewhat tautological form:

In a reproducible adiabatic process from one state of equilibrium to another


the thermodynamic entropy cannot decrease.

We can address this question from a di¤erent angle: How do we know that
the chosen constraints X are the relevant macrovariables that provide an ade-
quate thermodynamic description? In fact, what do we mean by an adequate
description? Let us rephrase these questions di¤erently: Could there exist ad-
ditional physical constraints Y that signi…cantly restrict the microstates com-
patible with the initial macrostate and which therefore provide an even better
description? The answer is that to the extent that we are only interested in the
5.9 On reversibility, irreversibility, and the arrow of time 143

X variables, it is unlikely that the inclusion of additional Y variables in the de-


scription will lead to improved predictions. The reason is that since the process
X(t) ! X(t0 ) is already reproducible when no particular care has been taken to
control the values of Y it is unlikely that the additional information provided
by Y would be relevant. Thus keeping track of the Y s will not yield a better
description. Reproducibility is the pragmatic criterion whereby we can decide
whether a particular thermodynamic description is adequate for our purposes or
not.

5.9 On reversibility, irreversibility, and the ar-


row of time
A considerable source of confusion on the question of reversibility originates in
the fact that the same word ‘reversible’is used with several di¤erent meanings
[U¢ nk 2001]:
(a) Mechanical or microscopic reversibility refers to the possibility of reversing
the velocities of every particle. Such reversals would allow a completely isolated
system not just to retrace its steps from the …nal macrostate to the initial
macrostate but it would also allow it to retrace its detailed microstate trajectory
as well.
(b) Carnot or macroscopic reversibility refers to the possibility of retracing the
history of macrostates of a system in the opposite direction. The required
amount of control over the system can be achieved by forcing the system along
a prescribed path of intermediate macroscopic equilibrium states that are in…n-
itesimally close to each other. Such a reversible process is appropriately called
quasi-static. There is no implication that the trajectories of the individual par-
ticles will be retraced.
(c) Thermodynamic reversibility refers to the possibility of starting from a …nal
macrostate and completely recovering the initial macrostate without any other
external changes. There is no need to retrace the intermediate macrostates in
reverse order. In fact, rather than ‘reversibility’it may be more descriptive to
refer to ‘recoverability’. Typically a state is irrecoverable when there is friction,
decay, or corruption of some kind.
Notice that when one talks about the “irreversibility”of the Second Law and
about the “reversibility”of mechanics there is no inconsistency or contradiction:
the former refers to equilibrium macrostates, the latter refers to microstates.
The word ‘reversibility’is being used with two entirely di¤erent meanings.
Classical thermodynamics assumes that isolated systems approach and even-
tually attain a state of equilibrium. By its very de…nition the state of equilib-
rium is such that, once attained, it will not spontaneously change in the future.
On the other hand, it is understood that the system might have evolved from
a non-equilibrium situation in the relatively recent past. Thus, classical ther-
modynamics introduces a time asymmetry: it treats the past and the future
di¤erently.
144 Statistical Mechanics

The situation with statistical mechanics, however, is di¤erent. Once equi-


librium has been attained ‡uctuations still happen. In fact, if we are willing
to wait long enough we can be certain that large ‡uctuations will necessarily
happen in the future just as they might have happened in the past. In principle
the situation is symmetric. The interesting asymmetry arises when we realize
that for an improbable state — a large ‡uctuation — to happen spontaneously
in the future we may have to wait an extremely long time while we are perfectly
willing to entertain the possibility that a similarly improbable state — a non-
equilibrium state — might have been observed in the very recent past. This can
seem paradoxical because the formalisms of mechanics and of statistical me-
chanics do not introduce any time asymmetry. The solution to puzzles of this
kind hinges on realizing that if the system was in a highly improbable state in
the recent past, then it is most likely that the state did not arise spontaneously
but was brought about by some external intervention. The system might, for
example, have been deliberately prepared in some unusual state by applying
appropriate constraints which were subsequently removed — this is not uncom-
mon; we do it all the time. Thus, the time asymmetry is not introduced by the
laws of mechanics. It is introduced through the asymmetry between our infor-
mation about external interventions in the past versus our information about
spontaneous processes in the future.
We can pursue this matter further. It is not unusual to hear that the arrow
of time is de…ned by the Second Law. The claim is that since the laws of
dynamics are invariant under time-reversal,9 in order to distinguish the past
from the future we need a criterion from outside mechanics and the Second Law
is proposed as a potential candidate. One objection to this proposal is that the
thermodynamic entropy applies only to equilibrium states which is too limited
as it excludes most irreversible non-equilibrium processes of interest. Another
is that if the future is de…ned as the direction in which entropy increases, then
it is impossible for entropy not to increase with time. In other words, either the
Second Law is a tautology or one needs to seek elsewhere for an arrow of time.
Which raises the question: does the Jaynes’derivation lead to a tautological
Second Law, or was an arrow of time e¤ectively introduced somewhere else?
The answer is that a non-thermodynamic arrow was indeed introduced so that
the Second Law is not tautological.
One might wonder whether the arrow was introduced by the mere action of
asking a question that was itself already asymmetrical. Well, yes, the system
starts in an initial equilibrium state, a constraint is removed, and we asked to
which …nal equilibrium does the system spontaneously evolve. The conclusion
was that entropy increased into the future. As said earlier, the asymmetry
consists of an external intervention in the past — the removal of a constraint —
and spontaneous evolution to the future. A more symmetrical question would
re‡ect spontaneous evolution both into the future and from the past. Suppose
we ask the reverse question: given the …nal equilibrium state at time t0 , which
9 The violations of time reversal symmetry that are found in particle physics are not relevant

to the issues discussed here.


5.9 On reversibility, irreversibility, and the arrow of time 145

Figure 5.2: The reproducibility arrow of time leads to the Second law: we can
guarantee that the system will reproducibly evolve to R0 (t0 ) by controlling the
initial microstate to be in region Ra (t).

initial equilibrium state at time t < t0 did it come from? The answer is that
once an equilibrium has been reached at time t0 then, by the very de…nition
of equilibrium, the spontaneous evolution into the future t00 > t0 will maintain
the same equilibrium state. And vice versa: to the extent that the system
has evolved spontaneously — that is, to the extent that there are no external
interventions — the time reversibility of the Hamiltonian dynamics leads us to
that if the system is in equilibrium at t00 , then it must have been in equilibrium
at any all other previous time t0 . In other words, if the equilibrium at time
t0 is de…ned by variables X(t0 ), then the spontaneous evolution both into the
future t00 > t0 and from the past t < t0 , lead to the same equilibrium state,
X(t) = X(t0 ) = X(t00 ).
But, of course, our interest lies precisely in the e¤ect of external interven-
tions. A reproducible experiment that starts in equilibrium and ends in the
equilibrium state of region R0 (t0 ) de…ned by X(t0 ) is shown in Figure 5.2. We
can guarantee that the initial microstate will end somewhere in R0 (t0 ) by con-
trolling the initial equilibrium macrostate to have values X(t) that de…ne a
region such as Ra (t). Then we have an external intervention: a constraint is
removed and the system is allowed to evolve spontaneously. The initial region
Ra (t) will necessarily have a volume very very much smaller than R0 (t). In-
deed, the region Ra (t) would be highly atypical within R0 (t). The entropy of
the initial region Ra (t) is lower than that of the …nal region R0 (t) which leads
to the Second Law once again.
146 Statistical Mechanics

Figure 5.2 also shows that there are many other initial equilibrium macrostates
such as Rb (t) de…ned by values Xb (t) that lead to the same …nal equilibrium
macrostate. For example, if the system is a gas in a box of given volume and
the …nal state is equilibrium, the gas might have initially been in equilibrium
con…ned by a partition to the left half of the box, or the upper half, or the right
third, or any of many other such constrained states.
Finally, we note that introducing the notion of reproducibility is something
that goes beyond the laws of mechanics. Reproducibility refers to our capability
to control the initial microstate by deliberately manipulating the values Xa (t)
in order to reproduce the later values X(t0 ). The relation is that of a cause
Xa (t) leading to an e¤ect X(t0 ). To the extent that causes are supposed to
precede their e¤ects we conclude that the reproducibility arrow of time is the
causal arrow of time.
Our goal has been to derive the Second Law which requires an arrow of time
but, at this point, the origin of the latter remains unexplained. In chapter 11
we will revisit this problem within the context of a dynamics conceived as an
application of entropic inference. There we will …nd that the time associated
to such an “entropic dynamics” is intrinsically endowed with the directionality
required for the Second Law.

Overview
It may be useful to collect the main arguments of the previous three sections
in a more condensed form along with a short preview of things to come later in
chapter 11. To understand what is meant by the Second Law — roughly that
entropy increases as time increases — one must specify what entropy we are
talking about, and we must specify an arrow of time so that it is clear what we
mean by ‘time increases’.
The entropy in the Second Law of Thermodynamics is the thermodynamic
entropy of Clausius, eq.(5.51), which is only de…ned for equilibrium states. One
can invent all sorts of other entropies, which might increase or not. One might,
for example, de…ne an entropy associated to a microstate. The increase of such
an entropy could be due to some form of coarse graining, or could be induced by
unknown external perturbations. Or we could have the entropy of a probability
distribution that increases in a process of di¤usion. Or the entropy of the
distribution j j2 as it might arise in quantum mechanics, which increases just
as often as it decreases. Or any of many other possibilities. But none of these
refer to the Second Law, nor do they violate it.
Then there is the question of the arrow of time: either it is de…ned by the
Second Law or it is not. If it is, then the Second law is a tautology. While this
is a logical possibility one can o¤er a pragmatic objection: an arrow of time
linked to the entropy of equilibrium states is too limited to be useful — thermal
equilibrium is too rare, too local, too accidental a phenomenon in our universe.
It is more fruitful to pursue the consequences of an arrow of time that originates
through some other mechanism.
Entropic dynamics (ED) o¤ers a plausible mechanism in the spirit of infer-
5.10 Avoiding pitfalls –II: is this a 2nd law? 147

ence and information (see chapter 11). In the ED framework, time turns out
to be intrinsically endowed with directionality; ultimately this is what sets the
direction of causality. The arrow of such an “entropic time” is linked to an en-
tropy but not to thermodynamics; it is linked to the dynamics of probabilities.
The causal arrow of time is the arrow of entropic time.
Thus, the validity of the Second Law rests on three elements: (1) the ther-
modynamic entropy given by the maximized Gibbs entropy; (2) the existence of
an arrow of time; and (3) the existence of a time-reversible dynamical law that
involves no loss of information. The inference/information approach to physics
contributes to explain all three of these elements.

5.10 Avoiding pitfalls –II: is this a 2nd law?


Canonical distributions have many interesting properties that can be extremely
useful but can also be extremely misleading. Here is an example.
We adopt the notation of Section 5.1 and consider the canonical distribution
1 k
k x (z)
fs (z) = e : (5.103)
Zs
The k are Lagrange multipliers chosen to enforce the expected value constraints
Z
hx i = dz f (z)xk (z) = Xsk
k
(5.104)

where xk (z) are some functions of the microstate z ( = 1 : : : 6N ) and the


partition function is Z
k
Zs = dz e k x (z) = e s (5.105)

where the potential s is a Mathieu function — a Legendre transform of the


entropy that is somewhat analogous to a free energy. Indeed, the entropy of fs
is
S[fs ] = k Xsk s : (5.106)
Here is a theorem:
Theorem: The distribution fs (z), eq.(5.103), is a stationary solution of the
Fokker-Planck equation
k
@t f = @ [f @ ( k x )] +@ @ f ; (5.107)

where @t = @=@t , @ = @=@z , @ = @ .


Proof: Rewrite eq.(5.107) as a continuity equation,

@t f = @ (f v ) ; (5.108)

according to which the probability ‡ows with a velocity v given by


k
v (z) = @ [ k x (z) + log f (z)] : (5.109)
148 Statistical Mechanics

From (5.103) and (5.105) for f = fs we have

k
k x (z) + log fs (z) = s = const (5.110)

and the corresponding current velocity, v = @ s , vanishes.


De…nition: In analogy to (5.106) for any arbitrary distribution f (z) we can
de…ne a potential
k
[f ] = kX [f ] S[f ] ; (5.111)

where
Z Z
k k
X [f ] = dz f (z)x (z) and S[f ] = dz f (z) log f (z) : (5.112)

When the X k are conserved quantities the potential can be interpreted as


follows. Assume that the system is in contact with reservoirs described by the
Lagrange multipliers k . From (5.111) a small change is given by

= k Xk S (5.113)

where the X k represent the amount of X k withdrawn from the reservoirs and
supplied to the system. The corresponding change in the thermodynamic en-
tropy of the reservoirs is given by eq.(4.94),

Sres = k Xk ; (5.114)

so that
= (Sres + S) : (5.115)

Therefore, represents the increase of the entropy of the combined system


plus reservoirs.
Here is another theorem:
Theorem: If ft (z) is a solution of the Fokker-Planck equation (5.107), then
the potential [f ] is a decreasing function of t.
Proof: From (5.111),
Z
d k
= dz kx + log f + 1 @t f : (5.116)
dt

Next, use (5.108), (5.109), and integrate by parts,


Z Z
d k
= dz f @ k x + log f v = dz f v v 0; (5.117)
dt

which concludes the proof.


5.11 Entropies, descriptions and the Gibbs paradox 149

Discussion We have just shown that starting from any arbitrary initial dis-
tribution f0 at time t0 , the solutions ft of (5.107) evolve irreversibly towards a
…nal state of equilibrium fs . Furthermore, there is a potential that monoton-
ically decreases towards its minimum value s and that can (sometimes) be
interpreted as the monotonic increase of the total entropy of system plus reser-
voirs.
What are we to make of all this? We appear to have a derivation of the
Second Law with the added advantage of a dynamical equation that describes
not only the approach to equilibrium but also the evolution of distributions
arbitrarily far from equilibrium. If anything, this is too good to be true, where
is the mistake?
The mistake is not to be found in the mathematics which is rather straight-
forward. The theorems are indeed true. The problem lies in the physics. To see
this suppose we deal with a single system in thermal contact with a heat bath at
temperature T . What is highly suspicious is that the process of thermalization
described by eq.(5.107) appears to be of universal validity. It depends only on
the bath temperature and is independent of all sorts of details about the ther-
mal contact. This is blatantly wrong physics: surely the thermal conductivity
of the walls that separate the system from the bath — whether the conductivity
is high or low, whether it is uniform or not — must be relevant.
Here is another problem with the physics: the parameter that describes the
evolution of ft has been called t and this might mislead us to think that t has
something to do with time. But this need not be so. In order for t to deserve
being called time — and for the Fokker-Planck equation to qualify as a true
dynamical equation, even if only an e¤ective or phenomenological one — one
must establish how the evolution parameter is related to properly calibrated
clocks. As it is, eq.(5.107) is not a dynamical equation. It is just a cleverly
constructed equation for those curves in the space of distributions that have the
peculiar property of admitting canonical distributions as stationary states.

5.11 Entropies, descriptions and the Gibbs para-


dox
Under the generic title of “Gibbs Paradox” one usually considers a number of
related questions in both phenomenological thermodynamics and in statistical
mechanics: (1) The entropy change when two distinct gases are mixed happens
to be independent of the nature of the gases. Is this in con‡ict with the idea
that in the limit as the two gases become identical the entropy change should
vanish? (2) Should the thermodynamic entropy of Clausius be an extensive
quantity or not? (3) Should two microstates that di¤er only in the exchange of
identical particles be counted as two or just one microstate? Should add a more extended
The conventional wisdom asserts that the resolution of the paradox rests introduction to the Gibbs
on quantum mechanics but this analysis is unsatisfactory; at best it is incom- paradox.
plete. While it is true that the exchange of identical quantum particles does not
150 Statistical Mechanics

lead to a new microstate this approach ignores the case of classical, and even
non-identical particles. For example, nanoparticles in a colloidal suspension or
macromolecules in solution are both classical and non-identical. Several authors
(e.g., [Grad 1961, 1967][Jaynes 1992]) have recognized that quantum theory has
no bearing on the matter; indeed, as remarked in section 3.5, this was already
clear to Gibbs.
Our purpose here is to discuss the Gibbs paradox from the point of view of
information theory. The discussion follows [Tseng Caticha 2001]. Our conclu-
sion will be that the paradox is resolved once it is realized that there is no such
thing as the entropy of a system, that there are many entropies. The choice
of entropy is a choice between a description that treats particles as being dis-
tinguishable and a description that treats them as indistinguishable; which of
these alternatives is more convenient depends on the resolution of the particular
experiment being performed.
The “grouping” property of entropy, eq.(4.3),
P
S[p] = SG [P ] + g Pg Sg [p jg ]

plays an important role in our discussion. It establishes a relation between


several di¤erent descriptions and refers to three di¤erent entropies. One can
describe the system with high resolution as being in a microstate i (with prob-
ability pi ), or alternatively, with lower resolution as being in one of the groups
g (with probability Pg ). Since the description in terms of the groups g is less
detailed we might refer to them as ‘mesostates’. A thermodynamic description,
on the other hand, corresponds to an even lower resolution that merely speci…es
the equilibrium macrostate. For simplicity, we will de…ne the macrostate with a
single variable, the energy. Including additional variables is easy and does not
modify the gist of the argument.
The standard connection between the thermodynamic description in terms of
macrostates and the description in terms of microstates is established in section
5.4. If the energy of microstate a is "a , to the macrostate of energy E = h"i we
associate that canonical distribution (5.38)
"a
e
pa = ; (5.118)
ZH
where the partition function ZH and the Lagrange multiplier are determined
from eqs.(5.39),
P "i @ log ZH
ZH = e and = E: (5.119)
i @

The corresponding entropy, eq.(5.40) is (setting k = 1)

SH = E + log ZH ; (5.120)

measures the amount of information required to specify the microstate when all
we know is the value E.
5.11 Entropies, descriptions and the Gibbs paradox 151

Identical particles
Before we compute and interpret the probability distribution over mesostates
and its corresponding entropy we must be more speci…c about which mesostates
we are talking about. Consider a system of N classical particles that are exactly
identical. The interesting question is whether these identical particles are also
“distinguishable”. By this we mean the following: we look at two particles now
and we label them. We look at the particles later. Somebody might have
switched them. Can we tell which particle is which? The answer is: it depends.
Whether we can distinguish identical particles or not depends on whether we
were able and willing to follow their trajectories.
A slightly di¤erent version of the same question concerns an N -particle sys-
tem in a certain state. Some particles are permuted. Does this give us a di¤erent
state? As discussed earlier the answer to this question requires a careful speci-
…cation of what we mean by a state.
Since by a microstate we mean a point in the N -particle phase space, then
a permutation does indeed lead to a new microstate. On the other hand, our
concern with particle exchanges suggests that it is useful to introduce the notion
of a mesostate de…ned as the group of those N ! microstates that are obtained
by particle permutations. With this de…nition it is clear that a permutation of
the identical particles does not lead to a new mesostate.
Now we can return to discussing the connection between the thermodynamic
macrostate description and the description in terms of mesostates using, as
before, the method of Maximum Entropy. Since the particles are (su¢ ciently)
identical, all those N ! microstates i within the same mesostate g have the same
energy, which we will denote by Eg (i.e., Ei = Eg for all i 2 g). To the
macrostate of energy E = hEi we associate the canonical distribution,
Eg
e
Pg = ; (5.121)
ZL
where
P Eg @ log ZL
ZL = e and = E: (5.122)
g @
The corresponding entropy, eq.(5.40) is (setting k = 1)

SL = E + log ZL ; (5.123)

measures the amount of information required to specify the mesostate when all
we know is E.
Two di¤erent entropies SH and SL have been assigned to the same macrostate
E; they measure the di¤erent amounts of additional information required to
specify the state of the system to a high resolution (the microstate) or to a low
resolution (the mesostate).
The relation between ZH and ZL is obtained from
P Ei P Eg ZH
ZH = e = N! e = N !ZL or ZL = : (5.124)
i g N!
152 Statistical Mechanics

The relation between SH and SL is obtained from the “grouping” property,


eq.(4.3), with S = SH and SG = SL , and pijg = 1=N !. The result is
SL = SH log N ! : (5.125)
Incidentally, note that
P P
SH = a pa log pa = g Pg log Pg =N ! : (5.126)
Equations (5.124) and (5.125) both exhibit the Gibbs N ! “corrections.” Our
analysis shows (1) that the justi…cation of the N ! factor is not to be found in
quantum mechanics, and (2) that the N ! does not correct anything. The N !
is not a fudge factor that …xes a wrong (possibly nonextensive) entropy SH
into a correct (possibly extensive) entropy SL . Both entropies SH and SL are
correct. They di¤er because they measure di¤erent things: one measures the
information to specify the microstate, the other measures the information to
specify the mesostate.
An important goal of statistical mechanics is to provide a justi…cation, an
explanation of thermodynamics. Thus, we still need to ask which of the two
statistical entropies, SH or SL , should be identi…ed with the thermodynamic
entropy of Clausius ST . Inspection of eqs.(5.124) and (5.125) shows that, as long
as one is not concerned with experiments that involve changes in the number
of particles, the same thermodynamics will follow whether we set SH = ST or
SL = ST .
But, of course, experiments involving changes in N are very important (for
example, in the equilibrium between di¤erent phases, or in chemical reactions).
Since in the usual thermodynamic experiments we only care that some number
of particles has been exchanged, and we do not care which were the actual par-
ticles exchanged, we expect that the correct identi…cation is SL = ST . Indeed,
the quantity that regulates the equilibrium under exchanges of particles is the
chemical potential de…ned by
@ST
= kT (5.127)
@N E;V;:::

The two identi…cations SH = ST or SL = ST , lead to two di¤erent chemical


potentials, related by
L = H N kT : (5.128)
It is easy to verify that, under the usual circumstances where surface e¤ects
can be neglected relative to the bulk, L has the correct functional dependence
on N : it is intensive and can be identi…ed with the thermodynamic . On the
other hand, H is not an intensive quantity and cannot therefore be identi…ed
with .

Non-identical particles
We saw that classical identical particles can be treated, depending on the res-
olution of the experiment, as being distinguishable or indistinguishable. Here
5.11 Entropies, descriptions and the Gibbs paradox 153

we go further and point out that even non-identical particles can be treated as
indistinguishable. Our goal is to state explicitly in precisely what sense it is up
to the observer to decide whether particles are distinguishable or not.
We de…ned a mesostate as a subset of N ! microstates that are obtained as
permutations of each other. With this de…nition it is clear that a permutation
of particles does not lead to a new mesostate even if the exchanged particles
are not identical. This is an important extension because, unlike quantum
particles, classical particles cannot be expected to be exactly identical down to
every minute detail. In fact in many cases the particles can be grossly di¤erent –
examples might be colloidal suspensions or solutions of organic macromolecules.
A high resolution device, for example an electron microscope, would reveal that
no two colloidal particles or two macromolecules are exactly alike. And yet,
for the purpose of modelling most of our macroscopic observations it is not
necessary to take account of the myriad ways in which two particles can di¤er.
Consider a system of N particles. We can perform rather crude macroscopic
experiments the results of which can be summarized with a simple phenomeno-
logical thermodynamics where N is one of the relevant variables that de…ne the
macrostate. Our goal is to construct a statistical foundation that will explain
this macroscopic model, reduce it, so to speak, to “…rst principles.” The par-
ticles might ultimately be non-identical, but the crude phenomenology is not
sensitive to their di¤erences and can be explained by postulating mesostates g
and microstates i with energies Ei Eg , for all i 2 g, as if the particles were
identical. As in the previous section this statistical model gives

ZH X
Ei
ZL = with ZH = e ; (5.129)
N! i

and the connection to the thermodynamics is established by postulating

ST = SL = SH log N ! : (5.130)

Next we consider what happens when more sophisticated experiments are


performed. The examples traditionally o¤ered in discussions of this sort refer to
the new experiments that could be made possible by the discovery of membranes
that are permeable to some of the N particles but not to the others. Other,
perhaps historically more realistic examples, are a¤orded by the availability
of new experimental data, for example, more precise measurements of a heat
capacity as a function of temperature, or perhaps measurements in a range of
temperatures that had previously been inaccessible.
Suppose the new phenomenology can be modelled by postulating the exis-
tence of two kinds of particles. (Experiments that are even more sophisticated
might allow us to detect three or more kinds, perhaps even a continuum of
di¤erent particles.) What we previously thought were N identical particles we
will now think as being Na particles of type a and Nb particles of type b. The
new description is in terms of macrostates de…ned by Na and Nb as the relevant
variables.
154 Statistical Mechanics

To construct a statistical explanation of the new phenomenology from ‘…rst


principles’we need to revise our notion of mesostate. Each new mesostate will
be a group of microstates which will include all those microstates obtained by
permuting the a particles among themselves, and by permuting the b particles
among themselves, but will not include those microstates obtained by permuting
a particles with b particles. The new mesostates, which we will label g^ and to
which we will assign energy "g^ , will be composed of Na !Nb ! microstates ^{, each
with a well de…ned energy E^{ = Eg^ , for all ^{ 2 g^. The new statistical model
gives

Z^H X
Z^L = with Z^H = e E^{
; (5.131)
Na !Nb !
^
{

and the connection to the new phenomenology is established by postulating

S^T = S^L = S^H log Na !Nb ! : (5.132)

In discussions of this topic it is not unusual to …nd comments to the e¤ect


that in the limit as particles a and b become identical one expects that the
entropy of the system with two kinds of particles tends to the entropy of a
system with just one kind of particle. The fact that this expectation is not met
is one manifestation of the Gibbs paradox.
From the information theory point of view the paradox does not arise because
there is no such thing as the entropy of the system, there are several entropies.
It is true that as a ! b we will have Z^H ! ZH , and accordingly S^H ! SH ,
but there is no reason to expect a similar relation between S^L and SL because
these two entropies refer to mesostates g^ and g that remain di¤erent even as
a and b became identical. In this limit the mesostates g^, which are useful for
descriptions that treat particles a and b as indistinguishable among themselves
but distinguishable from each other, lose their usefulness.

Conclusion
The Gibbs paradox in its various forms arises from the widespread misconception
that entropy is a real physical quantity and that one is justi…ed in talking about
the entropy of the system. The thermodynamic entropy is not a property of the
system. Entropy is a property of our description of the system, it is a property of
the macrostate. More explicitly, it is a function of the macroscopic variables used
to de…ne the macrostate. To di¤erent macrostates re‡ecting di¤erent choices of
variables there correspond di¤erent entropies for the very same system.
But this is not the complete story: entropy is not just a function of the
macrostate. Entropies re‡ect a relation between two descriptions of the same
system: one description is the macrostate, the other is the set of microstates,
or the set of mesostates, as the case might be. Then, having speci…ed the
macrostate, an entropy can be interpreted as the amount of additional infor-
mation required to specify the microstate or mesostate. We have found the
5.11 Entropies, descriptions and the Gibbs paradox 155

‘grouping’ property very valuable precisely because it emphasizes the depen-


dence of entropy on the choice of micro- or mesostates.
Chapter 6

Entropy III: Updating


Probabilities

Inductive inference is a framework for reasoning with incomplete information,


for coping with uncertainty. The framework must include a means to repre-
sent a state of partial knowledge — this is handled through the introduction of
probabilities — and it must allow us to change from one state of partial knowl-
edge to another when new information becomes available. Indeed any inductive
method that recognizes that a situation of incomplete information is in some
way unfortunate — by which we mean that it constitutes a problem in need of
a solution — would be severely de…cient if it failed to address the question of
how to proceed in those fortunate circumstances when new information becomes
available. The theory of probability, if it is to be useful at all, demands a method
for updating probabilities.
The challenge is to develop updating methods that are both systematic,
objective and practical. In Chapter 2 we saw that Bayes’ rule is the natural
way to update when the information consists of data and a likelihood function.
We also saw that Bayes’rule could not be derived just from the requirements
of consistency implicit in the sum and product rules of probability theory. An
additional principle of parsimony — the Principle of Minimal Updating (PMU)
— was necessary: Whatever was learned in the past is valuable and should not
be disregarded; beliefs ought to be revised but only to the minimal extent required
by the new data. A few interesting questions were just barely hinted at: How do
we update when the information is not in the form of data? If the information is
not data, what else could it possibly be? Indeed what, after all, is ‘information’?
Then in Chapter 4 we saw that the method of maximum entropy, MaxEnt,
allowed one to deal with information in the form of constraints on the allowed
probability distributions. So here we have a partial answer to one of our ques-
tions: in addition to data, information can also take the form of constraints.
However, MaxEnt was not designed as a method for updating; it is a method for
assigning probabilities on the basis of the constraint information, but it does
158 Entropy III: Updating Probabilities

not allow us to take into account the information contained in generic prior
distributions.
Thus, Bayes’ rule allows information contained in arbitrary priors and in
data, but not in arbitrary constraints,1 while on the other hand, MaxEnt can
handle arbitrary constraints but not arbitrary priors. In this chapter we bring
those two methods together: by generalizing the PMU we show how the MaxEnt
method can be extended beyond its original scope, as a rule to assign proba-
bilities, to a full-‡edged method for inductive inference, that is, a method for
updating from arbitrary priors given information in the form of arbitrary con-
straints. It should not be too surprising that the extended Maximum Entropy
method — which we will henceforth abbreviate as ME, and also refer to as
‘entropic inference’or ‘entropic updating’— includes both MaxEnt and Bayes’
rule as special cases.
Historically the ME method is a direct descendant of MaxEnt. As we saw in
chapter 4 in the MaxEnt framework entropy is interpreted through the Shannon
axioms as a measure of the amount of information that is missing in a probability
distribution. We discussed some limitations of this approach. The Shannon
axioms refer to probabilities of discrete variables; for continuous variables the
entropy is not de…ned. But a more serious objection was raised: even if we grant
that the Shannon axioms do lead to a reasonable expression for the entropy, to
what extent do we believe the axioms themselves? Shannon’s third axiom, the
grouping property, is indeed sort of reasonable, but is it necessary? Is entropy
the only consistent measure of uncertainty or of information? What is wrong
with, say, the standard deviation? Indeed, there exist examples in which the
Shannon entropy does not seem to re‡ect one’s intuitive notion of information
[U¢ nk 1995]. One could introduce other entropies justi…ed by di¤erent choices
of axioms (see, for example, [Renyi 1961] and [Tsallis 1988]). Which one should
we adopt? If di¤erent systems are to handled using di¤erent Renyi entropies,
how do we handle composite systems?
From our point of view the real limitation is that neither Shannon nor Jaynes
were concerned with the problem of updating. Shannon was analyzing the capac-
ity of communication channels and characterizing the diversity of the messages
that could potentially be generated by a source (section 4.8). His entropy makes
no reference to prior distributions. On the other hand, as we already mentioned,
Jaynes conceived MaxEnt as a method to assign probabilities on the basis of
constraint information and a …xed underlying measure, not an arbitrary prior.
He never meant to update from one probability distribution to another.
Considerations such as these motivated several attempts to develop ME di-
rectly as a method for updating probabilities without invoking questionable
measures of uncertainty [Shore and Johnson 1980; Skilling 1988-1990; Csiszar
1991, 2008; Caticha 2003, 2014a]. The important contribution by Shore and
Johnson was the realization that one could axiomatize the updating method
itself rather than the information measure. Their axioms are justi…ed on the
1 Bayes’ rule can handle constraints when they are expressed in the form of data that can

be plugged into a likelihood function but not all constraints are of this kind.
159

basis of a fundamental principle of consistency — if a problem can be solved


in more than one way the results should agree — but the axioms themselves
and other assumptions they make have raised some objections [Karbelkar 1986,
U¢ nk 1995]). Despite such criticism Shore and Johnson’s pioneering papers
have had an enormous in‡uence: they identi…ed the correct goal to be achieved.
The main goal of this chapter is to design a framework for updating — the
method of entropic inference. The concept of relative entropy is introduced
as a tool for reasoning — it is designed to perform a certain function de…ned
through certain design criteria or speci…cations. There is no implication that
the method is “true”, or that it succeeds because it achieves some special contact
with reality. Instead the claim is that the method succeeds in the sense that it
works as designed — and that this is satisfactory because it leads to empirically
adequate models.2
As we argued earlier when developing the theory of degrees of belief, our
general approach di¤ers from the way in which many physical theories have been
developed in the past. The more traditional approach consists of …rst setting
up the mathematical formalism and then seeking an acceptable interpretation.
The drawback of this procedure is that questions can always be raised about the
uniqueness of the proposed interpretation, and about the criteria that makes it
acceptable or not.
In contrast, here we proceed in the opposite order: we …rst decide what we
are talking about, what goal we want to achieve, and only then we proceed to
construct a suitable mathematical formalism designed with that speci…c goal in
mind. The advantage is that issues of meaning and interpretation are resolved
from the start. The preeminent example of this approach is Cox’s algebra of
probable inference (see chapter 2) which clari…ed the meaning and use of the
notion of probability: after Cox it is no longer possible to question whether
degrees of belief can be interpreted as probabilities. Similarly, here the concept
of entropy is introduced as a tool for reasoning without recourse to notions
of heat, multiplicity of states, disorder, uncertainty, or even in terms of an
amount of information. In this approach entropy needs no interpretation. We
do not need to know what ‘entropy’means; we only need to know how to use
it. Incidentally, this may help explain why previous research failed to …nd an
unobjectionably precise meaning for the concept of entropy — there is none to
be found.
Since the PMU is the driving force behind both Bayesian and ME updating
it is worthwhile to investigate the precise relation between the two. We will
show that Bayes’rule can be derived as a special case of the ME method. This
important result was …rst obtained by Williams (see [Williams 80][Diaconis 82])
before the use of relative entropy as a tool for inference had been properly
understood. It is not, therefore, surprising that Williams’achievement did not
receive the widespread appreciation it deserved. The virtue of the derivation
presented here [Caticha Gi¢ n 2006], which hinges on translating information
2 The presentation below is based on work presented in a sequence of papers [Caticha 2003,

Caticha Gi¢ n 2006, Caticha 2007, 2014a, Vanslette 2017] and in earlier versions of these
lectures [Caticha 2008, 2012c].
160 Entropy III: Updating Probabilities

in the form of data into a constraint that can be processed using ME, is that
it is particularly clear. It throws light on Bayes’ rule and demonstrates its
complete compatibility with ME updating. Thus, within the ME framework
maximum entropy and Bayesian methods are uni…ed into a single consistent
theory of inference. One advantage of this insight is that it allows a number
of generalizations of Bayes’ rule (see section 2.10.2). Another is that it has
implications for physics: it provides an important missing piece for the old
puzzles of quantum mechanics concerning the so-called collapse of the wave
function and the measurement problem(see Chapter 11).
There is yet another function that the ME method must perform in order to
fully qualify as a method of inductive inference. Once we have decided that the
distribution of maximum entropy is to be preferred over all others the following
question arises immediately: the maximum of the entropy functional is never
in…nitely sharp, are we really con…dent that distributions that lie very close
to the maximum are totally ruled out? We must …nd a quantitative way to
assess the extent to which distributions with lower entropy are ruled out. This
topic, which completes the formulation of the ME method, will be addressed in
chapter 8.

6.1 What is information?


The term ‘information’is used with a wide variety of di¤erent meanings [Cover
Thomas 1991; Landauer 1991; Jaynes 2003; Caticha 2007, 2014a; Golan 2008,
2018; Floridi 2011]. There is the Shannon notion of information, a technical
term that, as we have seen, is meant to measure an amount of information and
is quite divorced from semantics. The goal of information theory, or better,
communication theory, is to characterize the sources of information, to measure
the capacity of the communication channels, and to learn how to control the
degrading e¤ects of noise. It is somewhat ironic but nevertheless true that this
“information” theory is unconcerned with the central Bayesian issue of how a
message a¤ects the beliefs of an ideally rational agent.
There is also an algorithmic notion of information, which captures the notion
of complexity [Cover and Thomas 1991] and originates in the work of [Solomonov
1964], [Kolmogorov 1965] and [Chaitin 1975], and the related Minimum Descrip-
tion Length principle of [Rissanen 1978, 1986]. The algorithmic approach has
been developed as an alternative approach to induction, learning, arti…cial in-
telligence, and as a general theory of knowledge — it has been suggested that
data compression is one of the principles that governs human cognition. Despite
their potential relevance to our subject, these algorithmic approaches will not
be pursued here.
It is not unusual to hear that systems “carry”or “contain”information and
that “information is physical”.3 This mode of expression can perhaps be traced
to the origins of information theory in Shannon’s theory of communication. We
3 The general context is the thermodynamics of computation. See [Landauer 1991, Bennett

1982, 2003] and references therein. For a critical appraisal see [Norton 2011, 2013].
6.1 What is information? 161

say that we have received information when among the vast variety of messages
that could have been generated by a distant source, we discover which particular
message was actually sent. It is thus that the message “carries” information.
The analogy with physics is immediate: the set of all possible states of a physical
system can be likened to the set of all possible messages, and the actual state of
the system corresponds to the message that was actually sent. Thus, the system
“conveys” a message: the system “carries” information about its own state.
Sometimes the message might be di¢ cult to read, but it is there nonetheless.
This language — information is physical — useful as it has turned out to be,
does not, however, exhaust the meaning of the word ‘information’.
Here we will follow a di¤erent path. We seek an epistemic notion of infor-
mation that is somewhat closer to the everyday colloquial use of the term —
roughly, information is what I get when my question has been answered. Indeed,
a fully Bayesian information theory requires an explicit account of the relation
between information and the beliefs of ideally rational agents. Furthermore,
implicit in the recognition that most of our beliefs are held on the basis of in-
complete information is the idea that our beliefs would be better if only we had
more information. Thus a theory of probability demands a theory for updating
probabilities.
The desire and need to update our assessment of what beliefs we ought to
hold is driven by the conviction that not all beliefs, not all probability assign-
ments, are equally good. The concern with ‘good’and ‘better’bears on the issue
of whether probabilities are subjective, objective, or somewhere in between. We
argued earlier (in Chapter 1) that what makes one probability assignment bet-
ter than another is that the adoption of better beliefs has real consequences:
they provide a better guidance about how to cope with the world, and in this
pragmatic sense, they provide a better guide to the “truth”. Thus, objectivity is
desirable; objectivity is the goal. Probabilities are useful to the extent that they
incorporate some degree of epistemic objectivity.4 What we seek are updating
mechanisms that allow us to process information and incorporate its objective
features into our beliefs. Bayes’rule behaves precisely in this way. We saw in
section 2.10.3 that as more and more data are taken into account the original
(possibly subjective) prior becomes less and less relevant, and all rational agents
become more and more convinced of the same truth. This is crucial: were it
not this way Bayesian reasoning would not be deemed acceptable.
To set the stage for the discussion below consider some examples. Suppose
a new piece of information is acquired. This could take a variety of forms. The
typical example in data analysis would be something like: The prior probability
of a certain proposition might have been q and after analyzing some data we
feel rationally justi…ed in asserting that a better assignment would be p. More
explicitly, propositions such as “the value of the variable X lies between x "
and x + "”might initially have had probabilities that were broadly spread over
the range of x and after a measurement is performed the new data might induce
4 We recall from Section 1.1.3 that probabilities are ontologically subjective but epistemi-

cally they can span the range from being fully subjective to fully objective.
162 Entropy III: Updating Probabilities

us to revise our beliefs to a distribution that favors values in a narrower more


localized region.
The typical example in statistical mechanics would run something like this:
Total ignorance about the state of a system is expressed by a prior distribution
that assigns equal probabilities to all microstates. The information that the
system happens to be in thermal equilibrium induces us to update such beliefs
to a probability distribution satisfying the constraint that the expected energy
takes on a speci…c value, h"i = E.
Here is another more generic example. Let’s say we have received a message
— but the carrier of information could equally well have been in the form of
input from our senses or data from an experiment. If the message agrees with
our prior beliefs we can safely ignore it. The message is boring; it carries no
news; literally, it carries no information. The interesting situation arises when
the message surprises us; it is not what we expected. A message that disagrees
with our prior beliefs presents us with a problem that demands a decision. If the
source of the message is not deemed reliable then the contents of the message
can be safely ignored — it carries no information; it is no di¤erent from noise.
On the other hand, if the source of the message is deemed reliable then we
have an opportunity to improve our beliefs — we ought to update our beliefs to
agree with the message. Choosing between these two options requires a rational
decision, a judgement. The message (or the sensation, or the data) becomes
“information” precisely at that moment when as a result of our evaluation we
feel compelled to revise our beliefs.
We are now ready to address the question: What, after all, is ‘information’?
The answer is pragmatic:

Information is what information does.

Information is de…ned by its e¤ects: (a) it induces us to update from prior beliefs
to posterior beliefs, and (b) it restricts our options as to what we are honestly
and rationally allowed to believe. This, I propose, is the de…ning characteristic
of information.

Information is that which induces a change from one state of rational


belief to another state that is more appropriately constrained.

One signi…cant aspect of this notion is that for a rational agent, the identi…-
cation of what constitutes information — as opposed to mere noise — already
involves a judgement, an evaluation; it is a matter of facts and also a matter of
values. Furthermore, once a certain proposition has been identi…ed as informa-
tion, the revision of beliefs acquires a moral component; it is no longer optional:
it becomes a moral imperative.
Another aspect is that the notion that information is directly related to
changing our minds does not involve any talk about amounts of information.
Nevertheless it allows precise quantitative calculations. Indeed, constraints on
the acceptable posteriors are precisely the kind of information the method of
maximum entropy is designed to handle.
6.1 What is information? 163

Figure 6.1: (a) In mechanics force is de…ned as that which a¤ects motion. (2)
Inference is dynamics too: information is de…ned as that which a¤ects rational
beliefs.

The mathematical representation of information is in the form of con-


straints on posterior probability distributions. The constraints are the in-
formation.

Constraints can take a wide variety of forms including, in addition to the ex-
amples mentioned above, anything capable of a¤ecting beliefs. For example,
in Bayesian inference the likelihood function constitutes information because it
contributes to constrain our posterior beliefs. And constraints need not be just
in the form of expected values; they can specify the functional form of a dis-
tribution or be imposed through various geometrical relations. (See Chapters 8
and 11.)
Concerning the act of updating it may be worthwhile to point out an analogy
with dynamics — the study of change. In Newtonian dynamics the state of
motion of a system is described in terms of momentum — the “quantity” of
motion — while the change from one state to another is explained in terms of
an applied force or impulse. Similarly, in Bayesian inference a state of belief
is described in terms of probabilities — a “degree” of belief — and the change
from one state to another is due to information (see Fig.6.1). Just as a force is
that which induces a change from one state of motion to another, so information
is that which induces a change from one state of belief to another. Updating is a
form of dynamics. In Chapter 11 we will reverse the logic and derive dynamical
laws of physics as examples of entropic updating of probabilities — an entropic
dynamics.
What about prejudices and superstitions? What about divine revelations?
Do they constitute information? Perhaps they lie outside our restriction to
164 Entropy III: Updating Probabilities

beliefs of ideally rational agents, but to the extent that their e¤ects are indistin-
guishable from those of other sorts of information, namely, they a¤ect beliefs,
they should qualify as information too. Whether the sources of such information
are reliable or not is quite another matter. False information is information too.
In fact, even ideally rational agents can be a¤ected by false information because
the evaluation that assures them that the data was competently collected or
that the message originated from a reliable source involves an act of judgement
that is not completely infallible. Strictly, all those judgements, which constitute
the …rst step of the inference process, are themselves the end result of other
inference processes that are not immune from uncertainty.
What about limitations in our computational power? Such practical limi-
tations are unavoidable and they do in‡uence our inferences so should they be
considered information? No. Limited computational resources may a¤ect the
numerical approximation to the value of, say, an integral, but they do not a¤ect
the actual value of the integral. Similarly, limited computational resources may
a¤ect the approximate imperfect reasoning of real humans and real computers
but they do not a¤ect the reasoning of those ideal rational agents that are the
subject of our present concerns.

6.2 The design of entropic inference


Once we have decided, as a result of the confrontation of new information with
old beliefs, that our beliefs require revision the problem becomes one of deciding
how precisely this ought to be done. First we identify some general features of
the kind of belief revision that one might consider desirable, of the kind of
belief revision that one might count as rational. Then we design a method,
a systematic procedure, that implements those features. To the extent that
the method performs as desired we can claim success. The point is not that
success derives from our method having achieved some intimate connection to
the inner wheels of reality; success just means that the method seems to be
working. Whatever criteria of rationality we choose, they are meant to be only
provisional — they are not immune from further change and improvement.
Typically the new information will not a¤ect our beliefs in just one propo-
sition — in which case the updating would be trivial. Tensions immediately
arise because the beliefs in various propositions are not independent; they are
interconnected by demands of consistency. Therefore the new information also
a¤ects our beliefs in all those “neighboring”propositions that are directly linked
to it, and these in turn a¤ect their neighbors, and so on. The e¤ect can po-
tentially spread over the whole network of beliefs; it is the whole web of beliefs
that must be revised.
The one obvious requirement is that the updated beliefs ought to agree with
the newly acquired information. Unfortunately, this requirement, while neces-
sary, is not su¢ ciently restrictive: we can update in many ways that preserve
both internal consistency and consistency with the new information. Additional
criteria are needed. What rules would an ideally rational agent choose?
6.2 The design of entropic inference 165

6.2.1 General criteria


The rules are motivated by the same pragmatic design criteria that motivate
the design of probability theory itself — universality, consistency, and practical
utility. But this is admittedly too vague; we must be very speci…c about the
precise way in which they are implemented.

Universality
The goal is to design a method for induction, for reasoning when not much
is known. In order for the method to perform its function — to be useful —
we require that it be of universal applicability. Consider the alternative: we
could design methods that are problem-speci…c, and employ di¤erent induc-
tion methods for di¤erent problems. Such a framework, unfortunately, would
fail us precisely when we need it most, namely, in those situations where the
information available is so incomplete that we do not know which method to
employ.
We can argue this point somewhat di¤erently. It is quite conceivable that
di¤erent situations could require di¤erent problem-speci…c induction methods.
What we want to design here is a general-purpose method that captures what
all the other problem-speci…c methods have in common.

Parsimony
To specify the updating we adopt a very conservative criterion that recognizes
the value of information: what has been laboriously learned in the past is valu-
able and should not be disregarded unless rendered obsolete by new information.
The only aspects of one’s beliefs that should be updated are those for which new
evidence has been supplied. Thus we adopt a

Principle of Minimal Updating (PMU): Beliefs should be updated only to


the minimal extent required by the new information.

This version of the principle generalizes the earlier version presented in section
2.10.2 which was restricted to information in the form of data.
The special case of updating in the absence of new information deserves
a comment. The PMU states that when there is no new information ideally
rational agents should not change their minds.5 In fact, it is di¢ cult to imagine
any notion of rationality that would allow the possibility of changing one’s mind
for no apparent reason. This is important and it is worthwhile to consider it
from a di¤erent angle. Degrees of belief, probabilities, are said to be subjective:
two di¤erent agents might not share the same beliefs and could conceivably
assign probabilities di¤erently. But subjectivity does not mean arbitrariness.
It is not a blank check allowing rational agents to change their minds for no
5 Our concern here is with ideally rational agents who have fully processed all information

acquired in the past. Our subject is not the psychology of actual humans who often change
their minds by processes that are not fully conscious.
166 Entropy III: Updating Probabilities

good reason. Valuable prior information should not be discarded unless it is


absolutely necessary.
Minimal updating o¤ers yet another pragmatic advantage. As we shall see
below, rather than identifying what features of a distribution are singled out
for updating and then specifying the detailed nature of the update, we will
adopt design criteria that stipulate what is not to be updated. The practical
advantage of this approach is that it enhances objectivity — there are many
ways to change something but only one way to keep it the same. The analogy
with mechanics can be pursued further: if updating is a form of dynamics, then
minimal updating is the analogue of inertia. Rationality and objectivity demand
a considerable amount of inertia.

Independence
The next general requirement turns out to be crucially important: without it
the very possibility of scienti…c theories would be compromised. The point
is that every scienti…c model, whatever the topic, if it is to be useful at all,
must assume that all relevant variables have been taken into account and that
whatever was left out — the rest of the universe — should not matter. To put
it another way: in order to do science we must be able to understand parts of
the universe without having to understand the universe as a whole. Granted,
it is not necessary that the understanding be complete and exact; it must be
merely adequate for our purposes.
The assumption, then, is that it is possible to focus our attention on a suit-
ably chosen system of interest and neglect the rest of the universe because they
are “su¢ ciently independent.” Thus, in any form of science the notion of sta-
tistical independence must play a central and privileged role. This idea — that
some things can be neglected, that not everything matters — is implemented
by imposing a criterion that tells us how to handle independent systems. The
requirement is quite natural: Whenever two systems are a priori believed to
be independent and we receive information about one it should not matter if
the other is included in the analysis or not. This amounts to requiring that
independence be preserved unless information about correlations is explicitly
introduced.
Again we emphasize: none of these criteria are imposed by Nature. They
are desirable for pragmatic reasons; they are imposed by design.

6.2.2 Entropy as a tool for updating probabilities


Consider a set of propositions fxg about which we are uncertain. The proposi-
tion x can be discrete or continuous, in one or in several dimensions. It could,
for example, represent the microstate of a physical system, a point in phase
space, or an appropriate set of quantum numbers. The uncertainty about x is
described by a probability distribution q(x). Our goal is to update from the
prior distribution q(x) to a posterior distribution p(x) when new information —
by which we mean a set of constraints — becomes available. The question is:
6.2 The design of entropic inference 167

which distribution among all those that are in principle acceptable — they all
satisfy the constraints — should we select?
Our goal is to design a method that allows a systematic search for the pre-
ferred posterior distribution. The central idea, …rst proposed in [Skilling 1988],6
is disarmingly simple: to select the posterior …rst rank all candidate distribu-
tions in increasing order of preference and then pick the distribution that ranks
the highest. Irrespective of what it is that makes one distribution “preferable”
over another (we will get to that soon enough) it is clear that any such ranking
must be transitive: if distribution p1 is preferred over distribution p2 , and p2 is
preferred over p3 , then p1 is preferred over p3 . Transitive rankings are imple-
mented by assigning to each p a real number S[p], which is called the entropy of
p, in such a way that if p1 is preferred over p2 , then S[p1 ] > S[p2 ]. The selected
distribution (one or possibly many, for there may be several equally preferred
distributions) is that which maximizes the entropy functional.
The importance of this strategy of ranking distributions cannot be overesti-
mated: it implies that the updating method will take the form of a variational
principle — the method of Maximum Entropy (ME) — and that the latter will
involve a certain functional — the entropy — that maps distributions to real
numbers and that is designed to be maximized. These features are not imposed
by Nature; they are all imposed by design. They are dictated by the func-
tion that the ME method is supposed to perform. (Thus, it makes no sense to
seek a generalization in which entropy is a complex number or a vector; such a
generalized entropy would just not perform the desired function.)
Next we specify the ranking scheme, that is, we choose a speci…c functional
form for the entropy S[p]. Note that the purpose of the method is to update
from priors to posteriors so the ranking scheme must depend on the particular
prior q and therefore the entropy S must be a functional of both p and q.
The entropy S[p; q] describes a ranking of the distributions p relative to the
given prior q. S[p; q] is the entropy of p relative to q, and accordingly S[p; q]
is commonly called relative entropy. This is appropriate and sometimes we
will follow this practice. However, since all entropies are relative, even when
relative to a uniform distribution, the quali…er ‘relative’is redundant and can
be dropped. This is somewhat analogous to the situation with energy: it is
implicitly understood that all energies are relative to some reference frame or
some origin of potential energy but there is no need to constantly refer to a
‘relative energy’— it is just not done.
The functional S[p; q] is designed by a process of elimination — one might call
it a process of eliminative induction. First we state the desired design criteria;
this is the crucial step that de…nes what makes one distribution preferable over
another. Candidate functionals that fail to satisfy the criteria are discarded
— hence the quali…er ‘eliminative’. As we shall see the criteria adopted below
are su¢ ciently constraining that there is a single entropy functional S[p; q] that
survives the process of elimination.
6 [Skilling 1988] deals with the more general problem of ranking positive additive distribu-

tions which includes the intensity of images as well as probability distributions.


168 Entropy III: Updating Probabilities

This approach has a number of virtues. First, to the extent that the design
criteria are universally desirable, the single surviving entropy functional will
be of universal applicability too. Second, the reason why alternative entropy
candidates are eliminated is quite explicit — at least one of the design criteria
is violated. Thus, the justi…cation behind the single surviving entropy is not that
it leads to demonstrably correct inferences, but rather, that all other candidates
demonstrably fail to perform as desired.

6.2.3 Speci…c design criteria


Consider a lattice of propositions generated by a set X of atomic propositions
(mutually exclusive and exhaustive propositions) labeled by a discrete index
i = 1; 2 : : : n. The extension to in…nite sets and to continuous labels will turn
out to be straightforward. The index i might, for example, label the microstates
of a physical system but, since the argument below is supposed to be of general
validity, we shall not assume that the labels themselves carry any particular
signi…cance. We can always permute labels and this should have no e¤ect on
the updating of probabilities.
The structure of the lattice — its members are related to each other by
disjunctions (or) and conjunctions (and) — is re‡ected in the consistency of
the web of beliefs which is implemented through the sum and product rules. We
adopt two design criteria one of which refers to propositions that are mutually
exclusive and the other to propositions that are independent. These represent
two extreme situations. At one end we have highly correlated propositions (if
one proposition is true the other is false and vice versa); at the other end we
have totally uncorrelated propositions (the truth or falsity of one proposition
has no e¤ect on the truth or falsity of the other). One relation is described by a
simpli…ed sum rule, p(i _ j) = p(i) + p(j), and the other by a simpli…ed product
rule, p(i ^ j) = p(i)p(j).7
Two design criteria and their consequences for the functional form of the
entropy are given below. Detailed proofs are deferred to the next section.

Mutually exclusive subdomains


DC1 Probabilities that are conditioned on one subdomain are not a¤ ected by
information about other non-overlapping subdomains.

Consider a subdomain D X composed of atomic propositions i 2 D and sup-


pose the information to be processed refers to some other subdomain D0 X
that does not overlap with D, D \ D0 = ;. In the absence of any new informa-
tion about D the PMU demands we do not change our minds about probabilities
that are conditional on D. Thus, we design the inference method so that q(ijD),
the prior probability of i conditioned on i 2 D, is not updated. The selected
7 For an alternative approach to the foundations of inference that exploits the various

symmetries of the lattice of propositions see [Knuth 2005, 2006; Knuth Skilling 2012].
6.2 The design of entropic inference 169

conditional posterior is
P (ijD) = q(ijD) : (6.1)
(We adopt the following notation: priors are denoted by q, candidate posteriors
by lower case p, and the selected posterior by upper case P . We shall write
either p(i) or pi .)
We emphasize: the point is not that we make the unwarranted assumption
that keeping q(ijD) unchanged is guaranteed to lead to correct inferences. It
need not; induction is risky. The point is, rather, that in the absence of any
evidence to the contrary there is no reason to change our minds and the prior
information takes priority.
The consequence of DC1 is that non-overlapping domains of i contribute
additively to the entropy,
X
S(p; q) = F (pi ; qi ) ; (6.2)
i

where F is some unknown function of two arguments. The proof is given in


section 6.3.
Comment 1: It is essential that DC1 refers to conditional probabilities: local
information about a domain D0 can have a non-local e¤ect on the total probabil-
ity of another domain D. An example may help to see why: Consider a loaded
die with faces i = 1 : : : 6. A priori we have no reason to favor any face, therefore
q(i) = 1=6. Then we are told that the die is loaded in favor of 2. The criterion
DC1 tells nothing about how to update the P (i)s. If the die were very loaded in
favor of 2, say, P (2) = 0:9 then it must be that P (i) < 1=6 for i 6= 2 and there-
fore all P (i)s must be updated. Let us continue with the example: suppose we
are further told that the die is loaded so that p(2) = 2p(5). The criterion DC1 is
meant to capture the fact that information about faces 2 and 5 does not change
our preferences among the remaining four faces D =f1; 3; 4; 6g; the DC1 implies
that the conditional probabilities remain unchanged P (ijD) = q(ijD) = 1=4; it
says nothing about whether P (i) for i 2 D is less or more than 1=6.8
Comment 2: An important special case is the “update” from a prior q(i) to
a posterior P (i) in a situation in which no new information is available. The
locality criterion DC1 applied to a situation where the subdomain D covers the
whole space of is, D = X , requires that in the absence of any new information
the prior conditional probabilities are not to be updated: P (ijX ) = q(ijX ) or
P (i) = q(i).
Comment 3: The locality criterion DC1 includes Bayesian conditionalization
as a special case. Indeed, if the information is given through the constraint
~ = 0 where D
p(D) ~ is the complement of D then P (ijD) = q(ijD), which is
referred to as Bayesian conditionalization. More explicitly, if is the variable
to be inferred on the basis of information about a likelihood function q(ij ) and
observed data i0 , then the update from the prior q to the posterior P ,

q(i; ) = q(i)q( ji) ! P (i; ) = P (i)P ( ji) (6.3)


8 For i 2 D, if p2 < 2=9 then Pi > 1=6 ; if p2 > 2=9 then Pi < 1=6.
170 Entropy III: Updating Probabilities

consists of updating q(i) ! P (i) = ii0 to agree with the new information and
invoking the PMU so that P ( ji0 ) = q( ji0 ) remains unchanged. Therefore,

P (i; ) = ii0 q( ji0 ) and P ( ) = q( ji0 ) ; (6.4)

which is Bayes’rule (see sections 2.10.2 and 6.6 below). Thus, entropic inference
is designed to include Bayesian inference as a special case. Note however that
imposing DC1 is not identical to imposing Bayesian conditionalization: DC1 is
not restricted to information in the form of absolute certainties such as p(D) = 1.
Comment 4: If the label i is turned into a continuous variable x the criterion
DC1 requires that information that refers to points in…nitely close but just
outside the domain D will have no in‡uence on probabilities conditional on D.
This may seem surprising as it may lead to updated probability distributions
that are discontinuous. Is this a problem? No.
In certain situations (common in e.g. physics) we might have explicit reasons
to believe that conditions of continuity or di¤erentiability should be imposed
and this information might be given to us in a variety of ways. The crucial
point, however — and this is a point that we keep and will keep reiterating — is
that unless such information is in fact explicitly given we should not assume it.
If the new information leads to discontinuities, so be it. The inference process
should not be expected to discover and replicate information with which it was
not supplied.

Subsystem independence
DC2 When two systems are a priori believed to be independent and we receive
independent information about one then it should not matter if the other
is included in the analysis or not.

Consider a system of propositions labelled by a composite index, i = (i1 ; i2 ) 2


X = X1 X2 . For example, fi1 g = X1 and fi2 g = X2 might describe the mi-
crostates of two separate physical systems. Assume that all prior evidence led
us to believe the two subsystems are independent, that is, any two propositions
i1 2 X1 and i2 2 X2 are believed to be independent. This belief is re‡ected in the
prior distribution: if the individual subsystem priors q1 (i1 ) and q2 (i2 ), then the
prior for the whole system is q1 (i1 )q2 (i2 ). Next suppose that new information
is acquired such that q1 (i1 ) would by itself be updated to P1 (i1 ) and that q2 (i2 )
would itself be updated to P2 (i2 ). DC2 requires that S[p; q] be such that the
joint prior q1 (i1 )q2 (i2 ) updates to the product P1 (i1 )P2 (i2 ) so that inferences
about one subsystem do not a¤ect inferences about the other.
The consequence of DC2 is to fully determine the unknown function F in
(6.2) so that probability distributions p(i) should be ranked relative to the prior
q(i) according to the relative entropy,
X p(i)
S[p; q] = p(i) log : (6.5)
i
q(i)
6.2 The design of entropic inference 171

Comment 1: We emphasize that the point is not that when we have no


evidence for correlations we draw the …rm conclusion that the systems must
necessarily be independent. Induction involves risk; the systems might in actual
fact be correlated through some unknown interaction potential. The point is
rather that if the joint prior re‡ected independence and the new evidence is silent
on the matter of correlations, then the evidence we actually have — namely,
the prior — takes precedence and there is no reason to change our minds. As
before, a feature of the probability distribution — in this case, independence —
will not be updated unless the evidence requires it.
Comment 2: We also emphasize that DC2 is not a consistency requirement.
The argument we deploy is not that both the prior and the new information tell
us the systems are independent in which case consistency requires that it should
not matter whether the systems are treated jointly or separately. DC2 refers
to a situation where the new information does not say whether the systems
are independent or not. Rather, the updating is being designed so that the
independence re‡ected in the prior is maintained in the posterior by default.
Comment 3: The generalization to continuous variables x 2 X is approached
as a Riemann limit from the discrete case (see Section 4.6). A continuous prob-
ability density p(x) or q(x) can be approximated by the discrete distributions.
Divide the region of interest X into a large number N of small cells. The
probabilities of each cell are

pi = p(xi ) xi and qi = q(xi ) xi ; (6.6)

where xi is an appropriately small interval. The discrete entropy of pi relative


to qi is
XN
p(xi ) xi
SN = xi p(xi ) log ; (6.7)
i=1
q(xi ) xi

and in the limit as N ! 1 and xi ! 0 we get the Riemann integral


Z
p(x)
S[p; q] = dx p(x) log ; (6.8)
q(x)

(To simplify the notation we include multi-dimensional integrals by writing


dn x = dx.)
It is easy to check that the ranking of distributions induced by S[p; q] is
invariant under coordinate transformations.9 More explicitly, consider a change
from old coordinates x to new coordinates x0 such that x = (x0 ). The new
volume element dx0 includes the corresponding Jacobian,

@x
dx = (x0 )dx0 where (x0 ) = : (6.9)
@x0
9 The insight that coordinate invariance could be derived as a consequence of the require-

ment of subsystem independence …rst appeared in [Vanslette 2017].


172 Entropy III: Updating Probabilities

Since p(x) is a density; the transformed function p0 (x0 ) is such that p(x)dx =
p0 (x0 )dx0 is invariant. Therefore

p0 (x0 ) q 0 (x0 )
p(x) = and q(x) = ; (6.10)
(x0 ) (x0 )

and (6.8) gives


S[p0 ; q 0 ] = S[p; q] : (6.11)
This shows that the two rankings, the one according to S[p; q] and the other
according to S[p0 ; q 0 ] coincide.

6.2.4 The ME method


We can now summarize the overall conclusion:

The ME method: We want to update from a prior distribution q to a poste-


rior distribution when there is new information in the form of constraints
C that specify a family fpg of allowed posteriors. The posterior is selected
through a ranking scheme that recognizes the value of prior information
and the privileged role of independence. The preferred posterior P within
the family fpg is that which maximizes the relative entropy,
X Z
pi p(x)
S[p; q] = pi log or S[p; q] = dx p(x) log ; (6.12)
i
q i q(x)

subject to the constraints C.

This extends the method of maximum entropy beyond its original purpose
as a rule to assign probabilities from a given underlying measure (MaxEnt) to a
method for updating probabilities from any arbitrary prior (ME). Furthermore,
the logic behind the updating procedure does not rely on any particular meaning
assigned to the entropy, either in terms of information, or heat, or disorder.
Entropy is merely a tool for inductive inference. No interpretation for S[p; q]
is given and none is needed. We do not need to know what entropy means; we
only need to know how to use it.
Comment: In chapter 8 we will re…ne the method further. There we will
address the question of assessing the extent to which distributions that are
close to the entropy maximum ought to be ruled out or should be included in
the analysis. Their contribution — which accounts for ‡uctuation phenomena —
turns out to be particularly signi…cant in situations where the entropy maximum
is not particularly sharp.
The derivation above has singled out a unique S[p; q] to be used in inductive
inference. Other “entropies” (such as, the one-parameter family of entropies
proposed in [Renyi 1961, Aczel Daróczy 1975, Amari 1985, Tsallis 1988], see
Section 6.5.3 below) might turn out to be useful for other purposes — perhaps
as measures of some kinds of information, or measures of discrimination or
6.3 The proofs 173

distinguishability among distributions, or of ecological diversity, or for some


altogether di¤erent function — but they are unsatisfactory for the purpose of
updating in that they do not perform the functions stipulated by the design
criteria DC1 and DC2.

6.3 The proofs


In this section we establish the consequences of the two criteria leading to the
…nal result eq.(6.12). The details of the proofs are important not just because
they lead to our …nal conclusions, but also because the translation of the verbal
statement of the criteria into precise mathematical form is a crucial part of
unambiguously specifying what the criteria actually say.

DC1: Locality for mutually exclusive subdomains


Here we prove that criterion DC1 leads to the expression eq.(6.2) for S[p; q].
Consider the case of a discrete variable, pi with i = 1 : : : n, so that S[p; q] =
S(p1 : : : pn ; q1 : : : qn ). Suppose the space of states X is partitioned into two non-
overlapping domains D and D ~ with D [ D ~ = X , and that the information to be
processed is in the form of a constraint that refers to the domain D, ~
P
aj pj = A . (6.13)
~
j2D

DC1 states that the constraint on D ~ does not have an in‡uence on the condi-
tional probabilities pijD . It may however in‡uence the probabilities pi within D
through an overall multiplicative factor. To deal with this complication consider
then a special case where the overall probabilities of D and D ~ are constrained
too, P P
pi = PD and pj = PD~ ; (6.14)
i2D ~
j2D

with PD + PD~ = 1. Under these special circumstances constraints on D~ will not


in‡uence pi s within D, and vice versa.
To obtain the posterior maximize S[p; q] subject to these three constraints,
P
0= S pi PD +
i2D
! !#
~ P P
pi PD~ + aj pj A ;
~
j2D ~
j2D

leading to
@S
= for i 2 D ; (6.15)
@pi
@S
= ~ + aj for j 2 D
~: (6.16)
@pj
174 Entropy III: Updating Probabilities

Eqs.(6.13-6.16) are n + 3 equations we must solve for the pi s and the three
Lagrange multipliers. Since S = S(p1 : : : pn ; q1 : : : qn ) its derivative
@S
= fi (p1 : : : pn ; q1 : : : qn ) (6.17)
@pi
could in principle also depend on all 2n variables. But this violates the locality
criterion because any arbitrary change in aj within D ~ would in‡uence the pi s
within D. The only way that probabilities conditioned on D can be shielded
from arbitrary changes in the constraints pertaining to D ~ is that for any i 2 D
the function fi depends only on pj s with j 2 D. Furthermore, this must hold
not just for one particular partition of X into domains D and D, ~ it must hold
for all conceivable partitions including the partition into atomic propositions.
Therefore fi can depend only on pi ,
@S
= fi (pi ; q1 : : : qn ) : (6.18)
@pi
But the power of the locality criterion is not exhausted yet. The information
to be incorporated into the posterior can enter not just through constraints but
also through the prior. Suppose that the local information about domain D ~ is
~
altered by changing the prior within D. Let qj ! qj + qj for j 2 D. Then ~
(6.18) becomes
@S
= fi (pi ; q1 : : : qj + qj : : : qn ) (6.19)
@pi
which shows that pi with i 2 D will be in‡uenced by information about D ~ unless
~ Again, this must hold for
fi with i 2 D is independent of all the qj s for j 2 D.
~ and therefore,
all possible partitions into D and D,
@S
= fi (pi ; qi ) for all i 2 X : (6.20)
@pi
The choice of the functions fi (pi ; qi ) can be restricted further. If we were to
maximize S[p; q] subject to constraints
P P
i pi = 1 and i ai pi = A (6.21)

we get
@S
= fi (pi ; qi ) = + ai for all i 2 X ; (6.22)
@pi
where and are Lagrange multipliers. Solving for pi gives the posterior,

Pi = gi (qi ; ; ; ai ) (6.23)

for some functions gi . As stated in Section 6.2.3 we do not assume that the
labels i themselves carry any particular signi…cance. This means, in particular,
that for any proposition labelled i we want the selected posterior Pi to depend
only on the prior qi and on the constraints –that is, on , , and ai . We do not
6.3 The proofs 175

want to have di¤erent updating rules for di¤erent propositions: two di¤erent
propositions i and i0 with the same priors qi = qi0 and the same constraints
ai = ai0 should be updated to the same posteriors, Pi = Pi0 . In other words the
functions gi and fi must be independent of i. Therefore

@S
= f (pi ; qi ) for all i 2 X : (6.24)
@pi

Integrating, one obtains


P
S[p; q] = i F (pi ; qi ) + constant . (6.25)

for some still undetermined function F . The constant has no e¤ect on the
maximization and can be dropped.
The corresponding expression for a continuous variable x is obtained replac-
ing i by x, and the sum over i by an integral over x leading to eq.(6.2),
Z
S[p; q] = dx F (p(x); q(x)) : (6.26)

Comment: One might wonder whether in taking the continuum limit there
might be room for introducing …rst and higher derivatives of p and q so that
the function F might include more arguments,

? dp dq
F = F (p; q; ; ; : : :) : (6.27)
dx dx
The answer is no! As discussed in the previous section one must not allow the
inference method to introduce assumptions about continuity or di¤erentiability
unless such conditions are explicitly introduced as information. In the absence
of any information to the contrary the prior information takes precedence; if this
leads to discontinuities we must accept them. On the other hand, we may …nd
ourselves in situations where our intuition insists that the discontinuities should
just not be there. The right way to handle such situations (see section 4.12) is
to recognize the existence of additional constraints concerning continuity that
must be explicitly taken into account.

DC2: Independent subsystems


Let the microstates of a composite system be labeled by (i1 ; i2 ) 2 X = X1 X2 .
We shall consider two special cases.

Case (a) — Suppose nothing is said about subsystem 2 and the information
about subsystem 1 is extremely constraining: for subsystem 1 we maximize
S1 [p1 ; q1 ] subject to the constraint that p1 (i1 ) is P1 (i1 ), the selected posterior
being, naturally, p1 (i1 ) = P1 (i1 ). For subsystem 2 we maximize S2 [p2 ; q2 ] sub-
ject only to normalization so there is no update, P2 (i2 ) = q2 (i2 ).
176 Entropy III: Updating Probabilities

When the systems are treated jointly, however, the inference is not nearly
as trivial. We want to maximize the entropy of the joint system,
P
S[p; q] = F (p(i1 ; i2 ); q1 (i1 )q2 (i2 )) ; (6.28)
i1 ;i2

subject to normalization,
P
p(i1 ; i2 ) = 1 ; (6.29)
i1 ;i2

and the constraint on subsystem 1,


P
i1 p(i1 ; i2 ) = P1 (i1 ) : (6.30)

Notice that this is not just one constraint: we have one constraint for each value
of i1 , and each constraint must be supplied with its own Lagrange multiplier,
1 (i1 ). Then,
h P P P i
S i1 1 (i1 ) i2 p(i1 ; i2 ) P1 (i1 ) i1 ;i2 p(i1 ; i2 ) 1 =0:
(6.31)
The independent variations p(i1 ; i2 ) yield

f (p(i1 ; i2 ); q1 (i1 )q2 (i2 )) = + 1 (i1 ) ; (6.32)

where f is given in (6.24),

@S @
= F (p; q1 q2 ) = f (p; q1 q2 ) : (6.33)
@p @p

Next we impose that the selected posterior is the product P1 (i1 )q2 (i2 ). The
function f must be such that

f (P1 q2 ; q1 q2 ) = + 1 : (6.34)

Since the RHS is independent of the argument i2 , the f function on the LHS
must be such that the i2 -dependence cancels out and this cancellation must
occur for all values of i2 and all choices of the prior q2 . Therefore we impose
that for any value of x the function f (p; q) must satisfy

f (px; qx) = f (p; q) : (6.35)

Choosing x = 1=q in the …rst equation we get

p @F p
f ;1 = f (p; q) or = f (p; q) = : (6.36)
q @p q

Thus, the function f (p; q) has been constrained to a function (p=q) of a single
argument.
6.3 The proofs 177

Case (b) — Next we consider a situation in which both subsystems are up-
dated and the information is assumed to be extremely constraining: when the
subsystems are treated separately q1 (i1 ) is updated to P1 (i1 ) and q2 (i2 ) is up-
dated to P2 (i2 ). When the systems are treated jointly we require that the joint
prior for the combined system q1 (i1 )q2 (i2 ) be updated to P1 (i1 )P2 (i2 ).
First we treat the subsystems separately. Maximize the entropy of subsystem
1,
P
S[p1 ; q1 ] = i1 F (p1 (i1 ); q1 (i1 )) subject to p1 (i1 ) = P1 (i1 ) : (6.37)
To each constaint — one constraint for each value of i1 — we must supply one
Lagrange multiplier, 1 (i1 ). Then,
P
S i1 1 (i1 ) ( p(i1 ) P1 (i1 )) = 0 : (6.38)
Using eq.(6.36),
@S @ p1
= F (p1 ; q1 ) = ; (6.39)
@p1 @p1 q1
and, imposing that the selected posterior be P1 (i1 ), we …nd that the function
must obey
P1 (i1 )
= 1 (i1 ) : (6.40)
q1 (i1 )
Similarly, for system 2 we …nd
P2 (i2 )
= 2 (i2 ) : (6.41)
q2 (i2 )
Next we treat the two subsystems jointly. Maximize the entropy of the joint
system, P
S[p; q] = F (p(i1 ; i2 ); q1 (i1 )q2 (i2 )) ; (6.42)
i1 ;i2

subject to the following constraints on the joint distribution p(i1 ; i2 ):


P P
i2 p(i1 ; i2 ) = P1 (i1 ) and i1 p(i1 ; i2 ) = P2 (i2 ) : (6.43)
Again, there is one constraint for each value of i1 and of i2 and we introduce
Lagrange multipliers, 1 (i1 ) or 2 (i2 ). Then,
P P
S i1 1 (i1 ) i2 p(i1 ; i2 ) P1 (i1 ) f1 $ 2g = 0; (6.44)
where f1 $ 2g indicates a third term, similar to the second, with 1 and 2
interchanged. Using eq.(6.36),
@S @ p
= F (p; q1 q2 ) = ; (6.45)
@p @p q1 q2
the independent variations p(i1 ; i2 ) yield
p(i1 ; i2 )
= 1 (i1 ) + 2 (i2 ) ; (6.46)
q1 (i1 )q2 (i2 )
178 Entropy III: Updating Probabilities

and we impose that the selected posterior be the product P1 (i1 )P2 (i2 ). There-
fore, the function must be such that

P1 P2
= 1 + 2 : (6.47)
q1 q2

To solve this equation we take the exponential of both sides,

P1 P2
= e 1e 2
where = exp ; (6.48)
q1 q2

and rewrite it as
P1 P2 2 (i2 ) 1 (i1 )
e =e : (6.49)
q1 q2
This shows that for any value of i1 , the dependences in the LHS on i2 through
P2 =q2 and 2 must cancel each other out. In particular, if for some subset of
i2 s the subsystem 2 is updated so that P2 = q2 , which amounts to no update at
all, the i2 dependence on the left is eliminated but the i1 dependence remains
una¤ected,
P1 0
e 2 = e 1 (i1 ) : (6.50)
q1
0
where 2 is some constant independent of i2 . A similar argument with f1 $ 2g
yields
P2 0
2 (i2 )
e 1 =e ; (6.51)
q2
0
where 1 is a constant. Taking the exponential of (6.40) and (6.41) leads to

P1 0 0 P2 0 0
e 2 =e 1 2 =e 1
and e 1 =e 2 1 =e 2
: (6.52)
q1 q2

Substituting back into (6.48), we get

P1 P2 P1 P2
= ; (6.53)
q1 q2 q1 q2
0 0
where a constant factor e ( 1 + 2 ) has been absorbed into a new function . The
general solution of this functional equation is a power,

(xy) = (x) (y) =) (x) = xa ; (6.54)

so that
(x) = a log x + b ; (6.55)
where a and b are constants. Finally, we can integrate (6.36),

@F p p
= = a log +b ; (6.56)
@p q q
6.4 Consistency with the law of large numbers 179

to get
p
F [p; q] = ap log + b0 p + c (6.57)
q

where b0 and c are constants.


At this point the entropy takes the general form

P pi
S[p; q] = i api log + b0 p i + c : (6.58)
qi

The additive constant c may be dropped: it contributes a term that does not de-
pend on the probabilities and has no e¤ect on the ranking scheme. Furthermore,
since S[p; q] will be maximized subject to constraints
P that include normalization
which is implemented by adding a term p
i i , the b0 constant can always be
absorbed into the undetermined multiplier : Thus, the b0 term has no e¤ect on
the selected distribution and can be dropped too. Finally, a is just an overall
multiplicative constant, it also does not a¤ect the overall ranking except in the
trivial sense that inverting the sign of a will transform the maximization prob-
lem to a minimization problem or vice versa. We can therefore set a = 1 so
that maximum S corresponds to maximum preference which gives us eq.(6.12)
and concludes our derivation.

6.4 Consistency with the law of large numbers


Entropic methods of inference are of general applicability but there exist special
situations — such as, for example, those involving large numbers of independent
subsystems — where inferences can be made by purely probabilistic methods
without ever invoking the concept of entropy. It is important to check that the
two methods of calculation are consistent with each other.
Consider a system composed of a large number N of subsystems that are
independent and identical. Let the microstates for each individual system be
described by a discrete variable i = 1 : : : m. Let the number of subsystems
found in state i be ni , and let fi = ni =N be the corresponding frequency.10 The
probability of a particular frequency distribution f = (f1 : : : fm ) generated by
the prior q is given by the multinomial distribution,

N! P
m
QN (f jq) = q n1 : : : q m
nm
with ni = N : (6.59)
n1 ! : : : n m ! 1 i=1

When the ni are su¢ ciently large we can use Stirling’s approximation,
p
log n! = n log n n + log 2 n + O(1=n) : (6.60)

1 0 This is the “frequency” with which one observes the microstate i in the large sample N .

Perhaps ‘fraction’might be a better term.


180 Entropy III: Updating Probabilities

Then
p
log QN (f jq) N log N N + log 2 N
P p
ni log ni ni + log 2 ni ni log qi
i
r
P ni ni P ni p
= N log log (N 1) log 2 N
i N N qi i N
P p p
= N S[f; q] log fi (N 1) log 2 N ; (6.61)
i

where S[f; q] is given by eq.(6.12),


X fi
S[f; q] = fi log : (6.62)
i
qi

Therefore for large N can be written as


Q 1=2
QN (f jq) CN ( fi ) exp(N S[f; q]) (6.63)
i

where CN is a normalization constant. The Gibbs inequality S[f; q] 0,


eq.(4.27), shows that for large N the probability QN (f jq) shows an exceedingly
sharp peak. The most likely frequency distribution is numerically equal to the
probability distribution qi . This is the weak law of large numbers. Equivalently,
we can rewrite it as
1
log QN (f jq) S[f; q] + rN ; (6.64)
N
where rN is a correction that vanishes (in probability) as N ! 1. This means
that …nding the most probable frequency distribution is equivalent to maximiz-
ing the entropy S[f; q]: the most probable frequency distribution f is q.
We can take this calculation one step farther and ask what is the most
probable frequency distribution among the subset of distributions that satis…es
a constraint such as the sample average
X
ai fi = a : (6.65)
i

The answer is given by (6.64): for large N maximizing the probability QN (f jq)
subject to the constraint a = A, is equivalent to maximizing the entropy S[f; q]
subject to a = A. In the limit of large N the frequencies fi converge P (in
probability) to the desired posterior Pi while the sample average a = ai fi
converges (also in probability) to the expected value hai = A.
[Csiszar 1984] and [Grendar 2001] have argued that the asymptotic argument
above provides by itself a valid justi…cation for the ME method of updating. An
agent whose prior is q receives the information hai = A which can be reasonably
interpreted as a sample average a = A over a large ensemble of N trials. The
agent’s beliefs are updated so that the posterior P coincides with the most
6.5 Random remarks 181

probable f distribution. This is quite compelling but, of course, as a justi…cation


of the ME method it is restricted to situations where it is natural to think in
terms of ensembles with large N . This justi…cation is not nearly as compelling
for singular events for which large ensembles either do not exist or are too
unnatural and contrived. From our point of view the asymptotic argument
above does not by itself provide a fully convincing justi…cation for the universal
validity of the ME method but it does provide considerable inductive support.
It serves as a valuable consistency check that must be passed by any inductive
inference procedure that claims to be of general applicability.11

6.5 Random remarks


6.5.1 On priors
All entropies are relative entropies. In the case of a discrete variable, if one
assigns equal a priori probabilities,
P qi = 1, one obtains the Boltzmann-Gibbs-
Shannon entropy, S[p] = i p i log pi . The notation S[p] has a serious draw-
back: it misleads one into thinking that S depends on p only. In particular,
we emphasize that whenever S[p] is used, the prior measure qi = 1 has been
implicitly assumed. In Shannon’s axioms, for example, this choice is implicitly
made in his …rst axiom, when he states that the entropy is a function of the
probabilities S = S(p1 :::pn ) and nothing else, and also in his second axiom when
the uniform distribution pi = 1=n is singled out for special treatment.
The absence of an explicit reference to a prior qi may erroneously suggest that
prior distributions have been rendered unnecessary and can be eliminated. It
suggests that it is possible to transform information (i.e., constraints) directly
into posterior distributions in a totally objective and unique way. This was
Jaynes’ hope for the MaxEnt program. If this were true the old controversy,
of whether probabilities are subjective or objective, would have been resolved
in favor of complete objectivity. But the prior qi = 1 is implicit in S[p]; the
postulate of equal a priori probabilities or Laplace’s “Principle of Insu¢ cient
Reason” still plays a major, though perhaps hidden, role. Any claims that
probabilities assigned using maximum entropy will yield absolutely objective
results are unfounded; not all subjectivity has been eliminated. Just as with
Bayes’ theorem, what is objective here is the manner in which information is
processed to update from a prior to a posterior, and not the prior probabilities
themselves. And even then the updating is objective because we have agreed to
adopt very speci…c criteria — this is objectivity by design.
Choosing the prior density q(x) can be tricky. Sometimes symmetry consid-
erations can be useful in …xing the prior (three examples were given in section
4.6) but otherwise there is no …xed set of rules to translate information into
a probability distribution except, of course, for Bayes’ theorem and the ME
method themselves.
1 1 It is signi…cant that other “entropies”, e.g. [Renyi 1961, Aczel Daróczy 1975, Amari 1985,

Tsallis 1988] do not pass this test. (Section 6.5.3 below.)


182 Entropy III: Updating Probabilities

What if the prior q(x) vanishes for some values of x? S[p; q] can be in…nitely
negative when q(x) vanishes within some region D. In other words, the ME
method confers an overwhelming preference on those distributions p(x) that
vanish whenever q(x) does. One must emphasize that this is as it should be; it
is not a problem. As we saw in section 2.10.4 a similar situation also arises in the
context of Bayes’ theorem where a vanishing prior represents a tremendously
serious commitment because no amount of data to the contrary would allow us to
revise it. In both ME and Bayes updating we should recognize the implications
of assigning a vanishing prior. Assigning a very low but non-zero prior represents
a safer and less prejudiced representation of one’s beliefs.
For more on the choice of priors see the review [Kass Wasserman 1996]; in
particular for entropic priors see [Rodriguez 1990-2003, Caticha Preuss 2004]

6.5.2 Informative and non-informative priors*


Stein shrinking phenomenon

6.5.3 Comments on other axiomatizations


One feature that distinguishes the axiomatizations proposed by various authors
is how they justify maximizing a functional. In other words, why maximum
entropy? In the approach of Shore and Johnson this question receives no answer;
it is just one of the axioms. Csiszar provides a better answer. He derives the
‘maximize a functional’ rule from reasonable axioms of regularity and locality
[Csiszar 1991]. In Skilling’s and in the approach developed here the rule is not
derived, but it does not go unexplained either: it is imposed by design, it is
justi…ed by the function that S is supposed to perform, namely, to achieve a
transitive ranking.
Both Shore and Johnson and Csiszar require, and it is not clear why, that
updating from a prior must lead to a unique posterior, and accordingly, there
is a restriction that the constraints de…ne a convex set. In Skilling’s approach
and in the one advocated here there is no requirement of uniqueness, we are
perfectly willing to entertain situations where the available information points
to several equally preferable distributions. To this subject we will return in
chapter 8.
There is another important di¤erence between the axiomatic approach pre-
sented by Csiszar and the design approach presented here. Our ME method is
designed to be of universal applicability. As with all inductive procedures, any
particular instance of induction can turn out to be wrong — perhaps because,
for example, not all relevant information has been taken into account — but
this does not change the fact that ME is still the unique inductive inference
method obeying rational design criteria. On the other hand Csiszar’s version
of the MaxEnt method is not designed to generalize beyond its axioms. This
version of the method was developed for linear constraints and, therefore, one
should not feel justi…ed to perform deductions beyond the cases of linear con-
straints. In our case, the application to non-linear constraints is precisely the
6.6 Bayes’rule as a special case of ME 183

kind of induction the ME method was designed to perform.


It is interesting that if instead of axiomatizing the inference process, one
axiomatizes the entropy itself by specifying those properties expected of a mea-
sure of separation between (possibly unnormalized) distributions one is led to a
continuum of -entropies, [Amari 1985]
Z
1
S [p; q] = dx ( + 1)p q p +1 q ; (6.66)
( + 1)
labelled by a parameter . These entropies are equivalent, for the purpose of
updating, to the relative Renyi entropies [Renyi 1961, Aczel 1975]. The short-
coming of this approach is that it is not clear when and how such entropies are
to be used, which features of a probability distribution are being updated and
which preserved, or even in what sense do these entropies measure an amount of
information. Remarkably, if one further requires that S be additive over inde-
pendent sources of uncertainty, as one could reasonably expect from a measure,
then the continuum in is restricted to just the two values = 0 and = 1
which correspond to the logarithmic entropies S[p; q] and S[q; p].
For the special case when p is normalized and a uniform prior q = 1 we get
(dropping the term linear in q)
Z
1 1
S = 1 dx p : (6.67)
+1

A related entropy Z
1 0
S0 0 = 0
1 dx p +1
(6.68)

has been proposed in [Tsallis 1988] and forms the foundation of a so-called non-
extensive statistical mechanics (see section 5.5). Clearly these two entropies
are equivalent in that they generate equivalent variational problems – maxi-
mizing S is equivalent to maximizing S 0 0 . To conclude our brief remarks on
the entropies S we point out that quite apart from the di¢ culty of achieving
consistency with the law of large numbers, some the probability distributions
obtained maximizing S may also be derived through a more standard use of
MaxEnt or ME as advocated in these lectures (section 5.5).

6.6 Bayes’rule as a special case of ME


Since the ME method and Bayes’rule are both designed for updating probabili-
ties, and both invoke a Principle of Minimal Updating, it is important to explore
the relations between them. In section 6.2.3 we showed that ME is designed to
include Bayes’rule as a special case. Here we would like to revisit this topic in
greater depth, and, in particular to explore some variations and generalizations
[Caticha Gi¢ n 2006].
As described in section 2.10 the goal is to update our beliefs about 2 (
represents one or many parameters) on the basis of three pieces of information:
184 Entropy III: Updating Probabilities

(1) the prior information codi…ed into a prior distribution q( ); (2) the data
x 2 X (obtained in one or many experiments); and (3) the known relation
between and x given by the model as de…ned by the sampling distribution
or likelihood, q(xj ). The updating consists of replacing the prior probability
distribution q( ) by a posterior distribution P ( ) that applies after the data has
been processed.
The crucial element that will allow Bayes’rule to be smoothly incorporated
into the ME scheme is the realization that before the data is available not only we
do not know , we do not know x either. Thus, the relevant space for inference
is not the space but the product space X and the relevant joint prior is
q(x; ) = q( )q(xj ). Let us emphasize two points: …rst, the likelihood function
is an integral part of the prior distribution; and second, the information about
how x is related to is contained in the functional form of the distribution q(xj )
— for example, whether it is a Gaussian or a Cauchy distribution or something
else – and not in the numerical values of the arguments x and which are, at
this point, still unknown.
Next data is collected and the observed values turn out to be x0 . We must
update to a posterior that lies within the family of distributions p(x; ) that
re‡ect the fact that x is now known to be x0 ,
Z
p(x) = d p( ; x) = (x x0 ) : (6.69)

This data information constrains but is not su¢ cient to determine the joint
distribution
p(x; ) = p(x)p( jx) = (x x0 )p( jx0 ) : (6.70)
Any choice of p( jx0 ) is in principle possible. So far the formulation of the prob-
lem parallels section 2.10 exactly. We are, after all, solving the same problem.
Next we apply the ME method and show that we get the same answer.
According to the ME method the selected joint posterior P (x; ) is that
which maximizes the entropy,
Z
p(x; )
S[p; q] = dxd p(x; ) log ; (6.71)
q(x; )

subject to the appropriate constraints. Note that the information in the data,
eq.(6.69), represents an in…nite number of constraints on the family p(x; ):
for each value of x there is one constraint and one Lagrange multiplier (x).
Maximizing S, (6.71), subject to (6.69) and normalization,
R R R
S+ dxd p(x; ) 1 + dx (x) d p(x; ) (x x0 ) = 0 ;
(6.72)

yields the joint posterior,


(x)
e
P (x; ) = q(x; ) ; (6.73)
Z
6.6 Bayes’rule as a special case of ME 185

where Z is a normalization constant, and the multiplier (x) is determined from


(6.69),
R e (x) e (x)
d q(x; ) = q(x) = (x x0 ) ; (6.74)
Z Z
so that the joint posterior is

(x x0 )
P (x; ) = q(x; ) = (x x0 )q( jx) : (6.75)
q(x)

The corresponding marginal posterior probability P ( ) is


R q(x0 j )
P ( ) = dx P ( ; x) = q( jx0 ) = q( ) ; (6.76)
q(x0 )

which is recognized as Bayes’ rule. Thus, Bayes’ rule is derivable from and
therefore consistent with the ME method.
To summarize: the prior q(x; ) = q(x)q( jx) is updated to the posterior
P (x; ) = P (x)P ( jx) where P (x) = (x x0 ) is …xed by the observed data
while P ( jx0 ) = q( jx0 ) remains unchanged. Note that in accordance with the
philosophy that drives the ME method one only updates those aspects of one’s
beliefs for which corrective new evidence has been supplied.
I conclude with a few simple examples that show how ME allows general-
izations of Bayes’ rule. The general background for these generalized Bayes
problems is the familiar one: We want to make inferences about some variables
on the basis of information about other variables x and of a relation between
them.

Bayes updating with uncertain data


As before, the prior information consists of our prior beliefs about given by
the distribution q( ) and a likelihood function q(xj ) so the joint prior q(x; )
is known. But now the information about x is much more limited. The data
is uncertain: x is not known. The marginal posterior p(x) is no longer a sharp
delta function but some other known distribution, p(x) = PD (x). This is still
an in…nite number of constraints
R
p(x) = d p( ; x) = PD (x) ; (6.77)

that are easily handled by ME. Maximizing S, (6.71), subject to (6.77) and
normalization, leads to
P (x; ) = PD (x)q( jx) : (6.78)
The corresponding marginal posterior,
R R q(xj )
P ( ) = dx PD (x)q( jx) = q( ) dx PD (x) ; (6.79)
q(x)

is known as Je¤rey’s rule which we met earlier in section 2.10.


186 Entropy III: Updating Probabilities

Bayes updating with information about x moments


Now we have even less information about the “data”x: the marginal distribution
p(x) is not known. All we know about p(x) is an expected value
R
hf i = dx p(x)f (x) = F : (6.80)

Maximizing S, (6.71), subject to (6.80) and normalization,


R R
S+ dxd p(x; ) 1 + dxd p(x; )f (x) F = 0 ; (6.81)

yields the joint posterior,


f (x)
e
P (x; ) = q(x; ) ; (6.82)
Z
where the normalization constant Z and the multiplier are obtained from
R f (x) d log Z
Z = dx q(x)e and =F : (6.83)
d
The corresponding marginal posterior is

R e f (x)
P ( ) = q( ) dx q(xj ) : (6.84)
Z
These two examples (6.79) and (6.84) are su¢ ciently intuitive that one could
have written them down directly without deploying the full machinery of the ME
method, but they do serve to illustrate the essential compatibility of Bayesian
and Maximum Entropy methods. Next we consider a slightly less trivial exam-
ple.

Updating with data and information about moments


Here we follow [Gi¢ n Caticha 2007]. In addition to data about x we have
additional information about in the form of a constraint on the expected
value of some function f ( ),
R
dxd P (x; )f ( ) = hf ( )i = F : (6.85)

In the standard Bayesian practice it is possible to impose constraint infor-


mation at the level of the prior, but this information need not be preserved
in the posterior. What we do here that di¤ers from the standard Bayes’ rule
is that we can require that the constraint (6.85) be satis…ed by the posterior
distribution.
Maximizing the entropy (6.71) subject to normalization, the data constraint
(6.69), and the moment constraint (6.85) yields the joint posterior,
(x)+ f ( )
e
P (x; ) = q(x; ) ; (6.86)
z
6.6 Bayes’rule as a special case of ME 187

where z is a normalization constant,


R (x)+ f ( )
z = dxd e q(x; ) : (6.87)

The Lagrange multipliers (x) are determined from the data constraint, (6.69),

e (x)
(x x0 ) R
= where Z( ; x0 ) = d e f( )
q( jx0 ) ; (6.88)
z Zq(x0 )
so that the joint posterior becomes
f( )
e
P (x; ) = (x x0 )q( jx0 ) : (6.89)
Z
The remaining Lagrange multiplier is determined by imposing that the pos-
terior P (x; ) satisfy the constraint (6.85). This yields an implicit equation for
,
@ log Z
=F : (6.90)
@
Note that since Z = Z( ; x0 ) the multiplier will depend on the observed data
x0 . Finally, the new marginal distribution for is
f( )
e q(x0 j ) e f ( )
P ( ) = q( jx0 ) = q( ) : (6.91)
Z q(x0 ) Z
For = 0 (no moment constraint) we recover Bayes’rule. For 6= 0 Bayes’rule
is modi…ed by a “canonical” exponential factor yielding an e¤ective likelihood
function.

Updating with uncertain data and an unknown likelihood


The following example [Caticha 2010] derives and generalizes Zellner’s Bayesian
Method of Moments [Zellner 1997]. Usually the relation between x and is
given by a known likelihood function q(xj ) but suppose this relation is not
known. This is the case when the joint prior is so ignorant that information
about x tells us nothing about and vice versa; such a prior treats x and
as statistically independent, q(x; ) = q(x)q( ). Since we have no likelihood
function the information about the relation between and the data x must be
supplied elsewhere. One possibility is through a constraint. Suppose that in
addition to normalization and the uncertain data constraint, eq.(6.77), we also
know that the expected value over of a function f (x; ) is
R
hf ix = d p( jx)f (x; ) = F (x) : (6.92)

We seek a posterior P (x; ) that maximizes (6.71). Introducing Lagrange mul-


tipliers , (x), and (x),
R R R
0= S+ dxd p(x; ) 1 + dx (x) d p(x; ) PD (x)
R R
+ dx (x) d p(x; )f (x; ) PD (x)F (x) ; (6.93)
188 Entropy III: Updating Probabilities

the variation over p(x; ) yields


1 (x)+ (x)f (x; )
P (x; ) = q(x)q( ) e ; (6.94)

where is a normalization constant. The multiplier (x) is determined from


(6.77),
R 1 (x)
R (x)f (x; )
P (x) = d P ( ; x) = q(x)e d q( ) e = PD (x) (6.95)

then,
q( ) e (x)f (x; )
P (x; ) = PD (x) R 0) (6.96)
d 0 q( 0 ) e (x)f (x;
so that
P (x; ) q( ) e (x)f (x; ) R (x)f (x; 0 )
P ( jx) = = with Z(x) = d 0 q( 0 ) e
P (x) Z(x)
(6.97)
The multiplier (x) is determined from (6.92)

1 @Z(x)
= F (x) : (6.98)
Z(x) @ (x)

The corresponding marginal posterior is

R R e (x)f (x; )
P ( ) = dx PD (x)P ( jx) = q( ) dx PD (x) : (6.99)
Z(x)

In the limit when the data are sharply determined PD (x) = (x x0 ) the
posterior takes the form of Bayes theorem,
(x0 )f (x0 ; )
e
P ( ) = q( ) ; (6.100)
Z(x0 )
0 0
where up to a normalization factor e (x )f (x ; ) plays the role of the likelihood
and the normalization constant Z plays the role of the evidence.
In conclusion, these examples demonstrate that the method of maximum en-
tropy can fully reproduce the results obtained by the standard Bayesian methods
and allows us to extend them to situations that lie beyond their reach such as
when the likelihood function is not known.

6.7 Commuting and non-commuting constraints


The ME method allows one to process information in the form of constraints.
When we are confronted with several constraints we must be particularly cau-
tious. In what order should they be processed? Or should they be processed
6.7 Commuting and non-commuting constraints 189

together? The answer depends on the problem at hand. (Here we follow [Gi¢ n
Caticha 2007].)
We refer to constraints as commuting when it makes no di¤erence whether
they are handled simultaneously or sequentially. The most common example is
that of Bayesian updating on the basis of data collected in several independent
experiments. In this case the order in which the observed data x0 = fx01 ; x02 ; : : :g
is processed does not matter for the purpose of inferring . (See section 2.10.3)
The proof that ME is completely compatible with Bayes’rule implies that data
constraints implemented through functions, as in (6.69), commute. It is useful
to see how this comes about.
When an experiment is repeated it is common to refer to the value of x in
the …rst experiment and the value of x in the second experiment. This is a dan-
gerous practice because it obscures the fact that we are actually talking about
two separate variables. We do not deal with a single x but with a composite
x = (x1 ; x2 ) and the relevant space is X1 X2 . After the …rst experiment
yields the value x01 , represented by the constraint C1 : P (x1 ) = (x1 x01 ), we can
perform a second experiment that yields x02 and is represented by a second con-
straint C2 : P (x2 ) = (x2 x02 ). These constraints C1 and C2 commute because
they refer to di¤ erent variables x1 and x2 . An experiment, once performed and
its outcome observed, cannot be un-performed ; its result cannot be un-observed
by a second experiment. Thus, imposing the second constraint does not imply
a revision of the …rst.
In general constraints need not commute and when this is the case the order
in which they are processed is critical. For example, suppose the prior is q and
we receive information in the form of a constraint, C1 . To update we maximize
the entropy S[p; q] subject to C1 leading to the posterior P1 as shown in Figure
6.2. Next we receive a second piece of information described by the constraint
C2 . At this point we can proceed in two very di¤erent ways:
(a) Sequential updating. Having processed C1 , we use P1 as the current prior
and maximize S[p; P1 ] subject to the new constraint C2 . This leads us to the
posterior Pa .
(b) Simultaneous updating. Use the original prior q and maximize S[p; q]
subject to both constraints C1 and C2 simultaneously. This leads to the poste-
rior Pb .
At …rst sight it might appear that there exists a third possibility of simulta-
neous updating: (c) use P1 as the current prior and maximize S[p; P1 ] subject to
both constraints C1 and C2 simultaneously. Fortunately, and this is a valuable
check for the consistency of the ME method, it is easy to show that case (c)
is equivalent to case (b). Whether we update from q or from P1 the selected
posterior is Pb .
To decide which path (a) or (b) is appropriate we must be clear about how the
ME method handles constraints. The ME machinery interprets a constraint such
as C1 in a very mechanical way: all distributions satisfying C1 are in principle
allowed and all distributions violating C1 are ruled out.
Updating to a posterior P1 consists precisely in revising those aspects of the
prior q that disagree with the new constraint C1 . However, there is nothing …nal
190 Entropy III: Updating Probabilities

Figure 6.2: Illustrating the di¤erence between processing two constraints C1 and
C2 sequentially (q ! P1 ! Pa ) and simultaneously (q ! Pb or q ! P1 ! Pb ).

about the distribution P1 . It is just the best we can do in our current state of
knowledge and we fully expect that future information may require us to revise
it further. Indeed, when new information C2 is received we must reconsider
whether the original C1 remains valid or not. Are all distributions satisfying
the new C2 really allowed, even those that violate C1 ? If this is the case then
the new C2 takes over and we update from P1 to Pa . The constraint C1 may
still retain some lingering e¤ect on the posterior Pa through P1 , but in general
C1 has now become obsolete.
Alternatively, we may decide that the old constraint C1 retains its validity.
The new C2 is not meant to revise C1 but to provide an additional re…nement
of the family of allowed posteriors. If this is the case, then the constraint that
correctly re‡ects the new information is not C2 but the more restrictive C1 ^ C2 .
The two constraints should be processed simultaneously to arrive at the correct
posterior Pb .
To summarize: sequential updating is appropriate when old constraints be-
come obsolete and are superseded by new information; simultaneous updating is
appropriate when old constraints remain valid. The two cases refer to di¤erent
states of information and therefore we expect that they will result in di¤erent
inferences. These comments are meant to underscore the importance of under-
standing what information is and how it is processed by the ME method; failure
to do so will lead to errors that do not re‡ect a shortcoming of the ME method
but rather a misapplication of it.
6.8 Conclusion 191

6.8 Conclusion
Any Bayesian account of the notion of information cannot ignore the fact that
Bayesians are concerned with the beliefs of rational agents. The relation be-
tween information and beliefs must be clearly spelled out. The de…nition we
have proposed — that information is that which constrains rational beliefs and
therefore forces the agent to change its mind — is convenient for two reasons.
First, the information/belief relation very explicit, and second, the de…nition is
ideally suited for quantitative manipulation using the ME method.
Dealing with uncertainty requires that one solve two problems. First, one
must represent a state of partial knowledge as a consistent web of interconnected
beliefs. The instrument to do it is probability. Second, when new information
becomes available the beliefs must be updated. The instrument for this is rela-
tive entropy. It is the only candidate for an updating method that is of universal
applicability; that recognizes the value of prior information; and that recognizes
the privileged role played by the notion of independence in science. The re-
sulting general method — the ME method — can handle arbitrary priors and
arbitrary constraints; and it includes MaxEnt and Bayes’rule as special cases.
The design of the ME method is essentially complete. However, the fact
that ME operates by ranking distributions according to preference immediately
raises questions about why should distributions that lie very close to the entropy
maximum be totally ruled out; and if not ruled out completely, to what extent
should they contribute to the inference. Do they make any di¤erence? To what
extent can we even distinguish similar distributions? The discussion of these
matters in the next two chapters will signi…cantly extend the utility of the ME
method as a framework for inference.
Chapter 7

Information Geometry

A main concern of any theory of inference is to pick a probability distribution


from a set of candidates and this immediately raises many questions. What if we
had picked a slightly di¤erent distribution? Would it have made any di¤erence?
What makes two distributions similar? To what extent can we distinguish one
distribution from another? Are there quantitative measures of distinguisha-
bility? The goal of this chapter is to address such questions by introducing
methods of geometry. More speci…cally the goal will be to introduce a notion
of “distance” between two probability distributions.
A parametric family of probability distributions — distributions p (x) la-
beled by parameters = ( 1 : : : n ) — forms a statistical manifold, namely, a
space in which each point, labeled by coordinates , represents a probability
distribution p (x). Generic manifolds do not come with an intrinsic notion of
distance. A metric structure is an optional addition that has to be supplied
separately. This is done by singling out one privileged metric tensor — or just,
the metric — from a family of potential candidates. Statistical manifolds are,
however, an exception. One of the main goals of this chapter is to show that
statistical manifolds already come endowed with a uniquely natural notion of
distance — the so-called information metric. And the geometry is not optional
— it is an intrinsic structural element that characterizes statistical manifolds.
We will not develop the subject in all its possibilities — for a more exten-
sive treatment see [Amari 1985, Amari Nagaoka 2000] — but we do wish to
emphasize one speci…c result. Having a notion of distance means that we also
have a notion of volume and this in turn implies that there is a unique and epis-
temically objective notion of a probability distribution that is uniform over the
space of parameters — that is, a distribution that assigns equal probabilities to
equal volumes. Whether such distributions are maximally non-informative, or
whether they de…ne ignorance, or whether they re‡ect the actual prior beliefs of
any rational agent, are all potentially important issues but they are quite beside
the speci…c point that we want to emphasize, namely, that they are uniform.
The distance d` between two neighboring points and + d is given by
194 Information Geometry

Pythagoras’theorem, which written in terms of a metric tensor gab , is1

d`2 = gab d a d b
: (7.1)

The singular importance of the metric tensor gab derives from a theorem due
µ
to N. Cencov that states that the metric gab on the manifold of probability
distributions is essentially unique: up to an overall scale factor there is only
one metric that takes into account the fact that these are not distances between
simple structureless dots but between probability distributions [Cencov 1981].

7.1 Examples of statistical manifolds


An n-dimensional manifold M is a smooth, possibly curved, space that is locally
like Rn . What this means is that one can set up coordinates (that is a map
M ! Rn ) so that each point 2 M is identi…ed or labelled by n numbers,
= ( 1 : : : n ).
A statistical manifold is a manifold in which each point represents a prob-
ability distribution p (x). Thus, a statistical manifold is a family of probability
distributions p (x) that depend on n parameters = ( 1 : : : n ); the distribu-
tions are labelled by the parameters . As we shall later see a very convenient
notation is p (x) = p(xj ).
Here are some examples:
The multinomial distributions are given by

N!
p(fmi gj ) = ( 1 )m1 ( 2 )m2 : : : ( n mn
) ; (7.2)
m1 !m2 ! : : : mn !

where = ( 1; 2
::: n
),
Pn Pn i
N= i=1 mi and i=1 =1: (7.3)

They form a statistical manifold of dimension (n 1) called a simplex, Sn 1 .


The parameters = ( 1 ; 2 : : : n ) are a convenient choice of coordinates.
The multivariate Gaussian distributions with means a , a = 1 : : : n, and
variance 2 ,

1 1 P
n
p(xj ; ) = 2 )n=2
exp 2
(xa a 2
) ; (7.4)
(2 2 a=1

form an (n + 1)-dimensional statistical manifold. A convenient choice of coor-


dinates is = ( 1 ; : : : ; n ; ).
1 The use of superscripts rather than subscripts for the indices labelling coordinates is a

common and very convenient notational convention in di¤erential geometry. We adopt the
standard Einstein convention of summing over repeated indices P Pwhenever one appears as a
superscript and the other as a subscript, that is, gab f ab = a
ab
b gab f . Furthermore, we
shall follow standard practice and indistinctly refer to both the metric tensor gab and the
quadratic form (7.1) as the ‘metric’.
7.2 Vectors in curved spaces 195

The canonical distributions, eq.(4.82),

1 k
k fi
p(ijF ) = e ; (7.5)
Z
are derived by maximizing the Shannon entropy S[p] subject to constraints on
the expected values of n functions fik = f k (xi ) labeled by superscripts k =
1; 2; : : : n,
P
f k = pi fik = F k : (7.6)
i

They form an n-dimensional statistical manifold. As coordinates we can ei-


ther use the expected values F = (F 1 : : : F n ) or, equivalently, the Lagrange
multipliers, = ( 1 : : : n ).

7.2 Vectors in curved spaces


In this section and the next we brie‡y review some basic notions of di¤erential
geometry. First we will be concerned with the notion of a vector. The treatment
is not meant to be rigorous; the goal is to give an intuitive discussion of the basic
ideas and to establish the notation.

Vectors as displacements
Perhaps the most primitive notion of a vector is associated to a displacement
in space and is visualized as an arrow; other vectors such as velocities and
accelerations are de…ned in terms of such displacements and from these one can
elaborate further to de…ne forces, force …elds and so on.
This notion of vector as a displacement is useful in ‡at spaces but it does not
work in a curved space — a bent arrow is not useful. The appropriate gener-
alization follows from the observation that smoothly curved spaces are “locally
‡at” — by which one means that within a su¢ ciently small region deviations
from ‡atness can be neglected. Therefore one can de…ne the in…nitesimal dis-
placement from a point x = (x1 : : : xn ) by a vector,

d~x = (dx1 : : : dxn ) ; (7.7)

and then de…ne …nite vectors, such as velocities ~v , through multiplication by a


suitably large scalar 1=dt, so that ~v = d~x=dt.
De…ned in this way vectors can no longer be thought of as “contained”in the
original curved space. The set of vectors that one can de…ne at any given point
x of the curved manifold constitute the tangent space Tx at x. An immediate
implication is that vectors at di¤erent locations x1 and x2 belong to di¤erent
spaces, Tx1 and Tx2 , and therefore they cannot be added or subtracted or even
compared without additional structure — we would have to provide criteria that
“connect”the two tangent spaces and stipulate which vector at Tx2 “corresponds
to” or “is the same as” a given vector in Tx1 .
196 Information Geometry

For physics the consequences of curvature are enormous. Concepts that


are familiar and basic in ‡at spaces cannot be de…ned in the curved spaces of
general relativity. For example, the natural de…nition of the relative velocity
of two distant objects involves taking the di¤erence of their two velocities but
in a curved space the operation of subtracting two vectors in di¤erent tangent
spaces is no longer available to us. Similarly, there is no general de…nition of
the total energy or the total momentum of an extended system of particles; the
individual momenta are vectors that live in di¤erent tangent spaces and there
is no unique way to add them.
One objection to using a displacement as the starting point to the construc-
tion of a vector that is visualized as an arrow is that it relies on our intuition
of a curved space as being embedded in a ‡at space of larger dimension. A
number of questions may be raised: What do we gain by introducing these
larger embedding ‡at spaces? Do they physically exist? And what is so special
about ‡atness anyway? So, while visualizing curved spaces in this way can be
useful, it is also a good idea to pursue alternative approaches that do not rely
on embedding.

Vectors as tangents to curves


One alternative approach is to focus our attention directly on the velocities
rather than on the displacements. Introduce a coordinate frame so that a point
x has coordinates xa with a = 1 : : : n. A parametrized curve x( ) is represented
by n functions xa ( ) and the vector ~v tangent to the curve x( ) at the point
labeled by is represented by the n-tuple of real numbers fdxa =d g,
dx1 dxn
~v ( ::: ): (7.8)
d d
(The symbol stands for “is represented by”.) This de…nition relies on the
notion of coordinates which depend on the choice of frame and therefore so do
the components of vectors. When we change to new coordinates,
0 0
xa = f a (x1 : : : xn ) ; (7.9)
the components of the vector in the new frame change accordingly. The new
components are given by the chain rule,2
0 0 0 0
dx1 dxn dxa @xa dxb
~v ( ::: ) where = : (7.10)
d d d @xb d
In this approach a vector at the point x is de…ned as an n-tuple of real numbers
(v 1 : : : v n ) that under a change of coordinate frame transform according to
0
a0 0 0 @xa
v = Xba v b where Xba = : (7.11)
@xb
2 The notation is standard: the primed frame is indicated by priming the indices, not the
0
quantity — that is xa rather than x0a . A derivative with respect to a variable labelled by
an upper index transforms as a variable with a lower index and vice versa. For example,
@f =@xa = @a f , and in eq.(7.10) the index b is summed over.
7.2 Vectors in curved spaces 197

In other words, the vector has di¤erent representations in di¤erent frames but
the vector itself is independent of the choice of coordinates.
The coordinate independence can be made more explicit by introducing the
notion of a basis. A coordinate frame singles out n special vectors f~ea g de…ned
so that the b component of ~ea is
eba = b
a : (7.12)
More explicitly,
~e1 (1; 0 : : : 0); ~e2 (0; 1; 0 : : : 0); : : : ; ~e1 (0; 0 : : : 1) : (7.13)
Any vector ~v can be expressed in terms of the basis vectors,
~v = v a ~ea : (7.14)
The basis vectors in the primed frame f~ea0 g are de…ned in the same way
0
b0
eba0 = a0 : (7.15)
so that using eq.(7.11) we have
0 0
~v = v a ~ea0 = Xba v b ~ea0 = v b ~eb ; (7.16)
where 0
~eb = Xba ~ea0 or, equivalently, ~ea0 = Xab0 ~eb : (7.17)
a
Eq.(7.16) shows that while the components v and the basis vectors ~ea both
depend on the frame, the vector ~v itself is invariant, and eq.(7.17) shows that
the invariance follows from the fact that components and basis vectors transform
according to inverse matrices. Explicitly, using the chain rule,
0 0
0 @xa @xb @xa 0
Xba Xcb0 = 0 = = ca0 : (7.18)
b
@x @x c @xc0
Remark: Eq.(7.16) is the main reason we care about vectors: they are ob-
jects that are independent of the accidents of the particular choice of coordinate
system. Therefore they are good candidates to represent quantities that carry
physical meaning. Conversely, since we can always change coordinates, it is
commonly thought that discussions that avoid coordinates and employ methods
of analysis that are coordinate-free are somehow deeper or more fundamental.
The introduction of coordinates is often regarded as a blemish that is barely
tolerated because it often facilitates computation. However, when we come to
statistical manifolds the situation is di¤erent. In this case coordinates can be
parameters in probability distributions (such as, for example, temperatures or
chemical potentials) that carry a statistical and physical meaning and therefore
have a signi…cance that goes far beyond the mere geometrical role of labelling
points. Thus, there is much to gained from recognizing, on one hand, the
geometrical freedom to assign coordinates and, on the other hand, that often
enough special coordinates are singled out because they carry physically rel-
evant information. The case can therefore be made that when dealing with
statistical manifolds coordinate-dependent methods can in many respects be
fundamentally superior.
198 Information Geometry

Vectors as directional derivatives


There is yet a third way to introduce vectors. Let (x) be a scalar function and
consider its derivative along the parametrized curve x( ) is given by the chain
rule,
d @ dxa @
= = va (7.19)
d @xa d @xa
Note that d =d is independent of the choice of coordinates. Indeed, using the
chain rule
0
d @ a @ @xa a @ 0
= v = v = va : (7.20)
d @xa @xa0 @xa @xa0
Since is any generic function we can write

d @
= va a : (7.21)
d @x

Note further that the partial derivatives @=@xa transform exactly as the basis
vectors, eq.(7.17)
0
@ @xa @ 0 @
= = Xaa ; (7.22)
@x a @xa @xa0 @xa0
so that there is a 1 : 1 correspondence between the directional derivative d=d
and the vector ~v that is tangent to the curve x( ). In fact, we can use one to
represent the other,
d @
~v and ~ea : (7.23)
d @xa
and, since mathematical objects are de…ned purely through their formal rules of
manipulation, it is common practice to ignore the distinction between the two
concepts and set
d @
~v = and ~ea = : (7.24)
d @xa
The partial derivative is indeed appropriate because the “vector” @=@xa is the
derivative along those curves xb ( ) parametrized by the parameter = xa that
are de…ned by keeping the other coordinates constant, xb ( ) = const, for b 6= a.
The vector ~ea has components

@xb
eba = = b
a : (7.25)
@xa

From a physical perspective, however, beyond the rules for formal manipu-
lation mathematical objects are also assigned a meaning, an interpretation, and
it is not clear that the two concepts, the derivative d=d and the tangent vector
~v , should be considered as physically identical. Nevertheless, we can still take
advantage of the isomorphism to calculate using one picture while providing
interpretations using the other.
7.3 Distance and volume in curved spaces 199

7.3 Distance and volume in curved spaces


The notion of a distance between two points is not intrinsic to generic manifolds;
it has to be supplied as an additional structure — the metric tensor. As we shall
see statistical manifolds constitute a most remarkable exception.
To introduce ‘distance’one follows the basic intuition that curved spaces are
locally ‡at: at any point x, within a su¢ ciently small region curvature e¤ects can
be neglected. The idea then is rather simple: within a very small region in the
vicinity of a point x we can always transform from the original coordinates xa to
new coordinates xa^ = f a^ (x1 : : : xn ) that we declare to be locally Cartesian (here
denoted with ^ superscripts). An in…nitesimal displacement has components
given by
@xa^
dxa^ = Xaa^ dxa where Xaa^ = : (7.26)
@xa
In these locally Cartesian coordinates, also called normal coordinates, the cor-
responding in…nitesimal distance can be computed using Pythagoras theorem,
^
d`2 = b dx
^^
a
a
^
dxb : (7.27)

Remark: The fact that at any given point one can always change to normal
coordinates such that Pythagoras’ theorem is locally valid is what de…nes a
Riemannian manifold. However, it is important to realize that while one can
do this at any single arbitrary point of our choice, in general one cannot …nd a
coordinate frame in which eq.(7.27) is simultaneously valid at all points within
an extended region.
Changing back to the original frame
^ ^
d`2 = a
^
b dx dx
^^
a
b
= a
^
b Xa
^^
a Xbb dxa dxb : (7.28)

De…ning the quantities


def a
^ ^
gab = b Xa
^^
a Xbb ; (7.29)
we can write the in…nitesimal Pythagoras theorem in the original generic frame
as
d`2 = gab dxa dxb : (7.30)
The quantities gab are the components of the metric tensor usually abbreviated
to just the ‘metric’. One can easily check that under a coordinate transformation
gab transforms according to
0 0
gab = Xaa Xab ga0 b0 ; (7.31)

so that the in…nitesimal distance d` is independent of the coordinate frame.


To …nd the …nite length between two points along a curve x( ) one integrates
along the curve,
Z Z 1=2
2 2
dxa dxb
`= d` = gab d : (7.32)
1 1
d d
200 Information Geometry

Once we have de…ned a measure of distance we can also measure angles,


areas, volumes and all sorts of other geometrical quantities. To …nd an expres-
sion for the n-dimensional volume element dVn we use the same trick as before:
Transform to normal coordinates xa^ so that the volume element is simply given
by the product
^ ^
dVn = dx1 dx2 : : : dxn^ ; (7.33)
and then transform back to the original coordinates xa using eq.(7.26),
@x
^
dVn = dx1 dx2 : : : dxn = det Xaa^ dn x : (7.34)
@x
This is the volume element written in terms of the coordinates xa but we still
have to calculate the Jacobian of the transformation, j@ x ^=@xj = det Xaa^ . This
is done noting that the transformation of the metric from its Euclidean form
^^
a b to gab , eq.(7.29), is the product of three matrices. Taking the determinant
we get
def 2
g = det(gab ) = det Xaa^ ; (7.35)
so that
det Xaa^ = g 1=2 : (7.36)
We have succeeded in expressing the volume element in terms of the metric
gab (x) in the original coordinates xa . The answer is
dVn = g 1=2 (x)dn x : (7.37)
The volume of any extended region R on the manifold is
Z Z
Vn = dVn = g 1=2 (x)dn x : (7.38)
R R

Example: These ideas are also useful in ‡at spaces when dealing with non-
Cartesian coordinates. The distance element of three-dimensional ‡at space in
spherical coordinates (r; ; ) is
d`2 = dr2 + r2 d 2
+ r2 sin2 d 2
; (7.39)
and the corresponding metric tensor is
0 1
1 0 0
(gab ) = @0 r2 0 A : (7.40)
0 0 r2 sin2
The volume element is the familiar expression
dV = g 1=2 drd d = r2 sin drd d : (7.41)
Important example: A uniform distribution over such a curved manifold is
one which assigns equal probabilities to equal volumes. Therefore,
p(x)dn x / g 1=2 (x)dn x : (7.42)
7.4 Derivations of the information metric 201

7.4 Derivations of the information metric


The distance d` between two neighboring distributions p(xj ) and p(xj + d )
or, equivalently, between the two points and + d , is given by a metric tensor
gab . Our goal is to propose a metric tensor gab for the statistical manifold of
distributions fp(xj )g. We give several di¤erent derivations because this serves
to illuminate the meaning of the information metric, its interpretation, and
ultimately, how it is to be used. The fact that so many di¤erent arguments all
lead to the same tensor is signi…cant. It strongly suggests that the information
metric is special, and indeed, in Section 7.5 we shall show that up to an overall
constant factor re‡ecting the choice of units of length this metric is unique.
Remark: At this point a word of caution (and encouragement) might be called
for. Of course it is possible to be confronted with su¢ ciently singular families
of distributions that are not smooth manifolds and studying their geometry
might seem a hopeless enterprise. Should we give up on geometry? Of course
not. The fact that statistical manifolds can have complicated geometries does
not detract from the value of the methods of information geometry any more
than the existence of physical surfaces with rugged geometries detracts from the
general value of geometry itself.

7.4.1 From distinguishability


We seek a quantitative measure of the extent that two distributions p(xj ) and
p(xj +d ) can be distinguished. The following argument is intuitively appealing.
The advantage of this approach is the emphasis on interpretation — the metric
measures distinguishability — the disadvantage is that the argument does not
address the issue of uniqueness of the metric.
Consider the relative di¤erence,
p(xj + d ) p(xj ) @ log p(xj ) a
= = d : (7.43)
p(xj ) @ a

The expected value of the relative di¤erence, h i, might at …rst sight seem a
good candidate to measure distinguishability, but it does not work because it
vanishes identically,
Z Z
@ log p(xj ) a a @
h i = dx p(xj ) d =d dx p(xj ) = 0: (7.44)
@ a @ a
R
(Depending on the problem the symbol dx will be used to represent either
discrete sums or integrals over one or more dimensions.) However, the variance
does not vanish,
Z
2 2 @ log p(xj ) @ log p(xj ) a b
d` = h i = dx p(xj ) d d : (7.45)
@ a @ b

This is the measure of distinguishability we seek; a small value of d`2 means


that the relative di¤erence is small and the points and + d are di¢ cult
202 Information Geometry

to distinguish. It suggests introducing the matrix gab


Z
def @ log p(xj ) @ log p(xj )
gab = dx p(xj ) (7.46)
@ a @ b

called the Fisher information matrix [Fisher 1925], so that

d`2 = gab d a d b
: (7.47)

Up to now no notion of distance has been introduced. Normally one says that
the reason it is di¢ cult to distinguish two points in, say, the three dimensional
space we seem to inhabit, is that they happen to be too close together. It is very
tempting to invert this intuition and assert that the two points and +d must
be very close together because they are di¢ cult to distinguish. Furthermore,
note that being a variance, d`2 = h 2 i, the quantity d`2 is positive and vanishes
only when the d a vanish. Thus it is natural to interpret gab as the metric tensor
of a Riemannian space. This is the information metric. The realization by C.
R. Rao that gab is a metric in the space of probability distributions [Rao 1945]
gave rise to the subject of information geometry [Amari 1985], namely, the
application of geometrical methods to problems in inference and in information
theory.
Remark: The derivation of (7.46) involved a Taylor expansion, eq.(7.43), to
…rst order in d . One might wonder if keeping higher orders in d would lead
to a “better” metric. The answer is no. What is being done here is de…ning
the metric tensor, not …nding an approximation to it. To illustrate this point
consider the following analogy. The trajectory of a moving particle is given by
xa = xa (t). The position at time t + t is given by the Taylor expansion

1
xa (t + t) = xa (t) + v a (t) t + aa (t) t2 + : : : (7.48)
2
The velocity of the particle is de…ned by the term linear in t. Despite the
Taylor expansion the de…nition of velocity is exact, no approximation is in-
volved; the higher order terms do not constitute an improvement to the notion
of velocity.
Other useful expressions for the information metric are
Z
@p1=2 (xj ) @p1=2 (xj )
gab = 4 dx
@ a @ b
Z 2 1=2
@ p (xj )
= 4 dx p1=2 (xj ) ; (7.49)
@ a@ b

and Z
@ 2 log p(xj ) @ 2 log p
gab = dx p(xj ) = h i: (7.50)
@ a@ b @ a@ b
The coordinates are quite arbitrary; one can freely relabel the points in the
manifold. It is then easy to check that gab are the components of a tensor and
7.4 Derivations of the information metric 203

that the distance d`2 is an invariant, a scalar under coordinate transformations.


Indeed, the transformation
a0 0
= fa ( 1
::: n
) (7.51)

leads to
a0
a @ a a0 @ @ @
d = d and = (7.52)
@ a0 @ a @ a @ a0

so that, substituting into eq.(7.46),


a0 b0
@ @
gab = a b
ga0 b0 (7.53)
@ @

7.4.2 From embedding in a Euclidean space


Consider a discrete variable i = 1 : : : m. The possible probability distributions of
i can be labelled by the probability values themselves: a probability distribution
can be speci…ed by a point p with coordinates (p1 : : : pm ). TheP corresponding
statistical manifold is the simplex Sm 1 = fp = (p1 : : : pm ) : i pi = 1g.
Next change to new coordinates i = (pi )1=2 . In these new coordinates the
equation for the simplex Sm 1 — the normalization condition — reads
Xm
( i )2 = 1 ; (7.54)
i=1

which, if we interpret the i as Cartesian coordinates, is recognized as the equa-


tion of an (m 1)-sphere embedded in an m-dimensional Euclidean space Rm .
This suggests that we assign the simplest possible metric: the distance between
the distribution p(ij ) and its neighbor p(ij + d ) is the Euclidean distance in
Rm , X
d`2 = (d i )2 = ij d i d j : (7.55)
i

Distances between more distant distributions are merely angles de…ned on the
surface of the unit sphere Sm 1 . To express d`2 in terms of the original coordi-
nates pi substitute
1 dpi
d i= (7.56)
2 p1=2 (i)
to get
1 X (dpi )2 1 ij
d`2 = = gij dpi dpj with gij = : (7.57)
4 i p(i) 4 p(i)

Remark: In (7.56) and in gij above we have gone back to the original notation
p(i) rather than pi to emphasize that the repeated index i is not being summed
over.
Except for an overall constant (7.57) is the same information metric (7.47) we
de…ned earlier. Indeed, consider an n-dimensional subspace (n m 1) of the
simplex Sm 1 de…ned by i = i ( 1 ; : : : ; n ). The parameters a , i = 1 : : : n, can
204 Information Geometry

be used as coordinates on the subspace. The Euclidean metric on Rm induces


a metric on the subspace. The distance between p(ij ) and p(ij + d ) is

@ i a@ j
d`2 = ij d
i
d j
d = d
ij
b
@ a @ b
1 X i @ log pi @ log pi a b
= p d d ; (7.58)
4 i @ a @ b

which (except for the factor 1/4) we recognize as the discrete version of (7.46)
and (7.47).
This interesting result does not constitute a “derivation.”There is a priori no
reason why the square root coordinates i should be singled out as special and
attributed a Euclidean metric. But perhaps it helps to lift the veil of mystery
that might otherwise surround the strange expression (7.46).

7.4.3 From embedding in a spherically symmetric space


Instead of embedding the simplex Sm 1 in a very special ‡at Euclidean space we
can slightly improve the derivation of the previous section by embedding Sm 1
in a somewhat more general curved space — a spherically symmetric space.
Just as in the previous section the strategy is to use the known geometry of the
larger embedding space to …nd the metric it induces on the embedded simplex.
The generalization we pursue here o¤ers a number of advantages. It allows
us to …nd the metric tensor in its most general form. As we shall later see in
section 7.5 it applies even for probability distributions that are not normalized.
More importantly, this derivation also has the advantage of emphasizing the
central role played by the rotational symmetry, which will be useful in Chapters
10 and 11 where we derive quantum mechanics as an application of entropic and
information geometry methods.
As in the previous section we transform to new coordinates i = (pi )1=2 . In
these new coordinates P the equation for the simplex Sm 1 — the normalization
condition — reads ( i )2 = 1, which we declare to be the equation of an
(m 1)-sphere embedded in an m-dimensional spherically symmetric space.
What makes an m-dimensional curved space spherically symmetric is that
— like an onion — it can be foliated into spheres. This means that every point
in the space lies on the (m 1)-dimensional “surface” of some sphere,
Xm
( i )2 = r2 = const. (7.59)
i=1
Unlike the ‡at Euclidean case, in these curved spaces the “radius”r is not to be
interpreted as the radial distance to the center; r is just a quantity that labels
the di¤erent spheres.
To construct the metric of a generic spherically symmetric space consider a
short segment that stretches from i to i + d i . The length of a segment is
given by the Pythagoras form

d`2 = d`2r + d`2t ; (7.60)


7.4 Derivations of the information metric 205

where d`r is the length of the segment in the radial direction (i.e., normal to the
sphere) and d`t is the length tangent to the sphere. To calculate d`r consider
two neighboring spheres of radii r and r + dr. Di¤erentiating (7.59) gives,
Xm i d i
dr = : (7.61)
i=1 r
If the sphere were embedded in ‡at space dr would itself be the radial distance
between the two spheres. In our curved space it is not; the best we can do is to
claim the actual radial distance is proportional to dr. Therefore
a(r2 ) P i i 2
d`2r = i d (7.62)
r2
where by spherical symmetry the (positive) function a(r2 ) depends only on r2
so that it is constant over the surface of the sphere. To calculate the tangential
length d`t we note that the actual geometry of the sphere is independent of
the space in which it is embedded. If it were embedded in ‡at space then the
tangential length would be
P
(tang. length)2 = i (d i )2 dr2 (7.63)
In our curved space, the actual tangential distance can involve an overall scale
factor, !
2
2 2 P i 2 1P i i
d`t = b(r ) i (d ) d : (7.64)
r2 i
By spherical symmetry the (positive) scale factor b(r2 ) depends only on r. Sub-
stituting into (7.60), we see that the metric of a generic spherically symmetric
space involves two arbitrary positive functions of r2 ,
1 P i i 2 P
d`2 = 2 a(r2 ) b(r2 ) i d + b(r2 ) i (d i )2 : (7.65)
r
Expressed in terms of the original pi coordinates the metric of a spherically
symmetric space takes the form
P i 2 P 1
d`2 = A(jpj) i dp + B(jpj) i i (dpi )2 ; (7.66)
2p
where
1 1
A(jpj) = [a(jpj) b(jpj)] and B(jpj) = b(jpj) (7.67)
4jpj 2
and
def P
jpj = i pi = r2 : (7.68)
Setting P P
i i
jpj = ip = 1 and i dp =0; (7.69)
gives the metric induced on the simplex Sm 1 ,
P 1
d`2 = B(1) i i (dpi )2 : (7.70)
p
Up to an overall scale this result agrees with previous expressions for the infor-
mation metric.
206 Information Geometry

7.4.4 From asymptotic inference


We have two very similar probability distributions. Which one do we prefer?
To decide we collect data in N independent trials. Then the question becomes:
To what extent does the data support one distribution over the other? This is
a typical inference problem. To be explicit consider multinomial distributions
speci…ed by p = (p1 : : : pn ). (Here it is slightly more convenient to revert to the
notation where indices appear as subscripts.) Suppose the data consists of the
numbers (m1 : : : mm ) where mi is the number of times that outcome i occurs.
The corresponding frequencies are
mi Xn
fi = with mi = N : (7.71)
N i=1

The probability of a particular frequency distribution f = (f1 : : : fn ) is


N!
PN (f jp) = pm1 : : : pm
n
n
: (7.72)
m1 ! : : : mn ! 1
For su¢ ciently large N and mi we can use Stirling’s approximation [see eq.(6.63)],
to get Q
PN (f jp) CN ( fi ) 1=2 exp(N S[f; p]) (7.73)
i

where CN is a normalization constant and S[f; p] is the entropy given by eq.(6.12).


The Gibbs inequality S[f; p] 0, eq.(4.27), shows that for large N the prob-
ability PN (f jp) shows an exceedingly sharp peak. The most likely fi is pi —
this is the weak law of large numbers.
Now we come to the inference: the values of p best supported by the data f
are inferred from Bayes rule,
PN (pjf ) / exp(N S[f; p]) ; (7.74)
where we have used the fact that Q for large N the exponential eN S dominates
both the prior and the pre-factor ( fi ) 1=2 . For large N the data fi supports
the value pi = fi . But the distribution PN (pjf ) is not in…nitely sharp; there
is some uncertainty. Distributions with parameters p0i = fi + pi can only be
distinguished from pi = fi provided pi lies outside a small region of uncertainty
de…ned roughly by
N S[p; p0 ] = N S[p; p + p] 1 (7.75)
so that the probability PN (p0 jf ) is down by e 1
from the maximum. Expanding
to second order,
P pi 1 P ( pi )2
S[p; p + p] = pi log (7.76)
i pi + p i 2 i pi
Thus, the nearest that two neighboring points p and p + p can be while still
being distinguishable in N trials is such that
N P ( pi )2
1: (7.77)
2 i pi
7.4 Derivations of the information metric 207

As N increases the resolution p with p which we can distinguish neighboring


distributions improves roughly as 1= N .
We can now de…ne a “statistical” distance on the simplex Sm 1 . The argu-
ment below is given in [Wootters 1981]; see also [Balasubramanian 1997]. We
de…ne the length of a curve between two given points by counting the number
of distinguishable points that one can …t along the curve,

1
number of distinguishable (in N trials)
Statistical length = ` = lim p
N !1 distributions that …t along the curve
N=2
p (7.78)
Since the number of pdistinguishable points grows as N it is convenient to
introduce
p a factor 1= N so that there is a …nite limit as N ! 1. The factor
2 is purely conventional. p
Remark: It is not actually necessary to include the 2=N factor; this leads to
a notion of statistical length `N de…ned on the space of N -trial multinomials.
(See section 7.6.)
More explicitly, let the curve p = p( ) be parametrized by . The separation
between two neighboring distributions that can barely be resolved in N trials
is
1=2
N P 1 dpi 2 2 N P 1 dpi 2
( ) 1 or ( ) : (7.79)
2 i pi d 2 i pi d
The number of distinguishable distributions within the interval d is d = and
the corresponding statistical length, eq.(7.78), is

1=2 1=2
P 1 dpi 2 P (dpi )2
d` = ( ) d = (7.80)
i pi d i pi

The length of the curve from 0 to 1 is


Z 1
P (dpi )2
`= d` where d`2 = : (7.81)
0 i pi

Thus, the width of the ‡uctuations is the unit used to de…ne a local measure
of “distance”. To the extent that ‡uctuations are intrinsic to statistical prob-
lems the geometry they induce is unavoidably hard-wired into the statistical
manifolds. The statistical P or distinguishability length di¤ers from a possible
Euclidean distance d`2E = (dpi )2 because the ‡uctuations are not uniform
over the space Sm 1 which a¤ects our ability to resolve neighboring points.
Equation (7.81) agrees the previous de…nitions of the information metric.
Consider the n-dimensional subspace (n m 1) of the simplex Sm 1 de…ned
by pi = pi ( 1 ; : : : ; n ). The distance between two neighboring distributions in
this subspace, p(ij ) and p(ij + d ), is

Pm ( p )2 P
m 1 @p
i i a @pj
d`2 = = a
d d b = gab d a d b
(7.82)
i=1 pi i;j=1 pi @ @ b
208 Information Geometry

where
P
m @ log pi @ log pi
gab = pi ; (7.83)
i=1 @ a @ b
which is the discrete version of (7.46).

7.4.5 From relative entropy


The relation we uncovered above between the information metric and entropy,
eq.(7.76), is not restricted to multinomials; it is quite general. Consider the
entropy of one distribution p(xj 0 ) relative to another p(xj ),
Z
0 p(xj 0 )
S( ; ) = dx p(xj 0 ) log : (7.84)
p(xj )
We study how this entropy varies when 0 = + d is in the close vicinity of a
given . As we had seen in section 4.2 –recall the Gibbs inequality S( 0 ; ) 0
with equality if and only if 0 = — the entropy S( 0 ; ) attains an absolute
maximum at 0 = . Therefore, the …rst nonvanishing term in the Taylor
expansion about is second order in d
1 @ 2 S( 0 ; )
S( + d ; ) = d ad b
+ ::: 0; (7.85)
2 @ 0a @ 0b 0=

which suggests de…ning the distance d` by


1 2
S( + d ; ) = d` : (7.86)
2
The second derivative is
Z
@ 2 S( 0 ; ) @ p(xj 0 ) @p(xj 0 )
= dx log + 1
@ 0a @ 0b @ 0a p(xj ) @ 0b
Z 0 0
@ log p(xj ) @p(xj ) p(xj 0 ) @ 2 p(xj 0 )
= dx + [log + 1] :
@ 0a @ 0b p(xj ) @ 0a @ 0b
Evaluating at 0 = and using the fact that the p(xj 0 ) are normalized gives
the desired result,
Z
@S 2 ( 0 ; ) @ log p(xj ) @ log p(xj )
= dx p(xj ) = gab : (7.87)
@ 0a @ 0b 0= @ a @ b

7.5 Uniqueness of the information metric


The most remarkable fact about the information metric is that it is essentially
unique: except for a constant scale factor it is the only Riemannian metric that
adequately takes into account the nature of the points of a statistical manifold,
namely, that these points are not “structureless”, that they are probability
µ
distributions. This theorem was …rst proved by N. Cencov within the framework
of category theory [Cencov 1981]. The proof below follows the treatment in
[Campbell 1986].
7.5 Uniqueness of the information metric 209

Markov embeddings
Consider a discrete variable i = 1; : : : ; n and let the probability of any particular
i be Pr(i) = pi . In practice the limitation to discrete variables is not very
serious because we can choose an n large enough to approximate a continuous
distribution to any desirable degree. However, it is possible to imagine situations
where the continuum limit is tricky — here we avoid such situations.
The set of numbers p = (p1 ; : : : pn ) can be used as coordinates to de…ne
a point on a statistical manifold. In this particular P case the manifold is the
(n 1)-dimensional simplex Sn 1 = fp = (p1 ; : : : pn ) : pi = 1g. The argument
is considerably simpli…ed by considering instead the n-dimensional space of non-
normalized distributions. This is the positive “octant” Rn+ = fp = (p1 ; : : : pn ) :
pi > 0g. The boundary is explicitly avoided so that Rn+ is an open set.
Next we introduce the notion of Markov mappings. The set of values of i
can be grouped or partitioned into M disjoint subsets with 2 M n. Let
A = 1 : : : M label the subsets, then the probability of the Ath subset is
X
Pr(A) = P A = pi : (7.88)
i2A

The space of these coarser


P probability distributions is the simplex SM 1 =
fP = (P 1 ; : : : P M ) : P A = 1g. The corresponding space of non-normalized
+
distributions if the positive octant RM = fP = (P 1 ; : : : P M ) : P A > 0g.
Thus, the act of partitioning (or grouping, or coarse graining) has produced
+
a mapping G : Rn+ ! RM with P = G(p) given by eq.(7.88). This is a many-
to-one map; it has no inverse. An interesting map that runs in the opposite
+
direction RM ! Rn+ can be de…ned by introducing conditional probabilities.
Let
i Pr(ijA) if i 2 A
qA = (7.89)
0 if i 2
=A
with X X
i
qA = Pr(ijA) = 1 : (7.90)
i i2A
i
Thus, to each choice of the set of numbers fqA g we can associate a one-to-one
+ +
map Q : Rm ! Rn with p = Q(P ) de…ned by

pi = q A
i
PA : (7.91)
i
This is a sum over A but since qA = 0 unless i 2 A only one term in the sum is
non-vanishing and the map is clearly invertible. These Q maps, called Markov
+
mappings, de…ne an embedding of RM into Rn+ . Markov mappings preserve
normalization, X X X
pi = i
qA PA = PA : (7.92)
i i A

Example: A coarse graining map G for the case of R3+ ! R2+ is

G(p1 ; p2 ; p3 ) = (p1 ; p2 + p3 ) = (P 1 ; P 2 ) : (7.93)


210 Information Geometry

One Markov map Q running in the opposite direction R2+ ! R3+ could be

1 2
Q(P 1 ; P 2 ) = (P 1 ; P 2 ; P 2 ) = (p1 ; p2 ; p3 ) : (7.94)
3 3
i
This particular map is de…ned by setting all qA = 0 except q11 = 1, q32 = 1=3,
3
and q2 = 2=3.
Example: We can use binomial distributions to analyze the act of tossing a
coin (the outcomes are either heads or tails) or, equally well, the act of throwing
a die (provided we only care about outcomes that are either even or odd). This
amounts to embedding the space of coin distributions (which are binomials,
+
RM with M = 2) as a subspace of the space of die distributions (which are
multinomials, Rn+ with n = 6).
To minimize confusion between the two spaces we will use lower case symbols
to refer to the original larger space Rn+ and upper case symbols to refer to the
+
coarse grained embedded space RM .
Having introduced the notion of Markov embeddings we can now state the
i
basic idea behind Campbell’s argument. For a …xed choice of fqA g, that is for
a …xed Markov map Q, the distribution P and its image p = Q(P ) represent
exactly the same information. In other words, whether we talk about heads/tails
outcomes in coins or about even/odd outcomes in dice, binomials are binomials.
Therefore the map Q is invertible. The Markov image Q(SM 1 ) of the simplex
SM 1 in Sn 1 is statistically “identical” to SM 1 ,

Q(SM 1) = SM 1 ; (7.95)

in the sense that it is just as easy or just as di¢ cult to distinguish the two
distributions P and P + dP as it is to distinguish their images p = Q(P ) and
p + dp = Q(P + dP ). Whatever geometrical relations are assigned to distribu-
tions in SM 1 , exactly the same geometrical relations should be assigned to the
corresponding distributions in Q(SM 1 ). Thus Markov mappings are not just
embeddings, they are congruent embeddings; distances between distributions in
+
RM should match the distances between the corresponding images in Rn+ .
Our goal is to …nd the Riemannian metrics that are invariant under Markov
mappings. It is easy to see why imposing such invariance is extremely restrictive:
+
The fact that distances computed in RM must agree with distances computed in
+
subspaces of Rn introduces a constraint on the allowed metric tensors; but we
+
can always embed RM in spaces of larger and larger dimension which imposes
a larger and larger number of constraints. It could very well have happened
that no Riemannian metric managed to survive such restrictive conditions; it is
quite remarkable that some do and it is even more remarkable that (up to an
uninteresting scale factor which amounts to a choice of the unit of distance) the
surviving Riemannian metric is unique.
The invariance of the metric is conveniently expressed as an invariance of
+
the inner product: inner products among vectors in RM should coincide with
the inner products among the corresponding images in Rn+ : Let vectors tangent
7.5 Uniqueness of the information metric 211

+
to RM be denoted by
~ = V A @ = V AE
V ~A ; (7.96)
@P A
~ A g is a coordinate basis. The inner product of two such vectors is
where fE

~ ;U
~ )M = g (M )
A B
(V AB V U (7.97)

where the metric is


(M ) def ~ ~
gAB = (E A ; EB )M : (7.98)
Similarly, vectors tangent to Rn+ are denoted by

@
~v = v i = v i~ei ; (7.99)
@pi
and the inner product of two such vectors is
(n)
(~v ; ~u)n = gij v i uj (7.100)

where
(n) def
gij = (~ei ; ~ej )n : (7.101)
~ tangent to Rm
The images of vectors V +
under Q are obtained from eq.(7.91)

@ @pi @ i @ ~ A = qA
i
Q A
= A i
= qA or Q E ~ei ; (7.102)
@P @P @p @pi
which leads to
~ = ~v
Q V with v i = qA
i
VA : (7.103)
Therefore, the invariance or isometry we want to impose is expressed as
~ ;U
(V ~ )M = (Q V
~ ;Q U
~ )n = (~v ; ~u)n : (7.104)

The Theorem
+
Let ( ; )M be the inner product on RM for any M 2 f2; 3; : : :g. The theorem
states that the metric is invariant under Markov embeddings if and only if

(M ) AB
gAB = (eA ; eB )M = (jP j) + jP j (jP j) ; (7.105)
PA
def P
where jP j = A P A , and and are smooth (C 1 ) functions with > 0 and
+ > 0. The proof is given in the next section.
An important by-product of this theorem is that (7.105) has turned out to be
the metric of a generic spherically symmetric space, eq.(7.66). In other words,

Invariance under Markovian embeddings implies that the statistical man-


ifold of unnormalized probabilities is spherically symmetric.
212 Information Geometry

As we shall see in Chapters 10 and 11 this fact will turn out to be important in
the derivation of quantum mechanics.
+
The metric above refers to the positive cone RM but ultimately we are
interested in the metric induced on the simplex SM 1 de…ned by jP j = 1. In
order to …nd the induced metric we …rst show that vectors that are tangent to
the simplex SM 1 are such that
def
X
jV j = VA =0 : (7.106)
A

+
Indeed, consider the derivative of any function f = f (jP j) de…ned on RM along
the direction de…ned by V~,

@f df @jP j df
0=VA A
=VA A
= jV j ; (7.107)
@P djP j @P djP j

where we used @jP j=@P A = 1. Therefore jV j = 0.


~ and U
Next consider the inner product of any two vectors V ~,
X AB
~ ;U
(V ~ )M = V AU B (jP j) + jP j (jP j)
PA
AB
X V AU A
= (jP j)jV jjU j + jP j (jP j) : (7.108)
PA
A

For vectors tangent to the simplex SM 1 this simpli…es to


X V AU A
~ ;U
(V ~ )M = (1) : (7.109)
PA
A

Therefore the choice of the function (jP j) is irrelevant and the corresponding
metric on SM 1 is determined up to a multiplicative constant (1) =

AB
gAB = ; (7.110)
PA
which is the information metric that was heuristically suggested earlier, eqs.(7.57)
and (7.58). Indeed, transforming to a generic coordinate frame P A = P A ( 1 ; : : : ; M
)
yields
d`2 = gAB P A P B = gab d a d b (7.111)
with
X @ log P A @ log P A
gab = PA : (7.112)
A @ a @ b

The Proof
The strategy is to consider special cases of Markov embeddings to determine
what kind of constraints they impose on the metric. First we consider the
7.5 Uniqueness of the information metric 213

+
consequences of invariance under the family of Markov maps Q0 that embed RM
0
into itself. In this case n = M and the action of Q is to permute coordinates.
A simple example in which just two coordinates are permuted is

(p1 ; : : : pa ; : : : pb ; : : : pM ) = Q0 (P 1 ; : : : P M )
= (P 1 ; : : : P b ; : : : P a ; : : : P m ) (7.113)
i
The required qA are
a b b a i i
qA = A ; qA = A and qA = A for A 6= a; b ; (7.114)
~ A = q i ~ei , gives
so that eq.(7.102), Q0 E A

~ a = ~eb ;
Q0 E ~ b = ~ea
Q0 E and ~ A = ~eA
Q0 E for A 6= a; b : (7.115)

The invariance
~ A; E
(E ~ B )M = (Q0 E
~ A ; Q0 E
~ B )M (7.116)
yields,
(M ) (M ) (M ) (M )
gaA (P ) = gbA (p) and gbA (P ) = gaA (p) for A 6= a; b (7.117)
(M ) (M ) (M ) (M )
gaa (P ) = gbb (p) and gbb (P ) = gaa (p) (7.118)
(M ) (M )
gAB (P ) = gAB (p) for A; B 6= a; b :
+
These conditions are useful for points along the line through the center of RM ,
1 2 M
P = P = : : : = P . Let Pc = (c=M; : : : ; c=M ) with c = jPc j; we have
pc = Q0 (Pc ) = Pc . Imposing eqs.(7.117) and (7.118) for all choices of the pairs
(a; b) implies
(M )
gAA (Pc ) = FM (c)
(M )
gAB (Pc ) = GM (c) for A 6= B ; (7.119)

where FM and GM are some unspeci…ed functions.


+ +
Next we consider the family of Markov maps Q00 : RM ! RkM with k 2

Q00 (P 1 ; : : : P M ) = (p1 ; : : : pkM )


P1 P1 P2 P2 PM PM
= ( ;::: ; ;::: ;:::; ;::: ): (7.120)
|k {z k } |k {z k } | k {z k }
k tim es k tim es k tim es
00
Q is implemented by choosing

i 1=k if i 2 fk(A 1) + 1; : : : kAg


qA = (7.121)
0 if i2= fk(A 1) + 1; : : : kAg

Under the action of Q00 vectors are transformed according to eq.(7.102),

~ A = qA 1
Q00 E i
~ei = ~ek(A 1)+1 + : : : + ~ekA (7.122)
k
214 Information Geometry

so that the invariance

~ A; E
(E ~ B )M = (Q00 E
~ A ; Q00 E
~ B )kM (7.123)

yields,
kA
X
(M ) 1 (kM )
gAB (P ) = gij (p) : (7.124)
k2
i; j=k(A 1)+1

Along the center lines, Pc = (c=M; : : : ; c=M ) and pc = (c=kM; : : : ; c=kM ), equa-
tions (7.119) and (7.124) give

1 k 1
FM (c) = FkM (c) + GkM (c) (7.125)
k k
and
GM (c) = GkM (c) : (7.126)
But this holds for all values of M and k, therefore GM (c) = (c) where is a
function independent of M . Furthermore, eq.(7.125) can be rewritten as

1 1
[FM (c) (c)] = [FkM (c) (c)] = (c) ; (7.127)
M kM
where (c) is a function independent of the integer M . Indeed, for any two
integers M1 and M2 we have

1 1 1
[FM1 (c) (c)] = [FM1 M2 (c) (c)] = [FM2 (c) (c)] :
M1 M1 M2 M2
(7.128)
Therefore,
FM (c) = (c) + M (c) ; (7.129)
and for points along the center line,
(M )
gAB (Pc ) = (c) + M (c) AB : (7.130)

So far the invariance under the special Markov embeddings Q0 and Q00 has
+
allowed us to …nd the metric of RM for arbitrary M but only along the center
(M ) +
line P = Pc for any c > 0. To …nd the metric gAB (P ) at any arbitrary P 2 RM
+
000
we show that it is possible to cleverly choose the embedding Q : RM ! Rn+
so that the image of P can be brought arbitrarily close to the center line of
Rn+ , Q000 (P ) pc , where the metric is known. Indeed, consider the embeddings
+
Q000 : RM ! Rn+ de…ned by

P1 P1 P2 P2 PM PM
Q000 (P 1 ; : : : P M ) = ( ;::: ; ;::: ;:::; ;::: ): (7.131)
k k k k k k
| 1 {z 1} | 2 {z 2} | M {z M}
k1 tim es k2 tim es km tim es
7.5 Uniqueness of the information metric 215

Q000 is implemented by choosing


8
>
> 1=kA if i 2 f(k1 + : : : + kA 1 + 1); (k1 + : : : + kA 1 + 2);
<
i : : : ; (k1 + : : : + kA )g
qA =
>
> 0 if i 2
= f(k1 + : : : + kA 1 + 1); (k1 + : : : + kA 1 + 2);
:
: : : ; (k1 + : : : + kA )g
(7.132)
+
Next note that any point P in RM can be arbitrarily well approximated by
points of the “rational” form
ck1 ck2 ckM
P = ; ;::: ; (7.133)
n n n
P
where the ks are positive integers and kA = n and jP j = c. For these rational
points the action of Q000 is
c c c
Q000 (P 1 ; : : : P M ) = qA
i
PA = ; ;::: = pc (7.134)
n n n
which lies along the center line of Rn+ where the metric is known, eq.(7.130).
The action of Q000 on vectors, eq.(7.102), gives

~ A = qA 1
Q000 E i
~ei = ~ek1 +:::+kA 1 +1
+ : : : + ~ek1 +:::+kA : (7.135)
kA
Using eq.(7.130) the invariance
~ A; E
(E ~ B )M = (Q000 E
~ A ; Q000 E
~ B )n (7.136)

yields, for A = B,
k1 +:::+k
X A
(M ) 1 (n)
gAA (P ) = 2 gij (pc )
(kA ) i; j=k1 +:::+kA 1 +1

k1 +:::+k
X A
1
= 2 [ (c) + n (c) ij ]
(kA ) i; j=k1 +:::+kA 1 +1

1 h i
2
= 2 (kA ) (c) + kA n (c)
(kA )
n c (c)
= (c) + (c) = (c) + ; (7.137)
kA PA
where we used eq.(7.133), P A = ckA =n. Similarly, for A 6= B,
k1 +:::+k
X A k1 +:::+k
X B
(M ) 1 (n)
gAB (P ) = gij (pc ) (7.138)
kA kB
i=k1 +:::+kA 1 +1 j=k1 +:::+kB 1 +1

1
= kA kB (c) = (c) : (7.139)
kA kB
216 Information Geometry

Therefore,
(M ) ~ A; E
~ B iM = (c) + c (c) AB
gAB = hE ; (7.140)
PA
with c = jP j. This almost concludes the proof.
The sign conditions on and follow from the positive-de…niteness of inner
products. Using eq.(7.108),

X VA 2
~ ;V
(V ~ ) = jV j2 + jP j ; (7.141)
PA
A

~ ;V
we see that for vectors with jV j = 0, (V ~ ) 0 implies that > 0, while for
A A
vectors with V = KP , where K is any constant we have
~ ;V
(V ~ ) = K 2 jP j2 ( + ) > 0 ) + >0: (7.142)

~ ;V
Conversely, we show that if these sign conditions are satis…ed then (V ~) 0
for all vectors. Using Cauchy’s inequality,
2
P 2 P 2 P
xi yi kxi yi k ; (7.143)
i i i

where k:k denotes the modulus, we have


2
!
2 2
P A P VB P P A
P B
VA V : (7.144)
A B P A A

Therefore,

X VA 2
~ ;V
(V ~ ) = jV j2 + jP j jV j2 ( + ) 0; (7.145)
PA
A

with equality if and only if all V A = 0.


We have just proved that for invariance under Markov embeddings it is
necessary that the metrics be of the form (7.140). It remains to prove the
converse, that this condition is su¢ cient. This is much easier. Indeed,
~ B )n = q i q j (ei ; ej )n
~ A; Q E
(Q E A B
P i j ij
= qA qB (jpj) + jpj (jpj) : (7.146)
ij pi
P
But as noted earlier, Markov mappings pi = qA i
P A are such that i
i qA = 1 and
they preserve normalization jP j = jpj, therefore
i i
(Q E ~ B )n = (jP j) + jP j (jP j)P qA qB :
~ A; Q E (7.147)
i pi
7.6 The metric for some common distributions 217

i
Furthermore, since qA = 0 unless i 2 A,
i i
P qA P qAi
qB AB
= AB = A : (7.148)
i pi i P
A P

which …nally leads to

~ A; Q E
~ B )n = (jP j) + jP j (jP j) AB
(Q E = (eA ; eB )M (7.149)
PA
which concludes the proof.

7.6 The metric for some common distributions


Multinomial distributions
The statistical manifold of multinomials,

N! n1 nm
PN (nj ) = 1 ::: m ; (7.150)
n1 ! : : : n m !
where
P
m P
m
n = (n1 : : : nm ) with ni = N and i = 1; (7.151)
i=1 i=1

is the simplex Sm 1. The metric is given by eq.(7.83),


P @ log PN @ log PN
gij = PN where 1 i; j m 1: (7.152)
n @ i @ j

The result is
ni nm nj nm
gij = ( )( ) ; (7.153)
i m j m

which, on computing the various correlations, gives


N N
gij = ij + where 1 i; j m 1: (7.154)
i m

A somewhat simpler expression can be obtained by extending the range of the


indices to include i; j = m. This is done as follows. The distance d` between
neighboring distributions is

P1
m N N
d`2 = ( ij + )d i d j : (7.155)
i;j=1 i m

Using
P
m P
m
i = 1 =) d i =0: (7.156)
i=1 i=1
218 Information Geometry

the second sum can be written as


N mP1 P1
m N 2
d i d j = (d m) : (7.157)
m i;j=1 i;j=1 m

Therefore,
P
m N
d`2 = gij d i d j with gij = ij : (7.158)
i;j=1 i

Remark: As we saw in the previous section, eq.(7.112), the information metric


is de…ned up to an overall multiplicative factor. This arbitrariness amounts to a
choice of units. We
p see here that the distance d` between N -trial multinomials
contains a factor N . It is a matter of convention whether we decide to include
such factors or not — that is, whether we want to adopt the same length scale
(N ) (N 0 )
when discussing two di¤erent statistical manifolds such as Sm 1 and Sm 1 .
(N )
A uniform distribution over the simplex Sm 1 is one which assigns equal
probabilities to equal volumes,
Nm 1
P ( )dm 1
/ g 1=2 dm 1
with g = (7.159)
1 2 ::: m

In the particular case of binomial distributions m = 2 with 1 = and


2 =1 the results above become
N
g = g11 = (7.160)
(1 )
so that the uniform distribution over (with 0 < < 1) is
N
P ( )d / d` = [ ]1=2 d : (7.161)
(1 )

Canonical distributions
Let z denote the microstates of a system (e.g., points in phase space) and let
m(z) be the underlying measure (e.g., a uniform density on phase space). The
space of macrostates is a statistical manifold: each macrostate is a canonical
distribution (see sections 4.10 and 5.4) obtained by maximizing entropy S[p; m]
subject to n constraints hf a i = F a for a = 1 : : : n, plus normalization,
Z
1 a
a f (z)
a
p(zjF ) = m(z)e where Z( ) = dz m(z)e a f (z) : (7.162)
Z( )

The set of numbers F = (F 1 : : : F n ) determines one point p(zjF ) on the statis-


tical manifold so we can use the F a as coordinates.
First, here are some useful facts about canonical distributions. The Lagrange
multipliers a are implicitly determined by
@ log Z
hf a i = F a = ; (7.163)
@ a
7.6 The metric for some common distributions 219

and it is straightforward to show that a further derivative with respect to b


yields the covariance matrix. Indeed,

@F a @ 1 @Z 1 @Z @Z 1 @2Z
= ( )= 2 (7.164)
@ b @ b Z@ a Z @ a@ b Z @ a@ b
= F a F b hf a f b i : (7.165)

Therefore
def @F a
C ab = h(f a F a )(f b F b )i = : (7.166)
@ b
Next, using the chain rule

c @ a @ a @F b
a = = ; (7.167)
@ c @F b @ c
we see that the matrix
@ a
Cab = (7.168)
@F b
is the inverse of the covariance matrix,

Cab C bc = c
a ;

Furthermore, using eq.(4.92), we have

@ 2 S(F )
Cab = : (7.169)
@F a @F b
The information metric is
Z
@ log p(zjF ) @ log p(zjF )
gab = dz p(zjF )
@F a @F b
Z
@ c @ d @ log p @ log p
= a b
dz p : (7.170)
@F @F @ c @ d

Using eqs.(7.162) and (7.163),

@ log p(zjF )
= Fc f c (z) (7.171)
@ c
therefore,
gab = Cca Cdb C cd =) gab = Cab ; (7.172)
so that the metric tensor gab is the inverse of the covariance matrix C ab which,
by eq.(7.169), is the Hessian of the entropy.
Instead of F a we could use the Lagrange multipliers a themselves as coor-
dinates. Then the information metric is the covariance matrix,
Z
ab @ log p(zj ) @ log p(zj )
g = dz p(zj ) = C ab : (7.173)
@ a @ b
220 Information Geometry

The distance d` between neighboring distributions can then be written in either


of two equivalent forms,

d`2 = gab dF a dF b = g ab d ad b : (7.174)

The uniform distribution over the space of macrostates assigns equal prob-
abilities to equal volumes,

P (F )dn F / C 1=2 dn F or P 0 ( )dn / C 1=2 n


d ; (7.175)

where C = det Cab .

Gaussian distributions
Gaussian distributions are a special case of canonical distributions — they max-
imize entropy subject to constraints on mean values and correlations. Consider
Gaussian distributions in D dimensions,

c1=2 1
p(xj ; C) = exp Cij (xi i
)(xj j
) ; (7.176)
(2 )D=2 2

where 1 i D, Cij is the inverse of the correlation matrix, and c = det Cij .
The mean values i are D parameters, while the symmetric Cij matrix is an
additional 21 D(D+1) parameters. Thus the dimension of the statistical manifold
is D + 12 D(D + 1).
Calculating the information distance between p(xj ; C) and p(xj + d ; C +
dC) is a matter of keeping track of all the indices involved. Skipping all details,
the result is

d`2 = gij d i d j
+ gkij dCij d k
+ g ij kl dCij dCkl ; (7.177)

where
1 ik jl
gij = Cij ; gkij = 0 ; and g ij kl = (C C + C il C jk ) ; (7.178)
4
where C ik is the correlation matrix, that is, C ik Ckj = i
j. Therefore,

1
d`2 = Cij d i d j
+ C ik C jl dCij dCkl : (7.179)
2
To conclude we consider a few interesting special cases. For Gaussians that
di¤er only in their means the information distance between p(xj ; C) and p(xj +
d ; C) is obtained setting dCij = 0, that is,

d`2 = Cij d i d j
; (7.180)

which is an instance of eq.(7.172).


Another interesting special case is that of spherically symmetric Gaussians,
7.7 Dimensionless distance? 221

1 1
p(xj ; ) = 2 )D=2
exp 2 ij
(xi i
)(xj j
) : (7.181)
(2 2

The covariance matrix and its inverse are both diagonal and proportional to the
unit matrix,
1
Cij = 2 ij
; C ij = 2 ij
; and c= 2D
: (7.182)

Using
1 2 ij
dCij = d 2 ij
= 3
d (7.183)

in eq.(7.179), the induced information metric is

1 1 4 ik jl 2 ij 2 kl
d`2 = 2 ij
d i
d j
+ 3
d 3
d (7.184)
2
which, using
ik jl k j k
ij kl = j k = k =D ; (7.185)
simpli…es to
ij 2D
d`2 = 2
d id j
+ 2
(d )2 : (7.186)

For Gaussians in one dimension, D = 1, the statistical manifold is two dimen-


sional with coordinates ( ; ). The metric is

1 2
d`2 = 2
(d )2 + 2
(d )2 : (7.187)

This space is a pseudo-sphere — a two-dimensional manifold of constant nega-


tive curvature.

7.7 Dimensionless distance?


There is one feature of the information distance d` in eq.(7.47) that turns out
to be very peculiar: d` is dimensionless. (See [Caticha 2015b].) Indeed, we can
easily verify from eq.(7.46) that if d i has units of length, then p(xj ), being a
density, has units of inverse volume, and gij has units of inverse length squared.
Distances are supposed to be measured in some units, perhaps of length; what
sort of distance is this dimensionless d`?
A simple example will help clarify this issue. Consider two neighboring
spherically symmetric Gaussian distributions, eq.(7.181) with some …xed vari-
ance. The distance between p(xj ; ) and p(xj + d ; ) is given by eq.(7.186)
with d = 0,
ij
d`2 = 2
d id j
: (7.188)
222 Information Geometry

This is the Euclidean metric ij rescaled by 2 . Therefore the dimensionless d`


represents a distance measured in units of the uncertainty .
The result is not restricted to Gaussian distributions. For example, for
canonical distributions, eq.(7.166) shows that the information metric gab = Cab
is an inverse measure of the ‡uctuations as given by the variance-covariance ma-
trix C ab . Therefore, the information metric gab ( ) measures distinguishability
in units of the local uncertainty implied by the distribution p(xj ).
A perhaps unexpected consequence of the notion of a dimensionless informa-
tion distance is the following. As we saw in section (7.4.4) we can measure the
length of a curve on a statistical manifold by “counting” the number of points
along it. Counting points depends on a decision what we mean by a point and,
in particular, on what we mean by two di¤erent points. If we agree that two
points ought to be counted separately only when we can distinguish them, then
one can assert that the number of distinguishable points in any …nite segment
is …nite.
The same idea can used to measure areas and volumes: just count the num-
ber of distinguishable points within the region of interest. Remarkably this
counting argument allows us to compare the sizes of regions of di¤erent dimen-
sionality: it is meaningful to assert that a surface and a volume are of the same
“information size” whenever they contain the same number of distinguishable
points.
The conclusion is that statistical manifolds are peculiar objects that have
properties that one would normally associate to a continuum but also have
properties that one would normally only associate with discrete structures such
as lattices.
Chapter 8

Entropy IV: Entropic


Inference

There is one last issue that must be addressed before one can claim that the de-
sign of the inference method of maximum entropy (ME) is more or less complete.
The goal is to rank probability distributions in order to select a distribution that,
according to some desirability criteria, is preferred over all others. The ranking
tool is entropy; higher entropy represents higher preference. But there is noth-
ing in our previous arguments to tell us by how much. Suppose the maximum
of the entropy function is not particularly sharp, are we really con…dent that
distributions with entropy close to the maximum are totally ruled out? We want
a quantitative measure of the extent to which distributions with lower entropy
are ruled out. Or, to phrase this question di¤erently: We can rank probability
distributions p relative to a prior q according to the relative entropy S[p; q] but
any monotonic function of the relative entropy will accomplish the same goal.
Does twice the entropy represent twice the preference or four times as much?
Can we quantify ‘preference’? The discussion below follows [Caticha 2000].

8.1 Deviations from maximum entropy


The problem is to update from a prior q(x) given information speci…ed by cer-
tain constraints. The constraints specify a family of candidate distributions
p (x) = p(xj ) which can be conveniently labelled with a …nite number of para-
meters i , i = 1 : : : n. Thus, the parameters are coordinates on the statistical
manifold speci…ed by the constraints. The distributions in this manifold are
ranked according to their entropy S[p ; q] = S( ) and the chosen posterior is
the distribution p(xj 0 ) that maximizes the entropy S( ).
The question we now address concerns the extent to which p(xj 0 ) should
be preferred over other distributions with lower entropy or, to put it di¤erently:
To what extent is it rational to believe that the selected value ought to be the
entropy maximum 0 rather than any other value ? This is a question about
224 Entropy IV: Entropic Inference

the probability p( ) of various values of .


The original problem which led us to design the maximum entropy method
was to assign a probability to the quantity x; we now see that the full problem is
to assign probabilities to both x and . We are concerned not just with p(x) but
rather with the joint distribution pJ (x; ); the universe of discourse has been
expanded from X (the space of xs) to the product space X ( is the space
of parameters ).
To determine the joint distribution pJ (x; ) we make use of essentially the
only method at our disposal — the ME method itself — but this requires that
we address the standard two preliminary questions: …rst, what is the prior
distribution? What do we know about x and before we receive information
about the constraints? And second, what is this new information that constrains
the allowed joint distributions pJ (x; )?
This …rst question is the more subtle one: when we know absolutely nothing
about the s we know neither their physical meaning nor whether there is any
relation to the xs. A joint prior that re‡ects this lack of correlations is a product,
qJ (x; ) = q(x) ( ). We will assume that the prior over x is known — it is the
same prior we had used when we updated from q(x) to p(xj 0 ). But we are
not totally ignorant about the s: we know that they label points on some as
yet unspeci…ed statistical manifold . Then there exists a natural measure of
distance in the space . It is given by the information metric gab introduced
in the previous chapter and the corresponding volume elements are given by
g 1=2 ( )dn , where g( ) is the determinant of the metric. The uniform prior
for , which assigns equal probabilities to equal volumes, is proportional to
g 1=2 ( ) and therefore we choose ( ) = g 1=2 ( ). Therefore, the joint prior is
qJ (x; ) = q(x)g 1=2 ( ).
Next we tackle the second question: what are the constraints on the allowed
joint distributions pJ (x; )? Consider the space of all joint distributions. To
each choice of the functional form of p(xj ) (for example, whether we talk about
Gaussians, Boltzmann-Gibbs distributions, or something else) there corresponds
a di¤erent subspace de…ned by distributions of the form pJ (x; ) = p( )p(xj ).
The crucial constraint is that which speci…es the subspace by specifying the
particular functional form of p(xj ). This de…nes the meaning to the s and
also …xes the prior g 1=2 ( ) on the relevant subspace.
To select the preferred joint distribution P (x; ) we maximize the joint en-
tropy S[pJ ; qJ ] over all distributions of the form pJ (x; ) = p( )p(xj ) by varying
with respect to p( ) with p(xj ) …xed. It is convenient to write the entropy as
Z
p( )p(xj )
S[pJ ; qJ ] = dx d p( )p(xj ) log 1=2
g ( )q(x)
Z
= S[p; g 1=2 ] + d p( )S( ); (8.1)

where Z
p( )
S[p; g 1=2 ] = d p( ) log (8.2)
g 1=2 ( )
8.2 The ME method 225

and Z
p(xj )
S( ) = dx p(xj ) log : (8.3)
q(x)
The notation shows that S[p; g 1=2 ] is a functional of p( ) while S( )R is a function
of . Maximizing (8.1) with respect to variations p( ) such that d p( ) = 1,
yields Z
p( )
0= d log 1=2 + S( ) + log p( ) ; (8.4)
g ( )
where the required Lagrange multiplier has been written as 1 log . Therefore
the probability that the value of should lie within the small volume g 1=2 ( )dn
is Z
n 1 S( ) 1=2 n
P ( )d = e g ( )d with = dn g 1=2 ( ) eS( ) : (8.5)

Equation (8.5) is the result we seek. It tells us that, as expected, the preferred
value of is the value 0 that maximizes the entropy S( ), eq.(8.3), because
this maximizes the scalar probability density exp S( ). But it also tells us the
degree to which values of away from the maximum are ruled out.
Remark: The density exp S( ) is a scalar function and the presence of the
Jacobian factor g 1=2 ( ) makes eq.(8.5) manifestly invariant under changes of
the coordinates in the space .

8.2 The ME method


Back in section 6.2.4 we summarized the method of maximum entropy as follows:

The ME method: We want to update from a prior distribution q to a poste-


rior distribution when there is new information in the form of constraints
C that specify a family fpg of allowed posteriors. The posterior is selected
through a ranking scheme that recognizes the value of prior information
and the privileged role of independence. Within the family fpg the pre-
ferred posterior P is that which maximizes the relative entropy S[p; q]
subject to the available constraints. No interpretation for S[p; q] is given
and none is needed.

The discussion of the previous section allows us to re…ne our understanding of


the method. ME is not an all-or-nothing recommendation to pick the single
distribution that maximizes entropy and reject all others. The ME method is
more nuanced: in principle all distributions within the constraint manifold ought
to be included in the analysis; they contribute in proportion to the exponential
of their entropy and this turns out to be signi…cant in situations where the
entropy maximum is not particularly sharp.
Going back to the original problem of updating from the prior q(x) given
information that speci…es the manifold fp(xj )g, the preferred update within
the family fp(xj )g is p(xj 0 ), but to the extent that other values of are not
226 Entropy IV: Entropic Inference

totally ruled out, a better update is obtained marginalizing the joint posterior
PJ (x; ) = P ( )p(xj ) over ,
Z Z
n eS( )
P (x) = d P ( )p(xj ) = dn g 1=2 ( ) p(xj ) : (8.6)

In situations where the entropy maximum at 0 is very sharp we recover the old
result,
P (x) p(xj 0 ) : (8.7)

When the entropy maximum is not very sharp eq.(8.6) is the more honest up-
date.
The discussion in section 8.1 is itself an application of the same old ME
method discussed in section 6.2.4, not on the original space X , but on the
enlarged product space X . Thus, adopting the improved posterior (8.6)
does not re‡ect a renunciation of the old ME method — only a re…nement. To
the summary description of the ME method above we can add the single line:

The ME method can be deployed to assess its own limitations and to take
the appropriate corrective measures.

Remark: Physical applications of the extended ME method are ubiquitous. For


macroscopic systems the preference for the distribution that maximizes S( ) can
be overwhelming but for small systems such ‡uctuations about the maximum
are common. Thus, violations of the second law of thermodynamics can be
seen everywhere — provided we know where to look. Indeed, as we shall see
in the next Chapter, eq.(8.5) agrees with Einstein’s theory of thermodynamic
‡uctuations and extends it beyond the regime of small ‡uctuations. Another
important application, to be developed in chapter 11, is quantum mechanics —
the ultimate theory of small systems.
We conclude this section by pointing out that there are a couple of inter-
esting points of analogy between the pair of {maximum likelihood, Bayesian}
methods and the corresponding pair of {MaxEnt, ME} methods. The …rst point
def
is that maximizing the likelihood function L( jx) = p(xj ) selects a single pre-
ferred value of but no measure is given of the extent to which other values
of are ruled out. The method of maximum likelihood does not provide us
with a distribution for — the likelihood function L( jx) is not a probability
distribution for . Similarly, maximizing entropy as prescribed by the MaxEnt
method yields a single preferred value of the label but MaxEnt fails to address
the question of the extent to which other values of are ruled out. The sec-
ond point of analogy is that neither the maximum likelihood nor the MaxEnt
methods are capable of handling information contained in prior distributions,
while both Bayesian and ME methods can. The latter analogy is to be expected
since neither the maximum likelihood nor the MaxEnt methods were designed
for updating probabilities.
8.3 Avoiding pitfalls –III 227

8.3 Avoiding pitfalls –III


Over the years a number of objections and paradoxes have been raised against
the method of maximum entropy. Some were discussed in chapter 4. Here we
discuss some objections of the type discussed in [Shimony 1985] and [Seidenfeld
1986]; see also [van Fraasen 1981 and 1986].1 I believe some of these objections
were quite legitimate at the time they were raised. They uncovered conceptual
limitations with the old MaxEnt as it was understood at the time. I also believe
that in the intervening decades our understanding of entropic inference has
evolved to the point that all these concerns can now be addressed satisfactorily.

8.3.1 The three-sided die


To set the stage for the issues involved consider a three-sided die. The die
has three faces labeled by the number of spots i = 1; 2; 3 with probabilities
P
f 1 ; 2 ; 3 g = . The space of distributions is the simplex S2 with i i = 1.
A fair die is one for which = C = ( 13 ; 13 ; 13 ) which lies at the very center of
the simplex (see …g.8.1). The expected number of spots for a fair die is hii = 2.
Having hii = 2 is no guarantee that the die is fair but if hii 6= 2 the die is
necessarily biased.
Next we consider three cases characterized by di¤erent states of information.
First we have a situation of complete ignorance. See …g.8.1(a). Nothing is
known about the die; we do not know that it is fair but on the other hand there
is nothing that induces us to favor one face over another. On the basis of this
minimal information we can use MaxEnt: maximize
P
S( ) = i i log i (8.8)
P
subject to i i = 1. The maximum entropy distribution is M E = C .
The second case involves more information: we are told that r = hii = 2.
This constraint is shown in …g.8.1(b) as a vertical dashed line that includes
distributions other than C . Therefore r = 2 does notPimply that the die is
fair. However, maximizing the entropy S( ) subject to i i = 1 and hii = 2
0
leads us to assign M E = C.
Finally, the third case involves even more information: we are told that the
die is fair. Maximizing S( ) subject to the constraint = C yields, of course,
00
M E = C . This is shown in …g.8.1(c).
The fact that MaxEnt assigns the same probability to the three cases might
suggest that the three situations are epistemically identical — which they obvi-
ously are not. This is a source of concern since failing to see a distinction where
one actually exists will inevitably give rise to paradoxes.
A more re…ned analysis, however, shows that — despite the fact that MaxEnt
assigns the same M E = C in all three cases — the uncertainties in the vicinity
of C are all di¤erent. Indeed, the fact that the maximum of the entropy S( )
1 Other objections raised by these authors, such as the compatibility of Bayesian and en-

tropic methods, have been addressed elsewhere in these lectures.


228 Entropy IV: Entropic Inference

Figure 8.1: Three di¤erent states of information concerning a three-sided die.


(a) Absolute ignorance: the distribution assigned by MaxEnt is C = ( 31 ; 13 ; 13 ).
(b) We know that r = hii = 2: the MaxEnt distribution is also C . (c) The die
is known to be fair: we know that = C . Despite the fact that MaxEnt assigns
the same = C in all three cases the uncertainties about C are di¤erent.

at C is not particularly sharp indicates that a full-blown ME analysis is called


for. For case (a) of complete ignorance, the probability that lies in any small
region d2 = d 1 d 2 of the simplex is given by eq.(8.5),
1
Pa ( )d 1 d 2 / eS( ) g 1=2 ( )d 1 d 2 with g( ) = : (8.9)
1 2 (1 1 2)

The maximum of Pa ( ) is indeed at the center C but the distribution is broad


and extends over the whole simplex.
The ME distribution for case (b) is formally similar to case (a),
1
Pb ( 2 )d 2 / eS( ) g 1=2 ( 2 )d 2 with g( 2 ) = : (8.10)
2 (1 2)

The maximum of Pb ( 2 ) is also at the center C but the distribution is con…ned


to the vertical line de…ned by 1 = 3 = (1 2 )=2 in …g.8.1(b); the probability
over the rest of the simplex is strictly zero.
Finally, in case (c) the distribution is concentrated at the single central point
C ,
Pc ( ) = ( C) ; (8.11)
and there is absolutely no room for ‡uctuations.
To summarize: complete ignorance about i = 1; 2; 3 with full knowledge of
= C = ( 31 ; 13 ; 13 ) is not the same as complete ignorance about both i = 1; 2; 3
8.3 Avoiding pitfalls –III 229

and = f 1 ; 2 ; 3 g. An assessment of ‘complete ignorance’ can be perfectly


legitimate but to avoid confusion we must be very speci…c about what variables
we are being ignorant about.

8.3.2 Understanding ignorance


Ignorance, like the vacuum, is not a trivial concept.2 Further opportunities for
confusion arise when we consider constraints hii = r with r 6= 2. In …g.8.2
the constraint hii = r = 1 is shown as a vertical dashed line. Maximizing S( )
subject to hii = 1 and normalization leads to the point at the intersection where
the r = 1 line crosses the dotted line. The dotted curve is the set of MaxEnt
distributions M E (r) as r spans the range from 1 to 3.

Figure 8.2: The MaxEnt solution for the constraint hii = r for di¤erent values
of r leads to the dotted line. If r is unknown averaging over r should lead to
the distribution at the point marked by .

It is tempting (but ill advised) to pursue the following line of thought: We


have a die but we do not know much about it. We do know, however, that the
quantity hii must have some value, call it r, about which we are ignorant too.
Now, the most ignorant distribution given r is the MaxEnt distribution M E (r).

2 The title for this section is borrowed from Rodriguez’s paper on the two-envelope paradox

[Rodriguez 1988]. Other papers of his on the general subject of ignorance and geometry (see
the bibliography) are highly recommended for the wealth of insights they contain.
230 Entropy IV: Entropic Inference

But r is itself unknown so a more honest assignment is an average over r,


Z
= dr p(r) M E (r) ; (8.12)

where p(r) re‡ects our uncertainty about r. It may, for example, make sense to
pick a uniform distribution over r but the precise choice is not important for our
purposes. The point is that since the MaxEnt dotted curve is concave the point
necessarily lies below C so that 2 < 1=3. And we have a paradox: we started
admitting complete ignorance and through a process that claims to express full
ignorance at every step we reach the conclusion that the die is biased against
i = 2. Where is the mistake?
The …rst clue is symmetry: We started with a situation that treats the
outcomes i = 1; 2; 3 symmetrically and end up with a distribution that is biased
against i = 2. The symmetry must have been broken somewhere and it is clear
that this happened at the moment the constraint on hii = r was imposed —
this is shown as vertical lines on the simplex. Had we chosen to express our
ignorance not in terms of the unknown value of hii = r but in terms of some
other function hf (i)i = s then we could have easily broken the symmetry in
some other direction. For example, let f (i) be a cyclic permutation of i,

f (1) = 2; f (2) = 3; and f (3) = 1 ; (8.13)

then repeating the analysis above would lead us to conclude that 1 < 1=3,
which represents a die biased against i = 1. Thus, the question becomes: What
leads to choose a constraint on hii rather than a constraint on hf i when we are
equally ignorant about both?
The discussion in section 4.11 is relevant here. There we identi…ed four
epistemically di¤erent situations:

(A) The ideal case: We know that hf i = F and we know that it captures all
the information that happens to be relevant to the problem at hand.

(B) The important case: We know that hf i captures all the information that
happens to be relevant to the problem at hand but its actual numerical
value F is not known.

(C) The predictive case: There is nothing special about the function f
except that we happen to know its expected value, hf i = F . In particular,
we do not know whether information about hf i is complete or whether it
is at all relevant to the problem at hand.

(D) The extreme ignorance case: We know neither that hf i captures rele-
vant information nor its numerical value F . There is nothing that singles
out one function f over any other.

The paradox with the three-sided die arises because two epistemically di¤erent
situations, case B and case D have been confused. On one hand, the unknown
8.3 Avoiding pitfalls –III 231

die is meant to re‡ect a situation of complete ignorance, case D. We do not know


whether it is the constraint hii or any other function hf i that captures relevant
information; and their numerical values are also unknown. There is nothing to
single out hii or hf i and therefore the correct inference consists of maximizing
S imposing the only constraint we actually know, namely, normalization. The
result is as it should be — a uniform distribution ( M E = C ).
On the other hand, the argument that led to the assignment of in eq.(8.12)
turns out to be actually correct when applied to an epistemic situation of type
B. Imposing the constraint hii = r when r is unknown and then averaging over
r represents a situation in which we know something. We have some knowledge
that singles out hii — and not any other hf i — as the function that captures
information that is relevant to the die. There is some ignorance here — we do
not know r — but this is not extreme ignorance. We can summarize as follows:
Knowing nothing about a die is not the same as knowing that the die is biased
against a particular phase but not knowing by how much.
A di¤erent instance of the same paradox is discussed in [Shimony 1985]. A
physical system can be in any of n microstates labeled i = 1 : : : n. When we
know absolutely nothing about the system maximizing entropy subject to the
single constraint of normalization leads to a uniform probability distribution,
pu (i) = 1=n. A di¤erent (misleading) way to express complete ignorance is to
argue that the expected energy h"i must have some value E about which we are
ignorant. Maximizing entropy subject to both h"i = E and normalization leads
to the usual Boltzmann distributions,

e "i X
"i
p(ij ) = where Z( ) = e : (8.14)
Z( ) i

Since the inverse temperature = (E) is itself unknown we must average over
, Z
pt (i) = d p( )p(ij ) : (8.15)

To the extent that both distributions re‡ect complete ignorance we must have

pu (i) = pt (i) (wrong!)

which can only happen provided

p( ) = ( ) or =0: (8.16)

Indeed, setting the Lagrange multiplier = 0 in p(ij ) amounts to maximizing


entropy without imposing the energy constraint and this leads to the uniform
distribution pu (i). But now we have a paradox: The …rst way of expressing com-
plete ignorance about the system implies we are ignorant about its temperature.
In fact, we do not even know that it has a temperature at all, much less that it
has a single uniform temperature. But we also have a second way of expressing
ignorance and if impose that the two agree we are led to conclude that has
232 Entropy IV: Entropic Inference

the precise value = 0; we have concluded that the system is in…nitely hot —
ignorance is hell.
The paradox is dissolved once we realize that, just as with the die problem,
we have confused two epistemically di¤erent situations — types D and B above:
Knowing nothing about a system is not the same as knowing that it is in thermal
equilibrium at a temperature that happens not be unknown.
It may be worthwhile to rephrase this important point in di¤erent words. If
I is the space of microstates and is some unknown arbitrary quantity in some
space B the rules of probability theory allow us to write
Z
p(i) = d p(i; ) where p(i; ) = p( )p(ij ) : (8.17)

Paradoxes will easily arise if we fail to distinguish a situation of complete igno-


rance from a situation where the conditional probability p(ij ) — which is what
gives meaning to the parameter — is known. Or, to put it in yet another way:
complete ignorance over the space I is not the same as complete ignorance over
the full space I B.
Chapter 9

Topics in Statistical
Mechanics*

9.1 An application to ‡uctuations


The starting point for the standard formulation of the theory of ‡uctuations in
thermodynamic systems (see [Landau 1977, Callen 1985]) is Einstein’s inversion
of Boltzmann’s formula S = k log W to obtain the probability of a ‡uctuation
in the form W exp S=k. A careful justi…cation, however, reveals a number of
approximations which, for most purposes, are legitimate and work very well. A
re-examination of ‡uctuation theory from the point of view of ME is, however,
valuable. Our general conclusion is that the ME point of view allows exact
formulations; in fact, it is clear that deviations from the canonical predictions
can be expected, although in general they will be negligible. Other advantages of
the ME approach include the explicit covariance under changes of coordinates,
the absence of restrictions to the vicinity of equilibrium or to large systems, and
the conceptual ease with which one deals with ‡uctuations of both the extensive
as well as their conjugate intensive variables. [Caticha 2000]
This last point is an important one: within the canonical formalism (section
5.4) the extensive variables such as energy are uncertain while the intensive
ones such as the temperature or the Lagrange multiplier are …xed parameters,
they do not ‡uctuate. There are, however, several contexts in which it makes
sense to talk about ‡uctuations of the conjugate variables. Below we discuss
the standard scenario of an open system that can exchange say, energy, with its
environment.
Consider the usual setting of a thermodynamical system with microstates
labelled by z. Let m(z)dz be the number of microstates within the range dz.
According to the postulate of “equal a priori probabilities”we choose a uniform
prior distribution proportional to the density of states m(z). The canonical
ME distribution obtained by maximizing S[p; m] subject to constraints on the
234 Topics in Statistical Mechanics*

expected values f k = F k of relevant variables f k (z), is

1 kf
k
(z)
R kf
k
(z)
p(zjF ) = m(z) e with Z( ) = dz m(z) e ; (9.1)
Z( )

and the corresponding entropy is


k
S(F ) = log Z( ) + kF : (9.2)

Fluctuations of the variables f k (z) or of any other function of the microstate


z are usually computed in terms of the various moments of p(zjF ). Within this
context all expected values such as the constraints f k = F k and the entropy
S(F ) itself are …xed; they do not ‡uctuate. The corresponding conjugate vari-
ables, the Lagrange multipliers k = @S=@F k , eq.(4.92), do not ‡uctuate either.
The standard way to make sense of ‡uctuations is to couple the system of
interest to a second system, a bath, and allow exchanges of the quantities f k . All
quantities referring to the bath will be denoted by primes: the microstates are
z 0 , the density of states is m0 (z 0 ), and the variables are f 0k (z 0 ), etc. Even though
the overall expected value f k + f 0k = FTk of the combined system plus bath is
…xed, the individual expected values f k = F k and f 0k = F 0k = FTk F k are
allowed to ‡uctuate. The ME distribution p0 (z; z 0 ) that best re‡ects the prior
information contained in m(z) and m0 (z 0 ) updated by information on the total
FTk is
1 k 0k 0
p0 (z; z 0 ) = m(z)m0 (z 0 ) e 0 (f (z)+f (z )) : (9.3)
Z0
But distributions of lower entropy are not totally ruled out; to explore the
possibility that the quantities FTk are distributed between the two systems in a
less than optimal way we consider the joint distributions pJ (z; z 0 ; F ) constrained
to the form
pJ (z; z 0 ; F ) = p(F )p(zjF )p(z 0 jFT F ); (9.4)
where p(zjF ) is the canonical distribution in eq.(9.1), its entropy is eq.(9.2) and
analogous expressions hold for the primed quantities.
We are now ready to write down the probability that the value of F ‡uctuates
into a small volume g 1=2 (F )dF . From eq.(8.5) we have

1
P (F )dF = eST (F ) g 1=2 (F )dF; (9.5)

where is a normalization constant and the entropy ST (F ) of the system plus


the bath is
ST (F ) = S(F ) + S 0 (FT F ): (9.6)
The formalism simpli…es considerably when the bath is large enough that ex-
changes of F do not a¤ect it, and 0 remains …xed at 0 . Then

S 0 (FT F ) = log Z 0 ( 0) + 0k FTk F k = const 0k F


k
. (9.7)
9.1 An application to ‡uctuations 235

It remains to calculate the determinant g(F ) of the information metric given


by eq.(7.87),

@ 2 ST (F_ ; F ) @2 h i
gij = = S(F_ ; F ) + S 0 (FT F_ ; FT F) (9.8)
@ F_ i @ F_ j @ F_ i @ F_ j
where the dot indicates that the derivatives act on the …rst argument. The …rst
term on the right is
Z
@ 2 S(F_ ; F ) @2 p(zjF_ ) m(z)
= dz p(zjF_ ) log
@ F_ i @ F_ j @ F_ i @ F_ j m(z) p(zjF )
2 Z 2
@ S(F ) @ p(zjF ) p(zjF )
= + dz log : (9.9)
@F i @F j @F i @F j m(z)

To calculate the integral on the right use eq.(9.1) written in the form

p(zjF ) k
log = log Z( ) kf (z) ; (9.10)
m(z)
so that the integral vanishes,
Z Z
@2 @2
log Z( ) i j dz p(zjF ) k dz p(zjF )f k (z) = 0 : (9.11)
@F @F @F i @F j
Similarly,

@2 @ 2 S 0 (FT F )
S 0 (FT F_ ; FT F) = (9.12)
@ F_ i @ F_ j @F i @F j
Z
@ 2 p(z 0 jFT F ) p(z 0 jFT F )
+ dz 0 log
@F i @F j m0 (z 0 )

and here, using eq.(9.7), both terms vanish. Therefore

@ 2 S(F )
gij = : (9.13)
@F i @F j
We conclude that the probability that the value of F ‡uctuates into a small
volume g 1=2 (F )dF becomes
1 k
p(F )dF = eS(F ) 0k F
g 1=2 (F )dF : (9.14)

This equation is exact.


An important di¤erence with the usual theory stems from the presence of
the Jacobian factor g 1=2 (F ). This is required by coordinate invariance and can
lead to small deviations from the canonical predictions. The quantities h k i and
F k may be close but will not in general coincide with the quantities 0k and F0k
at the point where the scalar probability density attains its maximum. For most
thermodynamic systems however the maximum is very sharp. In its vicinity the
236 Topics in Statistical Mechanics*

Jacobian can be considered constant, and one obtains the usual results [Landau
1977], namely, that the probability distribution for the ‡uctuations is given by
the exponential of a Legendre transform of the entropy.
The remaining di¢ culties are purely computational and of the kind that
can in general be tackled systematically using the method of steepest descent
to evaluate the appropriate generating function. Since we are not interested
in variables referring to the bath we can integrate Eq.(9.4) over z 0 , and use
the distribution P (z; F ) = p(F )p(zjF ) to compute various moments. As an
example, the correlation between i = i h i i and f j = f j f j or
j j j
F =F F is

@ h ii
i fj = i Fj = +( 0i h i i) F0j Fj : (9.15)
@ 0j

When the di¤erences 0i h i i or F0j F j are negligible one obtains the usual
expression,
j
i fj i : (9.16)

9.2 Variational approximation methods –I*


9.2.1 Mean …eld Theory*
9.2.2 Classical density functional theory*
[Youse… Caticha 2021]
Chapter 10

A Prelude to Dynamics:
Kinematics

In this and the following chapters our main concern will be to deploy the con-
cepts of probability, entropy, and information geometry to formulate quantum
mechanics (QM) as a dynamical model that describes the evolution of probabil-
ities in time. The fact that the dynamical variables are probability distributions
turns out to be highly signi…cant because all changes of probabilities — includ-
ing their time evolution — must be compatible with the basic principles for
updating probabilities. In other words, the kinds of dynamics we seek are those
driven by the maximization of an entropy subject to constraints that carry the
information that is relevant to the particular system at hand.
The goal of entropic dynamics is to generate a trajectory in a space of
probability distributions. As we saw in Chapter 7, these spaces are statistical
manifolds and have an intrinsic metric structure given by the information metric.
Furthermore, our interest in trajectories naturally leads us to consider both their
tangent vectors and their dual covectors because it is these objects that will be
used to represent velocities and momenta respectively. It turns out that, just as
the statistical manifold has a natural metric structure, the statistical manifold
plus all the spaces of tangent covectors is itself a manifold — the cotangent
bundle — that can be endowed with a natural structure called symplectic.1
Our goal will be to formulate an entropic dynamics that naturally re‡ects these
metric and symplectic structures.
But not every curve is a trajectory and not every parameter that labels
points along a curve is time. In order to develop a true dynamics we will have
to construct a concept of time — a problem that will be addressed in the next
chapter. This chapter is devoted to kinematics. As a prelude to a true dynamics
1 The term symplectic was invented by Weyl in 1946. The old name for the family of

groups of transformations that preserve certain antisymmetric bilinear forms had been complex
groups. Since the term was already in use for complex variables, Weyl thought this was
unnecessarily confusing. So he invented a new term by literally translating ‘complex’from its
Latin roots com-plexus, which means “together-braided,” to its Greek roots - " o&.
238 A Prelude to Dynamics: Kinematics

we shall develop some of the tools needed to study families of curves that are
closely associated with the symplectic and metric structures.2
To simplify the discussion in this chapter we shall consider the special case
of a statistical manifold of …nite dimension. Back in Chapter 7 we studied
the information geometry of the manifold associated with a parametric family
of probability distributions. The uncertain variable x can be either discrete
or continuous and the distributions (x) = (xj ) are labeled by parameters
i
(i = 1 : : : n) which will be used as coordinates on the manifold.3 First,
to introduce the main ideas, we shall consider the simpler example in which
the i are generic parameters of no particular signi…cance. Then, we shall
address the example of a simplex — a statistical manifold for which the uncertain
variables are discrete, x = i = 1 : : : n, and the probabilities themselves are used
as coordinates, i = (i). The result is a formalism in which the linearity of the
evolution equations, the emergence of a complex structure, Hilbert spaces, and
a Born rule, are derived rather than postulated.

10.1 Gradients and covectors


Just as we chose displacements as the prototype vectors, we choose the prototype
covectors (also known as covariant vectors and as 1-forms) to be the gradients
of functions. To see how this comes about we note that the derivative df =d of
the function f ( ) along the curve i = i ( ) parametrized by can be analyzed
in two ways.
The …rst way consists of interpreting df =d as the action of an operator V
on f . Recall from Section 7.2 that the vector V tangent to the curve i ( ) can
be “identi…ed”with a directional derivative. Indeed, if f ( ) is a scalar function,
then the derivative along the curve is
i
df @f d
=Vi i where V i( ) = : (10.1)
d @ d
Since there is a strict 1-1 correspondence between
d @
=Vi i and V = V i ei ; (10.2)
d @
we shall de…ne the action of V on f by
def d
V (f ) = f = (V i @i )f ; (10.3)
d
2 The material of this chapter is adapted from [Caticha 2019, 2021b] which itself builds and

expands on previous work on the geometric and symplectic structure of quantum mechanics
[Kibble 1979; Heslot 1985; Anandan and Aharonov 1990; Cirelli et al. 1990; Abe 1992;
Hughston 1995; Ashekar and Schilling 1998; de Gosson, Hiley 2011; Elze 2012; Reginatto and
Hall 2011, 2012].
3 We will continue to adopt the standard notation of using upper indices to label coordinates

and components of vectors (e.g. i and V ~ = V i~ei ) and lower indices to denote components
of covectors (e.g. @F=@ i = @i F ). We also adopt the Einstein summation convention: a sum
over an index is understood whenever it appears repeated as an upper and a lower index.
10.1 Gradients and covectors 239

so that
d
=V : (10.4)
d
The vectors
@
= @i
ei = (10.5)
@ i
constitute the “coordinate” basis — a basis of vectors that is adapted to the
coordinate grid in the sense that the vectors fei g are tangent to the grid lines.
More explicitly, the vector ei is tangent to the coordinate curve de…ned by
holding constant all j s with j 6= i, and using i as the parameter along the
curve.
The second way to think about the directional derivative df =d is to write
def d
rf [V ] = (@i f )V i =
f (10.6)
d
and interpret df =d as the scalar that results from the action of the linear
functional rf on the vector V . Indeed, using linearity the action of rf on the
vector V is
rf [V ] = V i rf [ei ] so that rf [ei ] = @i f : (10.7)
j
When the function f is one of the coordinates, f ( ) = , we obtain
j
@ j
r j [ei ] = i
= i : (10.8)
@
Furthermore, using the chain rule
@f
rf ( ) = r i = @i f r i ; (10.9)
@ i
we see that f@i f g are the components of the covector rf , and that fr i g
constitute a covector basis which is dual or reciprocal to the vector basis fei g.
The transformation of vector components and of basis vectors under a change
of coordinates (see eqs.(7.11) and (7.17)) is such that the vectors V = V i ei are
invariant. Generic covectors ! = !i r i can also be de…ned as invariant objects
whose components !i transform as @i f . Using the chain rule the transformation
0
to primed coordinates, i ! i , is
j j
@f @ @f @
i 0 = i0
or @i0 f = i0
@j f : (10.10)
@ @ @ j @
Using
0
i0 @ i @ j
r = r j and ! i0 = i0
!j (10.11)
@ j @
we can check that ! is indeed invariant,
i i0
! = !i r = ! i0 r : (10.12)
Alternatively, we can de…ne generic covectors as linear functionals of vectors.
Indeed, using linearity and (10.8) the action of ! on the vector V ,
![V ] = !i r i [V ] = !i V j r i [ej ] = !i V i ; (10.13)
is invariant.
240 A Prelude to Dynamics: Kinematics

10.2 Lie derivatives


The task of de…ning the derivative of a vector …eld (and more generally of tensor
…elds) in a curved manifold amounts to comparing vectors in two neighboring
tangent spaces (see e.g., [Schutz 1980]). The problem is to provide a criterion to
decide which vector in one tangent space is considered as being the “same” as
another vector in a neighboring and therefore di¤ erent tangent space. Such a
criterion would allow us to give meaning to the statement that a particular vector
…eld is “constant”or has a “vanishing derivative.”Solving this problem requires
introducing additional structure. One solution is to introduce a connection …eld
and this leads to the concept of a covariant derivative. Another solution, due to
Sophus Lie, is to de…ne the derivative relative to a “reference”vector …eld. This
approach leads to the concept of the Lie derivative — the derivative of a tensor
…eld with respect to a vector …eld. (See e.g., [Schutz 1980].) Lie derivatives are
introduced as follows.
Since vectors are de…ned as tangents to curves, if we are given a space-…lling
congruence of curves, then we can de…ne the associated vector …eld : to every
point we associate the vector V ( ) that happens to be tangent to the particular
curve that passes through . Conversely, if we are given a vector …eld, then we
can de…ne the corresponding congruence of curves i = i ( ) that are tangent
to V ( ) at every point .
The vector …eld V ( ) can be used to de…ne a di¤eomorphism V the action of
which is to map the point to the point displaced by a parameter “distance”
along the congruence,

if = ( 0) then V ( ) = ( 0 + )= : (10.14)

We shall assume that the map V is su¢ ciently smooth and invertible.
To de…ne the Lie derivative of a scalar function f ( ) along (the congruence
de…ned by) the …eld V ( ) we …rst introduce the notion of Lie-dragging. Given
the function f de…ne a new function f called the pull-back of f under the action
of the map V ,
f ( ) = f( ) : (10.15)
Then the Lie derivative of f along V is de…ned by
def 1
$V f = lim [f ( ) f ( )] : (10.16)
!0

The important point here is that both functions f and f are evaluated at the
same point . The idea is that when Lie-dragging is applied to a vector …eld,
its Lie derivative would involve subtracting vectors located at the same tangent
space. As ! 0,
i
i d
f ( ) = f( ) = f( + )
d
i
@f d df
= f( ) + = f( ) + (10.17)
@ i d d
10.2 Lie derivatives 241

so that
df
$V f = = V [f ] : (10.18)
d
This result is not particularly surprising: the Lie derivative of f along V is
just the derivative of f along V . It gets more interesting when we apply the
same idea to the Lie derivative of a vector …eld U along the congruence de…ned
by V . We note, in particular, that the Lie derivative of a scalar function is a
scalar function too or, in other words, the Lie derivative is a scalar di¤erential
operator and, therefore, its action on a vector U yields a vector, $V U , that can
itself act on functions, $V U [f ], to yield other scalar functions.
The de…nition of the Lie derivative can be extended to vectors and tensors
by imposing the natural additional requirement that the Lie derivative be a
derivative, that is, it must obey a Leibniz rule. For example, the Lie derivative
$V U of a vector U is de…ned so that for any scalar function f the Lie derivatives
satisfy
def
$V U [f ] = $V U [f ] + U [$V f ] ; (10.19)
while the derivative $V T of a generic tensor T acting on a collection a; b; ::: of
vectors or covectors satis…es
def
$V [T (a; b; :::)] = [$V T ] (a; b; :::) + T ($V a; b; :::) + T (a; $V b; :::) + ::: (10.20)

Next we compute the Lie derivatives of vectors, covectors, and tensors in terms
of their components.

10.2.1 Lie derivative of vectors


We wish to calculate the Lie derivative of U along V . First we note that if
f = f ( ), then
df @f d i
U [f ] = = i (10.21)
d @ d
is just a scalar function of so that

$V U [f ] = V U [f ] : (10.22)

On the other hand, imposing the Leibniz rule (10.19), gives

$V U [f ] = $V U [f ] + U V [f ] : (10.23)

Therefore,
def
$V U = V U U V = [V ; U ] : (10.24)
where we introduced the Lie bracket notation on the right. Since the Lie bracket
is antisymmetric, so is the Lie derivative,

$V U = $U V : (10.25)
242 A Prelude to Dynamics: Kinematics

To evaluate the Lie derivative in terms of components one calculates the deriv-
atives in (10.24),

V U [f ] = V i @i U j @j f = V i @i U j @j f + V i U j @i @j f ; (10.26)
i j i j i j
U V [f ] = U @i V @j f = U @i V @j f + U V @i @j f : (10.27)

The result is
$V U = [V ; U ] = V i @i U j U i @i V j @j ; (10.28)

which explicitly shows that $V U is a vector with components

($V U )j = [V ; U ]j = V i @i U j U i @i V j : (10.29)

Side remark: Equation (10.29) shows an important di¤erence between the Lie
derivative $V U and the covariant derivative rV U . The latter depends on V ( )
only at the point . Indeed, if f ( ) is some scalar function, then rf V U = f rV U .
In contrast, $V U also depends on the derivatives of V ( ) at .

10.2.2 Lie derivative of covectors


To calculate the Lie derivative $V ! of ! along the congruence de…ned by V we
…rst recall that $V is a scalar operator so that $V ! is itself a covector,
i
$V ! = [$V !]i r : (10.30)

Next we consider a generic vector …eld U and use the fact that !(U ) = !i U i is
a scalar function. The Lie derivative of the right hand side is

$V (!i U i ) = V (!i U i ) = V j @j (!i U j )


= V j (@j !i )U i + V j !i (@j U i ) : (10.31)

Since $V obeys the Leibniz rule, the Lie derivative of the left hand side can also
be written as

$V [!(U )] = [$V !](U ) + !($V U ) = [$V !]i U i + !i ($V U )i : (10.32)

Equating (10.31) and (10.32), and using (10.29) we get

V j (@j !i )U i + V j !i (@j U i ) = [$V !]i U i + !i (V j @j U i U j @j V i ) ; (10.33)

V j (@j !i )U i = [$V !]i U i !j U i @i V j : (10.34)

Since this must hold for any arbitrary vector U i we get

($V !)i = V j @j !i + !j @i V j : (10.35)


10.3 The cotangent bundle 243

10.2.3 Lie derivative of the metric


To …nd $V G where G is the metric tensor we deploy the same trick as in the
previous sections: use the fact that the action of G on two generic vector …elds
A and B, G(A; B) = Gij Ai B j , is a scalar function and take the Lie derivative
of both sides. The right hand side gives

$V (Gij Ai B j ) = V k @k (Gij Ai B j ) (10.36)

and, using the Leibniz rule on the left hand side, we get

$V [G(A; B)] = [$V G](A; B)] + G($V A; B) + G(A; $V B)


= [$V G]ij Ai B j + Gij ($V A)i B j + Gij Ai ($V B)j : (10.37)

Setting (10.36) equal to (10.37) and using (10.29) leads to the desired expression,

[$V G]ij = V k @k Gij + Gik @j V k + Gkj @i V k : (10.38)

Notice that in the derivation above we have not used the symmetry or any other
properties of G beyond the fact that G( ; ) is a tensor so that G(A; B) is a scalar
function. This means that the expression (10.38) gives the Lie derivative of any
covariant tensor with components Tij ,

[$V T ]ij = V k @k Tij + Tik @j V k + Tkj @i V k : (10.39)

10.3 The cotangent bundle


The construction of an entropic dynamics in the next chapter requires that we
deal with several distinct spaces. One is the space of microstates — the variables
that we are trying to predict. This space will be called the ontic con…guration
space.4 In this chapter we deal with discrete microstates labelled by x = 1 : : : N
such as might describe an N -sided (possibly “quantum”) die. A second space
of interest is the statistical manifold of normalized distributions,
n PN o
P = (xj )j x=1 (xj ) = 1 (10.40)

labelled by coordinates i , i = 1 : : : n. This space will be called the epistemic


con…guration space which we abbreviate to the e-con…guration space. The n-
dimensional space P is a subspace of the (N 1)-dimensional simplex of nor-
PN
malized distributions S = f (x)j x=1 (x) = 1g.
4 These ontic variables are meant to represent something real only within the context of

a particular model. One may consider models where the positions of particles are assumed
ontic. In other models one might assume that it is the …eld variables that are ontic, while the
particles are quantum excitations of the …elds. It is even possible to conceive of hybrid models
in which some ontic variables represent the particles we call “matter”(e.g., the fermions) while
some other ontic variables represent the gauge …elds we call “forces” (e.g., the electromagnetic
…eld).
244 A Prelude to Dynamics: Kinematics

Given any manifold such as P we can construct two other manifolds that
turn out to be useful. One of these manifolds is denoted by T P and is called
the tangent bundle. The idea is the following. Consider all the curves passing
through a point = ( 1 : : : n ). The set of all the vectors that are tangent to
those curves is a vector space called the tangent space at and is denoted T P .
The space T P composed of P plus all its tangent spaces T P turns out to be a
manifold of a special type generically called a …ber bundle; P is called the base
manifold and T P is called the …ber at the point . Thus, T P is appropriately
called the tangent bundle.
We can also consider the space of all covectors at a point . Such a space is
denoted T P and is called the cotangent space at . The second special man-
ifold we can construct is the …ber bundle composed of P plus all its cotangent
spaces T P . This …ber bundle is denoted by T P and is called the cotangent
bundle.
The reason we care about vectors and covectors is that these are the objects
that will eventually be used to represent velocities and momenta. Indeed, if P
is the e-con…guration space, its associated cotangent bundle T P, which we will
call the e-phase space, will play a central role. But that is for later; for now all
we need is that the tangent and cotangent bundles are geometric objects that
are always available to us independently of any physical considerations.

10.3.1 Vectors, covectors, etc.


A point X 2 T P will be represented as X = ( ; ), where = ( 1 : : : n )
are coordinates on the base manifold P and = ( 1 : : : n ) are some generic
coordinates on the space T P that is cotangent to P at the point . Curves on
T P allow us to de…ne vectors on the tangent spaces T (T P)X . Let X = X( )
be a curve parametrized by , then the vector V tangent to the curve at X =
( ; ) has components d i =d and d i =d , and is written

d d i @ d i @
V = = + ; (10.41)
d d @ i d @ i

where @=@ i and @=@ i are the basis vectors and the index i = 1 : : : n is summed
over. The directional derivative of a function F (X) along the curve X( ) is
i
dF @F d @F d i ~ [V ] ;
= i + = rF (10.42)
d @ d @ i d

where r~ is the gradient in T P, that is, the gradient of a generic function


F (X) = F ( ; ) is
~ = @F r
rF ~ i + @F r ~ i: (10.43)
@ i @ i
~ on the bundle T P.
The tilde ‘~’serves to remind us that this is the gradient r
To simplify the notation further instead of keeping separate track of the i
and i coordinates it is more convenient to combine them into a single X. A
10.4 Hamiltonian ‡ows 245

point X = ( ; ) will then be labelled by its coordinates


i
X = (X 1i ; X 2i ) = i
; i ; (10.44)

where i is a composite index. The …rst index (chosen from the beginning
of the Greek alphabet) takes two values, = 1; 2. It is used to keep track of
whether i is an upper i index ( = 1) or a lower i index ( = 2).5 Then
eqs.(10.41) and (10.43) are written as
i
d i @ i dX d i =d
V = =V ; with V = = ; (10.45)
d @X i d d i =d

and
@F ~
~ =
rFrX i : (10.46)
@X i
The repeated indices indicate a double summation over and i. The action of
~ [ ] on a vector V is de…ned by the action of the basis
the linear functional rF
~
covectors rX i
on the basis vectors, @=@X j = @ j ,
i
~ i @X i
rX [@ j ] = j
= j : (10.47)
@X
Using linearity we …nd

~ [V ] = @F V
rF i
=
dF
: (10.48)
@X i d

10.4 Hamiltonian ‡ows


We have seen that vectors that are tangent to a space-…lling congruence of curves
X i = X i ( ) de…ne a vector …eld — a vector V (X) at each point X 2 T P.
Conversely, a vector …eld V (X) de…nes the congruence of curves X i = X i ( )
that are tangent to the …eld V (X) at every point X. We are interested in those
special congruences or ‡ows that re‡ect the structure of the symmetries of the
manifold.

10.4.1 The symplectic form


Once a manifold is supplied with the symmetric bilinear form that we call the
metric tensor a number of remarkable properties follow. The metric tensor
gives the manifold a fairly rigid structure that is described as the geometry of
the manifold. It also induces a natural map from vectors to covectors and a
special signi…cance is given to those transformations that preserve the bilinear
form and to the associated group of isometries.
Something similar occurs when the manifold happens to be a cotangent
bundle. Then it is possible to endow it with an antisymmetric bilinear form,
5 This allows us the freedom to switch from i to as convenience dictates; occasionally
i
we shall write i = i .
246 A Prelude to Dynamics: Kinematics

called the symplectic form, and the manifold acquires a certain ‡oppy structure
that is somewhat less rigid than that provided by a metric. This structure
is described as the symplectic geometry of the manifold [Arnold 1997][Souriau
1997][Schutz 1980]. As was the case for the metric tensor, the symplectic form
also induces a map from vectors to covectors and the group of transformations
that preserve it is particularly important. It is called the symplectic group
which in Hamiltonian mechanics has long been known as the group of canonical
transformations.
Once local coordinates ( i ; i ) on T P have been established there is a nat-
ural choice of symplectic form
~ i[ ]
[; ]= r ~ i[ ]
r ~ i[ ]
r ~ i[ ] :
r (10.49)

The action of [ ; ] on two vectors V = d=d and U = d=d is obtained using


(10.47),
~ i (V ) = V 1i
r and ~ i (V ) = V 2i :
r (10.50)
The result is

(V ; U ) = V 1i U 2i V 2i U 1i = i; j V
i
U j
; (10.51)

so that the components of are

0 1
i; j = ij : (10.52)
1 0

The form is non-degenerate, that is, for every vector V there exists some
vector U such that [V ; U ] 6= 0.
An aside: In the language of exterior calculus the symplectic form can be
derived by …rst introducing the Poincare 1-form

!= ~
id
i
; (10.53)
~ is the exterior derivative on T P and the corresponding symplectic
where d
2-form is
= d!~ = d~ i^d
~ i: (10.54)
By construction is locally exact ( = d!) ~ and closed (d ~ = 0).
Remark: A generic 2n-dimensional symplectic manifold is a manifold with a
di¤erential two-form !( ; ) that is closed (d!~ = 0) and non-degenerate (that is,
if the vector …eld V is nowhere vanishing, then the one-form !(V ; ) is nowhere
vanishing too). The Darboux theorem [Guillemin Sternberg 1984] states that
one can choose coordinates (q i ; pi ) so that at any point the two-form ! can be
written as
! = dq ~ i ^ dp
~ i: (10.55)
In general this can only be done locally; there is no choice of coordinates that
will accomplish this diagonalization globally. The point of this remark is to
emphasize that our construction of in (10.49) follows a very di¤erent logic: we
10.4 Hamiltonian ‡ows 247

do not start with a 2n-dimensional manifold with a pre-assigned two-form !( ; )


which we then proceed to locally diagonalize. We start with a 2n-dimensional
manifold and a given set of privileged coordinates ( ; ) which we then use to
construct the globally diagonal symplectic form .

10.4.2 Hamilton’s equations and Poisson brackets


Next we derive the 2n-dimensional T P analogues of results that are standard
in classical mechanics [Arnold 1997][Souriau 1997][Schutz 1980]. Given a vector
…eld V (X) on T P we can integrate V i (X) = dX i =d to …nd its integral
curves X i = X i ( ). We are particularly interested in those vector …elds
V (X) that generate ‡ows that preserve the symplectic structure in the sense
that
$V = 0 ; (10.56)
where the Lie derivative is given by (10.39),
k k k
($V ) i; j =V @ k i; j + k; j @ i V + i; k @ j V : (10.57)
Since by eq.(10.52) the components i; j are constant, @ k i; j = 0, we can
rewrite $V as
k k
($V ) i; j = @ i( k; j V ) @ j( i; k V ); (10.58)
which is the exterior derivative (roughly, the curl) of the covector k; i V k . By
Poincare’s lemma, requiring $V = 0 (a vanishing curl) implies that k; i V k
is the gradient of a scalar function, which we will denote V~ (X),

k; i V
k
= @ i V~ or ~ V~ ( ) :
(V ; ) = r (10.59)
In the opposite direction we can easily check that (10.59) implies $V = 0.
Indeed,
($V ) i; j = @ i (@ j V~ ) @ j (@ i V~ ) = 0 : (10.60)
Using (10.52), eq.(10.59) is more explicitly written as
d i~ d i~ i @ V~ ~ i @ V~ ~
r i r = r + r i ; (10.61)
d d @ i @ i
or
d i
@ V~ d i @ V~
= and = ; (10.62)
d @ i d @ i
which we recognize as Hamilton’s equations for a Hamiltonian function V~ . This
justi…es calling V the Hamiltonian vector …eld associated to the Hamiltonian
function V~ . This is how Hamiltonians enter physics — as a way to generate
vector …elds that preserve .
From (10.51), the action of the symplectic form on two Hamiltonian vector
…elds V = d=d and U = d=d generated respectively by V~ and U ~ is

d id i d id i
(V ; U ) = ; (10.63)
d d d d
248 A Prelude to Dynamics: Kinematics

which, using (10.62), gives

@ V~ @ U
~ @ V~ @ U
~ def
(V ; U ) = i
= fV~ ; U
~g ; (10.64)
@ @ i @ i@ i

where, on the right hand side, we have introduced the Poisson bracket notation.
It is easy to check that the derivative of an arbitrary function F (X) along the
congruence de…ned by the vector …eld V = d=d , which is given by (10.42) or
(10.48),
dF @F dX i @F d i @F d i
= i
= i
+ ; (10.65)
d @X d @ d @ i d
can be expressed in terms of Poisson brackets,

dF
= fF; V~ g : (10.66)
d
These results are summarized as follows:
(1) The ‡ows that preserve the symplectic structure, $V = 0, are generated by
Hamiltonian vector …elds V associated to Hamiltonian functions V~ , eq.(10.62),
i
dX
V i
= = fX i ; V~ g : (10.67)
d
(2) The action of on two Hamiltonian vector …elds is the Poisson bracket of
the associated Hamiltonian functions,

(V ; U ) = i; j V
i
U j
= fV~ ; U
~g : (10.68)

We end this section with a word of caution. We have uncovered a math-


ematical formalism that resembles classical mechanics. We could arbitrarily
choose one particular Hamiltonian function V~ and call it H, ~ and we could re-
name the parameter and call it time, but this does not mean that we have
thereby constructed a dynamical theory. One facet of the problem is that the
choice of symplectic form depends on the choice of local coordinates ( i ; i ). It
may be natural to assign a privileged statistical or physical signi…cance to the
parameters i but how do we choose the corresponding conjugate momentum
i ? In classical mechanics this question is settled by appealing to a Lagrangian
L(q; q)_ which allows us to de…ne the conjugate momentum pi = @L=@ q_i , but
here we do not have a Lagrangian. Another facet of this same problem is that
time is not merely just another parameter labeling points along a curve. What
makes time special? What choices of Hamiltonian functions qualify as being
the generators of time evolution? Having raised these issues — which will be
addressed in the next chapter — it is nevertheless nothing less than astonishing
to see that the familiar Hamiltonian formalism emerges from purely geometrical
considerations.
Although we have not yet constructed a proper dynamical theory it is de-
sirable to adopt a more suggestive notation that anticipates the dynamics to
10.5 The information geometry of e-phase space 249

be derived in the next chapter: The ‡ow generated by a Hamiltonian function


~
H(X) and parametrized by is given by Hamilton’s equations

d i ~
@H d i
~
@H
= and = ; (10.69)
d @ i d @ i

and the evolution of any function f (X) given by the Hamiltonian vector H(X)
is
df ~ @
@H ~ @
@H
~
= H(f ) = ff; Hg with H= : (10.70)
d @ i@ i @ i@ i

10.5 The information geometry of e-phase space


As a prelude to a true dynamics we wish to characterize those special ‡ows that
re‡ect the structures intrinsic to the e-phase space T P. We have already dis-
cussed ‡ows that preserve the symplectic structure. Next we consider the other
natural structure present in a statistical manifold, namely, its metric struc-
ture. The immediate obstacle here is that although the space P is a statistical
manifold and is automatically endowed with a unique information metric, the
cotangent bundle T P is not a statistical manifold. Thus, our next goal is to
endow T P with a metric that is compatible with the metric of P.
Once a metric structure is in place we can ask: does the distance between two
neighboring points — the extent to which we can distinguish them — grow or
decrease with the ‡ow? Or does it stay the same? There are many possibilities
but for pragmatic (and esthetic) reasons we are led to consider the simplest form
of ‡ow — one that preserves the metric. This will lead us to study the Hamilton
‡ows (those that preserve the symplectic structure) that are also Killing ‡ows
(those that preserve the metric structure).

10.5.1 The metric of e-phase space T P


The present goal is to extend the metric of the statistical manifold P — given
by information geometry — to the full e-phase space, T P. The extension can
be carried out in many ways; here we focus on a particular extension that will
turn out to be useful for quantum mechanics. The virtue of the formulation
below is that the number of input assumptions is kept to a minimum.
The central idea is that the only metric structure at our disposal is that of
the statistical manifold P. As we saw in Chapter 7,

`2 = gij i j
; (10.71)

where
X @ log (xj ) @ log (xj )
gij ( ) = (xj ) : (10.72)
x @ i @ j
Since the only available tensor is gij the length element of T P,

`~2 = G i; j X i
X j
= G1i;1j i j
+2G1i;2j i
j +G2i;2j i j ; (10.73)
250 A Prelude to Dynamics: Kinematics

must be of the form

`~2 = gij i j
+ gij i
j + g ij i j ; (10.74)

where , , and are constants to be determined next.


To …x the value of we recall that the information metric gij is unique up
to an overall multiplicative constant which is ultimately irrelevant; its role is to
set the units of ` and of `~ relative to those of . We can rewrite (10.74) as
h i
`~2 = gij i j + 0 gij i j + 0 g ij i j ; (10.75)

with new constants 0 and 0 and we can either keep or drop as convenience
or convention dictates. In contrast, the values of 0 and 0 are signi…cant; they
are not a matter of convention. For future convenience we shall write 0 = 1=h2
in terms of a new constant h, and choose the irrelevant as = h.6 This allows
us to absorb h into gij and write
X @ log (xj ) @ log (xj )
gij ( ) = h (xj ) : (10.76)
x @ i @ j
To …x the value of 0 we impose an additional requirement that is motivated
by its eventual relevance to physics. Consider a curve [ ( ); ( )] on T P and
its ‡ow-reversed curve — or -reversed curve — is given by
0 0
( )! ( )= ( ) and ( )! ( )= ( ): (10.77)

When projected to P the ‡ow-reversed curve coincides with the original curve,
but it is now traversed in the opposite direction. We shall require that the
~ j remains invariant under ‡ow-reversal. Since under ‡ow-reversal
speed jd`=d
the mixed terms in (10.74) change sign, it follows that invariance implies
that 0 = 0.
The net result is that the line element, which has been designed to be fully
determined by information geometry, takes a particularly simple form,

`~2 = gij i j
+ g ij i j : (10.78)

Remark: We emphasize that assuming that e-phase space is symmetric under


‡ow-reversal does not amount to imposing that the dynamics itself be time-
reversal invariant. Eventually we will want to construct dynamical models which
exhibit time-reversal symmetry for some interactions and violate it for others.
This requires an e-phase space that allows the symmetry, and any potential
violations will then be due to speci…c interaction terms in the Hamiltonian.
In other words, we shall restrict ourselves to models in which time-reversal
violations are induced at the dynamical level of the Hamiltonian and not at the
kinematical level of the geometry of e-phase space.
6 In the next chapter we shall …nd it useful to assign conventional units to the momenta

that are conjugate to the probability densities of the positions of particles. There we shall
rewrite 0 as 0 = 1=~2 , and the (irrelevant) constant as = ~. In conventional units the
value of ~ is …xed by experiment but one can always choose units so that ~ = 1.
10.5 The information geometry of e-phase space 251

10.5.2 A complex structure for T P


The metric tensor G i; j and its inverse G i; j can be used to lower and raise
indices. In particular, we can raise the …rst index of the symplectic form i; j
in eq.(10.52)
G i; k k; j = J i j : (10.79)
(The convenience of introducing a minus sign will become clear later. See
eqs.(10.158) and (10.159)) We note that both G i; j and the symplectic form
i
i; j in eq.(10.52) map vectors to covectors while the tensor J j maps vec-
tors to vectors. Indeed, the action of J is such that maps a vector to a covector
which is then mapped by the inverse G 1 back to a vector.
The tensor J has an important property that is most easily derived by writing
G and in block matrix form,
1
g 0 1 g 0 0 1
G= 1 ; G = ; = : (10.80)
0 g 0 g 1 0

Then eq.(10.79) is
1
1 0 g
J= G = : (10.81)
g 0
We can immediately check that
i k i
JJ = 1 or J kJ j = j ; (10.82)

which shows that J is a square root of the negative unit matrix. This fact is
expressed by saying that J endows T P with a complex structure.
Furthermore, we can check that the action of J on any two vectors U and
V is an isometry, that is

G(J U ; J V ) = G(U ; V ) : (10.83)

Proof: In matrix form the LHS is

(J U )T G(J V ) = U T J T GJ V (10.84)

(the superscript T stands for transpose). But, from (10.80) and (10.81), we have

J T GJ = G : (10.85)

Therefore
(J U )T G(J V ) = U T GV ; (10.86)
which is (10.83).
To summarize, in addition to the symplectic and metric G structures the
cotangent bundle T P is also endowed with a complex structure J. Such highly
structured spaces are generically known as Kähler manifolds. Here we deal
with a curved Kähler manifold that is special in that it inherits its metric from
information geometry.
252 A Prelude to Dynamics: Kinematics

10.6 Quantum kinematics: symplectic and met-


ric structures
After considering generic statistical manifolds we shall now specialize to the
type of e-con…guration space that is relevant to quantum mechanics. In the
next chapter the uncertain variable x will be a continuous variable that labels
the positions of particles. Here, to simplify the discussion, we shall assume that
x is a discrete variable, x = i = (1 : : : n), such as one might use to describe
an n-sided quantum die. The e-con…guration space is the (n 1)-dimensional
simplex, Pn
S = f j (i) 0 ; i=1 (i) = 1g ; (10.87)
and as coordinates we shall use the probabilities themselves, i = (i). Except
for the inconvenience that the space S is constrained to normalized probabilities
so that the coordinates i are not independent, a mere substitution i ! i al-
lows the results of the previous sections to carry through essentially unchanged.
This technical problem can, however, be handled by embedding the (n 1)-
dimensional manifold S into a manifold of one dimension higher, the so-called
positive-cone, denoted S + , where the coordinates i are unconstrained. Thus, a
point X = ( ; ) in the 2n-dimensional T S + will be labelled by its coordinates
X i = (X 1i ; X 2i ) = ( i ; i ), and with the substitution i ! i our previous
results for T P can be directly imported to T S + .
The issue of normalization has, however, important consequences which we
address next.

10.6.1 The normalization constraint


Since our actual interest is not in ‡ows on the extended T S + but on the
constrained T S of normalized probabilities we shall consider ‡ows that preserve
the normalization of probabilities. Let

def P
n
i ~ def
j j = and N = 1 j j : (10.88)
i=1

The Hamiltonians H~ that are relevant to quantum mechanics are such that the
initial condition
N~ =0 (10.89)
is preserved by the ‡ow. However, as we shall see in the next chapter, the actual
quantum Hamiltonians will also preserve the constraint N ~ = const even when
the constant does not vanish.7 Therefore, we have
~ i
~ = fN
@ N ~ = 0 or P @ H = P d
~ ; Hg =0: (10.90)
i i
@ i d
7 As we shall see in the next chapter the quantum evolution of probabilities (x) takes

the form of a local conservation equation, eq.(11.49). This means that the Hamiltonian will
~ = const whether the constant vanishes or not.
preserve the constraint N
10.6 Quantum kinematics: symplectic and metric structures 253

Since the probabilities i must remain positive we shall further require that
d i =d 0 at the border of the simplex where i = 0.
In addition to the ‡ow generated by H~ we can also consider the ‡ow gener-
~
ated by N and parametrized by . From eq.(10.62) the corresponding Hamil-
tonian vector …eld N is given by
i
i @ i dX ~g ;
N =N i
with N = = fX i ; N (10.91)
@X d
or, more explicitly,
i P @
d d i
N 1i = =0; N 2i = =1; or N = : (10.92)
d d i @ i

~ is found by integrating (10.92). The


The congruence of curves generated by N
resulting curves are
i i
( )= (0) and i( )= i (0) + ; (10.93)

which amounts to shifting all momenta by the i-independent parameter .

~ is conserved
A Global Gauge Symmetry — We can also see that if N
~
along H, then H is conserved along N ,

dH~
~ N
= fH; ~g = 0 ; (10.94)
d
which implies that the conserved quantity N ~ is the generator of a symmetry
transformation.
The phase space of interest is the 2(n 1)-dimensional T S but the de-
scription is simpli…ed by using the n unnormalized coordinates of the larger
embedding space T S + . The introduction of one super‡uous coordinate forces
us to also introduce one super‡uous momentum. We eliminate the extra co-
ordinate by imposing the constraint N ~ = 0. We eliminate the extra momentum
by declaring it unphysical: the shifted point ( 0 ; 0 ) = ( ; + ) is declared to
be equivalent to ( ; ), which we describe by saying that ( ; ) and ( ; + ) lie
on the same “ray”. This equivalence is described as a global “gauge”symmetry
which, as we shall later see, is the reason why quantum mechanical states are
represented by rays rather than vectors in a Hilbert space.

10.6.2 The embedding space T S +


As we saw in section 7.4.3 the metric of a generic embedding space S + turns out
to be spherically symmetric. This fact turns out to be signi…cant for quantum
mechanics. The length element is given by eqs.(7.66) and (7.105)

B
`2 = gij i j
with gij = A ni nj + ij ; (10.95)
2 i
254 A Prelude to Dynamics: Kinematics

where n is a covector with components ni = 1 for all i =P1 : : : n,8 and A = A(j j)
i
and B = B(j j) are smooth scalar functions of j j = . These expressions
can be simpli…ed by a suitable change of coordinates.
The important term here is ‘suitable’. The point is that coordinates are often
chosen because they receive a particularly useful interpretation; the coordinate
might be a temperature, an angle or, in our case, a probability. In such cases the
advantages of the freedom to change coordinates might be severely outweighed
by the loss of a clear physical interpretation. Thus, we seek a change of coor-
dinates i ! 0i that will preserve the interpretation of as unnormalized or
relative probabilities. Such transformations are of the form,
i i
= ( 0 ) = (j 0 j) 0i
; (10.96)

where the scale is a positive function of j 0 j. Substituting


i 0 0i 0i
= _j j + (10.97)
P
where _ = d =dj 0 j and j j = i
into (10.95) we …nd
X B
`2 = A+ 0i ij
_j 0
j 0i
+ 0i
_j 0
j 0j
+ 0j
(10.98)
ij
2

where we used ni = 1 and the sum over ij is kept explicit. Then,

1
0
`2 = gij 0i 0j
with 0
gij = A0 ni nj + B ij ; (10.99)
2 i

where we introduced a new function A0 = A0 (j 0 j),

2 B _2 0
A0 = A ( _ j 0 j + ) + j j+B_ : (10.100)
2
We can now take advantage of the freedom to choose : set (j j) = ~=B(j j)
where ~ is a constant.9 Dropping the primes, the length element is

~
`2 = gij i j
with gij = A(j j) ni nj + ij ; (10.101)
2 i
or,
P ~
`2 = A(j j) j j2 + i
( i 2
) : (10.102)
i 2

8 The only reason to introduce the peculiar covector n is to maintain the Einstein convention

of summing over repeated indices,


P
Ani nj i j = A ij i j = Aj j2 :

9 The constant ~ plays the same role as h in (10.76). Of course, ~ will eventually be

identi…ed with Planck’s constant divided by 2 , and one could choose units so that ~ = 1.
10.6 Quantum kinematics: symplectic and metric structures 255

We can check that the inverse tensor g ij is

2 i 2A
g ij = ij
+C i j
where C(j j) = : (10.103)
~ ~Aj j + ~2 =2

We are now ready to write down the metric for T S + . We follow the same
argument that led to eq.(10.78) and impose invariance under ‡ow reversal. Since
the only tensors at our disposal are gij and g ij , the length element of T S + must
be of the form,

`~2 = G i; j X i
X j
= gij i j
+ g ij i j : (10.104)

Therefore, substituting (10.95) and (10.103), `~2 can be more explicitly written
as
2 2
P
n P
n P
n ~ 2 i
`~2 = A i +C i i + 2
i + 2
i ; (10.105)
i=1 i=1 i=1 2 i ~
or
P
n ~ 2 i
`~2 = Aj j2 + Cj j2 h i2 + 2
i + 2
i : (10.106)
i=1 2 i ~
From (10.104), writing the indices as a 2 2 matrix, the metric tensors are
1
g 0 1 g 0
G= 1 ; G = : (10.107)
0 g 0 g

As before, the tensor G and its inverse G 1 can be used to lower and raise
indices. Using G 1 to raise the …rst index of the symplectic form i; j as we
did in eq.(10.79), we see that eqs.(10.80) and (10.81),
1
0 1 1 0 g
= and J= G = ; (10.108)
1 0 g 0

remain valid for T S + . But ultimately the geometry of T S + is only of marginal


interest; what matters is the geometry it induces on the e-phase space T S of
normalized probabilities to which we turn next.

10.6.3 The metric induced on the e-phase space T S


We saw that the e-phase space T S can be obtained from the space T S + by
the restriction j j = 1 and by identifying the gauge equivalent points ( i ; i )
and ( i ; i + ni ). Consider two neighboring points ( i ; i ) and ( 0i ; 0i ) with
j j = j 0 j = 1. The metric induced on T S will be de…ned as the shortest T S +
distance between ( i ; i ) and points on the ray de…ned by ( 0i ; 0i ). Since the
T S + distance between ( i ; i ) and ( i + i ; i + i + ni ) is

`~2 ( ) = gij i j
+ g ij ( i + ni )( j + nj ) ; (10.109)
256 A Prelude to Dynamics: Kinematics

the metric on T S will be de…ned by

s~2 = min `~2 : (10.110)


j j =1

The value of that minimizes (10.109) is


P i
m in = h i= i : (10.111)
i

Therefore, setting j j = 0, the metric on T S, which measures the distance


between neighboring rays, is
P
n ~ 2 i
s~2 = ( i 2
) + ( i h i)2 : (10.112)
i=1 2 i ~
Although the metric (10.112) is expressed in a notation that may be unfa-
miliar, it turns out to be equivalent to the well-known Fubini-Study metric.10
The recognition that the e-phase space is the cotangent bundle of a statistical
manifold has led us to a derivation of the Fubini-Study metric that emphasizes
a natural connection to information geometry.
Remark: The physical meaning of ~ has, ever since Planck, been a matter of
interest and wonder. The interpretations range from the deeply metaphysical
“quantum of action”, to the “scale”that de…nes the boundary between quantum
and classical regimes, to a mere constant that …xes the units of energy relative
to those of frequency (E = ~!) and which can always be chosen equal to one.
In entropic dynamics the role of ~ can be characterized in yet another more
geometric way. When assigning the information geometry length to an e-phase
space vector ( i ; i ) there are independent contributions from the coordinate
i
and the momentum components i . The constant ~ determines the relative
weights of the two contributions.

Back to T S + — Even as we have succeeded in assigning a metric to the


e-phase space T S it is still true that the normalization constraint is an in-
convenience and that to proceed further in our study of Hamiltonian ‡ows we
are forced to return to the larger embedding space T S + . To this end we note
an important feature of the T S metric (10.112) that we can exploit to our
advantage: the metric is independent of the choice of the function A(j j) in
eq.(10.105) that de…nes the particular embedding geometry. Therefore, with-
out any loss of generality, we can impose A(j j) = 0.11 With these choices we
assign the simplest possible geometries to the embedding spaces S + and T S + ,
namely, they are both ‡at. The T S + metric, eq.(10.104), then becomes
P
n ~ 2 i
`~2 = 2
i + 2
i =G i; j X i
X j
: (10.113)
i=1 2 i ~
1 0 This metric was introduced years before the invention of quantum mechanics by G. Fubini

(1904) and E. Study (1905) in their studies of shortest paths on complex projective spaces.
The latter include the projective Hilbert spaces used in quantum mechanics.
1 1 Later we shall explicitly show that choosing A 6= 0 has no e¤ect on the Hamiltonian ‡ows

that are relevant to quantum mechanics.


10.6 Quantum kinematics: symplectic and metric structures 257

Writing the indices in 2 2 as a matrix, we have


~
2 i ij 0
[Gij ] = 2 i ; (10.114)
0 ~ ij

and the tensor J, eq.(10.108), which de…nes the complex structure, becomes
2 i i
i i; k 0 ~ j
J j = G k; j or [J i j ] = ~ i : (10.115)
2 i j 0

10.6.4 Re…ning the choice of cotangent space


Having endowed the e-phase space T S + with both metric and complex struc-
tures we can now revisit and re…ne our choice of cotangent spaces. So far we had
assumed the cotangent space T S + at to be the ‡at n-dimensional Euclidean
space Rn . It turns out that the cotangent space that is relevant to quantum
mechanics requires a further restriction. To see what this is we argue that
the fact that T S + is endowed with a complex structure suggests a canonical
transformation from ( ; ) to complex coordinates ( ; i~ ),

1=2 i j =~ 1=2 i j =~
j = j e and i~ j = i~ j e ; (10.116)

Thus, a point 2 T S + has coordinates


1j
j j
= 2j
= ; (10.117)
i~ j

where the index = 1; 2 takes two values (with ; ; : : : chosen from the middle
of the Greek alphabet).
Since changing the phase j ! j + 2 ~ in (10.116) yields the same point
we see that the cotangent space T S is a ‡at n-dimensional “hypercube”
(its edges have coordinate length 2 ~) with the opposite faces identi…ed, some-
thing like periodic boundary conditions.12 Thus, the new T S is still locally
isomorphic to the old Rn , which makes it a legitimate choice of cotangent space.
Remark: The choice of cotangent space is central to the derivation of quantum
mechanics and some additional justi…cation might be desirable. Here we take
the easy way out and argue that identifying the relevant e-phase space in which
the quantum dynamics is played out — see chapter 11 — represents signi…cant
progress even when its physical origin remains unexplained. Nevertheless, a
more illuminating justi…cation has in fact been proposed by Selman Ipek (see
section 4.5 in [Ipek 2021]) in the context of the relativistic dynamics of …elds,
a subject that lies outside the scope of the non-relativistic physics discussed in
this book.
1 2 Strictly, T S is a parallelepiped; from (10.113) we see that the lengths of its edges are

`~j = 2 (2~ j )1=2 which vanish at the boundaries of the simplex.


258 A Prelude to Dynamics: Kinematics

We can check that the transformation from real ( ; ) to complex coordinates


( ; i~ ) is canonical, that is, i~ is the momentum conjugate to . The
transformation to the = 1=2 ei =~ coordinates proceeds as follows,

i i 1=2 i =~ i i
i = 1=2
ei =~ +i e = +i i (10.118)
2 ~ 2 i ~
i

so that

i i i i i i
= +i and = i : (10.119)
i 2 i ~ i 2 i ~
Adding and subtracting these equations we …nd
~
j = j j + j j and j = j j j j : (10.120)
2i j

The action of on two generic vectors V = d=d and U = d=d is


j k j j
dX dX d d j d j d
(V ; U ) = j; k = : (10.121)
d d d d d d
In coordinates this becomes
j k di~ di~
d d d j j j d j
(V ; U ) = j; k = ; (10.122)
d d d d d d
which shows that the symplectic form ,

0 1
[ jk ] = jk ; (10.123)
1 0

retains the same form as (10.108).13


Similarly, the metric G on T S + , eq.(10.113), becomes
P
n
`~2 = 2i j i~ j =G j; k
j k
; (10.124)
j=1

and the metric tensor and its inverse take a particularly simple form,

0 1 0 1
[Gjk ] = i jk and [Gjk ] = i jk
: (10.125)
1 0 1 0
1 3 The canonical transformation is generated by the function
!
2
i~ P k
F( ; ) = k 1 + log
2 k k

according to
@F @F
= k ; = i~ k :
@ k @ k
10.7 Quantum kinematics: Hamilton-Killing ‡ows 259

Note, in particular, that the choice A(j j) = 0 for the embedding space has led
to a metric tensor that is independent of the coordinates which corroborates
that T S + is indeed ‡at.
Finally, using G j; k to raise the …rst index of k; l gives the components
of the tensor J

j def j; k i 0 j
J l = G k; l or [J j l ] = l : (10.126)
0 i

10.7 Quantum kinematics: Hamilton-Killing ‡ows


We have seen that Hamiltonian ‡ows preserve the symplectic form. Our next
goal is to …nd those Hamiltonian ‡ows H that also happen to preserve the metric
G of T S, that is, we also want H to be a Killing vector.
Remark: It may be worthwhile to emphasize that while we adopt the usual H
notation associated with time evolution, the vector …eld H refers to any ‡ow
that preserves the normalization N~ , the symplectic form , and the metric G.
The condition for H i is
$H G = 0 ; (10.127)
or, using (10.38),

($H G) j; k = H l@ lG j; k +G l; k @ j H
l
+G j; l @ k H
l
=0: (10.128)

The metric G, eq.(10.125), gives @ l G j; k = 0, and the Killing equation sim-


pli…es to
l l
($H G) j; k =G l; k @ j H +G j; l @ k H =0: (10.129)

More explicitly,
2 3
@H 2k @H 2j @H 1k @H 2j
@ j + @ k ; @ j + @i~ k 5
[($H G)jk ] = 4
i @H 2k =0: (10.130)
@H 1j @H 1k @H 1j
@i~ j + @ k ; @i~ j + @i~ k

j
If we further require that H be a Hamiltonian ‡ow, $H = 0, then H satis…es
Hamilton’s equations,

@H~ @H~
H 1j = and H 2j = ; (10.131)
@i~ j @ j

and we …nd " #


@2H~
@ j@ k
0
[($H G)jk ] = 2i 1 @2H ~ =0; (10.132)
0 ~2 @ j @ k

so that
@2H~ @2H~
= 0 and =0: (10.133)
@ j@ k @ j@ k
260 A Prelude to Dynamics: Kinematics

Therefore, in order to generate a ‡ow that preserves both G and , the function
~ ; ) must be linear in and linear in
H( ,

~ ; P
n
^ P
n
^ +M
^j
H( )= j Hjk k + j Lj j + const ; (10.134)
j;k=1 j=1

^ jk , L
where the kernels H ^ j , and M
^ j are independent of and , and the additive
constant can be dropped because it has no e¤ect on the ‡ow. Imposing that the
‡ow preserves the normalization constraint N ~ = const, eq.(10.90), implies that
H~ must be invariant under the phase shift ! ei . Therefore, L ^j = M ^ j = 0,
and we conclude that

~ ; P
n
^
H( )= j Hjk k : (10.135)
j;k=1

The corresponding HK ‡ow is given by Hamilton’s equations,

d j @H~ 1 P n
= H 1j = = ^ jk k ;
H (10.136)
d @i~ j i~ k=1
di~ j @H~ Pn
= H 2j = = ^
k Hkj : (10.137)
d @ j k=1

Taking the complex conjugate of (10.136) and comparing with (10.137), shows
^ ij is Hermitian, and that the Hamiltonian function H
that the kernel H ~ is real,

^ jk = H
H ^ kj and ~ ;
H( ~ ;
) = H( ): (10.138)

To summarize: the preservation of the symplectic structure, the metric struc-


ture, and the normalization constraint leads to Hamiltonian functions H ~ that
are bilinear in and , eq.(10.135). The ‡ow generated by the bilinear Hamil-
tonian (10.135) is given by the Poisson bracket or its corresponding Hamilton
equation,
d j ~ d j Pn
^ jk k ;
= f j ; Hg or i~ = H (10.139)
d d k=1

and the latter is recognized as the Schrödinger equation. Beyond being Her-
^ jk remains undetermined. These are the
mitian, the actual form of the kernel H
main results of this chapter.

Linearity — The central feature of Hamilton’s equations (10.136), or of the


Schrödinger equation (10.139), is that they are linear. Given two solutions (1)
and (2) and arbitrary constants c1 and c2 , the linear combination
(3) (1) (2)
= c1 + c2 (10.140)

is a solution too and this is extremely useful in calculations. Unfortunately,


these are Hamilton-Killing ‡ows on the embedding space T S + and when the
‡ow is projected onto the e-phase space T S the linearity is severely restricted.
10.7 Quantum kinematics: Hamilton-Killing ‡ows 261

If (1) and (2) are normalized the superposition (3) will not in general be
normalized except for appropriately chosen constants. More importantly, the
gauge-transformed states
0(1) (1) i 0(2) (2) i
= e 1
and = e 2
(10.141)
(1) (2)
are supposed to be “physically” equivalent to the original and but in
general the superposition
0(3) 0(1) 0(2)
= c1 + c2 (10.142)

is not equivalent to (3) . In other words, the mathematical linearity of (10.136)


or (10.139) does not extend to a full blown Superposition Principle for physically
equivalent states.
On the other hand, any point deserves to be called a “state”in the limited
sense that it may serve as the initial condition for a curve in T S + . Since given
two states (1) and (2) their superposition (3) is a state too, we see that the
set of states f g forms a linear vector space. This is a structure that we can
further exploit.

The e¤ect of curvature on the HK ‡ows — The previous analysis was


based on the fact that the curvature of the embedding space T S + , that is the
choice of the function A(j j), has no e¤ect on the metric of the e-phase space
T S and therefore should have no e¤ect on the HK ‡ows. This allowed us to
simplify the analysis by imposing A = 0 which makes T S + ‡at. Before we
proceed further it is advisable to verify explicitly that choosing A 6= 0 still leads
to the same bilinear form for the Hamiltonian (10.135).
We recall eqs.(10.105) and (10.107) and use (10.120). Then, for a generic
function A(j j), the metric of T S + in complex coordinates is
i
A i j i ij ~ A+ i j
[Gij ] = i 1 : (10.143)
i ij ~ A+ i j ~2 A_ i j

where

Cj j2 2A
A (j j) = A with C(j j) = : (10.144)
4 A~j j + ~2 =2

The condition for H to generate a Hamilton-Killing ‡ow is given by eqs.(10.127),


(10.128), and (10.131). In block matrix form this reads

($H G)1i;1j ($H G)1i;2j


[($H G)ij ] = =0: (10.145)
($H G)2i;1j ($H G)2i;2j

The argument involves some straightforward but lengthy algebra. Substitute


(10.143) into (10.128), and impose the Hamilton ‡ow condition, eq.(10.131).
262 A Prelude to Dynamics: Kinematics

Then, the 11 and 12 matrix elements are found to be


@2H~ @ @H~
0 = ($H G)1i;1j = 2i + iA i 1 k
@ i@ j @ k @ j
@ ~
@H i @ H~ 2
i @2H~
+ iA j 1 k + A+ i k + A+ j k ;
@ k @ i ~ @ k@ j ~ @ k@ i
(10.146)
and
1 @ @H~ 1 @ ~
@H
0 = ($H G)1i;2j = A+ i 1 k + A+ j 1 k
~ @ k @ j ~ @ k @ i
@ H~ 2 ~
@ H 2
A i k + A_ j k : (10.147)
@ k@ j @ k@ i

Similarly, the other two matrix elements, 21 and 22, are given by
1
($H G)2i;1j = ($H G)1j;2i and ($H G)2i;2j = ($H G)1i;1j ; (10.148)
~2
which shows that they provide no additional information about the form of H ~
beyond that already provided by eqs.(10.146) and (10.147).
It is easy to verify that the family of bilinear Hamiltonians, eq.(10.135),
provides the desired solution. Indeed, we can easily check that a bilinear H~
implies that the quantities
@2H~ @ @H~
; 1 k ; (10.149)
@ i@ j @ k @ j
and their complex conjugates all vanish. Then, both eqs.(10.146) and (10.147)
are satis…ed identically.
To see that there are no other solutions we argue as follows. Consider
2 ~
(10.146) as a system of linear equations in the unknown 2nd derivatives, @ @i @H j .
@H~ @H~
The coordinates i and i , the …rst derivatives @ i, @ i , and the mixed deriv-
@2H~
atives @ i@ j are independent quantities that de…ne the constant coe¢ cients in
the linear system. The number of unknowns is n(n + 1)=2 and, since
($H G)1j;1i = ($H G)1i;1j ; (10.150)
we see that the number of equations matches the number of unknowns. Thus,
since the determinant of the system does not vanish, except possibly for special
values of the s, we conclude that the solution
@2H~
=0 (10.151)
@ i@ j

is unique. The observation that the remaining eqs.(10.147) are also satis…ed too
concludes the proof.
In conclusion: whether the embedding space T S + is ‡at (A = 0) or not
(A 6= 0) the HK ‡ows are described by the linear Schrödinger equation (10.139).
10.8 Hilbert space 263

10.8 Hilbert space


We just saw that the possible initial conditions for an HK ‡ow, the points
2 T S + , form a linear space. In Section 10.6.2 we had seen that the geometry
of the embedding space T S + was not fully determined. This is a freedom we
exploited by setting A(j j) = 0 in eq.(10.113) and (10.114) so that T S + is
‡at. To take full advantage of linearity we would like to further endow the
‡at e-phase space with the additional structure of an inner product and thus
transform it into a Hilbert space.14
The metric tensor de…ned by (10.125) is supposed to act on vectors on this
space; its action on the points is not de…ned — we have a notion of length
for vectors but not a notion of length for points . But in a ‡at space a point
is also a vector. Indeed, since the vector tangent to a curve is (up to a scalar
factor) just the di¤erence of two s, we see that points on the manifold and
vectors tangent to the manifold are objects of the same kind. In other words,
the tangent spaces T [T S + ] are identical to the space T S + itself. Thus, the
linear space in which the s are both points and vectors can be identi…ed with
the ‡at embedding e-phase space T S + .
Remark: We have shown that whether the embedding space T S + is ‡at
(A = 0) or curved (A 6= 0) the HK ‡ows are described by the very same
linear Schrödinger equation (10.139). But it makes no sense to introduce an
inner product between points in a curved space. It is only when the points
happen to also be vectors that inner products and Hilbert spaces make sense.
It is, therefore, important to emphasize that the whole additional structure of
a Hilbert space is neither necessary nor fundamental. It is a merely useful tool
designed for the speci…c purpose of exploiting the full calculational advantages
of linearity.

The inner product — The choice of an inner product for the points is
now “natural”in the sense that the necessary ingredients are already available.
The Hamilton-Killing ‡ows followed from imposing that the symplectic form
, eq.(10.123), and the ‡at space tensor G, eq.(10.125), be preserved. In order
that the inner product also be preserved it is natural to choose an inner product
de…ned in terms of those two tensors. We adopt the familiar Dirac notation to
represent the states as vectors j i. The inner product h j i is de…ned in
terms of the tensors G and ,
i j
h j i = a (G i; j +b i; j ) ; (10.152)
1 4 We use the term Hilbert space loosely to describe any complex vector space with a Her-

mitian inner product. In this chapter we deal with complex vector spaces of …nite dimen-
sionality. The term Hilbert space is more commonly applied to in…nite dimensional vector
spaces of square-integrable functions that can be spanned by a countable basis. In in…nite
dimensions all sorts of questions arise concerning the the convergence of sums with an in…nite
numbers of terms and various other limiting procedures. It can be rigourously shown that
the conclusions we draw here for …nite dimensions also hold in the in…nite dimensional case
dimensions.
264 A Prelude to Dynamics: Kinematics

where a and b are constants. Using eq.(10.123) and (10.125) we get

k P
n
h j i=a j ; i~ j [Gjk +b jk ] = a~ (1 ib) j j + (1 + ib) j j :
i~ k j=1
(10.153)
We shall adopt the standard de…nitions and conventions. Requiring that h j i =
h j i implies that a = a and b = ijbj. Furthermore, the standard convention
that the inner product h j i be anti-linear in its …rst factor and linear in the
second leads us to choose b = +1. Finally, we adopt the standard normalization
and set a = 1=2~. The result is the familiar expression for the positive de…nite
inner product,

def 1 j k P
n
h j i = (G j; k +i j; k ) = j j : (10.154)
2~ j=1

We can check that this inner product is positive-de…nite,

h j i 0 with h j i = 0 only if j i=0: (10.155)

The map between points and vectors, $ j i, is de…ned by

P
n
j i= jji j where j = hjj i ; (10.156)
j=1

where, to be explicit about the interpretation, we emphasize that j and jji are
completely di¤erent objects: j represents an ontic state while jji represents an
epistemic state — j is one of the faces of the quantum die, and jji represents
the state of certainty that the actual face is j, that is (j) = 1. In this “j”
representation, the vectors fjjig form a basis that is orthogonal and complete,

P
n
hkjji = jk and jjihjj = ^1 : (10.157)
j=1

Complex structure — The tensors and G were originally meant to act on


tangent vectors but now they can also act on all points 2 T S + . For example,
the action of the mixed tensor J = G 1 , eq.(10.126), on a point is

i 0 j i i
[(J )j ] = = ; (10.158)
0 i i~ j i(i~ i )

which shows that J plays the role of multiplication by i, that is, when acting
on a point the action of J is represented by an operator J,^

J J ^
!i is j i ! Jj i = ij i : (10.159)
10.8 Hilbert space 265

Hermitian and unitary operators — There is another insight to be derived


from the embedding of e-phase space T S into a ‡at T S + . From eq.(10.154) we
see that the real part is the inner product of the space R2n . The group of trans-
formations that preserve this inner product is the orthogonal group O(2n; R).
Similarly, the imaginary part of (10.154) is the symplectic form which is pre-
served by transformations of the symplectic group Sp(2n; R). The transforma-
tions that preserve the inner product on the left of (10.154) are the unitary
transformations U(n; C).
We conclude that the subset of symplectic or canonical transformations that
are also rotations in e-phase space turn out to be unitary transformations or, in
terms of the corresponding groups, the unitary group arises as the intersection
of the symplectic and orthogonal groups [Arnold 1997],

Sp(2n; R) \ O(2n; R) = U(n; C) : (10.160)

~ ; ) with kernel H
The bilinear Hamilton functions H( ^ ij in eq.(11.146) can
^ and its matrix elements,
now be written in terms of a Hermitian operator H

~ ;
H( ^ i and
) = h jHj ^ jk = hjjHjki
H ^ : (10.161)

~ is the expected value


In words: up to normalization the Hamiltonian function H
^
of the Hamiltonian operator H. The corresponding Hamilton-Killing ‡ows are
given by
d ^ i or i~ d j i = Hj ^ i:
i~ hjj i = hjjHj (10.162)
d d
These ‡ows are described by unitary transformations

^H ( )j (0)i where
j ( )i = U ^H ( ) = exp( iH
U ^ =~) : (10.163)

Commutators — ~[ ;
The Poisson bracket of two Hamiltonian functions U ]
and V~ [ ; ],
!
P
n ~
U V~ ~
U V~
~ ; V~ g =
fU ;
j=1 j i~ j i~ j j

can be written in terms of the commutator of the associated operators,

~ ; V~ g =
fU ^ ; V^ ]j i :
i~h j[U (10.164)

Thus the Poisson bracket is the expectation of the commutator. This identity is
much sharper than Dirac’s pioneering discovery that the quantum commutator
of two quantum variables is merely analogous to the Poisson bracket of the
corresponding classical variables.
266 A Prelude to Dynamics: Kinematics

10.9 Assessment: is this all there is to Quantum


Mechanics?
The framework above takes us a long way towards justifying the mathematical
formalism that underlies quantum mechanics. It clari…es how complex numbers,
the Born rule i = j i j2 , and the linearity of the Schrödinger equation are a
consequence of the symplectic structure and the metric structure associated to
information geometry. The normalization constraint leads to the equivalence of
states along rays in a very convenient Hilbert space. Is there anything else to
explain?
There have been numerous attempts to derive or construct the mathemat-
ical formalism of quantum mechanics by adapting the symplectic geometry of
classical mechanics. Such phase-space methods invariably start from a classical
phase space of positions and momenta (q i ; pi ) and through some series of “quan-
tization rules” posit a correspondence to self-adjoint operators (Q ^ i ; P^i ) which
no longer constitute a phase space. The connection to a classical mechanics is
lost. The interpretation of Q ^ i and P^i and even the answer to the question of
what, if anything, is real or ontic in such a theory all become highly controver-
sial. Probabilities play a secondary role in the formulation; they are introduced
almost as an afterthought, as part of phenomenological rules for how to handle
those mysterious processes called measurements.
In this chapter we have taken a di¤erent starting point that places probabil-
ities at the very foundation. We have discussed special families of curves — the
Hamiltonian-Killing ‡ows — that promise to be useful for the study of quan-
tum mechanics. We have shown that the Hamilton-Killing ‡ows that preserve
the symplectic and the metric structures of the e-phase space — the cotangent
bundle of probabilities i and their conjugate momenta i — reproduce much
of the mathematical formalism of quantum theory.
But many questions are immediately raised: When we refer to the probability
i
= (i) what is this i that we are uncertain about? It is presumably meant
to represent something real, but are there other observables? Are those other
observables real, or are they created by the measurement process? Is there a
Born rule that applies to them? What are these processes called measurements
and how do we model them? Where is classical mechanics in all this?
The Hamilton-Killing ‡ows will be used to describe the evolution of proba-
bilities in time and this raises further questions. As variables go probabilities
are very peculiar because they carry a purpose. They are meant to guide us
as to what we ought to believe. This means that all their changes — including
their evolution in time — must be compatible with the basic entropic principles
for updating probabilities. Are Hamilton-Killing ‡ows compatible with entropic
updating? And then there is the issue of time itself. Time is not just another
parameter along a curve: what makes time special? These and other questions
will be addressed in the following chapters.
Chapter 11

Entropic Dynamics: Time


and Quantum Theory

Law without Law: “The only thing harder to understand than a law of sta-
tistical origin would be a law that is not of statistical origin, for then there
would be no way for it — or its progenitor principles — to come into
being.”

Two tests: “No test of these views looks like being someday doable, nor more
interesting and more instructive, than a derivation of the structure of
quantum theory... No prediction lends itself to a more critical test than
this, that every law of physics, pushed to the extreme, will be found statis-
tical and approximate, not mathematically perfect and precise.”

J. A. Wheeler 1

“... but it is important to note that the whole content of the theory depends
critically on just what we mean by ‘probability’.”

E. T Jaynes 2

11.1 Mechanics without mechanism


The drive to explain nature has always led us to seek the mechanisms hidden
behind the phenomena. Descartes, for example, claimed to explain the motion
of planets as being swept along in the ‡ow of some vortices. The model did not
work very well but at least it gave the illusion of a mechanical explanation and
thereby satis…ed a deep psychological need. Newton’s theory fared much better.
He took the important step of postulating that gravity was a universal force
acting at a distance but he abstained from o¤ering any mechanical explanations
1 [Wheeler Zurek 1983, p. 203 and 210]
2 [Jaynes 1957c]
268 Entropic Dynamics: Time and Quantum Theory

— a stroke of genius immortalized in his famous “hypotheses non …ngo.” At


…rst there were objections. Huygens, for instance, recognized the undeniable
value of Newton’s achievement but was nevertheless deeply disappointed: the
theory works but it does not explain. And Newton agreed. In 1693 he wrote
that any action at a distance would represent “so great an absurdity... that no
man who has in philosophical matters a competent faculty of thinking can ever
fall into it.” [Newton 1693]
Over the following 18th century, however, impressed by the many successes
of Newtonian mechanics, people started to downplay and then even forget their
qualms about the absurdity of an action at a distance. Mechanical explanations
were, of course, still desired but the very meaning of what counted as “mechan-
ical” su¤ered a gradual but irreversible shift. It no longer meant “caused by
contact forces” but rather “described according to Newton’s laws.” Over time
Newtonian forces, including those mysterious actions at a distance, became
“real” which quali…ed them to count as the causes behind the phenomena.
But this did not last too long. With Lagrange, Hamilton, and the principle
of least action, the notion of force started to lose some of its recently acquired
fundamental status. Later, after Maxwell succeeded in extending the principles
of dynamics to include the electromagnetic …eld, the meaning of ‘mechanical
explanation’changed once again. It no longer meant identifying the Newtonian
forces, but rather …nding the right equations of evolution — which is done by
identifying the right Lagrangian or the right Hamiltonian for the theory. Thus,
today gravity is no longer explained through a force but through the curvature
of space-time. And the concept of force …nds no place in quantum mechanics
where interactions are described as the evolution of vectors in an abstract Hilbert
space.
The goal of this chapter3 is to derive non-relativistic quantum theory without
invoking an underlying mechanism — there is no ontic dynamics operating at a
sub-quantum level. This does not mean that such mechanisms do not exist 4 ;5 It
is just that useful models can be constructed without having to go through the
trouble of keeping track of a myriad of microscopic ontic details that often turn
out to be ultimately irrelevant. The idea can be illustrated by contrasting the
two very di¤erent ways in which the theory of Brownian motion was originally
derived by Smoluchowski and by Einstein. In Smoluchowski’s approach one
keeps track of the microscopic details of molecular collisions through a stochastic
Langevin equation and a macroscopic e¤ective theory is then derived by taking
suitable averages. In Einstein’s approach, on the other hand, one focuses directly
on those pieces of information that turn out to be relevant for the prediction
of macroscopic e¤ects. The advantages of Einstein’s approach are twofold. On
3 The contents of this chapter is directly taken from [Caticha 2019 and 2021] which collect

material that evolved gradually in a series of previous publications [Caticha 2009a, 2010a,
2010b], [Caticha et al 2014], and [Caticha 2012c, 2014b, 2015a, 2017a, 2017b].
4 At present this possibility appears unlikely, but we should not underestimate the cleverness

of future scholars.
5 Non-relativistic quantum mechanics can, of course, be derived from an underlying rela-

tivistic quantum …eld theory, but no ontic dynamics is assumed to underwrite the latter (see
[Ipek et al 2014, 2018, 2020]).
11.1 Mechanics without mechanism 269

one hand there is the simplicity that arises from not having to keep track of
irrelevant details that are eventually washed out when taking the averages and,
on the other hand, it allows the intriguing possibility that there is no ontic
sub-quantum dynamics at all.
Quantum mechanics involves probabilities and, therefore, it is a theory of
inference. But this has not always been clear. The center of the controversy
has been the interpretation of the quantum state — the wave function. Does
it represent the actual real state of the system — its ontic state — or does it
represent a state of knowledge about the system — an epistemic state? The
problem has been succinctly stated by Jaynes: “Our present QM formalism
is a peculiar mixture describing in part realities in Nature, in part incomplete
human information about Nature — all scrambled up by Heisenberg and Bohr
into an omelette that nobody has seen how to unscramble.” [Jaynes 1990]6
The ontic interpretations have been fairly common. At the very beginning,
Schrödinger’s original waves were meant to be real material waves — although
the need to formulate the theory in con…guration space immediately made that
interpretation quite problematic.
Then the Copenhagen interpretation — an umbrella designation for the not
always overlapping views of Bohr, Heisenberg, Pauli, and Born [Stapp 1972;
Jammer 1966, 1974] — took over and became the orthodoxy (see, however,
[Howard 2004]). On the question of quantum reality and the epistemic vs. ontic
nature of the quantum state it is deliberately vague. As crystallized in the
standard textbooks, including the classics by [Dirac 1948], [von Neumann 1955]
and [Landau Lifshitz 1977], it regards the quantum state as an objective and
complete speci…cation of the properties of the system but only after they become
actualized through the act of measurement. According to Bohr the connection
between the wave function and the world is indirect. The wave function does
not represent the world itself, but is a mere tool to compute probabilities for
the outcomes of measurements that we are forced to describe using a classical
language that is ultimately inadequate [Bohr 1933, 1958, 1963]. Heisenberg’s
position is somewhat more ambiguous. While he fully agrees with Bohr on the
inadequacy of our classical language to describe a quantum reality, his take
on the wave function is that it describes something more ontic, an objective
tendency or potentiality for events to occur. And then there is also Einstein’s
ensemble or statistical interpretation which is more explicitly epistemic. In his
words, “the -function is to be understood as the description not of a single
system but of an ensemble of systems” [Einstein 1949b, p. 671]. Whether
he meant a virtual ensemble in the sense of Gibbs is not so clear. (See also
[Ballentine 1970, Fine 1996].)
6 An important point to be emphasized here is that the distinction ontic/epistemic is not the

same as the distinction objective/subjective. (See section 1.1.3.) To be explicit, probabilities


are fully epistemic — they are tools for reasoning with incomplete information — but they
can lie anywhere in the spectrum from being completely subjective (two di¤erent agents can
hold di¤erent beliefs) to being completely objective. In QM, for example, probabilities are
both epistemic and fully objective. Indeed, at the current state of development, anyone who
computes probabilities that disagree with QM will be led to experimental predictions that are
demonstrably wrong.
270 Entropic Dynamics: Time and Quantum Theory

Bohr, Heisenberg, Einstein and other founders of quantum theory were all
keenly aware of the epistemological and pragmatic elements at the foundation of
quantum mechanics (see e.g., [Stapp 1972] on Bohr, and [Fine 1996] on Einstein)
but, unfortunately, they wrote at a time when the language and the tools of a
quantitative epistemology — the Bayesian and entropic methods that are the
subject of this book — had not yet been su¢ ciently developed.
The conceptual problems that plagued the orthodox interpretation moti-
vated the creation of ontic alternatives such as the de Broglie-Bohm pilot wave
theory [Bohm Hiley 1993, Holland 1993], Everett’s many worlds interpretation
[Everett 1957, Zeh 2016], and the spontaneous collapse theories [Ghirardi et al
1986, Bassi et al 2013]. In these theories the wave function is ontic, it represents
a real state of a¤airs. On the other side, the epistemic interpretations have had
a growing number of advocates including, for example, [Ballentine 1970, 1998;
Caves et al 2007; Harrigan Spekkens 2010; Friedrich 2011; Leifer 2014].7 The
end result is that the conceptual struggles with quantum theory have engen-
dered a literature that is too vast to even consider reviewing here. Excellent
sources for the earlier work are found in [Jammer 1966, 1974; Wheeler Zurek
1983]; for more recent work see, e.g. [Schlosshauer 2004; Jaeger 2009; Leifer
2014].
Faced with all this controversy, Jaynes also understood where one might
start looking for a solution: “We suggest that the proper tool for incorporating
human information into science is simply probability theory — not the currently
taught ‘random variable’kind, but the original ‘logical inference’kind of James
Bernoulli and Laplace” which he proceeds to explain “is often called Bayesian
inference” and is “supplemented by the notion of information entropy”.
The Entropic Dynamics (ED) developed below achieves ontological clarity
by sharply separating the ontic elements from the epistemic elements — posi-
tions of particles (or distributions of …elds) on one side and probabilities and
their conjugate momenta on the other. In this regard ED is in agreement with
Einstein’s view that “... on one supposition we should in my opinion hold ab-
solutely fast: The real factual situation of the system S2 is independent of what
is done with system S1 which is spatially separated from the former.”[Einstein
1949a, p.85] (See also [Howard 1985].) ED is also in broad agreement with Bell’s
views on the desirability of formulating physics in terms of local “beables”8 (See
Bell’s papers reproduced in [Bell 2004].)
ED is an epistemic dynamics of probabilities and not an ontic dynamics of
particles (or …elds). Of course, if probabilities at one instant are large in one
place and at a later time they are large in some other place one infers that
the particles must have moved — but nothing in ED assumes the existence of
something that has pushed the particles around. ED is a mechanics without
7 For criticism of the epistemic view see e.g. [Zeh 2002; Ferrero et al 2004; Marchildon

2004].
8 In contast to mere observ ables, the be ables are supposed to represent something that is

ontic. In the Bohmian approach and in ED particle positions (and …elds) are local beables.
According to the Bohmian and the many worlds interpretations the wave function is a nonlocal
beable.
11.1 Mechanics without mechanism 271

a mechanism. We avoid the temptation to share Newton’s belief that a me-


chanics without a mechanism is “so great an absurdity...” and assert that the
laws of quantum mechanics are not laws of nature; they are rules for updating
probabilities about nature. The challenge, of course, is to identify those pieces
of information that happen to induce the correct updating of probabilities.
In the entropic approach one does not merely postulate a mathematical for-
malism and then append an interpretation to it. For the epistemic view of quan-
tum states to be satisfactory it is not su¢ cient to state that wave functions are
tools for codifying the beliefs of an (ideally rational) agent. It is also necessary
to show that the particular ways in which quantum states are handled stand
in complete agreement with the tightly constrained ways in which probabilities
are to be manipulated, computed, and updated. We can be more explicit: it is
2
not su¢ cient to accept that j j represents a state of knowledge; we must also
provide an epistemic interpretation for the phase of the wave function and, in
particular, we must show that changes or updates of the epistemic — which
include both unitary time evolution according to the Schrödinger equation and
its “collapse”during measurement — are nothing but instances of entropic and
Bayesian updating (see also Chapter 13). The universal applicability of prob-
ability theory including the entropic and Bayesian updating methods leaves
no room for alternative “quantum” probabilities obeying alternative forms of
Bayesian inference.
There is a large literature on reconstructions of quantum mechanics9 and
there are several approaches based on information theory.10 Before we proceed
with our subject it may be worthwhile to preview some of the features that
set ED apart. As mentioned above, one such feature is a strict adherence to
Bayesian and entropic methods. Another is a clear ontological commitment:
over and above all other observables, positions are assigned the privileged role
of being the only ontic variables.11 As one might expect, ED shows some for-
mal similarities with other position-based models such as the de Broglie-Bohm
pilot wave theory [Bohm 1952, Bohm Hiley 1993, Holland 1993] and Nelson’s
stochastic mechanics [Nelson 1966, 1967, 1985].12 Indeed, as we shall see, ED
allows both the Brownian trajectories of stochastic mechanics and the smooth
Bohmian trajectories as special cases. The conceptual di¤erences are otherwise
enormous: both stochastic and Bohmian mechanics operate totally at the onto-
logical level while ED operates almost completely at the epistemological level.
Bohm’s interpretation leads to a deterministic causal theory. Nelson, on the
other hand, seeks a realistic interpretation of quantum theory as arising from
a deeper, possibly non-local, but essentially stochastic and classical reality. For
9 See e.g., [Nelson 1985; Adler 2004; Smolin 2006; de la Peña Cetto 2014; Groessing 2008,

2009; ’t Hooft 2016] and references therein.


1 0 For a very incomplete list where more references can be found see e.g., [Wootters 1981;

Rovelli 1996; Caticha 1998, 2006; Zeilinger 1999; Brukner Zeilinger 2002; Fuchs 2002; Spekkens
2007; Goyal et al 2010; Hardy 2001, 2011, Chiribella et al 2011, D’Ariano 2017].
1 1 Here we are concerned with non-relativistic quantum mechanics. When the ED framework

is applied to relativistic quantum …eld theory it is the …elds that are the only ontic variables
[Ipek Caticha 2014, Ipek et al. 2018, 2020].
1 2 See also [Guerra 1981, Guerra Morato 1983] and references therein.
272 Entropic Dynamics: Time and Quantum Theory

him stochastic mechanics is “an attempt to build a naively realistic picture of


physical phenomena, an objective representation of physical processes without
reference to any observer” [Nelson 1986].
Incidentally, the fact that ED is committed to the reality of positions (or
…elds) does not stand in con‡ict with Bell’s theorems on the impossibility of
local/causal hidden variables [Bell 2004]. As we shall discuss in some detail
in Chapter 13 the consequences of Bell’s theorem and other no-go theorems
(e.g., [Pussey et al 2012]) are evaded by virtue of ED being a purely epistemic
dynamics.
Yet another distinguishing feature is a deep concern with the nature of time
— even at the non-relativistic level. The issue here is that any discussion of
dynamics must inevitably include a notion of time but, being atemporal, the
rules of inference are silent on this matter. One can make inferences about the
past just as well as about the present or the future. This means that any model
of dynamics based on inference must also include assumptions about time, and
those assumptions must be explicitly stated. In ED an epistemic notion of time
— an “entropic”time — is introduced as a book-keeping device designed to keep
track of changes. The construction of entropic time involves several ingredients.
One must introduce the notion of an instant; one must show that these instants
are suitably ordered; and …nally one must de…ne a convenient measure of the
duration or interval between the successive instants. It turns out that an arrow
of time is generated automatically and entropic time is intrinsically directional.
Finally, as mentioned above, ED consists in the entropic updating of proba-
bilities through information supplied by constraints. The challenge is to identify
criteria that specify how these constraints are chosen and, in particular, how
the constraints are themselves updated. As we saw in Chapter 10 the e-phase
space of probabilities and their conjugate momenta is endowed with natural
metric and symplectic structures. We propose that the preservation of these
structures provides the natural criterion for updating the constraints. The re-
sult is a formalism in which the linearity of the Schrödinger equation and the
emergence of complex numbers is derived rather than postulated. Interestingly,
the introduction of Hilbert spaces turns out to be less a matter of necessity
than of computational convenience. The chapter concludes with remarks on the
connection between entropic time and the presumably more “physical” notion
of time as measured by clocks — we shall argue that what clocks register is, in
fact, entropic time.

11.2 The ontic microstates


The …rst step in any exercise in inference is to specify the quantities to be in-
ferred. We consider N particles living in a ‡at Euclidean space X. In Cartesian
coordinates the metric is ab . The particles are assumed to have de…nite posi-
tions xan , which we denote collectively by x. The index n = 1 : : : N labels the
particles, and a = 1; 2; 3 the three spatial coordinates. The con…guration space
for N particles is XN = X : : : X. The positions of the particles are unknown
11.3 The entropic dynamics of short steps 273

and it is these values that we wish to infer. Since the positions are unknown
the main target of our attention will be the probability distribution (x).
The previous paragraph may seem straightforward common sense but the
assumptions it involves are not at all innocent. First, note that ED already sug-
gests a new perspective on the old question of determinism vs. indeterminism.
If we understand quantum mechanics as a generalization of a deterministic clas-
sical mechanics then it is natural to seek the cause of indeterminism. But within
an inference framework that is designed to deal with insu¢ cient information one
must accept uncertainty, probabilities, and indeterminism as the expected and
inevitable norm that requires no explanation. It is the determinism of classical
mechanics that demands an explanation. Indeed, as we shall see in section ??
while most quantities are a- icted by uncertainties there are situations where for
some very specially chosen variables one can, despite the lack of information,
achieve complete predictability. This accounts for the emergence of classical
determinism from an entropic dynamics that is intrinsically indeterministic.
Second, we note that the assumption that the particles have de…nite posi-
tions represents a major departure from the standard interpretation of quantum
mechanics according to which de…nite values can be attained but only as the
result of a measurement. In contrast, positions in ED play the very special role
of de…ning the ontic state of the system. Let us be very explicit: in the ED de-
scription of the double slit experiment, we might not know which slit the particle
goes through, but the particle de…nitely goes through one slit or the other.13
Indeed, as far as positions are concerned, ED agrees with Einstein’s view that
spatially separated objects have de…nite separate ontic states [Einstein 1949a,
p.85].
And third, we emphasize once again that (x) represents probabilities that
are to be manipulated according to exactly the same rules described in previous
chapters — the entropic and Bayesian methods. Just as there is no such thing
as a quantum arithmetic, there is no quantum probability theory either.

11.3 The entropic dynamics of short steps


Having identi…ed the microstates x 2 XN we can proceed to the dynamics. The
…rst assumption is that change happens. We do not explain why motion happens
— ED is silent about any underlying dynamics acting at the sub-quantum level.
Instead our task is to produce an estimate of what kind of motion one might
reasonably expect.
The goal is to …nd the probability P (x0 jx)d3N x0 that the system takes a step
from the initial position x 2 XN into a volume element d3N x0 centered at a new
x0 2 XN . This is done by maximizing the entropy,
Z
P (x0 jx)
S[P; Q] = dx0 P (x0 jx) log ; (11.1)
Q(x0 jx)
1 3 See section 2.5.
274 Entropic Dynamics: Time and Quantum Theory

relative to a prior Q(x0 jx), and subject to the appropriate constraints speci…ed
below. (For notational simplicity in multidimensional integrals such as (11.1) we
will write dx0 instead of d3N x0 .) It is through the choice of prior and constraints
that the relevant pieces of information that de…ne the dynamics are introduced.

11.3.1 The prior


Having decided that changes do happen, we need to be a bit more explicit about
which changes are likely to expected. The main dynamical assumption is that
the particles follow trajectories that are continuous. The assumption of con-
tinuity introduces an enormous simpli…cation because it implies that a generic
motion can be analyzed as the accumulation of many in…nitesimally short steps.
Therefore, our …rst goal will be to use eq.(11.1) to …nd the transition probabil-
ity P (x0 jx) for an in…nitesimally short step. The corresponding prior Q(x0 jx)
is meant to describe the state of knowledge that is common to all short steps
before we take into account the additional information that is speci…c to the
particular short step being considered. We shall adopt a prior that incorporates
the information that the particles take in…nitesimally short steps but is oth-
erwise maximally uninformative. In particular, it will re‡ect the translational
and rotational invariance of the Euclidean space X and express total ignorance
about any correlations. Such a prior can itself be derived from the principle of
maximum entropy. Indeed, maximize
Z
Q(x0 jx)
S[Q; ] = dx0 Q(x0 jx) log ; (11.2)
(x0 jx)

relative to a uniform measure (x0 jx), subject to normalization, and subject to


N independent constraints — one for each particle — that impose short steps
and rotational invariance,

h ab xan xbn i = n ; (n = 1 : : : N ) ; (11.3)

where x = x0 x and n are small constants. The result is


1X
Q(x0 jx) / exp n ab xan xbn ; (11.4)
2 n

where the Lagrange multipliers n are constants that are independent of x but
may depend on the index n in order to describe non-identical particles. The n s
will be eventually be taken to in…nity in order to enforce the fact that the steps
are meant to be in…nitesimally short. In Cartesian coordinates the uniform
measure is a numerical constant that can be absorbed into the normalization
and therefore has no e¤ect on Q.14 The result is a product of Gaussians,
1X
Q(x0 jx) / exp n ab xan xbn ; (11.5)
2 n
1 4 Indeed, as 0
n ! 1 the prior Q becomes independent of any choice of (x ) provided the
latter is su¢ ciently smooth.
11.3 The entropic dynamics of short steps 275

which describes the a priori lack of correlations among the particles. Next
we specify the constraints that specify the information that is speci…c to each
individual short step.

11.3.2 The phase constraint


In Newtonian dynamics one does not need to explain why a particle perseveres
in its motion in a straight line; what demands an explanation — that is, a
force — is why the particle deviates from inertial motion. In ED one does not
require an explanation for why the particles move. It is taken for granted that
things will not stay put; what requires an explanation is how the motion can be
both directional and highly correlated. The information about such correlations
is introduced through one constraint that acts simultaneously on all particles.
The constraint involves a function '(x) = '(x1 : : : xN ) on the 3N -dimensional
con…guration space, x 2 XN , that we shall refer to as the drift potential.
We shall assume that the displacements xan are such that the expected
change of '(x) is constrained to be

XN
@'
h '(x)i = a
h xan i = 0
(x) ; (11.6)
n=1
@xn

where 0 (x) is some small but for now unspeci…ed function. This information
is already su¢ cient to construct an interesting entropic dynamics which turns
out to be a kind of di¤usion where the expected “drift”, h xn i, is determined
by the “potential” '.15
The physical origin of the potential '(x) is at this point unknown so how can
one justify its introduction? First, we note that identifying the relevant con-
straints, such as (11.6), represents signi…cant progress even when their physical
origin remains unexplained. This situation has historical precedents. For exam-
ple, in Newton’s theory of gravity or in the theory of elasticity, the speci…cation
of the forces turned out to be very useful even though their microscopic origin
had not yet been fully understood. Indeed, as we shall show the assumption
of a constraint involving a con…guration space function '(x) is instrumental to
explain quantum phenomena such as entanglement, interference, and tunneling.
A second more formal justi…cation, is motivated by the geometrical discussion
in chapter 10. We seek a dynamics in which evolution takes the form of curves
or trajectories on the statistical manifold of probabilities f g. Then, if the prob-
abilities (x) are treated as generalized coordinates, it is only natural to expect
that quantities '(x) must at some point be introduced to play the role of their
conjugate momenta. What might be surprising is that the single function '(x)
will play three roles that might appear to be totally unrelated to each other:
…rst, as a constraint in an entropic inference; second, as the momenta conjugate
1 5 However, to construct that particular dynamics that describes quantum systems we must

further require that '=~ be a multi-valued function with the topological properties of an angle
— '(x) and '(x) + 2 ~ represent the same “angle” (see section 11.8.4 below).
276 Entropic Dynamics: Time and Quantum Theory

to generalized coordinates; and third, as (essentially) the phase of the quantum


wave function.

11.3.3 The gauge constraints


The minimal constraints described in the previous paragraphs lead to a rich
entropic dynamics but by imposing additional constraints we can construct even
more realistic models. To incorporate the e¤ect of an external electromagnetic
…eld we shall impose the additional constraints that the expected displacements
h xan i of each particle n satisfy

h xan iAa (~xn ) = 00


n for n = 1 : : : N : (11.7)

These N constraints involve a single vector …eld Aa (~x) that lives in the 3-
dimensional physical space (~x 2 X). This ensures that all particles couple to
one single electromagnetic …eld. The strength of the coupling is given by the
values of the 00n . These are small quantities that could be speci…ed directly but,
as is often the case in entropic inference, it is much more convenient to specify
them indirectly in terms of the corresponding Lagrange multipliers.

11.3.4 The transition probability


Following the by now standard procedure (see section 4.10), the distribution
P (x0 jx) that maximizes the entropy S[P; Q] in (11.1) relative to (11.5) and
subject to (11.6), (11.7), and normalization is
X n
P (x0 jx) / exp f a b
ab xn xn +
0
(@na ' xn )) xan g (11.8)
n Aa (~
n 2
where n , 0 , and 0 n are Lagrange multipliers, and @na = @=@xan .16 It is
convenient to rewrite P (x0 jx) as

1 1X a b
P (x0 jx) = exp n ab xan xn xbn xn (11.9)
Z 2 n

where Z is a normalization constant. Thus, a generic displacement xan =


x0a
n xan can be expressed as the sum of an expected drift plus a ‡uctuation,
a
xan = xn + wna ; (11.10)

given by
0
a
xn = h xan i = ab
[@nb ' n Ab (~
xn )] ; (11.11)
n
1
h wna i = 0 and h wna wnb 0 i = ab
nn0 : (11.12)
n
1 6 The distribution (11.8) is not merely a local maximum or a stationary point. It yields

the absolute maximum of the relative entropy S[P; Q] subject to the constraints. The proof
follows the standard argument originally due to Gibbs (see section 4.10).
11.4 Entropic time 277

The directionality of the motion and the correlations among the particles are
introduced by a systematic drift in a direction determined by @na ' and Aa ,
while the position ‡uctuations remain isotropic and uncorrelated. As n ! 1,
the trajectory is expected to be continuous. As we shall see below, whether the
trajectory is di¤erentiable or not depends on the particular choices of n and
0
. Eqs. (11.11) and (11.12) also show that the e¤ect of 0 is to enhance or
suppress the magnitude of the drift relative to the ‡uctuations.

11.3.5 Invariance under gauge transformations


The fact that the constraints (11.6) and (11.7) are not independent — they are
both linear in the same displacements h xan i — turns out to be signi…cant: it
leads to a gauge symmetry and provides the physical interpretation of the vector
potential Aa (~x) as the corresponding gauge connection …eld. This is evident in
eq.(11.8) where ' and Aa appear in the combination @na ' n Aa which is
invariant under a local gauge transformation,

Aa (~xn ) ! Aa (~xn ) + @a (~xn ) ; (11.13)


X
'(x) ! '(x) + n (~xn ) ; (11.14)
n

where (~x) is a function in 3d-space and the multipliers n will later be related
to the electric charges qn by n = qn =c.

11.4 Entropic time


The transition probability P (x0 jx) in (11.9) describes a single short step. To
predict motion over …nite distances these short steps must be iterated and this
is where time comes in. Time is introduced as a book-keeping device designed
to keep track of the accumulation of short steps.
Since the foundation for any theory of time is dynamics, that is, the theory of
change, it is important to be explicit about what changes one is talking about.
To be clear, ED involves two di¤erent kinds of change. One consists of the
ontic changes x of the positions that ED is designed to infer; the other re‡ects
the epistemic changes of the evolving probabilities. An ontic dynamics such as,
for example, Newtonian mechanics, leads to an ontic notion of time, while a
dynamics of probabilities necessarily leads to an epistemic notion of time.
Our task here is to develop an epistemic model of time that includes (a)
something one might identify as an “instant”, (b) a sense in which these instants
can be ordered, and (c) a convenient concept of “duration” that measures the
separation or interval between instants. A welcome bonus is that the model in-
corporates an intrinsic directionality — an evolution from past instants towards
future instants. Thus, an arrow of time does not have to be externally imposed
but is generated automatically. Such a construction we shall call entropic time
[Caticha 2010ab]. By design, entropic time is epistemic and, therefore, it is
not ontic. Later, in section 11.11, after ED has been more fully developed, we
278 Entropic Dynamics: Time and Quantum Theory

shall return to the question of whether and how this epistemic notion of time is
related to the presumably more “physical” time that is measured by clocks.

11.4.1 Time as an ordered sequence of instants


A trajectory in ED consists of a succession of short steps that take the sys-
tem through a sequence of positions (x0 ; x1 ; x2 ; : : :). Consider, for example, a
generic kth step that takes the system from some unknown x = xk 1 to an
also unknown, neighboring next point x0 = xk . Integrating the joint probability
P (xk 1 ; xk ) over xk 1 gives
Z Z
P (xk ) = dxk 1 P (xk 1 ; xk ) = dxk 1 P (xk jxk 1 )P (xk 1 ) : (11.15)

This equation follows directly from the laws of probability and, therefore, it is
true independently of any physical assumptions which means that it is not very
useful as it stands. To make it useful, something else must be added.
There is something peculiar about con…guration spaces. For example, when
we represent the position of a particle as a point with coordinates (x1 ; x2 ; x3 ) it
is implicitly understood that the values of the three coordinates x1 , x2 , and x3
hold simultaneously — no surprises here. Things get a bit more interesting when
we describe a system of N particles by a single point x = (~x1 ; ~x2 ; : : : ~xN ) in 3N -
dimensional con…guration space. The point x is meant to represent the state at
one instant, that is, it is also implicitly assumed that all the 3N coordinate values
are simultaneous. What is peculiar about con…guration spaces is that they
implicitly introduce a notion of simultaneity. Furthermore, when we express
uncertainty about the values of an x by means of a probability distribution
P (x) it is also implicitly understood that the di¤erent possible values of x all
refer to the same instant. The system could be at this point here or it could
be at that point there, we might not know which, but whichever it is, the two
possibilities refer to positions at one and the same instant.17 And similarly,
when we consider the transition probability from x to x0 , given by P (x0 jx), it
is implicitly assumed that x refers to one instant, and the possible x0 s all refer
to another instant. Thus, in ED, a probability distribution over con…guration
space provides a criterion of simultaneity.
We can now return to eq.(11.15): if P (xk 1 ) happens to be the probability
of di¤erent values of x at an “initial”instant of entropic time t, and P (xk jxk 1 )
is the transition probability from xk 1 at one instant to xk at another instant,
then we can interpret P (xk ) as the probability of values of xk at a “later”
instant of entropic time t0 = t + t. Accordingly, we write P (xk 1 ) = t (x)
and P (xk ) = t0 (x0 ) so that
0
R 0
t0 (x ) = dx P (x jx) t (x) : (11.16)
1 7 We could of course consider the joint probability P (~ x2 (t2 )) of particle 1 being at
x1 (t1 ); ~
x1 at time t1 and particle 2 being at ~
~ x2 at time t2 , but the set of points f~ x2 (t2 )g is
x1 (t1 ); ~
not at all what one would call a con…guration space.
11.4 Entropic time 279

Nothing in the laws of probability leading to eq.(11.15) forces the interpretation


(11.16) on us — this is the additional ingredient that allows us to construct
time and dynamics in our model. If the distribution t (x) refers to one instant
t, then the distribution t0 (x0 ) generated by P (x0 jx) by means of eq.(11.16)
de…nes what we mean by the “next” instant t0 . The dynamics is de…ned by
iterating this process. Entropic time is constructed instant by instant: t0 is
constructed from t , t00 is constructed from t0 , and so on.
The construction is intimately related to information and inference. An
instant is a complete epistemic state. Its “completeness” consists in the in-
stant being speci…ed by information — codi…ed into the distributions t (x) and
P (x0 jx) — that is su¢ cient for generating the next instant. Thus, the present
instant is de…ned so that, given the present, the future is independent of the
past.
Remark: It is common to use equations such as (11.16) to de…ne a special
kind of dynamics, called Markovian, that unfolds in a time de…ned by some
external clocks. In such a Markovian dynamics the speci…cation of the state
at one instant is su¢ cient to determine its evolution into the future. Since the
transition probability P (x0 jx), codi…es information supplied through the prior
and the constraints neither of which refer to anything earlier than the earlier
point x it is clear that formally ED is a Markovian process. There is, however, an
important di¤erence. Equation (11.16) is not being used to de…ne a (Markovian)
dynamics in a pre-existing background time because ED makes no reference to
external clocks. The system is its own clock and (11.16) is used both to de…ne
the dynamics and to construct time itself.18
Remark: In a relativistic theory there is a greater freedom in the choice of
instants which translates into a greater ‡exibility with the notion of simultane-
ity. But even in the relativistic setting the notion of instant delineated above
remains valid. The new element introduced by relativity is that these di¤erent
notions of simultaneity must be consistent with each other. This requirement
of consistency places severe constraints on the allowed forms of relativistic ED
[Ipek et al 2018, 2020].

11.4.2 The arrow of entropic time


The notion of time constructed according to eq.(11.16) is intrinsically direc-
tional. There is an absolute sense in which t0 (x0 ) occurs after t (x). To see
how this comes about note that the same rules of probability that led us to
(11.16) can also lead us to the time-reversed evolution,
Z
t (x) = dx0 P (xjx0 ) t0 (x0 ) : (11.17)

The temporal asymmetry is due to the fact that the distribution P (x0 jx), eq.(11.8),
is a Gaussian derived using the maximum entropy method, while the time-
1 8 In this respect, entropic time bears some resemblance with the relational notion of time

advocated by J. Barbour in the context of classical physics (see e.g. [Barbour 1994]).
280 Entropic Dynamics: Time and Quantum Theory

reversed version P (xjx0 ) is related to P (x0 jx) by Bayes’theorem,


t (x)
P (xjx0 ) = 0
P (x0 jx) ; (11.18)
t0 (x )

and, therefore, it is not general Gaussian.


The puzzle of the arrow of time (see e.g. [Price 1996, Zeh 2007]) arises from
the di¢ culty in deriving a temporal asymmetry from underlying laws of nature
that are symmetric. The ED approach o¤ers a fresh perspective on this issue
because it does not assume any underlying laws of nature — whether they be
symmetric or not. The asymmetry is the inevitable consequence of constructing
time in a dynamics driven by entropic inference.
From the ED point of view the challenge does not consist in explaining the
arrow of time — entropic time only ‡ows forward — but rather in explaining
how it comes about that despite the arrow of time some laws of physics, such
as the Schrödinger equation, turn out to be time reversible. We will revisit this
topic in section 11.11.

11.4.3 Duration
We have argued that the concept of time is intimately connected to the associ-
ated dynamics but at this point neither the transition probability P (x0 jx) that
speci…es the dynamics nor the corresponding entropic time have been fully de-
…ned yet. It remains to specify how the interval t between successive instants
is encoded into the multipliers n and 0 .
The basic criterion for this choice is convenience: duration is de…ned so that
motion looks simple. The description of motion is simplest when it re‡ects the
symmetry of translations in space and time. We therefore choose 0 and n
to be constants independent of x and t. The resulting entropic time resembles
Newtonian time in that it ‡ows “equably everywhere and everywhen.”
The particular choice of duration t in terms of the multipliers n and 0
can be motivated as follows. In Newtonian mechanics time is de…ned to simplify
the dynamics. The prototype of a classical clock is a free particle that moves
equal distances in equal times so that there is a well de…ned (constant) velocity.
In ED time is also de…ned to simplify the dynamics, but now it is the dynamics
of probabilities as prescribed by the transition probability. We de…ne duration
so that for short steps the system’s expected displacement h xi increases by
equal amounts in equal intervals t so there is a well de…ned drift velocity.
Referring to eq.(11.11) this is achieved by setting the ratio 0 = n proportional
to t and thus, the transition probability provides us with a clock. For future
convenience the proportionality constants will be expressed in terms of some
particle-speci…c constants mn ,
0
1
= t: (11.19)
n mn
At this point the constants mn receive no interpretation beyond the fact that
their dependence on the particle label n recognizes that the particles need not
11.5 Brownian sub-quantum motion and the evolution equation 281

be identical. Eventually, however, the mn s will be identi…ed with the particle


masses.19
Having speci…ed the ratio 0 = n it remains to specify 0 (or n ). It turns
out that di¤erent choices of 0 lead to qualitatively di¤erent motions at the
sub-quantum or “microscopic” level. Remarkably, however, all of these sub-
quantum motions lead to the same dynamics at the quantum level [Bartolomeo
Caticha 2016].
The freedom to choose 0 and still reproduce quantum mechanics is yet an-
other manifestation of the fact that ED is an epistemic dynamics of probabilities
and not an ontic dynamics of positions. ED gives us the probability P (x0 ; t0 jx; t)
but it is silent on what caused the motion from x to x0 and, beyond the fact that
the ontic path is continuous, it is also silent on whether the path is smooth or
not. Nevertheless, one must de…ne the duration t and, to proceed further, one
must commit to a de…nite choice of 0 . The situation bears some resemblance to
gauge theories, where for actual calculations it is necessary to choose a gauge,
even if the actual choice should have no e¤ect on physical predictions.
In the next sections we shall explore the consequences of setting 0 = const
which leads to the highly-irregular sub-quantum Brownian trajectories charac-
teristic of Nelson’s stochastic mechanics. Thus, we set

0 1 mn
= so that n = : (11.20)
t

where a new constant is introduced. Below we shall comment further on the


signi…cance of and in section 11.6 we shall explore a di¤erent choice of 0 .
There we shall set 0 / 1= t2 and show that it leads to an ED in which the
particles follow the smooth trajectories characteristic of Bohmian mechanics.

11.5 Brownian sub-quantum motion and the evo-


lution equation
It is convenient to introduce a notation tailored to con…guration space. Let
xA = xan , @A = @=@xan , and AB = nn0 ab , where A; B; : : : are composite indices
labeling both the particles (n; n0 ; : : :) and their spatial coordinates (a; b; : : :).
With the choice (11.20), the transition probability (11.9) for a drift potential '
becomes
1 1 A B
P (x0 jx) = exp mAB xA x xB x (11.21)
Z 2 t

Where we introduced the “mass” tensor, its inverse,


1
mAB = mn AB = mn nn0 ab and mAB = AB
: (11.22)
mn
1 9 If t and mn are given units of time and mass then eqs.(11.11) and (11.19) …x the units
of '.
282 Entropic Dynamics: Time and Quantum Theory

A generic displacement is then written as a drift plus a ‡uctuation,


A
xA = x + w A = bA t + wA ; (11.23)

where, from eqs.(11.11) and (11.25), the drift velocity is

h xA i
bA (x) = = mAB @B '(x) AB (x) ; (11.24)
t
and A(x) is the electromagnetic vector expressed as a …eld in con…guration
space. Its components are

AA (x) = Aan (x) = n Aa (xn ) : (11.25)

From (11.21) we see that the ‡uctuation wA obeys

h wA i = 0 and h wA wB i = mAB t ; (11.26)

which shows that the constant controls the strength of the ‡uctuations. Note
that for very short steps, as t ! 0, the ‡uctuations become dominant: the
drift is h xA i O( t) while wA O( t1=2 ) which is characteristic of Brown-
ian paths. Thus, with the choice (11.20), the trajectory is continuous but not
di¤erentiable: a particle has a de…nite position but its velocity, the tangent
to the trajectory, is completely unde…ned. To state this more explicitly, since
wA O( t1=2 ) but h wA i = 0, we see that the limit t ! 0 and the
expectation h i do not commute,

xA xA
lim 6= lim (11.27)
t!0 t t!0 t
because
wA wA
lim = 0 while lim =1: (11.28)
t!0 t t!0 t

11.5.1 The information metric of con…guration space


Before studying the dynamics de…ned eq.(11.21) we take a brief detour to con-
sider the geometry of the N -particle ontic con…guration space, XN . Since the
single particle space X is described by the Euclidean metric ab we can expect
that the N -particle con…guration space, XN = X : : : X, will also be ‡at, but
for non-identical particles a question might be raised about the relative scales or
weights associated to each X factor. Information geometry provides the answer.
The fact that to each point x 2 XN there corresponds a probability distrib-
ution P (x0 jx) means that to the ontic con…guration space XN we can associate
a statistical manifold and, as we saw in chapter 7, its geometry is uniquely
determined (up to an overall scale factor) by the information metric,
Z
@ log P (x0 jx) @ log P (x0 jx)
AB = C dx0 P (x0 jx) ; (11.29)
@xA @xB
11.5 Brownian sub-quantum motion and the evolution equation 283

where C is a positive constant. Substituting eqs.(11.21) into (11.29) in the limit


of short steps ( t ! 0) yields (see eq. 7.180)
C
AB = mAB : (11.30)
t
The divergence as t ! 0 arises because the information metric measures sta-
tistical distinguishability. As t ! 0 the distributions P (x0 jx) and P (x0 jx+ x)
become more sharply peaked and increasingly easier to distinguish. Therefore,
AB ! 1. To de…ne a geometry that remains useful even for arbitrarily small
t we can choose C / t,

AB / mAB or an;bn0 / mn ab nn0 : (11.31)

Thus, up to overall constants the mass tensor is the metric of con…guration


space. In words: the metric for the N -particle con…guration space, XN =
X : : : X is a block diagonal matrix in which the block corresponding to each
single particle is the ‡at Euclidean metric ab scaled by the mass of the particle,
mn ab .
Ever since the work of H. Hertz in 1894 [Lanczos 1970] it has been standard
practice to describe the motion of systems with many particles as the motion of
a single point in an abstract space — the con…guration space. The choice of the
geometry of this ontic con…guration space has been based on an examination of
the kinetic energy of the system. Historically this choice has been regarded as a
matter of convenience, a merely useful convention. We can now see that entropic
dynamics points to a uniquely natural choice: up to a global scale factor the
metric follows uniquely from information geometry.

11.5.2 The evolution equation in di¤erential form


Entropic dynamics is generated by iterating eq.(11.16),
Z
0
t+ t (x ) = dx P (x0 ; t + tjx; t) t (x) ; (11.32)

where the times t and t + t have been written down explicitly. As so often in
physics it is more convenient to rewrite the evolution equation above in di¤er-
ential form. One might be tempted to Taylor expand in t and x = x0 x,
but this is not possible because for small t the distribution P (x0 ; t + tjx; t),
eq. (11.21), is very sharply peaked at x0 = x. To handle such singular behav-
ior one follows an indirect procedure that is well known from di¤usion theory
[Chandrasekhar 1943]: multiply by a smooth test function f (x0 ) and integrate
over x0 ,
Z Z Z
0 0 0
dx t+ t (x )f (x ) = dx dx0 P (x0 ; t + tjx; t)f (x0 ) t (x) : (11.33)

The test function f (x0 ) is assumed su¢ ciently smooth precisely so that it can
be expanded about x. The important point here is that for Brownian paths
284 Entropic Dynamics: Time and Quantum Theory

eq.(11.26) implies that the terms ( x)2 contribute to O( t). Then, dropping
all terms of order higher than t, the integral in the brackets is
Z
@f 1 @2f
[ ] = dx0 P (x0 ; t + tjx; t) f (x) + A xA + xA xB + : : :
@x 2 @xA @xB
@f 1 @2f
= f (x) + bA (x) t A + t mAB A B + : : : (11.34)
@x 2 @x @x
where we used eq.(11.23) and (11.26),
Z
1
lim dx0 P (x0 ; t + tjx; t) xA = bA (x) ;
t!0+ t
Z
1
lim + dx0 P (x0 ; t + tjx; t) xA xB = mAB : (11.35)
t!0 t
Dropping the primes on the left hand side of (11.33), substituting (11.34) into
the right, and dividing by t, gives
Z Z
1 @f 1 @2f
dx [ t+ t (x) t (x)] f (x) = dx bA (x) A + mAB A B t (x) :
t @x 2 @x @x
(11.36)
Next integrate by parts on the right and let t ! 0. The result is
Z Z
@ A 1 AB @2 t
dx @t t (x)f (x) = dx (b t ) + m f (x) : (11.37)
@xA 2 @xA @xB
Since the test function f (x) is arbitrary, we conclude that
1
@t t = @A (bA t ) + mAB @A @B t : (11.38)
2
Thus, the di¤erential equation for the evolution of t (x) takes the form of a
Fokker-Planck equation. To proceed with our analysis we shall rewrite (11.38)
in several di¤erent forms.

11.5.3 The current and osmotic velocities


The evolution equation (11.38) can also be rewritten as
1
@t t = @A bA + mAB @B log t t : (11.39)
2
The interpretation is clear: the fact that the particles follow continuous paths
implies that the probabilities are conserved locally in the N -particle con…gura-
tion space, XN . This continuity equation can be written as

@t = @A v A (11.40)

where v A , the velocity of the probability ‡ow or current velocity, is

v A = bA + u A (11.41)
11.5 Brownian sub-quantum motion and the evolution equation 285

where bA is the drift velocity, eq.(11.24), and uA , is the osmotic velocity,


def
uA = mAB @B log 1=2
: (11.42)

The interpretation is straightforward: eq.(11.42) shows that the osmotic


velocity uA re‡ects the tendency for probability to ‡ow down the density gra-
dient in a di¤usion process which is analogous to Brownian motion. Indeed,
in Brownian motion the drift velocity bA is the response to the gradient of an
external potential while uA is the response to the gradient of a concentration
or chemical potential— the so-called osmotic force. The osmotic contribution to
the probability ‡ow is the actual di¤usion current,
1
ua = mAB @A ; (11.43)
2
which can be recognized as the 3N -dimensional con…guration space version of
Fick’s law with a di¤usion tensor given by mAB =2.20
Since both bA and uA involve gradients the current velocity for Brownian
paths can be written ,
v A = mAB (@B AB ) ; (11.44)
where we introduced a new “potential,”
1=2
=' log ; (11.45)

that will be called the phase. (Eventually will be identi…ed as the phase of
the wave function.) Eqs.(11.43) and (11.45) show, once again, that the action
of the constant is to control the relative strength of di¤usion and drift.
Next we shall rewrite the continuity equation (11.40) in yet another equiva-
lent but very suggestive form involving functional derivatives.

11.5.4 A quick review of functional derivatives


Functional derivatives can be de…ned by analogy to the partial derivatives. A
functional F [ ] is a function of a function, that is, a map that associates a
number to a function . We can think of F [ ] as a function of in…nitely many
variables (x) = x that are labeled by a continuous index x.
If f (q) = f (q1 ; q2 ; : : : qn ) is a function of several variables qi labeled by a
discrete index i, then small changes qi ! qi + dqi induce a small change df that
to …rst order in dqi is given by

def P @f
df = dqi : (11.46)
i @qi

2 0 The de…nition of osmotic velocity adopted in [Nelson 1966] and other authors di¤ers from

ours by a sign. Nelson takes the osmotic velocity to be the velocity imparted by the external
force that is needed to balance the osmotic force (due to concentration gradients) in order to
attain equilibrium. Here the osmotic velocity is the velocity associated to the actual di¤usion
current, eq. (11.43).
286 Entropic Dynamics: Time and Quantum Theory

The partial derivative @f =@qi is de…ned as the coe¢ cient of the term linear in
dqi .
Similarly, if F [ ] = F (: : : x : : : x0 : : :) is a function of in…nitely many vari-
ables x labeled by a continuous index x, then a small change in the function
(x) ! (x)+ (x) will induce a small change F of the functional F [ ]. To
…rst order in we have
def R F
F = dx (x) (11.47)
(x)
where the functional derivative F= (x) is de…ned as the coe¢ cient of the term
linear in (x).
The virtue of this approach is that it allows us to manipulate and calculate
functional derivatives by just following the familiar rules of calculus such as
Taylor expansions, integration by parts, etc. For example, if the functional F [ ]
just returns the value of (x) at the point y, that is F [ ] = y , then
R F (y)
F = (y) = dx (x) implies = (x y) : (11.48)
(x) (x)

11.5.5 The evolution equation in Hamiltonian form


We can now return to rewriting the continuity equation (11.40) in an alternative
~ ; ] can be found such
form. The important observation is that a functional H[
that (11.40) can be written as
~
H
@t t (x) = : (11.49)
(x)
~ satis…es
The desired H
~
H
AB
@A tm (@B AB ) = ; (11.50)
(x)
which is a linear functional equation that can be easily integrated. The result
is Z
~ ; ] = dx 1 mAB @A
H[ AA @B AB + F [ ] ; (11.51)
2
where the unspeci…ed functional F [ ] is an integration constant. (F could also
depend on x and on t:) We can check that a variation ! + followed by
an integration by parts reproduces the correct functional derivative,
Z
~ 1
H = dx mAB @A @B AB + @A AA @B
2
Z
= dx @A mAB @B AB :

The continuity equation (11.49) describes a dynamics in which the evolution


of the probability density t (x) is driven by two non-dynamical …elds (x), and
11.5 Brownian sub-quantum motion and the evolution equation 287

A(x). This is an interesting ED in its own right but it is not QM. Indeed, a
quantum dynamics consists in the coupled evolution of two dynamical …elds:
the density t (x) and the phase of the wave function. This second …eld can
be naturally introduced into ED by allowing the phase …eld t (x) in (11.49) to
become dynamical which amounts to an ED in which the constraint (11.6) is
itself continuously updated at each instant in time. To complete the construction
of ED we must identify the appropriate updating criterion (e.g., along the lines of
Chapter 10) to formulate an ED in which the phase …eld t guides the evolution
of t , and in return, the evolving t reacts back and induces the evolution of t .
Remark: One might suspect that the Hamiltonian H ~ in (11.51) will eventually
lead us to the concept of energy and this will indeed turn out to be the case.
But there is something peculiar about H: ~ the variables that de…ne H ~ are prob-
abilities and the phase …elds which are both epistemic quantities. It therefore
follows that in the ED approach the energy is also an epistemic concept. In ED
only positions are ontic; energy is not. Surprising as this may sound, it is not
an impediment to formulating laws of physics that are empirically successful.
Remark: We note that once the evolution equation is written in Hamiltonian
form in terms of the phase the constant disappears from the formalism. This
means that changes in arise from the combined e¤ect of drift and di¤usion
and it is no longer possible to attribute any particular e¤ect to one or the other.
The fact that it is possible to enhance or suppress the ‡uctuations relative
to the drift to achieve the same overall evolution shows that there is a whole
family of ED models that di¤er at the “microscopic” or sub-quantum level.
Nevertheless, as we shall see, all members of this family lead to the same
“emergent” Schrödinger equation at the “macroscopic” or quantum level. The
model in which ‡uctuations are (almost) totally suppressed is of particular inter-
est: the system evolves along the smooth lines of probability ‡ow. This suggests
that ED includes the Bohmian or causal form of quantum mechanics as a special
limiting case. (For more on this see section 11.6).

11.5.6 The future and past drift velocities


An interesting consequence of the time asymmetry, eq.(11.18), is that the drift
velocities towards the future and from the past do not coincide. Let us be more
speci…c. Equation (11.24) gives the mean drift velocity to the future,
xA (t + t) x(t)
xA (t)
bA (x) = lim +
t!0 t
1 R 0
= lim + dx P (x0 jx) xA ; (11.52)
t!0 t
where x = x(t), x0 = x(t + t), and xA = x0A xA . Note that the expectation
in (11.52) is conditional on the earlier position x = x(t). One can also de…ne a
mean drift velocity from the past,
xA (t) xA (t t) x(t)
A
b (x) = lim + (11.53)
t!0 t
288 Entropic Dynamics: Time and Quantum Theory

where the expectation is conditional on the later position x = x(t). Shifting the
time by t, bA can be equivalently written as

xA (t + t) xA (t) x(t+ t)
A 0
b (x ) = lim +
t!0 t
1 R
= lim dx P (xjx0 ) xA ; (11.54)
t!0+ t
with the same de…nition of xA as in eq.(11.52).
The two drift velocities, towards the future bA and from the past bA , do
not coincide. The connection between them was derived by Nelson in [Nelson
1966, 1985] and independently by Jaynes [Jaynes 1989]. It turns out to be a
straightforward consequence of Bayes’theorem, eq.(11.18). To derive it expand
0
t0 (x ) about x in (11.18) to get

t (x)
P (xjx0 ) = 1 @B log t0 (x) xB + : : : P (x0 jx) : (11.55)
t0 (x)

Next multiply bA (x0 ) by a smooth test function f (x0 ) and integrate,


Z Z Z
1
dx0 bA (x0 )f (x0 ) = lim dx0 dx P (xjx0 ) xA f (x0 ) : (11.56)
t!0+ t
On the right hand side expand f (x0 ) about x and use (11.55),
Z Z
1 t (x)
lim + dx0 dx P (x0 jx) [ xA f (x)
t!0 t t0 (x)
xA xB f (x)@B log t0 (x) + xA xC @C f (x) + : : :] : (11.57)

Interchange the orders of integration and integrate over x0 using eq.(11.26),

h xA xB ix = h wA wB ix + O( t3=2 )
= mAB t + O( t3=2 ) ; (11.58)

to get
Z
t (x) xA
lim + dx [ f (x) mAB f (x)@B log t0 (x) + mAB @B f (x) + : : :] :
t!0 t0 (x) t
(11.59)
Next take the limit t ! 0+ and note that the third term vanishes (just inte-
grate by parts). The result is
Z Z
dx bA (x)f (x) = dx bA (x) mAB @B log t (x) f (x) : (11.60)

Since f (x) is arbitrary we get the desired relation,

bA (x) = bA (x) mAB @B log t (x) : (11.61)


11.6 An alternative: Bohmian sub-quantum motion 289

Incidentally, eq.(11.61) shows that the current v A and osmotic uA velocities,


eqs.(11.41) and (11.42), can be expressed in terms of the sum and di¤erence of
the past and future drift velocities,

1 A 1 A
vA = b + bA and uA = b bA : (11.62)
2 2

11.6 An alternative: Bohmian sub-quantum mo-


tion
In section 11.5.5 we remarked that in the limit ! 0 the Brownian trajectories
become as smooth as one wishes. In this section we explore an alternative way
to achieve the same result. Rather than 0 = const, eq.(11.20), and Brownian
trajectories, we shall set 0 / 1= t2 which leads directly to the smooth sub-
quantum trajectories characteristic of a Bohmian mechanics. Thus, recalling
eq.(11.19), we set

0 1 mn
= 0
so that n = ; (11.63)
t2 0 t3

where 0 is a new constant.


Our goal is to show that this choice of 0 has no e¤ect on the dynamics
provided a new “drift potential” 'Bohm ian = '0 is suitably chosen. Then, with
the choice (11.63), the new transition probability (11.9) becomes

1 1 xA xB
P (x0 jx) = exp 0
mAB b0A (x) b0B (x) ;
Z 2 t t t
(11.64)
where we used (11.11) to de…ne the drift velocity,

h xA i
b0A (x) = = mAB @B '0 (x) AB (x) ; (11.65)
t

and the ‡uctuations wA are given by

h wA i = 0 and h wA wB i = 0
mAB t3 ; (11.66)

or
xA xB
b0A b0B = 0
mAB t: (11.67)
t t
It is noteworthy that h xA i O( t) and wA O( t3=2 ). This means that
as t ! 0 the dynamics is dominated by the drift and the ‡uctuations become
negligible. Indeed, since w0A O( t3=2 ) eq.(11.23) shows that the limit

xA
lim = b0A (11.68)
t!0 t
290 Entropic Dynamics: Time and Quantum Theory

is well de…ned. In words: the actual velocities of the particles coincide with
the expected or drift velocities. From eq.(11.65) we see that these velocities
are continuous functions. Since, as we shall later see, these smooth trajectories
coincide with the trajectories postulated in Bohmian mechanics, we shall call
them Bohmian trajectories to distinguish them from the Brownian trajectories
discussed in section 11.5.2.

11.6.1 The evolution equation in di¤erential form


We wish to rewrite the evolution equation (11.16),
Z
0
t+ t (x ) = dx P (x0 ; t + tjx; t) t (x) ; (11.69)

in di¤erential form. Since for small t the transition probability P (x0 ; t+ tjx; t)
is very sharply peaked at x0 = x we proceed as in section 11.5.2. We multiply
by a smooth test function f (x0 ) and integrate over x0 ,
Z Z Z
0 0 0
dx t+ t (x )f (x ) = dx dx0 P (x0 ; t + tjx; t)f (x0 ) t (x) : (11.70)

The test function f (x0 ) is assumed su¢ ciently smooth precisely so that it can
be expanded about x. Then, dropping all terms of order higher than t, as
t ! 0 the integral in the brackets is
Z
@f
[ ] = dx0 P (x0 ; t + tjx; t) f (x) + A (x0A xA ) + :::
@x
@f
= f (x) + b0A (x) t A + : : : (11.71)
@x
where we used eq.(11.23). Dropping the primes on the left hand side of (11.70),
substituting (11.71) into the right, and dividing by t, gives
Z Z
1 @f
dx [ t+ t (x) t (x)] f (x) = dx b0A (x) A t (x) : (11.72)
t @x

Next integrate by parts on the right and let t ! 0. Since the test function
f (x) is arbitrary, we conclude

@t t (x) = @A [ t (x)b0A (x)] ; (11.73)

which is the desired evolution equation for t (x) written in di¤erential form.
This is a continuity equation where the current velocity is equal to the drift
velocity, v A = b0A .
Thus, whether we deal with Brownian ( 0 = const) or Bohmian ( 0 / 1= t2 )
trajectories we …nd the same continuity equation

@t t (x) = @A [ t (x)v A (x)] with v A = mAB @B (x) AB (x) ; (11.74)


11.7 The epistemic phase space 291

provided the corresponding drift potentials 'Bohm ian and 'Brownian are chosen
such that they lead to the same phase …eld,
1=2
= 'Bohm ian = 'Brownian log : (11.75)

It also follows that whether we deal with Bohmian or Brownian paths, the
evolution of probabilities can be expressed in the same Hamiltonian form given
in eqs.(11.49) and (11.51).

A fractional Brownian motion? — Our choices of 0 led to Brownian and


Bohmian paths but more general fractional Brownian motions [Mandelbrot Van
Ness 1968] are in principle possible. Consider

0 1 mn
= 00 1
and n = 00
; (11.76)
t t

where and 00 are positive constants. We will not pursue this topic further
except to note that for < 2 the sub-quantum motion is dominated by ‡uc-
tuations and the trajectories are non-di¤erentiable, while for > 2 the drift
dominates and velocities are well de…ned.

11.7 The epistemic phase space


In ED we deal with two con…guration spaces. One is the ontic con…guration
space XN = X X : : : of all particle positions, x = (x1 : : : xN ) 2 XN .
The other is the epistemic con…guration space or e-con…guration space S of all
normalized probabilities,
Z
S= (x) 0; dx (x) = 1 : (11.77)

To formulate the coupled dynamics of and we need a framework to study


trajectories in the larger space f ; g that we will call the epistemic phase space
or e-phase space.
As we saw in chapter 10, given a manifold such as S its associated cotan-
gent bundle T S is a geometric object of particular interest because it comes
automatically endowed with rich symplectic and metric structures.21 This ob-
servation leads us to identify the e-phase space f ; g with the cotangent bundle
T S and we adopt the preservation of those structures as the criterion for up-
dating constraints. The discussion in chapter 10 can be borrowed essentially
unchanged once our notation is adapted to account for the fact that the ontic
variables we now deal are continuous rather than discrete.
2 1 For previous work on the geometric and symplectic structure of quantum mechanics [Kib-

ble 1979; Heslot 1985; Anandan and Aharonov 1990; Cirelli et al. 1990; Abe 1992; Hughston
1995; Ashekar and Schilling 1998; de Gosson, Hiley 2011; Elze 2012; Reginatto and Hall 2011,
2012]; [Caticha 2019, 2021b].
292 Entropic Dynamics: Time and Quantum Theory

Notation: vectors, covectors, etc. A point X 2 T S is represented as


x
X = ( (x); (x)) = ( ; x) ; (11.78)

where x represents coordinates on the base manifold S, and x represents some


generic coordinates on the space T S that is cotangent to S at the point .
Curves in T S allow us to de…ne vectors. Let X = X( ) be a curve parametrized
by , then the vector V tangent to the curve at X = ( ; ) has components
d x =d and d x =d , and is written
Z
d d x d x
V = = dx x
+ ; (11.79)
d d d x

where = x and = x are the basis vectors. The directional derivative of a


functional F [X] along the curve X( ) is
Z
dF F d x F d x def ~ [V ] ;
= dx x d
+ = rF (11.80)
d x d

where r~ is the functional gradient in T S. The tilde ‘~’on r~ serves to distin-


guish the functional gradient on T S from the spatial gradient rf = @a f rxa
on XN . The gradient of a generic functional F [X] = F [ ; ] is
Z
~ = F ~ x F ~
rF dx x
r + r x ; (11.81)
x

~ x and r
and the action of the basis covectors r ~ x on the vector V is de…ned
by
x
~ x [V ] = d
r and r~ x [V ] = d x ; (11.82)
d d
that is,

~ x x ~ x0 ~ x ~
r [ x0
]= x0 ; r x[ ]= x ; and r [ ]=r x[ x0
]=0:
x0 x0
(11.83)
The fact that the space S is constrained to normalized probabilities means
that the coordinates x are not independent. This technical di¢ culty is handled
by embedding the 1-dimensional manifold S in a (1+ 1)-dimensional manifold
~ is a covector
S + where the coordinates x are unconstrained. Thus, strictly, rF
+ ~ + ~ x ~
on T S , that is, rF 2 T (T S )X and r and r x are the corresponding
basis covectors.
Instead of keeping separate track of the x and x coordinates it is more
convenient to combine them into a single index. A point X = ( ; ) will then
be labelled by its coordinates
x
X = (X 1x ; X 2x ) = ( x
; x) (11.84)
11.7 The epistemic phase space 293

where x is a composite index: = 1; 2 keeps track of whether x is an upper


index ( = 1) or a lower index ( = 2). Then eqs.(11.79-11.81) are written as
x x
x x dX d =d
V =V ; where V = = ; (11.85)
X x d d x =d

dF ~ [V ] = F V x and rF ~ = F ~
= rF x
rX x ; (11.86)
d X X x
where the repeated upper and lower indices indicate a summation over and
an integration over x. Once we have introduced the composite indices x to
label tensor components there is no further need to draw a distinction between
x
and x — these are coordinates and not the components of a vector. From
now on we shall write (x) = x = x switching from one notation to another as
convenience dictates. On the other hand, for quantities such as x or d x =d
that are the components of vectors it is appropriate to keep x as an upper index.

11.7.1 The symplectic form in ED


In classical mechanics with con…guration space fq i g the Lagrangian L(q; q)
_ is a
function on the tangent bundle while the Hamiltonian H(q; p) is a function on
the cotangent bundle [Arnold 1997][Souriau 1997][Schutz 1980]. A symplectic
form provides a mapping from the tangent to the cotangent bundles. When a
Lagrangian is given the map is de…ned by pi = @L=@ q_i and this automatically
de…nes the corresponding symplectic form. In ED there is no Lagrangian so in
order to de…ne the symplectic map we must look elsewhere. The fact that the
preservation of a symplectic structure must reproduce the continuity equation
(11.49) leads us to identify the phase x as the momentum canonically conjugate
to x . This identi…cation of the e-phase space f ; g with T S is highly non-
trivial. It amounts to asserting that there is a privileged symplectic form22
Z h i
= dx r ~ x r ~ x r ~ x r ~ x : (11.88)

The action of [ ; ] on two vectors V = d=d and U = d=d is given by


Z
0
[V ; U ] = dx V 1x U 2x V 2x U 1x = x; x0 V
x
U x ; (11.89)

so that the components of are


0 1
x; x0 = (x; x0 ) ; (11.90)
1 0
where (x; x0 ) = xx0 is the Dirac function.
2 2 Alternatively,
the assumption is that the phase x transforms as the components of a
locally-de…ned Poincare 1-form Z
= ~ x ;
dx x d (11.87)
~ is the exterior derivative on T S + ) and the corresponding symplectic 2-form is
(where d
= d ~ : By construction is locally exact ( = d~ ) and closed (d
~ = 0).
294 Entropic Dynamics: Time and Quantum Theory

11.7.2 Hamiltonian ‡ows


Next we reproduce the 1-dimensional T S + analogues of the …nite dimensional
Hamiltonian ‡ows studied in section 10.4. Given a vector …eld V [X] in e-phase
space we can integrate V x [X] = dX x =d to …nd its integral curves X x =
X x ( ). We are particularly interested in those vector …elds that generate ‡ows
that preserve the symplectic structure,
$V =0; (11.91)
where the Lie derivative is given by (see eq.(10.39))
x00 ~ ~ x00 ~ x00
($V ) x; x0 =V r x00 x; x0 + x00 ; x0 r x V + x; x00 r x0 V :
(11.92)
Since by eq.(11.90) the components x; x0
~
are constant, r x00 x; x0 = 0, we
rewrite $V as
~ x00 ~ x00
($V ) x; x0 =r x( x00 ; x0 V ) r x0 ( x00 ; x V ); (11.93)
x00
which is the exterior derivative (basically, the curl) of the covector x00 ; xV .
x0
By Poincare’s lemma, requiring $V = 0 (a vanishing curl) implies that x0 ; x V
is the gradient of a scalar function, which we will denote V~ [X],
x0 ~ ~ :
x0 ; x V =r xV (11.94)
Using (11.90) this is more explicitly written as
Z Z " #
d x~ d x~ V~ ~ V~ ~
dx r x r x = dx r x+ r x ; (11.95)
d d x x

or
d x V~ d x V~
= and = ; (11.96)
d x d x

which are Hamilton’s equations for a Hamiltonian function V~ . Thus V is the


Hamiltonian vector vector …eld associated to the Hamiltonian function V~ .
Remark: The Hamiltonian ‡ows that might potentially be of interest tend
to be those associated to Lie groups and, in particular, those that generate
symmetry transformations. Then, to each element of the Lie algebra one can
associate a corresponding Hamiltonian function. This map from the Lie algebra
to Hamiltonian functions is commonly called “the moment map”.
From (11.89), the action of the symplectic form on two Hamiltonian vector
…elds V = d=d and U = d=d generated respectively by V~ and U ~ is
Z
d xd x d xd x
[V ; U ] = dx ; (11.97)
d d d d
which, using (11.96), gives
Z " #
V~ ~
U V~ ~
U def
[V ; U ] = dx = fV~ ; U
~g ; (11.98)
x x x x
11.7 The epistemic phase space 295

where on the right we introduced the Poisson bracket notation.


These results are summarized by the continuum version of eqs.(10.67-10.70):
(1) The condition for the ‡ow generated by the vector …eld V x to preserve the
symplectic structure, $V = 0, is that the ‡ow be generated by a Hamiltonian
function V~ according to eq.(11.96) or equivalently,
x
dX
V x
= = fX x
; V~ g : (11.99)
d
(2) The action of on two Hamiltonian vector …elds (11.98) is the Poisson
bracket of the associated Hamiltonian functions,
x0
[V ; U ] = x; x0 V
x
U = fV~ ; U
~g : (11.100)

The ED that preserves the symplectic structure and reproduces the con-
~ ; ] in
tinuity equation (11.49) is generated by the Hamiltonian functional H[
(11.51),
H~ ~
H
@t x = ; @t x = ; (11.101)
x x

and the evolution of a generic functional f [ ; ] is given by the Poisson bracket,


~ :
@t f = ff; Hg (11.102)

The dynamics, however, is not yet fully determined because the integration
constant F [ ] in (11.51) remains to be speci…ed.

11.7.3 The normalization constraint


Since the particular ‡ow that we will associate with time evolution is required
to reproduce the continuity equation it will also preserve the normalization
constraint,
Z
N~ = 0 where N ~ = 1 j j and j j def = dx x : (11.103)

Indeed, one can check that


~ = fN
@t N ~ ; Hg
~ =0: (11.104)
~ and parametrized by
The Hamiltonian ‡ow (11.99) generated by N is given
by the vector …eld
x
x x dX x ~g :
N =N x
with N = = fX ;N (11.105)
X d
More explicitly, the components are
d x d x
N 1x = = 0 and N 2x = =1: (11.106)
d d
296 Entropic Dynamics: Time and Quantum Theory

From eq.(11.104) we see that if N ~ is conserved along H, then H~ is conserved


along N ,
dH~
~ N
= fH; ~g = 0 ; (11.107)
d
~ is the generator of a global gauge symmetry. Integrating (11.106) one
so that N
…nds the integral curves generated by N ~,

x( )= x (0) and x( )= x (0) + : (11.108)


This shows that the symmetry generated by N ~ is to shift the phase by a
constant independent of x without otherwise changing the dynamics.
As discussed in the last chapter this gauge symmetry is the consequence of
embedding the e-phase space T S of normalized probabilities into the larger
embedding space T S + of unnormalized probabilities. One consequence of in-
troducing a super‡uous coordinate is to also introduce a super‡uous momentum.
The extra coordinate is eliminated by imposing the constraint N ~ = 0. The extra
momentum is eliminated by declaring it unphysical, that is, shifting the phase
x by a constant, eq.(11.108), does not lead to a new state. The two states
( ; ) and ( ; + ) are declared to be equivalent; they lie on the same “ray”.

11.8 The information geometry of e-phase space


The construction of the Hamiltonian H ~ on e-phase space involves three steps.
The goal of dynamics is to determine the evolution of the state ( t ; t ). From a
given initial state ( 0 ; 0 ) two slightly di¤erent Hamiltonians will lead to slightly
di¤erent …nal states, say ( t ; t ) or ( t + t ; t + t ). Will these small changes
make any di¤erence? Can we quantify the extent to which we can distinguish
between two neighboring states? This is precisely the kind of question that
metrics are designed to address. One should then expect that a suitable choice
of metric will help us constrain the form of H. ~ In this section we take the …rst
step and transform the e-phase space T S + from a manifold that is merely
symplectic to a manifold that is both symplectic and Riemannian.
As discussed in chapter 10 the e-phase space is endowed with a natural
metric inherited from information geometry. It is also endowed with a natural
symplectic structure inherited from the entropic dynamics as expressed in the
continuity equation, (11.49). We shall impose the preservation of these two
structures — a Hamilton-Killing or HK ‡ow — as the natural criterion for
updating the constraints. In section 11.9 we shall implement this second step
and identify the general form of the Hamiltonian functions H ~ that generate HK
‡ows.
In ED entropic time is constructed so that time (duration) is de…ned by a
clock provided by the system itself. In section 11.10 we take the …nal third
step in the construction of H ~ and require that the generator of time evolution
be de…ned in terms of the very same clock that provides the measure of time.
There we impose that the Hamiltonian H ~ agree with (11.51) so as to reproduce
the ED evolution of t .
11.8 The information geometry of e-phase space 297

11.8.1 The embedding space T S +


In section (10.6.2) we assigned a metric to the statistical manifold of discrete
unnormalized probabilities. The generalization of the metric from …nite dimen-
sions, eqs.(10.101) and (10.102), to the in…nite-dimensional case is straightfor-
ward.23 The length element for S + is24

x0 ~
`2 = gxx0 x
with gxx0 = A(j j) nx nx0 + xx0 ; (11.109)
2 x
where n is a special covector which, in coordinates, has components nx = 1,
so that Z Z
2
~
`2 = A(j j) dx x + dx ( x )2 ; (11.110)
2 x
and the freedom in the function A(j j) re‡ects the ‡exibility in the choice
of spherically symmetric embedding. The corresponding inverse tensor, see
eq.(10.103), is
0 2 x 2A
g xx = xx0 +C x x0 where C(j j) = : (11.111)
~ ~Aj j + ~2 =2
The metric structure for T S + is obtained following the same argument that
led to eqs.(10.78) and (10.104). The simplest geometry that is invariant under
‡ow reversal and is determined by the information geometry of S + , which is
0
fully described by the tensor gxx0 and its inverse g xx , is given by the length
element
x0 x0 0
`~2 = G x; x0 X x
X = gxx0 x
+ g xx x x0 : (11.112)

More explicitly, `~2 is


Z 2 Z 2 Z
~2 x ~ x 2 2 x 2
` =A dx +C dx x x + dx ( ) + x ;
2 x ~
(11.113)
or
Z
~ 2 x
`~2 = Aj j2 + Cj j2 h i2 + dx ( x 2
) + 2
x : (11.114)
2 x ~
1
In 2 2 matrix form the tensor G and its inverse G can be written as
0
gxx0 0 0 g xx 0
[Gxx0 ] = xx0 ; [Gxx ] = : (11.115)
0 g 0 gxx0
2 3 The mathematics of in…nite dimensional spaces is tricky. The term ‘straightforward’

should be quali…ed with some …ne print to the e¤ect that “we adopt the standard of mathe-
matical rigor typical of theoretical physics.” Ultimately the argument is justi…ed by the fact
that it leads to useful models that are empirically successful. For relevant references see [Cirelli
et al 1990][Pistone Sempi 1995] and also [Jaynes 2003, appendix B].
2 4 The quantities x are the components of a vector so in (11.109) it makes sense to keep

x as an upper index and adopt Einstein’s summation convention.


298 Entropic Dynamics: Time and Quantum Theory

1
Using G to raise the …rst index of the symplectic form x; x0 ,

0 1
[ xx0 ] = xx0 ; (11.116)
1 0

as in eq.(10.79),
x; x00 x
G x00 ; x0 = J x0 ; (11.117)
we …nd
0
0 g xx
[J x x0 ] = : (11.118)
gxx0 0
And just as in the discrete case the square of the J tensor is minus the identity,
x x00 x
J x00 J x0 = x0 = xx0 or JJ = 1; (11.119)

which means that J endows T S + with a complex structure.

11.8.2 The metric induced on the e-phase space T S


The e-phase space T S is obtained from the embedding space T S + by restrict-
ing to normalized probabilities, j j = 1, and by identifying the gauge equivalent
points ( x ; x ) and ( x ; x + ). Consider two rays de…ned by the neighboring
points ( x ; x ) and ( 0x ; 0x ) with j j = j 0 j = 1. The distance induced on T S,
that is, the distance between the two neighboring rays, is de…ned as the short-
est T S + distance between ( x ; x ) and points on the ray de…ned by ( 0x ; 0x ).
Since the T S + distance between ( x ; x ) and ( x + x ; x + x + ) is

x0 0
`~2 ( ) = gxx0 x
+ g xx ( x + )( x0 + ); (11.120)

the metric on T S is de…ned by


def
s~2 = min `~2 : (11.121)

The value of that minimizes (11.120) is


Z
m in = h i= dx x x : (11.122)

Then the metric that measures the distance between neighboring rays on T S
is obtained by substituting m in back into (11.120), and setting j j = 1 and
j j = 0. The result is
Z
~ 2 x
s~2 = dx ( x 2
) + ( x h i)2 : (11.123)
2 x ~

As we saw in section 10.6.3 this is the Fubini-Study metric.


11.8 The information geometry of e-phase space 299

11.8.3 A simpler embedding


To avoid the inconvenience of dealing with normalized probabilities we return
to the embedding space T S + . We take advantage of the fact that the T S
metric (11.123) is independent of the particular choice of the function A(j j)
and choose A(j j) = 0 so that the embedding spaces S + and T S + are assigned
the simplest possible geometries, namely, they are ‡at. With this choice the
T S + metric, eq.(11.112), becomes
Z
~2 ~ 2 x 2 0
` = dx ( x )2 + x = G x; x0 X
x
X x (11.124)
2 x ~
1
where G and its inverse G are
~
2 xx0 0
[Gxx0 ] = x
2 ; (11.125)
0 ~
x
xx0

and
2
0
x
xx0 0
[Gxx ] = ~
~ : (11.126)
0 2 x
xx0

The tensor J, eq.(11.118) that de…nes the complex structure becomes


2
x x; x00 0 x
xx0
J x0 = G x00 ; x0 or [J x x0 ] = ~
~ : (11.127)
2 x
xx0 0

11.8.4 Re…ning the choice of cotangent space


As we saw in section 10.6.4 in the …nite-dimensional case the cotangent spaces
that are relevant to quantum mechanics are ‡at n-dimensional “hypercubes”
that are only locally isomorphic to the old Rn . Here we make the analogous
move for the 1-dimensional case and we choose cotangent spaces that obey
periodic boundary conditions: the coordinates ( x ; x ) and ( x ; x + 2 ~) label
the same point in T S + .
Having thus re…ned our choice of cotangent spaces, the fact that T S + is
endowed with a complex structure suggests introducing complex coordinates,
1=2 i x =~ 1=2 i x =~
x = x e and i~ x = i~ x e ; (11.128)
so that a point 2 T S + has coordinates
1x
x x
= 2x
= ; (11.129)
i~ x

where the index takes two values, = 1; 2.


We can check that the transformation from real coordinates ( ; ) to complex
coordinates ( ; i~ ) is indeed canonical. From (11.128) we have

x = x x + x x ;
~
x = ( x x x x) : (11.130)
2i x
300 Entropic Dynamics: Time and Quantum Theory

The action of on any two vectors V = d=d and U = d=d ,


Z
x x0 d xd x d xd x
[U ; V ] = x; x0 U V = dx ;
d d d d
transforms into
Z
d di~ di~ d x x0
[U ; V ] = dx = x; x0 (11.131)
d d d d
so that in coordinates the symplectic form,

0 1
[ xx0 ] = xx0 ; (11.132)
1 0

retains the same form as (11.90).


Expressed in coordinates the Hamiltonian ‡ow generated by the normal-
ization constraint (11.103),
Z
N~ = 0 with N
~ =1 dx x x ; (11.133)

and parametrized by is given by the vector …eld


x
x x d x ~g ;
N =N x
with N = =f ;N (11.134)
d
or
i
~ x
N= : (11.135)
x
Its integral curves are given by
d x i i =~
= x so that x( )= x (0)e : (11.136)
d ~
~ is conserved along H, then H
We had seen that if N ~ is conserved along N ,

dH~
~ N
= fH; ~g = 0 : (11.137)
d
Therefore N ~ is the generator of a global “gauge”symmetry and the Hamiltonian
~
H is invariant under the transformation x (0) ! x ( ). The interpretation is
that as we embed the e-phase space T S into the larger space T S + we introduce
two additional degrees of freedom. We eliminate one by imposing the constraint
N~ = 0; we eliminate the other by declaring that two states x (0) and x ( )
that lie on the same ray (or gauge orbit) are equivalent in the sense that they
represent the same epistemic state.
In coordinates the metric on T S + , eqs.(11.124) and (11.125), becomes
Z Z
x0
` = 2i dx x i~ x = dxdx0 G x; x0
2 x
; (11.138)
11.9 Hamilton-Killing ‡ows 301

x; x0
where in matrix form the metric tensor G x; x0 and its inverse G are

0 1 0 0 1
[Gxx0 ] = i xx0 and [Gxx ] = i xx0 : (11.139)
1 0 1 0

x; x0
Finally, using G to raise the …rst index of x0 ; x00 gives the components
of the tensor J

x def x; x0 i 0
J x00 = G x0 ; x00 or [J x x00 ] = xx00 : (11.140)
0 i

11.9 Hamilton-Killing ‡ows


Our next goal will be to …nd those Hamiltonian ‡ows H that also happen to
preserve the metric tensor, that is, we want H to be a Killing vector. The
condition for H is $H G = 0, or
x00 x00 x00
($H G) x; x0 =H @ x00 G x; x0 +G x00 ; x0 @ x H
=0: +G x; x00 @ x0 H
(11.141)
Remark: We note that while we adopt the H notation that is usually associated
with evolution in the time parameter t, the vector …eld H might refer to any
~ , the symplectic form , and the metric
‡ow that preserves the normalization N
G.
In complex coordinates eq.(11.139) gives @ x00 G x; x0 = 0, and the Killing
equation simpli…es to
x00 x00
($H G) x; x0 =G x00 ; x0 @ x H +G x; x00 @ x0 H =0; (11.142)

or 2 0 0 3
H 2x H 2x H 1x H 2x
+ ; + i~ x0
[($Q G)xx0 ] = i4 x
0
x0 x
0 5=0: (11.143)
H 2x H 1x H 1x H 1x
i~ x + ; i~ x + i~ x0
x0

x
If we further require that H be a Hamiltonian ‡ow, $H = 0, then we
substitute
H~ ~
H
H 1x = and H 2x = (11.144)
i~ x x

into (10.130) to get


2 ~
H 2 ~
H
= 0 and =0: (11.145)
x x0 x x0

Therefore in order to generate a ‡ow that preserves N~ , G, and , the functional


~
H[ ; ] must be gauge invariant and linear in both and . Therefore,
Z
~
H[ ; ] = dxdx0 x H ^ xx0 x0 ; (11.146)
302 Entropic Dynamics: Time and Quantum Theory

^ xx0 is a possibly non-local kernel. The actual Hamilton-Killing ‡ow is


where H

~ Z
d x H 1 ^ xx0 x0 ;
= H 1x = = dx0 H (11.147)
dt i~ x i~
~ Z
di~ x 2x H ^ xx0 :
=H = = dx0 x0 H (11.148)
dt x

Taking the complex conjugate of (11.147) and comparing with (11.148), shows
^ xx0 is Hermitian,
that the kernel H
^ 0 = Hx0 x ;
H (11.149)
xx

~ are real,
and we can check that the corresponding Hamiltonian functionals H
~ ;
H[ ~ ;
] = H[ ]:

An example: translations — The Hamiltonian ‡ows of interest are those


associated to Lie groups and, in particular, those that generate symmetry trans-
formations. For example, the generator of translations is total momentum. Un-
der a spatial displacement by "a , a function f (x) transforms as

@f
f (x) ! f" (x) = f (x ") or " f (x) = f" (x) f (x) = "a : (11.150)
@xa
The change of a functional F [ ; ] is
Z
F F
" F [ ; ] = dx " x+ " x = fF; P~a "a g (11.151)
x x

where Z Z
X @ @ x
P~a = dx x = dx x (11.152)
n @xa
n
a
@Xcm
a
is interpreted as the expectation of the total linear momentum, and Xcm are
the coordinates of the center of mass,

a 1 P P
Xcm = mn xan where M= mn : (11.153)
M n n

Transforming to complex coordinates we can check that


Z Z
~ P~ @ ~ @
Pa = dx a
= dx a
; (11.154)
n i @xn i @Xcm

and the corresponding kernel P^axx0 is


P~ @ ~ @
P^axx0 = xx0 a
= xx0 a
: (11.155)
n i @xn i @Xcm
11.10 The e-Hamiltonian 303

11.10 The e-Hamiltonian


In previous sections we supplied the e-phase space T S + with a symplectic
structure, a Riemannian metric and, as a welcome by-product, also with a com-
plex structure. Then we showed that the condition for the simplest form of
dynamics — one that preserves normalization and the metric, symplectic, and
complex structures — is a Hamilton-Killing ‡ow generated by a Hamiltonian H ~
that is linear in both and ,
Z
~
H[ ; ] = dxdx0 x H ^ xx0 x0 : (11.156)

The last ingredient in the construction of H ~ is that the e-Hamiltonian that


generates evolution in entropic time must be de…ned in terms of the same clock
~ has to agree with
that provides the measure of entropic time. In other words, H
(11.51) in order to reproduce the entropic dynamics of given by the continuity
eq.(11.49).
To proceed we use the identity
1 ~2 AB ~2 AB
mAB (@A AA )(@B AB ) = m (DA ) DB m @A @B
2 2 8 2
(11.157)
where
i
DA = @A AA and AA (x) = n Aa (xn ) : (11.158)
~
The proof is straightforward: just substitute = 1=2 e =~ into the right hand
~ ; ] in (11.51) in terms of and
side of (11.157). Rewriting H[ we get
Z
~ ; ] = dx ~2 AB
H[ m DA DB + F 0[ ] : (11.159)
2
where Z
~2 AB
F 0[ ] = F [ ] dx m @A @B : (11.160)
8 2
According to (11.156), in order for H[~ ; ] to generate an HK ‡ow we must
impose that F 0 [ ] be linear in both and ,
Z
F [ ] = dxdx0 x V^xx0 x0
0
(11.161)

for some Hermitian kernel V^xx0 , but F 0 [ ] must remain independent of ,


F 0[ ]
=0: (11.162)
x

Substituting = 1=2 i =~
e into (11.161) and using V^x0 x = V^xx0 leads to
Z
F0 2 1=2
= 1=2
x dx0 x0 Im V^xx0 e i( x x0 )=~ =0: (11.163)
x ~
304 Entropic Dynamics: Time and Quantum Theory

This equation must be satis…ed for all choices of x0 . Therefore, it follows that

Im V^xx0 e i( x x0 )=~ =0: (11.164)

Furthermore, this last equation must in turn hold for all choices of x and x0 .
Therefore, the kernel V^xx0 must be local in x,

V^xx0 = xx0 Vx ; (11.165)

where Vx = V (x) is some real function.


We conclude that the Hamiltonian that generates a Hamilton-Killing ‡ow
and agrees with the ED continuity equation must be of the form
Z
~ ; ] = dx ~2 AB
H[ m DA DB + V (x) : (11.166)
2
The evolution of is given by the Hamilton equation,

H~
@t x =f x ; Hg
~ = ; (11.167)
i~ x
which is the Schrödinger equation,
~2 AB
i~@t = m DA DB + V : (11.168)
2
In more standard notation it reads
X ~2 ab @ i @ i
i~@t = a n Aa (xn ) n Ab (xn ) +V :
n 2mn @xn ~ @xbn ~
(11.169)
At this point we can …nally provide the physical interpretation of the various
constants introduced along the way. Since the Schrödinger equation (11.169) is
the tool we use to analyze experimental data we can identify ~ with Planck’s
constant, mn will be interpreted as the particles’masses, and the n are related
to the particles’electric charges qn (in Gaussian units) by
qn
n = : (11.170)
c
For completeness we write the Hamiltonian in the ( ; ) variables,
Z X ab
~ @ qn @ qn
H[ ; ] = d3N x Aa (xn ) Ab (xn )
n 2mn @xan c @xbn c
X ~2 ab @ @
+ + V (x1 : : : xn ) : (11.171)
n 8mn 2 @xa b
n @xn

The Hamilton equations for and are the continuity equation (11.49),
~
H X @ ab
@ qn
@t = = Ab (xn ) ; (11.172)
n @xan mn @xbn c
11.11 Entropic time, physical time, and time reversal 305

and the quantum analogue of the Hamilton-Jacobi equation,

~
H X ab
@ qn @ qn
@t = = Aa (xn ) Ab (xn )
n 2mn @xan c @xbn c
X ~2 ab
@ 2 1=2
+ 1=2 @xa @xb
V (x1 : : : xn ) : (11.173)
n 2mn n n

To summarize: we have just shown that an ED that preserves normalization,


the symplectic and the metric structures of the e-phase space T S + leads to a
linear Schrödinger equation. In particular, such an ED reproduces the Bohmian
quantum potential in (11.173) with the correct coe¢ cients ~2 =2mn .

The action — Now that we have Hamilton’s equations (11.101) one can
invert the usual procedure and construct an action principle from which they
can be derived. De…ne the di¤erential
Z Z " ! ! #
~
H H~
A = dt dx @t x x @t x + x (11.174)
x x

and then integrate to get the action,


Z Z
A[ ; ] = dt dx x @t x
~ ; ]
H[ : (11.175)

By construction, imposing A = 0 leads to eq.(11.101). Introducing the action


A[ ; ] is a maneuver that is useful as a convenient summary of the dynami-
cal equations and in the formal derivation of important consequences such as
Noether’s theorem. From the ED perspective, however, it does not appear to
be particularly fundamental.

11.11 Entropic time, physical time, and time re-


versal
Now that the dynamics has been fully developed we revisit the question of
time. The derivation of laws of physics as examples of inference led us to
introduce the notion of an entropic time which includes assumptions about the
concept of instant, of simultaneity, of ordering, and of duration. It is clear that
entropic time is useful but is this the actual, “physical” time? The answer is
an unquali…ed yes. By deriving the Schrödinger equation (from which we can
obtain the classical limit) we have shown that the t that appears in the laws
of physics is entropic time. Since these are the equations that we routinely use
to design and calibrate our clocks we conclude that what clocks “measure” is
entropic time. No notion of time that is in any way deeper or more “physical”
is needed.
306 Entropic Dynamics: Time and Quantum Theory

This may at …rst be surprising. What ED has led us to is an epistemic


notion of time and one might ask how do epistemic quantities, such as probabil-
ities, temperatures, or energies, ever get to be “measured”. As we shall discuss
in some detail in chapter 13, the answer is that epistemic quantities are not
measured; they are inferred from those other ontic quantities, such as positions,
that are actually directly accessible to observations.
Most interestingly, the entropic model automatically includes an arrow of
time. The statement that the laws of physics are invariant under time reversal
has nothing to do with particles travelling backwards in time. It is instead the
assertion that the laws of physics exhibit a certain symmetry. For a classical
system described by coordinates q and momenta p the symmetry is the statement
that if fqt ; pt g happens to be one solution of Hamilton’s equations then we can
construct another solution fqtT ; pTt g where

qtT = q t and pTt = p t ; (11.176)

but both solutions fqt ; pt g and fqtT ; pTt g describe evolution forward in time.
An alternative statement of time reversibility is the following: if there is
one trajectory of the system that takes it from state fq0 ; p0 g at time t0 to state
fq1 ; p1 g at the later time t1 , then there is another possible trajectory that takes
the system from state fq1 ; p1 g at time t0 to state fq0 ; p0 g at the later time
t1 . The merit of this re-statement is that it makes clear that nothing needs to
travel back in time. Indeed, rather than time reversal the symmetry might be
more appropriately described as momentum or motion or ‡ow reversal.
Since ED is a Hamiltonian dynamics one can expect that similar consid-
erations will apply to QM and indeed they do. It is straightforward to check
that given one solution f t (x); t (x)g that evolves forward in time, we can con-
struct another solution f Tt (x); Tt (x)g that is also evolving forward in time.
The reversed solution is
T T
t (x) = t (x) and t (x) = t (x) : (11.177)

These transformations constitute a symmetry — i.e., the transformed tT (x) is


a solution of the Schrödinger equation — provided the motion of the sources
of the external potentials is also reversed, that is, the potentials Aa (~x; t) and
V (x; t) are transformed according to

ATa (~x; t) = Aa (~x; t) and V T (x; t) = V (x; t) : (11.178)

Expressed in terms of wave functions the time reversal transformation is


T
t (x) = t (x) : (11.179)

The proof that this is a symmetry is straightforward; just take the complex
conjugate of (11.169), and let t ! t.
Remark: In section 10.5.1 we saw that the metric structure of the e-phase
space was designed so that potential time-reversal violations would be induced
at the dynamical level of the Hamiltonian and not at the kinematical level of
11.12 Hilbert space 307

the geometry of e-phase space. The Hamiltonian (11.166) o¤ers us an explicit


example: it leads to a dynamics that can either preserve or violate the time-
reversal symmetry according to whether the potentials A and V obey the extra
condition (11.178) or not.

11.12 Hilbert space


The formulation of the ED of spinless particles is now complete. We note, in
particular, that the notion of Hilbert spaces turned out to be unnecessary to the
formulation of quantum mechanics. However, as we argued in the last chapter,
the introduction of Hilbert spaces is nevertheless a very useful tool designed for
the speci…c purpose of exploiting the full calculational advantages of linearity.
The rest of this section is a straightforward adaptation of the discussion in
section 10.8 from the …nite-dimensional to the in…nite-dimensional case.
We shall assume that the e-phase space T S + of unnormalized probabilities
is ‡at. Then the points 2 T S + are are also vectors and we can deploy the
already available metric and symplectic tensors to construct an inner product.

The inner product — We introduce the Dirac notation to represent the


wave functions x as vectors j i in a Hilbert space. The …nite-dimensional
inner product given by eq.(10.154) is written in in…nite dimensions as
Z
def 1 2x0
h 1j 2i = dx dx0 ( 1x ; i~ 1x ) [Gxx0 + i xx0 ] : (11.180)
2~ i~ 2x0

A straightforward calculation using (11.132) and (11.139) gives


Z
h 1 j 2 i = dx 1 2 : (11.181)

The map x $ j i is de…ned by


Z
j i = dx jxi x where x = hxj i ; (11.182)

where, in this “position” representation, the vectors fjxig form a basis that is
orthogonal and complete,
Z
hxjx0 i = xx0 and dx jxihxj = ^1 : (11.183)

The complex structure — The tensors and G were originally meant to


act on tangent vectors but now they can also act on all points 2 T S + . For
example,
The action of the mixed tensor J = G 1 , eq.(11.140), on a point is

i 0 x i x
[(J )x ] = = ; (11.184)
0 i i~ x i(i~ x)
308 Entropic Dynamics: Time and Quantum Theory

which shows that J plays the role of multiplication by i, that is, when acting
on a point the action of J is represented by an operator J,^

J J
^ i = ij i :
!i is j i ! Jj (11.185)

Hermitian and unitary operators — The bilinear Hamilton functionals


~ ; ] with kernel Q
Q[ ^ xx0 in eq.(11.146) can now be written in terms of a Her-
^ and its matrix elements,
mitian operator Q

~ ;
Q[ ^ i and
] = h jQj ^ xx0 = hxjQjx
Q ^ 0i : (11.186)

The corresponding Hamilton-Killing ‡ows parametrized by are given by

d ^ i or i~ d j i = Qj
^ i:
i~ hxj i = hxjQj (11.187)
d d
These ‡ows are described by unitary transformations

^Q ( )j (0)i where ^Q ( ) = exp i ^


j ( )i = U U Q : (11.188)
~

Commutators — ~[ ;
The Poisson bracket of two Hamiltonian functionals U ]
and V~ [ ; ],
Z !
~
U V~ ~
U V~
~ ; V~ g =
fU dx ;
x i~ x i~ x x

can be written in terms of the commutator of the associated operators,then

~ ; V~ g = 1 ^ ; V^ ]j i :
fU h j[U (11.189)
i~
Thus the Poisson bracket is the expectation of the commutator.

11.13 Summary
We conclude with a summary of the main ideas.

Ontological clarity: Particles have de…nite but unknown positions


and follow continuous trajectories.

ED is a dynamics of probabilities. The probability of a short step


is found by maximizing entropy subject to a constraint expressed
in terms of the phase …eld that introduces directionality and cor-
relations, and when appropriate gauge constraints that account for
external electromagnetic …elds.
11.13 Summary 309

Entropic time: An epistemic dynamics of probabilities inevitably


leads to an epistemic notion of time. The construction of time in-
volves the introduction of the concept of an instant, that the instants
are suitably ordered, and a convenient de…nition of duration. By its
very construction there is a natural arrow of entropic time.
The evolution of probabilities is given by a continuity equation that
is local in con…guration space but leads to non-local correlations in
physical space. Rewriting the continuity equation in Hamiltonian
form allows us to identify the pair ( ; ) as canonically conjugate
variables.
The epistemic phase space f ; g has a natural symplectic geometry.
The e-phase space is assigned a metric structure based on the in-
formation metric of the statistical manifold of probabilities plus the
requirement of symmetry under ‡ow reversal. The joint presence of
symplectic and metric structures implies the existence of a complex
structure.
The introduction of wave functions as complex coordinates sug-
gests that the cotangent spaces T S + that are relevant to quantum
mechanics are “hypercubes” with opposite faces identi…ed.
The dynamics that preserves the symplectic and metric structures as
well as the normalization of probabilities is shown to be described by a
linear Schrödinger equation. The particular form of the Hamiltonian
is determined by requiring that it reproduce the ED evolution of
probabilities in entropic time.
Since clocks are calibrated according to the Schrödinger equation, it
follows that what clocks measure is entropic time. Therefore, entropic
time is “physical” time.
The ontic motion of the particles at the sub-quantum level remains
unspeci…ed. ED allows us to infer di¤erent sub-quantum motions
(i.e., Brownian vs. Bohmian trajectories) and they all lead to the
same emergent quantum mechanics but there is no underlying ontic
dynamics.

ED achieves ontological clarity by a sharp distinction between the ontic and


the epistemic — positions of particles on one side and probabilities and phases
on the other. ED is an epistemic dynamics of probabilities and not an ontic
dynamics of particles. Of course, the probabilities will allow us to infer how the
particles move — but nothing in ED describes what it is that has pushed the
particles around. ED is a mechanics without a mechanism. Christiaan Huygens
would have been disappointed: the theory works but it does not explain. But
then, the question of ‘what counts as an explanation?’ has always been a tricky
one.
310 Entropic Dynamics: Time and Quantum Theory

We can elaborate on this point from a di¤erent direction. The empirical


success of ED suggests that its epistemic probabilities are in agreement with
ontic features of the physical world. It is highly desirable to clarify the precise
nature of this agreement. Consider, for example, a fair die. Its property of
being a perfect cube is an ontic property of the die that is re‡ected at the
epistemic level in the assignment of equal probabilities to each face of the die.
In this example we see that the epistemic probabilities achieve objectivity, and
therefore usefulness, by corresponding to something ontic. The situation in ED
is similar except for one crucial aspect. The ED probabilities are objective and
they are empirically successful. They must therefore re‡ect something real.
However, ED is silent about what those underlying ontic symmetries might
possibly be.
To the extent that ED is a dynamics of probabilities it follows that quantum
mechanics is incomplete in the sense of Einstein — QM does not provide us
with a completely detailed and therefore deterministic description of the system
and its dynamics. On the other hand, the fact that epistemic tools such as
information geometry have turned out to be so central to the derivation of the
mathematical formalism of QM lends strong support to the conjecture that QM
is not only incomplete, but that it is also incompletable.
The trick of embedding the e-phase space T S in the larger space T S +
that is chosen to be a ‡at vector space is clever but optional. It allows one
to make use of the calculational advantages of linearity; it allows, for example,
to rewrite QM in the form of Feynman’s path integrals. This recognition that
Hilbert spaces are not fundamental is one of the signi…cant contributions of the
entropic approach to our understanding of QM — another being the derivation
of linearity. The distinction between Hilbert spaces being necessary in principle
as opposed to merely convenient in practice is not of purely academic interest.
It could be important in the search for a quantum theory that includes gravity:
Shall we follow the usual approaches to quantization that proceed by replacing
classical dynamical variables by an algebra of linear operators acting on some
abstract space? Or, in the spirit of an entropic dynamics, shall we search for
an appropriately constrained dynamics of probabilities and information geome-
tries? [Ipek Caticha 2020]
Chapter 12

Topics in Quantum Theory*

In the Entropic Dynamics (ED) framework quantum theory is derived as an ap-


plication of entropic inference. In this chapter the immediate goal is to demon-
strate that the entropic approach can prove its worth through the clari…cation
and removal of conceptual di¢ culties.
We will tackle three topics that are central to quantum theory: in section
12.1 we discuss the connections between linearity, the superposition principle,
the single-valuedness of wavefunctions, and the quantization of charge. Second,
we consider the introduction and interpretation of quantities other than position,
including momentum, and the corresponding uncertainty relations. Third, we
discuss the classical limit.1

12.1 Linearity and the superposition principle


The Schrödinger equation is linear, that is, a linear combination of solutions
is a solution too. However, this mathematical linearity does not guarantee the
physical linearity that is usually referred to as the superposition principle. The
latter is the physical assumption that if there is one experimental setup that
prepares a system in the (epistemic) state 1 and there is another setup that
prepares the system in the state 2 then, at least in principle, it is possible to
construct yet a third setup that can prepare the system in the superposition

3 = 1 1 + 2 2 ; (12.1)
where 1 and 2 are arbitrary complex numbers. Mathematical linearity refers
to the fact that solutions can be expressed as sums of solutions and there is no
implication that any of these solutions will necessarily describe physical situa-
tions.2 Physical linearity on the other hand — the superposition principle —
1 The presentation follows closely the work presented in [Caticha 2019]

[Johnson Caticha 2011; Nawaz Caticha 2011]. More details can be found in [Johnson 2011;
Nawaz 2012].
2 The di¤usion equation provides an illustration. Fourier series were originally invented

to describe the di¤usion of heat: a physical distribution of temperature, which can only
312 Topics in Quantum Theory*

refers to the fact that the superposition of physical solutions is also a physical
solution. The point to be emphasized is that the superposition principle is a
physical hypothesis of wide applicability that need not, however, be universally
true.

12.1.1 The single-valuedness of


The question “Why should wave functions be single-valued?” has been around
for a long time. In this section we shall argue that the single- or multi-valuedness
of the wave functions is closely related to the question of linearity and the
superposition principle.3
First we show that the mathematical linearity of (11.169) is not su¢ cient to
imply the superposition principle. The idea is that even when j 1 j2 = 1 and
j 2 j2 = 2 are probabilities it is not generally true that j 3 j2 from eq.(12.1) will
also be a probability. Consider moving around a closed loop in con…guration
space. Since the phase (x) can be multi-valued the corresponding wave function
could in principle be multi-valued too. Suppose a generic changes by a phase
factor,
! 0 = ei ; (12.2)
then the superposition 3 of two wave functions 1 and 2 changes into
0 i i
3 ! 3 = 1e
1
1 + 2e
2
2 : (12.3)

The problem is that even if j 1 j2 = 1 and j 2 j2 = 2 are single-valued (because


they are probability densities), the quantity j 3 j2 need not in general be single-
valued. Indeed,
2 2 2
j 3j =j 1j 1 +j 2j 2 + 2 Re[ 1 2 1 2] ; (12.4)

changes into
0 2 2 2 i( 2)
j 3j =j 1j 1 +j 2j 2 + 2 Re[ 1 2e
1
1 2] ; (12.5)
take positive values, is expressed as a sum of sines and cosines which cannot individually
represent physical distributions. Despite the unphysical nature of the individual sine and
cosine components the Fourier expansion is nevertheless very useful.
3 Our discussion parallels [Schrödinger 1938]. Schrödinger invoked time reversal invariance

which was a very legitimate move back in 1938 but today it is preferable to develop an
argument which does not invoke symmetries that are already known to be violated. The answer
proposed in [Pauli 1939] is worthy of note. (See also [Merzbacher 1962].) Pauli proposed that
admissible wave functions must form a basis for representations of the transformation group
that happens to be pertinent to the problem at hand. In particular, Pauli’s argument serves to
discard double-valued wave functions for describing the orbital angular momentum of scalar
particles. The question of single-valuedness was later revived in [Takabayashi 1952, 1983] in
the context of the hydrodynamical interpretation of QM, and later rephrased by in [Wallstrom
1989, 1994] as an objection to Nelson’s stochastic mechanics: are these theories equivalent to
QM or do they merely reproduce a subset of its solutions? Wallstrom’s objection to Nelson’s
stochastic mechanics is that it leads to phases and wave functions that are either both multi-
valued or both single-valued. Both alternatives are unsatisfactory because on one hand QM
requires single-valued wave functions, while on the other hand single-valued phases exclude
states that are physically relevant (e.g., states with non-zero angular momentum).
12.1 Linearity and the superposition principle 313

so that in general
0 2 2
j 3j 6= j 3j ; (12.6)
2
which precludes the interpretation of j 3 j as a probability. That is, even when
the epistemic states 1 and 2 describe actual physical situations, their super-
positions need not.
The problem does not arise when

ei( 1 2)
=1: (12.7)

If we were to group the wave functions into classes each characterized by its own
then we could have a limited version of the superposition principle that ap-
plies within each class. We conclude that beyond the linearity of the Schrödinger
equation we have a superselection rule that restricts the validity of the super-
position principle to wave functions that belong to the same -class.
To …nd the allowed values of we argue as follows. It is natural to assume
that if f ; g (at some given time t0 ) is a physical state then the state with
reversed momenta f ; g (at the same time t0 ) is an equally reasonable physical
state. Basically, the idea is that if particles can be prepared to move in one
direction, then they can also be prepared to move in the opposite direction. In
terms of wave functions the statement is that if t0 is a physically allowed initial
state, then so is t0 .4 Next we consider a generic superposition

3 = 1 + 2 : (12.8)

Is it physically possible to construct superpositions such as (12.8)? The answer is


that while constructing 3 for an arbitrary might be di¢ cult in practice there
is strong empirical evidence that there exist no superselection rules to prevent
us from doing so in principle. Indeed, it is easy to construct superpositions
of wavepackets with momentum p~ and p~, or superpositions of states with
opposite angular momenta, Y`m and Y`; m . We shall assume that in principle
the superpositions (12.8) are physically possible.
According to eq.(12.2) as one moves in a closed loop the wave function 3
will transform into
0 i
3 = 1e + 2e i ; (12.9)
2
and the condition (12.7) for j 3j to be single-valued is

e2i = 1 or ei = 1: (12.10)

Thus, we are restricted to two discrete possibilities 1. Since the wave func-
tions are assumed su¢ ciently well behaved (continuous, di¤erentiable, etc.) we
conclude that they must be either single-valued, ei = 1, or double-valued,
ei = 1.
Thus, the superposition principle appears to be valid in a su¢ ciently large
number of cases to be a useful rule of thumb but it is restricted to either single-
valued or double-valued wave functions. The argument above does not exclude
4 We make no symmetry assumptions such as parity or time reversibility. It need not be

the case that there is any symmetry that relates the time evolution of t0 to that of t0 .
314 Topics in Quantum Theory*

the possibility that a multi-valued wave function might describe an actual phys-
ical situation. What the argument implies is that the superposition principle
would not extend to such states.

12.1.2 Charge quantization


Next we analyze the conditions for the electromagnetic gauge symmetry to be
compatible with the superposition principle. We shall con…ne our attention to
systems that are described by single-valued wave functions (ei = +1).5 The
condition for the wave function to be single-valued is
I
= d`A @A = 2 k ; (12.11)
~ ~
where k is an integer that depends on the loop . Under a local gauge trans-
formation
Aa (~x) ! Aa (~x) + @a (~x) (12.12)
the phase transforms according to (??),
0 P qn
(x) ! (x) = (x) + n (~xn ) : (12.13)
c
The requirement that the gauge symmetry and the superposition principle be
compatible amounts to requiring that the gauge transformed states also be
single-valued, I 0 0
= d`A @A = 2 k0 : (12.14)
~ ~
Thus, the allowed gauge transformations are restricted to functions (~x) such
that I
P qn
n d`an @na (~xn ) = 2 k (12.15)
~c
where k = k 0 k is an integer. Next consider a loop n in which we follow
the coordinates of the nth particle around some closed path in 3-dimensional
space while all the other particles are kept …xed. Then
I
qn
d`a @an (~xn ) = 2 k n (12.16)
~c n n
where k n is an integer. But (~x) and n = are just a function and a loop
in 3-dimensional space, which implies that the integral on the left,
I I
d`an @an (~xn ) = d`a @a (~x) ; (12.17)
n

is independent of n. Therefore the charge qn divided by an integer k n must


be independent of n which means that qn must be an integer multiple of some
basic charge q0 . We conclude that the charges qn are quantized.
5 Double-valued wave functions with ei = 1 will, of course, …nd use in the description of
spin-1/2 particles [Caticha Carrara 2019].
12.2 Momentum in Entropic Dynamics* 315

The issue of charge quantization is ultimately the issue of deciding which


is the gauge group that generates electromagnetic interactions. We could for
example decide to restrict the gauge transformations to single-valued gauge
functions (~x) so that (12.16) is trivially satis…ed irrespective of the charges
being quantized or not. Under such a restricted symmetry group the single-
valued (or double-valued) nature of the wave function is una¤ected by gauge
transformations. If, on the other hand, the gauge functions (~x) are allowed
to be multi-valued, then the compatibility of the gauge transformation (12.12-
12.13) with the superposition principle demands that charges be quantized.
The argument above cannot …x the value of the basic charge q0 because
it depends on the units chosen for the vector potential Aa . Indeed since the
dynamical equations show qn and Aa appearing only in the combination qn Aa
we can change units by rescaling charges and potentials according to Cqn = qn0
and Aa =C = A0a so that qn Aa = qn0 A0a .6
Remark: A similar conclusion — that charge quantization is a re‡ection of the
compactness of the gauge group — can be reached following an argument due
to C. N. Yang [Yang 1970]. Yang’s argument assumes that a Hilbert space has
been established and one has access to the unitary representations of symmetry
groups. Yang considers a gauge transformation
P qn
(x) ! (x) exp i n (~xn ) ; (12.18)
c
with (~x) independent of ~x. If the qn s are not commensurate there is no value
of (except 0) that makes (12.18) be the identity transformation. The gauge
group — translations on the real line — would not be compact. If, on the other
hand, the charges are integer multiples of a basic charge q0 , then two values
of that di¤er by an integer multiple of 2 c=q0 give identical transformations
and the gauge group is compact. In the present ED derivation, however, we
deal with the space T S which is a complex projective space. We cannot adopt
Yang’s argument because a gauge transformation independent of ~x is already
an identity transformation — it leads to an equivalent state in the same ray —
and cannot therefore lead to any constraints on the allowed charges.

12.2 Momentum in Entropic Dynamics*


12.2.1 Uncertainty relations
See [Nawaz Caticha 2011][Bartolomeo Caticha 2016]

12.3 The classical limit*


See [Demme Caticha 2016].
6 For conventional units such that the rescaled basic charge is Cq = e=3 with = e2 =~c =
0
1=137 the scaling factor is C = ( ~c)1=2 =3q0 . A more natural set of units might be to set
q0 = ~c so that the gauge functions (~ x) are angles.
316 Topics in Quantum Theory*

12.4 Elementary applications*


See [DiFranzo 2018].

12.4.1 The free particle*


12.4.2 The double-slit experiment*
12.4.3 Tunneling*
12.4.4 Entangled particles*
Chapter 13

The quantum measurement


problem*

See [Johnson Caticha 2011][Vanslette Caticha 2017][Caticha 2022].

13.1 Measuring position: ampli…cation*


13.2 “Measuring”other observables and the Born
rule*
13.3 Evading no-go theorems*
13.4 Weak measurements*
13.5 The qubit*
13.6 Contextuality*
13.7 Delayed choice experiments*
Chapter 14

Entropic Dynamics of
Fermions*
Chapter 15

Entropic Dynamics of Spin*

See [Caticha Carrara 2019][Carrara Caticha 202?]

15.1 Geometric algebra*


15.1.1 Multivectors and the geometric product*
15.1.2 Spinors*

15.2 Spin and the Pauli equation*


15.3 Entangled spins*
Chapter 16

Entropic Dynamics of
Bosons*

See [Caticha 2012b][Ipek Caticha 2014][Ipek et al 2018, 2020]

16.1 Boson …elds*


16.2 Boson particles*
Chapter 17

Entropy V: Quantum
Entropy*

See [Vanslette 2017]

17.1 Density matrices*


17.2 Ranking density matrices*
17.3 The quantum maximum entropy method*
17.4 Decoherence*
17.5 Variational approximation methods –II*
17.5.1 The variational method*
17.5.2 Quantum density functional formalism*
See [Youse… Caticha 2022]
Chapter 18

Epilogue: Towards a
Pragmatic Realism*

See [Caticha 2014a].

18.1 Background: the tensions within realism*


18.2 Pragmatic realism*
328 Epilogue: Towards a Pragmatic Realism*
References

[Abe 1992] S. Abe, “Quantum-state space metric and correlations,” Phys.


Rev. A 46, 1667 (1992).
[Aczel 1966] J. Aczél, Lectures on Functional Equations and Their Applica-
tions (Academic Press, New York, 1996).
[Aczel 1975] J. Aczél and Z. Daróczy, On Measures of Information and their
Characterizations (Academic Press, New York 1975).
[Adler 2004] S. L. Adler, Quantum Theory as an Emergent Phenomenon (Cam-
bridge U. Press, Cambridge 2004) (arXiv:hep-th/0206120).
[Adriaans 2008] P. W. Adriaans and J.F.A.K. van Benthem (eds.), Handbook
of Philosophy of Information (Elsevier, 2008).
[Adriaans 2012] P. W. Adriaans, “Philosophy of Information”, to appear in
the Stanford Encyclopedia of Philosophy (2012).
[Amari 1985] S. Amari, Di¤ erential-Geometrical Methods in Statistics (Springer-
Verlag, 1985).
[Amari Nagaoka 2000] S. Amari and H. Nagaoka, Methods of Information
Geometry (Am. Math. Soc./Oxford U. Press, Providence 2000).
[Anandan Aharonov 1990] J. Anandan and Y. Aharonov, “Geometry of
Quantum Evolution,” Phys. Rev. Lett. 65, 1697-1700 (1990).
[Arnold 1997] V. I. Arnold, Mathematical Methods of Classical Mechanics
(Springer, Berlin/Heidelberg, 1997).
[Ashtekar Schilling 1998] A. Ashtekar and T. A. Schilling, “Geometrical
Formulation of Quantum Mechanics,” in On Einstein’s Path, ed. by A.
Harvey (Springer, New York, 1998).
[Atkinson Mitchell 1981] C. Atkinson and A. F. S. Mitchell, “Rao’s distance
measure”, Sankhyā 43A, 345 (1981).
[Ay et al 2017] N. Ay, J. Jost, H. Vân Lê, L. Schwanchhöfer, Information
Geometry (Springer, 2017).
330 References

[Balasubramanian 1996] V. Balasubramanian, “A Geometric Formulation


of Occam’s Razor for Inference of Parametric Distributions”(arXiv:adap-
org/9601001).
[Balasubramanian 1997] V. Balasubramanian, “Statistical inference, Occam’s
razor, and statistical mechanics on the space of probability distributions”,
Neural Computation 9, 349 (1997).
[Balian 1991, 1992] R. Balian, From microphysics to macrophysics: methods
and applications of statistical mechanics (Vol. I and II) (Springer, Heidel-
berg, 1991 and 1992).
[Balian 1999] R. Balian, “Incomplete descriptions and relevant entropies”,
Am. J. Phys. 67, 1078 (1999).
[Balian 2014] R. Balian, “The Entropy-Based Quantum Metric”, Entropy 16,
3878 (2014).
[Balian Balazs 1987] R. Balian and N. L. Balazs, “Equiprobability, Infer-
ence, and Entropy in Quantum Theory”, Ann. Phys. 179, 97 (1987).
[Balian Veneroni 1981] R. Balian and M. Vénéroni, “Time-dependent Vari-
ational Principle for Predicting the Expectation Value of an Observable”,
Phys. Rev. Lett. 47, 1353 (1981).
[Balian Veneroni 1987] R. Balian and M. Vénéroni, “Incomplete Descrip-
tions, Relevant Information, and Entropy Production in Collision Processes”,
Ann. Phys. 174, 229 (1987).
[Balian Veneroni 1988] R. Balian and M. Vénéroni, “Static and Dynamic
Variational Principles for Expectation Values of Observables”, Ann. Phys.
187, 29 (1988).
[Ballentine 1970] L. Ballentine, “The statistical interpretation of quantum
mechanics”, Rev. Mod. Phys. 42, 358 (1970).
[Ballentine 1986] L. Ballentine, “Probability theory in quantum mechanics,”
Am. J. Phys. 54, 883 (1986).
[Ballentine 1990] L. Ballentine, “Limitations of the projection postulate”,
Found. Phys. 20, 1329 (1990).
[Ballentine 1998] L. Ballentine, Quantum Mechanics: A Modern Develop-
ment (World Scienti…c, Singapore 1998).
[Barbour 1994a] J. B. Barbour, “The timelessness of quantum gravity: I. The
evidence from the classical theory”, Class. Quant. Grav. 11, 2853(1994).
[Barbour 1994b] J. B. Barbour, “The timelessness of quantum gravity: II.
The appearance of dynamics in static con…gurations”, Class. Quant.
Grav. 11, 2875 (1994).
References 331

[Barbour 1994c] J. B. Barbour, “The emergence of time and its arrow from
timelessness”in Physical Origins of Time Asymmetry, eds. J. Halliwell et
al, (Cambridge U. Press, Cambridge 1994).

[Bartolomeo Caticha 2015] D. Bartolomeo and A. Caticha, “Entropic Dy-


namics: the Schroedinger equation and its Bohmian limit”, in Bayesian
Inference and Maximum Entropy Methods in Science and Engineering,
ed. by A.Gi¢ n and K. Knuth, AIP Conf. Proc. 1757, 030002 (2016)
(arXiv.org:1512.09084).

[Bartolomeo Caticha 2016] D. Bartolomeo and A. Caticha, “Trading drift


and ‡uctuations in entropic dynamics: quantum dynamics as an emergent
universality class,” J. Phys: Conf. Series 701, 012009 (2016)
(arXiv.org:1603.08469).

[Bell 2004] J. S. Bell, Speakable and Unspeakable in Quantum Mechanics (2nd.


ed., Cambridge U. Press, Cambridge 2004).

[Bennett 1982] C. Bennett, “The thermodynamics of computation— A re-


view”, Int. J. Th. Phys. 21, 905 (1982).

[Bennett 2003] C. Bennett, “Notes on Landauer’s principle, reversiblecom-


putation, and Maxwell’s demon”, Studies in History and Philosophy of
Modern Physics 34, 501 (2003).

[Blanchard et al 1986] P. Blanchard, S. Golin and M. Serva, “Repeated mesure-


ments in stochastic mechanics”, Phys. Rev. D34, 3732 (1986).

[Bohm 1952] D. Bohm, “A Suggested Interpretation of the Quantum Theory


in Terms of ‘Hidden’ Variables, I and II”, Phys. Rev. 85, 166 and 180
(1952).

[Bohm Hiley 1993] D. Bohm and B. J. Hiley, The Undivided Universe: an


ontological interpretation on quantum theory (Routledge, New York 1993).

[Bohr 1934] N. Bohr, Atomic Theory and the Description of Nature (1934,
reprinted by Ox Bow Press, Woodbridge Connecticut, 1987).

[Bohr 1958] N. Bohr, Essays 1933-1957 on Atomic Physics and Human Knowl-
edge (1958, reprinted by Ox Bow Press, Woodbridge Connecticut, 1987).

[Bohr 1963] N. Bohr, Essays 1958-1962 on Atomic Physics and Human Knowl-
edge (1963, reprinted by Ox Bow Press, Woodbridge Connecticut, 1987).

[Bollinger 1989] J. J. Bollinger et al., “Test of the Linearity of Quantum


Mechanics by rf Spectroscopy of the 9 Be + Ground State,” Phys. Rev.
Lett. 63, 1031 (1989).

[Bretthorst 1988] G. L. Bretthorst, Bayesian Spectrum Analysis and Para-


meter Estimation (Springer, Berlin 1988); available at http://bayes.wustl.edu.
332 References

[Brillouin 1952] L. Brillouin, Science and Information Theory (Academic Press,


New York, 1952).

[Brillouin 1953] L. Brillouin, “The Negentropy Principle of Information”, J.


Appl. Phys. 24, 1152 (1953).

[Brody Hughston 1997] D. J. Brodie and L. P. Hughston, “Statistical Geom-


etry in Quantum Mechanics,” Phil. Trans. R. Soc. London A 454, 2445
(1998); arXiv:gr-qc/9701051.

[Brukner Zeilinger 2002] C. Brukner and A. Zeilinger, “Information and


Fundamental Elements of the Structure of Quantum Theory,” in Time,
Quantum, Information, ed. L. Castell and O. Ischebeck (Springer, 2003);
arXiv:quant-ph/0212084.

[Callen 1985] H. B. Callen, Thermodynamics and an Introduction to Thermo-


statistics (Wiley, New York, 1985).
µ
[Campbell 1986] L. L. Campbell, “An extended Cencov characterization of
the information metric”, Proc. Am. Math. Soc. 98, 135 (1986).

[Carrara Caticha 2017] N. Carrara and A. Caticha, “Quantum phases in en-


tropic dynamics”, in Bayesian Inference and Maximum Entropy Methods
in Science and Engineering, ed. by A. Polpo et al., Springer Proc. Math.
Stat. 239, 1 (2018); arXiv.org:1708.08977.

[Caticha 1998a] A. Caticha, “Consistency and Linearity in Quantum The-


ory”, Phys. Lett. A244, 13 (1998); arXiv.org/abs/quant-ph/9803086.

[Caticha 1998b] A. Caticha, “Consistency, Amplitudes and Probabilities in


Quantum Theory”, Phys. Rev. A57, 1572 (1998); arXiv.org/abs/quant-
ph/ 9804012.

[Caticha 1998c] A. Caticha, “Insu¢ cient reason and entropy in quantum the-
ory”, Found. Phys. 30, 227 (2000); arXiv.org/abs/quant-ph/9810074.

[Caticha 2000] A. Caticha, “Maximum entropy, ‡uctuations and priors”,


Bayesian Methods and Maximum Entropy in Science and Engineering, ed.
by A. Mohammad-Djafari, AIP Conf. Proc. 568, 94 (2001); arXiv.org/abs/math-
ph/0008017.

[Caticha 2001] A. Caticha, “Entropic Dynamics”, Bayesian Methods and Max-


imum Entropy in Science and Engineering, ed. by R. L. Fry, A.I.P. Conf.
Proc. 617 (2002); arXiv.org/abs/gr-qc/0109068.

[Caticha 2003] A. Caticha, “Relative Entropy and Inductive Inference”,


Bayesian Inference and Maximum Entropy Methods in Science and Engi-
neering, ed. by G. Erickson and Y. Zhai, AIP Conf. Proc. 707, 75 (2004);
arXiv.org/abs/physics/0311093.
References 333

[Caticha 2004] A. Caticha “Questions, Relevance and Relative Entropy”, in


Bayesian Inference and Maximum Entropy Methods in Science and Engi-
neering, R. Fischer et al. A.I.P. Conf. Proc. Vol. 735, (2004); arXiv:cond-
mat/0409175.
[Caticha 2005] A. Caticha, “The Information Geometry of Space and Time”
in Bayesian Inference and Maximum Entropy Methods in Science and En-
gineering, ed. by K. Knuth et al. AIP Conf. Proc. 803, 355 (2006);
arXiv.org/abs/gr-qc/0508108.
[Caticha 2006] A. Caticha, “From Objective Amplitudes to Bayesian Prob-
abilities,” in Foundations of Probability and Physics-4, G. Adenier, C.
Fuchs, and A. Khrennikov (eds.), AIP Conf. Proc. 889, 62 (2007);
arXiv.org/abs/quant-ph/0610076.
[Caticha 2007] A. Caticha, “Information and Entropy”, Bayesian Inference
and Maximum Entropy Methods in Science and Engineering, ed. by K.
Knuth et al., AIP Conf. Proc. 954, 11 (2007); arXiv.org/abs/0710.1068.
[Caticha 2008] A. Caticha, Lectures on Probability, Entropy, and Statistical
Physics (MaxEnt 2008, São Paulo, Brazil); arXiv.org/abs/0808.0012.
[Caticha 2009a] A. Caticha, “From Entropic Dynamics to Quantum Theory”
in Bayesian Inference and Maximum Entropy Methods in Science and En-
gineering, ed. by P. Goggans and C.-Y. Chan, AIP Conf. Proc. 1193, 48
(2009); arXiv.org/abs/0907.4335.
[Caticha 2009b] A. Caticha, “Quantifying Rational Belief”, Bayesian Infer-
ence and Maximum Entropy Methods in Science and Engineering, ed. by
P. Goggans et al., AIP Conf. Proc. 1193, 60 (2009); arXiv.org/abs/0908.3212.
[Caticha 2010a] A. Caticha, “Entropic Dynamics, Time, and Quantum The-
ory”, J. Phys. A 44, 225303 (2011); arXiv.org/abs/1005.2357.
[Caticha 2010b] A. Caticha, “Entropic time”, Bayesian Inference and Maxi-
mum Entropy Methods in Science and Engineering, ed. by A. Mohammad-
Djafari, AIP Conf. Proc. 1305 (2010); arXiv:1011.0746.
[Caticha 2010c] A. Caticha, “Entropic Inference”, Bayesian Inference and
Maximum Entropy Methods in Science and Engineering, ed. by A. Mohammad-
Djafari et al., AIP Conf. Proc. 1305, 20 (2010); arXiv.org: 1011.0723.
[Caticha 2012a] A. Caticha, “Entropic inference: some pitfalls and paradoxes
we can avoid”, in Bayesian Inference and Maximum Entropy Methods in
Science and Engineering, ed. by U. von Toussaint et al., AIP Conf. Proc.
1553, 200 (2013); arXiv.org/abs/1212.6967.
[Caticha 2012b] A. Caticha, “The entropic dynamics of relativistic quantum
…elds”, in Bayesian Inference and Maximum Entropy Methods in Science
and Engineering, ed. by U. von Toussaint et al., AIP Conf. Proc. 1553,
176 (2013); arXiv.org/abs/1212.6946.
334 References

[Caticha 2012c] A. Caticha, Entropic Inference and the Foundations of Physics,


(EBEB 2012, São Paulo, Brazil); http://www.albany.edu/physics/ACaticha-
EIFP-book.pdf.

[Caticha 2014a] A. Caticha, “Towards an Informational Pragmatic Realism”,


Mind and Machines 24, 37 (2014); arXiv.org:1412.5644.

[Caticha 2014b] A. Caticha, “Entropic Dynamics: an Inference Approach to


Time and Quantum Theory”, J. Phys: Conf Series 504 (2014) 012009;
arXiv.org:1403.3822.

[Caticha 2015a] A. Caticha, “Entropic Dynamics”, Entropy 17, 6110-6128


(2015); arXiv.org:1509.03222.

[Caticha 2015b] A. Caticha, “Geometry from Information Geometry”, in


Bayesian Inference and Maximum Entropy Methods in Science and En-
gineering, A.Gi¢ n and K. Knuth (eds.), AIP Conf. Proc. 1757, 030001
(2016); arXiv.org:1512.09076.

[Caticha 2017a] A. Caticha, “Entropic Dynamics: Mechanics without Mecha-


nism”, in Recent Advances in Info-Metrics, ed. by M. Chen and A. Golan;
arXiv.org:1704.02663

[Caticha 2017b] A. Caticha, “Entropic Dynamics: Quantum Mechanics from


Entropy and Information Geometry”, Annalen der Physik, 1700408 (2018);
https://doi.org/10.1002/andp.201700408; arXiv.org:1711.02538.

[Caticha 2019] A. Caticha, “The Entropic Dynamics approach to Quantum


Mechanics,” Entropy 21, 943 (2019); arXiv.org:1908.04693.

[Caticha 2021a] A. Caticha, “Entropy, Information, and the Updating of Prob-


abilities,” Entropy 23, 895 (2021); arXiv.org:2107.04529.

Caticha 2021b A. Caticha, “Quantum mechanics as Hamilton-Killing ‡ows on


a statistical manifold,” Phys. Sci. Forum 3, 12 (2021); arXiv:2107.08502.

[Caticha 2022] A. Caticha, “Entropic Dynamics and Quantum "Measurement"”,


MaxEnt 2022, Paris, France.

[Caticha et al 2014] A. Caticha, D. Bartolomeo, and M. Reginatto, “En-


tropic Dynamics: from entropy and information geometry to Hamiltonians
and quantum mechanics”, in Bayesian Inference and Maximum Entropy
Methods in Science and Engineering, ed. by A. Mohammad-Djafari and
F, Barbaresco, AIP Conf. Proc. 1641, 155 (2015); arXiv.org:1412.5629.

[Caticha Cafaro 2007] A. Caticha and C. Cafaro, “From Information Geom-


etry to Newtonian Dynamics”, Bayesian Inference and Maximum Entropy
Methods in Science and Engineering, ed. by K. Knuth et al., AIP Conf.
Proc. 954, 165 (2007) (arXiv.org/abs/0710.1071).
References 335

[Caticha Carrara 2019] A. Caticha and N. Carrara, “The entropic dynamics


of spin,” arXiv:2007.15719.

[Caticha Gi¢ n 2006] A. Caticha and A. Gi¢ n, “Updating Probabilities”,


Bayesian Inference and Maximum Entropy Methods in Science and Engi-
neering, ed. by A. Mohammad-Djafari, AIP Conf. Proc. 872, 31 (2006);
arXiv.org/ abs/physics/0608185.

[Caticha Golan 2014] A. Caticha and A. Golan, “An Entropic framework for
Modeling Economies”, Physica A 408, 149 (2014).

[Caticha Preuss 2004] A. Caticha and R. Preuss, “Maximum entropy and


Bayesian data analysis: entropic prior distributions”, Phys. Rev. E70,
046127 (2004); arXiv.org/abs/physics/0307055.

[CatichaN Kinouchi 1998] N. Caticha and O. Kinouchi, “Time ordering in


the evolution of information processing and modulation systems”, Phil.
Mag. B 77, 1565 (1998).

[CatichaN Neirotti 2006] N. Caticha and J. P. Neirotti, “The evolution of


learning systems: to Bayes or not to be”, in Bayesian Inference and Maxi-
mum Entropy Methods in Science and Engineering, ed. by A. Mohammad-
Djafari, AIP Conf. Proc. 872, 203 (2006).

[Caves et al 2007] C. Caves, C. Fuchs, and R. Schack, “Subjective probability


and quantum certainty”, Studies in History and Philosophy of Modern
Physics 38, 244 (2007).
µ
[Cencov 1981] N. N. Cencov: Statistical Decision Rules and Optimal Infer-
ence, Transl. Math. Monographs, vol. 53, Am. Math. Soc. (Providence,
1981).

[Chaitin 1975] G. J. Chaitin, “A theory of program size formally identical to


information theory”, J. Assoc. Comp. Mach. 22, 329-340 (1975).

[Chandrasekhar 1943] S. Chandrasekhar, “Stochastic Problems in Physics


and Astronomy” Rev. Mod. Phys. 15, 1 (1943).

[Chiribella et al 2011] G. Chiribella, G. M. D’Ariano, and P. Perinotti, “In-


formational derivation of quantum theory,” Phys. Rev. A 84, 012311
(2011).

[Cirelli et al 1990] R. Cirelli, A. Manià, and L. Pizzochero, “Quantum me-


chanics as an in…nite-dimensional Hamiltonian system with uncertainty
structure: Part I and II,” J. Math. Phys. 31, 2891 and 2898 (1990).

[Costa de Beauregard Tribus 1974] O. Costa de Beauregard and M. Tribus,


“Information Theory and Thermodynamics”, Helv. Phys. Acta 47, 238
(1974).
336 References

[Cover Thomas 1991] T. Cover and J. Thomas, Elements of Information


Theory (Wiley, New York 1991).

[Cox 1946] R.T. Cox, “Probability, Frequency and Reasonable Expectation”,


Am. J. Phys. 14, 1 (1946).

[Cox 1961] R.T. Cox, The Algebra of Probable Inference (Johns Hopkins, Bal-
timore 1961).

[Cropper 1986] W. H. Cropper, “Rudolf Clausius and the road to entropy”,


Am. J. Phys. 54, 1068 (1986).

[Csiszar 1984] I. Csiszar, “Sanov property, generalized I-projection and a con-


ditional limit theorem”, Ann. Prob. 12, 768 (1984).

[Csiszar 1985] I. Csiszár “An extended Maximum Entropy Principle and a


Bayesian justi…cation”, Bayesian Statistics 2, p.83, ed. by J. M. Bernardo.
M. H. de Groot, D. V. Lindley, and A. F. M. Smith (North Holland, 1985);
“MaxEnt, mathematics and information theory”, Maximum Entropy and
Bayesian Methods, p. 35, ed. by K. M. Hanson and R. N.Silver (Kluwer,
Dordrecht 1996).

[Csiszar 1991] I. Csiszár, “Why least squares and maximum entropy: an ax-
iomatic approach to inference for linear inverse problems”, Ann. Stat. 19,
2032 (1991).

[Csiszar 1996] I. Csiszár, “MaxEnt, Mathematics, and Information Theory”,


p. 35 in Maximum Entropy and Bayesian Methods, K. M. Hanson and R.
N. Silver (eds.) (Kluwer Academic Publishers, Netherlands 1996).

[Csiszar 2008] I. Csiszár, “Axiomatic Characterizations of Information Mea-


sures”, Entropy 10, 261 (2008).

[Dankel 1970] T. G. Dankel, Jr., “Mechanics on Manifolds and the incorpora-


tion of spin into Nelson’s stochastic mechanics”, Arch. Rat. Mech. Anal.
37, 192 (1970).

[de Falco et al 1982] D. de Falco, S. D. Martino and S. De Siena, “Position-


Momentum Uncertainty Relations in Stochastic Mechanics”, Phys. Rev.
Lett. 49, 181 (1982).

[DAriano 2017] G. M. D’Ariano, “Physics without physics: the power of


information-theoretical principles,” Int. J. Th. Phys. 56, 97 (2017).

[de Gosson HiIey 2011] M. A. de Gosson and B. J. Hiley, “Imprints of the


Quantum World in Classical Mechanics,” Found. Phys. 41, 1415 (2011).

[De Martino et al 1984] S. De Martino and S. De Siena, “On Uncertainty


Relations in Stochastic Mechanics”, Il Nuovo Cimento 79B, 175 (1984).
References 337

[Demme Caticha 2016] A. Demme and A. Caticha, “The classical limit of


entropic quantum dynamics,” in Bayesian Inference and Maximum En-
tropy Methods in Science and Engineering, ed. by G. Verdoolaege, AIP
Conf. Proc. 1853, 090001 (2017) (arXiv.org:1612.01905).

[Dewar 2003] R. Dewar, “Information theory explanation of the ‡uctuation


theorem, maximum entropy production and self-organized criticality in
non-equilibrium stationary states”, J. Phys. A: Math. Gen. 36 631
(2003).

[Dewar 2005] R. Dewar, “Maximum entropy production and the ‡uctuation


theorem”, J. Phys. A: Math. Gen. 38 L371 (2003).

[Diaconis 1982] P. Diaconis and S. L. Zabell, “Updating Subjective Probabil-


ities”, J. Am. Stat. Assoc. 77, 822 (1982).

[DiFranzo 2018] S. DiFranzo, “The Entropic Dynamics Approach to the Par-


adigmatic Quantum Mechanical Phenomena”, Ph.D. thesis, University at
Albany (2018).

[Dirac 1948] P. A. M. Dirac, Quantum Mechanics (3rd edition, Oxford Uni-


versity Press, 1948).

[Dohrn Guerra 1978] D. Dohrn and F. Guerra, “Nelson’s stochastic mechan-


ics on Riemannian manifolds”, Lett. Nuovo Cimento 22, 121 (1978).

[Doran Lasenby 2003] C. Doran and A. Lasenby, Geometric Algebra for Physi-
cists (Cambridge U.P., Cambridge UK, 2003).

[Earman 1992] J. Earman, Bayes or Bust?: A Critical Examination of Bayesian


Con…rmation Theory (MIT Press, Cambridge, 1992).

[Earman Redei 1996] J. Earman and M. Rédei, “Why ergodic theory does
not explain the success of equilibrium statistical mechanics”, Brit. J. Phil.
Sci. 47, 63-78 (1996).

[Ehrenfest 1912] P. Ehrenfest and T. Ehrenfest, The Conceptual Foundations


of the Statistical Approach in Mechanics (Cornell U.P., Ithaca, New York,
1959).

[Einstein 1949a] A. Einstein, “Autobiographical Notes” in Albert Einstein:


Philosopher Scientist, ed. by P. A. Schilpp (Open Court, La Salle, Illinois,
1949).

[Einstein 1949b] A. Einstein, “Reply to Criticisms”in Albert Einstein: Philoso-


pher Scientist, ed. by P. A. Schilpp (Open Court, La Salle, Illinois, 1949).

[Ellis 1985] B. Ellis, “What science aims to do”in Images of Science ed. by P.
Churchland and C. Hooker (U. of Chicago Press, Chicago 1985); reprinted
in [Papineau 1996].
338 References

[Elze 2002] H. T. Elze and O. Schipper, “Time withour time: a stochastic


clock model”, Phys. Rev. D66, 044020 (2002).

[Elze 2003] H. T. Elze, “Emergent discrete time and quantization: relativistiv


particle with extra dimensions”, Phys. Lett. A310, 110 (2003).

[Elze 2011] H.-T. Elze, “Linear dynamics of quantum-classical hybrids,”Phys.


Rev. A 85, 052109 (2012).

[Everett 1957] H. Everett III, “Relative state formulation of quantum me-


chanics”, Rev. Mod. Phys. 29, 454 (1957).

[Faris 1982a] W. G. Faris, “A stochastic picture of spin”, in Stochastic Processes


in Quantum Theory and Statistical Physics ed. By S. Albeverio et al., Lec-
ture Notes in Physics 173 (Springer, 1982).

[Faris 1982b] W. G. Faris, “Spin correlation in stochastic mechanics”, Found.


Phys. 12, 1 (1982).

[Ferrero et al 2004] M. Ferrero, D. Salgado, and J. L. Sánchez-Gómez, “Is


the Epistemic View of Quantum Mechanics Incomplete?”, Found. Phys.
34, 1993 (2004).

[Fine 1996] A. Fine, The Shaky Game – Einstein Realism and the Quantum
Theory (University of Chicago Press, Chicago 1996)

[Fisher 1925] R. A. Fisher, “Theory of statistical estimation”, Proc. Cam-


bridge Philos. Soc. 122, 700 (1925).

[Floridi 2011] L. Floridi, The Philosophy of Information (Oxford U. Press,


Oxford 2011).

[Friederich 2011] S. Friederich, “How to spell out the epistemic conception


of quantum states”, Studies in History and Philosophy of Modern Physics
42, 149 (2011).

[Fritsche Haugk 2009] L. Fritsche and M. Haugk, “Stochastic Foundation of


Quantum Mechanics and the Origin of Particle Spin”, arXiv:0912.3442.

[Fuchs 2002] C. Fuchs, “Quantum mechanics as quantum information (and


only a little more),” in Quantum Theory: Reconstruction of Foundations
ed. by A. Khrennikov (Vaxjo U. Press, 2002) (arXiv:quant-ph/0205039).

[Garrett 1996] A. Garrett, “Belief and Desire”, Maximum Entropy and Bayesian
Methods ed. by G. R. Heidbreder (Kluwer, Dordrecht 1996).

[Gibbs 1875-78] J. W. Gibbs, “On the Equilibrium of Heterogeneous Sub-


stances”, Trans. Conn. Acad. III (1875-78), reprinted in The Scienti…c
Papers of J. W. Gibbs (Dover, New York 1961).
References 339

[Gibbs 1902] J. W. Gibbs, Elementary Principles in Statistical Mechanics


(Yale U. Press, New Haven 1902; reprinted by Ox Bow Press, Connecticut
1981).

[Gi¢ n Caticha 2007] A. Gi¢ n and A. Caticha, “Updating Probabilities with


Data and Moments”, Bayesian Inference and Maximum Entropy Methods
in Science and Engineering, ed. by K. Knuth et al., AIP Conf. Proc. 954,
74 (2007) (arXiv.org/abs/0708.1593).

[Gissin 1990] N. Gisin, Phys. Lett. A 143, 1 (1990); J. Polchinski, Phys.


Rev. Lett. 66, 397 (1991).

[Giulini et al 1996] D. Giulini, E. Joos, C. Kiefer, J. Kupsch, I.-O. Sta-


matescu, and H.D. Zeh, Decoherence and the Appearance of a Classical
World in Quantum Theory (Springer, Berlin, 1996).

[Godfrey-Smith 2003] P. Godfrey-Smith, Theory and Reality (U. Chicago


Press, Chicago 2003).

[Golan 2008] A. Golan, “Information and Entropy in Econometrics – A Re-


view and Synthesis”, Foundations and Trends in Econometrics 2, 1–145
(2008).

[Golan 2018] A. Golan, Foundations of Info-Metrics: Modeling, Inference,


and Imperfect Information (Oxford U. P., New York, 2018).

[Golin 1985] S. Golin, “Uncertainty relations in stochastic mechanics”, J.


Math. Phys. 26, 2781 (1985).

[Golin 1986] S. Golin, “Comment on momentum in stochastic mechanics”, J.


Math. Phys. 27, 1549 (1986).

[Godfrey-Smith 2003] P. Godfrey-Smith, Theory and Reality (U. of Chicago


Press, Chicago 2003).

[Good 1950] I. J. Good, Probability and the Weighing of Evidence (Gri¢ n,


London 1950).

[Good 1983] I. J. Good, Good Thinking, The Foundations of Probability and


its Applications (University of Minnesota Press, 1983).

[Goyal et al 2010] P. Goyal, K. Knuth, J. Skilling, “Origin of complex quan-


tum amplitudes and Feynman’s rules”, Phys. Rev. A 81, 022109 (2010).

[Grad 1961] H. Grad, “The Many Faces of Entropy”, Comm. Pure and Appl.
Math. 14, 323 (1961).

[Grad 1967] H. Grad, “Levels of Description in Statistical Mechanics and


Thermodynamics”, Delaware Seminar in the Foundations of Physics, ed.
by M. Bunge (Springer-Verlag, New York 1967).
340 References

[Gregory 2005] P. C. Gregory, Bayesian Logical Data Analysis for the Phys-
ical Sciences (Cambridge UP, 2005).
[Grendar 2003] M. Grendar, Jr. and M. Grendar “Maximum Probability
and Maximum Entropy Methods: Bayesian interpretation”, Bayesian In-
ference and Maximum Entropy Methods in Science and Engineering, ed.
by G. Erickson and Y. Zhai, AIP Conf. Proc. 707, p. 490 (2004)
(arXiv.org/abs/physics/0308005).
[Greven et al 2003] A. Greven, G. Keller, and G. Warnecke (eds.), Entropy
(Princeton U. Press, Princeton 2003).
[Groessing 2008] G. Groessing, “The vacuum ‡uctuation theorem: Exact
Schrödinger equation via nonequilibrium thermodynamics”, Phys. Lett.
A 372, 4556 (2008).
[Groessing 2009] G. Groessing, “On the thermodynamic origin of the quan-
tum potential”, Physica A 388, 811 (2009).
[Guerra 1981] F. Guerra, “Structural aspects of stochastic mechanics and sto-
chastic …eld theory”, Phys. Rep. 77, 263 (1981).
[Guerra Morato 1983] F. Guerra and L. Morato, “Quantization of dynami-
cal systems and stochastic control theory”, Phys. Rev. D27, 1774 (1983).
[Guillemin Sternberg 1984] V. Guillemin and S. Sternberg, Symplectic tech-
niques in physics (Cambridge U. Press, Cambridge 1984).
[Hacking 2001] I. Hacking, An Introduction to Probability and Inductive Logic
(Cambridge U. Press, Cambridge 2001).
[Hall Reginatto 2002a] M. J. W. Hall and M. Reginatto, “Schrödinger equa-
tion from an exact uncertainty principle”, J. Phys. A 35, 3289 (2002).
[Hall Reginatto 2002b] M. J. W. Hall and M. Reginatto, “Quantum mechan-
ics from a Heisenberg-type equality”, Fortschr. Phys. 50, 646 (2002).
[Hall Reginatto 2016] M. J. W. Hall and M. Reginatto, Ensembles in Con-
…guration Space (Springer, Switzerland, 2016).
[Halpern 1999] J. Y. Halpern, “A Counterexample to Theorems of Cox and
Fine”, Journal of Arti…cial Intelligence Research 10, 67 (1999).
[Hardy 2001] L. Hardy, “Quantum Theory From Five Reasonable Axioms”
(arXiv.org/quant- ph/0101012).
[Hardy 2011] L. Hardy, “Reformulating and Reconstructing Quantum The-
ory” (arXiv.org:1104.2066).
[Harrigan Spekkens 2010] N. Harrigan and R. Spekkens, “Einstein, Incom-
pleteness, and the Epistemic View of Quantum States”, Found. Phys.
40,125 (2010).
References 341

[Hawthorne 1993] J. Hawthorne, “Bayesian Induction is Eliminative Induc-


tion”, Philosophical Topics 21, 99 (1993).

[Heisenberg 1958] W. Heisenberg, Physics and Philosophy. The Revolution


in Modern Science (Harper, New York, 1958).

[Hempel 1967] C. G. Hempel, “The white shoe: No red herring”, Brit. J.


Phil. Sci. 18, 239 (1967).

[Hermann 1965] R. Hermann, “Remarks on the Geometric Nature of Quan-


tum Phase Space,” J. Math. Phys. 6, 1768 (1965).

[Heslot 1985] A. Heslot, “Quantum mechanics as a classical theory,” Phys.


Rev. D31, 1341-1348 (1985).

[Hestenes 1966] D. Hestenes, Space-Time Algebra (Gordon and Breach, New


York, 1966; 2nd ed. Springer, Switzerland, 2015).

[Hestenes Sobczyk 1984] D. Hestenes and G. Sobczyk, Cli¤ ord Algebra to


Geometric Calculus (Reidel, Dordrecht, 1984).

[Holland 1993] P. R. Holland, The quantum Theory of Motion (Cambridge


U. Press, Cambridge 1993).

[Howard 1985] D. Howard, “Einstein on locality and separability,”Stud. Hist.


Phil. 16, 171-201 (1985).

[Howard 2004] D. Howard, “Who invented the “Copenhagen Interpretation”?


A study in mythology”, Philosophy of Science 71, 669 (2004).

[Howson Urbach 1993] C. Howson and P. Urbach, Scienti…c Reasoning, the


Bayesian Approach (Open Court, Chicago 1993).

[Hughston 1995] L. P. Hughston, “Geometric aspects of quantum mechan-


ics,”in Twistor Theory, ed. by S. A. Huggett (Marcel Dekker, New York,
1995).

[Ipek Caticha 2014] S. Ipek and A. Caticha, “Entropic Quantization of Scalar


Fields”, in Bayesian Inference and Maximum Entropy Methods in Science
and Engineering, ed. by A. Mohammad-Djafari and F. Barbaresco, AIP
Conf. Proc. 1641, 345 (2015); arXiv.org:1412.5637.

[Ipek Caticha 2016] S. Ipek and A. Caticha, “Relational Entropic Dynamics


of Many Particles”, in Bayesian Inference and Maximum Entropy Methods
in Science and Engineering, ed. by A.Gi¢ n and K. Knuth, AIP Conf.
Proc. 1757, 030003 (2016); arXiv.org:1601.01901.

[Ipek et al 2018] S. Ipek, M. Abedi, and A. Caticha, “Entropic Dynamics:


Reconstructing Quantum Field Theory in Curved Spacetime”, Class. Quan-
tum Grav. 36, 205013 (2019); arXiv:1803.07493 [gr-qc].
342 References

[Ipek Caticha 2020] S. Ipek and A.Caticha, “The Entropic Dynamics of Quan-
tum Scalar Fields coupled to Gravity,”Symmetry 12, 1324 (2020); arXiv:
2006.05036.
[Ipek 2021] S. Ipek, The Entropic Dynamics of Relativistic Quantum Fields in
Curved Spacetime, Ph.D. Thesis, University at Albany, State University
of New York, 2021; arXiv:2105.07042 [gr-qc].
[Jaeger 2009] G. Jaeger, Entanglement, Information, and the Interpretation
of Quantum Mechanics (Springer, Berlin 2009).
[James 1897] W. James, The Will to Believe (1897, reprinted by Dover, New
York 1956).
[James 1907] W. James, Pragmatism (1907, reprinted by Dover, 1995).
[James 1911] W. James, The Meaning of Truth (1911, reprinted by Prometheus,
1997).
[Jammer 1966] M. Jammer, The Conceptual Development of Quantum Me-
chanics (McGraw-Hill, New York 1966).
[Jammer 1974] M. Jammer, The Philosophy of Quantum Mechanics – The
Interpretations of Quantum Mechanics in Historical Perspective (Wiley,
New York 1974).
[Jaynes 1957a] E. T. Jaynes, “How does the Brain do Plausible Reasoning”,
Stanford Univ. Microwave Lab. report 421 (1957); also published in
Maximum Entropy and Bayesian Methods in Science and Engineering,
G. J. Erickson and C. R. Smith (eds.) (Kluwer, Dordrecht 1988) and at
http://bayes.wustl.edu.
[Jaynes 1957b] E. T. Jaynes, “Information Theory and Statistical Mechan-
ics”, Phys. Rev. 106, 620 (1957).
[Jaynes 1957c] E. T. Jaynes, “Information Theory and Statistical Mechanics.
II”, Phys. Rev. 108, 171 (1957).
[Jaynes 1963] E. T. Jaynes, “Information Theory and Statistical Mechanics,”
in Statistical Physics, Brandeis Lectures in Theoretical Physics, K. Ford
(ed.), Vol. 3, p.181 (Benjamin, New York, 1963).
[Jaynes 1965] E. T. Jaynes, “Gibbs vs. Boltzmann Entropies”, Am. J. Phys.
33, 391 (1965).
[Jaynes 1968] E. T. Jaynes, “Prior Probabilities”, IEEE Trans. on Systems
Science and Cybernetics SSC-4, 227 (1968) and at http://bayes.wustl.edu.
[Jaynes 1979] E. T. Jaynes, “Where do we stand on maximum entropy?”The
Maximum Entropy Principle ed. by R. D. Levine and M. Tribus (MIT
Press 1979); reprinted in [Jaynes 1983] and at http://bayes.wustl.edu.
References 343

[Jaynes 1980] E. T. Jaynes, “The Minimum Entropy Production Principle”,


Ann. Rev. Phys. Chem. 31, 579 (1980) and at http://bayes.wustl.edu.
[Jaynes 1983] E. T. Jaynes: Papers on Probability, Statistics and Statisti-
cal Physics edited by R. D. Rosenkrantz (Reidel, Dordrecht, 1983), and
papers online at http://bayes.wustl.edu.
[Jaynes 1985] E. T. Jaynes, “Bayesian Methods: General Background”, in
Maximum Entropy and Bayesian Methods in Applied Statistics, J. H. Jus-
tice (ed.) (Cambridge UP, 1985) and at http://bayes.wustl.edu.
[Jaynes 1986] E. T. Jaynes, “Predictive Statistical Mechanics,” in Frontiers
of Nonequilibrium Statistical Physics, G.T. Moore and M.O. Scully (eds.)
(Plenum Press, New York, 1986) and at http://bayes.wustl.edu.
[Jaynes 1988] E. T. Jaynes, "The Evolution of Carnot’s Principle," pp. 267-
281 in Maximum Entropy and Bayesian Methods in Science and Engineer-
ing ed. by G. J. Erickson and C. R. Smith (Kluwer, Dordrecht 1988) and
at http://bayes.wustl.edu.
[Jaynes ] E. T. Jaynes, “Macroscopic Prediction”, pp. 254-269 in Complex
SystemsjOperational Approaches in Neurobiology, Physics, and Comput-
ers, ed. by H. Haken (Springer, Berlin, 1985) and at http://bayes.wustl.edu.
[Jaynes 1989] E. T. Jaynes, “Clearing up the mysteries— the original goal”, in
Maximum Entropy and Bayesian Methods, edited by J. Skilling (Kluwer,
Dordrecht 1989) and at http://bayes.wustl.edu.
[Jaynes 1990] E. T. Jaynes, “Probability in Quantum Theory”, in Complex-
ity, Entropy and the Physics of Information, ed. by W. H Zurek (Addison-
Welsey, Reading MA, 1990) and at http://bayes.wustl.edu.
[Jaynes 1992] E. T. Jaynes, “The Gibbs Paradox”, Maximum Entropy and
Bayesian Methods, ed. by C. R. Smith, G. J. Erickson and P. O. Neudorfer
(Kluwer, Dordrecht 1992) and at http://bayes.wustl.edu.
[Jaynes 2003] E. T. Jaynes, Probability Theory: The Logic of Science edited
by G. L. Bretthorst (Cambridge UP, 2003).
[Je¤rey 2004] R. Je¤rey, Subjective Probability, the Real Thing (Cambridge
U. Press, Cambridge 2004).
[Je¤reys 1939] H. Je¤reys, Theory of Probability (Oxford U. Press, Oxford
1939).
[Je¤reys 1946] H. Je¤reys, “An invariant form for the prior probability in
estimation problems”, Proc. Roy. Soc. London Ser. A 196, 453 (1946).
[Johnson 2011] D. T. Johnson, “Generalized Galilean Transformations and
the Measurement Problem in the Entropic Dynamics Approach to Quan-
tum Theory”, Ph.D. thesis, University at Albany (2011) (arXiv:1105.1384).
344 References

[Johnson Caticha 2010] D. T. Johnson and A. Caticha, “Non-relativistic


gravity in entropic quantum dynamics”, Bayesian Inference and Maxi-
mum Entropy Methods in Science and Engineering, ed. by A. Mohammad-
Djafari, AIP Conf. Proc. 1305, 130 (2010) (arXiv:1010.1467).
[Johnson Caticha 2011] D. T. Johnson and A. Caticha, “Entropic dynamics
and the quantum measurement problem”, Bayesian Inference and Max-
imum Entropy Methods in Science and Engineering, ed. by K. Knuth et
al., AIP Conf. Proc. 1443, 104 (2012) (arXiv:1108.2550).
[Karbelkar 1986] S. N. Karbelkar, “On the axiomatic approach to the maxi-
mum entropy principle of inference”, Pramana –J. Phys. 26, 301 (1986).
[Kass Wasserman 1996] R. E. Kass and L. Wasserman, “The Selection of
Prior Distributions by Formal Rules”, J. Am. Stat. Assoc. 91, 1343
(1996).
[Khinchin 1949] A. I. Khinchin, Mathematical Foundations of Statistical Me-
chanics (Dover, New York, 1949).
[Kibble 1979] T. W. B. Kibble,“Geometrization of Quantum Mechanics,”Comm.
Math. Phys. 65, 189-201 (1979).
[Klein 1970] M. J. Klein, “Maxwell, His Demon, and the Second Law of Ther-
modynamics”, American Scientist 58, 84 (1970).
[Klein 1973] M. J. Klein, “The Development of Boltzmann’s Statistical Ideas”,
The Boltzmann Equation ed. by E. G. D. Cohen and W. Thirring, (Springer
Verlag, 1973).
[Knuth 2002] K. H. Knuth, “What is a question?” in Bayesian Inference
and Maximum Entropy Methods in Science and Engineering, ed. by C.
Williams, AIP Conf. Proc. 659, 227 (2002).
[Knuth 2003] K. H. Knuth, “Deriving laws from ordering relations”, Bayesian
Inference and Maximum Entropy Methods in Science and Engineering, ed.
by G.J. Erickson and Y. Zhai, AIP Conf. Proc. 707, 204 (2003).
[Knuth 2005] K. H. Knuth, “Lattice duality: The origin of probability and
entropy”, Neurocomputing 67C, 245 (2005).
[Knuth 2006] K. H. Knuth, “Valuations on lattices and their application to in-
formation theory,”Proceedings of the 2006 IEEE World Congress on Com-
putational Intelligence (IEEE WCCI 2006) (doi: 10.1109/FUZZY.2006.1681717).
[Knuth Skilling 2012] K. H. Knuth, J. Skilling, “Foundations of Inference,”
Axioms 1, 38-73 (2012).
[Kolmogorov 1965] A. N. Kolmogorov, “Three approaches to the quantita-
tive de…nition of information”, Problems Inform. Transmission 1, 4-7
(1965).
References 345

[Koopman 1955] B. O. Koopman, “Quantum Theory and the Foundations of


Probability”, Proc. Symp. Appl. Math. Vol. VII, 97-102 (1955).

[Kullback 1959] S. Kullback, Information Theory and Statistics (Wiley, New


York 1959).
[Landauer 1961] R. Landauer, “Information is Physical”, Physics Today, May
1991, 23.
[Landau Lifshitz 1977a] L. D. Landau and E. M. Lifshitz, Statistical Physics
(Pergamon, New York 1977).
[Landau Lifshitz 1977b] L. D. Landau and E. M. Lifshitz, Quantum Me-
chanics (3rd edition, Pergamon, New York 1977).
[Landau Lifshitz 1993] L. D. Landau and E. M. Lifshitz, Mechanics (But-
terworth, Oxford 1993).
[Lanczos 1970] C. Lanczos, The Variational Principles of Mechanics (4th edi-
tion, Dover, New York 1986).
[Lebowitz 1993] J. Lebowitz, “Boltzmann’s entropy and time’s arrow”, Physics
Today, September 1993, 32.
[Lebowitz 1999] J. Lebowitz, “Statistical mechanics: a selective review of two
central issues”, Rev. Mod. Phys. 71, S346 (1999).
[Lee and Presse 2012] J. Lee and S. Pressé, “Microcanonical origin of the
maximum entropy principle for open systems”, Phys. Rev. E 86, 041126
(2012).
[Leifer 2014] M. S. Leifer, “Is the quantum state real? An extended review of
-ontology theorems”, Quanta 3, 67 (2014); arXiv.org: 1409.1570.
[Lindley 1956] D. V. Lindley, “On a measure of the information provided by
an experiment”, Ann. Math. Statist. 27, 986 (1956).
[Loredo 2003] T. J. Loredo and D. F. Cherno¤, “Bayesian adaptive explo-
ration”, Bayesian Inference and Maximum Entropy Methods in Science
and Engineering, ed. by G. Erickson and Y. Zhai, AIP Conf. Proc. 707,
330 (2004)

[Lucas 1970] J. R. Lucas, The Concept of Probability (Clarendon Press, Ox-


ford 1970).
[Mackey 1989] M. C. Mackey, “The dynamic origin of increasing entropy”,
Rev. Mod. Phys. 61, 981 (1989).
[Mandelbrot Van Ness 1968] B. B. Mandelbrot and J. W. Van Ness, “Frac-
tional Brownian motions, fractional noises, and applications,” SIAM Re-
view 10, 422 (1968).
346 References

[Marchildon 2004] L. Marchildon, “Why Should We Interpret Quantum Me-


chanics?”, Found. Phys. 34, 1453 (1998).

[McDu¤ Salamon 2017] D. McDu¤ and D. Salamon, Introduction to Sym-


plectic Topology (Oxford U.P., Oxford, 2017).

[Mehra 1998] J. Mehra, “Josiah Willard Gibbs and the Foundations of Sta-
tistical Mechanics”, Found. Phys. 28, 1785 (1998).

[Mehrafarin 2004] M. Mehrafarin, “Quantum mechanics from two physical


postulates,”Int. J. Theor. Phys., 44, 429 (2005); arXiv:quant-ph/0402153.

[Merzbacher 1962] E. Merzbacher, “Single Valuedness of Wave Functions”,


Am. J. Phys. 30, 237 (1962).

[Molitor 2015] M. Molitor, “On the relation between geometrical quantum


mechanics and information geometry,” J. Geom. Mech. 7, 169 (2015).

[Myung et al 2000] I. J. Myung, V. Balasubramanian, M. A. Pitt, “Counting


probability distributions: Di¤erential geometry and model selection”Proc.
Nat. Acad. Sci. 97, 11170–11175 (2000).

[Nawaz 2012] S. Nawaz, “Momentum and Spin in Entropic Quantum Dynam-


ics”, Ph.D. thesis, University at Albany (2012) .

[Nawaz Caticha 2011] S. Nawaz and A. Caticha, “Momentum and uncer-


tainty relations in the entropic approach to quantum theory”, in Bayesian
Inference and Maximum Entropy Methods in Science and Engineering, ed.
by K. Knuth et al., AIP Conf. Proc. 1443, 112 (2012) (arXiv:1108.2629).

[Nawaz et al 2016] S. Nawaz, M. Abedi, and A. Caticha, “Entropic Dynamics


on Curved Spaces”, in Bayesian Inference and Maximum Entropy Methods
in Science and Engineering, ed. by A.Gi¢ n and K. Knuth, AIP Conf.
Proc. 1757, 030004 (2016) (arXiv.org:1601.01708).

[Neirotti CatichaN 2003] J. P. Neirotti and N. Caticha, “Dynamics of the


evolution of learning algorithms by selection”, Phys. Rev. E 67, 041912
(2003).

[Nelson 1966] E. Nelson, “Derivation of the Schrödinger equation from New-


tonian Mechanics”, Phys. Rev. 150, 1079 (1966).

[Nelson 1967] E. Nelson, Dynamical theories of Brownian motion (Princeton


U. P., Princeton 1967; 2nd ed. 2001, http://www.math.princeton.edu/
nelson/books.html).).

[Nelson 1979] E. Nelson, “Connection between Brownian motion and quan-


tum mechanics”, p.168 in Einstein Symposium Berlin, Lecture Notes in
Physics 100 (Springer-Verlag, Berlin 1979).
References 347

[Nelson 1985] E. Nelson, Quantum Fluctuations (Princeton U. Press, Prince-


ton 1985).

[Nelson 1986] E. Nelson, “Field theory and the future of stochastic mechan-
ics”, in in Stochastic Processes in Classical and Quantum Systems, ed. By
S. Albeverio et al., Lecture Notes in Physics 262 (Springer, Berlin 1986).

[von Neumann 1955] J. von Neumann, Mathematical Foundations of Quan-


tum Mechanics (Princeton University Press, 1955).

[Newton 1693] Isaac Newton’s third letter to Bentley, February 25, 1693 in
Isaac Newton’s papers and letters on Natural Philosophy and related doc-
uments, ed. by I. B. Cohen (Cambridge, 1958), p. 302.

[Norton 2011] J. D. Norton, “Waiting for Landauer”, Studies in History and


Philosophy of Modern Physics 36, 184 (2011).

[Norton 2013] J. D. Norton, “The End of the Thermodynamics of Computa-


tion: A No-Go Result”, Philosophy of Science 80, 1182 (2013).

[Papineau 1996] D. Papineau (ed.), The Philosophy of Science (Oxford U.


Press, Oxford 1996).

[Pauli 1939] W. Pauli, Helv. Phys. Acta 12, 147 (1939) and W. Pauli, Gen-
eral Principles of Quantum Mechanics section 6 (Springer-Verlag, Berlin
1980).

[de la Peña and Cetto 2014] L. de la Peña and A.M. Cetto, The Emerging
Quantum: The Physics Behind Quantum Mechanics (Springer, 2014).

[Peres 1993] A. Peres, Quantum Theory: Concepts and Methods (Kluwer,


Dordrecht 1993).

[Pistone Sempi 1995] An in…nite-dimensional geometric structure on the space


of all the probability measures equivalent to a given one”, Ann. Statist.
23, 1543–1561 (1995).

[Plastino and Plastino 1994] A. R. Plastino and A. Plastino, “From Gibbs


microcanonical ensemble to Tsallis generalized canonical distribution”,
Phys. Lett. A193, 140 (1994).

[Presse et al 2013] S. Pressé, K. Ghosh, J. Lee, and K. A. Dill, “Nonadditive


Entropies Yield Probability Distributions with Biases not Warranted by
the Data”, Phys. Rev. Lett. 111, 180604 (2013).

[Price 1996] H. Price, Time’s Arrow and Archimedes’Point (Oxford U. Press,


Oxford 1996).

[Pusey et al 2012] M. F. Pusey, J. Barrett, and T. Rudolph, “On the reality


of the quantum state,”Nature Physics 8, 475-478 (2012); arXiv:1111.3328.
348 References

[Putnam 1975] H. Putnam, Mathematics, Matter, and Method, Vol. 1 (Cam-


bridge U. Press, Cambridge 1975).

[Putnam 1979] H. Putnam, “How to be an internal realist and a transcenden-


tal idealist (at the same time)”in Language Logic and Philosophy, Proc. of
the 4th International Wittgenstein Symposium (Kirchberg/Wechsel, Aus-
tria 1979).

[Putnam 1981] H. Putnam, Reason, Truth and History (Cambridge U. Press,


Cambridge 1981).

[Putnam 1987] H. Putnam, The Many Faces of Realism (Open Court, LaSalle,
Illinois 1987).

[Putnam 2003] H. Putnam, The Collapse of the Fact/Value Dichotomy and


Other Essays (Harvard U. Press, Cambridge 2003).

[Rao 1945] C. R. Rao, “Information and the accuracy attainable in the es-
timation of statistical parameters”, Bull. Calcutta Math. Soc. 37, 81
(1945).

[Ramsey ] F. P. Ramsey, Philosophical Papers, ed. by D. H. Mellor (Cam-


bridge U.P., Cambridge, 1990).

[Reginatto Hall 2011] M. Reginatto and M.J.W. Hall, “Quantum theory from
the geometry of evolving probabilities,”AIP Conf. Proc. 1443, 96 (2012);
arXiv:1108.5601.

[Reginatto Hall 2012] M. Reginatto and M.J.W. Hall, “Information geome-


try, dynamics and discrete quantum mechanics,” AIP Conf. Proc. 1553,
246 (2013); arXiv:1207.6718.

[Reginatto 2013] M. Reginatto, “From information to quanta: A derivation of


the geometric formulation of quantum theory from information geometry,”
arXiv:1312.0429.

[Renyi 1961] A. Renyi, “On measures of entropy and information”, Proc. 4th
Berkeley Symposium on Mathematical Statistics and Probability, Vol 1, p.
547 (U. of California Press, Berkeley 1961).

[Riesz 1958] M. Riesz, Cli¤ ord Numbers and Spinors (The Institute for Fluid
Dynamics and Applied Mathematics, Lecture Series No.38, U. of Mary-
land, 1958).

[Rissanen 1978] J. Rissanen, “Modeling by shortest data description”, Auto-


matica 14, 465 (1978).

[Rissanen 1986] J. Rissanen, “Stochastic complexity and modeling”, Ann.


Stat. 14, 1080 (1986).
References 349

[Rodriguez 1988] C. C. Rodríguez, “Understanding ignorance”, Maximum


Entropy and Bayesian Methods, G. J. Erickson and C. R. Smith (eds.)
(Kluwer, Dordrecht 1988).

[Rodriguez 1989] C. C. Rodríguez, “The metrics generated by the Kullback


number”, Maximum Entropy and Bayesian Methods, J. Skilling (ed.) (Kluwer,
Dordrecht 1989).

[Rodriguez 1990] C. C. Rodríguez, “Objective Bayesianism and geometry”,


Maximum Entropy and Bayesian Methods, P. F. Fougère (ed.) (Kluwer,
Dordrecht 1990).

[Rodriguez 1991] C. C. Rodríguez, “Entropic priors”, Maximum Entropy and


Bayesian Methods, W. T. Grandy Jr. and L. H. Schick (eds.) (Kluwer,
Dordrecht 1991).

[Rodriguez 1998] C. C. Rodríguez, “Are we cruising a hypothesis space?”


(arxiv.org/abs/physics/9808009).

[Rodriguez 2002] C. C. Rodríguez: “Entropic Priors for Discrete Probabilis-


tic Networks and for Mixtures of Gaussian Models”, Bayesian Inference
and Maximum Entropy Methods in Science and Engineering, R. L. Fry
(ed.) AIP Conf. Proc. 617, 410 (2002) (arXiv.org/abs/physics/0201016).

[Rodriguez 2003] C. C. Rodríguez, “A Geometric Theory of Ignorance”


(omega.albany.edu:8008/ignorance/ignorance03.pdf).

[Rodriguez 2004] C. C. Rodríguez, “The Volume of Bitnets”


(omega.albany.edu:8008/bitnets/bitnets.pdf).

[Rodriguez 2005] C. C. Rodríguez, “The ABC of model selection: AIC, BIC


and the new CIC”, Bayesian Inference and Maximum Entropy Methods
in Science and Engineering, K. Knuth et al. (eds.) AIP Conf. Proc. Vol.
803, 80 (2006) (omega.albany.edu:8008/CIC/me05.pdf).

[Rovelli 1996] C. Rovelli, “Relational Quantum Mechanics”, Int. J. Theor.


Phys. 35, 1637 (1996); arXiv:9609002[quant-ph].

[Rozanov 1977] Y. A. Rozanov, Probability Theory, A Concise Course (Dover,


New York 1977).

[Ruppeiner 1979] G. Ruppeiner, “Thermodynamics: a Riemannian geomet-


ric model”, Phys. Rev. A 20, 1608 (1979).

[Ruppeiner 1995] G. Ruppeiner, “Riemannian geometry in thermodynamic


‡uctuation theory”, Rev. Mod. Phys. 63, 605 (1995).

[Savage 1972] L. J. Savage, The Foundations of Statistics (Dover, 1972).


350 References

[Schlosshauer 2004] M. Schlosshauer, “Decoherence, the measurement prob-


lem, and interpretations of quantum mechanics”, Rev. Mod. Phys. 76,
1267 (2004).

[Schrodinger 1930] E. Schrödinger, “About the Heisenberg Uncertainty Re-


lation”, Sitzungberichten der Preussischen Akademie der Wissenschaften
(Phys. Math. Kasse) 19, 296 (1930); English translation by A. Agelow
and M. Batoni: arxiv.org/abs/quant-ph/9903100.

[Schrodinger 1938] E. Schrödinger, “The multivaluedness of the wave func-


tion,” Ann. Phys. 32, 49 (1938).

[Schutz 1980] B. Schutz, Geometrical Methods of Mathematical Physics (Cam-


bridge U.P., UK, 1980).

[Sebastiani Wynn 00] P. Sebastiani and H. P. Wynn, “Maximum entropy


sampling and optimal Bayesian experimental design”, J. Roy. Stat. Soc.
B, 145 (2000).

[Seidenfeld 1986] T. Seidenfeld, “Entropy and Uncertainty”, Philosophy of


Science 53, 467 (1986); reprinted in Foundations of Statistical Inference,
I. B. MacNeill and G. J. Umphrey (eds.) (Reidel, Dordrecht 1987).

[Shannon 1948] C. E. Shannon, “The Mathematical Theory of Communica-


tion”, Bell Syst. Tech. J. 27, 379, 623 (1948).

[Shannon Weaver 1949] C. E. Shannon and W. Weaver, The Mathematical


Theory of Communication, (U. Illinois Press, Urbana 1949).

[Shimony 1985] A. Shimony, “The status of the principle of maximum en-


tropy”, Synthese 63, 35 (1985).

[Shore Johnson 1980] J. E. Shore and R. W. Johnson, “Axiomatic derivation


of the Principle of Maximum Entropy and the Principle of Minimum Cross-
Entropy”, IEEE Trans. Inf. Theory IT-26, 26 (1980).

[Shore Johnson 1981] J. E. Shore and R. W. Johnson, “Properties of Cross-


Entropy Minimization”, IEEE Trans. Inf. Theory IT-27, 26 (1981).

[Sivia Skilling 2006] D. S. Sivia and J. Skilling, Data Analysis: a Bayesian


tutorial (Oxford U. Press, Oxford 2006).

[Skilling 1988] J. Skilling, “The Axioms of Maximum Entropy”, Maximum-


Entropy and Bayesian Methods in Science and Engineering, pp. 173-187,
ed. by G. J. Erickson and C. R. Smith (Kluwer, Dordrecht 1988).

[Skilling 1989] J. Skilling, “Classic Maximum Entropy”, Maximum Entropy


and Bayesian Methods, pp. 45-52, ed. by J. Skilling (Kluwer, Dordrecht
1989).
References 351

[Skilling 1990] J. Skilling, “Quanti…ed Maximum Entropy”, Maximum En-


tropy and Bayesian Methods, pp. 341-350, ed. by P. F. Fougère (Kluwer,
Dordrecht 1990).

[Smolin 1986a] L. Smolin, “On the nature of quantum ‡uctuations and their
relation to gravitation and the principle of inertia”, Class. Quantum Grav.
3, 347 (1986).

[Smolin 1986b] L. Smolin, “Quantum ‡uctuations and inertia”, Phys. Lett.


113A, 408 (1986).

[Smolin 2006] L. Smolin, “Could quantum mechanics be an approximation to


another theory?” (arXiv.org/abs/quant-ph/0609109).

[Smith Erickson 1990] C. R. Smith, G. J. Erickson, “Probability Theory and


the Asociativity Equation”, in Maximum Entropy and Bayesian Methods
ed. by P. F. Fougère (Kluwer, Dordrecht 1990).

[Solomonov 1964] R. Solomonov, “A formal theory of inductive inference”,


Information and Control 7, 1-22 and 224-254 (1964).

[Souriau 1997] J.-M. Souriau, Structure of Dynamical Systems – A Symplec-


tic View of Physics, translation by C.H. Cushman-deVries (Birkhäuser,
Boston 1997).

[Spekkens 2005] R. W. Spekkens, “Contextuality for preparations, transfor-


mations and unsharp measurements”, Phys. Rev. A 2005, 71, 052108;
arXiv:quant-ph/0406166.

[Spekkens 2007] R. Spekkens, “Evidence for the epistemic view of quantum


states: a toy theory”, Phys. Rev. A 75, 032110 (2007).

[Stapp 1972] H. P. Stapp, “The Copenhagen Interpretation”, Am. J. Phys.


40, 1098 (1972).

[Takabayasi 1952] T. Takabayasi, “On the Formulation of Quantum Mechan-


ics associated with Classical Pictures,”Prog. Theor. Phys. 8, 143 (1952).

[Takabayasi 1983] T. Takabayasi, “Vortex, Spin and Triad for Quantum Me-
chanics of Spinning Particle,” Prog. Theor. Phys. 70, 1 (1983).

[’t Hooft 1999] G. ’t Hooft, “Quantum Gravity as a Dissipative Deterministic


System”, Class. Quant. Grav. 16, 3263 (1999) (arXiv:gr-qc/9903084).

[ter Haar 1955] D. ter Haar, “Foundations of Statistical Mechanics”, Rev.


Mod. Phys. 27, 289 (1955).

[Tribus 1961] M. Tribus, “Information Theory as the Basis for Thermostatics


and Thermodynamics”, J. Appl. Mech. (March 1961) p. 1-8.
352 References

[Tribus 1969] M. Tribus, Rational Descriptions, Decisions and Designs (Perg-


amon, New York 1969).

[Tribus 1978] M. Tribus, “Thirty Years of Information Theory”, The Maxi-


mum Entropy Formalism, R.D. Levine and M. Tribus (eds.) (MIT Press,
Cambridge 1978).

[Tsallis 1988] C. Tsallis, “Possible Generalization of Boltzmann-Gibbs Statis-


tics”, J. Stat. Phys. 52, 479 (1988).

[Tsallis 2011] C. Tsallis, “The nonadditive entropy Sq and its applications in


physics and elsewhere; some remarks”, Entropy 13, 1765 (2011).

[Tseng Caticha 2001] C.-Y. Tseng and A. Caticha, “Yet another resolution
of the Gibbs paradox: an information theory approach”, Bayesian In-
ference and Maximum Entropy Methods in Science and Engineering, ed.
by R. L. Fry, A.I.P. Conf. Proc. 617, 331 (2002) (arXiv.org/abs/cond-
mat/0109324).

[Tseng Caticha 2008] C. Y. Tseng and A. Caticha, “Using relative entropy


to …nd optimal approximations: An application to simple ‡uids”, Physica
A 387, 6759 (2008) (arXiv:0808.4160).

[Van Horn 2003] K. Van Horn, “Constructing a Logic of Plausible Inference:


a Guide to Cox’s Theorem”, Int. J. Approx. Reasoning 34, 3 (2003).

[Vanslette 2017] K. Vanslette, “Entropic Updating or Probabilities and Den-


sity Matrices”, Entropy 19, 664 (2017).

[Vanslette Caticha 2016] K. Vanslette and A. Caticha, “Quantum measure-


ment and weak values in entropic quantum dynamics,”in Bayesian Infer-
ence and Maximum Entropy Methods in Science and Engineering, ed. by
G. Verdoolaege, AIP Conf. Proc. 1853, 090003 (2017) (arXiv.org:1701.00781).

[von Mises 1957] R. von Mises, Probability, Statistics and Truth (Dover, 1957).

[U¢ nk 1995] J. U¢ nk, “Can the Maximum Entropy Principle be explained as


a consistency requirement?”Studies in History and Philosophy of Modern
Physics 26, 223 (1995).

[U¢ nk 1996] J. U¢ nk, “The Constraint Rule of the Maximum Entropy Prin-
ciple”, Studies in History and Philosophy of Modern Physics 27, 47 (1996).

[U¢ nk 2001] J. U¢ nk, “Blu¤ Your Way in the Second Law of Thermody-
namics”, Studies in History and Philosophy of Modern Physics 32(3), 305
(2001).

[U¢ nk 2003] J. U¢ nk, “Irreversibility and the Second Law of Thermodynam-


ics”, in Entropy, ed. by A. Greven et al. (Princeton UP, 2003).
References 353

[U¢ nk 2004] J. U¢ nk, “Boltzmann’s Work in Statistical Physics”, The Stan-


ford Encyclopedia of Philosophy (http://plato.stanford.edu).
[U¢ nk 2006] J. U¢ nk, “Compendium of the foundations of classical statisti-
cal physics” in Handbook for Philosophy of Physics, J. Butter…eld and J.
Earman (eds) (2006).
[U¢ nk 2009] J. U¢ nk, “Boltzmann’s H-theorem, its discontents, and the
birth of statistical mechanics”, Studies in History and Philosophy of Mod-
ern Physics 40, 174 (2009).
[van Fraasen 1980] B. C. van Fraasen, The Scienti…c Image (Clarendon, Ox-
ford 1980).
[van Fraasen 1981] B. C. van Fraasen, “A problem for relative information
minimizers in probability kinematics”, Brit. J. Phil. Sci. 32, 375 (1981).
[van Fraasen 1986] B. C. van Fraasen, “A problem for relative information
minimizers, continued”, Brit. J. Phil. Sci. 37, 453 (1986).
[van Fraasen 1989] B. C. van Fraasen, Laws and Symmetry (Clarendon, Ox-
ford 1989).
[van Fraasen 1997] B. C. van Fraasen, “Structure and Perspective: Philo-
sophical Perplexity and Paradox”, in Logic and Scienti…c Methods, p. 511,
ed. by M. L. Dalla Chiara et al. (Kluwer, Netherlands 1997).
[van Fraasen 2006a] B. C. van Fraasen, “Structure: Its Shadow and Sub-
stance”, Brit. J. Phil. Sci. 57, 275 (2006).
[van Fraasen 2006b] B. C. van Fraasen, “Representation: The Problem for
Structuralism”, Philosophy of Science 73, 536 (2006).
[Vanslette 2017] K. Vanslette, “Entropic Updating of Probabilities and Den-
sity Matrices”, Entropy 19, 664 (2017); arXiv:1710.09373.
[Vanslette Caticha 2017] K. Vanslette and A. Caticha, “Quantum measure-
ment and weak values in entropic quantum dynamics,” AIP Conf. Proc.
1853, 090003 (2017); arXiv:1701.00781.
[von Toussaint 2011] U. von Toussaint, “Bayesian inference in physics”, Rev.
Mod. Phys. 83, 943 (2011).
[Wallstrom 1989] T. C. Wallstrom, “On the derivation of the Schrödinger
equation from stochastic mechanics”, Found. Phys. Lett. 2, 113 (1989).
[Wallstrom 1990] T. C. Wallstrom, “The stochastic mechanics of the Pauli
equation”, Trans. Am. Math. Soc. 318, 749 (1990).
[Wallstrom 1994] T. C. Wallstrom, “The inequivalence between the Schrödinger
equation and the Madelung hydrodynamic equations”, Phys. Rev. A49,
1613 (1994).
354 References

[Wehrl 1978] A. Wehrl, “General properties of entropy”, Rev. Mod. Phys.


50, 221 (1978).
[Weinhold 1975] F. Weinhold, “Metric geometry of equilibrium thermody-
namics”, J. Chem. Phys. 63, 2479 (1975).
[Weinhold 1976] F. Weinhold, “Geometry and thermodynamics”, Phys. To-
day 29, No. 3, 23 (1976).
[Wetterich 2010] C. Wetterich, “Quantum particles from coarse grained clas-
sical probabilities in phase space”, Ann. Phys. 325, 1359 (2010)
(arXiv:1003.3351).
[Wheeler Zurek 1983] J. A. Wheeler and W. H. Zurek, Quantum Theory and
Measurement (Princeton U. Press, Princeton 1983).
[Wigner 1963] E. P. Wigner, “The problem of measurement”, Am J. Phys.
31, 6 (1963).
[Williams 1980] P. M. Williams, “Bayesian Conditionalization and the Prin-
ciple of Minimum Relative Information”, Brit. J. Phil. Sci. 31, 131
(1980).
[Wilson 1981] S. S. Wilson, “Sadi Carnot”, Scienti…c American, August 1981,
p. 134.
[Wootters 1981] W. K. Wootters, “Statistical distance and Hilbert space”,
Phys. Rev. D 23, 357 (1981).
[Yang 1970] C. N. Yang, “Charge Quantization, Compactness of the Gauge
Group, and Flux Quantization,” Phys. Rev. D1, 2360 (1970).
[Youse… Caticha 2021] A. Youse… and A. Caticha, “An entropic approach
to classical Density Functional Theory,” Phys. Sci.Forum 3, 13 (2021);
arXiv:2108.01594.
[Zeh 2002] H. D. Zeh, “The Wave Function: It or Bit?”(arXiv.org/abs/quant-
ph/0204088).
[Zeh 2007] H. D. Zeh, The Physical Basis of the Direction of Time (5th edi-
tion, Springer, Berlin 2007).
[Zeh 2016] H. D. Zeh, “The strange (hi)story of particles and waves”, Z.
Naturf. A 71, 195 (2016); revised version: arXiv:1304.1003v23.
[Zeilinger 1999] A. Zeillinger, “A Foundational Principle for Quantum Me-
chanics”, Found. Phys. 29, 631 (1999).
[Zellner 1997] A. Zellner, “The Bayesian Method of Moments”, Advances in
Econometrics 12, 85 (1997).
[Zurek 2003] W. H. Zurek, “Decoherence, einselection, and the quantum ori-
gins of the classical”, Rev. Mod. Phys. 75, 715 (2003).

You might also like