100% found this document useful (1 vote)
207 views

Probability and Statistics For Particle Physics: Carlos Maña

Uploaded by

Teodora Palmas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
207 views

Probability and Statistics For Particle Physics: Carlos Maña

Uploaded by

Teodora Palmas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 252

UNITEXT for Physics

Carlos Maña

Probability
and Statistics
for Particle
Physics
UNITEXT for Physics

Series editors
Paolo Biscari, Milano, Italy
Michele Cini, Roma, Italy
Attilio Ferrari, Torino, Italy
Stefano Forte, Milano, Italy
Morten Hjorth-Jensen, Oslo, Norway
Nicola Manini, Milano, Italy
Guido Montagna, Pavia, Italy
Oreste Nicrosini, Pavia, Italy
Luca Peliti, Napoli, Italy
Alberto Rotondi, Pavia, Italy
UNITEXT for Physics series, formerly UNITEXT Collana di Fisica e Astronomia,
publishes textbooks and monographs in Physics and Astronomy, mainly in English
language, characterized of a didactic style and comprehensiveness. The books
published in UNITEXT for Physics series are addressed to graduate and advanced
graduate students, but also to scientists and researchers as important resources for
their education, knowledge and teaching.

More information about this series at http://www.springer.com/series/13351


Carlos Maña

Probability and Statistics


for Particle Physics

123
Carlos Maña
Departamento de Investigación Básica
Centro de Investigaciones Energéticas,
Medioambientales y Tecnológicas
Madrid
Spain

ISSN 2198-7882 ISSN 2198-7890 (electronic)


UNITEXT for Physics
ISBN 978-3-319-55737-3 ISBN 978-3-319-55738-0 (eBook)
DOI 10.1007/978-3-319-55738-0
Library of Congress Control Number: 2017936885

© Springer International Publishing AG 2017


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents

1 Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Elements of Probability: ðX; B; lÞ . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Events and Sample Space: ðXÞ . . . . . . . . . . . . . . . . . . . . 1
1.1.2 r-algebras ðBX Þ and Measurable Spaces ðX; BX Þ . . . . . . 3
1.1.3 Set Functions and Measure Space: ðX; BX ; lÞ . . . . . . . . . 6
1.1.4 Random Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Conditional Probability and Bayes Theorem . . . . . . . . . . . . . . . . 14
1.2.1 Statistically Independent Events . . . . . . . . . . . . . . . . . . . 15
1.2.2 Theorem of Total Probability . . . . . . . . . . . . . . . . . . . . . 18
1.2.3 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.1 Discrete and Continuous Distribution Functions . . . . . . . 24
1.3.2 Distributions in More Dimensions . . . . . . . . . . . . . . . . . 28
1.4 Stochastic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.4.1 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . 35
1.4.2 Moments of a Distribution . . . . . . . . . . . . . . . . . . . . . . . 36
1.4.3 The “Error Propagation Expression” . . . . . . . . . . . . . . . . 44
1.5 Integral Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.5.1 The Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.5.2 The Mellin Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.6 Ordered Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
1.7 Limit Theorems and Convergence . . . . . . . . . . . . . . . . . . . . . . . . 67
1.7.1 Chebyshev’s Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.7.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . 69
1.7.3 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . 70
1.7.4 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . 71
1.7.5 Convergence in Lp Norm . . . . . . . . . . . . . . . . . . . . . . . . 76
1.7.6 Uniform Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

v
vi Contents

Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.1 Elements of Parametric Inference . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.2 Exchangeable Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.3 Predictive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.4 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.5 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.6 Prior Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.6.1 Principle of Insufficient Reason . . . . . . . . . . . . . . . . . . . 96
2.6.2 Parameters of Position and Scale . . . . . . . . . . . . . . . . . . 97
2.6.3 Covariance Under Reparameterizations . . . . . . . . . . . . . . 103
2.6.4 Invariance Under a Group of Transformations . . . . . . . . 109
2.6.5 Conjugated Distributions. . . . . . . . . . . . . . . . . . . . . . . . . 115
2.6.6 Probability Matching Priors . . . . . . . . . . . . . . . . . . . . . . 119
2.6.7 Reference Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.7 Hierarchical Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.8 Priors for Discrete Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.9 Constrains on Parameters and Priors . . . . . . . . . . . . . . . . . . . . . . 136
2.10 Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.10.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
2.10.2 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
2.11 Credible Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
2.12 Bayesian (B) Versus Classical (F ) Philosophy . . . . . . . . . . . . . . 148
2.13 Some Worked Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
2.13.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
2.13.2 Characterization of a Possible Source of Events . . . . . . . 158
2.13.3 Anisotropies of Cosmic Rays . . . . . . . . . . . . . . . . . . . . . 161
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
3.1 Pseudo-Random Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.2 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.2.1 Inverse Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.2.2 Acceptance-Rejection (Hit-Miss; J. Von Neumann
1951) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
3.2.4 Decomposition of the Probability Density. . . . . . . . . . . . 185
3.3 Everything at Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.3.1 The Compton Scattering . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.3.2 An Incoming Flux of Particles . . . . . . . . . . . . . . . . . . . . 192
3.4 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
3.4.1 Sampling from Conditionals and Gibbs Sampling . . . . . 214
Contents vii

3.5 Evaluation of Definite Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . 218


References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
4 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.1 Quantification of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.2 Expected Information and Entropy . . . . . . . . . . . . . . . . . . . . . . . . 223
4.3 Conditional and Mutual Information . . . . . . . . . . . . . . . . . . . . . . 226
4.4 Generalization for Absolute Continuous Random Quantities . . . . 228
4.5 Kullback–Leibler Discrepancy and Fisher’s Matrix . . . . . . . . . . . 229
4.5.1 Fisher’s Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
4.5.2 Asymptotic Behaviour of the Likelihood Function . . . . . 232
4.6 Some Properties of Information . . . . . . . . . . . . . . . . . . . . . . . . . . 234
4.7 Geometry and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Introduction

They say that understanding ought to work by the rules of


right reason. These rules are, or ought to be, contained in
Logic; but the actual science of logic is conversant at present
only with things either certain, impossible, or entirely
doubtful, none of which (fortunately) we have to reason on.
Therefore the true logic of this world is the calculus of
Probabilities, which takes account of the magnitude of the
probability which is, or ought to be, in a reasonable man’s
mind.

J.C. Maxwell

These notes, based on a one-semester course on probability and statistics given in


the former Doctoral Program of the Department of Theoretical Physics at the
Universidad Complutense in Madrid, are a more elaborated version of a series of
lectures given at different places to advanced graduate and Ph.D. students.
Although they certainly have to be tailored for undergraduate students, they contain
a humble overview of the basic concepts and ideas one should have in mind before
getting involved in data analysis and I believe they will be a useful reference for
both students and researchers.
I feel, maybe wrongly, that there is a recent tendency in a subset of the particle
physics community to consider statistics as a collection of prescriptions written in
some holy references that are used blindly with the only arguments that either
“everybody does it that way” or that “it has always been done this way.” In the
lectures, I have tried to demystify the “how-to” recipes not because they are not
useful but because, on the one hand, they are applicable under some conditions that
tend to be forgotten and, on the other, because if the concepts are clear so will be
the way to proceed (“at least formally”) for the problems that come across in
particle physics. At the end, the quote from Laplace given at the beginning of the
first lecture is what it is all about.
There is a countable set of books on probability and statistics and a sizable
subset of them are very good, out of which I would recommend the following ones
(a personal choice function). Chapter 1 deals with probability and this is just a
measure, a finite nonnegative measure, so it will be very useful to read some
sections of Measure Theory (2006; Springer) by V.I. Bogachev, in particular
Chaps. 1 and 2 of the first volume. However, for those students who are not yet

ix
x Introduction

familiar with measure theory, there is an appendix to this chapter with a short
digression on some basic concepts. A large fraction of the material presented in this
lecture can be found in more depth, together with other interesting subjects, in the
book Probability: A Graduate Course (2013; Springer Texts in Statistics) by A.
Gut. Chapter 2 is about statistical inference, Bayesian inference in fact, and a must
for this topic is the Bayesian Theory (1994; John Wiley & Sons) by J.M. Bernardo
and A.F.M. Smith that contains also an enlightening discussion about the Bayesian
and frequentist approaches in the Appendix B. It is beyond question that in any
worthwhile course on statistics the ubiquitous frequentist methodology has to be
taught as well and there are excellent references on the subject. Students are
encouraged to look, for instance, at Statistical Methods in Experimental Physics
(2006; World Scientific) by F. James, Statistics for Nuclear and Particle Physicists
(1989; Cambridge University Press) by L. Lyons, or Statistical Data Analysis
(1997; Oxford Science Pub.) by G. Cowan. Last, Chap. 3 is devoted to Monte Carlo
simulation, an essential tool in statistics and particle physics, and Chap. 4 to
information theory, and, like for the first chapters, both have interesting references
given along the text.
“Time is short, my strength is limited,…”, Kafka dixit, so many interesting
subjects that deserve a whole lecture by themselves are left aside. To mention some:
an historical development of probability and statistics, Bayesian networks, gener-
alized distributions (a different approach to probability distributions), decision
theory (games theory), and Markov chains for which we shall state only the relevant
properties without further explanation.
I am grateful to Drs. J. Berdugo, J. Casaus, C. Delgado, and J. Rodriguez for
their suggestions and a careful reading of the text and much indebted to Dr. Hisako
Niko. Were not for her interest, this notes would still be in the drawer. My gratitude
goes also to Mieke van der Fluit for her assistance with the edition.
Chapter 1
Probability

The Theory of Probabilities is basically nothing else but


common sense reduced to calculus
P.S. Laplace

1.1 The Elements of Probability: (, B, μ)

The axiomatic definition of probability was introduced by A.N. Kolmogorov in 1933


and starts with the concepts of sample space () and space of events (B  ) with
structure of σ-algebra. When the pair (, B ) is equipped with a measure μ we have
a measure space (E, B, μ) and, if the measure is a probability measure P we talk
about a probability space (, B , P). Lets discuss all these elements.

1.1.1 Events and Sample Space: ()

To learn about the state of nature, we do experiments and observations of the natural
world and ask ourselves questions about the outcomes. In a general way, the object
of questions we may ask about the result of an experiment such that the possible
answers are it occurs or it does not occur are called events. There are different kinds
of events and among them we have the elementary events; that is, those results of
the random experiment that can not be decomposed in others of lesser entity. The
sample space () is the set of all the possible elementary outcomes (events) of a
random experiment and they have to be:
(i) exhaustive: any possible outcome of the experiment has to be included in ;
(ii) exclusive: there is no overlap of elementary results.

© Springer International Publishing AG 2017 1


C. Maña, Probability and Statistics for Particle Physics,
UNITEXT for Physics, DOI 10.1007/978-3-319-55738-0_1
2 1 Probability

To study random phenomena we start by specifying the sample space and, therefore,
we have to have a clear idea of what are the possible results of the experiment.
To center the ideas, consider the simple experiment of rolling a die with 6 faces
numbered from 1 to 6. We consider as elementary events

ei = {get the number i on the upper face}; i = 1, . . ., 6

so  = {e1 , . . ., e6 }. Note that any possible outcome of the roll is included in  and
we can not have two or more elementary results simultaneously. But there are other
types of events besides the elementary ones. We may be interested for instance in
the parity of the number so we would like to consider also the possible results1

A = {get an even number} and Ac = {get an odd number}

They are not elementary since the result A = {e2 , e4 , e6 } is equivalent to get e2 , e4
or e6 and Ac =  \ A to get e1 , e3 or e5 . In general, an event is any subset2 of the
sample space and we shall distinguish between:

elementary events: any element of the sample space ;

events: any subset of the sample space;

and two extreme events:

sure events: SS = {get any result contained in } ≡ 

impossible events: S I = {get any result not contained in } ≡ ∅

Any event that is neither sure nor impossible is called random event. Going back to
the rolling of the die, sure events are

SS = {get a number n | 1 ≤ n≤6} =  or


SS = {get a number that is even or odd} = 

impossible events are

S I = {get an odd number that is not prime} = ∅ or


S I = {get the number 7} = ∅

1 Given two sets A, B⊂, we shall denote by Ac the complement of A (that is, the set of all elements
of  that are not in A) and by A\B ≡ A∩B c the set difference or relative complement of B in A
(that is, the set of elements that are in A but not in B). It is clear that Ac = \A.
2 This is not completely true if the sample space is non-denumerable since there are subsets that can

not be considered as events. It is however true for the subsets of Rn we shall be interested in. We
shall talk about that in Sect. 1.1.2.2.
1.1 The Elements of Probability: (, B, μ) 3

and random events are any of the ei or, for instance,

Sr = {get an even number} = {e2 , e4 , e6 }

Depending on the number of possible outcomes of the experiment, the the sample
space can be:

finite: if the number of elementary events is finite;

Example: In the rolling of a die,  = {ei ; i = 1, . . . , 6}


so dim() = 6.

countable: when there is a one-to-one correspondence between the


elements of  and N ;

Example: Consider the experiment of flipping a coin and stopping


when we get H . Then  = {H, T H, T T H, T T T H, . . .}.

non-denumerable: if it is neither of the previous;

Example: For the decay time of an unstable particle  =


{t ∈ R|t ≥ 0} = [0, ∞) and for the production polar angle
of a particle  = {θ ∈ R|0 ≤ θ ≤ π} = [0, π].

It is important to note that the events are not necessarily numerical entities. We
could have for instance the die with colored faces instead of numbers. We shall deal
with that when discussing random quantities. Last, given a sample space  we shall
talk quite frequently about a partition (or a complete system of events); that is, a
sequence {Si } of events, finite or countable, such that

 
= Si (complete system) and Si S j = ∅; i = j (disjoint events).
i ∀i, j

1.1.2 σ-algebras (B ) and Measurable Spaces (, B )

As we have mentioned, in most cases we are interested in events other than the
elementary ones. We single out them in a class of events that contains all the
possible results of the experiment we are interested in such that when we ask
about the union, intersection and complements of events we obtain elements that
belong the same class. A non-empty family B = {Si }i=1n
of subsets of the sample
space  that is closed (or stable) under the operations of union and complement;
4 1 Probability

that is

Si ∪ S j ∈ B; ∀Si , S j ∈ B and Si c ∈ B; ∀Si ∈ B

is an algebra (Boole algebra) if  is finite. It is easy to see that if it is closed


under unions and complements it is also closed under intersections and the following
properties hold for all Si , S j ∈ B :

 ∈ B ∅ ∈ B Si ∩ S j ∈ B

Si c ∪ S j c ∈ B (Si c ∪ S j c )c ∈ B Si \ S j ∈ B

∪i=1
m
Si ∈ B ∩i=1
m
Si ∈ B

Given a sample space  we can construct different Boole algebras depending on


the events of interest. The smaller one is Bm = {∅, }, the minimum algebra that
contains the event A ⊂  has 4 elements: B = {∅, , A, Ac } and the largest one,
B M = {∅, , all possible subsets of } will have 2dim() elements. From B M we
can engender any other algebra by a finite number of unions and intersections of its
elements.

1.1.2.1 σ-algebras

If the sample space is countable, we have to generalize the Boole algebra such that
the unions and intersections can be done a countable number of times getting always
events that belong to the same class; that is:

 ∞

Si ∈ B and Si ∈ B
i=1 i=1


with {Si }i=1 ∈ B. These algebras are called σ-algebras. Not all the Boole algebras
satisfy these properties but the σ-algebras are always Boole algebras (closed under
finite union).

Consider for instance a finite set E and the class A of subsets of E that are either
finite or have finite complements. The finite union of subsets of A belongs to A
because the finite union of finite sets is a finite set and the finite union of sets that
have finite complements has finite complement. However, the countable union of
finite sets is countable and its complement will be an infinite set so it does not belong
to A. Thus, A is a Boole algebra but not a σ-algebra.
Let now E be any infinite set and B the class of subsets of E that are either
countable or have countable complements. The finite or countable union of countable
sets is countable and therefore belongs to B. The finite or countable union of sets
1.1 The Elements of Probability: (, B, μ) 5

whose complement is countable has a countable complement and also belongs to B.


Thus, B is a Boole algebra and σ-algebra.

1.1.2.2 Borel σ-algebras

Eventually, we are going to assign a probability to the events of interest that belong
to the algebra and, anticipating concepts, probability is just a bounded measure so
we need a class of measurable sets with structure of a σ-algebra. Now, it turns out
that when the sample space  is a non-denumerable topological space there exist
non-measurable subsets that obviously can not be considered as events.3 We are
particularly interested in R (or, in general, in Rn ) so we have to construct a family
BR of measurable subsets of R that is
∞ ∞
(i) closed under countable number of intersections: {Bi }i=1 ∈ BR −→ ∩i=1 Bi ∈ BR
(ii) closed under complements: B ∈ BR → B = R\B ∈ BR
c

Observe that, for instance, the family of all subsets of R satisfies the conditions
(i) and (ii) and the intersection of any collection of families that satisfy them is a
family that also fulfills this conditions but not all are measurable. Measurably is the
key condition. Let’s start identifying what we shall considered the basic set in R
to engerder an algebra. The sample space R is a linear set of points and, among it
subsets, we have the intervals. In particular, if a ≤ b are any two points of R we
have:
• open intervals: (a, b) = {x ∈ R | a < x < b}
• closed intervals: [a, b] = {x ∈ R | a ≤ x ≤ b}
• half-open intervals on the right: [a, b) = {x ∈ R | a ≤ x < b}
• half-open intervals on the left: (a, b] = {x ∈ R | a < x ≤ b}
When a = b the closed interval reduces to a point {x = a} (degenerated interval)
and the other three to the null set and, when a→ − ∞ or b→∞ we have the infinite
intervals (−∞, b), (−∞, b], (a, ∞) and [a, ∞). The whole space R can be consid-
ered as the interval (−∞, ∞) and any interval will be a subset of R. Now, consider
the class of all intervals of R of any of the aforementioned types. It is clear that the
intersection of a finite or countable number of intervals is an interval but the union
is not necessarily an interval; for instance [a1 , b1 ] ∪ [a2 , b2 ] with a2 > b1 is not an
interval. Thus, this class is not additive and therefore not a closed family. However,

3 Isnot difficult to show the existence of Lebesgue non-measurable sets in R. One simple example
is the Vitali set constructed by G. Vitali in 1905 although there are other interesting examples
(Hausdorff, Banach–Tarsky) and they all assume the Axiom of Choice. In fact, the work of R.M.
Solovay around the 70s shows that one can not prove the existence of Lebesgue non-measurable sets
without it. However, one can not specify the choice function so one can prove their existence but
can not make an explicit construction in the sense Set Theorists would like. In Probability Theory,
we are interested only in Lebesgue measurable sets so those which are not have nothing to do in
this business and Borel’s algebra contains only measurable sets.
6 1 Probability

it is possible to construct an additive class including, along with the intervals, other
measurable sets so that any set formed by countably many operations of unions,
intersections and complements of intervals is included in the family. Suppose, for
instance, that we take the half-open intervals on the right [a, b), b > a as the initial
class of sets4 to generate the algebra BR so they are in the bag to start with. The
open, close and degenerate intervals are

 ∞

(a, b) = [a − 1/n, b); [a, b] = [a, b + 1/n) and a = {x ∈ R|x = a} = [a, a]
n=1 n=1

so they go also to the bag as well as the half-open intervals (a, b] = (a, b) ∪ [b, b]
and the countable union of unitary sets and their complements. Thus, countable sets
like N , Z or Q are in the bag too. Those are the sets we shall deal with.
The smallest family BR (or simply B) of measurable subsets of R that contains
all intervals and is closed under complements and countable number of intersections
has the structure of a σ -algebra, is called Borel’s algebra and its elements are gener-
ically called Borel’s sets or borelians Last, recall that half-open sets are Lebesgue
measurable (λ((a, b]) = b − a) and so is any set built up from a countable number of
unions, intersections and complements so all Borel sets are Lebesgue measurable and
every Lebesgue measurable set differs from a Borel set by at most a set of measure
zero. Whatever has been said about R is applicable to the n-dimensional euclidean
space Rn .
The pair (, B ) is called measurable space and in the next section it will be
equipped with a measure and “upgraded” to a measure space and eventually to a
probability space.

1.1.3 Set Functions and Measure Space: (, B , μ)

A function f : A ∈ B −→ R that assigns to each set A ∈ B one, and only one real
number, finite or not, is called a set function. Given a sequence {Ai }i=1n
of subset
of B pair-wise disjoint, (Ai ∩ A j = ∅; i, j = 1, . . ., n; i = j) we say that the set
function is additive (finitely additive) if:
 n 
 
n
f Ai = f (Ai )
i=1 i=1


or σ-additive if, for a countable the sequence {Ai }i=1 of pair-wise disjoint sets of B,
∞  ∞
 
f Ai = f (Ai )
i=1 i=1

4 The same algebra is obtained if one starts with (a, b), (a, b] or [a, b].
1.1 The Elements of Probability: (, B, μ) 7

It is clear that any σ-additive set function is additive but the converse is not true.
A countably additive set function is a measure on the algebra B , a signed measure
in fact. If the σ-additive set function is μ : A ∈ B −→ [0, ∞) (i.e., μ(A) ≥ 0 ) for
all A ∈ B it is a non-negative measure. In what follows, whenever we talk about
measures μ, ν, . . . on a σ-algebra we shall assume that they are always non-negative
measures without further specification. If μ(A) = 0 we say that A is a set of zero
measure.
The “trio” (, B , μ), with  a non-empty set, B a σ-algebra of the sets of 
and μ a measure over B is called measure space and the elements of B measurable
sets.
In the particular case of the n-dimensional euclidean space  = Rn , the σ-algebra
is the Borel algebra and all the Borel sets are measurable. Thus, the intervals I of
any kind are measurable sets and satisfy that
(i) If I ∈ R is measurable −→ I c = R − I is measurable;
∞ ∞
(ii) If {I }i=1 ∈ R are measurable −→ ∪i=1 Ii is measurable;
Countable sets are Borel sets of zero measure for, if μ is the Lebesgue measure (see
Appendix 2), we have that μ ([a, b)) = b − a and therefore:

1
μ ({a}) = lim μ ([a, a + 1/n)) = lim =0
n→∞ n→∞ n
Thus, any point is a Borel set with zero Lebesgue measure and, being μ a σ-additive
function, any countable set has zero measure. The converse is not true since there
are borelians with zero measure that are not countable (i.e. Cantor’s ternary set).
In general, a measure μ over B satisfies that, for any A, B ∈ B not necessarily
disjoint:

(m.1) μ(A ∪ B) = μ(A) + μ(B\A)


(m.2) μ(A ∪ B) = μ(A) + μ(B) − μ(A ∩ B) (μ(A ∪ B) ≤ μ(A) + μ(B))
(m.3) If A ⊆ B, then μ(B\A) = μ(B) − μ(A) (≥0 since μ(B) ≥ μ(A))
(m.4) μ(∅) = 0

(m.1) A ∪ B is the union of two disjoint sets A and B\A and the measure is an
additive set function;
(m.2) A ∩ B c and B are disjoint and its union is A ∪ B so μ(A ∪ B) = μ(A ∩ B c ) +
μ(B). On the other hand A ∩ B c and A ∩ B are disjoint at its union is A so
μ(A ∩ B c ) + μ(A ∩ B) = μ(A). It is enough to substitute μ(A ∩ B c ) in the
previous expression;
(m.3) from (m.1) and considering that, if A ⊆ B, then A ∪ B = B
(m.4) from (m.3) with B = A.
A measure μ over a measurable space (, B ) is finite if μ() < ∞ and σ-finite

if  = ∪i=1 Ai , with Ai ∈ B and μ(Ai ) < ∞. Clearly, any finite measure is σ-finite
but the converse is not necessarily true. For instance, the Lebesgue measure λ in
(Rn , BRn ) is not finite because λ(Rn ) = ∞ but is σ-finite because
8 1 Probability

Rn = [−k, k]n
k ∈N

and λ([−k, k]n ) = (2k)n is finite. As we shall see in Chap. 2, in some circumstances
we shall be interested in the limiting behaviour of σ-finite measures over a sequence
of compact sets. As a second example, consider the measurable space (R, B) and μ
such that for A⊂B is μ(A) = card(A) if A is finite and ∞ otherwise. Since R is
an uncountable union of finite sets, μ is not σ-finite in R. However, it is σ-finite in
(N , BN ).

1.1.3.1 Probability Measure

Let (, B ) be a measurable space. A measure P over B (that is, with domain
in B ), image in the closed interval [0, 1] ∈ R and such that P() = 1 (finite) is
called a probability measure and its properties a just those of finite (non-negative)
measures. Expliciting the axioms, a probability measure is a set function with domain
in B and image in the closed interval [0, 1] ∈ R that satisfies three axioms:

(i) additive: is an additive set function;


(ii) no negativity: is a measure;
(iii) certainty: P() = 1.

These properties coincide obviously with those of the frequency and combinatorial
probability (see Note 1). All probability measures are finite (P() = 1) and any
bounded measure can be converted in a probability measure by proper normalization.
The measurable space (, B ) provided with and probability measure P is called
the probability space (, B , P). It is straight forward to see that if A, B ∈ B, then:

(p.1) P(Ac ) = 1 − P(A)


(p.2) P(∅) = 0
(p.3) P(A ∪ B) = P(A) + P(B\A) = P(A) + P(B) − P(A ∩ B) ≤ P(A) + P(B))

The property (p.3) can be extended by recurrence to an arbitrary number of events


{Ai }i=1
n
∈ B for if Sk = ∪kj=1 A j , then Sk = Ak ∪Sk−1 and P(Sn ) = P(An ) +
P(Sn−1 ) − P(An ∩ Sn−1 ).
Last, note that in the probability space (R, B, P) (or in (Rn , Bn , P)), the set of
points W = {∀x ∈ R | P(x) > 0} is countable. Consider the partition


W = Wk where Wk = {∀x ∈ R | 1/(k + 1) < P(x) ≤ 1/k}
k=1

If x ∈ W then it belongs to one Wk and, conversely, if x belongs to one Wk then it


belongs to W . Each set Wk has at most k points for otherwise the sum of probabilities
of its elements is P(Wk ) > 1. Thus, the sets Wk are finite and since W is a countable
1.1 The Elements of Probability: (, B, μ) 9

union of finite sets is a countable set. In consequence, we can assign finite probabilities
on at most a countable subset of R.
NOTE 1: What is probability?
It is very interesting to see how along the 500 years of history of probability many
people (Galileo, Fermat, Pascal, Huygens, Bernoulli, Gauss, De Moivre, Poisson,…)
have approached different problems and developed concepts and theorems (Laws of
Large Numbers, Central Limit, Expectation, Conditional Probability,…) and a proper
definition of probability has been so elusive. Certainly there is a before and after Kol-
mogorov’s “General Theory of Measure and Probability Theory” and “Grundbegriffe
der Wahrscheinlichkeitsrechnung” so from the mathematical point of view the ques-
tion is clear after 1930s. But, as Poincare said in 1912: “It is very difficult to give a
satisfactory definition of Probability”. Intuitively, What is probability?
The first “definition” of probability was the Combinatorial Probability (∼1650).
This is an objective concept (i.e., independent of the individual) and is based on
Bernoulli’s Principle of Symmetry or Insufficient Reason: all the possible outcomes
of the experiment equally likely. For its evaluation we have to know the cardinal
(ν(·)) of all possible results of the experiment (ν()) and the probability for an
event A⊂ is “defined” by the Laplace’s rule: P(A) = ν(A)/ν(). This concept
of probability, implicitly admitted by Pascal and Fermat and explicitly stated by
Laplace, is an a priory probability in the sense that can be evaluated before or even
without doing the experiment. It is however meaningless if  is a countable set
(ν() = ∞) and one has to justify the validity of the Principle of Symmetry that
not always holds originating some interesting debates. For instance, in a problem
attributed to D’Alembert, a player A tosses a coin twice and wins if H appears in at
least one toss. According to Fermat, one can get {(T T ), (T H ), (H T ), (H H )} and A
will loose only in the first case so being the four cases equally likely, the probability
for A to win is P = 3/4. Pascal gave the same result. However, for Roberval one
should consider only {(T T ), (T H ), (H ·)} because if A has won already if H appears
at the first toss so P = 2/3. Obviously, Fermat and Pascal were right because, in
this last case, the three possibilities are not all equally likely and the Principle of
Symmetry does not apply.
The second interpretation of probability is the Frequentist Probability, and is based
on the idea of frequency of occurrence of an event. If we repeat the experiment n
times and a particular event Ai appears n i times, the relative frequency of occurrence
is f (Ai ) = n i /n. As n grows, it is observed (experimental fact) that this number
stabilizes around a certain value and in consequence the probability of occurrence
of Ai is defined as P(Ai )≡limex n→∞ f (Ai ). This is an objective concept inasmuch
p

it is independent of the observer and is a posteriori since it is based on what has


been observed after the experiment has been done through an experimental limit that
obviously is not attainable. In this sense, it is more a practical rule than a definition.
It was also implicitly assumed by Pascal and Fermat (letters of de Mere to Pascal:
I have observed in my die games…), by Bernoulli in his Ars Conjectandi of 1705
(Law of Large Numbers) and finally was clearly explicited at the beginning of the
XX century (Fisher and Von Mises).
10 1 Probability

Both interpretations of probability are restricted to observable quantities. What


happen for instance if they are not directly observable? What if we can not repeat the
experiment a large number of times and/or under the same conditions? Suppose that
you jump from the third floor down to ground (imaginary experiment). Certainly, we
can talk about the probability that you break your leg but, how many times can we
repeat the experiment under the same conditions?
During the XX century several people tried to pin down the concept of probability.
Pierce and, mainly, Popper argumented that probability represents the propensity of
Nature to give a particular result in a single trial without any need to appeal at “large
numbers”. This assumes that the propensity, and therefore the probability, exists in
an objective way even though the causes may be difficult to understand. Others, like
Knight, proposed that randomness is not a measurable property but just a problem
of knowledge. If we toss a coin and know precisely its shape, mass, acting forces,
environmental conditions,… we should be able to determine with certainty if the
result will be head or tail but since we lack the necessary information we can not
predict the outcome with certainty so we are lead to consider that as a random process
and use the Theory of Probability. Physics suggests that it is not only a question of
knowledge but randomness is deeply in the way Nature behaves.
The idea that probability is a quantification of the degree of belief that we have
in the occurrence of an event was used, in a more inductive manner, by Bayes and,
as we shall see, Bayes’s theorem and the idea of information play an essential role
in its axiomatization. To quote again Poincare, “… the probability of the causes,
the most important from the point of view of scientific applications.”. It was still
an open the question whether this quantification is subjective or not. In the 20s,
Keynes argumented that it is not because, if we know all the elements and factors
of the experiment, what is likely to occur or not is determined in an objective sense
regardless what is our opinion. On the contrary, Ramsey and de Finetti argued that
the probability that is to be assigned to a particular event depends on the degree of
knowledge we have (personal beliefs) and those do not have to be shared by every-
body so it is subjective. Furthermore they started the way towards a mathematical
formulation of this concept of probability consistent with Kolmogorov’s axiomatic
theory. Thus, within the Bayesian spirit, it is logical and natural to consider that
probability is a measure of the degree of belief we have in the occurrence of an
event that characterizes the random phenomena and we shall assign probabilities to
events based on the prior knowledge we have. In fact, to some extent, all statisti-
cal procedures used for the analysis of natural phenomena are subjective inasmuch
they all are based on a mathematical idealizations of Nature and all require a priory
judgments and hypothesis that have to be assumed.

1.1.4 Random Quantities

In many circumstances, the possible outcomes of the experiments are not numeric
(a die with colored faces, a person may be sick or healthy, a particle may decay
1.1 The Elements of Probability: (, B, μ) 11

in different modes,…) and, even in the case they are, the possible outcomes of the
experiment may form a non-denumerable set. Ultimately, we would like to deal with
numeric values and benefit from the algebraic structures of the real numbers and the
theory behind measurable functions and for this, given a measurable space (, B ),
we define a function X (w) : w ∈ −→R that assigns to each event w of the sample
space  one and only one real number.

In a more formal way, consider two measurable spaces (, B ) and ( , B ) and
a function

X (w) : w ∈  −→ X (w) ∈ 

Obviously, since we are interested in the events that conform the σ-algebra B ,

the same structure has to be maintained in ( , B ) by the application X (w) for
otherwise we wont be able to answer the questions of interest. Therefore, we require
the function X (w) to be Lebesgue measurable with respect to the σ-algebra B ; i.e.:
 
X −1 (B ) = B ⊆ B 
∀ B ∈ B

so we can ultimately identify P(B ) with P(B). Usually, we are interested in the

case that  = R (or Rn ) so B is the Borel σ-algebra and, since we have generated
the Borel algebra B from half-open intervals on the left Ix = (−∞, x] with x ∈ R,
we have that X (w) will be a Lebesgue measurable function over the Borel algebra
(Borel measurable) if, and only if:

X −1 (Ix ) = {w ∈  | X (w) ≤ x} ∈ B ∀ x ∈R

We could have generated as well the Borel algebra from open, closed or half-open
intervals on the right so any of the following relations, all equivalent, serve to define
a Borel measurable function X (w):
(1) {w|X (w) > c} ∈ B ∀c ∈ R;
(2) {w|X (w) ≥ c} ∈ B ∀c ∈ R;
(3) {w|X (w) < c} ∈ B ∀c ∈ R;
(4) {w|X (w) ≤ c} ∈ B ∀c ∈ R
To summarize:
• Given a probability space (, B , Q), a random variable is a function X (w) :
→R, Borel measurable over the σ-algebra B , that allows us to work with the
induced probability space (R, B, P).5
Form this definition, it is clear that the name “random variable” is quite unfortunate
inasmuch it is a univoque function, neither random nor variable. Thus, at least to
get rid of variable, the term “random quantity” it is frequently used to design

is important to note that a random variable X (w) : −→R is measurable with respect to the
5 It

σ-algebra B .
12 1 Probability

a numerical entity associated to the outcome of an experiment; outcome that is


uncertain before we actually do the experiment and observe the result, and distinguish
between the random quantity X (w), that we shall write in upper cases and usually
as X assuming understood the w dependence, and the value x (lower case) taken in
a particular realization of the experiment. If the function X takes values in  X ⊆R
it will be a one dimensional random quantity and, if the image is  X ⊆Rn , it will
be an ordered n-tuple of real numbers (X 1 , X 2 , . . ., X n ). Furthermore, attending to
the cardinality of  X , we shall talk about discrete random quantities if it is finite or
countable and about continuous random quantities if it is uncountable. This will be
explained in more depth in Sect. 1.1.3.1. Last, if for each w ∈  is |X (w)| < k with
k finite, we shall talk about a bounded random quantity.
The properties of random quantities are those of the measurable functions. In
particular, if X (w) : → is measurable with respect to B and Y (x) :  → is
measurable with respect to B , the function Y (X (w)) : → is measurable with
respect to B and therefore is a random quantity. We have then that

P(Y ≤ y) = P(Y (X ) ≤ y) = P(X ∈ Y −1 (I X ))

where Y −1 (I X ) is the set {x|x ∈  } such that Y (x) ≤ y.

Example 1.1 Consider the measurable space (, B ) and X (w) :  → R. Then:

• X (w) = k, constant in R. Denoting by A = {w ∈ |X (w) > c} we have that if


c ≥ k then A = ∅ and if c < k then A = . Since {∅, E} ∈ B we conclude that
X (w) is a measurable function. In fact, it is left as an exercise to show that for
the minimal algebra B min
= {∅, }, the only functions that are measurable are
X (w) = constant.
• Let G ∈ B and X (w) = 1G (w) (see Appendix 1.1). We have that if Ia = (−∞, a]
with a ∈ R, then a ∈ (−∞, 0) → X −1 (Ia ) = ∅, a ∈ [0, 1) → X −1 (Ia ) = G c , and
a ∈ [1, ∞) → X −1 (Ia ) =  so X (w) is a measurable function with respect to
B . A simple function


n
X (w) = ak 1 Ak (w)
k=1

where ak ∈ R and {Ak }nk=1 is a partition of  is Borel measurable and any random
quantity that takes a finite number of values can be expressed in this way.
• Let  = [0, 1]. It is obvious that if G is a non-measurable Lebesgue subset
of [0, 1], the function X (w) = 1G c (w) is not measurable over B[0,1] because
a ∈ [0, 1) → X −1 (Ia ) = G ∈B/ [0,1] .
• Consider a coin tossing, the elementary events

e1 = {H }, and e2 = {T } −→  = {e1 , e2 }
1.1 The Elements of Probability: (, B, μ) 13

the algebra B = {∅, , {e1 }, {e2 }} and the function X : −→R that denotes the
number of heads

X (e1 ) = 1 and X (e2 ) = 0

Then, for Ia = (−∞, a] with a ∈ R we have that:

a ∈ (−∞, 0) −→ X −1 (Ia ) = ∅ ∈ B
a ∈ [0, 1) −→ X −1 (Ia ) = e2 ∈ B
a ∈ [1, ∞) −→ X −1 (Ia ) = {e1 , e2 } =  ∈ B E

so X (w) is measurable in (, B , P) and therefore an admissible random quantity


with P(X = 1) = P(e1 ) and P(X = 0) = P(e2 ). It will not be an admissible
random quantity for the trivial minimum algebra B min
= {∅, } since e2 ∈B
/ min
.

Example 1.2 Let  = [0, 1] and consider the sequence of functions X n (w) =
2n 1n (w) where w ∈ , n = [1/2n , 1/2n−1 ] and n ∈ N . Is each X n (w) is measur-
able iff ∀r ∈ R, A = {w ∈  | X n (w) > r } is a Borel set of B . Then:
(1) r ∈ (2n , ∞)−→A = ∅ ∈ B with λ(A) = 0;
(2) r ∈ [0, 2n ]−→A = [1/2n , 1/2n−1 ] ∈ B with λ(A) = 2/2n − 1/2n = 1/2n .
(3) r ∈ (−∞, 0)−→A = [0, 1] =  with λ() = 1.
Thus, each X n (w) is a measurable function.

Problem 1.1 Consider the experiment of tossing two coins, the elementary events

e1 = {H, H } , e2 = {H, T } , e3 = {T, H } , e4 = {T, T }

the sample space  = {e1 , e2 , e3 , e4 } and the two algebras

B1 = {∅, , {e1 }, {e4 }, {e1 , e2 , e3 }, {e2 , e3 , e4 }, {e1 , e4 }, {e2 , e3 }}


B2 = {∅, , {e1 , e2 }, {e3 , e4 }}

The functions X (w) : −→R such that X (e1 ) = 2; X (e2 ) = X (e3 ) = 1; X (e4 ) = 0
(number of heads) and Y (w) : −→R such that Y (e1 ) = Y (e2 ) = 1; Y (e3 ) =
Y (e4 ) = 0, with respect to which algebras are admissible random quantities? (sol.:
X wrt B1 ; Y wrt B2 )

Problem 1.2 Let X i (w) : R−→R with i = 1, . . . , n be random quantities. Show


that

Y = max{X 1 , X 2 } , Y = min{X 1 , X 2 } , Y = sup{X k }nk=1 and Y = inf{X k }nk=1

are admissible random quantities.


14 1 Probability

Hint: It is enough to observe that

{w|max{X 1 , X 2 } ≤ x} = {w|X 1 (w) ≤ x}∩{w|X 2 (w) ≤ x} ∈ B


{w|min{X 1 , X 2 } ≤ x} = {w|X 1 (w) ≤ x}∪{w|X 2 (w) ≤ x} ∈ B
{w|supn X n (w) ≤ x} = ∩n {w|X n (w) ≤ x} ∈ B
{w|inf n X n (w) < x} = ∪n {w|X n (w) < x} ∈ B.

1.2 Conditional Probability and Bayes Theorem

Suppose and experiment that consists on rolling a die with faces numbered from one
to six and the event e2 ={get the number two on the upper face}. If the die is fair,
based on the Principle of Insufficient Reason you and your friend would consider
reasonable to assign equal chances to any of the possible outcomes and therefore a
probability of P1 (e2 ) = 1/6. Now, if I look at the die and tell you, and only you,
that the outcome of the roll is an even number, you will change your beliefs on the
occurrence of event e2 and assign the new value P2 (e2 ) = 1/3. Both of you assign
different probabilities because you do not share the same knowledge so it may be a
truism but it is clear that the probability we assign to an event is subjective and is
conditioned by the information we have about the random process. In one way or
another, probabilities are always conditional degrees of belief since there is always
some state of information (even before we do the experiment we know that whatever
number we shall get is not less than one and not greater than six) and we always
assume some hypothesis (the die is fair so we can rely on the Principle of Symmetry).
Consider a probability space (, B , P) and two events A, B ⊂ B that are
not disjoint so A ∩ B = /0. The probability for both A and B to happen is P(A ∩
B)≡P(A, B). Since  = B ∪ B c and B∩B c = /0 we have that:
  
P(A) ≡ P(A ) = P(A B) + P(A B c ) = P(A\B)



probability for A P(A\B) : probability for
and B to occur A to happen and not B

What is the probability for A to happen if we know that B has occurred? The prob-
ability of A conditioned to the occurrence of B is called conditional probability
of A given B and is expressed as P(A|B). This is equivalent to calculate the prob-
ability for A to happen in the probability space ( , B , P  ) with  the reduced
sample space where B has already occurred and B the corresponding sub-algebra
that does not contain B c . We can set P(A|B) ∝ P(A ∩ B) and define (Kolmogorov)
the conditional probability for A to happen once B has occurred as:

P(A
de f.B) P(A, B)
P(A|B) = =
P(B) P(B)
1.2 Conditional Probability and Bayes Theorem 15

provided that P(B) = 0 for otherwise the conditional probability is not defined. This
normalization factor ensures that P(B|B) = P(B ∩ B)/P(B) = 1. Conditional
probabilities satisfy the basic axioms of probability:

(i) non-negative since (A B) ⊂ B → 0 ≤ P(A|B)  ≤ 1
P( B)
(ii) unit measure (certainty) since P(|B) = = P(B) = 1
P(B) P(B)

(iii) σ-additive: For a countable sequence of disjoint set {Ai }i=1


 
∞  ∞   P(Ai B) ∞
 P ( i=1 Ai ) B i=1

P Ai |B = = = P(Ai |B)
i=1
P(B) P(B) i=1

Generalizing, for n events {Ai }i=1


n
we have, with j = 0, . . . , n − 1 that

P(A1 , . . . , An ) = P(An , . . . , An− j |A j , . . . , A1 )P(A j , . . . , A1 ) =


= P(An |A1 , . . . , An−1 )P(A3 |A2 , A1 )P(A2 |A1 )P(A1 ).

1.2.1 Statistically Independent Events

Two events A, B ∈ B are statistically independent when the occurrence of one


does not give any information about the occurrence of the other6 ; that is, when

P(A, B) = P(A)P(B)

A necessary and sufficient condition for A and B to be independent is that P(A|B) =


P(A) (which implies P(B|A) = P(B)). Necessary because

P(A, B) P(A)P(B)
P(A, B) = P(A)P(B) −→ P(A|B) = = = P(A)
P(B) P(B)

6 In fact for the events A, B ∈ B we should talk about conditional independence for it is
true that if C ∈ B , it may happen that P(A, B) = P(A)P(B) but conditioned on C,
P(A, B|C) = P(A|C)P(B|C) so A and B are related through the event C. On the other hand,
that P(A|B) = P(A) does not imply that B has a “direct” effect on A. Whether this is the case or
not has to be determined by reasoning on the process and/or additional evidences. Bernard Shaw
said that we all should buy an umbrella because there is statistical evidence that doing so you have
a higher life expectancy. And this is certainly true. However, it is more reasonable to suppose that
instead of the umbrellas having any mysterious influence on our health, in London, at the beginning
of the XX century, if you can afford to buy an umbrella you have most likely a well-off status,
healthy living conditions, access to medical care,…
16 1 Probability

Sufficient because

P(A|B) = P(A) −→ P(A, B) = P(A|B)P(B) = P(A)P(B)

If this is not the case, we say that they are statistically dependent or correlated. In
general, we have that:

P(A|B) > P(A) → the events A and B are positively correlated; that is,
that B has already occurred increases the chances for
A to happen;

P(A|B) < P(A) → the events A and B are negatively correlated; that is,
that B has already occurred reduces the chances for A
to happen;

P(A|B) = P(A) → the events A and B are not correlated so the occur-
rence of B does not modify the chances for A to hap-
pen.

Given a finite collection of events A = {Ai }i=1


n
with A∀i ⊂ B , they are statisti-
cally independent if

P(A1 , . . ., Am ) = P(A1 ) · · · P(Am )

for any finite subsequence {Ak }mk= j ; 1 ≤ j < m ≤ n of events. Thus, for instance,
for a sequence of 3 events {A1 , A2 , A3 } the condition of independence requires that:

P(A1 , A2 ) = P(A1 )P(A2 ); P(A1 , A3 ) = P(A1 )P(A3 ); P(A2 , A3 ) = P(A2 )P(A3 )


and P(A1 , A2 , A3 ) = P(A1 )P(A2 )P(A3 )

so the events {A1 , A2 , A3 } may be statistically dependent and be pairwise indepen-


dent.

Example 1.3 In four cards (C1 , C2 , C3 and C4 ) we write the numbers 1 (C1 ), 2 (C2 ),
3 (C3 ) and 123 (C4 ) and make a fair random extraction. Let be the events

Ai = {the chosen card has the number i}

with i = 1, 2, 3. Since the extraction is fair we have that:

P(Ai ) = P(Ci ) + P(C4 ) = 1/2


1.2 Conditional Probability and Bayes Theorem 17

Now, I look at the card and tell you that it has number j. Since you know that
A j has happened, you know that the extracted card was either C j or C4 and the
only possibility to have Ai = A j is that the extracted card was C4 so the conditional
probabilities are

P(Ai |A j ) = 1/2; i, j = 1, 2, 3; i = j

The, since

P(Ai |A j ) = P(Ai ); i, j = 1, 2, 3; i = j

any two events (Ai , A j ) are (pairwise) independent. However:

P(A1 , A2 , A3 ) = P(A1 |A2 , A3 ) P(A2 |A3 ) P(A3 )

and if I tell you that events A2 and A3 have occurred then you are certain that chosen
card is C4 and therefore A1 has happened too so P(A1 |A2 , A3 ) = 1. But

1 1 1
P(A1 , A2 , A3 ) = 1 = P(A1 )P(A2 )P(A3 ) =
2 2 8
so the events {A1 , A2 , A3 } are not independent even though they are pairwise inde-
pendent.

Example 1.4 (Bonferroni’s Inequality) Given a finite collection A = {A1 , . . ., An } ⊂


B of events, Bonferroni’s inequality states that:
 
P(A1 ··· An ) ≡ P(A1 , . . ., An ) ≥ P(A1 ) + · · · + P(An ) − (n − 1)

and gives a lover bound for the joint probability P(A1 , . . ., An ). For n = 1 it is
trivially true since P(A1 ) ≥ P(A1 ). For n = 2 we have that
  
P(A1 A2 ) = P(A1 ) + P(A2 ) − P(A1 A2 ) ≤ 1 −→ P(A1 A2 ) ≥ P(A1 ) + P(A2 ) − 1

Proceed then by induction. Assume the statement is true for n − 1 and see if it is so
for n. If Bn−1 = A1 ∩ · · · ∩ An−1 and apply the result we got for n = 2 we have that
  
P(A1 ··· An ) = P(Bn−1 An ) ≥ P(Bn−1 ) + P(An ) − 1

but

P(Bn−1 ) = P(Bn−2 An−1 ) ≥ P(Bn−2 ) + P(An−1 ) − 1
18 1 Probability

so
 
P(A1 ··· An ) ≥ P(Bn−2 ) + P(An−1 ) + P(An ) − 2

and therefore the inequality is demonstrated.

1.2.2 Theorem of Total Probability

Consider a probability space (, B , P) and a partition 


S = {Si }i=1
n
 of
the sample
n
space. Then, for any event A ∈ B we have that A = A  = A ( i=1 Si ) and
therefore:

  n  n    n  
n
P(A) = P A Si =P A Si = P(A Si ) = P(A|Si )·P(Si )
i=1 i=1
i=1 i=1

Consider now a second different partition of the sample space {Bk }m


k=1 . Then, for
each set Bk we have


n
P(Bk ) = P(Bk |Si )P(Si ); k = 1, . . . , m
i=1

and
 n 

m 
n  
n
P(Bk ) = P(Si ) P(Bk |Si ) = P(Si ) = 1
k=1 i=1 k=1 i=1

Last, a similar expression can be written for conditional probabilities. Since

P(A, B, S) = P(A|B, S)P(B, S) = P(A|B, S)P(S|B)P(B)

and


n
P(A, B) = P(A, B, Si )
i=1

we have that

1  
n n
P(A, B)
P(A|B) = = P(A, B, Si ) = P(A|B, Si )P(Si |B).
P(B) P(B) i=1 i=1
1.2 Conditional Probability and Bayes Theorem 19

Example 1.5 We have two indistinguishable urns: U1 with three white and two black
balls and U2 with two white balls and three black ones. What is the probability that
in a random extraction we get a white ball?
Consider the events:

A1 = {choose urn U1 }; A2 = {choose urn U2 } and B = {get a white ball}

It is clear that A1 ∩ A2 = /0 and that A1 ∪ A2 = . Now:

3 2 1
P(B|A1 ) = ; P(B|A2 ) = and P(A1 ) = P(A2 ) =
5 5 2
so we have that


2
31 21 1
P(B) = P(B|Ai )·P(Ai ) = + =
i=1
52 52 2

as expected since out of 10 balls, 5 are white.

1.2.3 Bayes Theorem

Given a probability space (, B , P) we have seen that the joint probability for for
two events A, B ∈ B can be expressed in terms of conditional probabilities as:

P(A, B) = P(A|B)P(B) = P(B|A)P(A)

The Bayes Theorem (Bayes ∼1770s and independently Laplace ∼1770s) states that
if P(B) = 0, then

P(B|A)P(A)
P(A|B) =
P(B)

apparently a trivial statement but with profound consequences. Let’s see other expres-
sions of the theorem. If H = {Hi }i=1
n
is a partition of the sample space then

P(A, Hi ) = P(A|Hi )P(Hi ) = P(Hi |A)P(A)

and from the Total Probability Theorem


n
P(A) = P(A|Hk )P(Hk )
k=1
20 1 Probability

so we have a different expression for Bayes’s Theorem:

P(A|Hi )P(Hi ) P(A|Hi )P(Hi )


P(Hi |A) = = n
P(A) k=1 P(A|Hk )P(Hk )

Let’s summarize the meaning of these terms7 :

P(Hi ) : is the probability of occurrence of the event Hi before we know if


event A has happened or not; that is, the degree of confidence we
have in the occurrence of the event Hi before we do the experi-
ment so it is called prior probability;

P(A|Hi ) : is the probability for event A to happen given that event Hi has
occurred. This may be different depending on i = 1, 2, . . . , n and
when considered as function of Hi is usually called likelihood;

P(Hi |A) : is the degree of confidence we have in the occurrence of event Hi


given that the event A has happened. The knowledge that the event
A has occurred provides information about the random process
and modifies the beliefs we had in Hi before the experiment was
done (expressed by P(Hi )) so it is called posterior probability;

P(A) : is simply the normalizing factor.

Clearly, if the events A and Hi are independent, the occurrence of A does not provide
any information on the chances for Hi to happen. Whether it has occurred or not does
not modify our beliefs about Hi and therefore P(Hi |A) = P(Hi ).
In first place, it is interesting to note that the occurrence of A restricts the sample
space for H and modifies the prior chances P(Hi ) for Hi in the same proportion as
the occurrence of Hi modifies the probability for A because

P(Hi |A) P(A|Hi )


P(A|Hi )P(Hi ) = P(Hi |A)P(A) −→ =
P(Hi ) P(A)

Second, from Bayes Theorem we can obtain relative posterior probabilities (in the
case, for instance, that P(A) is unknown) because

P(Hi |A) P(A|Hi ) P(Hi )


=
P(H j |A) P(A|H j ) P(H j )

Last, conditioning all the probabilities to H0 (maybe some conditions that are
assumed) we get a third expression of Bayes Theorem

7 Although is usually the case, the terms prior and posterior do not necessarily imply a temporal
ordering.
1.2 Conditional Probability and Bayes Theorem 21

P(A|Hi , H0 )P(Hi |H0 ) P(A|Hi , H0 )P(Hi |H0 )


P(Hi |A, H0 ) = = n
P(A|H0 ) k=1 P(A|Hk , H0 )P(Hk |H0 )

where H0 represents to some initial state of information or some conditions that are
assumed. The posterior degree of credibility we have on Hi is certainly meaningful
when we have an initial degree of information and therefore is relative to our prior
beliefs. And those are subjective inasmuch different people may assign a different
prior degree of credibility based on their previous knowledge and experiences. Think
for instance in soccer pools. Different people will assign different prior probabili-
ties to one or other team depending on what they know before the match and this
information may not be shared by all of them. However, to the extent that they share
common prior knowledge they will arrive to the same conclusions.
Bayes’s rule provides a natural way to include new information and update our
beliefs in a sequential way. After the event (data) D1 has been observed, we have

P(D1 |Hi )
P(Hi ) −→ P(Hi |D1 ) = P(Hi ) ∝ P(D1 |Hi )P(Hi )
P(D1 )

Now, if we get additional information provided by the observation of D2 (new data)


we “update” or beliefs on Hi as:

P(D2 |Hi , D1 ) P(D2 |Hi , D1 )P(D1 |Hi )P(Hi )


P(Hi |D1 )−→P(Hi |D2 , D1 ) = P(Hi |D1 ) =
P(D2 |D1 ) P(D2 , D1 )

and so on with further evidences.

Example 1.6 An important interpretation of Bayes Theorem is that based on the


relation cause-effect. Suppose that the event A (effect) has been produced by a
certain cause Hi . We consider all possible causes (so H is a complete set) and among
them we have interest in those that seem more plausible to explain the observation of
the event A. Under this scope, we interpret the terms appearing in Bayes’s formula as:

P(A|Hi , H0 ): is the probability that the effect A is produced by the cause (or
hypothesis) Hi ;

P(Hi , H0 ): is the prior degree of credibility we assign to the cause Hi before


we know that A has occurred;

P(Hi |A, H0 ): is the posterior probability we have for Hi being the cause of the
event (effect) A that has already been observed.

Let’s see an example of a clinical diagnosis just because the problem is general
enough and conclusions may be more disturbing. If you want, replace individuals by
events and for instance (sick, healthy) by (signal, background). Now, the incidence
22 1 Probability

of certain rare disease is of 1 every 10,000 people and there is an efficient diagnostic
test such that:
(1) If a person is sick, the tests gives positive in 99% of the cases;
(2) If a person is healthy, the tests may fail and give positive (false positive) in 0.5%
of the cases;
In this case, the effect is to give positive (T ) in the test and the exclusive and
exhaustive hypothesis for the cause are:

H1 : be sick and H2 : be healthy

with H2 = H1 c . A person, say you, is chosen randomly (H0 ) among the population
to go under the test and give positive. Then you are scared when they tell you: “The
probability of giving positive being healthy is 0.5%, very small” (p-value). There is
nothing wrong with the statement but it has to be correctly interpreted and usually
it is not. It means no more and no less than what the expression P(T |H2 ) says:
“under the assumption that you are healthy (H2 ) the chances of giving positive are
0.5%” and this is nothing else but a feature of the test. It doesn’t say anything about
P(H1 |T ), the chances you have to be sick giving positive in the test that, in the end,
is what you are really interested in. The two probabilities are related by an additional
piece of information that appears in Bayes’s formula: P(H1 |H0 ); that is, under the
hypothesis that you have been chosen at random (H0 ), What are the prior chances
to be sick? From the prior knowledge we have, the degree of credibility we assign to
both hypothesis is

1 9999
P(H1 |H0 ) = and P(H2 |H0 ) = 1 − P(H1 ) =
10000 10000
On the other hand, if T denotes the event give positive in the test we know that:

99 5
P(T |H1 ) = and P(T |H2 ) =
100 1000
Therefore, Bayes’s Theorem tells that the probability to be sick giving positive in
the test is
99 1
P(T |H1 )·P(H1 |H0 )
P(H1 |T ) = 2 = 100 10000
 0.02
i=1 P(T |Hi )·P(Hi |H0 )
99 1
100 10000
+ 1000
5 9999
10000

Thus, even if the test looks very efficient and you gave positive, the fact that you
were chosen at random and that the incidence of the disease in the population is very
small, reduces dramatically the degree of belief you assign to be sick. Clearly, if you
were not chosen randomly but because there is a suspicion from to other symptoms
that you are sic, prior probabilities change.
1.3 Distribution Function 23

1.3 Distribution Function

A one-dimensional Distribution Function is a real function F : R→R that:


(p.1) is monotonous non-decreasing: F(x1 ) ≤ F(x2 ) ∀x1 < x2 ∈ R
(p.2) is everywhere continuous on the right: lim→0+ F(x + ) = F(x) ∀x ∈ R
(p.3) F(−∞) ≡ lim x→−∞ F(x) = 0 and F(∞) ≡ lim x→+∞ F(x) = 1.
and there is [1] a unique Borel measure μ on R that satisfies μ((−∞, x]) = F(x)
for all x ∈ R. In the Theory of Probability, we define the Probability Distribution
Function8 of the random quantity X (w) : →R as:

de f.
F(x) = P(X ≤ x) = P (X ∈ (−∞, x]) ; ∀x ∈ R

Note that the Distribution Function F(x) is defined for all x ∈ R so if supp{P(X )} =
[a, b], then F(x) = 0 ∀x < a and F(x) = 1 ∀x ≥ b. From the definition, it is easy
to show the following important properties:
(a) ∀x ∈ R we have that:
de f.
(a.1) P(X ≤ x) = F(x)
(a.2) P(X <x) = F(x − ) ;
(a.3) P(X >x) = 1 − P(X ≤ x) = 1 − F(x) ;
(a.4) P(X ≥ x) = 1 − P(X <x) = 1 − F(x − ) ;
(b) ∀x1 < x2 ∈ R we have that:
(b.1) P(x1 <X ≤ x2 ) = P(X ∈ (x1 , x2 ]) = F(x2 ) − F(x1 ) ;
(b.2) P(x1 ≤ X ≤x2 ) = P(X ∈ [x1 , x2 ]) = F(x2 ) − F(x1 − )
(thus, if x1 = x2 then P(X = x1 ) = F(x1 ) − F(x1 − ));
(b.3) P(x1 <X <x2 ) = P(X ∈ (x1 , x2 )) = F(x2 − ) − F(x1 ) =
= F(x2 ) − F(x1 ) − P(X = x2 ) ;
(b.4) P(x1 ≤ X <x2 ) = P(X ∈ [x1 , x2 )) = F(x2 − ) − F(x1 − ) =
= F(x2 ) − F(x1 ) − P(X = x2 ) + P(X = x1 ) .
The Distribution Function is discontinuous at all x ∈ R where F(x −) = F(x +).
Let D be the set of all points of discontinuity. If x ∈ D, then F(x −)<F(x +) since
it is monotonous non-decreasing. Thus, we can associate to each x ∈ D a rational
number r (x) ∈ Q such that F(x − ) < r (x) < F(x + ) and all will be different
because if x1 < x2 ∈ D then F(x1 + ) ≤ F(x2 − ). Then, since Q is a countable set,
we have that the set of points of discontinuity of F(x) is either finite or countable.

8 The condition P(X ≤ x) is due to the requirement that F(x) be continuous on the right. This is
not essential in the sense that any non-decreasing function G(x), defined on R, bounded between
0 and 1 and continuous on the left (G(x) = lim→0+ G(x − )) determines a distribution function
defined as F(x) for all x where G(x) is continuous and as F(x + ) where G(x) is discontinuous.
In fact, in the general theory of measure it is more common to consider continuity on the left.
24 1 Probability

At each of them the distribution function has a “jump” of amplitude (property b.2):

F(x) − F(x − ) = P(X = x)

and will be continuous on the right (condition p.2).


Last, for each Distribution Function there is a unique probability measure P
defined over the Borel sets of R that assigns the probability F(x2 ) − F(x1 ) to
each half-open interval (x1 , x2 ] and, conversely, to any probability measure defined
on a measurable space (R, B) corresponds one Distribution Function. Thus, the
Distribution Function of a random quantity contains all the information needed to
describe the properties of the random process.

1.3.1 Discrete and Continuous Distribution Functions

Consider the probability space (, F, Q), the random quantity X (w) : w ∈  →
X (w) ∈ R and the induced probability space (R, B, P). The function X (w) is a
discrete random quantity if its range (image) D = {x1 , . . ., xi , . . .}, with xi ∈ R ,
i = 1, 2, . . . is a finite or countable set; that is, if {Ak ; k = 1, 2, . . .} is a finite or
countable partition of , the function X (w) is either:


n ∞

simple: X (w) = xk 1 Ak (w) or elementary: X (w) = xk 1 Ak (w)
k=1 k=1

Then, P(X = xk ) = Q(Ak ) and the corresponding Distribution Function, defined


for all x ∈ R, will be
 
F(x) = P(X ≤ x) = P(X = xk ) 1 A (xk ) = P(X = xk )
∀xk ∈ D ∀xk ≤ x

with A = (−∞, x] ∩ D and satisfies:


(i) F(−∞) = 0 and F(+∞) = 1;
(ii) is a monotonous non decreasing step-function;
(iii) continuous on the right (F(x + ) = F(x)) and therefore constant but on the
finite or countable set of points of discontinuity D = {x1 , . . .} where

F(xk ) − F(xk − ) = P(X = xk )

Familiar examples of discrete Distribution Functions are Poisson, Binomial, Multino-


mial,…
1.3 Distribution Function 25

The random quantity X (w) : −→R is continuous if its range is a non-denu-


merable set; that is, if for all x ∈ R we have that P(X = x) = 0. In this case, the
Distribution Function F(x) = P(X ≤ x) is continuous for all x ∈ R because
(i) from condition (p.2): F(x + ) = F(x);
(ii) from property (b.2): F(x − ) = F(x) − P(X = x) = F(x)
Now, consider the measure space (, B , μ) with μ countably  additive. If f :
→[0, ∞) is integrable with respect to μ, it is clear that ν(A) = A f dμ for A ∈ B
is also a non-negative countably additive set function. (see Appendix 1.2). More
generally, we have:
• Radon–Nikodym Theorem (Radon(1913), Nikodym(1930)): If ν and μ are two
σ-additive measures on the measurable space (, B ) such that ν is absolutely
continuous with respect to μ (ν << μ; that is, for every set A ∈ B for which
μ(A) = 0 it is ν(A) = 0), then there exists a μ-integrable function p(x) such that
  
dν(w)
ν(A) = dν(w) = dμ(w) = p(w)dμ(w)
A A dμ(w) A

and, conversely, if such a function exists then ν << μ (see Appendix 1.3 for the
main properties).
The function p(w) = dν(w)/dμ(w) is called Radon density and is unique up to at
most a set of measure zero; that is, if
 
ν(A) = p(w)dμ(w) = f (w)dμ(w)
A A

then μ{x| p(x) = f (x)} = 0. Furthermore, if ν and μ are equivalent (ν∼μ; μ << ν
and ν << μ) then dν/dμ > 0 almost everywhere. In consequence, if we have a
probability space (R, B, P) with P equivalent to the Lebesgue measure, there exists
a non-negative Lebesgue integrable function (see Appendix 2) p : R −→ [0, ∞),
unique a.e., such that

P(A) ≡ P(X ∈ A) = p(x) d x; ∀A ∈ B
A

The function p(x) is called probability density function and satisfies:


(i) p(x) ≥ 0 ∀x ∈ R;
(ii) at any bounded interval of R, p(x) is bounded and is Riemann-integrable;
 +∞
(iii) −∞ p(x) d x = 1.
Thus, for an absolutely continuous random quantity X , the Distribution Function
F(x) can be expressed as
 x
F(x) = P(X ≤ x) = p(w) dw
−∞
26 1 Probability

Usually we shall be interested in random quantities that take values in a subset


D ⊂ R. It will then be understood that p(x) is p(x)1 D (x) so it is defined for all
x ∈ R. Thus, for instance, if supp{ p(x)} = [a, b] then
 +∞  +∞  b
p(x) d x ≡ p(x) 1[a,b] (x) d x = p(x) d x = 1
−∞ −∞ a

and therefore
 x
F(x) = P(X ≤ x) = 1(a,∞) (x)1[b,∞) (x) + 1[a,b) (x) p(u) du
a

Note that from the previous considerations, the value of the integral will not be
affected if we modify the integrand on a countable set of points. In fact, what we
actually integrate is an equivalence class of functions that differ only in a set of
measure zero. Therefore, a probability density function p(x) has to be continuous
for all x ∈ R but, at most, on a countable set of points. If F(x) is not differentiable
at a particular point, p(x) is not defined on it but the set of those points is of zero
measure. However, if p(x) is continuous in R then F  (x) = p(x) and the value of
p(x) is univocally determined by F(x). We also have that
 x  +∞
P(X ≤ x) = F(x) = p(w)dw −→ P(X > x) = 1 − F(x) = p(w)dw
−∞ x

and therefore:
 x2
P(x1 < X ≤ x2 ) = F(x2 ) − F(x1 ) = p(w) dw
x1

Thus, since F(x) is continuous at all x ∈ R:

P(x1 < X ≤ x2 ) = P(x1 < X < x2 ) = P(x1 ≤ X < x2 ) = P(x1 ≤ X ≤x2 )

and therefore P(X = x) = 0 ∀x ∈ R (λ([x]) = 0) even though X = x is a


possible outcome of the experiment so, in this sense, unlike discrete random quantities
“probability” 0 does not correspond necessarily to impossible events. Well known
examples absolutely continuous Distribution Functions are the Normal, Gamma,
Beta, Student, Dirichlet, Pareto, …
Last, if the continuous probability measure P is not absolutely continuous with
respect to the Lebesgue measure λ in R, then the probability density function does
not exist. Those are called singular random quantities for which F(x) is continuous
but F  (x) = 0 almost everywhere. A well known example is the Dirac’s singular
measure δx0 (A) = 1 A (x0 ) that assigns a measure 1 to a set A ∈ B if x0 ∈ A and 0
otherwise. As we shall see in the Examples 1.9 and 1.20, dealing with these cases is
no problem because the Distribution Function always exists. The Lebesgue’s General
1.3 Distribution Function 27

Decomposition Theorem establishes that any Distribution Function can be expressed


as a convex combination:


Nd 
Nac 
Ns
F(x) = ai Fd (x) + b j Fac (x) + ck Fs (x)
i=1 j=1 k=1

of a discrete Distribution Functions (Fd (x)), absolutely continuous ones (Fac (x) with
derivative at every point so F  (x) = p(x)) and singular ones (Fs (x)). For the cases
we shall deal with, ck = 0.

Example 1.7 Consider a real parameter μ > 0 and a discrete random quantity X
that can take values {0, 1, 2, . . .} with a Poisson probability law:

μk
P(X = k|μ) = e−μ ; k = 0, 1, 2, . . .
(k + 1)

The Distribution Function will be


m=[x]
μk
F(x|μ) = P(X ≤ x|μ) = e−μ
k=0
(k + 1)

where m = [x] is the largest integer less or equal to x. Clearly, for  → 0+ :

F(x + |μ) = F([x + ]|μ) = F([x]|μ) = F(x|μ)

so it is continuous on the right and for k = 0, 1, 2, . . .

μk
F(k|μ) − F(k − 1|μ) = P(X = k|μ) = e−μ
(k + 1)

Therefore, for reals x2 > x1 > 0 such that x2 − x1 < 1, P(x1 < X ≤ x2 ) =
F(x2 ) − F(x1 ) = 0.

Example 1.8 Consider the function g(x) = e−ax with a > 0 real and support in
(0, ∞). It is non-negative and Riemann integrable in R+ so we can define a proba-
bility density

e−ax
p(x|a) =  ∞ 1(0,∞) (x) = a e−ax 1(0,∞) (x)
0 e−αx d x

and the Distribution Function


 x 
0 x <0
F(x) = P(X ≤ x) = p(u|a)du =
−∞ 1 − e−ax x ≥ 0
28 1 Probability

Clearly, F(−∞) = 0 and F(+∞) = 1. Thus, for an absolutely continuous random


quantity X ∼ p(x|a) we have that for reals x2 > x1 > 0:

P(X ≤ x1 ) = F(x1 ) = 1 − e−ax1


P(X >x1 ) = 1 − F(x1 ) = e−ax1
P(x1 < X ≤ x2 ) = F(x2 ) − F(x1 ) = e−ax1 − e−ax2

Example 1.9 The ternary Cantor Set Cs(0, 1) is constructed iteratively. Starting with
the interval Cs0 = [0, 1], at each step one removes the open middle third of each of
the remaining segments. That is; at step one the interval (1/3, 2/3) is removed so
Cs1 = [0, 1/3]∪[2/3, 1] and so on. If we denote by Dn the union of the 2n−1 disjoint
open intervals removed at step n, each of length 1/3n , the Cantor set is defined as
Cs(0, 1) = [0, 1]\ ∪∞n=1 Dn . It is easy to check that any element X of the Cantor Set
can be expressed as

∞
Xn
X=
n=1
3n

with supp{X n } = {0, 2}9 and that Cs(0, 1) is a closed set, uncountable, nowhere
dense in [0, 1] and with zero measure. The Cantor Distribution, whose support is the
Cantor Set, is defined assigning a probability P(X n = 0) = P(X n = 2) = 1/2.
Thus, X is a continuous random quantity with support on a non-denumerable set
of measure zero and can not be described by a probability density function. The
Distribution Function F(x) = P(X ≤ x) (Cantor Function; Fig. 1.1) is an example
of singular Distribution.

1.3.2 Distributions in More Dimensions

The previous considerations can be extended to random quantities in more dimen-


sions but with some care. Let’s consider the the two-dimensional case: X = (X 1 , X 2 ).
The Distribution Function will be defined as:

F(x1 , x2 ) = P(X 1 ≤ x1 , X 2 ≤ x2 ); ∀(x1 , x2 ) ∈ R2


9 Note that the representation of a real number r ∈ [0, 1] as (a1 , a2 , . . .) : ∞
n=1 an 3
−n with a =
i
{0, 1, 2} is not unique. In fact x = 1/3 ∈ Cs(0, 1) and can be represented by (1, 0, 0, 0, . . .) or
(0, 2, 2, 2, . . .).
1.3 Distribution Function 29

1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 2 4 6 8 10 12 14 16 0 1 2 3 4 5 6 7

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 1.1 Empiric distribution functions (ordinate) form a Monte Carlo sampling (106 events) of
the Poisson Po(x|5) (discrete; upper left), Exponential E x(x|1) (absolute continuous; upper right)
and Cantor (singular; bottom) Distributions

and satisfies:
(i) monotonous no-decreasing in both variables; that is, if (x1 , x2 ), (x1 , x2 ) ∈ R2 :

x1 ≤ x1 −→ F(x1 , x2 ) ≤ F(x1 , x2 ) and x2 ≤ x2 −→ F(x1 , x2 ) ≤ F(x1 , x2 )

(ii) continuous on the right at (x1 , x2 ) ∈ R2 :

F(x1 + , x2 ) = F(x1 , x2 + ) = F(x1 , x2 )

(iii) F(−∞, x2 ) = F(x1 , −∞) = 0 and F(+∞, +∞) = 1.


30 1 Probability

Now, if (x1 , x2 ), (x1 , x2 ) ∈ R2 with x1 < x1 and x2 < x2 we have that:

P(x1 < X 1 ≤ x1 , x2 < X 2 ≤ x2 )=F(x1 , x2 ) − F(x1 , x2 ) − F(x1 , x2 ) + F(x1 , x2 ) ≥ 0

and

P(x1 ≤ X 1 ≤ x1 , x2 ≤ X 2 ≤ x2 ) = F(x1 , x2 ) − F(x1 − , x2 ) − F(x1 , x2 − ) + F(x1 − , x2 − ) ≥ 0

so, for discrete random quantities, if x1 = x1 and x2 = x2 :

P(X 1 = x1 , X 2 = x2 ) =F(x1 , x2 ) − F(x1 − 1 , x2 ) − F(x1 , x2 − )


+ F(x1 − , x2 − ) ≥ 0

will give the amplitude of the jump of the Distribution Function at the points of
discontinuity.
As for the one-dimensional case, for absolutely continuous random quantities we
can introduce a two-dimensional probability density function p(x) : R2 −→ R:
(i) p(x) ≥ 0; ∀x ∈ R2 ;
 every bounded interval of R , p(x) is bounded and Riemann integrable;
2
(ii) At
(iii) R2 p(x)d x = 1
such that:

 x1  x2
∂2
F(x1 , x2 ) = du 1 du 2 p(u 1 , u 2 ) ←→ p(x1 , x2 ) = F(x1 , x2 ).
−∞ −∞ ∂x1 ∂x2

1.3.2.1 Marginal and Conditional Distributions

It may happen that we are interested only in one of the two random quantities say,
for instance, X 1 . Then we ignore all aspects concerning X 2 and obtain the one-
dimensional Distribution Function

F1 (x1 ) = P(X 1 ≤ x1 ) = P(X 1 ≤ x1 , X 2 ≤ + ∞) = F(x1 , +∞)

that is called the Marginal Distribution Function of the random quantity X 1 . In


the same manner, we have F2 (x2 ) = F(+∞, x2 ) for the random quantity X 2 . For
absolutely continuous random quantities,
 x1  +∞  x1
F1 (x1 ) = F(x1 , +∞) = du 1 p(u 1 , u 2 ) du 2 = p(u 1 ) du 1
−∞ −∞ −∞
1.3 Distribution Function 31

with p(x1 ) the marginal probability density function10 of the random quantity X 1 :
 +∞

p1 (x1 ) = F1 (x1 ) = p(x1 , u 2 ) du 2
∂x1 −∞

In the same manner, we have for X 2


 +∞

p2 (x2 ) = F2 (x2 ) = p(u 1 , x2 ) du 1
∂x2 −∞

As we have seen, given a probability space (, B , P), for any two sets A, B ∈ B
the conditional probability for A given B was defined as

de f. P(A ∩ B) P(A, B)
P(A|B) = ≡
P(B) P(B)

provided P(B) = 0. Intimately related to this definition is the Bayes’ rule:

P(A, B) = P(A|B) P(B) = P(B|A) P(A)

Consider now the discrete random quantity X = (X 1 , X 2 ) with values on ⊂R2 .


It is then natural to define

de f. P(X 1 = x1 , X 2 = x2 )
P(X 1 = x1 |X 2 = x2 ) =
P(X 2 = x2 )

and therefore
P(X 1 ≤ x1 , X 2 = x2 )
F(x1 |x2 ) =
P(X 2 = x2 )

whenever P(X 2 = x2 ) = 0. For absolutely continuous random quantities we can


express the probability density as

p(x1 , x2 ) = p(x1 |x2 ) p(x2 ) = p(x2 |x1 ) p(x1 )

and define the conditional probability density function as

de f. p(x1 , x2 ) ∂
p(x1 |x2 ) = = F(x1 |x2 )
p(x2 ) ∂x1

10 It is habitual to avoid the indices and write p(x) meaning “the probability density function of the
variable x ” since the distinctive features are clear within the context.
32 1 Probability

provided again that p2 (x2 ) = 0.This is certainly is an admissible density.11 since


p(x1 |x2 ) ≥ 0 ∀(x1 , x2 ) ∈ R2 and R p(x1 |x2 )d x1 = 1.
As stated already, two events A, B ∈ B are statistically independent iff:

P(A, B) ≡ P(A ∩ B) = P(A) · P(B)

Then, we shall say that two discrete random quantities X 1 and X 2 are statistically
independent if F(x1 , x2 ) = F1 (x1 )F2 (x2 ); that is, if

P(X 1 = x1 , X 2 = x2 ) = P(X 1 = x1 ) P(X 2 = x2 )

for discrete random quantities and

∂2
p(x1 , x2 ) = F(x1 )F(x2 ) = p(x1 ) p(x2 ) ←→ p(x1 |x2 ) = p(x1 )
∂x1 ∂x2

and for absolutely continuous random quantities.

Example 1.10 Consider the probability space (, B , λ) with  = [0, 1] and λ the
Lebesgue measure. If F is an arbitrary Distribution Function, X : w ∈ [0, 1]−→
F −1 (w) ∈ R is a random quantity and is distributed as F(w). Take the Borel set
I = (−∞, r ] with r ∈ R. Since F is a Distribution Function is monotonous and
non-decreasing we have that:

X −1 (I ) = {w ∈  | X (w) ≤ r } = {w ∈ [0, 1] | F −1 (w) ≤ r }


= {w ∈  | w ≤ F(r )} == [0, F(r )] ∈ B

and therefore X (w) = F −1 (w) is measurable over BR and is distributed as


 F(x)
P(X (w) ≤ x} = P(F −1 (w) ≤ x} = P(w ≤ F(x)} = dλ = F(x)
0

Example 1.11 Consider the probability space (R, B, μ) with μ the probability mea-
sure

μ(A) = dF
A∈B

11 Recall that for continuous random quantities P(X 2 = x2 ) = P(X 1 = x1 ) = 0). One can justify
this expression with kind of heuristic arguments; essentially considering X 1 ∈ 1 = (−∞, x1 ],
X 2 ∈  (x2 ) = [x2 , x2 + ] and taking the limit  → 0+ of

P(X 1 ≤ x1 , X 2 ∈  (x2 )) F(x1 , x2 + ) − F(x1 , x2 )


P(X 1 ≤ x1 |X 2 ∈  (x2 )) = =
P(X 2 ∈  (x2 )) F2 (x2 + ) − F2 (x2 )
See however [1]; Vol 2; Chap. 10, for the Radon–Nikodym density with conditional measures.
1.3 Distribution Function 33

The function X : w ∈ R−→F(w) ∈ [0, 1] is measurable on B. Take I = [a, b) ∈


B[0,1] . Then

X −1 (I ) = {w ∈ R | a ≤ F(w) < b} = {w ∈ R | F −1 (a) ≤ w < F −1 (b)} = [wa , wb ) ∈ BR

It is distributed as X ∼U n(x|0, 1):


 F −1 (x)
−1
P(X (w) ≤ x} = P(F(w) ≤ x} = P(w ≤ F (x)} = dF = x
−∞

This is the basis of the Inverse Transform sampling method that we shall see in
Chap. 3 on Monte Carlo techniques.

Example 1.12 Suppose that the number of eggs a particular insect may lay (X 1 )
follows a Poisson distribution X 1 ∼ Po(x1 |μ):

μx1
P(X 1 = x1 |μ) = e−μ ; x1 = 0, 1, 2, . . .
(x1 + 1)

Now, if the probability for an egg to hatch is θ and X 2 represent the number of off
springs, given x1 eggs the probability to have x2 descendants follows a Binomial law
X 2 ∼Bi(x2 |x1 , θ):
 
x1
P(X 2 = x2 |x1 , θ) = θ x2 (1 − θ)x1 −x2 ; 0 ≤ x2 ≤ x1
x2

In consequence

P(X 1 = x1 , X 2 = x2 |μ, θ) = P(X 2 = x2 |X 1 = x1 , θ) P(X 1 = x1 |μ) =


 
x1 μx1
= θ x2 (1 − θ)x1 −x2 e−μ ; 0 ≤ x2 ≤ x1
x2 (x1 + 1)

Suppose that we have not observed the number of eggs that were laid. What is the
distribution of the number of off springs? This is given by the marginal probability

 (μθ)x2
P(X 2 = x2 |θ, μ) = P(X 1 = x1 , X 2 = x2 ) = e−μθ = Po(x2 |μθ)
x1 =x2
(x2 + 1)

Now, suppose that we have found x2 new insects. What is the distribution of the
number of eggs laid? This will be the conditional probability P(X 1 = x1 |X 2 =
x2 , θ, μ) and, since P(X 1 = x1 , X 2 = x2 ) = P(X 1 = x1 |X 2 = x2 )P(X 2 = x2 ) we
have that:
34 1 Probability

P(X 1 = x1 , X 2 = x2 )
P(X 1 = x1 |X 2 = x2 , μ, θ) =
P(X 2 = x2 )
1
= (μ(1 − θ))x1 −x2 e−μ(1−θ)
(x1 − x2 )!

with 0 ≤ x2 ≤ x1 ; that is, again a Poisson with parameter μ(1 − θ).

Example 1.13 Let X 1 and X 2 two independent Poisson distributed random quanti-
ties with parameters μ1 and μ2 . How is Y = X 1 + X 2 distributed? Since they are
independent:

μ1x1 μ2x2
P(X 1 = x1 , X 2 = x2 |μ1 , μ2 ) = e−(μ1 +μ2 )
(x1 + 1) (x2 + 1)

Then, since X 2 = Y − X 1 :

(y−x)
μ1x μ2
P(X 1 = x, Y = y) = P(X 1 = x, X 2 = y − x) = e−(μ1 +μ2 )
(x + 1) (y − x + 1)

Being X 2 = y − x ≥ 0 we have the condition y ≥ x so the marginal probability for


Y will be


y
μ1x μ2
(y−x)
(μ1 + μ2 ) y
−(μ1 +μ2 )
P(Y = y) = e = e−(μ1 +μ2 )
x=0
(x + 1) (y − x + 1) (y + 1)

that is, Po(y|μ1 + μ2 ).

Example 1.14 Consider a two-dimensional random quantity X = (X 1 , X 2 ) that


takes values in R2 with the probability density function N (x1 , x2 |μ = 0, σ = 1, ρ):

1
2 (x 1 − 2ρx 1 x 2 + x 2 )
2 2
1 1 −
p(x1 , x2 |ρ) =  e 2(1 − ρ )
2π 1 − ρ2

being ρ ∈ (−1, 1). The marginal densities are:


 +∞
1 1 2
X 1 ∼ p(x1 ) = p(x1 , u 2 ) du 2 = √ e− 2 x1
−∞ 2π
 +∞
1 1 2
X 2 ∼ p(x2 ) = p(u 1 , x2 ) du 1 = √ e− 2 x2
−∞ 2π

and since
ρ
2 (x 1 ρ − 2x 1 x 2 + x 2 ρ)
2 2
1 −
p(x1 , x2 |ρ) = p(x1 ) p(x2 )  e 2(1 − ρ )
1 − ρ2
1.3 Distribution Function 35

both quantities will be independent iff ρ = 0. The conditional densities are

1
2 (x 1 − x 2 ρ)
2
p(x1 , x2 ) 1 1 −
p(x1 |x2 , ρ) = =√  e 2(1 − ρ )
p(x2 ) 2π 1 − ρ2
1
2 (x 2 − x 1 ρ)
2
f (x1 , x2 ) 1 1 −
p(x2 |x1 , ρ) = =√  e 2(1 − ρ )
f 1 (x1 ) 2π 1 − ρ2

and when ρ = 0 (thus independent) p(x1 |x2 ) = p(x1 ) and p(x2 |x1 ) = p(x2 ). Last,
it is clear that

p(x1 , x2 |ρ) = p(x2 |x1 , ρ) p(x1 ) = p(x1 |x2 , ρ) p(x2 ).

1.4 Stochastic Characteristics

1.4.1 Mathematical Expectation

Consider a random quantity X (w) :  → R that can be either discrete




⎪ n

⎪ X (w) = xk 1 Ak (w)

⎪ 
⎨ k=1
X (w) = −→ P(X = xk ) = P(Ak ) = 1 Ak (w) d P(w)

⎪ ∞
 R



⎪ X (w) = xk 1 Ak (w)

k=1

or absolutely continuous for which


  
P(X (w) ∈ A) = 1 A (w) d P(w) = d P(w) = p(w)dw
R A A

The mathematical expectation of a n-dimensional random quantity Y = g(X)


is defined as12 :
 
de f.
E[Y ] = E[g(X)] = g(x) d P(x) = g(x) p(x) d x
Rn Rn

 
12 In
what follows we consider the Stieltjes-Lebesgue integral so → for discrete random
quantities and in consequence:
 ∞  ∞ 
g(x) d P(x) = g(x) p(x) d x −→ g(xk ) P(X = xk ).
−∞ −∞ ∀x k
36 1 Probability

In general, the function g(x) will be unbounded on supp{X } so both the sum and the
integral have to be absolutely convergent for the mathematical expectation to exist.
In a similar way, we define the conditional expectation. If X = (X 1 , . . . ,
X m . . . , X n ), W = (X 1 . . . , X m ) and Z = (X m+1 . . . , X n ) we have for Y = g(W )
that
 
p(w, z 0 )
E[Y |Z 0 ] = g(w) p(w|z 0 ) dw = g(w) dw.
Rm Rm p(z 0 )

1.4.2 Moments of a Distribution

Given a random quantity X ∼ p(x), we define the moment or order n (αn ) as:
 ∞
de f.
αn = E[X n ] = x n p(x) d x
−∞

Obviously, they exist if x n p(x) ∈ L 1 (R) so it may happen that a particular probability
distribution has only a finite number of moments. It is also clear that if the moment
of order n exists, so do the moments of lower order and, if it does not, neither those
of higher order. In particular, the moment of order 0 always exists (that, due to the
normalization condition, is α0 = 1) and those of even order, if exist, are non-negative.
A specially important moment is that order 1: the mean (mean value) μ = E[X ] that
has two important properties:
n n
• It is a linear operator since X = c0 + i=1 ci X i −→ E[X ] = c0 + i=1 ci E[X i ]
n
• If X = i=1 ci X i with {X i }i=1
n
independent random quantities, then E[X ] =
n
i=1 ci E[X i ].

We can define as well the moments (βn ) with respect to any point c ∈ R as:
 ∞
de f.
βn = E[(X − c) ] = n
(x − c)n p(x) d x
−∞

so αn are also called central moments or moments with respect to the origin. It is easy
to see that the non-central moment of second order, β2 = E[(X −c)2 ], is minimal for
c = μ = E[X ]. Thus, of special relevance are the moments or order n with respect
to the mean
 ∞
μn ≡ E[(X − μ) ] =
n
(x − μ)n p(x) d x
−∞

and, among them, the moment of order 2: the variance μ2 = V [X ] = σ 2 . It is clear


that μ0 = 1 and, if exists, μ1 = 0. Note that:
1.4 Stochastic Characteristics 37

• V [X ] = σ 2 = E[(X − μ)2 ] > 0


• It is not a linear operator since X = c0 + c1 X 1 −→ V [X ] = σ 2X =
c12 V [X 1 ] 
= c12 σ 2X 1
n
• If
nX =2 i=1 ci X i and {X i }i=1 are independent random quantities, V [X ] =
n

i=1 ci V [X i ].

Usually, is less tedious to calculate the moments with respect to the origin and
evidently they are related so, from the binomial expansion
n  
 n  

n n
(X − μ)n = X k (−μ)n−k −→ μn = αk (−μ)n−k
k k
k=0 k=0

The previous definitions are trivially extended to n-dimensional random quanti-


ties. In particular, for 2 dimensions, X = (X 1 , X 2 ), we have the moments of order
(n, m) with respect to the origin:

αnm = E[X 1n X 2m ] = x1n x2m p(x1 , x2 ) d x1 d x2
R2

so that α01 = μ1 and α02 = μ2 , and the moments order (n, m) with respect to the
mean:

μnm = E[(X 1 − μ1 )n (X 2 − μ2 )m ] = (x1 − μ1 )n (x2 − μ2 )m p(x1 , x2 ) d x1 d x2
R2

for which

μ20 = E[(X 1 − μ1 )2 ] = V [X 1 ] = σ12 and μ02 = E[(X 2 − μ2 )2 ] = V [X 2 ] = σ22

The moment

μ11 = E[(X 1 − μ1 ) (X 2 − μ)] = α11 − α10 α01 = V [X 1 , X 2 ] = V [X 2 , X 1 ]

is called covariance between the random quantities X 1 and X 2 and, if they are
independent, μ11 = 0. The second order moments with respect to the mean can be
condensed in matrix form, the covariance matrix defined as:
   
μ20 μ11 V [X 1 , X 1 ] V [X 1 , X 2 ]
V [X] = =
μ11 μ02 V [X 1 , X 2 ] V [X 2 , X 2 ]

Similarly, for X = (X 1 , X 2 , . . ., X n ) we have the moments with respect to the origin

αk1 ,k2 ,...,kn = E[X 1k1 X 2k2 · · ·X nkn ];


38 1 Probability

the moments with respect to the mean

μk1 ,k2 ,...,kn = E[(X 1 − μ1 )k1 (X 2 − μ2 )k2 · · ·(X n − μn )kn ]

and the covariance matrix:


⎛ ⎞ ⎛ ⎞
μ20...0 μ11...0 · · · μ10...1 V [X 1 , X 1 ] V [X 1 , X 2 ] · · · V [X 1 , X n ]
⎜ μ11...0 μ02...0 · · · μ01...1 ⎟ ⎜ V [X 1 , X 2 ] V [X 2 , X 2 ] · · · V [X 2 , X n ] ⎟
⎜ ⎟ ⎜ ⎟
V [X] = ⎜ . .. .. ⎟ = ⎜ .. .. .. ⎟
⎝ . . . ··· . ⎠ ⎝ . . ··· . ⎠
μ10...1 μ01...1 · · · μ00...2 V [X 1 , X n ] V [X 2 , X n ] · · · V [X n , X n ]

The covariance matrix V [X] = E[(X −μ)(X −μ)T ] has the following properties
that are easy to prove from basic matrix algebra relations:
(1) It is a symmetric matrix (V = V T ) with positive diagonal elements (V ii ≥ 0);
(2) It is positive defined (x T V x ≥ 0; ∀x ∈ Rn with the equality when ∀i xi = 0);
(3) Being V symmetric, all the eigenvalues are real and the corresponding eigen-
vectors orthogonal. Furthermore, since it is positive defined all eigenvalues are
positive;
(4) If J is a diagonal matrix whose elements are the eigenvalues of V and H a matrix
whose columns are the corresponding eigenvectors, then V = HJH−1 (Jordan
dixit);
(5) Since V is symmetric, there is an orthogonal matrix C (CT = C−1 ) such that
CVCT = D with D a diagonal matrix whose elements are the eigenvalues of V ;
(6) Since V is symmetric and positive defined, there is a non-singular matrix C such
that V = CCT ;
(7) Since V is symmetric and positive defined, the inverse V−1 is also symmetric
and positive defined;
(8) (Cholesky Factorization) Since V is symmetric and positive defined, there exists
a unique lower triangular matrix C (Ci j = 0; ∀i < j) with positive diagonal
elements such that V = CCT (more about this in Chap. 3).
Among other things to be discussed later, the moments of the distribution are
interesting because they give an idea of the shape and location of the probability
distribution and, in many cases, the distribution parameters are expressed in terms
of the moments.

1.4.2.1 Position Parameters

Let X ∼ p(x) with support in ⊂R. The position parameters choose a characteristic
value of X and indicate more or less where the distribution is located. Among them
we have the mean value
 ∞
μ = α1 = E[X ] = x p(x) d x
−∞
1.4 Stochastic Characteristics 39

The mean is bounded by the minimum and maximum values the random quantity
/ If, for instance,  = 1 ∪2
can take but, clearly, if ⊂R it may happen that μ∈.
is the union of two disconnected regions, μ may lay in between and therefore μ∈.
/
On the other hand, as has been mentioned the integral has to be absolute convergent
and there are some probability distributions for which there is no mean value. There
are however other interesting location quantities. The mode is the value x0 of X for
which the distribution is maximum; that is,

x0 = supx ∈  p(x)

Nevertheless, it may happen that there are several relative maximums so we talk
about uni-modal, bi-modal,… distributions. The median is the value xm such that
 xm  ∞
F(xm ) = P(X ≤ xm ) = 1/2 −→ p(x)d x = p(x)d x = P(X > xm ) = 1/2
−∞ xm

For discrete random quantities, the distribution function is either a finite or count-
able combination of indicator functions 1 Ak (x) with {Ak }n,∞
k=1 a partition of  so it
may happen that F(x) = 1/2 ∀x ∈ Ak . Then, any value of the interval Ak can be
considered the median. Last, we may consider the quantiles α defined as the value
qα of the random quantity such that F(qα ) = P(X ≤ qα ) = α so the median is the
quantile q1/2 .

1.4.2.2 Dispersion Parameters

There are many ways to give an idea of how dispersed are the values the random
quantity may take. Usually they are based on the mathematical expectation of a
function that depends on the difference between X and some characteristic value it
may take; for instance E[|X − μ|]. By far, the most usual and important one is the
already defined variance

V [X ] = σ 2 = E[(X − E[X ])2 ] = (x − μ)2 p(x)d x
R

provided it exists. Note that if the random quantity X has dimension D[X ] = d X ,
the variance has dimension D[σ 2 ] = d X2 so to have a quantity that gives an idea
of the dispersion
√ and√has the same dimension one defines the standard deviation
σ = + V [X ] = + σ 2 and, if both the mean value (μ) and the variance exist, the
standardized random quantity

X −μ
Y =
σ

for which E[Y ] = 0 and V [Y ] = σY2 = 1.


40 1 Probability

1.4.2.3 Asymmetry and Peakiness Parameters

Related to higher order non-central moments, there are two dimensionless quantities
of interest: the skewness and the kurtosis. The first non-trivial odd moment with
respect to the mean is that of order 3: μ3 . Since it has dimension D[μ3 ] = d X3 we
define the skewness (γ1 ) as the dimensionless quantity

de f μ3 μ3 E[(X − μ)3 ]
γ1 = = =
3/2
μ2 σ3 σ3

The skewness is γ1 = 0 for distributions that are symmetric with respect to the mean,
γ1 > 0 if the probability content is more concentrated on the right of the mean and
γ1 < 0 if it is to the left of the mean. Note however that there are many asymmetric
distributions for which μ3 = 0 and therefore γ1 = 0. For unimodal distributions, it
is easy to see that
γ1 = 0 mode = median = mean

γ1 > 0 mode < median < mean

γ1 < 0 mode > median > mean


The kurtosis is defined, again for dimensional considerations, as

μ4 μ4 E[(X − μ)4 ]
γ2 = = =
μ22 σ4 σ4

and gives an idea of how peaked is the distribution. For the Normal distribution γ2 = 3
so in order to have a reference one defines the extended kurtosis as γ2ext = γ2 − 3.
Thus, γ2ext > 0 (<0) indicates that the distribution is more (less) peaked than the
Normal. Again, γ2ext = 0 for the Normal density and for any other distribution for
which μ4 = 3 σ 4 . Last you can check that ∀a, b ∈ R E[(X −μ−a)2 (X −μ−b)2 ] > 0
so, for instance, defining u = a + b, w = ab and taking derivatives, γ2 ≥ 1 + γ12 .

Example 1.15 Consider the discrete random quantity X ∼Pn(k|λ) with

λk
P(X = k) ≡ Pn(k|λ) = e−λ ; λ ∈ R+ ; k = 0, 1, 2, . . .
(k + 1)

The moments with respect to the origin are



 λk
αn (λ) ≡ E[X n ] = e−λ kn
k=0
k!
1.4 Stochastic Characteristics 41

If ak denotes the kth term of the sum, then


 n ' '
λ 1 ' ak+1 '
ak+1 = 1+ '
ak −→ limk→∞ ' ' → 0
k+1 k ak '

so being the series absolute convergent all order moments exist. Taking the derivative
of αn (λ) with respect to λ one gets the recurrence relation
 
dαn (λ)
αn+1 (λ) = λ αn (λ) + ; α0 (λ) = 1

so we can easily get

α0 = 1; α1 = λ; α2 = λ(λ + 1); α3 = λ(λ2 + 3λ + 1); α4 = λ(λ3 + 6λ2 + 7λ + 1)

and from them

μ0 = 1; μ1 = 0; μ2 = λ; μ3 = λ; μ4 = λ(3λ + 1)

Thus, for the Poisson distribution Po(n|λ) we have that:

E[X ] = λ; V [X ] = λ; γ1 = λ−1/2 ; γ2 = 3 + λ−1

Example 1.16 Consider X ∼Ga(x|a, b) with:

a b −ax b−1
p(x) = e x 1(0,∞) (x)λ; a, b ∈ R+
(b)

The moments with respect to the origin are


 ∞
ab (b + n) −n
αn = E[X ] = n
e−ax x b+n−1 d x = a
(b) 0 (b)

being the integral absolute convergent. Thus we have:


 
1 
n
n
μn = n (−b)n−k (b + k)
a (b) k=0 k

and in consequence

b b 2 6
E[X ] = ; V [X ] = ; γ1 = √ ; γ2ext. =
a a2 b b
42 1 Probability

Example 1.17 For the Cauchy distribution X ∼Ca(x|1, 1),

1 1
p(x) == 1(−∞,∞) (x)
π 1 + x2

we have that
 ∞
1 xn
αn = E[X n ] = dx
π −∞ 1 + x2

and clearly the integral diverges for n > 1 so there are no moments but the trivial
one α0 . Even for n = 1, the integral
 ∞  ∞
|x| x
dx = 2 d x = lima→∞ ln (1 + a 2 )
−∞ (1 + x 2 ) 0 (1 + x 2 )

is not absolute convergent so, in strict sense, there is no mean value. However, the
mode and the median are x0 = xm = 0, the distribution is symmetric about x = 0
and for n = 1 there exists the Cauchy’s Principal Value and is equal to 0. Had we
introduced the Probability Distributions as a subset of Generalized Distributions, the
Principal Value is an admissible distribution. It is left as an exercise to show that for:
• Pareto: X ∼Pa(x|ν, xm ) with p(x|xm , ν) ∝ x −(ν+1) 1[xm ,∞) (x); xm , ν ∈ R+
 −(ν+1)/2
• Student: X ∼St (x|ν) with p(x|ν) ∝ 1 + x 2 /ν 1(−∞,∞) (x); ν ∈ R+
the moments αn = E[X n ] exist iff n < ν.
Another distribution of interest in physics is the Landau Distribution that describes
the energy lost by a particle when traversing a material under certain conditions. The
probability density, given as the inverse Laplace Transform, is:
 c+i∞
1
p(x) = es log s+xs ds
2πi c−i∞

with c ∈ R+ and closing the contour on the left along a counterclockwise semicircle
with a branch-cut along the negative real axis it has a real representation
 ∞
1
p(x) = e−(r log r +xr ) sin(πr ) dr
π 0

The actual expression of the distribution of the energy loss is quite involved and
some simplifying assumptions have been made; among other things, that the energy
transfer in the collisions is unbounded (no kinematic constraint). But nothing is for
free and the price to pay is that the Landau Distribution has no moments other than
the trivial of order zero. This is why instead of mean and variance one talks about
the most probable energy loss and the full-width-half-maximum.
1.4 Stochastic Characteristics 43

1.4.2.4 Correlation Coefficient

The covariance between the random quantities X i and X j was defined as:

V [X i , X j ] = V [X j , X i ] = E[(X i − μi ) (X i − μ j )] = E[X i X j ] − E[X i ]E[X j ]

If X i and X j are independent, then E[X i X j ] = E[X i ]E[X j ] and V [X i , X j ] = 0.


Conversely, if V [X i , X j ] = 0 then E[X i X j ] = E[X i ]E[X j ] and in consequence
X i and X j are not statistically independent. Thus, the covariance V [X i , X j ] serves
to quantify, to some extent, the degree of statistical dependence between the ran-
dom quantities X i and X j . Again, for dimensional considerations one defines the
correlation coefficient
V [X i , X j ] E[X i X j ] − E[X i ]E[X j ]
ρi j =  =
V [X i ] V [X j ] σi σ j

Since p(xi , x j ) is a non-negative function we can write


 ( )( )
 
V [X i , X j ] = (xi − μi ) p(xi , x j ) (x j − μ j ) p(xi , x j ) d xi d x j
R2

and from the Cauchy–Schwarz inequality:

−1 ≤ ρi j ≤ 1

The extreme values (+1, −1) will be taken when E[X i X j ] = E[X i ]E[X j ]±σi σ j
and ρi j = 0 when E[X i X j ] = E[X i ]E[X j ]. In particular, it is immediate to see that
if here is a linear relation between both random quantities; that is, X i = a X j + b,
then ρi j = ±1. Therefore, it is a linear correlation coefficient. Note however that:
• If X i and X j are linearly related, ρi j = ±1, but ρi j = ±1 does not imply neces-
sarily a linear relation;
• If X i and X j are statistically independent, then ρi j = 0 but ρi j = 0 does not imply
necessarily statistical independence as the following example shows.

Example 1.18 Let X 1 ∼ p(x1 ) and define a random quantity X 2 as

X 2 = g(X 1 ) = a + bX 1 + cX 1 2

Obviously, X 1 and X 2 are not statistically independent for there is a clear parabolic
relation. However

V [X 1 , X 2 ] = E[X 1 X 2 ] − E[X 1 ] E[X 2 ] = bσ 2 + c(α3 − μ3 − μσ 2 )


44 1 Probability

with μ, σ 2 and α3 respectively the mean, variance and moment of order 3 with respect
to the origin of X 1 and, if we take b = cσ −2 (μ3 + μσ 2 − α3 ) then V [Y, X ] = 0 and
so is the (linear) correlation coefficient.

NOTE 2: Information as a measure of independence.


The Mutual Information (see Sect. 4.4) serves also to quantify the degree of statistical
dependence between random quantities. Consider for instance the two-dimensional
random quantity X = (X 1 , X 2 )∼ p(x1 , x2 ). Then:
  
p(x1 , x2 )
I (X 1 : X 2 ) = d x1 d x2 p(x1 , x2 ) ln
X p(x1 ) p(x2 )

and I (X 1 : X 2 ) ≥ 0 with equality iff p(x1 , x2 ) = p(x1 ) p(x2 ). Let’s look as an


example to the bi-variate normal distribution: N (x|μ, ):
 *
1 
p(x|φ) = (2π)−1 |det[]|−1/2 exp − (x − μ)T  −1 (x − μ)
2

with covariance matrix


 2 
σ1 ρσ1 σ2
= and det[] = σ12 σ22 (1 − ρ)2
ρσ1 σ2 σ22

Since X i ∼N (xi |μi , σi ); i = 1, 2 we have that:


  
p(x|μ, ) 1
I (X 1 : X 2 ) = d x1 d x2 p(x|μ, ) ln =− ln (1 − ρ2 )
X p(x1 |μ1 , σ1 ) p(x2 |μ2, σ2 ) 2

Thus, if X 1 and X 2 are independent (ρ = 0), I (X 1 : X 2 ) = 0 and when ρ→±1,


I (X 1 : X 2 )→∞.

1.4.3 The “Error Propagation Expression”

Consider a n-dimensional random quantity X = (X 1 , . . . , X n ) with E[X i ] = μi and


the random quantity Y = g(X) with g(x) an infinitely differentiable function. If we
make a Taylor expansion of g(X) around E[X] = μ we have
n 
  n  
1   ∂ 2 g(x)
n
∂g(x)
Y = g(X) = g(μ) + Zi + Zi Z j + R
i=1
∂xi μ 2! i=1 j=1 ∂xi ∂x j μ
1.4 Stochastic Characteristics 45

where Z i = X i − μi . Now, E[Z i ] = 0 and E[Z i Z j ] = V [X i , X j ] = Vi j so

n  
1   ∂ 2 g(x)
n
E[Y ] = E[g(X)] = g(μ) + + Vi j + · · ·
2! i=1 j=1 ∂xi ∂x j μ

and therefore
n 
  n  
1   ∂ 2 g(x)
n
∂g(x)
Y − E[Y ] = Zi + (Z i Z j − Vi j ) + · · ·
i=1
∂xi μ 2! i=1 j=1 ∂xi ∂x j μ

Neglecting all but the first term

n  n    
∂g(x) ∂g(x)
V [Y ] = E[(Y − E[Y ]) ] = 2
V [X i , X j ] + · · ·
i=1 j=1
∂xi μ ∂x j μ

This is the first order approximation to V [Y ] and usually is reasonable but has to
be used with care. On the one hand, we have assumed that higher order terms are
negligible and this is not always the case so further terms in the expansion may
have to be considered. Take for instance the simple case Y = X 1 X 2 with X 1 and X 2
independent random quantities. The first order expansion gives V [Y ]  μ21 σ22 +μ22 σ12
and including second order terms (there are no more) V [Y ] = μ21 σ22 + μ22 σ12 + σ12 σ22 ;
the correct result. On the other hand, all this is obviously meaningless if the random
quantity Y has no variance. This is for instance the case for Y = X 1 X 2−1 when X 1,2
are Normal distributed.

1.5 Integral Transforms

The Integral Transforms of Fourier, Laplace and Mellin are a very useful tool to
study the properties of the random quantities and their distribution functions. In
particular, they will allow us to obtain the distribution of the sum, product and ratio
of random quantities, the moments of the distributions and to study the convergence
of a sequence {Fk (x)}∞ k=1 of distribution functions to F(x).

1.5.1 The Fourier Transform

Let f : R→C be a complex and integrable function ( f ∈ L 1 (R)). The Fourier


Transform F(t) with t ∈ R of f (x) is defined as:
46 1 Probability
 ∞
F(t) = f (x) ei xt d x
−∞

The class of functions for which the Fourier Transform exists is certainly much wider
than the probability density functions p(x) ∈ L 1 (R) (normalized real functions of
real argument) we are interested in for which the transform always exists. If X ∼ p(x),
the Fourier Transform is nothing else but the mathematical expectation

F(t) = E[eit X ]; t ∈ R

and it is called Characteristic Function (t). Thus, depending on the character of


the random quantity X , we shall have:

• if X is discrete: (t) = eit xk P(X = xk )
xk
 +∞  +∞
• if X is continuous: (t) = eit x d P(x) = eit x p(x) d x
−∞ −∞

Attending to its definition, the Characteristic Function (t), with t ∈ R, is a


complex function and has the following properties:

(1) (0) = 1;
(2) (t) is bounded: |(t)| ≤ 1;
(3) (t) has schwarzian symmetry: (−t) = (t);
(4) (t) is uniformly continuous in R.

The first three properties are obvious. For the fourth one, observe that for any  > 0
there exists a δ > 0 such that |(t1 ) − (t2 )| <  when |t1 − t2 | < δ with t1 and t2
arbitrary in R because
 +∞  +∞
|(t + δ) − (t)| ≤ |1 − e−iδx |d P(x) = 2 | sin δx/2| d P(x)
−∞ −∞

and this integral can be made arbitrarily small taking a sufficiently small δ.
These properties, that obviously hold also for a discrete random quantity, are
necessary but not sufficient for a function (t) to be the Characteristic Function
of a distribution P(x) (see Example 1.9). Generalizing for a n-dimensional random
quantity X = (X 1 , . . . , X n ):

(t1 , . . . , tn ) = E[eitX ] = E[ei(t1 X 1 + · · · + tn X n ) ]

so, for the discrete case:


 
(t1 , . . . , tn ) = ... ei(t1 x1 + · · · + tn xn ) P(X 1 = x1 , . . . , X n = xn )
x1 xn
1.5 Integral Transforms 47

and for the continuous case:


 +∞  +∞
(t1 , . . . , tn ) = d x1 . . . d xn ei(t1 x1 + · · · + tn xn ) p(x1 , . . . , xn )
−∞ −∞

The n-dimensional Characteristic Function is such that:


(1) (0, . . . , 0) = 1
(2) |(t1 , . . . , tn )| ≤ 1
(3) (−t1 , . . . , −tn ) = (t1 , . . . , tn )
Laplace Transform: For a function f (x) : R+ →C defined as f (x) = 0 for x < 0,
we may consider also the Laplace Transform defined as
 ∞
L(s) = e−sx f (x) d x
0

with s ∈ C provided it exists. For a non-negative random quantity X ∼ p(x) this is


just the mathematical expectation E[e−sx ] and is named Moment Generating Func-
tion since the derivatives give the moments of the distribution (see Sect. 1.5.1.4).
While the Fourier Transform exists for f (x) ∈ L 1 (R), the Laplace Transform exists
if e−sx f (x) ∈ L 1 (R+ ) and thus, for a wider class of functions and although it is for-
mally defined for functions with non-negative support, it may be possible to extend
the limits of integration to the whole real line (Bilateral Laplace Transform). How-
ever, for the functions we shall be interested in (probability density functions), both
Fourier and Laplace Transforms exist and usually there is no major advantage in
using one or the other.

Example 1.19 There are several criteria (Bochner, Kintchine, Cramèr,…) specify-
ing sufficient and necessary conditions for a function (t), that satisfies the four
aforementioned conditions, to be the Characteristic Function of a random quantity
X ∼F(x). However, it is easy to find simple functions like

1
g1 (t) = e−t
4
and g2 (t) = ;
1 + t4

that satisfy four stated conditions and that can not be Characteristic Functions asso-
ciated to any distribution. Let’s calculate the moments of order one with respect to
the origin and the central one of order two. In both cases (see Sect. 1.5.1.4) we have
that:

α1 = μ = E[X ] = 0 and μ2 = σ 2 = E[(X − μ)2 ] = 0

that is, the mean value and the variance are zero so the distribution function is
zero almost everywhere but for X = 0 where P(X = 0) = 1… but this is the
Singular Distribution Sn(x|0) that takes the value 1 if X = 0 and 0 otherwise whose
Characteristic Function is (t) = 1. In general, any function (t) that in a boundary
48 1 Probability

of t = 0 behaves as (t) = 1 + O(t 2+ ) with  > 0 can not be the Characteristic
Function associated to a distribution F(x) unless (t) = 1 for all t ∈ R.

Example 1.20 The elements of the Cantor Set C S (0, 1) can be represented in base
3 as:

∞
Xn
X=
n=1
3n

with X n ∈ {0, 2}. This set is non-denumerable and has zero Lebesgue measure so
any distribution with support on it is singular and, in consequence, has no pdf. The
Uniform Distribution on C S (0, 1) is defined assigning a probability P(X n = 0) =
P(X n = 2) = 1/2 (Geometric Distribution). Then, for the random quantity X n we
have that
1 
 X n (t) = E[eit X ] = 1 + e2it
2
and for Yn = X n /3n :

1 n
Yn (t) =  X n (t/3n ) = 1 + e2it/3
2
Being all X n statistically independent, we have that

1 n
+∞ +∞ +∞
1 it/3n
 X (t) = 1 + e2it/3 = e cos(t/3n ) = eit/2 cos(t/3n )
n=1
2 n=1
2 n=1

and, from the derivatives (Sect. 1.5.1.4) it is straight forward to calculate the moments
of the distribution. In particular:

(1) i 1 (2) 3 3
 X (0) = −→ E[X ] = and  X (0) = − −→ E[X 2 ] =
2 2 8 8

so V [X ] = 1/8.

1.5.1.1 Inversion Theorem (Lévy 1925)

The Inverse Fourier Transform allows us to obtain the distribution function of a


random quantity from the Characteristic Function. If X is a continuous random
quantity and (t) its Characteristic Function, then the pdf p(x) will be given by
1.5 Integral Transforms 49
 +∞
1
p(x) = e−it x (t) dt
2π −∞

provided that p(x) is continuous at x and, if X is discrete:


 +∞
1
P(X = xk ) = e−it xk (t) dt
2π −∞

In particular, if the discrete distribution is reticular (that is, all the possible values
that the random quantity X may take can be expressed as a + b n with a, b ∈ R;
b = 0 and n integer) we have that:
 π/b
b
P(X = xk ) = e−it xk (t) dt
2π −π/b

From this expressions, we can obtain also the relation between the Characteristic
Function and the Distribution Function. For discrete random quantities we shall
have:
  +∞ 
1
F(xk ) = P(X = xk ) = e−it x (t) dt
x ≤x
2π −∞ x ≤x
k k

and, in the continuous case, for x1 < x2 ∈ R we have that:


 x2  +∞
1 1 −it x1
F(x2 ) − F(x1 ) = p(x) d x = (t) (e − e−it x2 ) dt
x1 2πi −∞ t

so taking x1 = 0 we have that (Lévy, 1925):


 +∞
1 1
F(x) = F(0) + (t) (1 − e−it x ) dt
2πi −∞ t

The Inversion Theorem states that there is a one-to-one correspondence between


a distribution function and its Characteristic Function so to each Characteristic Func-
tion corresponds one and only one distribution function that can be either discrete or
continuous but not a combination of both. Therefore, two distribution functions with
the same Characteristic Function may differ, at most, on their points of discontinuity
that, as we have seen, are a set of zero measure. In consequence, if we have two ran-
dom quantities X 1 and X 2 with distribution functions P1 (x) and P2 (x), a necessary
and sufficient condition for P1 (x) = P2 (x) a.e. is that 1 (t) = 2 (t) for all t ∈ R.
50 1 Probability

1.5.1.2 Changes of Variable

Let X ∼P(x) be a random quantity with Characteristic Function  X (t) and g(X ) a
one-to-one finite real function defined for all real values of X . The Characteristic
Function of the random quantity Y = g(X ) will be given by:

Y (t) = E Y [eitY ] = E X [eitg(X ) ]

that is:
 +∞ 
Y (t) = eitg(x) d P(x) or Y (t) = eitg(xk ) P(X = xk )
−∞ xk

depending on whether X is continuous or discrete. In the particular case of a linear


transformation Y = a X + b with a and b real constants, we have that:

Y (t) = E X [eit (a X + b) ] = eitb  X (at).

1.5.1.3 Sum of Random Quantities

The Characteristic Function is particularly useful to obtain the Distribution Func-


tion of a random quantity defined as the sum of independent random quantities.
If X 1 , . . . , X n are n independent random quantities with Characteristic Functions
1 (t1 ), . . . , n (tn ), then the Characteristic Function of X = X 1 + · · · + X n will be:

 X (t) = E[eit X ] = E[eit (X 1 + · · · + X n ) ] = 1 (t) · · · n (t)

that is, the product of the Characteristic Functions of each one, a necessary but not
sufficient condition for the random quantities X 1 , . . . , X n to be independent. In a
similar way, we have that if X = X 1 − X 2 with X 1 and X 2 independent random
quantities, then

 X (t) = E[eit (X 1 − X 2 ) ] = 1 (t) 2 (−t) = 1 (t) 2 (t)

Form these considerations, it is left as an exercise to show that:


• Poisson: The sum of n independent random quantities, each distributed as
Po(xk |μk ) with k = 1, . . . , n is Poisson distributed with parameter μs =
μ1 + · · · + μn .
• Normal: The sum of n independent random quantities, each distributed as
N (xk |μk , σk ) with k = 1, . . . , n is Normal distributed with mean μs = μ1 +
· · · + μn and variance σs2 = σ12 + · · · + σn2 .
1.5 Integral Transforms 51

• Cauchy: The sum of n independent random quantities, each Cauchy distributed


Ca(xk |αk , βk ) with k = 1, . . . , n is is Cauchy distributed with parameters αs =
α1 + · · · + αn and βs = β1 + · · · + βn .
• Gamma: The sum of n independent random quantities, each distributed as
Ga(xk |α, βk ) with k = 1, . . . , n is Gamma distributed with parameters (α, β1 +
· · · + βn ).

Example 1.21 (Difference of Poisson distributed random quantities) Consider two


independent random quantities X 1 ∼Po(X 1 |μ1 ) and X 2 ∼Po(X 2 |μ2 ) and let us find
the distribution of X = X 1 − X 2 . Since for the Poisson distribution:

it
X i ∼Po(μi ) −→ i (t) = e−μi (1 − e )

we have that
it + μ e−it )
 X (t) = 1 (t) 2 (t) = e−(μ1 + μ2 ) e(μ1 e 2

Obviously, X is a discrete random quantity with integer support  X = {. . ., −2, −1,


0, 1, 2, . . .}; that is, a reticular random quantity with a = 0 and b = 1. Then
 π
1 −μ S it + μ e−it )
P(X = n) = e e−itn e(μ1 e 2 dt
2π −π

being μ S = μ1 + μ2 . If we take:
,
μ1 it
z= e
μ2

we have
 n/2 
μ1 w
 z −n−1 e 2 (z + 1/z) dz
1
P(X = n) = e−μ S
μ2 2πi C
√ √
with w = 2 μ1 μ2 and C the circle |z| = μ1 /μ2 around the origin. From the
definition of the Modified Bessel Function of first kind

 t −n−1 e 2 (t + 1/t) dt
1 z
In (z) =
2πi C

with C a circle enclosing the origin anticlockwise and considering that I−n (z) = In (z)
we have finally:
 n/2
μ1 √
P(X = n) = e−(μ1 + μ2 ) I|n| (2 μ1 μ2 ).
μ2
52 1 Probability

1.5.1.4 Moments of a Distribution

Consider a continuous random quantity X ∼P(x) and Characteristic Function


 +∞
(t) = E[eit X ] = eit x d P(x)
−∞

and let us assume that there exists the moment of order k. Then, upon derivation of
the Characteristic Function k times with respect to t we have:
 +∞
∂k
(t) = E[i k X k eit X ] = (i x)k eit x d P(x)
∂k t −∞

and taking t = 0 we get the moments with respect to the origin:

 
1 ∂k
E[X ] = k
k
(t)
i ∂k t t=0

Consider now the Characteristic Function referred to an arbitrary point a ∈ R;


that is:
 +∞
(t, a) = E[eit (X − a) ] = eit (x − a) d P(x) = e−ita (t)
−∞

In a similar way, upon k times derivation with respect to t we get the moments with
respect to an arbitrary point a:
 
1 ∂k
E[(X − a)k ] = (t, a)
ik ∂k t t=0

and the central moments if a = E(X ) = μ. The extension to n dimensions immedi-


ate: for a n dimensional random quantity X we shall have the for the moment αk1 ...kn
with respect to the origin that
 
1 ∂ k1 +···+kn
αk1 ...kn = E[X 1k1 · · · X nkn ] = (t1 , . . . , tn )
i k1 +···+kn ∂ t1 · · · ∂ k n tn
k1
t1 =···=tn =0

Example 1.22 For the difference of Poisson distributed random quantities analyzed
in the previous example, one can easily derive the moments from the derivatives of
the Characteristic Function. Since

log X (t) = −(μ1 + μ2 ) + (μ1 eit + μ2 e−it )


1.5 Integral Transforms 53

we have that

X (0) = i (μ1 − μ2 ) −→ E[X ] = μ1 − μ2


X (0) = (X (0))2 − (μ1 + μ2 ) −→ V [X ] = μ1 + μ2

and so on.

Problem 1.3 The Moyal Distribution, with density


 *
1 1 −x

p(x) = √ exp − x + e 1(−∞,∞) (x)
2π 2

is sometimes used as an approximation to the Landau Distribution. Obtain the Char-


acteristic Function (t) = π −1/2 2−it (1/2 − it) and show that E[x] = γ E + ln 2
and V [X ] = π 2 /2.

1.5.2 The Mellin Transform

Let f : R+ →C be a complex and integrable function with support on the real positive
axis. The Mellin Transform is defined as:
 ∞
M( f ; s) = M f (s) = f (x) x s−1 d x
0

with s ∈ C, provided that the integral exists. In general, we shall be interested in


continuous probability density functions f (x) such that

lim x→0+ f (x) = O(x α ) and lim x→∞ f (x) = O(x β )

and therefore
 ∞  1  ∞
|M( f ; s)| ≤ | f (x)|x Re(s)−1 d x = | f (x)|x Re(s)−1 d x + | f (x)|x Re(s)−1 d x ≤
0 0 1
 1  ∞
≤ C1 x Re(s)−1+α d x + C2 x Re(s)−1+β d x
0 1

The first integral converges for −α < Re(s) and the second for Re(s) < −β so the
Mellin Transform exists and is holomorphic on the band −α < Re(s) < −β, parallel
to the imaginary axis (s) and determined by the conditions of convergence of the
integral. We shall denote the holomorphy band (that can be a half of the complex
plane or the whole complex plane) by S f = −α, −β. Last, to simplify the notation
when dealing with several random quantities, we shall write for X n ∼ pn (x) Mn (s)
of M X (s) instead of M( pn ; s).
54 1 Probability

1.5.2.1 Inversion

For a given function f (t) we have that


 ∞  ∞  ∞
M( f ; s) = f (t) t s−1 dt = f (t) e(s−1)lnt dt = f (eu ) esu du
0 0 −∞

assuming that the integral exists. Since s ∈ C, we can write s = x + i y so: the Mellin
Transform of f (t) is the Fourier Transform of g(u) = f (eu )e xu . Setting now t = eu
we have that
 ∞  σ+i∞
1 1
f (t) = M( f ; s = x + i y) t −(x+i y) dy = M( f ; s) t −s ds
2π −∞ 2πi σ−i∞

where, due to Chauchy’s Theorem, σ lies anywhere within the holomorphy band.
The uniqueness of the result holds with respect to this strip so, in fact, the Mellin
Transform consists on the pair M(s) together with the band a, b.

Example 1.23 It is clear that to determine the function f (x) from the transform F(s)
we have to specify the strip of analyticity for, otherwise, we do not know which poles
should be included. Let’s see as an example f 1 (x) = e−x . We have that
 ∞
M1 (z) = e−x x z−1 d x = (z)
0

holomorphic in the band 0, ∞ so, for the inverse transform, we shall include the
poles z = 0, −1, −2, . . .. For f 2 (x) = e−x − 1 we get M2 (s) = (s), the same
function, but

lim x→0+ f (x) O(x 1 )−→α = 1 and lim x→∞ f (x) O(x 0 )−→β = 0

Thus, the holomorphy strip is −1, 0 and for the inverse transform we shall include
the poles z = −1, −2, . . .. For f 3 (x) = e−x − 1 + x we get M3 (s) = (s), again
the same function, but

lim x→0+ f (x) O(x 2 )−→α = 2 and lim x→∞ f (x) O(x 1 )−→β = 1

Thus, the holomorphy strip is −2, −1 and for the inverse transform we include the
poles z = −2, −3, . . ..

1.5.2.2 Useful Properties

Consider a positive random quantity X with continuous density p(x) and x ∈ [0, ∞),
the Mellin Transform M X (s) (defined only for x ≥ 0)
1.5 Integral Transforms 55
 ∞
M( p; s) = x s−1 p(x) d x = E[X s−1 ]
0

and the Inverse Transform


 c+i∞
1
p(x) = x −s M( p; s) ds
2πi c−i∞

defined for all x where p(x) is continuous with the line of integration contained in
the strip of analyticity of M( p; s). Then:
• Moments: E[X n ] = M X (n + 1);
• For the positive random quantity Z = a X b (a, b ∈ R and a > 0) we have that
 ∞  ∞
M Z (s) = z s−1 f (z) dz = a s−1 x b(s−1) p(x) d x = a s−1 M X (bs − b + 1)
0 0
 c+i∞
2πi p(z) = z −s M X (bs − b + 1) ds
c−i∞

In particular, for Z = 1/ X (a = 1 and b = −1) we have that

M Z =1/ X (s) = M X (2 − s)

• If Z = X 1 X 2 · · ·X n with {X i }i=1
n
n independent positive defined random quantities,
each distributed as pi (xi ), we have that

 ∞ n  ∞
+ +
n +
n
M Z (s) = z s−1 p(z) dz = xis−1 pi (xi ) d xi = E[X is−1 ] = Mi (s)
0 i=1 0 i=1 i=1
 c+i∞
2πi p(z) = z −s M1 (s)· · ·Mn (s) ds
c−i∞

In particular, for n = 2, X = X 1 X 2 it is easy to check that


 ∞  c+i∞
1
p(x) = p1 (w) p2 (x/w) dw/w = x −s M1 (s)M2 (s) ds
0 2πi c−i∞

Obviously, the strip of holomorphy is S1 ∩S2 .


• For X = X 1 / X 2 , with both X 1 and X 2 positive defined and independent, we have
that
M X (s) = M1 (s) M2 (2 − s)

and therefore
 ∞  c+i∞
1
p(x) = p1 (wx) p2 (w) w dw = x −s M1 (s) M2 (2 − s) ds
0 2πi c−i∞
56 1 Probability
x
• Consider the distribution function F(x) = 0 p(u)du of the random quantity X .
Since d F(x) = p(x)d x we have that
 ∞  ∞
- .∞
M( p(x); s) = x s−1 d F(x) = x s−1 F(x) 0 − (s − 1) x s−2 F(x) d x
0 0

and therefore, if lim x→0+ [x s−1 F(x)] = 0 and lim x→∞ [x s−1 F(x)] = 0 we have,
shifting s→s − 1, that
 x
1
M(F(x); s) = M( p(u) du; s) = − M( p(x); s + 1).
0 s

1.5.2.3 Some Useful Examples

• Ratio and product of two independent Exponential distributed random quan-


tities
Consider X 1 ∼E x(x1 |a1 ) and X 2 ∼E x(x2 |a2 ). The Mellin transform of
X ∼E x(x|a) is
 ∞  ∞
(s)
M X (s) = x s−1 p(x|a) d x = a x s−1 e−ax d x = s−1
0 0 a

and therefore, for Z = 1/ X :


(2 − s)
M Z (s) = M X (2 − s) =
a 1−s
In consequence, we have that
• X = X 1 X 2 −→ M X (z) = M1 (z)M2 (z) = (z)2
(a1 a2 )z−1
 c+i∞
a1 a2
p(x) = (a1 a2 x)−z (z)2 dz
2πi c−i∞

The poles of the integrand are at z n = −n and the residuals13 are

(a1 a2 x)n
Res( f (z), z n ) = (2ψ(n + 1) − ln(a1 a2 x))
(n!)2

and therefore

 (a1 a2 x)n
p(x) = a1 a2 (2ψ(n + 1) − ln(a1 a2 x))
n=0
(n!)2

13 In the following examples, −π ≤ ar g(z) < π.


1.5 Integral Transforms 57


If we define w = 2 a1 a2 x

p(x) = 2 a1 a2 K 0 (2 a1 a2 x)1(0,∞) (x)

from the Neumann Series expansion the Modified Bessel Function K 0 (w).
 z−1
• Y = X 1 X 2−1 −→ MY (z) = M1 (z)M2 (2 − z) = aa2 π(1 − z)
1 sin(zπ)

a1 a2−1 c+i∞
1−z
p(x) = (a1 a2−1 x)−z dz
2i c−i∞ sin(zπ)

Considering again the poles of MY (z) at z n = −n we get the residuals


 n+1
a1
Res( f (z), z n ) = (1 + n) (−1)n xn
a2

and therefore:
∞  
a1  a1 x n a1 a2
p(x) = (1 + n) (−1)n = 1(1,∞) (x)
a2 n=0 a2 (a2 + a1 x)2

To summarize, if X 1 ∼E x(x1 |a1 ) and X 2 ∼E x(x2 |a2 ) are independent random quan-
tities:

X = X 1 X 2 ∼ 2 a1 a2 K 0 (2 a1 a2 x)1(1,∞) (x)
a1 a2
Y = X 1/ X 2 ∼ 1(0,∞) (x)
(a2 + a1 x)2

• Ratio and product of two independent Gamma distributed random quantities


Consider Y ∼Ga(x|a, b). Then X = aY ∼Ga(x|1, b) with Mellin Transform

(b + s − 1)
M X (s) =
(b)

The, if X 1 ∼Ga(x1 |1, b1 ) and X 2 ∼Ga(x2 |1, b2 ); b1 = b2 :


• X = X 1 X 2−1 −→ M X (z) = M1 (z)M2 (z) = (b1 − 1 + z) (b2 + 1 − z)
(b1 ) (b2 )
 c+i∞
2πi (b1 ) (b2 ) p(x) = x −z (b1 − 1 + z) (b2 + 1 − z) dz
c−i∞

Closing the contour on the left of the line Re(z) = c contained in the strip of
holomorphy 0, ∞ we have poles of order one at bi − 1 + z n = −n with n =
0, 1, 2, . . ., that is, at z n = 1 − bi − n. Expansion around z = z n +  gives the
residuals
58 1 Probability

(−1)n
Res( f (z), z n ) = (b1 + b2 + n) x n+b1 −1
n!
and therefore the quantity X = X 1 / X 2 is distributed as

∞
x b1 −1 (−1)n (b1 + b2 ) x b1 −1
p(x) = (b1 + b2 + n) x n = 1(0,∞) (x)
(b1 ) (b2 ) n! (b1 )(b2 ) (1 + x)b1 +b2
n=0

• X = X 1 X 2 −→ M X (z) = M1 (z)M2 (z) = (b1 − 1 + z) (b2 − 1 + z)


(b1 ) (b2 )
Without loss of generality, we may assume that b2 > b1 so the strip of holomorphy
is 1 − b1 , ∞. Then, with c > 1 − b1 real
 c+i∞
1
(b1 ) (b2 ) p(x) = x −z (b1 − 1 + z) (b2 − 1 + z) dz
2πi c−i∞

Considering the definition of the Modified Bessel Functions



 1  x 2n+ν π I−ν (x) − Iν (x)
Iν (x) = and K ν (x) =
n=0
n! (1 + n + ν) 2 2 sin(νπ)

we get that

2 √
p(x) = x (b1 +b2 )/2−1 K ν (2 x)1(0,∞) (x)
(b1 ) (b2 )

with ν = b2 − b1 > 0.
To summarize, if X 1 ∼Ga(x1 |a1 , b1 ) and X 2 ∼Ga(x2 |a2 , b2 ) are two independent
random quantities and ν = b2 − b1 > 0 we have that
 ν/2
2a1b1 a2b2 a2 √
X = X1 X2 ∼ x (b1 +b2 )/2−1 K ν (2 a1 a2 x)1(0,∞) (x)
(b1 ) (b2 ) a1
(b1 + b2 ) a1b1 a2b2 x b1 −1
X = X 1/ X 2 ∼ 1(0,∞) (x)
(b1 )(b2 ) (a2 + a1 x)b1 +b2

• Ratio and product of two independent Uniform distributed random quantities


Consider X ∼U n(x|0, 1). Then M X (z) = 1/z with with S = 0, ∞. For X =
X 1 · · ·X n we have

1 c+i∞
(−lnx)n−1
p(x) = e−zlnx z −n dz = 1(0,1] (x)
2πi c−i∞ (n)

being z = 0 the only pole or order n.


1.5 Integral Transforms 59

For X = X 1 / X 2 one has to be careful when defining the contours. In principle,

1 1
M X (s) = M1 (s)M2 (2 − s) =
s 2−s

so the strip of holomorphy is S = 0, 2 and there are two poles, at s = 0 and s = 2.
If lnx < 0→x < 1 we shall close the Bromwich the contour on the left enclosing
the pole at s = 0 and if lnx > 0→x > 1 we shall close the contour on the right
enclosing the pole at s = 2 so the integrals converge. Then it is easy to get that

1 - .
p(x) = 1(0,1] (x) + x −2 1(1,∞) (x) = U n(x|0, 1) + Pa(x|1, 1)
2
Note that
1 1
E[X n ] = M X (n + 1) =
n+1 1−n

and therefore there are no moments for n ≥ 1.

Example 1.24 Show that if X i ∼Be(xi |ai , bi ) with ai , bi > 0, then

(ai + bi ) (s + ai − 1)
Mi (s) =
(ai ) (s + ai + bi − 1)

with S = 1 − ai , ∞ and therefore:


• X = X1 X2

p(x) = N p x a1 −1 (1 − x)b1 +b2 −1 F(a1 − a2 + b1 , b2 , b1 + b2 , 1 − x)1(0,1) (x)

with
(a1 + b1 )(a2 + b2 )
Np =
(a1 )(a2 )(b1 + b2 )

• X = X 1/ X 2

p(x) = N1 x −(a2 +1) F(1 − b2 , a1 + a2 , b1 + a1 + a2 , x −1 )1(1,∞) (x) +


+ N2 x a1 −1 F(1 − b1 , a1 + a2 , b2 + a1 + a2 , x)1(0,1) (x)

with
B(a1 + a2 , bk )
Nk =
B(a1 + b1 )B(a2 + b2 )
60 1 Probability

Example 1.25 Consider a random quantity

2 a (b+1)/2
e−ax x b
2
X ∼ p(x|a, b) =
(b/2 + 1/2)

with a, b > 0 and x ∈ [0, ∞). Show that

(b/2 + s/2) −(s − 1)/2


M(s) ∼ p(x|a, b) = a
(b/2 + 1/2)

with S = −b, ∞ and, from this, derive that the probability density function of
X = X 1 X 2 , with X 1 ∼ p(x1 |a1 , b1 ) and X 2 ∼ p(x2 |a2 , b2 ) independent, is given by:

4 a1 a2 √ √ 
p(x) = ( a1 a2 x)(b1 +b2 )/2 K |ν| 2 a1 a2 x
(b1 /2 + 1/2) (b2 /2 + 1/2)

with ν = (b2 − b1 )/2 and for X = X 1 / X 2 by

2 (b + 1) (a x 2 )b1 /2
p(x) = a 1/2
(b1 /2 + 1/2) (b2 /2 + 1/2) (1 + a x 2 )b+1

with a = a1 /a2 and b = (b1 + b2 )/2.

Problem 1.4 Show that if X 1,2 ∼U n(x|0, 1), then for X = X 1X 2 we have that p(x) =
−x −1 Ei(ln x), with Ei(z) the exponential integral, and E[X m ] = m −1 ln(1 + m).
Hint: Consider Z = log X = X 1 log X 2 = −X 1 W2 and the Mellin Transform for
the Uniform and Exponential densities.

1.5.2.4 Distributions with Support in R

The Mellin Transform is defined for integrable functions with non-negative support.
To deal with the more general case X ∼ p(x) with supp{X } = x ≥ 0 + x<0 ⊆R we
have to
(1) Express the density as p(x) = p(x) 1x ≥ 0 (x) + p(x) 1x<0 (x);



p+ (x) p− (x)
(2) Define Y1 = X when x ≥ 0 and Y2 = −X when x < 0 so supp{Y2 } is positive
and find MY1 (s) and MY2 (s);
(3) Get from the inverse transform the corresponding densities p1 (z) for the quantity
of interest Z 1 = Z (Y1 , X 2 , . . .) with MY1 (s) and p2 (z) for Z 2 = Z (Y2 , X 2 , . . .)
with MY2 (s) and at the end for p2 (z) make the corresponding change for X →−X .
This is usually quite messy and for most cases of interest it is far easier to find
the distribution for the product and ratio of random quantities with a simple change
of variables.
1.5 Integral Transforms 61

• Ratio of Normal and χ2 distributed random quantities Let’s study the ran-
dom quantity X = X 1 (X 2 /n)−1/2 where X 1 ∼N (x1 |0, 1) with sup{X 1 } = R and
X 2 ∼χ2 (x2 |n) with sup{X 2 } = R+ . Then, for X 1 we have

p(x1 ) = p(x1 ) 1[0,∞) (x1 ) + p(x1 ) 1(−∞,0) (x1 )





p+ (x1 ) p− (x1 )

and therefore for X

X ∼ p(x) = p(x) 1[0,∞) (x) + p(x) 1(−∞,0) (x) = p + (x) + p − (x)

Since

2s−1 (n/2 + s − 1)
M2 (s) =
(n/2)

we have for Z = (X 2 /n)−1/2 that


 n (s−1)/2 ((n + 1 − s)/2)
M Z (s) = n (s−1)/2 M2 ((3 − s)/2) =
2 (n/2)

for 0 < (s) < n + 1. For X 1 ∈ [0, ∞) we have that

2s/2 (s/2)
M1+ (s) = √ ; 0 < (s)
2 2π

and therefore

n s/2 (s/2) ((n + 1 − s)/2)


M X+ (s) = M1+ (s) M Z (s) = √
2 nπ

with holomorphy stripe 0 < (s) < n + 1. There are poles at sm = −2m with
m = 0, 1, 2, . . . on the negative real axis and sk = n + 1 + 2k with k = 0, 1, 2, . . .
on the positive real axis. Closing the contour on the left we include only sm so

 ∞  2 m  
+ 1 (−1)m x n+1
p (x) = √  m+ =
nπ(n/2) m=0 (m + 1) n 2
 −(n+1)/2
((n + 1)/2) x2
= √ 1+ 1[0,∞) (x)
nπ(n/2) n

For X 1 ∈ (−∞, 0) we should in principle define Y = −X 1 with support in (0, ∞),


find MY (s), obtain the density for X  = Y/Z and then obtain the corresponding one
for X = −X  . However, in this case it is clear by symmetry that p + (x) = p − (x)
and therefore
62 1 Probability

 −(n+1)/2
((n + 1)/2) x2
X ∼ p(x) = √ 1+ 1(−∞,∞) (x) = St (x|n)
nπ(n/2) n

• Ratio and product of Normal distributed random quantities Consider


X 1 ∼N (x1 |μ1 , σ1 ) and X 2 ∼N (x2 |μ2 , σ2 ). The Mellin Transform is

e−μ /4σ s−1


2 2

MY (s) = √ σ (s) D−s (∓μ/σ)


with Da (x) the Whittaker Parabolic Cylinder Functions. The upper sign (−) of
the argument corresponds to X ∈ [0, ∞) and the lower one (+) to the quantity Y =
−X ∈ (0, ∞). Again, the problem is considerably simplified if μ1 = μ2 = 0 because

2z/2
MY (z) = √ σ z−1 (z/2)
2 2π

with S = 0, ∞ and, due to symmetry, all contributions are the same. Thus, summing
over the poles at z n = −2n for n = 0, 1, 2, . . . we have that for X = X 1 X 2 and
a −1 = 4σ12 σ22 :
√ ∞ √ √
2 a  ( a|x|)2n √  2 a √
p(x) = 2(1 + n) − ln( a|x|) = K 0 (2 a|x|)
π n=0 (n + 1)2 π

Dealing with the general case of μi = 0 it is much more messy to get compact
expressions and life is easier with a simple change of variables. Thus, for instance
for X = X 1 / X 2 we have that
√ 
a1 a2 ∞ −{a1 (xw − μ1 )2 + a2 (w − μ2 )2 }
p(x) = e |w| dw
π −∞

where ai = 1/(2σi2 ) and if we define:


w0 = a2 + a1 x 2 ; w1 = a1 a2 (xμ2 − μ1 )2 and w2 = (a1 μ1 x + a2 μ2 )/ w0

one has:

a1 a2 1 −w1 /w0  −w 2 √ 
p(x) = e e 2 + π w2 erf(w2 ) 1(−∞,∞) (x).
π w0
1.6 Ordered Samples 63

1.6 Ordered Samples

Let X ∼ p(x|θ) be a one-dimensional random quantity and the experiment e(n) that
consists on n independent observations and results in the exchangeable sequence
{x1 , x2 , . . ., xn } are equivalent to an observation of the n-dimensional random quan-
tity X∼ p(x|θ) where

+
n
p(x|θ) = p(x1 , x2 , . . ., xn |θ) = p(xi |θ)
i=1

Consider now a monotonic non-decreasing ordering of the observations

x1 ≤ x2 ≤ · · · xk−1 ≤ xk ≤ xk+1 ≤ · · · ≤ xn−1 ≤ xn





k−1 n−k

and the Statistic of Order k; that is, the random quantity X (k) associated with the kth
observation (1 ≤ k ≤ n) of the ordered sample such that there are k − 1 observations
smaller than xk and n − k above xk . Since
 xk
P(X ≤ xk |θ) = p(x|θ)d x = F(xk |θ) and P(X > xk |θ) = 1 − F(xk |θ)
−∞

we have that

- .k−1 - .n−k
X (k) ∼ p(xk |θ, n, k) = Cn,k p(xk |θ) F(xk |θ) 1 − F(xk |θ)

/ x 0k−1 / ∞ 0n−k
k
= Cn,k p(xk |θ) p(x|θ) d x p(x|θ) d x
−∞ xk



[P(X ≤ xk )]k−1 [P(X >xk )]n−k

The normalization factor


 
n
Cn,k =k
k

is given by combinatorial analysis although in general it is easier to get by normal-


ization of the final density. With a similar reasoning we have that the density function
of the two dimensional random quantity X (i j) = (X i , X j ); j > i, associated to the
observations xi and x j (Statistic of Order i, j; i < j)14 will be:

14 If the random quantities


X i are not identically distributed the idea is the same but one hast to deal
with permutations and the expressions are more involved.
64 1 Probability

/ xi 0i−1 / xj 0 j−i−1
X (i j) ∼ p(xi , x j |θ, i, j, n) = Cn,i, j p(x|θ)d x p(xi |θ) p(x|θ)d x
−∞ xi



[P(X <xi )]i−1 [P(xi <X ≤ <x j )] j−i−1
 n− j

p(x j |θ) p(x|θ)d x
xj


[P(x j <X )]n− j

where (xi , x j ) ∈ (−∞, x j ]×(−∞, ∞) or (xi , x j ) ∈ (−∞, ∞)×[xi , ∞). Again by


combinatorial analysis or integration we have that

n!
Cn,i, j =
(i − 1)! ( j − i − 1)! (n − j)!

The main Order Statistics we are usually interested in are


• Maximum X (n) = max{X 1 , X 2 , . . ., X n }:
/ xn 0n−1
p(xn |·) = n p(xn |θ) p(x|θ) d x
−∞

• Minimum X (1) = min{X 1 , X 2 , . . ., X n }:


/ ∞ 0n−1
p(x1 |·) = n p(x1 |θ) p(x|θ) d x
x1

• Range R = X (n) − X (1)


/ xn 0n−2
p(x1 , xn |·) = n(n − 1) p(x1 |θ) p(xn |θ) p(x|θ) d x
x1

If supp(X ) = [a, b], then R ∈ (0, b − a) and


 b−r *
p(r ) = n(n − 1) p(w + r ) p(w) [F(w + r ) − F(w)] n−2
dw
a

There is no explicit form unless we specify the Distribution Function F(x|θ).


• Difference S = X (i+1) − X (i) . If supp(X ) = [a, b], then S ∈ (0, b − a) and

1 2
(n + 1) b−s
p(s) = p(w + s) p(w) [F(w)]i−1 [1 − F(w + s)]n−i−1 dw
(i)(n − i) a
1.6 Ordered Samples 65

In the case of discrete random quantities, the idea is the same but a bit more messy
because one has to watch for the discontinuities of the Distribution Function. Thus,
for instance:
• Maximum X (n) = max{X 1 , X 2 , . . ., X n }:

X (n) ≤ x iff all xi are less or equal x and this happens with probability

P(xn ≤ x) = [F(x)]n

X (n) < x iff all xi are less than x and this happens with probability

P(xn < x) = [F(x − 1)]n

Therefore

P(xn = x) = P(xn ≤ x) − P(xn < x) = [F(x)]n − [F(x − 1)]n

• Minimum X (1) = min{X 1 , X 2 , . . ., X n }:

X (1) ≥ x iff all xi are grater or equal x and this happens with probability

P(x1 ≥ x) = 1 − P(x1 < x) = [1 − F(x − 1)]n

X (1) > x iff all xi are greater than x and this happens with probability

P(x1 > x) = 1 − P(x1 ≤ x) = [1 − F(x)]n

Therefore

P(x1 = x) = P(x1 ≤ x) − P(x1 < x) = [1 − P(x1 > x)] − [1 − P(x1 ≥ x)] =


= [1 − F(x − 1)]n − [1 − F(x)]n

Example 1.26 Let X ∼U n(x|a, b) and an iid sample of size n. Then, if L = b − a:


• Maximum: p(xn ) = n (xn − a) n 1(a,b) (xn )
n−1
(b − a)
• Minimum: p(x1 ) = n (b − x1 ) n 1(a,b) (x1 )
n−1
(b − a)
r n−2 
• Range: R = X (n) − X (1) : p(r ) = n(n−1) 1 − Lr 1(0,L) (r )
L
L  n−1
• Difference: S = X (k+1) − X (k) : p(s) = Ln 1 − Ls 1(0,L) (s)
Example 1.27 Let’s look at the Uniform distribution in more detail. Consider a
random quantity X ∼U n(x|a, b), the experiment e(n) that provides a sample of n
independent events and the ordered sample

Xn = {x1 ≤ x2 ≤ · · · ≤ xk ≤ · · · ≤ xk+ p ≤ · · · ≤ xn−1 ≤ xn }


66 1 Probability

Then, for the ordered statistics X k , X k+ p and X k+ p+1 with k, p ∈ N , 1 ≤ k≤n − 1


and p ≤ n − k − 1 we have that

k−1 p−1 n − (k + p + 1)

a ... xk . . . xk+ p xk+ p+1 ... b

/ xk 0k−1 / xk+ p 0 p−1  b


n−(k+ p+1)
p(xk , xk+ p |a, b, n, p) ∝ ds1 ds2 ds3
a xk xk+ p+1

Let’s think for instance that those are the arrival times of n events collected with
a detector in a time window [a = 0, b = T ]. If we define w1 = xk+ p − xk and
w2 = xk+ p+1 − xk we have that

p−1
p(xk , w1 , w2 |T, n, p) = xkk−1 w1 (T − xk − w2 )n−k− p−1 1[0,T −w2 ] (xk )1[0,w2 ] (w1 )1[0,T ] (w2 )

and, after integration of xk :


 
n p(n − p) p−1
p(w1 , w2 |T, n, p) = w1 (T − w2 )n− p−1 1[0,w2 ] (w1 )1[0,T ] (w2 )
p Tn

Observe that the support can be expressed also as 1[0,T ] (w1 )1[w1 ,T ] (w2 ) and that the
distribution of (W1 , W2 ) does not depend on k. The marginal densities are given by:
 
n p p−1
p(w1 |T, n, p) = w (T − w1 )n− p 1[0,T ] (w1 )
p Tn 1
 
n n−p p
p(w2 |T, n, p) = w2 (T − w2 )n− p−1 1[0,T ] (w2 )
p Tn

and if we take the limit T →∞ and n→∞ keeping the rate λ = n/T constant we
have

λ p+1 −λw2 p−1


lim p(w1 , w2 |T, n, p) = p(w1 , w2 |λ, p) = e w1 1[0,w2 ) (w1 ) 1[0,∞) (w2 )
T,n→∞ ( p)

and
λ p −λw1 p−1
p(w1 |λ, p) = e w1 1[0,∞) (w1 )
( p)

In consequence, under the stated conditions the time difference between two consec-
utive events ( p = 1) tends to an exponential distribution. Let’s consider for simplicity
1.6 Ordered Samples 67

this limiting behaviour in what follows and leave as an exercise the more involved
case of finite time window T .
Suppose now that after having observed one event, say xk , we have a dead-time of
size a in the detector during which we can not process any data. All the events that
fall in (xk , xk + a) are lost (unless we play with buffers). If the next observed event
is at time xk+ p+1 , we have lost p events and the probability for this to happen is

(λa) p
P(w1 ≤ a, w2 ≥ a|λ, p) = e−λa
( p + 1)

that is, Nlost ∼Po( p|λa) regardless the position of the last recorded time (xk ) in the
ordered sequence. As one could easily have intuited, the expected number of events
lost for each observed one is E[Nlost ] = λa. Last, it is clear that the density for the
time difference between two consecutive observed events when p are lost due to the
dead-time is

p(w2 |w1 ≤ a, λ, p) = λ e−λ(w2 −a) 1[a,∞) (w2 )

Note that it depends on the dead-time window a and not on the number of events
lost.

Example 1.28 Let X ∼E x(x|λ) and an iid sample of size n. Then:


• Maximum: p(xn ) = n λ e−λxn (1 − e−λxn )n−1 1(0,∞) (xn )
• Minimum: p(x1 ) = n λ e−λnx1 1(0,∞) (x1 )
- .n−2
• Range: R = X (n) − X (1) : p(r ) = (n − 1)λn−1 e−λr 1 − e−λr 1(0,∞) (r )
• Difference: S = X (k+1) − X (k) : p(s) = (n − k) λe−λx(n−k) 1(0,∞) (s).

1.7 Limit Theorems and Convergence

In Probability, the Limit Theorems are statements that, under the conditions of
applicability, describe the behavior of a sequence of random quantities or of Dis-
tribution Functions. In principle, whenever we can define a distance (or at least a
positive defined set function) we can establish a convergence criteria and, obviously,
some will be stronger than others so, for instance, a sequence of random quantities

{X i }i=1 may converge according to one criteria and not to other. The most usual types
of convergence, their relation and the Theorems derived from them are:
68 1 Probability

Distribution =⇒ Central Limit Theorem


⇑ =⇒ Glivenko–Cantelly Theorem (weak form)

Probability =⇒ Weak Law of Large Numbers
⇑ ⇑
⇑ Almost Sure =⇒ Strong Law of Large Numbers

L p (R) Norm =⇒ Convergence in Quadratic Mean

Uniform =⇒ Glivenko–Cantelly Theorem

so Convergence in Distribution is the weakest of all since does not imply any of the
others. In principle, there will be no explicit mention to statistical independence of
the random quantities of the sequence nor to an specific Distribution Function. In
most cases we shall just state the different criteria for convergence and refer to the
literature, for instance [2], for further details and demonstrations. Let’s start with the
very useful Chebyshev’s Theorem.

1.7.1 Chebyshev’s Theorem

Let X be a random quantity that takes values in ⊂R with Distribution Function
F(x) and consider the random quantity Y = g(X ) with g(X ) a non-negative single
valued function for all X ∈ . Then, for α ∈ R+

E[g(X )]
P (g(X ) ≥ α) ≤
α
In fact, given a measure space (, B , μ), for any μ-integrable function f (x) and
c > 0 we have for A = {x : | f (x)| ≥ c} that c1 A (x) ≤ | f (x)| for all x and therefore
 
cμ(A) = c1 A (x)dμ ≤ | f (x)|dμ

Let’s see two particular cases. First, consider g(X ) = (X − μ)2n where μ = E[X ]
and n a positive integer such that g(X ) ≥ 0 ∀X ∈ . Applying Chebishev’s Theorem:

E[(X − μ)2n ] μ2n


P((X − μ)2n ≥ α) = P(|X − μ| ≥ α1/2n ) ≤ =
α α

For n = 1, if we take α = k 2 σ 2 we get the Bienaymé–Chebishev’s inequality

P(|X − μ| ≥ kσ) ≤ 1/k 2


1.7 Limit Theorems and Convergence 69

that is, whatever the Distribution Function of the random quantity X is, the probability
that X differs from its expected value μ more than k times its standard deviation is
less or equal than 1/k 2 . As a second case, assume X takes only positive real values
and has a first order moment E[X ] = μ. Then (Markov’s inequality):

μ α=kμ
P(X ≥ α) ≤ −→ P(X ≥ kμ) ≤ 1/k
α
The Markov and Bienaymé–Chebishev’s inequalities provide upper bounds for
the probability knowing just mean value and the variance although they are usually
very conservative. They can be considerably improved if we have more information
about the Distribution Function but, as we shall see, the main interest of Chebishev’s
inequality lies on its importance to prove Limit Theorems.

1.7.2 Convergence in Probability

The sequence of random quantities {X n (w)}∞


n=1 converges in probability to X (w)
iff:
lim P(|X n (w) − X (w)| ≥ ) = 0; ∀ > 0;
n→∞

or, equivalently, iff:

lim P(|X n (w) − X (w)| < ) = 1 ∀ > 0;


n→∞

Note that P(|X n (w) − X (w)| ≥ ) is a a real number so this is is the usual limit
for a sequence of real numbers and, in consequence, for all  > 0 and δ > 0
∃ n 0 (, δ) such that for all n > n 0 (, δ) it holds that P(|X n (w) − X (w)| ≥ ) < δ.
For a sequence of n-dimensional random quantities, this can be generalized to
limn→∞ P(X n (w), X (w)) and, as said earlier, Convergence in Probability implies
Convergence in Distribution but the converse is not true. An important consequence
of the Convergence in Probability is the
• Weak Law of Large Numbers: Consider a sequence of independent random

quantities {X i (w)}i=1 , all with the same Distribution Function and first order moment
E[X i (w)] = μ, and define a new random quantity

1
n
Z n (w) = X i (w)
n i=i

The, the sequence {Z n (w)}∞


n=1 converges in probability to μ; that is:

lim P(|Z n (w) − μ| ≥ ) = 0; ∀ > 0;


n→∞
70 1 Probability

The Law of Large Numbers was stated first by J. Bernoulli in 1713 for the Binomial
Distribution, generalized (and named Law of Large Numbers) by S.D. Poisson and
shown in the general case by A. Khinchin in 1929. In the case X i (w) have variance
V (X i ) = σ 2 it is straight forward from Chebishev’s inequality:

 E[(Z n − μ)2 ] σ2
P (|Z n − μ| ≥ ) = P (Z n − μ)2 ≥ 2 ≤ =
2 n2
Intuitively, Convergence in Probability means that when n is very large, the prob-
ability that Z n (w) differs from μ by a small amount is very small; that is, Z n (w) gets
more concentrated around μ. But “very small” is not zero and it may happen that for
some k > n Z k differs from μ by more than . An stronger criteria of convergence
is the Almost Sure Convergence.

1.7.3 Almost Sure Convergence

A sequence {X n (w)}∞
n=1 of random quantities converges almost sure to X (w) if, and
only if:

lim X n (w) = X (w)


n→∞

for all w ∈  except at most on a set W ⊂ of zero measure (P(W ) = 0 so it is


also referred to as convergence almost everywhere). This means that for all  > 0
and all w ∈ W c =  − W , ∃n 0 (, w) > 0 such that |X n (w) − X (w)| <  for all
n > n 0 (, w). Thus, we have the equivalent forms:
   
P lim |X n (w) − X (w)| ≥  = 0 or P lim |X n (w) − X (w)| <  = 1
n→∞ n→∞

for all  > 0. Needed less to say that the random quantities X 1 , X 2 . . . and X are
defined on the same probability space. Again, Almost Sure Convergence implies
Convergence in Probability but the converse is not true. An important consequence
of the Almost Sure Convergence is the:
• Strong Law of Large Numbers (E. Borel 1909, A.N. Kolmogorov,…): Let

{X i (w)}i=1 be a sequence of independent random quantities all with the same Distrib-
ution Function and first order moment E[X i (w)] = μ. Then the sequence {Z n (w)}∞n=1
with

1
n
Z n (w) = X i (w)
n i=i
1.7 Limit Theorems and Convergence 71

converges almost sure to μ; that is:


 
P lim |Z n (w) − μ| ≥  = 0 ∀ > 0
n→∞

Intuitively, Almost Sure Convergence means that the probability that for some
k > n, Z k differs from μ by more than  becomes smaller as n grows.

1.7.4 Convergence in Distribution

Consider the sequence of random quantities {X n (ω)}∞n=1 and of their corresponding


Distribution Functions {Fn (x)}∞n=1 . In the limit n→∞, the random quantity X n (w)
tends to be distributed as X (w)∼F(x) iff

lim Fn (x) = F(x) ⇔ lim P(X n ≤ x) = P(X ≤ x); ∀x ∈ C(F)


n→∞ n→∞

with C(F) the set of points of continuity of F(x). Expressed in a different manner,
the sequence {X n (w)}∞ n=1 Converges in Distribution to X (w) if, and only if, for all
 > 0 and x ∈ C(F), ∃ n 0 (, x) such that |Fn (x) − F(x)| < , ∀n > n 0 (, x). Note
that, in general, n 0 depends on x so it is possible that, given an  > 0, the value of
n 0 for which the condition |Fn (x) − F(x)| <  is satisfied for certain values of x
may not be valid for others. It is important to note also that we have not made any
statement about the statistical independence of the random quantities and that the
Convergence in Distribution is determined only by the Distribution Functions so the
corresponding random quantities do not have to be defined on the same probability
space. To study the Convergence in Distribution, the following theorem it is very
useful:
• Theorem (Lévy 1937; Cramèr 1937): Consider a sequence of Distribution Func-
tions {Fn (x)}∞ ∞
n=1 and of the corresponding Characteristic Functions {n (t)}n=1 . Then

" if limn→∞ Fn (x) = F(x), then limn→∞ n (t) = (t) for all t ∈ R with (t)
the Characteristic Function of F(x).
n→∞
" Conversely, if n (t) −→ (t) ∀t ∈ R and (t) is continuous at t = 0, then
n→∞
Fn (x) −→ F(x)
This criteria of convergence is weak in the sense that if there is convergence if
probability or almost sure or in quadratic mean then there is convergence in distri-
bution but the converse is not necessarily true. However, there is a very important
consequence of the Convergence in Distribution:

• Central Limit Theorem (Lindberg-Levy): Let {X i (w)}i=1 be a sequence of inde-
pendent random quantities all with the same Distribution Function and with sec-
ond order moments so E[X i (w)] = μ and V [X i (w)] = σ 2 . Then the sequence
{Z n (w)}∞
n=1 of random quantities
72 1 Probability

1
n
Z n (w) = X i (w)
n i=i

with

1 1 
n n
σ2
E[Z n ] = E[X i ] = μ and V [Z n ] = 2 V [X i ] =
n i=i n i=i n


tends, in the limit n→∞, to be distributed as N (z|μ, σ/ n) or, what is the same,
the standardized random quantity


n
1 Xi − μ
n
∼ Zn − μ i=1
Zn= √ = √
V [Z n ] σ/ n

tends to be distributed as N (x|0, 1).


Consider, without loss of generality, the random quantity Wi = X i − μ so that
E[Wi ] = E[X i ] − μ = 0 and V [Wi ] = V [X i ] = σ 2 . Then,

1
W (t) = 1 − t 2 σ 2 + O(t k )
2
Since we require that the random quantities X i have at least moments of order two,
the remaining terms O(t k ) are either zero or powers of t larger than 2. Then,

1 1
n n
σ2
Zn = Xi = Wi + μ; E[Z n ] = μ; V [Z n ] = σ 2Z n =
n i=i n i=i n

so

 Z n (t) = eitμ [W (t/n)]n −→ lim  Z n (t) = eitμ lim [W (t/n)]n
n→∞ n→∞

Now, since:

1  t 2 2 1 t2 2
W (t/n) = 1 − σ + O(t k /n k ) = 1 − σ + O(t k /n k )
2 n 2 n Zn
we have that:
/ 0n  *
1 t2 2 1
lim [W (t/n)]n = limn→∞ 1 − σ Z n + O(t k /n k ) = exp − t 2 σ 2Z n
n→∞ 2 n 2
1.7 Limit Theorems and Convergence 73

and therefore:

lim  Z n (t) = eitμ e− 2 t σ /n


1 2 2

n→∞


so, limn→∞ Z n ∼N (x|μ, σ/ n).
The first indications about the Central Limit Theorem are due to A. De Moivre
(1733). Later, C.F. Gauss and P.S. Laplace enunciated the behavior in a general
way and, in 1901, A. Lyapunov gave the first rigorous demonstration under more
restrictive conditions. The theorem in the form we have presented here is due to
Lindeberg and Lévy and requires that the random quantities X i are:
(i) Statistically Independent;
(ii) have the same Distribution Function;
(iii) First and Second order moments exist (i.e. they have mean value and variance).
In general, there is a set of Central Limit Theorems depending on which of the
previous conditions are satisfied and justify the empirical fact that many natural
phenomena are adequately described by the Normal Distribution. To quote E.T.
Whittaker and G. Robinson (Calculus of Observations):
“Everybody believes in the exponential law of errors;
The experimenters because they think that it can be proved by mathematics;
and the mathematicians because they believe it has been established by
observation”
Example 1.29 From the limiting behavior of the Characteristic Function, show that:
• If X ∼Bi(r |n, p), in the limit p → 0 with np constant tends to a Poisson Distrib-
ution Po(r |μ = np);
• If X ∼Bi(r |n, p), in the limit n → ∞ the standardized random quantity

X − μX X − np n→∞
Z= = √ ∼ N (x|0, 1)
σX npq

• If X ∼Po(r |μ), then

X − μX X −μ μ→∞
Z= = √ ∼ N (x|0, 1)
σX μ

• X ∼χ2 (x|n), then n→∞ the standardized random quantity

X − μX X −ν n→∞
Z= = √ ∼ N (x|0, 1)
σX 2ν

• The Student’s Distribution St (x|0, 1, ν) converges to N (x|0, 1) in the limit ν→∞;


• The Snedecor’s Distribution Sn(x|ν1 , ν2 ) converges to χ2 (x|ν1 ) in the limit
ν2 →∞, to St (x|0, 1, ν2 ) in the limit ν1 →∞ and to N (x|0, 1) in the limit
ν1 , ν2 →∞.
74 1 Probability

1000
500

250 500

0 0
0 0.5 1 0 0.5 1

2000

1000
1000

0 0
0 0.5 1 0 0.5 1

4000
2000

2000

0 0
0 0.5 1 0 0.5 1

Fig. 1.2 Generated sample from U n(x|0, 1) (1) and sampling distribution of the mean of 2 (2),
5 (3), 10 (4), 20 (5) y 50 (6) generated values

Example 1.30 It is interesting to see the Central Limit Theorem at work. For this,
we have done a Monte Carlo sampling of the random quantity X ∼U n(x|0, 1). The
sampling distribution is shown in the Fig. 1.2(1) and the following ones show the
sample mean of n = 2 (Fig. 1.2(2)), 5 (Fig. 1.2(3)), 10 (Fig. 1.2(4)), 20 (Fig. 1.2(5))
y 50 (Fig. 1.2(6)) consecutive values. Each histogram has 500000 events and, as you
can see, as n grows the distribution “looks” more Normal. For n = 20 and n = 50
the Normal distribution is superimposed.
The same behavior is observed in Fig. 1.3 where we have generated a sequence
of values from a parabolic distribution with minimum at x = 1 and support on
 = [0, 2].
Last, Fig. 1.4 shows the results for a sampling from the Cauchy Distribution
X ∼Ca(x|0, 1). As you can see, the sampling averages follow a Cauchy Distrib-
ution regardless the value of n. For n = 20 and n = 50 a Cauchy and a Normal
distributions have been superimposed. In this case, since the Cauchy Distribution
has no moments the Central Limit Theorem does not apply.
1.7 Limit Theorems and Convergence 75

1000
1000

0 0
0 1 2 0 1 2

1000
1000

0 0
0 1 2 0 1 2

4000

2000

2000
1000

0 0
0 1 2 0 1 2

Fig. 1.3 Generated sample from a parabolic distribution with minimum at x = 1 and support on
 = [0, 2] (1) and sampling distribution of the mean of 2 (2), 5 (3), 10 (4), 20 (5) y 50 (6) generated
values


Example 1.31 Let {X i (w)}i=1 be a sequence of independent random quantities all
with the same Distribution Function, mean value μ and variance σ 2 and consider the
random quantity

1
n
Z (w) = X i (w)
n i=i

What is the value of n such that the probability that Z differs from μ more than  is
less than δ = 0.01?
From the√ Central Limit Theorem we know that in the limit n→∞, Z ∼
N (x|μ, σ/ n) so we may consider that, for large n:
76 1 Probability

2000 2000

0 0
-10 0 10 -10 0 10

2000 2000

0 0
-10 0 10 -10 0 10

2000 2000

0 0
-10 0 10 -10 0 10

Fig. 1.4 Generated sample from a Cauchy distribution Ca(x|0, 1) (1) and sampling distribution of
the mean of 2 (2), 5 (3), 10 (4), 20 (5) y 50 (6) generated values

P(|Z − μ| ≥ ) = P(μ −  ≥ Z ≥ μ + ) 
 μ−  +∞ /√ 0
n
 N (x|μ, σ) d x + N (x|μ, σ) d x = 1 − erf √ <δ
−∞ μ+ σ 2

For δ = 0.01 we have that



n 6.63 σ 2
≥ 2.575 −→ n ≥ .
σ 2

1.7.5 Convergence in L p Norm

A sequence of random quantities {X n (w)}∞


n=1 converges to X (w) in L p (R) ( p ≥ 1)
norm iff,
1.7 Limit Theorems and Convergence 77

X (w) ∈ L p (R), X n (w) ∈ L p (R) ∀n and lim E[|X n (w) − X (w)| p ] = 0


n→∞

that is, iff for any real  > 0 there exists a natural n 0 () > 0 such that for all n ≥ n 0 ()
it holds that E[|X n (w) − X (w)| p ] < . In the particular case that p = 2 it is called
Convergence in Quadratic Mean.
From Chebyshev’s Theorem

E[(X n (w) − X (w)) p ]


P(|X n (w) − X (w)| ≥ α1/ p ) ≤
α
so, taking α =  p , if there is convergence in L p (R) norm:

E[(X n (w) − X (w)) p ]


lim P(|X n (w) − X (w)| ≥ ) ≤ lim = 0 ∀ > 0
n→∞ n→∞ p
and, in consequence, we have convergence in probability.

1.7.6 Uniform Convergence

In some cases, point-wise convergence of Distribution Functions is not strong enough


to guarantee the desired behavior and we require a stronger type of convergence.
To some extent one may think that, more than a criteria of convergence, Uniform
Convergence refers to the way in which it is achieved. Point-wise convergence
requires the existence of an n 0 that may depend on  and on x so that the condi-
tion | f n (x) − f (x)| <  for n ≥ n 0 may be satisfied for some values of x and not for
others, for which a different value of n 0 is needed. The idea behind uniform conver-
gence is that we can find a value of n 0 for which the condition is satisfied regardless
the value of x. Thus, we say that a sequence { f n (x)}∞n=1 converges uniformly to f (x)
iff:

∀ > 0 , ∃n 0 ∈ N such that | f n (x) − f (x)| <  ∀n > n 0 and ∀x

or, in other words, iff:


n→∞
supx | f n (x) − f (x)| −→ 0

Thus, it is a stronger type of convergence that implies point-wise convergence. Intu-


itively, one may visualize the uniform convergence of f n (x) to f (x) if one can draw
a band f (x)± that contains all f n (x) for any n sufficiently large. Look for instance
at the sequence of functions f n (x) = x(1 + 1/n) with n = 1, 2, . . . and x ∈ R.
It is clear that converges point-wise to f (x) = x because limn→∞ f n (x) = f (x)
for all x ∈ R; that is, if we take n 0 (x, ) = x/, for all n > n 0 (x, ) it is true that
78 1 Probability

| f n (x) − f (x)| <  but for larger values of x we need larger values of n. Thus, the
the convergence is not uniform because

supx | f n (x) − f (x)| = supx |x/n| = ∞ ∀n ∈ N

Intuitively, for whatever small a given  is, the band f (x)± = x± does not con-
tain f n (x) for all n sufficiently large. As a second example, take f n (x) = x n with
x ∈ (0, 1). We have that limn→∞ f n (x) = 0 but supx |gn (x)| = 1 so the convergence
is not uniform. For the cases we shall be interested in, if a Distribution Function F(x)
is continuous and the sequence of {Fn (x)}∞ n=1 converges in distribution to F(x) (i.e.
point-wise) then it does uniformly too. An important case of uniform convergence
is the (sometimes called Fundamental Theorem of Statistics):
• Glivenko–Cantelli Theorem (V. Glivenko-F.P. Cantelli; 1933): Consider the ran-
dom quantity X ∼F(x) and a statistically independent (essential point) sampling of
size n {x1 , x2 , . . ., xn }. The empirical Distribution Function

1 
n
Fn (x) = 1(−∞,x] (xi )
n i=1

converges uniformly to F(x); that is (Kolmogorov–Smirnov Statistic):

limn→∞ supx |Fn (x) − F(x)| = 0

Let’s see the convergence in probability, in quadratic mean and, in consequence,


in distribution. For a fixed value x = x0 , Y = 1(−∞,x0 ] (X ) is a random quantity that
follows a Bernoulli distribution with probability

p = P(Y = 1) = P(1(−∞,x0 ] (x) = 1) = P(X ≤ x0 ) = F(x0 )


P(Y = 0) = P(1(−∞,x0 ] (x) = 0) = P(X > x0 ) = 1 − F(x0 )

and Characteristic Function

Y (t) = E[eitY ] = eit p + (1 − p) = eit F(x0 ) + (1 − F(x0 ))

Then, for a fixed value of x we have for the random quantity


n
n
Z n (x) = 1(−∞,x] (xi ) = n Fn (x) −→  Z n (t) = eit F(x) + (1 − F(x))
i=1

and therefore Z n (x) ∼ Bi(k|n, F(x)) so, if W = n Fn (x), then


 
n
P(W = k|n, F(x)) = F(x)k (1 − F(x))n−k
k
1.7 Limit Theorems and Convergence 79

with

E[W ] = n F(x) −→ E[Fn (x)] = F(x)


1
V [W ] = n F(x) (1 − F(x)) −→ V [Fn (x)] = F(x) (1 − F(x))
n
From Chebishev’s Theorem
1
P (|Fn (x) − F(x)| ≥ ) ≤ F(x) (1 − F(x))
n 2
and therefore

limn→∞ P[|Fn (x) − F(x)| ≥ ] = 0; ∀ > 0

so the empirical Distribution Function Fn (x) converges in probability to F(x). In


fact, since

F(x) (1 − F(x))
limn→∞ E[|Fn (x) − F(x)|2 ] = limn→∞ =0
n
converges also in quadratic mean and therefore in distribution.

Example 1.32 Let X = X 1 / X 2 with X i ∼U n(x|0, 1); i = 1, 2 and Distribution


Function
x  x
F(x) = 1(0,1] (x) + 1 − 1(1,∞) (x)
2 2
that you can get (exercise) from the Mellin Transform. This is depicted in black in
Fig. 1.5. There are no moments for this distribution; that is E[X n ] does not exist for
n ≥ 1. We have done Monte Carlo samplings of size n = 10, 50 and 100 and the
corresponding empirical Distribution Functions

1 
n
Fn (x) = 1(−∞,x] (xi )
n i=1

are shown in blue, read and green respectively.

NOTE 3: Divergence of measures.


Consider the measurable space (, B ) and the probability measures λ, μ << λ
and ν << λ. The Kullback’s divergence between μ and ν is defined as (see Chap. 4)
  
dμ dμ/dλ
K (μ, ν) = log dλ
 dλ dν/dλ
80 1 Probability

Fig. 1.5 Empirical


distribution function of 1
Example 1.32 for sample
0.9
sizes 10 (blue), 50 (green)
and 100 (red) together with
0.8
the distribution function
(black) 0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9

and the Hellinger distance as


  
1
d H2 (μ, ν) = | dμ/dλ − dν/dλ|2 dλ
2 

For λ Lebesgue measure, we can write


    
p(x)
K ( p, q) = p(x) log dx and 2
dH ( p, q) = 1 − p(x)q(x)d x
 q(x) 

The Kullback’s divergence will be relevant for Chap. 2. It is left as an exercise to


show that the Normal density that best approximates in Kullback’s sense a given
density p(x) is that with the same mean and variance (assuming they exist) and,
using the Calculus of Variations, that
1 2

k
p(x|λ) = f (x) exp λi h i (x)
i=0


is the form (exponential family) that satisfies k + 1 constraints h i (x) p(x)d x =
ci < ∞; i = 0, . . ., k with specified constants {c j }kj=0 (c0 = 1 for h 0 (x) = 1) and
best approximates a given density q(x). The Hellinger distance is a metric on the
set of all probability measures on B and we shall make use of it, for instance, as a
further check for convergence in Markov Chain Monte Carlo sampling.
Appendices 81

Appendices

Appendix 1: Indicator Function


This is one of the most useful functions in maths. Given subset A⊂ we define the
Indicator Function 1 A (x) for all elements x ∈  as:

1 if x ∈ A
1 A (x) =
0 if x ∈
/A

Given two sets A, B⊂, the following relations are obvious:

1 A∩B (x) = min{1 A (x), 1 B (x)} = 1 A (x) 1 B (x)


1 A∪B (x) = max{1 A (x), 1 B (x)} = 1 A (x) + 1 B (x) − 1 A (x) 1 B (x)
1 Ac (x) = 1 − 1 A (x)

It is also called “Characteristic Function” but in Probability Theory we reserve this


name for the Fourier Transform.
Appendix 2: Lebesgue Integral and Lebesgue Measure
The Lebesgue integral extends the Riemann theory of integration and is the natural
integral in the Theory of Probability. There are thousands of good references on the
subject; Chap. 2, Vol. 1 of “Measure Theory” by Bogachev [1] is a recommended
reading but let’s have a short and rather informal introduction for those who did not
yet look into this.
Even though we use the Lebesgue integral in Probability … Can we survive with
a rough idea of what it is without entering in the mathematical formalism? Yes. The
reason is that for the problems we have to deal with in experimental Particle Physics
we have either probability density functions that are Riemann integrable (and, when
it exists, coincides with that of Lebesgue) or we actually do Lebesgue integrals
“unconsciously”. Suppose for instance that we have a probability space (, B, μ) a
partition  = A1 ∪ A2 , the algebra B = {∅, , A1 , A2 } and a probability measure
such that μ(A1 ) = 1/3 and μ(A2 ) = 2/3. What is the expected value of the random
quantity X (Ak ) = k (that it is measurable with respect to B)? We have that


2 
2
E[X ] = X (Ak ) μ(Ak ) = k (k/3) = 5/3
k=1 k=1

Essentially, this is a Lebesgue integral so like Moliere’s Bourgeois Gentleman, we


have been speaking prose and didn’t even know it! Let’s look at another example
following the original idea of Lebesgue [3]. In fact, there are more axiomatic ways
to define the Lebesgue integral but this the most intuitive of all. Suppose we want to
evaluate
82 1 Probability
 2
ln x d x
1

In principle, for a real valued function f (x) defined on [a, b] the basic approach to
b
evaluate a f d x is that of Riemann and goes as follows. Consider a partition of the
interval [a, b] = ∪n−1
k=0 k with

k = [xk , xk+1 ) for k = 0, . . . n − 2; n−1 = [xn−1 , xn ]; x0 = a, xn = b

and the sequence {xk ∈ k }n−1


k=0 of interior points. Note that the length of each subin-
terval k is (xk+1 − xk ). Now, define the Riemann sum


n−1
Sn = f (xk ) (xk+1 − xk )
k=0

and the limit of Sn as the partition gets finer and finer in such a way that max(xk+1 −
xk ) → 0. If the limit exists, we say that the function f (x) is Riemann integrable and
the limit is the (Riemann) integral. Therefore, for the posed problem:
(1) Take a partition of the domain [1, 2] where xk = 1 + k with x0 = 1 and
xn = 2 →  = 1/n;
(2) For each subinterval k , of length , take xk = xk so f (xk ) = ln xk = ln(1+k);
(3) Evaluate the sum

  *
1  +
n−1 n−1 n−1
1 1 (2n)
Sn = [ln xk ]  = ln(1 + k/n) = ln (1 + k/n) = ln
n n n (n) n n
k=0 k=0 k=0

and take the limit →0+ (n → ∞). You can check that limn→∞ Sn = 2 ln 2 − 1.
Consider now a measure space (R, B, μ) and a non-negative, bounded and Borel
measurable function f (x). Lebesgue’s definition of the integral rests on partitioning
the range of f (x) instead of the domain. Thus, we start with a partition of [0, sup f ] =
k=0 k where
∪n−1

k = [yk , yk+1 ) for k = 0, . . . n − 2; n−1 = [yn−1 , yn ]; y0 = 0, yn = sup f

Being f Borel measurable, μ{ f −1 (k )} exists so we can evaluate the sum


n−2
Sn = yk μ[ f −1 (k )] + yn−1 μ[ f −1 (n−1 )]
k=0
Appendices 83

Again, as the partition gets finer in such a way that max(yk+1 − yk ) → 0, the limit
will be the Lebesgue integral provided it exists. For the problem at hand:
(1) Take a partition of the range [0, ln 2] where yk = k, y0 = 0 and yn = ln 2 →
 = n −1 ln 2;
(2) For each subinterval k determine the length of the corresponding interval on
the support; that is, μ(k ) = f −1 (yk+1 ) − f −1 (yk ) = ek (e − 1)
(3) Evaluate the sum


n−1 
n−1
Sn = yk μ(k ) = (e − 1) k ek = 2 ln 2 + e (1 − e )−1
k=0 k=0

and take the limit →0+ (n→∞). As expected, lim→0+ Sn = 2 ln 2 − 1.


Partitioning the range of the function and determining the “length” (measure) of each
corresponding set on the domain allows to integrate functions defined over sets for
which the Riemann integral does not exist. The typical example that you have almost
certainly seen is the integral over [0, 1] of the function f (x) = 1Q∩[0,1] (x). It is
nowhere continuous and therefore is not Riemann integrable. but Q is countable so
μ(Q∩[0, 1]) = 0 and therefore

1Q∩[0,1] dμ = μ(Q∩[0, 1]) = 0
[0,1]

Nevertheless, the crucial difference with respect Riemann’s integral is not the parti-
tion of the range but the possibility to perform integrals over “wilder” sets and, for
us, the chance to consider arbitrary probability measures over arbitrary sets. But, for
this, we have to clarify how to define the measure of a set. In general, we shall be
concerned only with Rn and it turns out that there is a unique measure λ on R n that
is invariant under translations and such that for the unit cube λ([0, 1]n ) = 1: the
Lebesgue measure that assigns to an interval [a, b] ∈ R what we intuitively would
guess: λ([a, b]) = (b − a). However, as explained in Sect. 1.1.2.2, if we want to
satisfy these conditions there is a price to pay: not all subsets of R are measurable.
Let’s finish with a more axiomatic introduction and some properties. Consider
the measure space (, B, μ); eventually a probability space with μ a probability
measure. Then, for S⊂B we define
 
de f
μ(S) = dμ = 1 S dμ
S 

where μ(S) may be +∞ (unless it is a finite measure). Now, given a finite partition
{Sk ; k = 1, . . ., n} of  and a simple function


n
S= ak 1 Sk where ak ≥ 0 ∀k and μ(Sk ) < +∞ if ak = 0
k=1
84 1 Probability

it is natural to define for a measurable set A⊂:


  
n
de f
S dμ = S 1 A dμ = ak μ(Sk ∩A)
A  k=1

Then:
(1) Let f be a non-negative measurable function with respect to B (that may take
the value +∞). We define:
  
de f
f dμ = sup S dμ; 0 ≤ S≤ f ; S simple
 

that, obviously, it may be +∞ in some cases.


(2) Let f be a measurable function that may take negative values and denote by

f + = f 1( f >0) and f − = − f 1( f <0)

so that f = f + − f − and | f | = f + + f − . Then, if


 
f + dμ < +∞ and f − dμ < +∞
 

we have that
  
+
| f | dμ = f dμ + f − dμ < +∞
  

and we say that the function f is Lebesgue integrable with integral


  
f dμ = f + dμ − f − dμ
  

Observe that f defined on R is Lebesgue integrable iff it belongs to the Banach


space L 1 (R).
Some of the main properties of the Lebesgue integral are:
(1) If f (x) and g(x) are two non-negative measurable functions such that f = g
almost everywhere; that is, μ ({x ∈ | f (x) = g(x)}) = 0 then
 
f dμ = g dμ

The function f (x) is integrable iff g(x) is integrable and both integrals are the
same;
(2) If f (x) and g(x) are two integrable functions then
  
• if a, b ∈ R, it holds that (a f + b g) dμ = a f dμ + b g dμ
Appendices 85
 
• if f ≤ g it holds that f dμ ≤ g dμ
(3) If { f k (x)}k ∈ N is a sequence of non-negative measurable functions such that
f k (x) ≤ f k+1 (x) for all k ∈ N and x ∈ , then
 
limk f k dμ = limk f k dμ

(The integrals can be infinite)


(4) If { f k (x)}k ∈ N is a sequence of functions that converge pointwise to f (x) (i.e.
limk→∞ f k (x) = f (x) for all x) and there exists an integrable function g such
that | f k | ≤ g for all k, then f is integrable and
 
limk→∞ f k dμ = f dμ

Appendix 3: Some properties of Radon–Nikodym derivatives


Consider the σ-additive measures μ1 , μ2 and μ3 on the measurable space (, B ).
Then

dμ dμ dμ
(1) If μ1 << μ2 and μ2 << μ3 , then μ1 << μ3 and dμ1 = dμ1 dμ2 so:
3 2 3
   
dμ1 dμ1 dμ2 dμ1
μ1 (A) = dμ1 = dμ2 = dμ3 = dμ3
A A dμ2 A dμ2 dμ3 A dμ3

d(μ1 + μ2 ) dμ dμ
(2) If μ1 << μ3 and μ2 << μ3 , then dμ3 = dμ1 + dμ2
3 3
(3) If μ1 << μ2 and g es a μ1 integrable function, then
 
dμ1
g dμ1 = g dμ2
A A dμ2
 
dμ dμ −1
(4) If μ1 << μ2 and μ2 << μ1 (equivalent: μ1 ∼μ2 ) then dμ1 = dμ2
2 1

We shall use some of these properties in different places; for instance, relating math-
ematical expectations under different probability measures or justifying some tech-
niques used in Monte Carlo Sampling.

References

1. V.I. Bogachev, Measure Theory (Springer, Berlin, 2006)


2. A. Gut, Probability: A Graduate Course, Springer Texts in Statistics (Springer, Berlin, 2013)
3. H.L. Lebesgue, Sur le développment de la notion d’intégrale (1926)
Chapter 2
Bayesian Inference

… some rule could be found, according to which we ought to


estimate the chance that the probability for the happening of an
event perfectly unknown, should lie between any two named
degrees of probability, antecedently to any experiments made
about it; …
An Essay towards solving a Problem in the Doctrine of Chances
By the late Rev. Mr. Bayes…

The goal of statistical inference is to get information from experimental observations


about quantities (parameters, models,…) on which we want to learn something, be
them directly observable or not. Bayesian inference1 is based on the Bayes rule
and considers probability as a measure of the degree of knowledge we have on the
quantities of interest. Bayesian methods provide a framework with enough freedom
to analyze different models, as complex as needed, using in a natural and conceptually
simple way all the information available from the experimental data within a scheme
that allows to understand the different steps of the learning process:
(1) state the knowledge we have before we do the experiment;
(2) how the knowledge is modified after the data is taken;
(3) how to incorporate new experimental results.
(4) predict what shall we expect in a future experiment from the knowledge acquired.
It was Sir R.A Fisher, one of the greatest statisticians ever, who said that “The
Theory of Inverse Probability (that is how Bayesianism was called at the beginning of
the XX century) is founded upon an error and must be wholly rejected” although, as
time went by, he became a little more acquiescent with Bayesianism. You will see that
Bayesianism is great, rational, coherent, conceptually simple,… “even useful”,… and
worth to, at least, take a look at it and at the more detailed references on the subject

1 For a gentle reading on the subject see [1].


© Springer International Publishing AG 2017 87
C. Maña, Probability and Statistics for Particle Physics,
UNITEXT for Physics, DOI 10.1007/978-3-319-55738-0_2
88 2 Bayesian Inference

given along the section. At the end, to quote Lindley, “Inside every non-Bayesian
there is a Bayesian struggling to get out”. For a more classical approach to Statistical
Inference see [2] where most of what you will need in Experimental Physics is
covered in detail.

2.1 Elements of Parametric Inference

Consider an experiment designed to provide information about the set of parameters


θ = {θ1 , . . . , θk } ∈  ⊆ R k and whose realization results in the random sample
x = {x1 , x2 , . . . , xn }. The inferential process entails:
(1) Specification of the probabilistic model for the random quantities of interest;
that is, state the joint density:

p(θ, x) = p(θ1 , θ2 , . . . , θk , x1 , x2 , . . . , xn ); θ = ∈  ⊆ Rk ; x∈ X

(2) Conditioning the observed data (x) to the parameters (θ) of the model:

p(θ, x) = p(x|θ) p(θ)

(3) Last, since p(θ, x) = p(x|θ) p(θ) = p(θ|x) p(x) and


 
p(x) = p(x, θ) dθ = p(x|θ) p(θ) dθ
 

we have (Bayes Rule) that:

p(x|θ) p(θ)
p(θ|x) = 
p(x|θ) p(θ)dθ


This is the basic equation for parametric inference. The integral of the denominator
does not depend on the parameters (θ) of interest; is just a normalization factor so
we can write in a general way;

p(θ|x) ∝ p(x|θ) p(θ)

Let’s see these elements in detail:


p(θ|x): This is the Posterior Distribution that quantifies the knowledge we have
on the parameters of interest θ conditioned to the observed data x (that is,
after the experiment has been done) and will allow to perform inferences
about the parameters;
2.1 Elements of Parametric Inference 89

p(x|θ): The Likelihood; the sampling distribution considered as a function of


the parameters θ for the fixed values (already observed) x. Usually, it is
written as (θ; x) to stress the fact that it is a function of the parameters.
The experimental results modify the prior knowledge we have on the
parameters θ only through the likelihood so, for the inferential process,
we can consider the likelihood function defined up to multiplicative factors
provided they do not depend on the parameters.
p(θ): This is a reference function, independent of the results of the experiment,
that quantifies or expresses, in a sense to be discussed later, the knowledge
we have on the parameters θ before the experiment is done. It is termed
Prior Density although, in many cases, it is an improper function and
therefore not a probability density.

2.2 Exchangeable Sequences

The inferential process to obtain information about a set of parameters θ ∈  of a


model X ∼ p(x|θ) with X ∈  X is based on the realization of an experiment e(1) that
provides an observation {x1 }. The n−fold repetition of the experiment under the same
conditions, e(n), will provide the random sample x = {x1 , x2 , . . . , xn } and this can be
considered as a draw of the n-dimensional random quantity X = (X 1 , X 2 , . . . , X n )
where each X i ∼ p(x|θ).
In Classical Statistics, the inferential process makes extensive use of the idea that
the observed sample is originated from a sequence of independent and identically
distributed (iid) random quantities while Bayesian Inference rests on the less restric-

tive idea of exchangeability [3]. An infinite sequence of random quantities {X i }i=1
is said to be exchangeable if any finite sub-sequence {X 1 , X 2 , . . . , X n } is exchange-
able; that is, if the joint density p(x1 , x2 , . . . , xn ) is invariant under any permutation
of the indices.
The hypothesis of exchangeability assumes a symmetry of the experimental obser-
vations {x1 , x2 , . . . , xn } such that the subscripts which identify a particular obser-
vation (for instance the order in which they appear) are irrelevant for the infer-
ences. Clearly, if {X 1 , X 2 , . . . X n } are iid then the conditional joint density can be
expressed as:


n
p(x1 , x2 , . . . , xn ) = p(xi )
i=1

and therefore, since the product is invariant to reordering, is an exchangeable


sequence. The converse is not necessarily true2 so the hypothesis of exchangeability

is weaker than the hypothesis of independence. Now, if {X i }i=1 is an exchangeable

2 It
is easy to check for instance that if X 0 is a non-trivial random quantity independent of the X i ,
the sequence {X 0 + X 1 , X 0 + X 2 , . . . X 0 + X n } is exchangeable but not iid.
90 2 Bayesian Inference

sequence of real-valued random quantities it can be shown that, for any finite subset,
there exists a parameter θ ∈ , a parametric model p(x|θ) and measure dμ(θ) such
that3 :
 
n
p(x1 , x2 , . . . , xn ) = p(xi |θ) dμ(θ)
 i=1

Thus, any finite sequence of exchangeable observations is described by a model


p(x|θ) and, if dμ(θ) = p(θ)dθ, there is a prior density p(θ) that we may consider
as describing the available information on the parameter θ before the experiment is
done. This justifies and, in fact, leads to the Bayesian approach in which, by formally
applying Bayes Theorem

p(x, θ) = p(x|θ) p(θ) = p(θ|x) p(x)

we obtain the posterior density p(θ|x) that accounts for the degree of knowledge
we have on the parameter after the experiment has been performed. Note that the
random quantities of the exchangeable sequence {X 1 , X 2 , . . . , X n } are conditionally
independent given θ but not iid because
⎛ ⎞
 
n 
p(x j ) = p(x j |θ) dμ(θ) ⎝ p(xi |θ) d xi ⎠
 i(= j)=1  X

and


n
p(x1 , x2 , . . . , xn ) = p(xi )
i=1

There are situations for which the hypothesis of exchangeability can not be
assumed to hold. That is the case, for instance, when the data collected by an exper-
iment depends on the running conditions that may be different for different periods
of time, for data provided by two different experiments with different acceptances,
selection criteria, efficiencies,… or the same medical treatment when applied to
individuals from different environments, sex, ethnic groups,… In these cases, we
shall have different units of observation and it may be more sound to assume partial
exchangeability within each unit (data taking periods, detectors, hospitals,…) and
design a hierarchical structure with parameters that account for the relevant infor-
mation from each unit analyzing all the data in a more global framework.
Note 4: Suppose that we have a parametric model p1 (x|θ) and the exchangeable
sample x 1 = {x1 , x2 , . . . , xn } provided by the experiment e1 (n). The inferences on

3 Thisis referred as De Finetti’s Theorem after B. de Finetti (1930s) and was generalized by
E. Hewitt and L.J. Savage in the 1950s. See [4].
2.2 Exchangeable Sequences 91

the parameters θ will be drawn from the posterior density p(θ|x 1 ) ∝ p1 (x 1 |θ) p(θ).
Now, we do a second experiment e2 (m), statistically independent of the first, that
provides the exchangeable sample x 2 = {xn+1 , xn+2 , . . . , xn+m } from the model
p2 (x|θ). It is sound to take as prior density for this second experiment the pos-
terior of the first including therefore the information that we already have about θ
so

p(θ|x 2 ) ∝ p2 (x 2 |θ) p(θ|x 1 ) ∝ p2 (x 2 |θ) p1 (x 1 |θ) p(θ).

Being the two experiments statistically independent and their sequences exchange-
able, if they have the same sampling distribution p(x|θ) we have that
p1 (x 1 |θ) p2 (x 2 |θ) = p(x|θ) where x = {x 1 , x 2 } = {x1 , . . . , xn , xn+1 , . . . , xn+m }
and therefore p(θ|x 2 ) ∝ p(x|θ) p(θ). Thus, the knowledge we have on θ includ-
ing the information provided by the experiments e1 (n) and e2 (m) is determined by
the likelihood function p(x|θ) and, in consequence, under the aforementioned con-
ditions the realization of e1 (n) first and e2 (m) after is equivalent, from the inferential
point of view, to the realization of the experiment e(n + m).

2.3 Predictive Inference

Consider the realization of the experiment e1 (n) that provides the sample x =
{x1 , x2 , . . . , xn } drawn from the model p(x|θ). Inferences about θ ∈  are deter-
mined by the posterior density

p(θ|x) ∝ p(x|θ) π(θ)

Now suppose that, under the same model and the same experimental conditions, we
think about doing a new independent experiment e2 (m). What will be the distrib-
ution of the random sample y = {y1 , y2 , . . . , ym } not yet observed? Consider the
experiment e(n + m) and the sampling density

p(θ, x, y) = p(x, y|θ) π(θ)

Since both experiments are independent and iid, we have the joint density

p(x, y|θ) = p(x|θ) p( y|θ) −→ p(θ, x, y) = p(x|θ) p( y|θ) π(θ)

and integrating the parameter θ ∈ :


 
p( y, x) = p( y|x) p(x) = p( y|θ) p(x|θ)π(θ)dθ = p(x) p( y|θ) p(θ|x) dθ
 
92 2 Bayesian Inference

Thus, we have that



p( y|x) = p( y|θ) p(θ|x) dθ


This is the basic expression for the predictive inference and allows us to predict the
results y of a future experiment from the results x observed in a previous experiment
within the same parametric model. Note that p( y|x) is the density of the quantities not
yet observed conditioned to the observed sample. Thus, even though the experiments
e( y) and e(x) are statistically independent, the realization of the first one (e(x))
modifies the knowledge we have on the parameters θ of the model and therefore
affect the prediction on future experiments for, if we do not consider the results of
the first experiment or just don’t do it, the predictive distribution for e( y) would be

p( y) = p( y|θ) π(θ) dθ


It is then clear from the expression of predictive inference that in practice it is


equivalent to consider as prior density for the second experiment the proper den-
sity π(θ) = p(θ|x). If the first experiment provides very little information on the
parameters, then p(θ|x) π(θ) and

p( y|x) p( y|θ) π(θ) dθ p( y)


On the other hand, if after the first experiment we know the parameters with high
accuracy then, in distributional sense,
p(θ|x), ·
δ(θ 0 ), · and

p( y|x)
δ(θ 0 ), p( y|θ) = p( y|θ0 ).

2.4 Sufficient Statistics

Consider m random quantities {X 1 , X 2 , . . . , X m } that take values in 1 × · · · ×m


and a random vector

T : 1 × · · · ×m −→ Rk(m)

whose k(m) ≤ m components are functions of the random quantities {X i }i=1 m


. Given
the sample {x1 , x2 , . . . , xm }, the vector t = t(x1 , . . . , xm ) is a k(m)-dimensional
statistic. The practical interest lies in the existence of statistics that contain all the
relevant information about the parameters so we don’t have to work with the whole
sample and simplify considerably the expressions. Thus, of special relevance are
the sufficient statistics. Given the model p(x1 , x2 , . . . , xn |θ), the set of statistics
2.4 Sufficient Statistics 93

t = t(x1 , . . . , xm ) is sufficient for θ if, and only if, ∀m ≥ 1 and any prior distribution
π(θ) it holds that

p(θ|x1 , x2 , . . . , xm ) = p(θ|t)

Since the data act in the Bayes formula only through the likelihood, it is clear that
to specify the posterior density of θ we can consider

p(θ|x1 , x2 , . . . , xm ) = p(θ|t) ∝ p(t|θ) π(θ)

and all other aspects of the data but t are irrelevant. It is obvious however that t =
{x1 , . . . , xm } is sufficient and, in principle, gives no simplification in the modeling.
For this we should have k(m) = dim(t) < m (minimal sufficient statistics) and, in
the ideal case, we would like that k(m) = k does not depend on m. Except some
irregular cases, the only distributions that admit a fixed number of sufficient statistics
independently of the sample size (that is, k(m) = k < m ∀m) are those that belong
to the exponential family.

Example 2.1 (1) Consider the exponential model X ∼ E x(x|θ): and the iid experi-
ment e(m) that provides the sample x = {x1 , . . . , xm }. The likelihood function is:

p(x|θ) = θm e−θ (x1 +···+xm ) = θt1 e−θ t2


m
and therefore we have the sufficient statistic t = (m, i=1 xi ) : 1 ×· · ·×m −→
Rk(m)=2
(2) Consider the Normal model X ∼ N (x|μ, σ) and the iid experiment e(m) again
with x = {x1 , . . . , xm }. The likelihood function is:
⎧ ⎫

m ⎨ ⎬  
1 1
p(x|μ, σ) ∝ σ −m exp − 2 (xi − μ)2 = σ −t1 exp − 2 (t3 − 2 μ t2 + μ2 t1 )
⎩ 2σ ⎭ 2σ
i=1

m m
and t = (m, i=1 xi , i=1 xi2 ) : 1 ×· · ·×m −→ Rk(m)=3 a sufficient statistic.
Usually we shall consider t = {m, x, s 2 } with

1 1
m m
x= xi and s2 = (xi − x)2
m i=1 m i=1

the sample mean and the sample variance. Inferences on the parameters μ and σ will
depend on t and all other aspects of the data are irrelevant.

(3) Consider the Uniform model X ∼ U n(x|0, θ) and the iid sampling {x1 , x2 , . . . ,
xm }. Then t = (m, max{xi , i = 1, . . . m}) : 1 ×· · ·×m −→ Rk(m)=2 is a suffi-
cient statistic for θ.
94 2 Bayesian Inference

2.5 Exponential Family

A probability density p(x|θ), with x ∈  X and θ ∈  ⊆ Rk belongs to the


k-parameter exponential family if it has the form:
 

k
p(x|θ) = f (x) g(θ) exp ci φi (θ) h i (x)
i=1

with
 
k
−1
g(θ) = f (x) exp {ci φi (θ) h i (x)} d x ≤ ∞
x i=1

The family is called regular if supp{X } is independent of θ; irregular otherwise.


If x = {x1 , x2 , . . . , xn } is an exchangeable random sampling from the k-parameter
regular exponential family, then
⎧ ⎛ ⎞⎫
n  ⎨ k
n ⎬
p(x|θ) = f (xi ) [g(θ)]n exp ci φi (θ) ⎝ h i (x j )⎠
i=1 ⎩ ⎭
i=1 j=1

 n n 
and therefore t(x) = n, i=1 h 1 (xi ), . . . i=1 h k (xi ) will be a set of sufficient
statistics.

Example 2.2 Several distributions of interest, like Poisson and Binomial, belong to
the exponential family:
e−μ μn −(μ−nln μ)
(1) Poisson Po(n|μ): P(n|μ) = =e
(n + 1)
  (n + 1)  
N N
(2) Binomial Bi(n|N , θ): P(n|N , θ) = θn (1 − θ) N −n = en lnθ+(N −n)ln (1−θ)
n n
However, the Cauchy Ca(x|α, β) distribution, for instance, does not because
 m 

n
 
2 −1
p(x1 , . . . , xm |α, β) ∝ 1 + β(xi − α) = exp log(1 + β(xi − α) )
2

i=1 i=1

can not be expressed as the exponential family form. In consequence, there are
no sufficient minimal statistics (in other words t = {n, x1 , . . . , xn } is the sufficient
statistic) and we will have to work with the whole sample.
2.6 Prior Functions 95

2.6 Prior Functions

In the Bayes rule, p(θ|x) ∝ p(x|θ) p(θ), the prior function p(θ) represents the
knowledge (degree of credibility) that we have about the parameters before the exper-
iment is done and it is a necessary element to obtain the posterior density p(θ|x)
from which we shall make inferences. If we have faithful information on them before
we do the experiment, it is reasonable to incorporate that in the specification of the
prior density (informative prior) so the new data will provide additional information
that will update and improve our knowledge. The specific form of the prior can be
motivated, for instance, by the results obtained in previous experiments. However, it
is usual that before we do the experiment, either we have a vague knowledge of the
parameters compared to what we expect to get from the experiment or simply we
do not want to include previous results to perform an independent analysis. In this
case, all the new information will be contained in the likelihood function p(x|θ) of
the experiment and the prior density (non-informative prior) will be merely a math-
ematical element needed for the inferential process. Being this the case, we expect
that the whole weight of the inferences rests on the likelihood and the prior function
has the smallest possible influence on them. To learn something from the experiment
it is then desirable to have a situation like the one shown in Fig. 2.1 where the pos-
terior distribution p(θ|x) is dominated by the likelihood function. Otherwise, the
experiment will provide little information compared to the one we had before and,
unless our previous knowledge is based on suspicious observations, it will be wise
to design a better experiment.
A considerable amount of effort has been put to obtain reasonable non-informative
priors that can be used as a standard reference function for the Bayes rule. Clearly,
non-informative is somewhat misleading because we are never in a state of absolute
ignorance about the parameters and the specification of a mathematical model for
the process assumes some knowledge about them (masses and life-times take non-
negative real values, probabilities have support on [0, 1],…). On the other hand, it
doesn’t make sense to think about a function that represents ignorance in a formal
and objective way so knowing little a priory is relative to what we may expect to
learn from the experiment. Whatever prior we use will certainly have some effect on
the posterior inferences and, in some cases, it would be wise to consider a reasonable
set of them to see what is the effect.
The ultimate task of this section is to present the most usual approaches to derive
a non-informative prior function to be used as a standard reference that contains
little information about the parameters compared to what we expect to get from the
experiment.4 In many cases, these priors will not be Lebesgue integrable (improper
functions) and, obviously, can not be considered as probability density functions that
quantify any knowledge on the parameters (although, with little rigor, sometimes we
still talk about prior densities). If one is reluctant to use them right the way one can,
for instance, define them on a sufficiently large compact support that contains the
region where the likelihood is dominant. However, since

4 For a comprehensive discussion see [5].


96 2 Bayesian Inference

Fig. 2.1 Prior, likelihood 2.2


and posterior as function of posterior
2
the parameter θ. In this case,
likelihood
the prior is a smooth function 1.8
and the posterior is
dominated by the likelihood 1.6

1.4

1.2

0.8

0.6
prior
0.4

0.2

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
θ

p(θ|x) dθ ∝ p(x|θ) p(θ) dθ = p(x|θ) dμ(θ)

in most cases it will be sufficient to consider them simply as what they really are: a
measure. In any case, what is mandatory is that the posterior is a well defined proper
density.

2.6.1 Principle of Insufficient Reason

The Principle of Insufficient Reason5 dates back to J. Bernoulli and P.S. Laplace
and, originally, it states that if we have n exclusive and exhaustive hypothesis and
there is no special reason to prefer one over the other, it is reasonable to consider
them equally likely and assign a prior probability 1/n to each of them. This certainly
sounds reasonable and the idea was right the way extended to parameters taking
countable possible values and to those with continuous support that, in case of com-
pact sets, becomes a uniform density. It was extensively used by P.S. Laplace and
T. Bayes, being he the first to use a uniform prior density for making inferences on the
parameter of a Binomial distribution, and is usually referred to as the “Bayes-Laplace
Postulate”. However, a uniform prior density is obviously not invariant under repa-
rameterizations. If prior to the experiment we have a very vague knowledge about
the parameter θ ∈ [a, b], we certainly have a vague knowledge about φ = 1/θ or
ζ = logθ and a uniform distribution for θ:

5 Apparently, “Insufficient Reason” was coined by Laplace in reference to the Leibniz’s Principle
of Sufficient Reason stating essentially that every fact has a sufficient reason for why it is the way
it is and not other way.
2.6 Prior Functions 97

1
π(θ) dθ = dθ
b−a

implies that:

1
π(φ) dφ = dφ and π(ζ) dζ = eζ dζ
φ2

Shouldn’t we take as well a uniform density for φ or ζ?


Nevertheless, we shall see that a uniform density, that is far from representing
ignorance on a parameter, may be a reasonable choice in many cases even though, if
the support of the parameter is infinite, it is an improper function.

2.6.2 Parameters of Position and Scale

An important class of parameters we are interested in are those of position and scale.
Let’s treat them separately and leave for a forthcoming section the argument behind
that. Start with a random quantity X ∼ p(x|μ) with μ a location parameter. The
density has the form p(x|μ) = f (x − μ) so, taking a prior function π(μ) we can
write

p(x, μ) d x dμ = [ p(x|μ) d x] [π(μ) dμ] = [ f (x − μ) d x] [π(μ) dμ]

Now, consider random quantity X  = X + a with a ∈ R a known value. Defining


the new parameter μ = μ + a we have
     
p(x  , μ ) d x  dμ = p(x  |μ ) d x  π  (μ ) dμ = f (x  − μ ) d x  π(μ − a) dμ

In both cases the models have the same structure so making inferences on μ from the
sample {x1 , x1 , . . . , xn } is formally equivalent to making inferences on μ from the
shifted sample {x1 , x2 , . . . , xn }. Since we have the same prior degree of knowledge
on μ and μ , it is reasonable to take the same functional form for π(·) and π  (·) so:

π(μ − a) dμ == π(μ ) dμ ∀a ∈ R

and, in consequence:

π(μ) = constant

If θ is a scale parameter, the model has the form p(x|θ) = θ f (xθ) so taking a
prior function π(θ) we have that

p(x, θ) d x dθ = [ p(x|θ) d x] [π(θ) dθ] = [θ f (xθ) d x] [π(θ) dθ]


98 2 Bayesian Inference

For the scaled random quantity X  = a X with a ∈ R+ known, we have that:


     
p(x  , θ ) d x  dθ = p(x  |θ ) d x  π  (θ ) dθ = θ f (x  θ ) d x  π(aθ ) adθ

where we have defined the new parameter θ = θ/a. Following the same argument
as before, it is sound to assume the same functional form for π(·) and π  (·) so:

π(aθ ) a dθ = π(θ ) dθ ∀a ∈ R

and, in consequence:

1
π(θ) =
θ
Both prior functions are improper so they may be explicited as

1
π(μ, θ) ∝ 1 (θ) 1 M (μ)
θ
with , M an appropriate sequence of compact sets or considered as prior measures
provided that the posterior densities are well defined. Let’s see some examples.

Example 2.3 (The Exponential Distribution) Consider the sequence of independent


observations {x1 , x2 , . . . , xn } of the random quantity X ∼ E x(x|θ) drawn under the
same conditions. The joint density is

p(x1 , x2 , . . . , xn |θ) = θn e−θ(x1 + x2 + · · · xn )


n
The statistic t = n −1 i=1 xi is sufficient for θ and is distributed as

(nθ)n n−1
p(t|θ) = t exp{−nθt}
(n)

It is clear that θ is a scale parameter so we shall take the prior function π(θ) = 1/θ.
Note that if we make the change z = log t and φ = log θ we have that

nn  
p(z|φ) = exp{n (φ + z) − eφ+z }
(n)

In this parameterization, φ is a position parameter and therefore π(φ) = const in


consistency with π(θ). Then, we have the proper posterior for inferences:

(nt)n
p(θ|t, n) = exp{−ntθ} θn−1 ; θ>0
(n)
2.6 Prior Functions 99

Consider now the sequence of compact sets Ck = [1/k, k] covering R + as k→∞.


Then, with support on Ck we have the proper prior density

1 1
πk (θ) = 1C (θ)
2 log k θ k

and the sequence of posteriors:

(nt)n
pk (θ|t, n) = exp{−ntθ} θn−1 1Ck (θ)
γ(n, ntk) − γ(n, nt/k)

with γ(a, x) the Incomplete Gamma Function. It is clear that

lim pk (θ|t, n) = p(θ|t, n)


k→∞

Example 2.4 (The Uniform Distribution) Consider the random quantity X ∼


U n(x|0, θ) and the independent sampling {x1 , x2 , . . . , xn }. To draw inferences on
θ, the statistics x M = max{x1 , x2 , . . . , xn } is sufficient and is distributed as (show
that):
n−1
xM
p(x M |θ) = n 1[0,θ] (x M )
θn
As in the previous case, θ is a scale parameter and with the change t M = log x M ,
φ = log θ is a position parameter. Then, we shall take π(θ) ∝ θ−1 and get the posterior
density (Pareto):
n
xM
p(θ|x M , n) = n 1
n+1 [x M ,∞)
(θ)
θ
Example 2.5 (The one-dimensional Normal Distribution) Consider the random
quantity X ∼ N (x|μ, σ) and the experiment e(n) that provides the independent and
exchangeable sequence x = {x1 , x2 , . . . , xn } of observations. The likelihood func-
tion will then be:
 
n
1 1
n
p(x|μ, σ) = p(xi |μ, σ) ∝ n exp − 2 (xi − μ) 2

i=1
σ 2σ i=1

There is a three-dimensional sufficient statistic t = {n, x, s 2 } where

1 1
n n
x= xi and s2 = (xi − x)2
n i=1 n i=1
100 2 Bayesian Inference

so we can write
1  n  2 
p(x|μ, σ) ∝ exp − s + (x − μ) 2
σn 2σ 2
In this case we have both position and scale parameters so we take π(μ, σ) =
π(μ)π(σ) = σ −1 and get the proper posterior

1  n  
p(μ, σ|x) ∝ p(x|μ, σ) π(μ, σ) ∝ exp − 2 s 2 + (x − μ)2
σ n+1 2σ

• Marginal posterior density of σ: Integrating the parameter μ ∈ R we have that:


 +∞  
−n n s2
p(σ|x) = p(μ, σ|x) dμ ∝ σ exp − 2 1(0,∞) (σ)
−∞ 2σ

and therefore, the random quantity

n s2
Z= ∼ χ2 (z|n − 1)
σ2

• Marginal posterior density of μ: Integrating the parameter σ ∈ [0, ∞) we have


that:
 +∞  −n/2
(μ − x)2
p(μ|x) = p(μ, σ|x) dσ ∝ 1+ 1(−∞,∞) (μ)
0 s2

so the random quantity



n − 1(μ − x)
T = ∼ St (t|n − 1)
s

It is clear that p(μ, σ|x) = p(μ|x) p(σ|x) and, in consequence, are not independent.

• Distribution of μ conditioned to σ: Since p(μ, σ|x) = p(μ|σ, x) p(σ|x) we


have that
1 n
p(μ|σ, x) ∝ exp{− 2 (μ − x)2 }
σ 2σ

so μ|σ ∼ N (μ|x, σ/ n).

Example 2.6 (Contrast of parameters of Normal Densities) Consider two indepen-


dent random quantities X 1 ∼ N (x1 , |μ1 , σ1 ) and X 2 ∼ N (x2 , |μ2 , σ2 ) and the ran-
2.6 Prior Functions 101

dom samplings x 1 = {x11 , x12 , . . . , x1n 1 } and x 2 = {x21 , x22 , . . . , x2n 2 } of sizes n 1
and n 2 under the usual conditions. From the considerations of the previous example,
we can write
 
1 ni  
p(x i |μi , σi ) ∝ ni exp − 2 si2 + (xi − μi )2 ; i = 1, 2
σi 2σi

Clearly, (μ1 , μ2 ) are position parameters and (σ1 , σ2 ) scale parameters so, in princi-
ple, we shall take the improper prior function

1
π(μ1 , σ1 , μ2 , σ2 ) = π(μ1 )π(μ2 )π(σ1 )π(σ2 ) ∝
σ1 σ2

However, if we have know that both distributions have the same variance, then we
may set σ = σ1 = σ2 and, in this case, the prior function will be

1
π(μ1 , μ2 , σ) = π(μ1 )π(μ2 )π(σ) ∝
σ
Let’s analyze both cases.

• Marginal Distribution of σ1 and σ2 : In this case we assume that σ1 = σ2 and we


shall take the prior π(μ1 , σ1 , μ2 , σ2 ) ∝ (σ1 σ2 )−1 . Integrating μ1 and μ2 we get:
  
1 n 1 s12 n 2 s22
p(σ1 , σ2 |x 1 , x 2 ) = p(σ1 , |x 1 ) p(σ2 , |x 2 )∝ σ1−n 1 σ2−n 2 exp − + 2
2 σ12 σ2

Now, if we define the new random quantities

s22 (σ1 /s1 )2 n 1 s12


Z= = and W =
w 2 s12 (σ2 /s2 )2 σ1 2

both with support in (0, +∞), and integrate the last we get we get that Z follows a
Snedecor Distribution Sn(z|n 2 − 1, n 1 − 1) whose density is
 
(ν1 /ν2 )ν1 /2 ν1 −(ν1 +ν2 )/2
p(z|x 1 , x 2 ) = z (ν1 /2)−1 1+ z 1(0,∞) (z).
Be (ν1 /2, ν2 /2) ν2

• Marginal Distribution of μ1 and μ2 : In this case, it is different whether we assume


that, although unknown, the variances are the same or not. In the first case, we set
σ1 = σ2 = σ and take the reference prior π(μ1 , μ2 , σ) = σ −1 . Defining
   
A = n 1 s12 + (x 1 − μ1 )2 + n 2 s22 + (x 2 − μ2 )2
102 2 Bayesian Inference

we can write
 
1 1
p(μ1 , μ2 , σ|x, y) ∝ exp − A/σ 2
σ n 1 +n 2 +1 2

It is left as an exercise to show that if we make the transformation

w = μ1 − μ2 ∈ (−∞, +∞); u = μ2 ∈ (−∞, +∞) and z = σ −2 ∈ (0, +∞)

and integrate the last two, we get


 −(n 1 +n 2 −1)/2
n 1 n 2 [(x 1 − x 2 ) − w]2
p(w|x 1 , x 2 ) ∝ 1+
n 1 + n 2 n 1 s12 + n 2 s22

Introducing the more usual terminology

n 1 s12 + n 2 s22
s2 =
n1 + n2 − 2

we have that
 −[(n 1 +n 2 −2)+1]/2
n 1 n 2 [w − (x 1 − x 2 )]2
p(w|x 1 , x 2 ) ∝ 1 +
n 1 + n 2 s 2 (n 1 + n 2 − 2)

and therefore the random quantity

(μ1 − μ2 ) − (x 1 − x 2 )
T =
s (1/n 1 + 1/n 2 )1/2

follows a Student’s Distribution St (t|ν) with ν = n 1 + n 2 − 2 degrees of freedom.


Let’s see now the case where we can not assume that the variances are equal.
Taking the prior reference function π(μ1 , μ2 , σ1 , σ2 ) = (σ1 σ2 )−1 we get
 
1 si2 + (x i − μi )2
2
p(μ1 , μ2 , σ1 , σ2 |x 1 , x 2 ) ∝ σ1−(n 1 +1) σ2−(n 2 +1) exp −
2 i=1 σi2 /n i

After the appropriate integrations (left as exercise), defining w = μ1 − μ2 and


u = μ2 we end up with the density
 −n 1 /2  −n 2 /2
(x 1 − w − u)2 (x 2 − u)2
p(w, u|x 1 , x 2 ) ∝ 1+ 1 +
s12 s22

where integral over u ∈ R can not be expressed in a simple way. The density
2.6 Prior Functions 103
 +∞
p(w|x 1 , x 2 ) ∝ p(w, u|x 1 , x 2 ) du
−∞

is called the Behrens-Fisher Distribution. Thus, to make statements on the difference


of Normal means, we should analyze first the sample variances and decide how shall
we treat them.

2.6.3 Covariance Under Reparameterizations

The question of how to establish a reasonable criteria to obtain a prior for a given
model p(x|θ) that can be used as a standard reference function was studied by
Harold Jeffreys [6] in the mid XX century. The rationale behind the argument is that
if we have the model p(x|θ) with θ ∈ θ ⊆ R n and make a reparameterizations
φ = φ(θ) with φ(·) a one-to-one differentiable function, the statements we make
about θ should be consistent with those we make about φ and, in consequence, priors
should be related by
! "
∂φi (θ)
πθ (θ)dθ = πφ (φ(θ)) det dθ
∂θ j

Now, assume that the Fisher’s matrix (see Sect. 4.5)


! "
∂ log p(x|θ) ∂ log p(x|θ)
I i j (θ) = E X
∂θi ∂θ j

exists for this model. Under a differentiable one-to-one transformation φ = φ(θ)


we have that
∂θk ∂θl
I i j (φ) = I kl (θ)
∂φi ∂φ j

so it behaves as a covariant symmetric tensor of second order (left as exercise). Then,


since
! " 2
∂θi
det [I(φ)] = det det [I(θ)]
∂φ j

Jeffreys proposed to consider the prior

π(θ) ∝ [det[I(θ]]1/2

In fact, if we consider the parameter space as a Riemannian manifold (see Sect. 4.7)
the Fisher’s matrix is the metric tensor (Fisher-Rao metric) and this is just the invariant
104 2 Bayesian Inference

volume element. Intuitively, if we make a transformation such that at a particular value


φ0 = φ(θ 0 ) the Fisher’s tensor is constant and diagonal, the metric in a neighborhood
of φ0 is Euclidean and we have location parameters for which a constant prior is
appropriate and therefore
 
π(φ)dφ ∝ dφ = det [I(θ)]1/2 dθ = π(θ)dθ

It should be pointed out that there may be other priors that are also invariant under
reparameterizations and that, as usual, we talk loosely about prior densities although
they usually are improper functions.
For one-dimensional parameter, the density function expressed in terms of

φ ∼ [I(θ)]1/2 dθ

may be reasonably well approximated by a Normal density (at least in the parametric
region where the likelihood is dominant) because I(φ) is constant (see Sect. 4.5) and
then, due to translation invariance, a constant prior for φ is justified. Let’ see some
examples.
Example 2.7 (The Binomial Distribution) Consider the random quantity
X ∼ Bi(x|θ, n):
 
n
p(x|n, θ) = θk (1 − θ)n−k ; n, k ∈ N0 ; k ≤ n
x

with 0 < θ < 1. Since E[X ] = nθ we have that:


! "
∂ 2 log p(x|n, θ) n
I(θ) = E X − =
∂θ2 θ (1 − θ)

so the Jeffreys prior (proper in this case) for the parameter θ is

π(θ) ∝ [θ (1 − θ)]−1/2

and the posterior density will therefore be

p(θ|k, n) ∝ θk−1/2 (1 − θ)n−k−1/2

that is; a Be(x|k + 1/2, n − k + 1/2) distribution. Since




φ= √ = 2 asin (θ1/2 )
θ (1 − θ)

we have that θ = sin2 φ/2 and, parameterized in terms of φ, I(φ) is constant so the
distribution “looks” more Normal (see Fig. 2.2).
2.6 Prior Functions 105

1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3

Fig. 2.2 Dependence of the likelihood function with the parameter θ (upper) and with φ =
2 asin(θ1/2 ) (lower) for a Binomial process with n = 10 and k = 1, 5 and 9

Example 2.8 (The Poisson Distribution) Consider the random quantity X ∼


Po(x|μ):

μx
p(x|μ) = e−μ ; x ∈ N ; μ ∈ R+
(x + 1)

Then, since E[X ] = μ we have


! "
∂ 2 log p(x|μ) 1
I(μ) = E X − =
∂μ2 μ

so we shall take as prior (improper):

π(μ) = [I(μ)]1/2 = μ−1/2

and make inferences on μ from the proper posterior density

p(μ|x) ∝ e−μ μx−1/2

that is, a Ga(x|1, x + 1/2) distribution.


106 2 Bayesian Inference

Example 2.9 (The Pareto Distribution) Consider the random quantity X ∼


Pa(x|θ, x0 ) with x0 ∈ R + known and density

θ # x0 $θ+1
p(x|θ, x0 ) = 1(x0 ,∞) (x); θ ∈ R+
x0 x

Then,
! "
∂ 2 log p(x|θ, x0 ) 1
I(θ) = E X − = 2
∂θ 2
θ

so we shall take as prior (improper):

π(θ) ∝ [I(μ)]1/2 = θ−1

and make inferences from the posterior density (proper)

p(θ|x, x0 ) = x −θ log x

Note that if we make the transformation t = log x, the density becomes

p(t|θ, x0 ) = θ x0θ e−θt 1(log x0 ,∞) (t)

for which θ is a scale parameter and, from previous considerations, we should take
π(θ) ∝ θ−1 in consistency with Jeffreys’s prior.

Example 2.10 (The Gamma Distribution) Consider the random quantity X ∼ Ga


(x|α, β) with α, β ∈ R + and density

αβ −αx β−1
p(x|α, β) = e x 1(0,∞) (x)
(β)

Show that the Fisher’s matrix is


 
βα−2 −α−1
I(α, β) =
−α−1   (β)

with   (x) the first derivative of the Digamma Function and, following Jeffreys’
rule, we should take the prior
 1/2
π(α, β) ∝ α−1 β  (β) − 1

Note that α is a scale parameter so, from previous considerations, we should take
π(α) ∝ α−1 . Furthermore, if we consider α and β independently, we shall get
2.6 Prior Functions 107
 1/2
π(α, β) = π(α)π(β) ∝ α−1   (β)

Example 2.11 (The Beta Distribution)


Show that for the Be(x|α, β) distribution with density

(α + β) α−1
p(x|α, β) = (x (1 − x)β−1 1[0,1] (x); α, β ∈ R +
(α)(β)

the Fisher’s matrix is given by


 
  (α) −   (α + β) −  (α + β)
I(α, β) = 
− (α + β)  (β) −   (α + β)


with   (x) the first derivative of the Digamma Function.

Example 2.12 (The Normal Distribution)


Univariate: The Fisher’s matrix is given by
 
σ −2 0
I(μ, σ) =
0 2σ −2

so
1
π1 (μ, σ) ∝ [det[I(μ, σ]]1/2 ∝
σ2
However, had we treated the two parameters independently, we should have obtained

1
π2 (μ, σ) = π(μ) π(σ) ∝
σ

The prior π2 ∝ σ −1 is the one we had used in Example 2.5 where the problem was
treated as two one-dimensional independent problems and, as we saw:

n − 1(μ − x) n s2
T = ∼ St (t|n − 1) and Z= ∼ χ2 (z|n − 1)
s σ2

with E[Z ] = n − 1. Had we used prior π1 ∝ σ −2 , we would have obtained that


Z ∼ χ2 (z|n) and therefore E[Z ] = n. This is not reasonable. On the one hand, we
know from the sampling distribution N (x|μ, σ) that E[ns 2 σ −2 ] = n − 1. On the
other hand, we have two parameters (μ, σ) and integrate on one (σ) so the number
of degrees of freedom should be n − 1.
108 2 Bayesian Inference

Bivariate: The Fisher’s matrix is given by


 
2 −1 σ1−2 −ρ(σ1 σ2 )−1
I(μ1 , μ2 ) = (1 − ρ )
−ρ(σ1 σ2 ) −1
σ2−2

⎛ ⎞
(2 − ρ2 )σ1−2 −ρ2 (σ1 σ2 )−1 −ρσ1−1
I(σ1 , σ2 , ρ) = (1 − ρ2 )−1 ⎝ −ρ2 (σ1 σ2 )−1 (2 − ρ2 )σ2−2 −ρσ2−1 ⎠
−1 −1
−ρσ1 −ρσ2 (1 + ρ2 )(1 − ρ2 )−1

 
I(μ1 , μ2 ) 0
I(μ1 , μ2 , σ1 , σ2 , ρ) =
0 I(σ1 , σ2 , ρ)

From this,

1
π(μ1 , μ2 , σ1 , σ2 , ρ) ∝ |det I(μ1 , μ2 , σ1 , σ2 , ρ)|1/2 =
σ12 σ22 (1 − ρ2 )2

while if we consider π(μ1 , μ2 , σ1 , σ2 , ρ) = π(μ1 , μ2 )π(σ1 , σ2 , ρ) we get

1
π(μ1 , μ2 , σ1 , σ2 , ρ) ∝
σ1 σ2 (1 − ρ2 )3/2

Problem 2.1 Show that for the density p(x|θ); x ∈  ⊆ R n , the Fisher’s matrix (if
exists)
! "
∂ log p(x|θ) ∂ log p(x|θ)
I i j (θ) = E X
∂θi ∂θ j

transforms under a differentiable one-to-one transformation φ = φ(θ) as a covariant


symmetric tensor of second order; that is

∂θk ∂θl
I i j (φ) = I kl (θ)
∂φi ∂φ j

Problem 2.2 Show that for X ∼ Po(x|μ + b) with b ∈ R + known (Poisson model
with known background), we have that I(μ) = (μ + b)−1 and therefore the posterior
(proper) is given by:

p(μ|x, b) ∝ e−(μ+b) (μ + b)x−1/2


2.6 Prior Functions 109

Problem 2.3 Show that for the one parameter mixture model p(x|λ) = λ p1 (x) +
(1 − λ) p2 (x) with p1 (x) = p2 (x) properly normalized and λ ∈ (0, 1),
  ∞ 
1 p1 (x) p2 (x)
I(λ) = 1− dx
λ(1 − λ) −∞ p(x|λ)

When p1 (x) and p2 (x) are “well separated”, the integral is << 1 and therefore
I(λ) ∼ [λ(1 − λ)]−1 . On the% other hand, when they “get closer” we can write

p2 (x) = p1 (x) + η(x) with −∞ η(x)d x = 0 and, after a Taylor expansion for
|η(x)| << 1 get to first order that
 ∞
( p1 (x) − p2 (x))2
I(λ) dx + · · ·
−∞ p1 (x)

independent of λ. Thus, for this problem it will be sound to consider the prior
π(λ|a, b) = Be(λ|a, b) with parameters between (1/2, 1/2) and (1, 1).

2.6.4 Invariance Under a Group of Transformations

Some times, we may be interested to provide the prior with invariance under some
transformations of the parameters (or a subset of them) considered of interest for
the problem at hand. As we have stated, from a formal point of view the prior
can be treated as an absolute continuous measure with respect to Lebesgue so
p(θ|x) dθ ∝ p(x|θ) π(θ) dθ = p(x|θ) dμ(θ). Now, consider the probability space
(, B, μ) and a measurable homeomorphism T : →. A measure μ on the Borel
algebra B would be invariant by the mapping T if for any A ⊂ B, we have that
μ(T −1 (A)) = μ(A). We know, for instance, that there is a unique measure λ on R n
that is invariant under translations and such that for the unit cube λ([0, 1]n ) = 1:
the Lebesgue measure (in fact, it could have been defined that way). This is consis-
tent with the constant prior specified already for position parameters. The Lebesgue
measure is also the unique measure in R n that is invariant under the rotation group
S O(n) (see Problem 2.5). Thus, when expressed in spherical polar coordinates, it
would be reasonable for the spherical surface S n−1 the rotation invariant prior


n−1
dμ(φ) = (sin φk )(n−1)−k dφk
k=1

with φn−1 ∈ [0, 2π) and φ j ∈ [0, π] for the rest. We shall use this prior function in
a later problem.
In other cases, the group of invariance is suggested by the model

M : { p(x|θ), x ∈  X , θ ∈  }
110 2 Bayesian Inference

in the sense that we can make a transformation of the random quantity X→X  and
absorb the change in a redefinition of the parameters θ→θ  such that the expression
of the probability density remains unchanged. Consider a group of transformations6
G that acts

on the Sample Space: x→x  = g◦x; g ∈ G; x, x  ∈  X

on the Parametric Space: θ→θ = g◦θ; g ∈ G; θ, θ ∈ 

The model M is said to be invariant under G if ∀g ∈ G and ∀θ ∈  the random


quantity X  = g◦X is distributed as p(x  |θ ) ≡ p(g◦x|g◦θ). Therefore, transforma-
tions of data under G will make no difference on the inferences if we assign consistent
“prior beliefs” to the original and transformed parameters. Note that the action of
the group on the sample and parameter spaces will, in general, be different. The
essential point is that, as Alfred Haar showed in 1933, for the action of the group G
of transformations there is an invariant measure μ (Haar measure; [8]) such that
 
f (g◦x)dμ(x) = f (x  )dμ(x  )
X X

for any Lebesgue integrable function f (x) on  X . Shortly after, it was shown (Von
Neumann (1934); Weil and Cartan (1940)) that this measure is unique up to a mul-
tiplicative constant. In our case, the function will be p(·|θ)1 (θ) and the invariant
measure we are looking for is dμ(θ) ∝ π(θ)dθ. Furthermore, since the group may be
non-abelian, we shall consider the action on the right and on the left of the parameter
space. Thus, we shall have:
 
p(·|g◦θ) π L (θ) dθ = p(·|θ ) π L (θ ) dθ
 

if the group acts on the left and


 
p(·|θ◦g) π R (θ) dθ = p(·|θ ) π R (θ ) dθ
 

if the action is on the right. Then, we should start by identifying the group of transfor-
mations under which the model is invariant (if any; in many cases, either there is no
invariance or at least not obvious) work in the parameter space. The most interesting
cases for us are:

6 In this context, the use of Transformation Groups arguments was pioneered by E.T. Jaynes [7].
2.6 Prior Functions 111

Affine Transformations: x→x  = g◦x = a + b x

Matrix Transformations: x→x  = g◦x = R x

Translations and scale transformations are a particular case of the first and rotations
of the second. Let’s start with the location and scale parameters; that is, a density
 
1 x −μ
p(x|μ, σ) d x = f dx
σ σ

the Affine group G = {g ≡ (a, b); a ∈ R; b∈R + } so x  = g◦x = a + bx and the


model will be invariant if

(μ , σ  ) = g◦(μ, σ) = (a, b)◦(μ, σ) = (a + bμ, bσ)

Now,
 
p(·|μ , σ  ) π L (μ , σ  ) dμ dσ  = p(·|g◦(μ, σ) π L (μ, σ) dμ dσ =

 
= p(·|μ , σ  ) π L [g −1 (μ , σ  )] J (μ , σ  ; μ, σ) dμ dσ  =
     
  μ − a σ 1
= p(·|μ , σ ) π L , dμ dσ 
b b b2

and this should hold for all (a, b) ∈ R×R + so, in consequence:

1
dμ L (μ, σ) = π L (μ, σ) dμ dσ ∝ dμ dσ
σ2
However, the group of Affine Transformations is non-abelian so if we study the
action on the left, there is no reason why we should not consider also the action on
the right. Since

(μ , σ  ) = (μ, σ)◦g = (μ, σ)◦(a, b) = (μ + aσ, bσ)

the same reasoning leads to (left as exercise):

1
dμ R (μ, σ) = π R (μ, σ) dμ dσ ∝ dμ dσ
σ
The first one (π L ) is the one we obtain using Jeffrey’s rule in two dimensions while
π R is the one we get for position and scale parameters or Jeffrey’s rule treating both
parameters independently; that is, as two one-dimensional problems instead a one
two-dimensional problem. Thus, although from the invariance point of view there is
no reason why one should prefer one over the other, the right invariant Haar prior
112 2 Bayesian Inference

gives more consistent results. In fact ([9, 10]), a necessary and sufficient condition
for a sequence of posteriors based on proper priors to converge in probability to an
invariant posterior is that the prior is the right Haar measure.
Problem 2.4 As a remainder, given a measure space (, B, μ) a mapping T :  −→
 is measurable if T −1 (A) ∈ B for all A ∈ B and the measure μ is invariant under T if
μ(T −1 (A)) = μ(A) for all A ∈ B. Show that the measure dμ(θ) = [θ(1 − θ)]−1/2 dθ
is invariant under the mapping T : [0, 1]→[0, 1] such that T : θ→θ = T (θ) =
4θ(1 − θ). This is the Jeffrey’s prior for the Binomial model Bi(x|N , θ).
Problem 2.5 Consider the n-dimensional spherical surface Sn of unit radius, x ∈
Sn and the transformation x  = Rx ∈ Sn where R ∈ S O(n). Show that the Haar
invariant measure is the Lebesgue measure on the sphere.
Hint: Recall that R is an orthogonal matrix so Rt = R−1 ; that |det R| = 1 so
J (x  ; x) = |∂ x/∂ x  )| = |∂ R−1 x  /∂ x  )| = |det R| = 1 and that x t x  = x t x = 1.

Example 2.12 (Bivariate Normal Distribution) Let X = (X 1 , X 2 ) ∼ N (x|0, φ) with


φ = {σ1 , σ2 , ρ}; that is:
 
−1 −1/2 1  t −1 
p(x|φ) = (2π) |det[]| exp − x  x
2

with the covariance matrix


 2 
σ1 ρσ1 σ2
= and det[] = σ12 σ22 (1 − ρ)2
ρσ1 σ2 σ22

Using the Cholesky decomposition we can express  −1 as the product of two lower
(or upper) triangular matrices:
⎛ ⎞
  1 0
1 σ22 −ρσ1 σ2 σ1
 −1 = = At A with A = ⎝ &−ρ &1

det[] −ρσ1 σ2 σ12
σ1 1 − ρ σ2 1 − ρ
2 2

For the action on the left:


 
a0
M=T= ; a, b > 0 −→ J ( A ; A) = a 2 c
bc

and, in consequence

          1
π(aa11 , aa21 + ba22 , ca22 ) ac2 = π(a11 , a21 , a22 ) −→ π(a11 , a21 , a22 ) ∝  2 
a11 a22

and det[] = (det[ −1 ])−1 = (det[ A])−2 . Thus, in the new parameterization θ =
{a11 , a21 , a22 }
2.6 Prior Functions 113
 
1 t t 
p(x|θ) = (2π)−1 |det[ A]| exp − x A Ax
2

Consider now the group of lower triangular 2x2 matrices

G l = {T ∈ L T2x2 ; Tii > 0}

Since T −1 ∈ G l , inserting the identity matrix I = T T −1 = T −1 T we have: action

On the Left On the Right

T ◦x→T x = x  x◦T →T −1 x = x 
  
x t (T t (T t )−1 ) At A (T −1 T ) x x t ((T t )−1 T t ) At A (T T −1 ) x

M=T M = T −1

Then
1
M x = x; x = M −1 x  ; x = xt M t d x
t
and dx =
|det[M]|
so
 
 −1 |det[ A]| 1  t −1 t −1 

p(x |θ) = (2π) exp − x ( AM ) ( AM ) x
|det[M]| 2

and the model is invariant under G l if the action on the parameter space is

G l : A−→ A = AM −1 ; A = A M; det[ A] = det[ A ] det[M]

so
 
  −1  1  t t  
p(x |θ ) = (2π) |det[ A ]| exp − x A A x
2

Then, the Haar equation reads


  
p(•| A ) π( A ) d A = p(•|g◦ A) π( A) d A = p(•| A ) π( A M) J ( A ; A) d A
  

and, in consequence, ∀M ∈ G

π( A M) J ( A ; A) da11
 
da21 
da22 = π( A ) da11
 
da21 
da22
114 2 Bayesian Inference

For the action on the left:


 
a0
M=T= ; a, b > 0 −→ J ( A ; A) = a 2 c
bc

and, in consequence

          1
π(aa11 , aa21 + ba22 , ca22 ) a 2 c = π(a11 , a21 , a22 ) −→ π(a11 , a21 , a22 ) ∝  2 
a11 a22

For the action on the right:


 
a −1 0
M=T −1
= −→ J ( A ; A) = (ac2 )−1
−b(ac)−1 c−1

and, in consequence
  
a11 ca  − ba22
 a 1       1
π , 21 , 22 = π(a11 , a21 , a22 ) −→ π(a11 , a21 , a22 ) ∝
a ac c ac 2   2
a11 a22

In terms of the parameters of interest {σ1 , σ2 , ρ}, since

1
da11 da21 da22 = dσ1 dσdρ
σ12 σ22 (1− ρ2 )2

we have finally that for invariance under G l :

1 1
πlL (σ1 , σ2 , ρ) = and πlR (σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )3/2 σ22 (1 − ρ2 )

The same analysis with decomposition in upper triangular matrices leads to

1 1
π uL (σ1 , σ2 , ρ) = and π uR (σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )3/2 σ12 (1 − ρ2 )

As we see, in both cases the left Haar invariant prior coincides with Jeffrey’s prior
when {μ1 , μ2 } and {σ1 , σ2 , ρ} are decoupled.

At this point, one may be tempted to use a right Haar invariant prior where the
two parameters σ1 and σ2 are treated on equal footing

1
π(σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )
2.6 Prior Functions 115

Under this prior, since the sample correlation



i (x 1i− x 1 )(x2i − x 2 )
r =   
2 1/2
i (x 1i − x 1 ) i (x 2i − x 2 )
2

is a sufficient statistics for ρ, we have that the posterior for inferences on the corre-
lation coefficient will be

p(ρ|x) ∝ (1 − ρ2 )(n−3)/2 F(n − 1, n − 1, n − 1/2; (1 + r ρ)/2)

with F(a, b, c; z) the Hypergeometric Function.

Example 2.13 If θ ∈  −→ g◦θ = φ(θ) = θ ∈  with φ(θ) is a one-to-one differ-


entiable mapping, then
  
∂φ−1 (θ )
p(•|θ )dμ(θ) = p(•|θ )π(θ)dθ = p(•|θ )π(φ−1 (θ )) dθ =
 ∂θ
 
= p(•|θ )π(θ )dθ = p(•|θ )dμ(θ )
 

and therefore, Jeffreys’ prior defines a Haar invariant measure.

2.6.5 Conjugated Distributions

In as much as possible, we would like to consider reference priors π(θ|a, b, . . .)


versatile enough such that by varying some of the parameters a, b, . . . we get diverse
forms to analyze the effect on the final results and, on the other hand, to simplify the
evaluation of integrals like:
 
p(x) = p(x|θ)· p(θ) dθ and p(y|x) = p(y|θ)· p(θ|x) dθ

This leads us to consider as reference priors the Conjugated Distributions [11].


Let S be a class of sampling distributions p(x|θ) and P the class of prior densities
for the parameter θ. If

p(θ|x) ∈ P for all p(x|θ) ∈ S and p(θ) ∈ P

we say that the class P is conjugated to S. We are mainly interested in the class
of priors P that have the same functional form as the likelihood. In this case, since
both the prior density and the posterior belong to the same family of distributions,
we say that they are closed under sampling. It should be stressed that the criteria for
116 2 Bayesian Inference

taking conjugated reference priors is eminently practical and, in many cases, they do
not exist. In fact, only the exponential family of distributions has conjugated prior
densities. Thus, if x = {x1 , x2 , . . . , xn } is an exchangeable random sampling from
the k-parameter regular exponential family, then
⎧ ' n (⎫
⎨ k ⎬
p(x|θ) = f (x) g(θ) exp c j φ j (θ) h j (xi )
⎩ ⎭
j=1 i=1

and the conjugated prior will have the form:


⎧ ⎫
1 ⎨ k ⎬
π(θ|τ ) = [g(θ)]τ0 exp c j φ j (θ) τ j
K (τ ) ⎩ ⎭
j=1

where θ ∈ , τ = % {τ0 , τ1 , . . . , τk } the hyperparameters and K (τ ) < ∞ the normal-


ization factor so  π(θ|τ )dθ = 1. Then, the general scheme will be7 :
(1) Choose the class of priors π(θ|τ ) that reflect the structure of the model;
(2) Choose a prior function π(τ ) for the hyperparameters;
(3) Express the posterior density as p(θ, τ |x) ∝ p(x|θ)π(θ|τ )π(τ );
(4) Marginalize for the parameters of interest:

p(θ|x) ∝ p(x|θ)π(θ|τ )π(τ )dτ


or, if desired, get the conditional density

p(x, θ, τ ) p(x|θ)π(θ|τ )
p(θ|x, τ ) = =
p(x, τ ) p(x|τ )

The obvious question that arises is how do we choose the prior π(φ) for the hyper-
parameters. Besides reasonableness, we may consider two approaches. Integrating
the parameters θ of interest, we get

p(τ , x) = π(τ ) p(x|θ)π(θ|τ ) dθ = π(τ ) p(x|τ )


so we may use any of the procedures under discussion to take π(τ ) as the prior for
the model p(x|τ ) and then obtain

π(θ) = π(θ|τ ) π(τ ) dτ

7 We can go an step upwards and assign a prior to the hyperparameters with hyper-
hyperparameters,…
2.6 Prior Functions 117

The beauty of Bayes rule but not very practical in complicated situations. A second
approach, more ugly and practical, is the so called Empirical Method where we
assign numeric values to the hyperparameters suggested by p(x|τ ) (for instance,
moments, maximum-likelihood estimation,…); that is, setting, in a distributional
sense, π(τ ) = δτ  so
π(τ ), p(θ, x, τ ) = p(θ, x, τ  ). Thus,

p(θ|x, τ  ) ∝ p(x|θ)π(θ|τ  )

Obviously, fixing the hyperparameters assumes a perfect knowledge of them and


does not allow for variations but the procedure may be useful to guess at least were
to go.
Last, it may happen that a single conjugated prior does not represent sufficiently
well our beliefs. In this case, we may consider a k-mixture of conjugated priors


k
π(θ|τ 1 , . . . , τ k ) = wi π(θ|τ i )
i=1

In fact [12], any prior density for a model that belongs to the exponential family can
be approximated arbitrarily close by a mixture of conjugated priors.

Example 2.14 Let’s see the conjugated prior distributions for some models:
• Poisson model Po(n|μ): Writing

e−μ μn e−(μ−nlog μ)
p(n|μ) = =
(n + 1) (n + 1)

it is clear that the Poisson distribution belongs to the exponential family and the
conjugated prior density for the parameter μ is

π(μ|τ1 , τ2 ) ∝ e−τ1 μ+τ2 log μ ∝ Ga(μ|τ1 , τ2 )

If we set a prior π(τ1 , τ2 ) for the hyperparameters we can write

p(n, μ, τ1 , τ2 ) p(n|μ) π(μ|τ1 , τ2 ) = π(τ1 , τ2 )

and integrating μ:
! "
(n + τ2 ) τ1τ2
p(n, τ1 , τ2 ) = π(τ1 , τ2 ) = p(n|τ1 , τ2 ) π(τ1 , τ2 )
(τ1 ) (1 + τ1 )n+τ2

• Binomial model Bi(n|N , θ): Writing


   
N N −n N
P(n|N , θ) = θ (1 − θ)
n
= en log θ+(N −n) log (1−θ)
n n
118 2 Bayesian Inference

it is clear that it belong to the exponential family and the conjugated prior density
for the parameter θ will be:

π(θ|τ1 , τ2 ) = Be(τ |τ1 , τ2 )

• Multinomial model Let X = (X 1 , X 2 , . . . , X k ) ∼ Mn(x|θ); that is:


⎧ k

k
θixi ⎨ Xi ∈ N , i=1 Xi = n
X ∼ p(x|θ) = (n + 1)
(xi + 1) ⎩ k
i=1 θi ∈ [0, 1], i=1 θi =1

The Dirichlet distribution Di(θ|α):


⎧ k

k ⎨ α = (α1 , α2 , . . . , αk ), αi > 0,
⎪ i=1 αi = α0
π(θ|α) = D(α) θiαi −1  
i=1 ⎩ D(α) = (α ) *k (α ) −1

0 i=1 i

is the natural conjugated prior for this model. It is a degenerated distribution in the
sense that
+k−1 ,+ ,αk −1

k−1
π(θ|α) = D(α) θiαi −1 1− θi
i=1 i=1

The posterior density will then be θ ∼ Di(θ|x + α) with

xi + αi E[θi ](δi j − E[θ j ])


E[θi ] = and V [θi , θ j ] =
n + α0 n + α0 + 1

The parameters α of the Dirichlet distribution Di(θ|α) determine the expected


values E[θi ] = αi /α0 . In practice, it is more convenient to control also the variances
and use the Generalized Dirichlet Distribution G Di(θ|α, β):
⎡ ⎤γi

k−1
(αi + βi ) αi −1 ⎣ i
π(θ|α, β) = θi 1− θj⎦
i=1
(αi )(βi ) j=1

where:


k−1
k−1
0 < θi < 1, θi < 1, θn = 1 − θi
i=1 i=1

βi − αi+1 − βi+1 ; i = 1, 2, . . . , k − 2
αi > 0, βi > 0, and γi
βk−1 − 1; i = k − 1
2.6 Prior Functions 119

When βi = αi+1 + βi+1 it becomes the Dirichlet distribution. For this prior we have
that
 
αi αi + δi j
E[θi ] = Si and V [θi , θ j ] = E[θ j ] Ti − E[θi ]
αi + βi αi + βi + 1

where


i−1
βj 
i−1
βj + 1
Si = and Ti =
α + βj
j=1 j
α + βj + 1
j=1 j

with S1 = T1 = 1 and we can have control over the prior means and variances.

2.6.6 Probability Matching Priors

A pragmatic criteria is that of probability matching priors for which the one sided
credible intervals derived from the posterior distribution coincide, to a certain level
of accuracy, with those derived by the classical approach. This condition leads to
a differential equation for the prior distribution [13, 14]. We shall illustrate in the
following lines the rationale behind for the simple one parameter case assuming that
the needed regularity conditions are satisfied.
Consider then a random quantity X ∼ p(x|θ) and an iid sampling x = {x1 , x2 ,
. . . , xn } with θ the parameter of interest. The classical approach for inferences is
based on the likelihood


n
p(x|θ) = p(x1 , x2 , . . . xn |θ) = p(xi |θ)
i=1

and goes through the following reasoning:


(1) Assumes that the parameter θ has the true but unknown value θ0 so the sample
is actually drawn from p(x|θ0 );
(2) Find the estimator θm (x) of θ0 as the value of θ that maximizes the likelihood;
that is:
 
∂ ln p(x|θ)
θm = max{ p(x|θ)} −→ =0
θ ∂θ θm

(3) Given the model X ∼ p(x|θ0 ), after the appropriate change of variables get the
distribution

p(θm |θ0 )
120 2 Bayesian Inference

of the random quantity θm (X 1 , X 2 , . . . X n ) and draw inferences from it.


The Bayesian inferential process considers a prior distribution π(θ) and draws infer-
ences on θ from the posterior distribution of the quantity of interest

p(θ|x) ∝ p(x|θ) π(θ)

Let’s start with the Bayesian and expand the term on the right around θm . On the
one hand:
' ( ' (
p(x|θ) 1 ∂ 2 ln p(x|θ) 1 ∂ 3 ln p(x|θ)
ln = (θ − θm )2 + (θ − θm )3 + · · ·
p(x|θm ) 2! ∂θ2 3! ∂θ3
θm θm

Now,
! 2 "
1 ∂ 2 (− ln p(xi |θ)) n→∞
n
1 ∂ 2 ln p(x|θ) ∂ (− ln p(x|θ))
− = −→ E X = I (θ)
n ∂θ2 n i=1 ∂θ2 ∂θ2

so we can substitute:
 2     
∂ ln p(x|θ) ∂ 3 ln p(x|θ) ∂ I (θ)
= −n I (θm ) and = −n
∂θ2 θm ∂θ3 θm ∂θ θm

to get
'   (

n I (θm ) (θ − θ )2 n ∂ I (θ)
p(x|θ) = eln p(x|θ) ∝ e
m
2 1− (θ−θm ) +· · ·
3
3! ∂θ θm

On the other hand:


'   (
1 ∂π(θ)
π(θ) = π(θm ) 1+ (θ − θm ) + · · ·
π(θ) ∂θ θm


so If we define the random quantity T = n I (θm )(θ − θm ) and consider that

∂ I (θ) ∂ I −1/2
I −3/2 (θ) = −2
∂θ ∂θ
we get finally:
' +    ,
exp(−t 2 /2) 1 I −1/2 (θ) ∂π(θ) 1 ∂ I −1/2
p(t|x) = √ 1+ √ t+ t3
2π n π(θ) ∂θ θm 3 ∂θ θm
 
1
+O
n
2.6 Prior Functions 121

Let’s now find


 z
P(T ≤ z|x) = p(t|x)dt
−∞

Defining
 x
1
Z (x) = √ e− x /2
2
and P(x) = Z (t)dt
2π −∞

and considering that


 z  z
Z (t) t dt = −Z (z) and Z (t) t 3 dt = −Z (z) (z 2 + 2)
−∞ −∞

it is straight forward to get:


⎡' ( ' ( ⎤
 
Z (z) ⎣ I −1/2 (θ) ∂π(θ) z2 + 2 ∂ I −1/2 ⎦O 1
P(T ≤ z|x) = P(z) − √ +
n π(θ) ∂θ 3 ∂θ n
θm θm

From this probability distribution, we can infer what the classical approach will
get. Since he will draw inferences from p(x|θ0 ), we can take a sequence of proper
priors πk (θ|θ0 ) for k = 1, 2, . . . that induce a sequence of distributions such that

limk→∞
πk (θ|θ0 ), p(x|θ) = p(x|θ0 )

In Distributional sense, the sequence of distributions generated by

k
πk (θ|θ0 ) = 1[θ −1/k,θ0 +1/k] ; k = 1, 2, . . .
2 0
converge to the Delta distribution δθ0 and, from distributional derivatives, as k→∞,
 
d d −1/2 ∂ I −1/2 (θ))

πk (θ|θ0 ), I −1/2 (θ) = −
πk (θ|θ0 ), I (θ) −
dθ dθ ∂θ θ0


But θ0 = θm + O(1/ n) so, for a sequence of priors that shrink to θ0 θm ,
+   ,  
Z (z) z2 + 1 ∂ I −1/2 1
P(T ≤ z|x) = P(z) − √ +O
n 3 ∂θ θm n

For terms of order O(1/ n) in both expressions of P(T ≤ z|x) to be the same, we
need that:
122 2 Bayesian Inference
   
1 1 ∂π(θ) ∂ I −1/2
√ =−
I (θ) π(θ) ∂θ θm ∂θ θm

and therefore

π(θ) = I 1/2 (θ)

that is, Jeffrey’s prior. In the case of n-dimensional parameters, the reasoning goes
along the same lines but the expressions and the development become much more
lengthy and messy and we refer to the literature.
The procedure for a first order probability matching prior [15, 16] starts from the
likelihood

p(x1 , x2 , . . . xn |θ1 , θ2 , . . . θ p )

and then:
(1) Get the Fisher’s matrix I(θ1 , θ2 , . . . θ p ) and the inverse I −1 (θ1 , θ2 , . . . θ p );
(2) Suppose we are interested in the parameter t = t (θ1 , θ2 , . . . θ p ) a twice contin-
uous and differentiable function of the parameters. Define the column vector
 T
∂t ∂t ∂t
∇t = , ,...,
∂θ1 ∂θ2 ∂θ p

(3) Define the column vector

I −1 ∇t
η= so that η T Iη = 1
(∇tT I −1 ∇t )1/2

(4) The probability matching prior for the parameter t = t (θ) in terms of θ1 , θ2 , . . . θ p
is given by the equation:

p

[ηk (θ) π(θ)] = 0
k=1
∂θk

Any solution π(θ1 , θ2 , . . . θ p ) will do the job.


(5) Introduce t = t (θ) in this expression, say, for instance θ1 = θ1 (t, θ2 , . . . θ p ),
and the corresponding Jacobian J (t, θ2 , . . . θ p ). Then we get the prior for the
parameter t of interest and the nuisance parameters θ2 , . . . θ p that, eventually,
will be integrated out.

Example 2.15 Consider two independent random quantities X 1 and X 2 such that

P(X i = n k ) = Po(n k |μi ).


2.6 Prior Functions 123

We are interested in the parameter t = μ1 /μ2 so setting μ = μ2 we have the ordered


parameterization {t, μ}. The joint probability is

μn1 1 μn2 2
P(n 1 , n 2 |μ1 , μ2 ) = P(n 1 |μ1 ) P(n 2 |μ2 ) = e−(μ1 + μ2 )
(n 1 + 1)(n 2 + 1)

from which we get the Fisher’s matrix


   
1/μ1 0 μ1 0
I(μ1 , μ2 ) = and I−1 (μ1 , μ2 ) =
0 1/μ2 0 μ2

We are interested in the parameter t = μ1 /μ2 , a twice continuous and differen-


tiable function of the parameters, so
 T  
∂t ∂t  
−2 T μ−1
∇t (μ1 , μ2 ) = , = μ−1
2 , −μ1 μ2 = 2
∂μ1 ∂μ2 −μ1 μ−2
2

Therefore:
 
−1 μ1 μ−1 μ1 (μ1 + μ2 )
I ∇t = 2 S = ∇tT I−1 ∇t =
−μ1 μ−1
2 μ32

 
I−1 ∇t (μ1 μ2 )1/2 (μ1 + μ2 )−1/2
η= =
(∇tT I−1 ∇t )1/2 −(μ1 μ2 )1/2 (μ1 + μ2 )−1/2

so that η T Iη = 1. The probability matching prior for the parameter t = μ1 /μ2 in


terms of μ1 and μ2 is given by the equation:

2

[ηk (μ) π(μ)] = 0
k=1
∂μk

so, if f (μ1 , μ2 ) = (μ1 μ2 )1/2 (μ1 + μ2 )−1/2 , we have to solve

∂ ∂
f (μ1 , μ2 ) π(μ1 , μ2 ) = f (μ1 , μ2 ) π(μ1 , μ2 )
∂μ1 ∂μ2

Any solution will do so:



μ1 + μ2
π(μ1 , μ2 ) ∝ f −1 (μ1 , μ2 ) = √
μ1 μ2

Substituting μ1 = tμ2 and including the Jacobian J = μ2 we have finally:


124 2 Bayesian Inference
1
√ 1+t
π(t, μ2 ) ∝ μ2
t

The posterior density will be:

p(t, μ2 |n 1 , n 2 ) ∝ p(n 1 , n 2 |t, μ2 )π(t, μ2 ) ∝ e−μ2 (1 + t) t n 1 −1/2 (1 + t)1/2 μ2


n+3/2−1

and, integrating the nuisance parameter μ2 ∈ [0, ∞), we get the posterior density:

t n 1 −1/2
p(t|n 1 , n 2 ) = N
(1 + t)n+1

with N −1 = B(n 1 + 1/2, n 2 + 1/2).

Example 2.16 (Gamma distribution) Show that for Ga(x|α, β):

αβ −αx β−1
p(x|α, β) = e x 1(0,∞) (x)
(β)

the probability matching prior for the ordering


 √  
• {β, α} is π(α, β) = β −1/2√α−1 β  √ (β) − 1
• {α, β} is π(α, β) = α−1   (β) β  (β) − 1

to be compared with Jeffrey’s prior π2J (α, β) = α−1 β  (β) − 1√
and Jeffrey’s prior
when both parameters are treated individually π1+1 (α, β) = α−1   (β)
J

Example 2.17 (Bivariate Normal Distribution)


For the ordered parameterization ρ, σ1 , σ2 : the Fisher’s matrix (see Example
2.12) is:
⎛ ⎞
(1 + ρ2 )(1 − ρ2 )−1 −ρσ1−1 −ρσ2−1
I(ρ, σ1 , σ2 ) = (1 − ρ2 )−1 ⎝ −ρσ1−1 (2 − ρ2 )σ1−2 −ρ2 (σ1 σ2 )−1 ⎠
−1
−ρσ2 −ρ2 (σ1 σ2 )−1 (2 − ρ2 )σ2−2

and the inverse:


⎛ ⎞
2(1 − ρ2 )2 σ1 ρ(1 − ρ2 ) σ2 ρ(1 − ρ2 )
1
I −1 (ρ, σ1 , σ2 ) = ⎝ σ1 ρ(1 − ρ2 ) σ2 ρ2 σ1 σ2 ⎠
2 σ ρ(1 − ρ2 ) ρ2 σ1 σ σ22
2 1 2

Then
2 ∂ ∂ ∂
[π(1 − ρ2 )] + [πσ1 ] + [πσ2 ] = 0
ρ ∂ρ ∂σ1 ∂σ2
2.6 Prior Functions 125

for which
1
π(σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )

is a solution.

Problem 2.6 Consider


sinh[σ(b − a)] 1
X ∼ p(x|a, b, σ) = 1(−∞,∞) (x)
2(b − a) cosh[σ(x − a)]cosh[σ(b − x)]

where a < b ∈ R and σ ∈ (0, ∞). Show that

b+a (b − a)2 π2
E[X ] = and V [x] = +
2 12 12σ 2
and that, for known σ >>, the probability matching prior for a and b tends to
π pm (a, b) ∼ (b − a)−1/2 . Show also that, under the same limit, π pm (θ) ∼ θ−1/2 for
σ>>
(a, b) = (−θ, θ) and (a, b) = (0, θ). Since p(x|a, b, σ) → U n(x|a, b) discuss in
this last case what is the difference with the Example 2.4.

2.6.7 Reference Analysis

The expected amount of information (Expected Mutual Information) on the para-


meter θ provided by k independent observations of the model p(x|θ) relative to the
prior knowledge on θ described by π(θ) is
 
p(θ|z k )
I [e(k), π(θ)] = π(θ) dθ p(z k |θ) log d zk
 X π(θ)

where z k = {x 1 , . . . , x k }. If limk→∞ I [e(k), π(θ)] exists, it will quantify the maxi-


mum amount of information that we could obtain on θ from experiments described
by this model relative to the prior knowledge π(θ). The central idea of the ref-
erence analysis [4, 17] is to take as reference prior for the model p(x|θ) that
which maximizes the maximum amount of information we may get so it will be
the less informative for this model. From Calculus of Variations, if we introduce
the prior π  (θ) = π(θ) + η(θ) with π(θ) an extremal of the expected information
I [e(k), π(θ)] and η(θ) such that
  
π(θ)dθ = π  (θ)dθ = 1 −→ η(θ)dθ = 0
  
126 2 Bayesian Inference

it is easy to see (left as exercise) that


 
π(θ) ∝ exp p(z k |θ) log p(θ|z k ) d z k = f k (θ)
X

This is a nice but complicated implicit equation because, on the one hand, f k (θ)
depends on π(θ) through the posterior p(θ|z k ) and, on the other hand, the limit
k→∞ is usually divergent (intuitively, the more precision we want for θ, the more
information is needed and to know the actual value from the experiment requires an
infinite amount of information). This can be circumvented regularizing the expression
as
f k (θ)
π(θ) ∝ π(θ0 ) lim
k→∞ f k (θ0 )

with θ0 any interior point of  (we are used to that in particle physics!). Let’s see
some examples.
n
Example 2.18 Consider again the exponential model for which t = n −1 i=1 xi is
sufficient for θ and distributed as
(nθ)n n−1
p(t|θ) = t exp{−nθt}
(n)

Taking π(θ) = 1(0,∞) (θ) we have the proper posterior

(nt)n+1
π  (θ|t) = exp {−nθt} θn
(n + 1)

Then log π  (θ|t) = −(nθ)t + n log θ + (n + 1) log t + g1 (n) and


 
g2 (n) f n (θ) 1
f n (θ) = exp p(t|θ) log π  (θ|t) dt = −→ π(θ) ∝ π(θ0 ) lim ∝
X θ n→∞ f n (θ0 ) θ

Example 2.19 Prior functions depend on the particular model we are treating. To
learn about a parameter, we can do different experimental designs that respond to
different models and, even though the parameter is the same, they may have different
priors. For instance, we may be interested in the acceptance; the probability to accept
an event under some conditions. For this, we can generate for instance a sample of
N observed events and see how many (x) pass the conditions. This experimental
design corresponds to a Binomial distribution
 
N
p(x|N , θ) = θ x (1 − θ) N −x
x
2.6 Prior Functions 127

with x = {0, 1, . . . , N }. For this model, the reference prior (also Jeffrey’s and PM) is
π(θ) = θ−1/2 (1 − θ)−1/2 and the posterior θ ∼ Be(θ|x + 1/2, N − x + 1/2). Con-
versely, we can generate events until r are accepted and see how many (x) have we
generated. This experimental design corresponds to a Negative Binomial distribution
 
x −1
p(x|r, θ) = θr (1 − θ)x−r
r −1

where x = r, r + 1, . . . and r ≥ 1. For this model, the reference prior (Jeffrey’s and
PM too) is π(θ) = θ−1 (1 − θ)−1/2 and the posterior θ ∼ Be(θ|r, x − r + 1/2).

Problem 2.7 Consider

(1) X ∼ Po(x|θ) = exp{−θ} θx iid


and the experiment e(k) → {x1 , x2 ,
(x + 1)
. . . xk }. Take π  (θ) = 1(0,∞) (θ), and show that

f k (θ)
π(θ) ∝ π(θ0 ) lim ∝ θ−1/2
k→∞ f k (θ0 )
 
N iid
(2) X ∼ Bi(x|N , θ) = θ x (1 − θ) N −x and the experiment e(k) → {x1 , x2 ,
x
. . . xk }. Take π  (θ) ∝ θa−1 (1 − θ)b−1 1(0,1) (θ) with a, b > 0 and show that

f k (θ)
π(θ) ∝ π(θ0 ) lim ∝ θ−1/2 (1 − θ)−1/2
k→∞ f k (θ0 )

(Hint: For (1) and (2) consider the Taylor expansion of log (z, ·) around E[z] and
the asymptotic behavior of the Polygamma Function  n) (z) = an z −n + an+1 z −(n+1)
+ · · · ).

(3) X ∼ U n(x|0, θ) and the iid sample {x1 , x2 , . . . xk }. For inferences on θ, show
that f k = θ−1 g(k) and in consequence the posterior is Pareto Pa(θ|x M , n) with
x M = max{x1 , x2 , . . . xk } the sufficient statistic.
A very useful constructive theorem to obtain the reference prior is given in [18].
First, a permissible prior for the model p(x|θ) is defined as a strictly positive function
π(θ) such that it renders a proper posterior; that is,

∀x ∈  X p(x|θ) π(θ) dθ < ∞


and that for some approximating sequence k ⊂ ; limk→∞ k = , the sequence


of posteriors pk (θ|x) ∝ p(x|θ)πk (θ) converges logarithmically to p(θ|x) ∝
p(x|θ)π(θ). Then, the reference prior is just a permissible prior that maximizes
128 2 Bayesian Inference

the maximum amount of information the experiment can provide for the parameter.
The constructive procedure for a one-dimensional parameter consists on:
(1) Take π  (θ) as a continuous strictly positive function such that the corresponding
posterior

p(z k |θ) π  (θ)


π  (θ|z k ) = %
 p(z k |θ) π(θ) dθ

is proper and asymptotically consistent. π  (θ) is arbitrary so it can be taken for


convenience to simplify the integrals.
(2) Obtain
 
f k (θ)
f k (θ) = exp 
p(z k |θ) log π (θ|z k ) d z k and h k (θ; θ0 ) =
X f k (θ0 )

for any interior point θ0 ∈ ;


(3) If
(3.1) each f k (θ) is continuous;
(3.2) for any fixed θ and large k, is h k (θ; θ0 ) is either monotonic in k or bounded
from above by h(θ) that is integrable on any compact set;
(3.3) π(θ) = limk→∞ h k (θ; θ0 ) is a permissible prior function
then π(θ) is a reference prior for the model p(x|θ). It is important to note that there
is no requirement on the existence of the Fisher’s information I(θ). If it exists, a
simple Taylor expansion of the densities shows that for a one-dimensional parameter
π(θ) = [I(θ)]1/2 in consistency with Jeffrey’s proposal. Usually, the last is easier to
evaluate but not always as we shall see.
In many cases supp(θ) is unbounded and the prior π(θ) is not a propper density. As
we have seen this is not a problem as long as the posterior p(θ|z k ) ∝ p(z k |θ)π(θ)
is propper although, in any case, one can proceed “more formally” considering a
sequence of proper priors πm (θ) defined on a sequence of compact sets m ⊂ 
such that limm→∞ m =  and taking the limit of the corresponding sequence
of posteriors pm (θ|z k ) ∝ p(z k |θ)πm (θ). Usually simple sequences as for exam-
ple m = [1/m, m]; limm→∞ m = (0, ∞), or m = [−m, m]; limm→∞ m =
(−∞, ∞) will suffice.
When the parameter θ is n-dimensional, the procedure is more laborious. First, one
starts [4] arranging the parameters in decreasing order of importance {θ1 , θ2 , . . . , θn }
(as we did for the Probability Matching Priors) and then follow the previous scheme
to obtain the conditional prior functions

π(θn |θ1 , θ2 , . . . , θn−1 ) π(θn−1 |θ1 , θ2 , . . . , θn−2 ) . . . π(θ2 |θ1 ) π(θ1 )

For instance in the case of two parameters and the ordered parameterization {θ, λ}:
(1) Get the conditional π(λ|θ) as the reference prior for λ keeping θ fixed;
2.6 Prior Functions 129

(2) Find the marginal model



p(x|θ) = p(x|θ, λ) π(λ|θ) dλ


(3) Get the reference prior π(θ) from the marginal model p(x|θ)
Then π(θ, λ) ∝ π(λ|θ)π(θ). This is fine if π(λ|θ) and π(θ) are propper functions;
seldom the case. Otherwise one has to define the appropriate sequence of compact
sets observing, among other things, that this has to be done for the full parameter
space and usually the limits depend on the parameters. Suppose that we have the
i→∞
sequence i × i −→  × . Then:
(1) Obtain πi (λ|θ):

p(z k |θ, λ)πi (λ|θ) f  (λ| i , θ, . . .)


πi (λ|θ)1 i (λ) −→ πi (λ|θ, z k ) = % −→ πi (λ|θ) = lim k
i p(z k |θ, λ)πi (λ|θ) dλ k→∞ f k (λ0 | i , θ, . . .)

(2) Get the marginal density pi (x|θ):



pi (x|θ) = p(x|θ, λ) πi (λ|θ) dλ
i

(3) Determine πi (θ):

pi (z k |θ)πi (θ) f  (θ|i , i , . . .)


πi (θ)1i (θ) −→ πi (θ|z k ) = %  −→ πi (θ) = lim k
i pi (z k |θ)πi (θ) dθ k→∞ f k (θ0 |i , i , . . .)

(4) The reference prior for the ordered parameterization {θ, λ} will be:

πi (λ|θ) πi (θ)
π(θ, λ) = lim
i→∞ πi (λ0 |θ0 ) πi (θ0 )

In the case of two parameters, if is independent of θ the Fisher’s matrix usually


exists and, if I(θ, λ) and S(θ, λ) = I −1 (θ, λ) are such that:

I 22 (θ, λ) = a12 (θ) b12 (λ) and S11 (θ, λ) = a0−2 (θ) b0−2 (λ)

then [19] π(θ, λ) = π(λ|θ)π(θ) = a0 (θ) b1 (λ) is a permissible prior even if the con-
ditional reference priors are not proper. The reference priors are usually probability
matching priors.

Example 2.20 A simple example is the Multinomial distribution X ∼ Mn(x|θ) with


dimX = k + 1 and probability
130 2 Bayesian Inference


k
p(x|θ) ∝ θ1 x1 θ2 x2 . . . θk xk (1 − δk )xk+1 ; δk = θj
j=1

Consider the ordered parameterization {θ1 , θ2 , . . . , θk }. Then

π(θ1 , θ2 , . . . , θk ) = π(θk |θk−1 , θk−2 . . . θ2 , θ1 ) π(θk−1 |θk−2 . . . θ2 , θ1 ) . . . π(θ2 |θ1 ) π(θ1 )

In this case, all the conditional densities are proper

π(θm |θm−1 , . . . θ1 ) ∝ θm −1/2 (1 − δm )−1/2

and therefore


k
π(θ1 , θ2 , . . . θk ) ∝ θi −1/2 (1 − δi )−1/2
i=1

The posterior density will be then


+ k ,

xi −1/2 −1/2
p(θ|x) ∝ θi (1 − δi ) (1 − δk )xk+1
i=1

Example 2.21 Consider again the case of two independent Poisson distributed ran-
dom quantities X 1 and X 2 with joint density

μn1 1 μn2 2
P(n 1 , n 2 |μ1 , μ2 ) = P(n 1 |μ1 ) P(n 2 |μ2 ) = e−(μ1 + μ2 )
(n 1 + 1)(n 2 + 1)

We are interested in the parameter θ = μ1 /μ2 so setting μ = μ2 we have the ordered


parameterization {θ, μ} and:

θn 1 μn
P(n 1 , n 2 |θ, μ) = e−μ(1 + θ)
(n 1 + 1)(n 2 + 1)

where n = n 1 + n 2 . Since E[X 1 ] = μ1 = θμ and E[X 2 ] = μ2 = μ the Fisher’s


matrix and its inverse will be
   
μ/θ 1 θ(1 + θ)/μ −θ
I= ; det(I) = θ−1 and S = I−1 =
1 (1 + θ)/μ −θ μ

Therefore

S11 = θ(1 + θ)/μ and F22 = (1 + θ)/μ


2.6 Prior Functions 131

and, in consequence:
√ √
−1/2 μ 1/2 1+θ
π(θ) f 1 (μ) ∝ S11 =√ π(μ|θ) f 2 (θ) ∝ F22 = √
θ(1 + θ) μ

Thus, we have for the ordered parameterization {θ, μ} the reference prior:

1
π(θ, μ) = π(μ|θ) π(θ) ∝ √
μθ(1 + θ)

and the posterior density will be:

p(θ, μ|n 1 , n 2 ) ∝ exp {−μ(1 + θ)} θn 1 −1/2 (1 + θ)−1/2 μn−1/2

and, integrating the nuisance parameter μ ∈ [0, ∞) we get finally

θn 1 −1/2
p(θ|n 1 , n 2 ) = N
(1 + θ)n+1

with θ = μ1 /μ2 , n = n 1 + n 2 and N −1 = B(n 1 + 1/2, n 2 + 1/2). The distribution


function will be:
 θ
P(θ|n 1 , n 2 ) = p(θ |n 1 , n 2 ) dθ = I (θ/(1 + θ); n 1 + 1/2, n 2 + 1/2)
0

with I (x; a, b) the Incomplete Beta Function and the moments, when they exist;

(n 1 + 1/2 + m) (n 2 + 1/2 − m)


E[θm ] =
(n 1 + 1/2) (n 2 + 1/2)

It is interesting to look at the problem from a different point of view. Consider


again the ordered parameterization {θ, λ} with θ = μ1 /μ2 but now, the nuisance
parameter is λ = μ1 + μ2 . The likelihood will be:

1 θn1
P(n 1 , n 2 |θ, λ) = e−λ λn
(n 1 + 1)(n 2 + 1) (1 + θ)n

The domains are  = (0, ∞) and = (0, ∞), independent. Thus, no need to specify
the prior for λ since

θn1 θn1
p(θ|n 1 , n 2 ) ∝ π(θ) e−λ λn π(λ)dλ ∝ π(θ)
(1 + θ)n (1 + θ)n

In this case we have that


132 2 Bayesian Inference

1 1
I (θ) ∝ −→ π(θ) =
θ (1 + θ)2 θ 1/2
(1 + θ)

and, in consequence,

θn 1 −1/2
p(θ|n 1 , n 2 ) = N
(1 + θ)n+1

Problem 2.8 Show that the reference prior for the Pareto distribution Pa(x|θ, x0 )
∝ (θx0 )−1 and that for an iid sample x = {x1 , . . . , xn },
(see Example 2.9) is π(θ, x0 )
n
if xm = min{xi }i=1 and a = i=1
n
ln(xi /xm ) the posterior
 nθ−1
na n−1 x0
p(θ, x0 |x) = e−aθ θn−1 1(0,∞) (θ)1(0,xm ) (x0 )
xm (n − 1) xm

is proper for a sample size n > 1. Obtain the marginal densities

a n−1
p(θ|x) = e−aθ θn−2 1(0,∞) (θ) and
(n − 1)
!  "−n
n(n − 1) −1 n xm
p(x0 |x) = x0 1 + ln 1(0,xm ) (x0 )
a a x0

and show that for large n (see Sect. 2.10.2) E[θ] na −1 and E[x0 ] xm .

Problem 2.9 Show that for the shifted Pareto distribution (Lomax distribution):
 θ+1
θ x0
p(x|θ, x0 ) = 1(0,∞) (x); θ, x0 ∈ R +
x0 x + x0

the reference prior for the ordered parameterization {θ, x0 } is πr (θ, x0 ) ∝


(x0 θ(θ + 1))−1 and for {x0 , θ} is πr (x0 , θ) ∝ (x0 θ)−1 . Show that the first one is
a first order probability matching√ prior while the second is not. In fact, show that
for {x0 , θ}, π pm (x0 , θ) ∝ (x0 θ3/2 θ + 2)−1 is a matching
√ prior and that for both
orderings the Jeffrey’s prior is π J (θ, x0 ) ∝ (x0 (θ + 1) θ(θ + 2))−1 .

Problem 2.10 Show that for the Weibull distribution


 
p(x|α, β) = αβ x β−1 exp −αx β 1(0,∞) (x)

with α, β > 0, the reference prior functions are


2.6 Prior Functions 133
# & $−1
πr (β, α) = (αβ)−1 and πr (α, β) = αβ ζ(2) + (ψ(2) − ln α)2

for the ordered parameterizations {β, α} and {α, β} respectively being ζ(2) = π 2 /6
the Riemann Zeta Function and ψ(2) = 1 − γ the Digamma Function.

2.7 Hierarchical Structures

In many circumstances, even though the experimental observations respond to the


same phenomena it is not always possible to consider the full set of observations as an
exchangeable sequence but rather exchangeability within subgroups of observations.
As stated earlier, this may be the case when the results come from different exper-
iments or when, within the same experiment, data taking conditions (acceptances,
efficiencies,…) change from run to run. A similar situation holds, for instance, for
the results of responses under a drug performed at different hospitals when the under-
lying conditions of the population vary between zones, countries,… In general, we
shall have different groups of observations

x 1 = {x11 , x21 , . . . , xn 1 1 }
..
.
x j = {x1 j , x2 j , . . . , xn j j }
..
.
x J = {x1J , x2J , . . . , xn J J }

from J experiments e1 (n 1 ), e2 (n 2 ), . . . , e J (n J ). Within each sample x j , we can con-


sider that exchangeability holds and also for the sets of observations {x 1 , x 2 , . . . , x J }
In this case, it is appropriate to consider hierarchical structures.
Let’s suppose that for each experiment e( j) the observations are drawn from the
model

p(x j |θ j ); j = 1, 2, . . . , J

Since the experiments are independent we assume that the parameters of the sequence
{θ 1 , θ 2 , . . . , θ J } are exchangeable and that, although different, they can be assumed
to have a common origin since they respond to the same phenomena. Thus,we can
set


J
p(θ 1 , θ 2 , . . . , θ J |φ) = p(θi |φ)
i=1
134 2 Bayesian Inference

Fig. 2.3 Structure of the


hierarchical model φ

θ1 θ2 θm-1 θm

x1 x2 xm-1 xm

with φ the hyperparameters for which we take a prior π(φ). Then we have the
structure (Fig. 2.3.)


J
p(x 1 , . . . , x J , θ 1 , . . . , θ J , φ) = π(φ) p(x i |θi ) π(θi |φ)
i=1

This structure can be repeated sequentially if we consider appropriate to assign a


prior π(φ|τ ) to the hyperparameters φ so that

p(x, θ, φ, τ ) = p(x|θ) π(θ|φ) π(φ|τ ) π(τ )

Now, consider the model p(x, θ, φ). We may be interested in θ, in the hyperpa-
rameters φ or in both. In general we shall need the conditional densities:
%
• p(φ|x) ∝ p(φ) p(x|θ) p(θ|φ) dθ
p(θ, x, φ)
• p(θ|x, φ) = and
p(x, φ)
p(x|θ) p(x|θ) %
• p(θ|x) = p(θ) = p(θ|φ) p(φ) dφ
p(x) p(x)

that can be expressed as


 
p(x, θ, φ)
p(θ|x) = dφ = p(θ|x, φ) p(φ|x) dφ
p(x)

and, since

p(φ|x)
p(θ|x) = p(x|θ) p(θ|φ) dφ
p(x|φ)

we can finally write


2.7 Hierarchical Structures 135

p(θ, φ) p(φ) p(φ|x)


= p(θ|φ) = p(θ|φ)
p(x) p(x) p(x|φ)

In general, this conditional densities have complicated expressions and we shall


use Monte Carlo methods to proceed (see Gibbs Sampling, Example 3.15, in Chap. 3).
It is important to note that if the prior distributions are not proper we can have
improper marginal and posterior densities that obviously have no meaning in the
inferential process. Usually, conditional densities are better behaved but, in any case,
we have to check that this is so. In general, the better behaved is the likelihood
the wildest behavior we can accept for the prior functions. We can also used prior
distributions that are a mixture of proper distributions:

p(θ|φ) = wi pi (θ|φ)
i


with wi ≥ 0 and wi = 1 so that the combination is convex and we assure that it
is proper density or, extending this to a continuous mixture:

p(θ|φ) = w(σ) p(θ|φ, σ) dσ.

2.8 Priors for Discrete Parameters

So far we have discussed parameters with continuous support but in some cases it
is either finite or countable. If the parameter of interest can take only a finite set
of n possible values, the reasonable option for an uninformative prior is a Discrete
Uniform Probability P(X = xi ) = 1/n. In fact, it is shown in Sect. 4.2 that maxi-
mizing the expected information provided by the experiment with the normalization
constraint (i.e. the probability distribution for which the prior knowledge is minimal)
drives to P(X = xi ) = 1/n in accordance with the Principle of Insufficient Reason.
Even though finite discrete parameter spaces are either the most usual case we
shall have to deal with or, at least, a sufficiently good approximation for the real
situation, it may happen that a non-informative prior is not the most appropriate (see
Example 2.22). On the other hand, if the parameter takes values on a countable set
the problem is more involved. A possible way out is to devise a hierarchical structure
in which we assign the discrete parameter θ a prior π(θ|λ) with λ a set of continuous
hyperparameters. Then, since

p(x, λ) = p(x|θ)π(θ|λ) π(λ) = p(x|λ) π(λ)
θ∈

we get the prior π(λ) by any of the previous procedures for continuous parameters
with the model p(x|λ) and obtain
136 2 Bayesian Inference

π(θ) ∝ π(θ|λ) π(λ) dλ

Different procedures are presented and discussed in [20].


Example 2.22 The absolute value of the electric charge (Z ) of a particle is to be deter-
mined from the number of photons observed by a Cherenkov Counter. We know from
test beam studies and Monte Carlo simulations that the number of observed photons
n γ produced by a particle of charge Z is well described by a Poisson distribution
with parameter μ = n 0 Z 2 ; that is

(n 0 Z 2 )n γ
P(n γ |n 0 , Z ) = e−n 0 Z
2

(n γ + 1)

so E[n γ |Z = 1] = n 0 . First, by physics considerations Z has a finite support  Z =


{1, 2, . . . , n}. Second, we know a priory that not all incoming nuclei are equally
likely so a non-informative prior may not be the best choice. In any case, a discrete
uniform prior will give the posterior:

e−n 0 k k 2n γ
2

P(Z = k|n γ , n 0 , n) = n −n 0 k 2 2n γ
.
k=1 e k

2.9 Constrains on Parameters and Priors

Consider a parametric model p(x|θ) and the prior π0 (θ). Now we have some infor-
mation on the parameters that we want to include in the prior. Typically we shall
have say k constraints of the form

gi (θ) π(θ) dθ = ai ; i = 1, . . . , k


Then, we have to find the prior π(θ) for which π0 (θ) is the best approximation, in the
Kullback-Leibler sense, including the constraints with the corresponding Lagrange
multipliers λi ; that is, the extremal of
 k  
π(θ)
F= π(θ) log dθ + λi gi (θ) π(θ) dθ − ai
 π0 (θ) i=1 

Again, it is left as an exercise to show that from Calculus of Variations we have the
well known solution
 k  

π(θ) ∝ π0 (θ) exp λi gi (θ) where λi | gi (θ) π(θ) dθ = ai
i=1 
2.9 Constrains on Parameters and Priors 137

Quite frequently we are forced to include constraints on the support of the parame-
ters: some are non-negative (masses, energies, momentum, life-times,…), some are
bounded in (0, 1) (β = v/c, efficiencies, acceptances,…),... At least from a formal
point of view, to account for constraints on the support is a trivial problem. Consider
the model p(x|θ) with θ ∈ 0 and a reference prior π0 (θ). Then, our inferences on
θ shall be based on the posterior

p(x|θ) π0 (θ)
p(θ|x) = %
0 p(x|θ) π0 (θ) dθ

Now, if we require that θ ∈  ⊂ 0 we define


 
g1 (θ) = 1 (θ) −→ g1 (θ) π(θ) dθ = π(θ) dθ = 1 − 
0 
 
g2 (θ) = 1c (θ) −→ g2 (θ) π(θ) dθ = π(θ) dθ = 
0 c

and in the limit  → 0 we have the restricted reference prior

π0 (θ)
π(θ) = % 1 (θ)
 π0 (θ) dθ

as we have obviously expected. Therefore

p(x|θ) π(θ) p(x|θ) π0 (θ)


p(θ|x, θ ∈ ) = % =% 1 (θ)
 p(x|θ) π(θ) dθ  p(x|θ) π0 (θ) dθ

that is, the same initial expression but normalized in the domain of interest .

2.10 Decision Problems

Even though all the information we have on the parameters of relevance is contained
in the posterior density it is interesting, as we saw in Chap. 1, to explicit some
particular values that characterize the probability distribution. This certainly entails
a considerable and unnecessary reduction of the available information but in the end,
quoting Lord Kelvin, “… when you cannot express it in numbers, your knowledge
is of a meager and unsatisfactory kind”. In statistics, to specify a particular value of
the parameter is termed Point Estimation and can be formulated in the framework of
Decision Theory.
In general, Decision Theory studies how to choose the optimal action among
several possible alternatives based on what has been experimentally observed. Given
a particular problem, we have to explicit the set θ of the possible “states of nature”,
138 2 Bayesian Inference

the set  X of the possible experimental outcomes and the set  A of the possible
actions we can take. Imagine, for instance, that we do a test on an individual suspected
to have some disease for which the medical treatment has some potentially dangerous
collateral effects. Then, we have:

θ = {healthy, sic}

 X = {test positive, test negative}

 A = {apply treatment, do not apply treatment}

Or, for instance, a detector that provides within some accuracy the momentum ( p)
and the velocity (β) of charged particles. If we want to assign an hypothesis for
the mass of the particle we have that θ = R+ is the set of all possible states of
nature (all possible values of the mass),  X the set of experimental observations (the
momentum and the velocity) and  A the set of all possible actions that we can take
(assign one or other value for the mass). In this case, we shall take a decision based
on the probability density p(m| p, β).
Obviously, unless we are in a state of absolute certainty we can not take an action
without potential losses. Based on the observed experimental outcomes, we can for
instance assign the particle a mass m 1 when the true state of nature is m 2 = m 1 or
consider that the individual is healthy when is actually sic. Thus, the first element of
Decision Theory is the Loss Function:

l(a, θ) : (θ, a) ∈ θ ×  A −→ R+ + {0}

This is a non-negative function, defined for all θ ∈ θ and the set of possible actions
a ∈  A , that quantifies the loss associated to take the action a (decide for a) when
the state of nature is θ.
Obviously, we do not have a perfect knowledge of the state of nature; what we
know comes from the observed data x and is contained in the posterior distribution
p(θ|x). Therefore, we define the Risk Function (risk associated to take the action a,
or decide for a when we have observed the data x) as the expected value of the Loss
Function:

R(a|x) = E θ [l(a, θ)] = l(a, θ) p(θ|x) dθ

Sound enough, the Bayesian decision criteria consists on taking the action a(x)
(Bayesian action) that minimizes the risk R(a|x) (minimum risk); that is, that
minimizes the expected loss under the posterior density function.8 Then, we shall
encounter to kinds of problems:

8 Theproblems studied by Decision Theory can be addressed from the point of view of Game
Theory. In this case, instead of Loss Functions one works with Utility Functions u(θ, a) that, in
essence, are nothing else but u(θ, a) = K − l(θ, a) ≥ 0; it is just matter of personal optimism to
2.10 Decision Problems 139

• inferential problems, where  A = R y a(x) is a statistic that we shall take as


estimator of the parameter θ;
• decision problems (or hypothesis testing) where  A = {accept, reject} or choose
one among a set of hypothesis.
Obviously, the actions depend on the loss function (that we have to specify) and on
the posterior density and, therefore, on the data through the model p(x|θ) and the
prior function π(θ). It is then possible that, for a particular model, two different loss
functions drive to the same decision or that the same loss function, depending on the
prior, take to different actions.

2.10.1 Hypothesis Testing

Consider the case where we have to choose between two exclusive and exhaustive
hypothesis H1 and H2 (=H1 c ). From the data sample and our prior beliefs we have
the posterior probabilities

P(data|Hi ) P(Hi )
P(Hi |data) = ; i = 1, 2
P(data)

and the actions to be taken are then:

a1 : action to take if we decide upon H1


a2 : action to take if we decide upon H2

Then, we define the loss function l(ai , H j ); i, j = 1, 2 as:




⎪ l11 = l22 = 0 if we make the correct choice; that is,



⎪ if we take action a1 when the state of



⎪ nature is H1 or a2 when it is H2 ;





l >0 if we take action a1 (decide upon H1 )
l(ai |H j ) = 12

⎪ when the state of nature is H2







⎪ l21 > 0 if we take action a2 (decide upon H2 )



⎪ when the state of nature is H1

(Footnote 8 continued)
work with “utilities” or “losses”. J. Von Neumann and O. Morgenstern introduced in 1944 the idea
of expected utility and the criteria to take as optimal action hat which maximizes the expected utility.
140 2 Bayesian Inference

so the risk function will be:


2
R(ai |data) = l(ai |H j ) P(H j |data)
j=1

that is:

R(a1 |data) = l11 P(H1 |data) + l12 P(H2 |data)


R(a2 |data) = l21 P(H1 |data) + l22 P(H2 |data)

and, according to the minimum Bayesian risk, we shall choose the hypothesis H1
(action a1 ) if

R(a1 |data) < R(a2 |data) −→ P(H1 |data) (l11 − l21 ) < P(H2 |data) (l22 − l12 )

Since we have chosen l11 = l22 = 0 in this particular case, we shall take action a1
(decide for hypothesis H1 ) if:

P(H1 |data) l12


>
P(H2 |data) l21

or action a2 (decide in favor of hypothesis H2 ) if:

P(H2 |data) l21


R(a2 , data) < R(a1 , data) −→ >
P(H1 |data) l12

that is, we take action ai (i = 1, 2) if:


! "! "
P(Hi |data) P(data|Hi ) P(Hi ) li j
= >
P(H j |data) P(data|H j ) P(H j ) l ji

The ratio of likelihoods


P(data|Hi )
Bi j =
P(data|H j )

is called Bayes Factor Bi j and changes our prior beliefs on the two alternative
hypothesis based on the evidence we have from the data; that is, quantifies how
strongly data favors one model over the other. Thus, we shall decide in favor of
hypothesis Hi against H j (i, j = 1, 2) if

P(Hi |data) li j P(H j ) li j


> −→ Bi j >
P(H j |data) l ji P(Hi ) l ji
2.10 Decision Problems 141

If we consider the same loss if we decide upon the wrong hypothesis whatever it be,
we have l12 = l21 (Zero-One Loss Function). In general, we shall be interested in
testing:
(1) Two simple hypothesis, H1 versus H2 , for which the models Mi = {X ∼
pi (x|θi )}; i = 1, 2 are fully specified including the values of the parameters
(that is, i = {θi }). In this case, the Bayes Factor will be given by the ratio of
likelihoods
 
p1 (x|θ1 ) p(x|θ1 )
B12 = usually
p2 (x|θ2 ) p(x|θ2 )

The classical Bayes Factor is the ratio of the likelihoods for the two competing
models evaluated at their respective maximums.
(2) A simple (H1 ) versus a composite hypothesis H2 for which the parameters of
the model M2 = {X ∼ p2 (x|θ2 )} have support on 2 . Then we have to average
the likelihood under H2 and

p1 (x|θ1 )
B12 = %
2 p2 (x|θ)π2 (θ)dθ

(3) Two composite hypothesis: in which the models M1 and M2 have parameters
that are not specified by the hypothesis so
%
 p1 (x|θ1 )π1 (θ1 )dθ1
B12 = % 1
2 p2 (x|θ2 )π2 (θ2 )dθ2

and, since P(H1 |data) + P(H2 |data) = 1, we can express the posterior probability
P(H1 |data) as

B12 P(H1 )
P(H1 |data) =
P(H2 ) + B12 P(H1 )

Usually, we consider equal prior probabilities for the two hypothesis (P(H1 ) =
P(H2 ) = 1/2) but be aware that in some cases this may not be a realistic assumption.
Bayes Factors are independent of the prior beliefs on the hypothesis (P(Hi )) but,
when we have composite hypothesis, we average the likelihood with a prior and if
it is an improper function they are not well defined. If we have prior knowledge
about the parameters, we may take informative priors that are proper but this is not
always the case. One possible way out is to consider sufficiently general proper
priors (conjugated priors for instance) so the Bayes factors are well defined and then
study what is the sensitivity for different reasonable values of the hyperparameters. A
more practical and interesting approach to avoid the indeterminacy due to improper
priors [21, 22] is to take a subset of the observed sample to render a proper posterior
(with, for instance, reference priors) and use that as proper prior density to compute
142 2 Bayesian Inference

the Bayes Factor with the remaining sample. Thus, if the sample x = {x1 , . . . , xn }
consists on iid observations, we may consider x = {x 1 , x 2 } and, with the reference
prior π(θ), obtain the proper posterior

p(x 1 |θ) π(θ)


π(θ|x 1 ) = %
 p(x 1 |θ) π(θ) dθ

The remaining subsample (x 2 ) is then used to compute the partial Bayes Factor9 :
%  
 p1 (x 2 |θ 1 ) π1 (θ 1 |x 1 ) dθ 1 B F(x 1 , x 2 )
B12 (x2 |x1 ) = % 1 =
2 p2 (x 2 |θ 2 ) π2 (θ 2 |x 1 ) dθ 2 B F(x 1 )

for the hypothesis testing. Berger and Pericchi propose to use the minimal amount
of data needed to specify a proper prior (usually max{dim(θi )}) so as to leave most
of the sample for the model testing and dilute the dependence on a particular election
of the training sample evaluating the Bayes Factors with all possible minimal samples
and choosing the truncated mean, the geometric mean or the median, less sensitive to
outliers, as a characteristic value (see Example 2.24). A thorough analysis of Bayes
Factors, with its caveats and advantages, is given in [23].
A different alternative to quantify the evidence in favour of a particular model that
avoids the need of the prior specification and is easy to evaluate is the Schwarz cri-
teria [24] (or “Bayes Information Criterion (BIC)”). The rationale is the following.
Consider a sample x = {x1 , . . . , xn } and two alternative hypothesis for the mod-
els Mi = { pi (x|θi ); dim(θi ) = di }; i = 1, 2. As we can see in Sect. 4.5, under the
appropriate conditions we can approximate the likelihood as
 
1  
d d
l(θ|x) l(2
θ|x) exp − (θk − 2
θk ) n I km (2
θ) (θm − 2
θm )
2 k=1 m=1

so taking a uniform prior for the parameters θ, reasonable in the region where the
likelihood is dominant, we can approximate

J (x) = p(x|θ) π(θ) dθ p(x|2
θ) (2π/n)d/2 |det[I(2
θ)]|−1/2


and, ignoring terms that are bounded as n→∞, define the B I C(Mi ) for the model
Mi as

2 ln Ji (x) B I C(Mi ) ≡ 2 ln pi (x|2


θi ) − di ln n

so:

9 Essentially, the ratio of the predictive inferences for x 2 after x 1 has been observed.
2.10 Decision Problems 143
' (
p1 (x|2
θ 1 ) (d2 −d1 )/2 p1 (x|2
θ1 )
B12 n −→
12 = 2 lnB12 2 ln − (d1 − d2 ) ln n
p2 (x|2
θ2 ) p2 (x|2
θ2 )

and therefore, larger values of


12 = B I C(M1 ) − B I C(M2 ) indicate a preference
for the hypothesis H1 (M1 ) against H2 (M2 ) being commonly accepted that for values
grater than 6 the evidence is “strong”10 although, in some cases, it is worth to study
the behaviour with a Monte Carlo sampling. Note that the last term penalises models
with larger number of parameters and that this quantification is sound when the
sample size n is much larger than the dimensions di of the parameters.
Example 2.23 Suppose that from the information provided by a detector we estimate
the mass of an incoming particle and we want to decide upon the two exclusive and
alternative hypothesis H1 (particle of type 1) and H2 (=H1 c ) (particle of type 2). We
know from calibration data and Monte Carlo simulations that the mass distributions
for both hypothesis are, to a very good approximation, Normal with means m 1 and
m 2 variances σ12 and σ22 respectively. Then for an observed value of the mass m 0 we
have:
 
p(m 0 |H1 ) N (m 0 |m 1 , σ1 ) σ2 (m 0 − m 2 )2 (m 0 − m 1 )2
B12 = = = exp −
p(m 0 |H2 ) N (m 0 |m 2 , σ2 ) σ1 2 σ22 2 σ12

Taking (l12 = l21 ; l11 = l22 = 0), the Bayesian decision criteria in favor of the
hypothesis H1 is:

P(H2 ) P(H2 )
B12 > −→ ln B12 > ln
P(H1 ) P(H1 )

Thus, we have a critical value m c of the mass:


 
P(H2 ) σ1
σ1 (m c − m 2 ) − σ2 (m c − m 1 ) = 2 σ1 σ2 ln
2 2 2 2 2 2
P(H1 ) σ2

such that, if m 0 < m c we decide in favor of H1 and for H2 otherwise. In the case
that σ1 = σ2 and P(H1 ) = P(H2 ), then m c = (m 1 + m 2 )/2. This, however, may be
a quite unrealistic assumption for if P(H1 ) > P(H2 ), it may be more likely that the
event is of type 1 being B12 < 1.

Example 2.24 Suppose we have an iid sample x = {x1 , . . . , xn } of size n with X ∼=


N (x|μ, 1) and the two hypothesis H1 = {N (x|0, 1)} and H2 = {N (x|μ, 1); μ = 0}.
Let us take {xi } as the minimum sample and, with the usual constant prior, consider
the proper posterior

1
π(μ|xi ) = √ exp{−(μ − xi )2 /2}

10 If P(H1 ) = P(H2 ) = 1/2, then P(H1 |data) = 0.95 −→ B12 = 19 −→


12 6.
144 2 Bayesian Inference

that we use as a prior for the rest of the sample x  = {x1 , . . . , xi−1 , xi+1 , . . . , xn }.
Then

P(H1 |x  , xi ) P(H1 )

= B12 (i)
P(H2 |x , xi ) P(H2 )
p(x  |0)
where B12 (i) = % ∞ 
= n 1/2 exp{−(n x 2 − xi2 )/2}
−∞ p(x |μ)π(μ|x i )dμ

and x = n −1 nk=1 xk . To avoid the effect that a particular choice of the minimal sam-
ple ({xi }) may have, this is evaluated for all possible minimal samples and the median
(or the geometric mean) of all the B12 (i) is taken. Since P(H1 |x) + P(H2 |x) = 1, if
we assign equal prior probabilities to the two hypothesis (P(H1 ) = P(H2 ) = 1/2)
we have that
B12  −1
P(H1 |x) = = 1 + n −1/2 exp{(n x 2 − med{xi2 })/2}
1 + B12

is the posterior probability that quantifies the evidence in favor of the hypothesis H1 .
It is left as an exercise to compare the Bayes Factor obtained from the geometric mean
with what you would get if you were to take a proper prior π(μ|σ) = N (μ|0, σ).

Problem 2.11 Suppose we have n observations (independent, under the same exper-
imental conditions,…) of energies or decay time of particles above a certain known
threshold and we want to test the evidence of an exponential fall against a power law.
Consider then a sample x = {x1 , . . . , xn } of observations with supp(X ) = (1, ∞)
and the two models

M1 : p1 (x|θ) = θ exp{−θ(x − 1)}1(1,∞) (x) and M2 : p2 (x|α) = αx −(α+1) 1(1,∞) (x)

that is, Exponential and Pareto with unknown parameters θ and α. Show that for the
minimal sample {xi } and reference priors, the Bayes Factor B12 (i) is given by
 n    
xg ln xg xi − 1 p1 (x|2
θ) xi − 1
B12 (i) = =
x −1 xi ln xi p2 (x|2
α) xi ln xi

where (x, xg ) are the arithmetic and geometric sample means and (2 θ, α
2) the values
that maximize the likelihoods and therefore
   
xg ln xg n xi − 1 n
med{B12 (i)}i=1 =
n
med
x −1 xi ln xi i=1

Problem 2.12 Suppose we have two experiments ei (n i ); i = 1, 2 in which, out of


n i trials, xi successes have been observed and we are interested in testing whether
2.10 Decision Problems 145

both treatments are different or not (contingency tables). If we assume Binomial


models Bi(xi |n i , θi ) for both experiments and the two hypothesis H1 : {θ1 = θ2 }
and H2 : {θ1 = θ2 }, the Bayes Factor will be
%
 Bi(x1 |n 1 , θ)Bi(x2 |n 2 , θ)π(θ)dθ
B12 = % %
1 Bi(x1 |n 1 , θ1 )π(θ1 )dθ1 2 Bi(x2 |n 2 , θ)π(θ2 )dθ2

We may consider proper Beta prior densities Be(θ|a, b). In a specific pharmaco-
logical analysis, a sample of n 1 = 52 individuals were administered a placebo and
n 2 = 61 were treated with an a priori beneficial drug. After the essay, positive effects
were observed in x1 = 22 out of the 52 and x2 = 41 out of the 61 individuals.
It is left as an exercise to obtain the posterior probability P(H2 |data) with Jef-
freys’ (a = b = 1/2) and Uniform (a = b = 1) priors and to determine the BIC
difference
12 .

2.10.2 Point Estimation

When we have to face the problem to characterize the posterior density by a single
number, the most usual Loss Functions are:

• Quadratic Loss: In the simple one-dimensional case, the Loss Function is

l(θ, a) = (θ − a)2

so, minimizing the Risk:


 
min (θ − a)2 p(θ|x) dθ −→ (θ − a) p(θ|x) dθ = 0
θ θ

and therefore a = E[θ]; that is, the posterior mean.


In the k−dimensional case, if A = θ = Rk we shall take as Loss Function

l(θ, a) = (a − θ)T H (a − θ)

where H is a positive defined symmetric matrix. It is clear that:



min (a − θ)T H (a − θ) p(θ|x) dθ −→ H a = H E[θ]
Rk

so, if H−1 exists, then a = E[θ]. Thus, we have that the Bayesian estimate under a
quadratic loss function is the mean of p(θ|x) (… if exists!).
146 2 Bayesian Inference

• Linear Loss: If A = θ = R, we shall take the loss function:

l(θ, a) = c1 (a − θ) 1θ≤a + c2 (θ − a) 1θ>a

Then, the estimator will be such that


   a  ∞ 
min l(a, θ) p(θ|x)dθ = min c1 (a − θ) p(θ|x)dθ + c2 (θ − a) p(θ|x)dθ
θ −∞ a

After derivative with respect to a we have (c1 + c2 ) P(θ ≤ a) − c2 = 0 and therefore


the estimator will be the value of a such that
c2
P(θ ≤ a) =
c1 + c2

In particular, if c1 = c2 then P(θ ≤ a) = 1/2 and we shall have the median of the
distribution p(θ|x). In this case, the Loss Function can be expressed more simply as
l(θ, a) = |θ − a|.

• Zero-One Loss: Si A = θ = Rk , we shall take the Loss Function

l(θ, a) = 1 − 1B (a)

where B (a) ∈ θ is an open ball of radius  centered at a. The corresponding point


estimator will be:
 
min (1 − 1B (a) ) p(θ|x) dθ = max p(θ|x) dθ
θ B (a)

It is clear that, in the limit  → 0, the Bayesian estimator for the Zero-One Loss
Function will be the mode of p(θ|x) if exists.
As explained in Chap. 1, the mode, the median and the mean can be very different
if the distribution is not symmetric. Which one should we take then? Quadratic
losses, for which large deviations from the true value are penalized quadratically,
are the most common option but, even if for unimodal symmetric the three statistics
coincide, it may be misleading to take this value as a characteristic number for the
information we got about the parameters or even be nonsense. In the hypothetical
case that the posterior is essentially the same as the likelihood (that is the case for a
sufficiently smooth prior), the Zero-One Loss points to the classical estimate of the
Maximum Likelihood Method. Other considerations of interest in Classical Statistics
(like bias, consistency, minimum variance,…) have no special relevance in Bayesian
inference.

Problem 2.13 (The Uniform Distribution) Show that for the posterior density (see
Example 2.4)
2.10 Decision Problems 147

n
xM
p(θ|x M , n) = n 1[x ,∞) (θ)
θn+1 M
the point estimates under quadratic, linear and 0–1 loss functions are
n
θQ L = x M ; θ L L = x M 21/n and θ01L = x M
n−1

and discuss which one you consider more reasonable.

2.11 Credible Regions

Let p(θ|x), with θ ∈  ⊆ Rn be a posterior density function. A credible region with


probability content 1 − α is a region of Vα ⊆  of the parametric space such that

P(θ ∈ Vα ) = p(θ|x) dθ = 1 − α

Obviously, for a given probability content credible regions are not unique and a
sound criteria is to specify the one that the smallest possible volume. A region C of
the parametric space  is called Highest Probability Region (HPD) with probability
content 1 − α if:
(1) P(θ ∈ C) = 1 − α; C ⊆ ;
(2) p(θ 1 |·) ≥ p(θ 2 |·) for all θ 1 ∈ C and θ 2 ∈C
/ except, at most, for a subset of 
with zero probability measure.
It is left as an exercise to show that condition (2) implies that the HPD region so
defined is of minimum volume so both definitions are equivalent. Further properties
that are easy to demonstrate are:
(1) If p(θ|·) is not uniform, the HPD region with probability content 1 − α is unique;
(2) If p(θ 1 |·) = p(θ 2 |·), then θ 1 and θ 2 are both either included or excluded of the
HPD region;
(3) If p(θ 1 |·) = p(θ 2 ·), there is an HPD region for some value of 1 − α that contains
one value of θ and not the other;
(4) C = {θ ∈ | p(θ|x) ≥ kα } where kα is the largest constant for which
P(θ ∈ C) ≥ α;
(5) If φ = f (θ) is a one-to-one transformation, then
(a) any region with probability content 1 − α for θ will have probability content
1 − α for φ but…
(b) an HPD region for θ will not, in general, be an HPD region for φ unless the
transformation is linear.
In general, evaluation of credible regions is a bit messy task. A simple way through
is to do a Monte Carlo sampling of the posterior density and use the 4th property.
148 2 Bayesian Inference

For a one-dimensional parameter, the condition that the HPD region with probability
content 1 − α has the minimum length allows to write a relation that may be useful
to obtain those regions in an easier manner. Let [θ1 , θ2 ] be an interval such that
 θ2
p(θ|·) dθ = 1 − α
θ1

For this to be an HPD region we have to find the extremal of the function
 θ2 
φ(θ1 , θ2 , λ) = (θ2 − θ1 ) + λ p(θ|·) dθ − (1 − α)
θ1

Taking derivatives we get:


# $
∂φ(θ1 , θ2 , λ)
=0 −→ p(θ1 |·) = p(θ2 ·)
∂θi i=1,2
∂φ(θ1 , θ2 , λ) % θ2
=0 −→ p(θ) dθ = 1 − α
∂λ θ1

Thus, from the first two conditions we have that p(θ1 |·) = p(θ2 |·) and, from the
third, we know that θ1 = θ2 . In the special case that the distribution is unimodal and
symmetric the only possible solution is θ2 = 2E[θ] − θ1 .
The HPD regions are useful to summarize the information on the parameters con-
tained in the posterior density p(θ|x) but it should be clear that there is no justification
to reject a particular value θ 0 just because is not included in the HPD region (or, in
fact, in whatever confidence region) and that in some circumstances (distributions
with more than one mode for instance) it may be the union of disconnected regions.

2.12 Bayesian (B) Versus Classical (F ) Philosophy

The Bayesian philosophy aims at the right questions in a very intuitive and, at least
conceptually, simple manner. However the “classical” (frequentist) approach to sta-
tistics, that has been very useful in scientific reasoning over the last century, is at
present more widespread in the Particle Physics community and most of the stirred
up controversies are originated by misinterpretations. It is worth to take a look for
instance at [2]. Let’s see how a simple problem is attacked by the two schools. “We”
are B, “they” are F.
Suppose we want to estimate the life-time of a particle. We both “assume” an
exponential model X ∼ E x(x|1/τ ) and do an experiment e(n) that provides an iid
sample x = {x1 , x2 , . . . , xn }. In this case there is a sufficient statistic t = (n, x) with
x the sample mean so let’s define the random quantity
2.12 Bayesian (B) Versus Classical (F ) Philosophy 149

1
n # n $n 1  
X= X i ∼ p(x|n, τ ) = exp −nxτ −1 x n−1 1(0,∞) (x)
n i=1 τ (n)

What can we say about the parameter of interest τ ?


F will start by finding the estimator (statistic) 2τ that maximizes the likelihood
(MLE). In this case it is clear that 2
τ = x, the sample mean. We may ask about the
rationale behind because, apparently, there is no serious mathematical reasoning that
justifies this procedure. F will respond that, in a certain sense, even for us this should
be a reasonable way because if we have a smooth prior function, the posterior is dom-
inated by the likelihood and one possible point estimator is the mode of the posterior.
Beside that, he will argue that maximizing the likelihood renders an estimator that
often has “good” properties like unbiasedness, invariance under monotonous one-
to-one transformations, consistency (convergence in probability), smallest variance
within the class of unbiased estimators (Cramèr-Rao bound), approximately well
known distribution,… We may question some of them (unbiased estimators are not
always the best option and invariance… well, if the transformation is not linear usu-
ally the MLE is biased), argue that the others hold in the asymptotic limit,… Anyway;
for this particular case one has that:

τ2
τ] = τ
E[2 and V [2
τ] =
n
and F will claim that “if you repeat the experiment” many times under the same
conditions, you will get a sequence of estimators {2 τ1 ,2
τ2 , . . .} that eventually will
cluster around the life-time τ . Fine but we shall point out that, first, although desirable
we usually do not repeat the experiments (and under the same conditions is even more
rare) so we have just one observed sample (x → x = 2 τ ) from e(n). Second, “if you
repeat the experiment you will get” is a free and unnecessary hypothesis. You do not
know what you will get, among other things, because the model we are considering
may not be the way nature behaves. Besides that, it is quite unpleasant that inferences
on the life-time depend upon what you think you will get if you do what you know
you are not going to do. And third, that this is in any case a nice sampling property
of the estimator τ̂ but eventually we are interested in τ so, What can we say about it?
For us, the answer is clear. Being τ a scale parameter we write the posterior density
function
(n x)n  
p(τ |n, x) = exp −n xτ −1 τ −(n+1) 1(0,∞) (τ )
(n)

for the degree of belief we have on the parameter and easily get for instance:

(n − k) n n2
E[τ k ] = (nx)k −→ E[τ ] = x ; V [τ ] = x 2 ;...
(n) n−1 (n − 1)2 (n − 2)

Cleaner and simpler impossible.


150 2 Bayesian Inference

To bound the life-time, F proceeds with the determination of the Confidence


Intervals. The classical procedure was introduced by J. Neyman in 1933 and rests on
establishing, for an specified probability content, the domain of the random quan-
tity (usually a statistic) as function of the possible values the parameters may take.
Consider a one dimensional parameter θ and the model X ∼ p(x|θ). Given a desired
probability content β ∈ [0, 1], he determines the interval [x1 , x2 ] ⊂  X such that
 x2
P(X ∈ [x1 , x2 ]) = p(x|θ) d x = β
x1

for a particular fixed value of θ. Thus, for each possible value of θ he has one interval
[x1 = f 1 (θ; β), x1 = f 2 (θ; β)] ⊂  X and the sequence of those intervals gives a
band in the θ × X region of the real plane. As for the Credible Regions, these
intervals are not uniquely determined so one usually adds the condition:
  ∞
x1
1−β
(1) p(x|θ) d x = p(x|θ) d x = or
−∞ x2 2
 θ  x2
β
(2) p(x|θ) d x = p(x|θ) d x =
x1 θ 2

or, less often, (3) chooses the interval with smallest size. Now, for an invertible
mapping xi −→ f i (θ) one can write

β = P( f 1 (θ) ≤ X ≤ f 2 (θ)) = P( f 2−1 (X ) ≤ θ ≤ f 1−1 (X ))

and get the random interval [ f 2−1 (X ), f 1−1 (X )] that contains the given value of θ with
probability β. Thus, for each possible value that X may take he will get an interval
[ f 2−1 (X ), f 1−1 (X )] on the θ axis and a particular experimental observation {x} will
single out one of them. This is the Confidence Interval that the frequentist analyst will
quote. Let’s continue with the life-time example and take, for illustration, n = 50
and β = 0.68. The bands [x1 = f 1 (τ ), x2 = f 2 (τ )] in the (τ , X ) plane, in this case
obtained with the third prescription, are shown in Fig. 2.4(1). They are essentially
straight lines so P[X ∈ (0.847τ , 1.126τ )] = 0.68. This is a correct statement, but
doesn’t say anything about τ so he inverts that and gets 0.89 X < τ < 1.18 X in
such a way that an observed value {x} singles out an interval in the vertical τ axis.
We, Bayesians, will argue this does not mean that τ has a 0.68 chance to lie in this
interval and the frequentist will certainly agree on that. In fact, this is not an admissible
question for him because in the classical philosophy τ is a number, unknown but a
fixed number. If he repeats the experiment τ will not change; it is the interval that
will be different because x will change. They are random intervals and what the
68% means is just that if he repeats the experiment a large number N of times, he
will end up with N intervals of which ∼68% will contain the true value τ whatever
it is. But the experiment is done only once so: Does the interval derived from this
observation contain τ or not? We don’t know, we have no idea if it does contain τ ,
2.12 Bayesian (B) Versus Classical (F ) Philosophy 151

(1) (2)
τ 9 100
8
68% CL band 80
7
6
60
5
4 40
3
2 20

1
0
1 2 3 4 5 6 7 8 9 0.5 1 1.5 2 2.5 3 3.5
x

Fig. 2.4 (1) 68% confidence level bands in the (τ , X ) plane. (2) 68% confidence intervals obtained
for 100 repetitions of the experiment

if it does not and how far is the unknown true value. Figure 2.4(2) shows the 68%
confidence intervals obtained after 100 repetitions of the experiment for τ = 2 and
67 of them did contain the true value. But when the experiment is done once, he picks
up one of those intervals and has a 68% chance that the one chosen contains the true
value. We B shall proceed in a different manner. After integration of the posterior
density we get the HPD interval P [τ ∈ (0.85 x, 1.13 x)] = 0.68; almost the same
but with a direct interpretation in terms of what we are interested in. Thus, both have
an absolutely different philosophy:
F: “Given a particular value of the parameters of interest, How likely is the observed
data?”
B: “Having observed this data, What can we say about the parameters of interest?”
… and the probability if the causes, as Poincare said, is the most important from the
point of view of scientific applications.
In many circumstances we are also interested in one-sided intervals. That is for
instance the case when the data is consistent with the hypothesis H : {θ = θ0 } and
we want to give an upper bound on θ so that P(θ ∈ (−∞, θβ ] = β. The frequentist
rationale is the same: obtain the interval [−∞, x2 ] ⊂  X such that
 x2
P(X ≤ x2 ) = p(x|θ) d x = β
−∞

where x2 = f 2 (θ); in this case without ambiguity. For the random interval (−∞, f 2−1
(X )) F has that
   
P θ < f 2−1 (X ) = 1 − P θ ≥ f 2−1 (X ) = 1 − β

so, for a probability content α (say 0.95), one should set β = 1 − α (=0.05). Now,
consider for instance the example of the anisotropy is cosmic rays discussed in the last
152 2 Bayesian Inference

Sect. 2.13.3. For a dipole moment (details are unimportant now) we have a statistic

exp{−θ2 /2} √
X ∼ p(x|θ, 1/2) = √ exp{−x/2}sinh(θ x) 1(0,∞) (x)
2πθ

where the parameter θ is the dipole coefficient multiplied by a factor that is irrelevant
for the example. It is argued in Sect. 2.13.3 that the reasonable prior for this model
is π(θ) = constant so we have the posterior

2 √
p(θ|x, 1/2) = √ exp{−θ2 /2} θ−1 sinh(θ x) 1(0,∞) (θ)
πx M(1/2, 3/2, x/2)

with M(a, b, z) the Kummer’s Confluent Hypergeometric Function. In fact, θ


has a compact support but since the observed values of X are consistent with
H0 : {θ = 0} and the sample size is very large [AMS13],11 p(θ|x, 1/2) is concen-
trated in a small interval (0, ) and it is easier for the evaluations to extend the
domain to R+ without any effect on the results. Then we, Bayesians, shall derive
the one-sided upper credible region [0, θ0.95 (x)] with α = 95% probability content
as simply as:
 θ0.95
p(θ|x, 1/2) dθ = α = 0.95
0

This upper bound shown as function of x in Fig. 2.5 under “Bayes” (red line).
Neyman’s construction is also straight forward. From
 x2
p(x|θ, 1/2) d x = 1 − α = 0.05
0

(essentially a χ2 probability for ν = 3), F will get the upper bound shown in the
same figure under “Neyman” (blue broken line). As you can see, they get closer as x
grows but, first, there is no solution for x ≤ xc = 0.352. In fact, E[X ] = θ2 + 3 so if
the dipole moment is δ = 0 (θ = 0), E[X ] = 3 and observed values below xc will be
an unlikely fluctuation downwards (assuming of course that the model is correct) but
certainly a possible experimental outcome. In fact, you can see that for values of x
less than 2, even though there is a solution Neyman’s upper bound is underestimated.
To avoid this “little” problem, a different prescription has to be taken.
The most interesting solution is the one proposed by Feldman and Cousins [25]
in which the region
X ⊂  X that is considered for the specified probability content
is determined by the ratio of probability densities. Thus, for a given value θ0 , the
interval
X is such that

11 [AMS13]: Aguilar M. et al. (2013); Phys. Rev. Lett. 110, 141102.


2.12 Bayesian (B) Versus Classical (F ) Philosophy 153

θ0.95
6

Bayes
2
Neyman
Feldman-Cousins
1

−2 −1
10 10 1 10
x

Fig. 2.5 95% upper bounds on the parameter θ following the Bayesian approach (red), the Neyman
approach (broken blue) and Feldman and Cousins (solid blue line)

(1) (2)
12
1
θmax R(x|θ0)
10
0.8
8
0.6
6 kβ
0.4
4

2 0.2

0 0
1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 30
x x

Fig. 2.6 (1) Dependence of θm with x. (2) Probability density ratio R(x|θ) for θ = 2


p(x|θ0 )
p(x|θ0 ) d x = β with R(x|θ0 ) = > kβ ; ∀x ∈
X

X p(x|θb )

and where θb is the best estimation of θ for a given {x}; usually the one that maximizes
the likelihood (θm ). In our case, it is given by:
 √
0 if x ≤ √3
θm = √ √
θm + θm−1 − x coth(θm x) = 0 if x > 3
154 2 Bayesian Inference

and the dependence with x is shown in Fig. 2.6(1) (θm x for x >>). As illustration,
function R(x|θ) is shown in Fig. 2.6(2) for the particular value θ0 = 2. Following this
procedure,12 the 0.95 probability content band is shown in Fig. 2.5 under “Feldman-
Cousins” (blue line). Note that for large values of x, the confidence region becomes
an interval. It is true that if we observe a large value of X , the hypothesis H0 : {δ = 0}
will not be favoured by the data and a different analysis will be more relevant although,
by a simple modification of the ordering rule, we still can get an upper bound if desired
or use the standard Neyman’s procedure.
The Feldman and Cousins prescription allows to consider constrains on the para-
meters in a simpler way than Neyman’s procedure and, as opposed to it, will always
provide a region with the specified probability content. However, on the one hand,
they are frequentist intervals and as such have to be interpreted. On the other hand,
for discrete random quantities with image in {x1 , . . . , xk , . . .} it may not be possible
to satisfy exactly the probability content equation since for the Distribution Function
one has that F(xk+1 ) = F(xk ) + P(X = xk+1 ). And last, it is not straight forward
to deal with nuisance parameters. Therefore, the best advice: “Be Bayesian!”.

2.13 Some Worked Examples

2.13.1 Regression

Consider the exchangeable sequence z = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} of n sam-


plings from the two-dimensional model N (xi , yi |·) = N (xi |μxi , σx2i )N (yi |μ yi , σ 2yi ).
Then
 + ,
1 (yi − μ yi )2
n
(xi − μxi )2
p(z|·) ∝ exp − +
2 i=1 σ 2yi σxi
2

We shall assume that the precisions σxi and σxi are known and that there is a func-
tional relation μ y = f (μx ; θ) with unknown parameters θ. Then, in terms of the new
parameters of interest:
 + ,
1 (yi − f (μxi ; θ))2
n
(xi − μxi )2
p( y|·) ∝ exp − +
2 i=1 σ 2yi σx2i

Consider a linear relation f (μx ; a, b) = a + bμx with a, b the unknown parameters


so:

12 In most cases,a Monte Carlo simulation will simplify life.


2.13 Some Worked Examples 155
 + ,
1 (yi − a − bμxi )2
n
(xi − μxi )2
p(z|·) ∝ exp − +
2 i=1 σ 2yi σx2i

and assume, in first place, that μxi = xi without uncertainty. Then,


 + ,
1 (yi − a − bxi )2
n
p( y|a, b) ∝ exp −
2 i=1 σ 2yi

There is a set of sufficient statistics for (a, b):


 
n
1 xi2 xi yi yi xi
n n n n
t = {t1 , t2 , t3 , t4 , t5 } = , , , ,
σ 2 i=1 σi2 i=1 σi2 i=1 σi2 i=1 σi2
i=1 i

and, after a simple algebra, it is easy to write


 ! "
1 (a − a0 )2 (b − b0 )2 (a − a0 ) (b − b0 )
p( y|a, b) ∝ exp − + − 2 ρ
2(1 − ρ2 ) σa2 σb2 σa σb

where the new statistics {a0 , b0 , σa , σb , ρ} are defined as:

t 2 t4 − t3 t5 t1 t5 − t3 t4
a0 = , b0 =
t1 t2 − t32 t1 t2 − t32
t2 t1 t3
σa2 = , σb2 = , ρ = −√
t1 t2 − t32 t1 t2 − t32 t1 t2

Both (a, b) are position parameters so we shall take a uniform prior and in conse-
quence
  
(a−a0 )2 (b−b0 )2
1 − 2(1−ρ
1
2) σa2
+ σb2
− 2 ρ (a−a
σa
0 ) (b−b0 )
σb
p(a, b|·) = & e
2πσa σb 1 − ρ2

This was obviously expected.


When μxi are n unknown parameters, if we take π(μxi ) = 1(0,∞) (μxi ) and mar-
ginalize for (a, b) we have
  −1/2
1 (yi − a − b xi )2 
n n
p(a, b|·) ∝ π(a, b) exp − (σ 2yi +b 2
σxi
2
)
2 i=1 σ 2yi + b2 σxi
2
i=1

In general, the expressions one gets for non-linear regression problems are compli-
cated and setting up priors is a non-trivial task but fairly vague priors easy to deal with
are usually a reasonable choice. In this case, for instance, one may consider uniform
156 2 Bayesian Inference

priors or normal densities N (·|0, σ >>) for both parameters (a, b) and sample the
proper posterior with a Monte Carlo algorithm (Gibbs sampling will be appropriate).
The same reasoning applies if we want to consider other models or more involved
 b
relations with several explanatory variables like θi = kj=1 α j xi jj . In counting exper-
iments, for example, yi ∈ N so we may be interested in a Poisson model Po(yi |μi )
where μi is parameterized as a simple log-linear form ln(μi ) = α0 + α1 xi (so μi > 0
for whatever α0 , α1 ∈ R). Suppose for instance that we have the sample {(yi , xi )}i=1 n
.
Then:
 
n n
α1 α2 x i
p( y|α1 , α2 , x) ∝ exp{−μi }μi = exp α1 s1 + α2 s2 − e
yi
e
i=1 i=1

n n
where s1 = i=1 yi and s2 = i=1 yi xi . In this case, the Normal distribution
N (αi |ai , σi ) with σi >> is a reasonable smooth and easy to handle proper prior
density for both parameters. Thus, we get the posterior conditional densities
   
α2 ai
n
p(αi |α j , y, x) ∝ exp − i 2 + αi + si − e α1 e α2 x i ; i = 1, 2
2σi σi2 i=1

that are perfectly suited for the Gibbs sampling to be discussed in Sect. 4.1 of Chap. 3.
Example 2.25 (Proton Flux in Primary Cosmic Rays) For energies between ∼20 and
∼200 GeV, the flux of protons of the primary cosmic radiation is reasonably well
described by a power law φ(r ) = c r γ where r is the rigidity13 and γ = dlnφ/ds,
with s = ln r , is the spectral index. At lower energies, this dependence is significantly
modified by the geomagnetic cut-off and the solar wind but at higher energies, where
these effects are negligible, the observations are not consistent with a single power law
(Fig. 2.7(1)). One may characterize this behaviour with a simple phenomenological
model where the spectral index is no longer constant but has a dependence γ(s) =
α + β tanh[a(s − s0 )] such that lims→−∞ γ(s) = γ1 (r →0) and lims→∞ γ(s) = γ2
(r → + ∞). After integration, the flux can be expressed in terms of 5 parameters
θ = {φ0 , γ1 , δ = γ2 − γ1 , r0 , σ} as:
!  σ "δ/σ
γ1 r
φ(r ; θ) = φ0 r 1+
r0

For this example, I have used the data above 45 GeV published by the AMS experi-
ment14 and considered only the quoted statistical errors (see Fig. 2.7(1)). Last, for a
better description of the flux the previous expression has been modified to account
for the effect of the solar wind with the force-field approximation in consistency with

13 Therigidity (r ) is defined as the momentum ( p) divided by the electric charge (Z ) so r = p for


protons.
14 [AMS15]: Aguilar M. et al. (2015); PRL 114, 171103 and references therein.
2.13 Some Worked Examples 157

(1) (2) 3
× 10
14000 2000
φ r 2.7 1800
12000
1600

10000 1400

1200
8000
1000

800
6000
600
4000 400

r 200 γ1
2000
0
2 3 −2.852 −2.85 −2.848 −2.846 −2.844 −2.842 −2.84 −2.838
10 10 10

(3) 3
(4)
× 10
0.2
1800

1600
δ
0.18
1400

1200
0.16
1000

800
0.14
600

400 0.12

200 γ1
δ
0 0.1
0.1 0.12 0.14 0.16 0.18 0.2 −2.852 −2.85 −2.848 −2.846 −2.844 −2.842 −2.84 −2.838

Fig. 2.7 (1) Observed flux multiplied by r 2.7 in m−2 sr −1 sec−1 GV1.7 as given in [AMS15];
(2) Posterior density of the parameter γ1 (arbitrary vertical scale); (3) Posterior density of the
parameter δ = γ2 − γ1 (arbitrary vertical scale); (4): Projection of the posterior density p(γ1 , δ)

[AMS15]. This is just a technical detail, irrelevant for the purpose of the example.
Then, assuming a Normal model for the observations we can write the posterior
density

n 
1
p(θ|data) = π(θ) exp − (φi − φ(ri ; θ))2

i=1
2σi 2

I have taken Normal priors with large variances (σi >>) for the parameters γ1
and δ and restricted the support to R+ for {φ0 , r0 , σ}. The posterior densities for
the parameters γ1 and δ are shown in Fig. 2.7(2, 3) together with the projection
(Fig. 2.7(4)) that gives an idea of correlation between them. For a visual inspection,
the phenomenological form of the flux is shown in Fig. 2.7(1) (blue line) overimposed
to the data when the parameters are set to their expected posterior values.
158 2 Bayesian Inference

2.13.2 Characterization of a Possible Source of Events

Suppose that we observe a particular region  of the sky during a time t and denote
by λ the rate at which events from this region are produced. We take a Poisson
model to describe the number of produced events: k ∼ Po(k|λt). Now, denote by
 the probability to detect one event (detection area, efficiency of the detector,…).
The number of observed events n from the region  after an exposure time t and
detection probability  will follow:


n ∼ Bi(k|n, ) Po(k|λt) = Po(n|λt)
k=n

The approach to the problem will be the same for other counting process like,
for instance, events collected from a detector for a given integrated luminosity. We
suspect that the events observed in a particular region o of the sky are background
events together with those from an emitting source. To determine the significance
of the potential source we analyze a nearby region, b , to infer about the expected
background. If after a time tb we observe n b events from this region with detection
probability eb then, defining β = b tb we have that

(βλb )n b
n b ∼ Po(n b |λb β) = exp {−βλb }
(n b + 1)

At o we observe n o events during a time to with a detection probability o . Since


n o = n 1 + n 2 with n 1 ∼ Po(n 1 |λs α) signal events (α = o to ) and n 2 ∼ Po(n 2 |λb α)
background events (assume reasonably that es = eb = eo in the same region), we
have that


no
no ∼ Po(n 1 |λs α)Po(n o − n 1 |λb α) = Po(n o |(λs + λb )α)
n 1 =0

Now, we can do several things. We can assume for instance that the overall rate
form the region o is λ, write n o ∼ Po(n o |αλ) and study the fraction λ/λb of the
rates from the information provided by the observations in the two different regions.
Then, reparameterizing the model in terms of θ = λ/λb and φ = λb we have

p(n o , n b |·) = Po(n o |αλ)Po(n b |βλb ) ∼ e−βφ(1+γθ) θn o φn o +n b

where γ = α/β =(s ts )/(b tb ). For the ordering {θ, φ} we have that the Fisher’s
matrix and its inverse are
' ( ' (
θ(1+γθ)
γβφ
θ
γβ −1 φγβ
− βθ
I(θ, φ) = and I (μ1 , μ2 ) =
γβ β(1+γθ)
φ − βθ φ
β
2.13 Some Worked Examples 159

Then

φ−1/2
π(θ, φ) = π(φ|θ) π(θ) ∝ √
θ(1 + γθ)

and integrating the nuisance parameter φ we get finally:

γ n o +1/2 θn o −1/2
p(θ|n o , n b , γ) =
B(n o + 1/2, n b + 1/2) (1 + γθ)n o +n b +1

From this:
1 (n o + 1/2 + m) (n b + 1/2 − m) 1 n o + 1/2
E[θm ] = −→ E[θ] =
γ m (n o + 1/2) (n b + 1/2) γ n b − 1/2

and
 θ0
P(θ ≤ θ0 ) = p(θ|·) dθ = 1 − I B(n b + 1/2, n o + 1/2; (1 + γθ0 )−1 )
0

with I B(x, y; z) the Incomplete Beta Function. Had we interest in θ = λs /λb , the
corresponding reference prior will be

φ−1/2 1+γ
π(θ, φ) ∝ √ with δ=
(1 + θ)(δ + θ) γ

A different analysis can be performed to make inferences on λs . In this case, we


may consider as an informative prior for the nuisance parameter the posterior what
we had from the study of the background in the region b ; that is:

p(λb |n b , β) ∝ exp {−βλb } λb n b −1/2

and therefore:
 ∞
no
p(n o |α(λs + λb )) p(λb |n b , β) dλb ∝ π(λs ) e−αλs λs o ak λs −k
n
p(λs |·) ∝ π(λs )
0 k=0

where
 
no (k + n b + 1/2)
ak =
k [(α + β)]k

A reasonable choice for the prior will be a conjugated prior π(λs ) = Ga(λs |a, b)
that simplifies the calculations and provides enough freedom analyze the effect of
different shapes on the inferences. The same reasoning is valid if the knowledge on
λb is represented by a different p(λb |·) from, say, a Monte Carlo simulation. Usual
160 2 Bayesian Inference

14
μs μ =3
b
12

10

0 1 2 3 4 5 6 7 8 9 10 11 12
x

Fig. 2.8 90% Confidence Belt derived with Feldman and Cousins (filled band) and the Bayesian
HPD region (red lines) for a background parameter μb = 3

distributions in this case are the Gamma and the Normal with non-negative support.
Last, it is clear that if the rate of background events is known with high accuracy
then, with μi = αλi and π(μs ) ∝ (μs + μb )−1/2 we have

1
p(μs |·) = exp{−(μs + μb )} (μs + μb )x−1/2 1(0,∞) (μs )
(x + 1/2, μb )

As an example, we show in Fig. 2.8 the 90% HPD region obtained from the previous
expression (red lines) as function of x for μb = 3 (conditions as given in the example
of [25]) and the Confidence Belt derived with the Feldman and Cousins approach
(filled band). In this case, μs,m = max{0, x − μb } and therefore, for a given μs :


x2  x
(μs,m −μs ) μs + μb
Po(x|μs + μb ) = β with R(x|μs ) = e > kβ
x1
μs,m + μb

for all x ∈ [x1 , x2 ].

Problem 2.14 In the search for a new particle, assume that the number of observed
events follows a Poisson distribution with μb = 0.7 known with enough precision
from extensive Monte Carlo simulations. Consider the hypothesis H0 : {μs = 0} and
H1 : {μs = 0}. It is left as an exercise to obtain the Bayes Factor B F01 with the proper
2.13 Some Worked Examples 161

prior π(μs |μb ) = μb (μs + μb )−2 proposed in [26], P(H1 |n) and the BIC difference

01 as function of n = 1, . . . 7 and decide when, based on this results, will you


consider that there is evidence for a signal.

2.13.3 Anisotropies of Cosmic Rays

The angular distribution of cosmic rays in galactic coordinates is analyzed search-


ing for possible anisotropies. A well-behaved real function f (θ, φ) ∈ L 2 (), with
(θ, φ) ∈  = [0, π]×[0, 2π], can be expressed in the real harmonics basis as:


l 
f (θ, φ) = alm Ylm (θ, φ) where alm = f (θ, φ)Ylm (θ, φ) dμ;
l=0 m=−l 

alm ∈ R and dμ = sin θdθdφ. The convention adopted for the spherical harmonic
functions is such that (orthonormal basis):
  √
Ylm (θ, φ) Yl  m  (θ, φ) dμ = δll  δmm  and Ylm (θ, φ) dμ = 4π δl0
 

In consequence, a probability density function p(θ, φ) with support in  can be


expanded as


l
p(θ, φ) = c00 Y00 (θ, φ) + clm Ylm (θ, φ)
l=1 m=−l


The normalization imposes that c00 = 1/ 4π so we can write

1
p(θ, φ|a) = (1 + alm Ylm (θ, φ))

where l ≥ 1,

alm = 4πclm = 4π p(θ, φ) Ylm (θ, φ) dμ = 4π E p;μ [Ylm (θ, φ)]


and summation over repeated indices understood. Obviously, for any (θ, φ) ∈  we
have that p(θ, φ|a) ≥ 0 so the set of parameters a are constrained on a compact
support.
Even though we shall study the general case, we are particularly interested in
the expansion up to l = 1 (dipole terms) so, to simplify the notation, we redefine
the indices (l, m) = {(1, −1), (1, 0), (1, 1)} as i = {1, 2, 3} and, accordingly, the
coefficients a = (a1−1 , a10 , a11 ) as a = (a1 , a2 , a3 ). Thus:
162 2 Bayesian Inference

1
p(θ, φ|a) = (1 + a1 Y1 + a2 Y2 + a3 Y3 )

In this case, the condition p(θ, φ|a) ≥ 0 implies that the coefficients are bounded by
the sphere a12 + a22 + a32 ≤ 4π/3 and therefore, the coefficient of anisotropy
1
de f. 3  2 1/2
δ = a1 + a22 + a32 ≤ 1

There are no sufficient statistics for this model but the Central Limit Theorem
applies and, given the large amount of data, the experimental observations can be
cast in the statistic a = (a1 , a2 , a3 ) such that15


3
p(a|μ) = N (ai |μi , σi2 )
i=1

with V (ai ) = 4π/n known and with negligible correlations (ρi j 0).
Consider then a k-dimensional random quantity Z = {Z 1 , . . . , Z k } and the distri-
bution


k
p(z|μ, σ) = N (z j |μ j , σ 2j )
j=1

The interest is centered on the euclidean norm ||μ||, with dim{μ} = k, and its square;
in particular, in
1
3 ||μ||2
δ= ||μ|| for k = 3 and Ck =
4π k

First, let us define X j = Z j /σ j and ρ j = μ j /σ j so X j ∼ N (x j |ρ j , 1) and make


a transformation of the parameters ρ j to spherical coordinates:

ρ1 = ρ cos φ1
ρ2 = ρ sin φ1 cos φ2
ρ3 = ρ sin φ1 sin φ2 cos φ3
..
.
ρk−1 = ρ sin φ1 sin φ2 . . . sin φk−2 cos φk−1
ρk = ρ sin φ1 sin φ2 . . . sin φk−2 sin φk−1

The Fisher’s matrix is the Riemann metric tensor so the square root of the determinant
is the k-dimensional volume element:
n
15 Essentially, alm = 4π
n i=1 Ylm (θi , φi ) for a sample of size n.
2.13 Some Worked Examples 163

d V k = ρk−1 dρ d S k−1

with


k−1
d S k−1 = sink−2 φ1 sink−3 φ2 . . . sin φk−2 dφ1 dφ2 . . . dφk−1 = sin(k−1)− j φ j dφ j
j=1

the k − 1 dimensional spherical surface element, φk−1 ∈ [0, 2π) and φ1,...,k−2 ∈
[0, π]. The interest we have is on the parameter ρ so we should consider the ordered
parameterization {ρ; φ} with φ = {φ1 , φ2 , . . . , φk−1 } nuisance parameters. Being ρ
and φi independent for all i, we shall consider the surface element (that is, the deter-
minant of the submatrix obtained for the angular part) as prior density (proper) for
the nuisance parameters. As we have commented in Chap. 1, this is just the Lebesgue
measure on the k − 1 dimensional sphere (the Haar invariant measure under rotations)
and therefore the natural choice for the prior; in other words, a uniform distribution
on the k − 1 dimensional sphere. Thus, we start integrating the angular parameters.
Under the assumption that the variances σi2 are all the same and considering that
 π  ν
√ 2 1 1
e±β cos θ sin2ν θdθ = π (ν + ) Iν (β) for Re(ν) > −
0 β 2 2

one gets p(φ|data) ∝ p(φm |φ)π(φ) where


 ν/2 & &
φm
p(φm |φ, ν) = b e−b(φ+φm ) Iν (2b φm φ)
φ

is properly normalized,

1 n
ν = k/2 − 1 ; φ = ||μ||2 ; φm = ||a||2 ; b= 2
=
2σ 8π

and dim{μ} = dim{a} = k. This is nothing else but a non-central χ2 distribution.


From the series expansion of the Bessel functions it is easy to prove that this
process is just a compound Poisson-Gamma process


p(φm |φ, ν) = Po(k|bφ) Ga(φm |b, ν + k + 1)
k=0

and therefore the sampling distribution is a Gamma-weighted Poisson distribution


with the parameter of interest that of the Poisson. From the Mellin Transform:

b e−bφ (s + ν)
M(s)
−ν,∞ = M(s + ν, ν + 1, bφ)
(ν + 1) bs
164 2 Bayesian Inference

with M(a, b, z) the Kummer’s function one can easily get the moments (E[φnm ] =
M(n + 1)); in particular

E[φm ] = φ + b−1 (ν + 1) and V [φm ] = 2φb−1 + b−2 (ν + 1)

Now that we have the model p(φm |φ), let’s go for the prior function π(φ) or π(δ).
One may guess already what shall we get. The first element of the Fisher’s matrix
(diagonal) corresponds to the norm and is constant so it would not be surprising to get
the Lebesgue measure for the norm dλ(δ) = π(δ)dδ = c dδ. As a second argument,
for large sample sizes (n >>) we have b >> so φm ∼ N (φm |φ, σ 2 = 2φ/b) and, to
first order, Jeffreys’ prior is π(φ) ∼ φ−1/2 . From the reference analysis, if we take
for instance

π  (φ) = φ(ν−1)/2

we end up, after some algebra, with


 
f k (φ) φ0 1/2
π(φ) ∝ π(φ0 ) lim ∝ lim e−3b(φ − φ0 )/2 + [I (φ, b) − I (φ0 , b)]
k→∞ f k (φ0 ) φ b−→∞

where
 ∞

Iν (2b φφm )
I (φ, b) = p(φm |φ) log dφm
0 Iν/2 (bφm /2)

and φ0 any interior point of (φ) = [0, ∞). From the asymptotic behavior of the
Bessel functions one gets

π(φ) ∝ φ−1/2

and therefore, π(δ) = c. It is left as an exercise to get the same result with other
priors like π  (φ) = c or π  (φ) = φ−1/2 .
For this problem, it is easier to derive the prior from the reference analysis. Nev-
ertheless, the Fisher’s information that can be expressed as:
  √ 

e−bφ −bz ν/2+1 Iν+1 (2b zφ)
2
F(φ; ν) = b 2
−1 + b ν/2+1 e z √ dz
φ 0 Iν (2b zφ)

and, for large b (large sample size), F(λ; ν) → φ−1 regardless the number of degrees
of freedom ν. Thus, Jeffrey’s prior is consistent with the result from reference analy-
sis. In fact, from the asymptotic behavior of the Bessel Function in the corresponding
expressions of the pdf, one can already see that F(φ; ν) ∼ φ−1 . A cross check from a
numeric integration is shown in Fig. 2.9 where, for k = 3, 5, 7 (ν = 1/2, 3/2, 5/2),
F(φ; ν) is depicted as function of φ compared to 1/φ in black for a sufficiently large
2.13 Some Worked Examples 165

0.8

0.6

0.4

0.2

0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Fig. 2.9 Fisher’s information (numeric integration) as function of φ for k = 3, 5, 7 (discontinuous


lines) and f (φ) = φ−1 (continuous line). All are scaled so that F(φ = 0.005, ν) = 1

value of b. Therefore we shall use π(φ) = φ−1/2 for the cases of interest (dipole,
quadrupole, … any-pole).
The posterior densities are
√ √
• For φ = ||μ||2 : p(φ|φm , ν) = N e−bφ φ−(ν+1)/2 Iν (2b φm φ) with
−ν/2
(ν + 1) b1/2−ν φm
N=√
π M(1/2, ν + 1, bφm )

The Mellin Transform is


(s − 1/2) M(s − 1/2, ν + 1, bφm )
Mφ (s)
1/2,∞ = √
bs−1 π M(1/2, ν + 1, bφm )

and therefore the moments


(n + 1/2) M(n + 1/2, ν + 1, bφm )
E[φn ] = M(n + 1) = √ n
π b M(1/2, ν + 1, bφm )
166 2 Bayesian Inference

In the limit |bφm |→∞, E[φn ] = φnm .



p(ρ|φm , ν) = 2 N e−bρ ρ−ν Iν (2b φm ρ) and
2
• For ρ = ||μ|| :

(n/2 + 1/2) M(n/2 + 1/2, ν + 1, bφm )


Mρ (s) = Mφ (s/2 + 1/2) −→ E[ρn ] = √ n/2
π b M(1/2, ν + 1, bφm )


In the particular case that k = 3 (dipole; ν = 1/2), we have for δ = 3/4πρ that
the first two moments are:
erf (z) 1
E[δ] = E[δ 2 ] =
aδm M(1, 3/2, −z 2 ) a M(1, 3/2, −z 2 )

with z = 2δm bπ/3 and, when δm →0 we get
1
2 1.38 1 1.04
E[δ] = √ E[δ 2 ] = σδ √
πa n a n

and a one sided 95% upper credible region (see Sect. 2.11 for more details) of δ0.95 =
3.38
√ .
n
So far, the analysis has been done assuming that the variances σ 2j are of the
same size (equal in fact) and the correlations are small. This is a very reasonable
assumption but may not always be the case. The easiest way to proceed then is
to perform a transformation of the parameters of interest (μ) to polar coordinates
μ(ρ, ) and do a Monte Carlo sampling from the posterior:
⎡ ⎤

n
p(ρ, |z,  −1 ) ∝ ⎣ N (z j |μ j (ρ, ),  −1 )⎦ π(ρ)dρ d S n−1
j=1

with a constant prior for δ or π(φ) ∝ φ−1/2 for φ.

References

1. G. D’Agostini, Bayesian Reasoning in Data Analysis (World Scientific Publishing Co, Singa-
pore, 2003)
2. F. James, Statistical Methods in Experimental Physics (World Scientific Publishing Co, Sin-
gapore, 2006)
3. J.M. Bernardo, The concept of exchangeability and its applications. Far East J. Math. Sci. 4,
111–121 (1996). www.uv.es/~bernardo/Exchangeability.pdf
4. J.M. Bernardo, A.F.M. Smith, Bayesian Theory (Wiley, New York, 1994)
5. R.E. Kass, L. Wasserman, The selection of prior distributions by formal Rules. J. Am. Stat.
Assoc. V 91(453), 1343–1370 (1996)
References 167

6. H. Jeffreys, Theory of Probability (Oxford University Press, Oxford, 1939)


7. E.T. Jaynes, Prior Probabilities and Transformation Groups, NSF G23778 (1964)
8. V.I. Bogachev, Measure Theory (Springer, Berlin, 2006)
9. M. Stone, Right haar measures for convergence in probability to invariant posterior distribu-
tions. Ann. Math. Stat. 36, 440–453 (1965)
10. M. Stone, Necessary and sufficient conditions for convergence in probability to invariant pos-
terior distributions. Ann. Math. Stat. 41, 1349–1353 (1970)
11. H. Raiffa, R. Schlaifer, Applied Statistical Decision Theory (Harvard University Press, Cam-
bridge, 1961)
12. S.R. Dalal, W.J. Hall, J.R. Stat, Soc. Ser. B 45, 278–286 (1983)
13. B. Welch, H. Pears, J.R. Stat, Soc. B 25, 318–329 (1963)
14. M. Gosh, R. Mukerjee, Biometrika 84, 970–975 (1984)
15. G.S. Datta, M. Ghosh, Ann. Stat. 24(1), 141–159 (1996)
16. G.S. Datta, R. Mukerjee, Probability Matching Priors and Higher Order Asymptotics (Springer,
New York, 2004)
17. J.M. Bernardo, J.R. Stat, Soc. Ser. B 41, 113–147 (1979)
18. J.O. Berger, J.M. Bernardo, D. Sun, Ann. Stat 37(2), 905–938 (2009)
19. J.M. Bernardo, J.M. Ramón, The Statistician 47, 1–35 (1998)
20. J.O. Berger, J.M. Bernardo, D. Sun, Objective priors for discrete parameter spaces. J. Am. Stat.
Assoc. 107(498), 636–648 (2012)
21. A. O’Hagan, J.R. Stat. Soc. B57, 99–138 (1995)
22. J.O. Berger, L.R. Pericchi, J. Am. Stat. Assoc. V 91(433), 109–122 (1996)
23. R.E. Kass, A.E. Raftery, J. Am. Stat. Assoc. V 90(430), 773–795 (1995)
24. G. Schwarz, Ann. Stat. 6, 461–464 (1978)
25. Feldman G.J. and Cousins R.D. (1997); arXiV:physics/9711021v2
26. J.O. Berger, L.R. Pericchi, Ann. Stat. V 32(3), 841–869 (2004)
Chapter 3
Monte Carlo Methods

Anyone who considers arithmetical methods of producing


random digits is, of course, in a state of sin
J. Von Neumann

The Monte Carlo Method is a very useful and versatile numerical technique that
allows to solve a large variety of problems difficult to tackle by other procedures.
Even though the central idea is to simulate experiments on a computer and make
inferences from the “observed” sample, it is applicable to problems that do not
have an explicit random nature; it is enough if they have an adequate probabilistic
approach. In fact, a frequent use of Monte Carlo techniques is the evaluation of
definite integrals that at first sight have no statistical nature but can be interpreted as
expected values under some distribution.
Detractors of the method used to argue that one uses Monte Carlo Methods because
a manifest incapability to solve the problems by other more academic means. Well,
consider a “simple” process in particle physics: ee→eeμμ. Just four particles in
the final state; the differential cross section in terms of eight variables that are not
independent due to kinematic constraints. To see what we expect for a particular
experiment, it has to be integrated within the acceptance region with dead zones
between subdetectors, different materials and resolutions that distort the momenta
and energy, detection efficiencies,… Yes. Admittedly we are not able to get nice
expressions. Nobody in fact and Monte Carlo comes to our help. Last, it may be a
truism but worth to mention that Monte Carlo is not a magic black box and will not
give the answer to our problem out of nothing. It will simply present the available
information in a different and more suitable manner after more or less complicated
calculations are performed but all the needed information has to be put in to start
with in some way or another.

© Springer International Publishing AG 2017 169


C. Maña, Probability and Statistics for Particle Physics,
UNITEXT for Physics, DOI 10.1007/978-3-319-55738-0_3
170 3 Monte Carlo Methods

In this lecture we shall present and justify essentially all the procedures that are
commonly used in particle physics and statistics leaving aside subjects like Markov
Chains that deserve a whole lecture by themselves and for which only the relevant
properties will be stated without demonstration. A general introduction to Monte
Carlo techniques can be found in [1].

3.1 Pseudo-Random Sequences

Sequences of random numbers {x1 , x2 , . . . , xn } are the basis of Monte Carlo simula-
tions and, in principle, their production is equivalent to perform an experiment e(n)
sampling n times the random quantity X ∼ p(x|θ). Several procedures have been
developed for this purpose (real experiments, dedicated machines, digits of transcen-
dental numbers,…) but, besides the lack of precise knowledge behind the generated
sequences and the need of periodic checks, the complexity of the calculations we are
interested in demands large sequences and fast generation procedures. We are then
forced to devise simple and efficient arithmetical algorithms to be implemented in a
computer. Obviously neither the sequences produced are random nor we can produce
truly random sequences by arithmetical algorithms but we really do not need them. It
is enough for them to simulate the relevant properties of truly random sequences and
be such that if I give you one of these sequences and no additional information, you
won’t be able to tell after a bunch of tests [2] whether it is a truly random sequence or
not (at least for the needs of the problem at hand). That’s why they are called pseudo-
random although, in what follows we shall call them random. The most popular (and
essentially the best) algorithms are based on congruential relations (used for this
purpose as far as in the 1940s) together with binary and/or shuffling operations with
some free parameters that have to be fixed before the sequence is generated. They
are fast, easy to implement on any computer and, with the adequate initial setting of
the parameters, produce very long sequences with sufficiently good properties. And
the easiest and fastest pseudo-random distribution to be generated on a computer is
the Discrete Uniform.1
Thus, let’s assume that we have a good Discrete Uniform random number
generator2 although, as Marsaglia said, “A Random Number Generator is like sex:
When it is good it is wonderful; when it is bad… it is still pretty good”. Each
call in a computer algorithm will produce an output (x) that we shall represent as
x ⇐ U n(0, 1) and simulates a sampling of the random quantity X ∼ U n(x|0, 1).
Certainly, we are not very much interested in the Uniform Distribution so the task is
to obtain a sampling of densities p(x|θ) other than Uniform from a Pseudo-Uniform
Random Number Generator for which there are several procedures.

1 See
[3, 4] for a detailed review on random and quasi-random number generators.
2 For
the examples in this lecture I have used RANMAR [5] that can be found, for instance, at the
CERN Computing Library.
3.1 Pseudo-Random Sequences 171

Table 3.1 Estimation of π from a Binomial random process


Throws (N ) Accepted (n) π σ
100 83 3.3069 0.1500
1000 770 3.0789 0.0532
10000 7789 3.1156 0.0166
100000 78408 3.1363 0.0052
1000000 785241 3.1410 0.0016

Example 3.1 (Estimate the value of π) As a simple first example, let’s see how we
may estimate the value of π. Consider a circle of radius r inscribed in a square with
sides of length 2r . Imagine now that we throw random points evenly distributed
inside the square and count how many have fallen inside the circle. It is clear that
since the area of the square is 4r 2 and the area enclosed by the circle is πr 2 , the
probability that a throw falls inside the circle is θ = π/4.

If we repeat the experiment N times, the number n of throws falling inside the
circle follows a Binomial law Bi(n|N , p) and therefore, having observed n out of
N trials we have that

p(θ|n, N ) ∝ θn−1/2 (1 − θ) N −n−1/2

Let’s take π  = 4E[θ] as point estimator and σ  = 4σθ as a measure of the precision.
The results obtained for samplings of different size are shown in Table 3.1.
It is interesting to see that the precision decreases with the sampling size as

1/ N . This dependence is a general feature of Monte Carlo estimations regardless
the number of dimensions of the problem.
A similar problem is that of Buffon’s needle: A needle of length l is thrown at
random on a horizontal plane with stripes of width d > l. What is the probability that
the needle intersects one of the lines between the stripes? It is left as an exercise to
shown that, as given already by Buffon in 1777, Pcut = 2l/πd. Laplace pointed out,
in what may be the first use of the Monte Carlo method, that doing the experiment
one may estimate the value of π “… although with large error”.

3.2 Basic Algorithms

3.2.1 Inverse Transform

This is, at least formally, the easiest procedure. Suppose we want a sampling of the
continuous one-dimensional random quantity X ∼ p(x)3 so

3 Remember that if supp(X ) = ⊆R, it is assumed that the density is p(x)1 (x).
172 3 Monte Carlo Methods
 x  x
P[X ∈ (−∞, x]] = p(x
)d x
= d F(x
) = F(x)
−∞ −∞

Now, we define the new random quantity U = F(X ) with support in [0, 1]. How is
it distributed? Well,
 F −1 (u|·)
FU (u) ≡ P[U ≤ u] = P[F(X ) ≤ u)] = P[X ≤F −1 (u)] = d F(x
) = u
−∞

and therefore U ∼ U n(u|0, 1). The algorithm is then clear; at step i:


(i 1 ) u i ⇐ U n(u|0, 1)
(i 2 ) xi = F −1 (u i )
After repeating the sequence n times we end up with a sampling {x1 , x2 , . . . , xn } of
X ∼ p(x).

Example 3.2 Let’ see how we generate a sampling of the Laplace distribution X ∼
La(x|α, β) with α ∈ R, β ∈ (0, ∞) and density

1 −|x−α|/β
p(x|α, β) = e 1(−∞,∞) (x)

The distribution function is


⎧  
 ⎪
⎪ 1
exp x−α
if x < α
x ⎨2 β
F(x) = p(x
|α, β)d x
=  
−∞ ⎪

⎩ 1 − 1 exp − x−α if u ≥ α
2 β

Then, if u ⇐ U n(0, 1):



⎨ α + βln(2u) if u < 1/2
x=

α − βln(2(1 − u)) if u ≥ 1/2

The generalization of the Inverse Transform method to n-dimensional random


quantities is trivial. We just have to consider the marginal and conditional distribu-
tions

F(x1 , x2 , . . . , xn ) = Fn (xn |xn−1 , . . . , x1 ) · · · F2 (x2 |x1 ) · · · F1 (x1 )

or, for absolute continuous quantities, the probability densities

p(x1 , x2 , . . . , xn ) = pn (xn |xn−1 , . . . , x1 ) · · · p2 (x2 |x1 ) · · · p1 (x1 )


3.2 Basic Algorithms 173

and then proceed sequentially; that is:


(i 2,1 ) u 1 ⇐ U n(u|0, 1) and x1 = F1−1 (u 1 );
(i 2,2 ) u 2 ⇐ U n(u|0, 1) and x2 = F2−1 (u 2 |x1 );
(i 2,1 ) u 3 ⇐ U n(u|0, 1) and x3 = F3−1 (u 3 |x1 , x2 );
..
.
(i 2,n ) u n ⇐ U n(u|0, 1) and xn = Fn−1 (u 3 |xn−1 , . . . , x1 )
If the random quantities are independent there is a unique decomposition


n
n
p(x1 , x2 , . . . , xn ) = pi (xi ) and F(x1 , x2 , . . . , xn ) = Fi (xi )
i=1 i=1

but, if this is not the case, note that there are n! ways to do the decomposition and
some may be easier to handle than others (see Example 3.3).

Example 3.3 Consider the probability density

p(x, y) = 2 e−x/y 1(0,∞) (x)1(0,1] (y)

We can express the Distribution Function as F(x, y) = F(x|y)F(y) where:


 ∞
p(y) = p(x, y) d x = 2 y −→ F(y) = y 2
0
p(x, y) 1
p(x|y) = = e−x/y −→ F(x|y) = 1 − e−x/y
p(y) y

Both F(y) and F(x|y) are easy to invert so:


(i 1 ) u ⇐ U n(0, 1) and get y = u 1/2
(i 2 ) w ⇐ U n(0, 1) and get x = −y ln w
Repeating the algorithm n times, we get the sequence {(x1 , y1 ), (x2 , y2 ), . . .} that
simulates a sampling from p(x, y).

Obviously, we can also write F(x, y) = F(y|x)F(x) and proceed in an analogous


manner. However
 1  ∞
p(x) = p(x, y)dy = 2x e−u u −2 du
0 x

is not so easy to sample.


Last, let’s see how to use the Inverse Transform procedure for discrete ran-
dom quantities. If X can take the values in  X = {x0 , x1 , x2 , . . .} with probabilities
P(X = xk ) = pk , the Distribution Function will be:
174 3 Monte Carlo Methods

F0 = P(X ≤ x0 ) = p0
F1 = P(X ≤ x1 ) = p0 + p1
F2 = P(X ≤ x2 ) = p0 + p1 + p2
...

Graphically, we can represent the sequence {0, F0 , F1 , F2 , . . . , 1} as:


p0 p1 p2 ........

0 F0 F1 F2 ........ 1

Then, it is clear that a random quantity u i drawn from U (x|0, 1) will determine a point
in the interval [0, 1] and will belong to the subinterval [Fk−1 , Fk ] with probability
pk = Fk − Fk−1 so we can set up the following algorithm:
(i 1 ) Get u i ∼ U n(u|0, 1);
(i 2 ) Find the value xk such that Fk−1 < u i ≤ Fk
The sequence {x0 , x1 , x2 , . . .} so generated will be a sampling of the probability
law P(X = xk ) = pk . Even though discrete random quantities can be sampled in
this way, some times there are specific properties based on which faster algorithms
can be developed. That is the case for instance for the Poisson Distribution as the
following example shows.

Example 3.4 (Poisson Distribution) Po(k|μ). From the recurrence relation

μk μ
pk = e−μ = pk−1
(k + 1) k

(i 1 ) u i ⇐ U n(0, 1)
(i 2 ) Find the value k = 0, 1, . . . such that Fk−1 < u i ≤ Fk and deliver xk = k
For the Poisson Distribution, there is a faster procedure. Consider a sequence
of n independent random quantities {X 1 , X 2 , . . . , X n }, each distributed as X i ∼
U n(x|0, 1), and introduce a new random quantity


n
Wn = Xk
k=1

with supp{Wn } = [0, 1]. Then


 ∞
(−logwn )n−1 1
Wn ∼ p(wn |n) = −→ P(Wn ≤ a) = e−t t n−1 dt
(n) (n) −loga

and if we take a = e−μ we have, in terms of the Incomplete Gamma Function


P(a, x):
3.2 Basic Algorithms 175

n−1
μk
P(Wn ≤ e−μ ) = 1 − P(n, μ) = e−μ = Po(X ≤ n − 1|μ)
k=0
(k + 1)

Therefore,
(i 0 ) Set w p = 1;
(i 1 ) u i ⇐ U n(0, 1) and set w p = w p u;
(i 2 ) Repeat step (i 1 ) while w p ≤ e−μ , say k times, and deliver xk = k − 1

Example 3.5 (Binomial Distribution Bn(k|N , θ)) From the recurrence relation

N θ n−k+1
pk = θk (1 − θ)n−k = pk−1
k 1−θ k

with p0 = (1 − θ)k
(i 1 ) u i ⇐ U n(0, 1)
(i 2 ) Find the value k = 0, 1, . . . , N such that Fk−1 < u i ≤ Fk and deliver xk = k

Example 3.6 (Simulation of the response of a Photomultiplier tube) Photomultiplier


tubes are widely used devices to detect electromagnetic radiation by means of the
external photoelectric effect. A typical photomultiplier consists of a vacuum tube with
an input window, a photocathode, a focusing and a series of amplifying electrodes
(dynodes) and an electron collector (anode). Several materials are used for the input
window (borosilicate glass, synthetic silica,…) which transmit radiation in different
wavelength ranges and, due to absorptions (in particular in the UV range) and external
reflexions, the transmittance of the window is never 100%. Most photocathodes are
compound semiconductors consisting of alkali metals with a low work function.
When the photons strike the photocathode the electrons in the valence band are
excited and, if they get enough energy to overcome the vacuum level barrier, they are
emitted into the vacuum tube as photoelectrons. The trajectory of the electrons inside
the photomultiplier is determined basically by the applied voltage and the geometry
of the focusing electrode and the first dynode. Usually, the photoelectron is driven
towards the first dynode and originates an electron shower which is amplified in the
following dynodes and collected at the anode. However, a fraction of the incoming
photons pass through the photocathode and originates a smaller electron shower
when it strikes the first dynode of the amplification chain.
To study the response of a photomultiplier tube, an experimental set-up has been
made with a LED as photon source. We are interested in the response for isolated
photons so we regulate the current and the frequency so as to have a low intensity
source. Under this conditions, the number of photons that arrive at the window of the
photomultiplier is well described by a Poisson law. When one of this photons strikes
on the photocathode, an electron is ejected and driven towards the first dynode to start
the electron shower. We shall assume that the number of electrons so produced follows
176 3 Monte Carlo Methods

also a Poisson law n gen ∼ Po(n|μ). The parameter μ accounts for the efficiency of
this first process and depends on the characteristics of the photocathode, the applied
voltage and the geometry of the focusing electrodes (essentially that of the first
dynode). It has been estimated to be μ = 0.25. Thus, we start our simulation with
(1) n gen ⇐ Po(n|μ)
electrons leaving the photocathode. They are driven towards the first dynode to start
the electron shower but there is a chance that they miss the first and start the shower
at the second. Again, the analysis of the experimental data suggest that this happens
with probability pd2  0.2. Thus, we have to decide how many of the n gen electrons
start the shower at the second dynode. A Binomial model is appropriate in this case:
(2) n d2 ⇐ Bi(n d2 |n gen , pd2 ) and therefore n d1 = n gen − n d2 .
Obviously, we shall do this second step if n gen > 0.
Now we come to the amplification stage. Our photomultiplier has 12 dynodes so
let’s see the response of each of them. For each electron that strikes upon dynode k
(k = 1, . . . , 12), n k electrons will be produced and directed towards the next element
of the chain (dynode k + 1), the number of them again well described by a Poisson
law Po(n k |μk ). If we denote by V the total voltage applied between the photocathode
and the anode and by Rk the resistance previous to dynode k we have that the current
intensity through the chain will be

V
I = 13
i=1 Ri

where we have considered also the additional resistance between the last dynode
and the anode that collects the electron shower. Therefore, the parameters μk are
determined by the relation

μk = a (I Rk )b

where a and b are characteristic parameters of the photomultiplier. In our case we have
that N = 12, a = 0.16459, b = 0.75, a total applied voltage of 800 V and a resistance
chain of {2.4, 2.4, 2.4, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.2, 2.4} Ohms. It is easy
to see that if the response of dynode k to one electron is modeled as Po(n k |μk ), the
response to n i incoming electrons is described by Po(n k |n i μk ). Thus, we simulate
the amplification stage as:
(3.1) If n d1 > 0: do from k = 1 to 12:

μ = μk n d1 −→ n d1 ⇐ Po(n|μ)

(3.2) If n d2 > 0: do from k = 2 to 12:

μ = μk n d2 −→ n d2 ⇐ Po(n|μ)
3.2 Basic Algorithms 177

Fig. 3.1 Result of the


simulation of the response of
5
a photomultiplier tube. The 10
histogram contains 106
events and shows the final
4
ADC distribution detailing 10
the contribution of the
pedestal and the response to
3
1, 2 and 3 incoming photons 10

2
10

10

1
0 25 50 75 100 125 150 175 200 225 250

Once this is done, we have to convert the number of electrons at the anode in ADC
counts. The electron charge is Q e = 1.602176 10−19 C and in our set-up we have
f ADC = 2.1 1014 ADC counts per Coulomb so

ADC pm = (n d1 + n d2 ) (Q e f ADC )

Last, we have to consider the noise (pedestal). In our case, the number of pedestal
ADC counts is well described by a mixture model with two Normal densities

p ped (x|·) = α N1 (x|10., 1.) + (1 − α) N1 (x|10., 1.5)

with α = 0.8. Thus, with probability α we obtain ADC ped ⇐N1 (x|10, 1.5), and with
probability 1 − α, ADC ped ⇐N1 (x|10, 1) so the total number of ADC counts will be

ADCtot = ADC ped + ADC pm

Obviously, if in step 1) we get n gen = 0, then ADCtot = ADC ped . Figure 3.1 shows
the result of the simulation for a sampling size of 106 together with the main con-
tributions (1, 2 or 3 initial photoelectrons) and the pedestal. From these results, the
parameters of the device can be adjusted (voltage, resistance chain,…) to optimize
the response for our specific requirements.
The Inverse Transform method is conceptually simple and easy to implement for
discrete distributions and many continuous distributions of interest. Furthermore,
is efficient in the sense that for each generated value u i as U n(x|0, 1) we get a
value xi from F(x). However, with the exception of easy distributions the inverse
function F −1 (x) has no simple expression in terms of elementary functions and may
be difficult or time consuming to invert. This is, for instance, the case if you attempt
178 3 Monte Carlo Methods

to invert the Error Function for the Normal Distribution. Thus, apart from simple
cases, the Inverse Transform method is used in combination with other procedures
to be described next.
NOTE 5: Bootstrap. Given the iid sample {x1 , x2 , . . . , xn } of the random quantity
X ∼ p(x|θ) we know (Glivenko-Cantelly theorem; see lecture 1 (7.6)) that:

n
unif.
Fn (x) = 1/n 1(−∞,x] (xk ) −→ F(x|θ)
k=1

Essentially the idea behind the bootstrap is to sample from the empirical Distribution
Function Fn (x) that, as we have seen for discrete random quantities, is equivalent to
draw samplings {x1
, x2
, . . . , xn
} of size n from the original sample with replacement.
Obviously, increasing the number of resamplings does not provide more information
than what is contained in the original data but, used with good sense, each bootstrap
will lead to a posterior and can also be useful to give insight about the form of the
underlying model p(x|θ) and the distribution of some statistics. We refer to [6] for
further details.

3.2.2 Acceptance-Rejection (Hit-Miss; J. Von Neumann 1951)

The Acceptance-Rejection algorithm is easy to implement and allows to sample a


large variety of n-dimensional probability densities with a less detailed knowledge
of the function. But nothing is for free; these advantages are in detriment of the
generation efficiency.
Let’s start with the one dimensional case where X ∼ p(x|θ) is a continuous
random quantity with supp(X ) = [a, b] and pm = maxx p(x|θ). Consider now two
independent random quantities X 1 ∼ U n(x1 |α, β) and X 2 ∼ U n(x2 |0, δ) where
[a, b] ⊆ supp(X 1 ) = [α, β] and [0, pm ] ⊆ supp(X 2 ) = [0, δ]. The covering does
not necessarily have to be a rectangle in R2 (nor a hypercube Rn+1 ) and, in fact, in
some cases it may be interesting to consider other coverings to improve the efficiency
but the generalization is obvious. Then

1 1
p(x1 , x2 |·) =
β−α δ

Now, let’s find the distribution of X 1 conditioned to X 2 ≤ p(X 1 |θ):

x  p(x1 |θ)
P(X 1 ≤ x, X 2 ≤ p(x|θ)) α d x1 0 p(x1 , x2 |·) d x2
P(X 1 ≤ x|X 2 ≤ p(x|θ)) = =   p(x1 |θ) =
P(X 2 ≤ p(x|θ)) β
α d x1 0 p(x1 , x2 |·) d x2
x  x
p(x1 |θ)1[a,b] (x1 ) d x1
= αβ = p(x1 |θ) d x1 = F(x|θ)
α p(x 1 |θ)1[a,b] (x 1 ) d x 1 a
3.2 Basic Algorithms 179

so we set up the following algorithm:


(i 1 ) u i ⇐ U n(u|α ≤ a, β ≥ b) and wi ⇐ U n(w|0, δ ≥ pm );
(i 2 ) If wi ≤ p(u i |θ) we accept xi ; otherwise we reject xi and start again from (i 1 )
Repeating the algorithm n times we get the sampling {x1 , x2 , . . . , xn } from p(x|θ).
Besides its simplicity, the Acceptance-Rejection scheme does not require to have
normalized densities for it is enough to know an upper bound and in some cases,
for instance when the support of the random quantity X is determined by functional
relations, it is easier to deal with a simpler covering of the support. However, the
price to pay is a low generation efficiency:

de f. accepted trials area under p(x|θ)


 = = ≤ 1
total trials area of the covering

Note that the efficiency so defined refers only to the fraction of accepted trials and,
obviously, the more adjusted the covering is the better but for the Inverse Transform
 = 1 and it does not necessarily imply that it is more efficient attending to other
considerations. It is interesting to observe that if we do not know the normalization
factor of the density function, it can be estimated as

p(x|θ) d x  (area of the covering) 
X

Let’s see some examples before we proceed.


Example 3.7 Consider X ∼ Be(x|α, β). In this case, what follows is just for ped-
agogical purposes since other procedures to be discussed later are more efficient.
Anyway, the density is

p(x|α, β) ∝ x α−1 (1 − x)β−1 ; x ∈ [0, 1]

Suppose that α, β > 1 so the mode xo = (α − 1)(α + β − 2)−1 exists and is unique.
Then

(α − 1)α−1 (β − 1)β−1
pm ≡maxx { p(x|α, β)} = p(x0 |α, β) =
(α + β − 2)α+β−2

so let’s take then the domain [α = 0, β = 1] × [0, pm ] and


(i 1 ) Get xi ∼ U n(x|0, 1) and yi ∼ U n(y|0, pm );
(i 2 ) If
(a) yi ≤ p(xi |α, β) we deliver (accept) xi
(r) yi > p(xi |α, β) we reject xi and start again from (i 1 )
Repeating the procedure n times, we get a sampling {x1 , x2 , . . . , xn } from Be(x|α, β).
In this case we know the normalization so the area under p(x|α, β) is Be(x|α, β)
and the generation efficiency will be:
180 3 Monte Carlo Methods

(α + β − 2)α+β−2
 = B(α, β)
(α − 1)α−1 (β − 1)β−1

Example 3.8 Let’s generate a sampling of the spatial distributions of a bounded


electron in a Hydrogen atom. In particular, as an example, those for the principal
quantum number n = 3. The wave-function is ψnlm (r, θ, φ) = Rnl (r )Ylm (θ, φ) with:

R30 ∝ (1 − 2r/2 − 2r 2 /27)e−r/3 ; R31 ∝ r (1 − r/6)e−r/3 and R32 ∝ r 2 e−r/3

the radial functions of the 3s, 3p and 3d levels and

|Y10 |2 ∝ cos2 θ; |Y1±1 |2 ∝ sin2 θ; |Y20 |2 ∝ (3 cos2 θ − 1)2 ; |Y2±1 |2 ∝ cos2 θsin2 θ;
|Y2±2 |2 ∝ sin4 θ

the angular dependence from the spherical harmonics. Since dμ = r 2 sinθdr dθdφ,
the probability density will be

p(r, θ, φ|n, l, m) = Rnl


2
(r ) |Ylm |2 r 2 sinθ = pr (r |n, l) pθ (θ|l, m) pφ (φ)

so we can sample independently r, θ and φ. It is left as an exercise to explicit a


sampling procedure. Note however that, for the marginal radial density, the mode
is at r = 13, 12 and 9 for l = 0, 1 and 2 and decreases exponentially so even if
the support is r ∈ [0, ∞) it will be a reasonable approximation to take a covering
r ∈ [0, rmax ) such that P(r ≥ rmax ) is small enough. After n = 4000 samplings for
the quantum numbers (n, l, m) = (3, 1, 0), (3, 2, 0) and (3, 2, ±1), the projections
on the planes πx y , πx z and π yz are shown in Fig. 3.2.

The generalization of the Acceptance-Rejection method to sample a n-dimensional


density X ∼ p(x|θ) with dim(x) = n is straight forward. Covering with an n + 1
dimensional hypercube:
(i 1 ) Get a sampling {xi1) , xi2) , . . . , xin) ; yi } where

{xik) ⇐ U n(xik) |αk , βk )}nk=1 ; yi ⇐ U n(y|0, k) and k ≥ max x p(x|θ)

(i 2 ) Accept the n-tuple x i = (xi1) , xi2) , . . . , xin) ) if yi ≤ p(x i |θ) or reject it otherwise.

3.2.2.1 Incorrect Estimation of maxx {p(x|·)}

Usually, we know the support [α, β] of the random quantity but the pdf is complicated
enough to know the maximum. Then, we start the generation with our best guess for
maxx p(x|·), say k1 , and after having generated N1 events (generated, not accepted)
in [α, β]×[0, k1 ],… wham!, we generate a value xm such that p(xm ) > k1 . Certainly,
3.2 Basic Algorithms 181

20 20 20

15 15 15

10 10 10

5 5 5

0 0 0

−5 −5 −5

− 10 − 10 − 10

− 15 − 15 − 15

− 20 − 20 − 20
− 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20

20 20 20

15 15 15

10 10 10

5 5 5

0 0 0

−5 −5 −5

− 10 − 10 − 10

− 15 − 15 − 15

− 20 − 20 − 20
− 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20

Fig. 3.2 Spatial probability distributions of an electron in a hydrogen atom corresponding to the
quantum states (n, l, m) = (3, 1, 0), (3, 2, 0) and (3, 2, ±1) (columns 1, 2, 3) and projections (x, y)
and (x, z) = (y, z) (rows 1 and 2) (see Example 3.8)

our estimation of the maximum was not correct. A possible solution is to forget about
what has been generated and start again with the new maximum k2 = p(xm ) > k1
but, obviously, this is not desirable among other things because we have no guarantee
that this is not going to happen again. We better keep what has been done and proceed
in the following manner:
(1) We have generated N1 pairs (x1 , x2 ) in [α, β] × [0, k1 ] and, in particular, X 2
uniformly in [0, k1 ]. How many additional pairs Na do we have to generate? Since
the density of pairs is constant in both domains [α, β] × [0, k1 ] and [α, β] ×
[0, k2 ] we have that

N1 N1 + Na k2
= −→ Na = N1 −1
(β − α) k1 (β − α) k2 k1

(2) How do we generate them? Obviously in the domain [α, β]×[k1 , k2 ] but from
the truncated density

pe (x|·) = ( p(x|·) − k1 ) 1 p(x|·)>k1 (x)


182 3 Monte Carlo Methods

(3) Once the Na additional events have been generated (out of which some have been
hopefully accepted) we continue with the usual procedure but on the domain
[α, β]×[0, k2 ].
The whole process is repeated as many times as needed.
NOTE 6: Weighted events.
The Acceptance-Rejection algorithm just explained is equivalent to:
(i 1 ) Sample xi from U n(x|α, β) and u i from U n(u|0, 1);
(i 2 ) Assign to each generated event xi a weight: wi = p(xi |·)/ pm ; 0 ≤ wi ≤1 and
accept the event if u i ≤ wi or reject it otherwise.
It is clear that:
• Events with a higher weight will have a higher chance to be accepted;
• After applying the acceptance-rejection criteria at step (i 2 ), all events will have a
weight either 1 if it has been accepted or 0 if it was rejected.
• The generation efficiency will be

N
accepted trials
= = wi = w
total trials(N ) N i=1

In some cases it is interesting to keep all the events, accepted or not.

Example 3.9 Let’s obtain a sampling {x 1 , x 2 , . . .}, dim(x) = n, of points inside a


n-dimensional sphere centered at x c and radius r . For a direct use of the Acceptance-
Rejection algorithm we enclose the sphere in a n-dimensional hypercube


n
Cn = [xic − r, xic + r ]
i=1

and:
(1) xi ⇐ U n(x|xic − r, xic + r ) for i = 1, . . . , n
(2) Accept x i if ρi = ||x i − x c || ≤ r and reject otherwise.
The generation efficiency will be

volume of the sphere 2 π n/2 1


(n) = =
volume of the covering n (n/2) 2n

Note that the sequence {x i /ρi }i=1


n
will be a sampling of points uniformly distributed
on the sphere of radius r = 1. This we can get also as:
(1) z i ⇐ N (z|0, 1) for i = 1, . . . , n
(2) ρ = ||z i || and x i = z i /ρ
3.2 Basic Algorithms 183

Except for simple densities, the efficiency of the Acceptance-Rejection algorithm


is not very high and decreases quickly with the number of dimensions. For instance,
we have seen in the previous example that covering the n-dimensional sphere with a
hypercube has a generation efficiency

2 π n/2 1
(n) =
n (n/2) 2n

and limn→∞ (n) = 0. Certainly, some times we can refine the covering since there
is no need other than simplicity for a hypercube (see Stratified Sampling) but, in
general, the origin of the problem will remain: when we generate points uniformly in
whatever domain, we are sampling with constant density regions that have a very low
probability content or even zero when they have null intersection with the support
of the random quantity X. This happens, for instance, when we want to sample
from a differential cross-section that has very sharp peaks (sometimes of several
orders of magnitude as in the case of bremsstrahlung). Then, the problem of having
a low efficiency is not just the time expend in the generation but the accuracy and
convergence of the evaluations. We need a more clever way to generate sequences
and the Importance Sampling method comes to our help.

3.2.3 Importance Sampling

The Importance Sampling generalizes the Acceptance-Rejection method sampling


the density function with higher frequency in regions of the domain where the prob-
ability of acceptance is larger (more important). Let’s see the one-dimensional case
since the extension to n-dimensions is straight forward.
Suppose that we want a sampling of X ∼ p(x) with support  X ∈ [a, b] and
F(x) the corresponding distribution function. We can always express p(x) as:

p(x) = c g(x) h(x)

where:
(1) h(x) is a probability density function, i.e., non-negative and normalized in  X ;
(2) g(x) ≥ 0; ∀x ∈  X and has a finite maximum gm = max{g(x); x ∈  X };
(3) c > 0 a constant normalization factor.
Now, consider a sampling {x1 , x2 , . . . , xn } drawn from the density h(x). If we apply
the Acceptance-Rejection criteria with g(x), how are the accepted values distributed?
It is clear that, if gm = max(g(x)) and Y ∼ U n(y|0, gm )
x  g(x) x
h(x) d x dy h(x) g(x) d x
P(X ≤ x|Y ≤g(x)) = a
b
0
 g(x) = ab = F(x)
a h(x) d x 0 dy a h(x) g(x) d x
184 3 Monte Carlo Methods

and therefore, from a sampling of h(x) we get a sampling of p(x) applying


the Acceptance-Rejection with the function g(x). There are infinite options for
h(x). First, the simpler the better for then the Distribution Function can be eas-
ily inverted and the Inverse Transform applied efficiently. The Uniform Density
h(x) = U n(x|a, b) is the simplest one but then g(x) = p(x) and this is just the
Acceptance-Rejection over p(x). The second consideration is that h(x) be a fairly
good approximation to p(x) so that g(x) = p(x)/ h(x) is as smooth as possible and
the Acceptance-Rejection efficient. Thus, if h(x) > 0 ∀x ∈ [a, b]:

p(x)
p(x) d x = h(x) d x = g(x) d H (x).
h(x)

3.2.3.1 Stratified Sampling

The Stratified Sampling is a particular case of the Importance Sampling where the
density p(x); x ∈  X is approximated by a simple function over  X . Thus, in the
one-dimensional case, if  = [a, b) and we take the partition (stratification)

 = ∪i=1
n
i = ∪i=1
n
[ai−1 , ai ); a0 = a , an = b

with measure λ(i ) = (ai − ai−1 ), we have


n  an
1[ai−1 , ai ) (x)
h(x) = −→ h(x) d x = 1
i=1
λ(i ) a0

Denoting by pm (i) = maxx { p(x)|x ∈ i }, we have that for the Acceptance-Rejection


algorithm the volume of each sampling domain is Vi = λ(i ) pm (i). In consequence,
for a partition of size n, if Z ∈ {1, 2, . . . , n} and define

Vk

k
P(Z = k) = n ; F(k) = P(Z ≤ k) = P(Z = j)
i=1 Vi j=1

we get a sampling of p(x) from the following algorithm:


(i 1 ) u i ⇐ U n(u|0, 1) and select the partition k = Int[min{Fi | Fi > n·u i }];
(i 2 ) xi ⇐ U n(x|ak−1 , ak ), yi ⇐ U n(y|0, pm (k)) and accept xi if yi ≤ p(xi ) (reject
otherwise).
3.2 Basic Algorithms 185

3.2.4 Decomposition of the Probability Density

Some times it is possible to express in a simple manner the density function as a


linear combination of densities; that is

k
p(x) = a j p j (x); a j > 0 ∀ j = 1, 2, . . . , k
j=1

that are easier to sample. Since normalization imposes that


 ∞

k  ∞

k
p(x) d x = aj p j (x) d x = am = 1
−∞ j=1 −∞ j=1

we can sample from p(x) selecting, at each step i, one of the k densities pi (x) with
probability pi = ai from which we shall obtain xi and therefore sampling with higher
frequency from those densities that have a higher relative weight. Thus:
(i 1 ) Select which density pi (x) are we going to sample at step (i 2 ) with probability
pi = ai ;
(i 2 ) Get xi from pi (x) selected at (i 1 ).
It may happen that some densities p j (x) can not be easily integrated so we do
not know a priory the relative weights. If this is the case, we can sample from
f j (x) ∝ p j (x) and estimate with the generated events from f i (x) the corresponding
normalizations Ii with, for instance, from the sample mean

n
Ii = f i (xk )
n k=1

Then, since pi (x) = f i (x)/Ii we have that

K
f i (x)

K
p(x|·) = ai f i (x|·) = ai Ii = ai Ii pi (x)
i=1 i=1
Ii i=1

so each generated event from f i (x) has a weight wi = ai Ii .


Example 3.10 Suppose we want to sample from the density

3
p(x) = (1 + x 2 ); x ∈ [−1, 1]
8
Then, we can take:
 
p1 (x) ∝ 1 p1 (x) = 1/2
−→ normalization −→
p2 (x) ∝ x 2 p2 (x) = 3 x 2 /2
186 3 Monte Carlo Methods

so:
3 1
p(x) = p1 (x) + p2 (x)
4 4
Then:
(i 1 ) Get u i and wi as U n(u|0, 1);
(i 2 ) Get xi as:

if u i ≤ 3/4 then xi = 2 wi − 1
if u i > 3/4 then xi = (2 wi − 1)1/3

In this case, 75% of the times we sample from the trivial density U n(x| − 1, 1).

3.3 Everything at Work

3.3.1 The Compton Scattering

When problems start to get complicated, we have to combine several of the afore-
mentioned methods; in this case Importance Sampling, Acceptance-Rejection and
Decomposition of the probability density.
Compton Scattering is one of the main processes that occur in the interaction of
photons with matter. When a photon interacts with one of the atomic electrons with an
energy greater than the binding energy of the electron, is suffers an inelastic scattering
resulting in a photon of less energy and different direction than the incoming one
and an ejected free electron from the atom. If we make the simplifying assumptions
that the atomic electron initially at rest and neglect the binding energy we have that

if the incoming photon has an energy E γ its energy after the interaction (E γ ) is:

Eγ 1
= =
Eγ 1 + a (1 − cosθ)

where θ ∈ [0, π] is the angle between the momentum of the outgoing photon and
the incoming one and a = E γ /m e . It is clear that if the dispersed photon goes in the
forward direction (that is, with θ = 0), it will have the maximum possible energy
( = 1) and when it goes backwards (that is, θ = π) the smallest possible energy
( = (1 + 2a)−1 ). Being a two body final state, given the energy (or the angle) of
the outgoing photon the rest of the kinematic quantities are determined uniquely:


1 cotθ/2
Ee = Eγ 1+ − and tanθe =
a 1+a
3.3 Everything at Work 187

The cross-section for the Compton Scattering can be calculated perturbatively in


Relativistic Quantum Mechanics resulting in the Klein-Nishina expression:

d σ0 3 σT
= f (x)
dx 8

where x = cos(θ), σT = 0.665 barn = 0.665 · 10−24 cm2 is the Thomson cross-
section and

1 a 2 (1 − x)2
f (x) = 1+x +2
[1 + a (1 − x)]2 1 + a (1 − x)

has all the angular dependence. Due to the azimuthal symmetry, there is no explicit
dependence with φ ∈ [0, 2π] and has been integrated out. Last, integrating this
expression for x ∈ [−1, 1] we have the total cross-section of the process:
 
σT 1+a 2 (1 + a) ln (1 + 2a) ln (1 + 2a) 1 + 3a
σ0 (E γ ) = − + −
4 a2 1 + 2a a 2a (1 + 2 a)2

For a material with Z electrons, the atomic cross-section can be approximated by


σ = Z σ0 cm2 /atom.
Let’s see how to simulate this process sampling the angular distribution
p(x) ∼ f (x). Figure 3.3 (left) shows this function for incoming photon energies
of 10, 100 and 1000 MeV. It is clear that it is peaked at x values close to 1 and gets
sharper with the incoming energy; that is, when the angle between the incoming and
outgoing photon momentum becomes smaller. In consequence, for high energy pho-
tons the Acceptance-Rejection algorithm becomes very inefficient. Let’s then define
the functions
1
f n (x) =
[1 + a (1 − x)]n

and express f (x) as f (x) = ( f 1 (x) + f 2 (x) + f 3 (x)) · g(x) where

f 1 (x)
g(x) = 1 − (2 − x 2 )
1 + f 1 (x) + f 2 (x)

The functions f n (x) are easy enough to use the Inverse Transform method and apply
afterward the Acceptance-Rejection on g(x) > 0 ∀x ∈ [−1, 1]. The shape of this
function is shown in Fig. 3.3 (right) for different values of the incoming photon
energy and clearly is much more smooth than f (x) so the Acceptance-Rejection
will be significantly more efficient. Normalizing properly the densities
 1
1
pi (x) = f i (x) such that pi (x) d x = 1; i = 1, 2, 3
wi −1
188 3 Monte Carlo Methods

2 1
1.5 0.75
1 0.5
0.5 0.25
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

1.5 1

1 0.75
0.5
0.5
0.25
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

0.4
1
0.3 0.75
0.2 0.5
0.1 0.25
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1

Fig. 3.3 Functions f (x) (left) and g(x) (right) for different values of the incoming photon energy

we have that, with b = 1 + 2a:

1 2 2b
w1 = lnb w2 = 2 2 w3 = 3 2
a a (b − 1) a (b − 1)2

and therefore

f (x) = ( f 1 (x) + f 2 (x) + f 3 (x)) · g(x) = (w1 p1 (x) + w2 p2 (x) + w3 p3 (x)) · g(x)
= wt (α1 p1 (x) + α2 p2 (x) + α3 p3 (x)) · g(x)

where wt = w1 + w2 + w3 ,

wi

i=3
αi = > 0; i = 1, 2, 3 and αi = 1
wt i=1
3.3 Everything at Work 189

Thus, we set up the following algorithm:


(1) Generate u ⇐ U n(u|0, 1),
(1.1) if u ≤ α1 we sample xg ∼ p1 (x);
(1.2) if α1 < u ≤ α1 + α2 we sample xg ∼ p2 (x) and
(1.3) if α1 + α2 < u we sample xg ∼ p2 (x);
(2) Generate w ⇐ U n(w|0, g M ) where

b
g M ≡ max[g(x)] = g(x = −1) = 1 −
1 + b + b2

If w ≤ g(xg ) we accept xg ; otherwise we go back to step (1).


Let’s see now to sample from the densities pi (x). If u ⇐ U n(u|0, 1) and
 x
Fi (x) = pi (s) ds i = 1, 2, 3
−1

then:
• x ∼ p1 (x): F1 (x) = 1 − ln(1 + a(1 − x)) −→ xg = 1 + aa − b
u
ln(b)
• x ∼ p2 (x): F2 (x) = b − 1 − 21a −→ xg = b − a1(b+ 2au− 1)
2 2
2 (b − x)
• x ∼ p3 (x): F3 (x) = 4a(11+ a) (b + 1)2 − 1 −→ xg = b − b+1
(b − x)2 [1 + 4a(1 + a)u]1/2
Once we have xg we can deduce the remaining quantities of interest from the kine-
matic relations. In particular, the energy of the outcoming photon will be

Eg 1
g = =
E 1 + a(1 − xg )

Last, we sample the azimuthal outgoing photon angle as φ ⇐ U n(u|0, 2π).


Even though in this example we are going to simulate only the Compton effect,
there are other processes by which the photon interacts with matter. At low energies
(essentially ionization energies: ≤E γ ≤ 100 KeV) the dominant interaction is the
photoelectric effect

γ + atom −→ atom+ + e− ;

at intermediate energies (E γ ∼ 1 − 10 MeV) the Compton effect

γ + atom −→ γ + e− + atom+
190 3 Monte Carlo Methods

and at high energies (E γ ≥ 100 MeV) the dominant one is pair production

γ + nucleus −→ e+ + e− + nucleus

To first approximation, the contribution of other processes is negligible. Then, at each


step in the evolution of the photon along the material we have to decide first which
interaction is going to occur next. The cross section is a measure of the interaction
probability expressed in cm2 so, since the total interaction cross section will be in
this case:

σt = σphot. + σCompt. + σpair

we decide upon the process i that is going to happen next with probability pi = σi /σt ;
that is, u ⇐ U n(0, 1) and
(1) if u ≤ pphot. we simulate the photoelectric interaction;
(2) if pphot. < u ≤ ( pphot. + pCompt. ): we simulate the Compton effect and otherwise
(3) we simulate the pair production
Once we have decided which interaction is going to happen next, we have to
decide where. The probability that the photon interacts after traversing a distance x
(cm) in the material is given by

Fint = 1 − e−x/λ

where λ is the mean free path. Being A the atomic mass number of the material, N A
the Avogadro’s number, ρ the density of the material in g/cm 3 , and σ the cross-section
of the process under discussion, we have that

A
λ= [cm]
ρ NA σ

Thus, if u ⇐ U n(0, 1), the next interaction is going to happen at x = −λlnu along
the direction of the photon momentum.
As an example, we are going to simulate what happens when a beam of photons
of energy E γ = 1 MeV (X rays) incide normally on the side of a rectangular block of
carbon (Z = 6, A = 12.01, ρ = 2.26) of 10 × 10 cm2 surface y 20 cm depth. Behind
the block, we have hidden an iron coin (Z = 26, A = 55.85, ρ = 7.87) centered on
the surface and in contact with it of 2 cm radius and 1 cm thickness. Last, at 0.5 cm
from the coin there is a photographic film that collects the incident photons.
The beam of photons is wider than the block of carbon so some of them will
go right the way without interacting and will burn the film. We have assumed for
simplicity that when the photon energy is below 0.01 MeV, the photoelectric effect is
dominant and the ejected electron will be absorbed in the material. The photon will
then be lost and we shall start with the next one. Last, an irrelevant technical issue:
3.3 Everything at Work 191

Fig. 3.4 The upper figure 15


shows an sketch of the
experimental set-up and the
trajectory of one of the 10
simulated photons until it is
detected on the screen. The
lower figure shows the 5
density of photons collected
at the screen for a initially
generated sample 105 events 0

-5

-10

-15
-5 0 5 10 15 20 25

10

-2

-4

-6

-8

-10
-10 -8 -6 -4 -2 0 2 4 6 8 10

the angular variables of the photon after the interaction are referred to the direction
of the incident photon so in each case we have to do the appropriate rotation.
Figure 3.4 (up) shows the sketch of the experimental set-up and the trajectory
of one of the traced photons of the beam collected by the film. The radiography
obtained after tracing 100,000 photons is shown in Fig. 3.4 (down). The black zone
corresponds to photons that either go straight to the screen or at some point leave the
192 3 Monte Carlo Methods

block before getting to the end. The mid zone are those photons that cross the carbon
block and the central circle, with less interactions, those that cross the carbon block
and afterward the iron coin.

3.3.2 An Incoming Flux of Particles

Suppose we have a detector and we want to simulate a flux of isotropically distributed


incoming particles. It is obvious that generating them one by one in the space and
tracing them backwards is extremely inefficient. Consider a large cubic volume V
that encloses the detector, both centered in the reference frame S0 . At time t0 , we
have for particles uniformly distributed inside this volume that:

1
p(r 0 ) dμ0 = d x0 dy0 dz 0
V
Assume now that the velocities are isotropically distributed; that is:

1
p(v) dμv = sin θ dθ dφ f (v) dv

with f (v) properly normalized. Under independence of positions and velocities at
t0 , we have that:

1 1
p(r 0 , v) dμ0 dμv = d x0 dy0 dz 0 sin θ dθ dφ f (v) dv
V 4π

Given a square of surface S = (2l)2 , parallel to the (x, y) plane, centered at (0, 0, z c )
and well inside the volume V , we want to find the probability and distribution of
particles that, coming from the top, cross the surface S in unit time.
For a particle having a coordinate z 0 at t0 = 0, we have that z(t) = z 0 + v z t. The
surface S is parallel to the plane (x, y) at z = z c so particles will cross this plane at
time tc = (z c − z 0 )/v z from above iff:
(0) z 0 ≥ z c ; obvious for otherwise they are below the plane S at t0 = 0;
(1) θ ∈ [π/2, π); also obvious because if they are above S at t0 = 0 and cut the
plane at some t > 0, the only way is that v z = v cos θ < 0 → cos θ < 0 → θ ∈
[π/2, π).
But to cross the squared surface S of side 2l we also need that
(2) −l ≤ x(tc ) = x0 + v x tc ≤ l and −l ≤ y(tc ) = y0 + v y tc ≤ l
Last, we want particles crossing in unit time; that is tc ∈ [0, 1] so 0 ≤ tc = (z c −
z 0 )/v z ≤ 1 and therefore
(3) z 0 ∈ [z c , z c − v cos θ]
3.3 Everything at Work 193

Then, the desired subspace with conditions (1), (2), (3) is

c = {θ ∈ [π/2, π); z 0 ∈ [z c , z c − v cos θ]; x0 ∈ [−l − v x tc , l − v x tc ];


y0 ∈ [−l − v y tc , l − v y tc ]}

After integration:
 z c −v cos θ  l−v x tc  l−v y tc
dz 0 d x0 dy0 = −(2l)2 v cos θ
zc −l−v x tc −l−v y tc

Thus, we have that for the particles crossing the surface S = (2l)2 from above in unit
time

(2l)2 1
p(θ, φ, v) dθdφdv = − sin θ cos θ dθ dφ f (v) v dv
V 4π

with θ ∈ [π/2, π) and φ ∈ [0, 2π). If we define the average velocity



E[v] = v f (v) dv
v

the probability to have a cut per unit time is



S E[v]
Pcut (tc ≤ 1) = p(θ, φ, v) dθdφdv =
c ×v 4V

and the pdf for the angular distribution of velocities (direction of crossing particles)
is
1 1
p(θ, φ) dθdφ = − sin θ cos θ dθ dφ = d(cos2 θ) dφ
π 2π
If we have a density of n particles per unit volume, the expected number of crossings
per unit time due to the n V = n V particles in the volume is

n E[v]
n c = n V Pcut (tc ≤ 1) = S
4
so the flux, number of particles crossing the surface from one side per unit time and
unit surface is
nc n E[v]
0c = =
S 4
194 3 Monte Carlo Methods

Note that the requirement that the particles cross the square surface S in a finite time
(tc ∈ [0, 1]) modifies the angular distribution of the direction of particles. Instead of

p1 (θ, φ) ∝ sin θ; θ ∈ [0, π); φ ∈ [0, 2π)

we have

p2 (θ, φ) ∝ − sin θ cos θ; θ ∈ [π/2, π); φ ∈ [0, 2π)

The first one spans a solid angle of


 π  2π
dθ dφ p1 (θ, φ) = 4π
0 0

while for the second one we have that


 π/2  2π
dθ dφ p2 (θ, φ) = π
0 0

that is; one fourth the solid angle spanned by the sphere. Therefore, the flux expressed
as number of particles crossing from one side of the square surface S per unit time
and solid angle is

0c nc n E[v]
c = = =
π πS 4π
Thus, if we generate a total of n T = 6n c particles on the of the surface of a cube,
each face of area S, with the angular distribution

1
p(θ, φ) dθdφ = d(cos2 θ) dφ

for each surface with θ ∈ [π/2, π) and φ ∈ [0, 2π) defined with k normal to the
surface, the equivalent generated flux per unit time, unit surface and solid angle is
T = n T /6πS and corresponds to a density of n = 2n T /3S E[v] particles per unit
volume.
NOTE 7: Sampling some continuous distributions of interest
These are some procedures to sample from continuous distributions of interest. There
are several algorithms for each case with efficiency depending on the parameters but
those outlined here have in general high efficiency. In all cases, it is assumed that
u ⇐ U n(0, 1).
3.3 Everything at Work 195

• Beta: Be(x|α, β); α, β ∈ (0, ∞):

1 x1
p(x|·) = x α−1 (1 − x)β−1 1(0,1) (x) −→ x =
B(α, β) x1 + x2

where x1 ⇐ Ga(x|1/2, α) and x2 ⇐ Ga(x|1/2, β)

• Cauchy: Ca(x|α, β); α ∈ R; β ∈ (0, ∞):

β/π
p(x|·) =   1(−∞,∞) (x) −→ x = α + β −1 tan(π(u − 1/2))
1 + β 2 (x − α)2

• Chi-squared: For χ2 (x|ν) see Ga(x|1/2, ν/2).


n
• Dirichlet Di(x|α); dim(x, α) = n, α j ∈ (0, ∞), x j ∈ (0, 1) and j=1 xj = 1

(α1 + · · · + αn ) α j −1
n
p(x|α) = x 1(0,1) (x j ) −→ {x j = z j /z 0 }nj=1
(α1 )· · ·(αn ) j=1 j

n
where z j ⇐ Ga(z|1, α j ) and z 0 = j=1 z j.
n−1
Generalized Dirichlet G Di(x|α, β); dim(β) = n, β j ∈ (0, ∞), j=1 xj < 1
 γi

n−1
(αi + βi ) αi −1
i
p(x1 , . . . , xn−1 |α, β) = x 1− xk
i=1
(αi )(βi ) i k=1

with

βi − αi+1 − βi+1 for i = 1, 2, . . . , n − 2
γi =
βn−1 − 1 for i = n − 1

When βi = α i+1 + βi+1 reduces to the usual Dirichlet. If z k ⇐ Be(z|αk , βk ) then


xk = z k (1 − k−1
j=1 x j ) for k = 1, . . . , n − 1 and x n = 1 −
n−1
i=1 x i .

• Exponential: E x(x|α); α ∈ (0, ∞):

p(x|·) = α exp{−αx}1[0,∞) (x) −→ x = −α−1 lnu

• Gamma Distribution Ga(x|α, β); α, β ∈ (0, ∞).


The probability density is

αβ − αx β−1
p(x|α, β) = e x 1(0,∞) (x)
(β)
196 3 Monte Carlo Methods

Note that Z = αX ∼ Ga(z|1, β) so let’s see how to simulate a sampling of Ga(z|1, β)


and, if α=1, take x = z/α. Depending on the value of the parameter β we have that
 β = 1: This is the Exponential distribution E x(x|1) already discussed;
 β = m ∈ N : As we know, the sum X s = X 1 + · · · + X n of n independent ran-
dom quantities X i ∼ Ga(xi |α, βi ); i = 1, . . . , n is a random quantity distrib-
uted as Ga(|xs |α, β1 + · · · + βn ). Thus, if we have m independent samplings
xi ⇐ Ga(x|1, 1) = E x(x|1), that is

x1 = −lnu 1 , . . . , xm = −lnu m

with u i ⇐ U n(0, 1), then


m
xs = x1 + x2 · · · + xm = −ln ui
i=1

will be a sampling of Ga(xs |1, β = m).


 β > 1 ∈ R: Defining m = [β] we have that β = m + δ with δ ∈ [0, 1). Then,
if u i ⇐ U n(0, 1); i = 1, . . . , m and w ⇐ Ga(w|1, δ),
m
z = −ln ui + w
i=1

will be a sampling from Ga(x|1, β). The problem is reduced to get a sampling
w ⇐ Ga(w|1, δ) with δ ∈ (0, 1).
 0 < β < 1: In this case, for small values of x the density is dominated by p(x) ∼
x β−1 and for large values by p(x) ∼ e−x . Let’s then take the approximant

g(x) = x β−1 1(0,1) (x) + e−x 1[1,∞) (x)

Defining

p1 (x) = βx β−1 1(0,1) (x) −→ F1 (x) = x β


p2 (x) = e−(x−1) 1[1,∞) (x) −→ F2 (x) = 1 − e−(x−1)

w1 = e/(e + β) and w2 = β/(e + β) we have that

g(x) = w1 p1 (x) + w2 p2 (x)

and therefore:
(1) u i ⇐ U n(0, 1); i = 1, 2, 3
(2) If u 1 ≤ w1 , set x = u 2 and accept x if u 3 ≤ e−x ; otherwise go to (1);
1/β

If u 1 > w1 , set x = 1 − lnu 2 and accept x if u 3 ≤ x β−1 ; otherwise go to (1);


The sequence of accepted values will simulate a sampling from Ga(x|1, β). It is
easy to see that the generation efficiency is
3.3 Everything at Work 197

e
(β) = (β + 1)
e+β

and min (β  0.8)  0.72.

• Laplace: La(x|α, β); α ∈ R, β ∈ (0, ∞):



1 −|x−α|/β ⎨ α + βln(2u) if u < 1/2
p(x|α, β) = e 1(−∞,∞) (x) −→ x =
2β ⎩
α − βln(2(1 − u)) if u ≥ 1/2

• Logistic: Lg(x|α, β); α ∈ R; β ∈ (0, ∞):



exp{−β(x − α)} −1 u
p(x|·) = β 1 (−∞,∞) (x) −→ x = α + β ln
(1 + exp{−β(x − α)})2 1−u

• Normal Distribution N (x|μ, V).


There are several procedures to generate samples from a Normal Distribution. Let’s
start with the one-dimensional case X ∼ N (x|μ, σ) considering two independent
standardized random quantities X i ∼ N (xi |0, 1); i = 1, 2 [7] with joint density

1 −(x12 +x22 )/2


p(x1 , x2 ) = e

After the transformation X 1 = R cos and X 2 = R sin with R ∈ [0, ∞);  ∈
[0, 2π) we have

1 −r 2 /2
p(r, θ) = e r

Clearly, both quantities R and  are independent and their distribution functions

θ
Fr (r ) = 1 − e−r /2
2
and Fθ (θ) =

are easy to invert so, using then the Inverse Transform algorithm:
(1) u 1 ⇐√U n(0, 1) and u 2 ⇐ U n(0, 1);
(2) r = −2 lnu 1 and θ = 2πu 2 ;
(3) x1 = r cosθ and x2 = r sinθ.
Thus, we get two independent samplings x1 and x2 from N (x|0, 1) and
(4) z 1 = μ1 + σ1 x1 and z 2 = μ2 + σ2 x2
will be two independent samplings from N (x|μ1 , σ1 ) and N (x|μ2 , σ2 ).
For the n-dimensional case, X ∼ N (x|μ, V ) and V the covariance matrix, we
proceed from the conditional densities
198 3 Monte Carlo Methods

p(x|μ, V ) = p(xn |xn−1 , xn−2 . . . , x1 ; ·) p(xn−1 |xn−2 . . . , x1 ; ·) · · · p(x1 |·)

For high dimensions this is a bit laborious and it is easier if we do first a bit of algebra.
We know from Cholesky’s Factorization Theorem that if V ∈ Rn×n is a symmetric
positive defined matrix there is a unique lower triangular matrix C, with positive
diagonal elements, such that V = CCT . Let then Y be an n-dimensional random
quantity distributed as N (y|0, I) and define a new random quantity

X = μ+CY

Then V−1 = [C−1 ]T C−1 and

Y T Y = (X − μ)T [C−1 ]T [C−1 ] (X − μ) = (X − μ)T [V−1 ] (X − μ)

After some algebra, the elements if the matrix C can be easily obtained as

Vi1
Ci1 = √ 1 ≤ i≤n
V11
j−1
Vi j − Cik C jk
Ci j = k=1
1 < j < i ≤n
Cjj
 1/2

i−1
Cii = Vii − 2
Cik 1 < i ≤n
k=1

and, being lower triangular, Ci j = 0 ∀ j > i. Thus, we have the following algorithm:
(1) Get the matrix C from the covariance matrix V;
samplings z i ⇐ N (0, 1) with i = 1, . . . , n;
(2) Get n independent
(3) Get xi = μi + nj=1 Ci j z j
In particular, for a two-dimensional random quantity we have that

σ12 ρσ1 σ2
V=
ρσ1 σ2 σ22

and therefore:
V11
C11 = √ = σ1 ; C12 = 0
V11
V21 
C21 = √ = ρ σ2 ; C22 = (V22 − C221 )1/2 = σ2 1 − ρ2
V11
so:

σ1 0
C=
ρσ2 σ2 1 − ρ2
3.3 Everything at Work 199

Then, if z 1,2 ⇐ N (z|0, 1) we have that:



x1 μ1 z 1 σ1
= +
x2 μ2 σ2 (z 1 ρ + z 2 1 − ρ2 )

• Pareto: Pa(x|α, β); α, β ∈ (0, ∞):

p(x|·) = αβ α x −(α+1) 1(β,∞) (x) −→ x = β u −1/α

• Snedecor: Sn(x|α, β); α, β ∈ (0, ∞):

x1 /α
p(x|·) ∝ x α/2−1 (β + αx)−(α+β)/2 1(0,∞) (x) −→ x =
x2 /β

where x1 ⇐ Ga(x|1/2, α/2) and x2 ⇐ Ga(x|1/2, β/2).

• Student St (x|ν); ν ∈ (0, ∞):



1 −2/ν
p(x|·) ∝  (ν+1)/2 1(−∞,∞) (x) −→ x = ν(u 1 − 1) sin(2πu 2 )
1 + x 2 /ν

where u 1,2 ⇐ U n(0, 1).

• Uniform U n(x|a, b); a < b ∈ R

p(x|·) = (b − a)−1 1[a,b] (x) −→ x = (b − 1) + a u

• Weibull: W e(x|α, β); α, β ∈ (0, ∞):

p(x|· = αβ α x α−1 exp{−(x/β)α }1(0,∞) (x) −→ x = β (− ln u)1/α .

3.4 Markov Chain Monte Carlo

With the methods we have used up to now we can simulate samples from distributions
that are more or less easy to handle. Markov Chain Monte Carlo allows to sample
from more complicated distributions. The basic idea is to consider each sampling as
a state of a system that evolves in consecutive steps of a Markov Chain converging
(asymptotically) to the desired distribution. In the simplest version were introduced
by Metropolis in the 1950 s and were generalized by Hastings in the 1970s.
Let’s start for simplicity with a discrete distribution. Suppose that we want a
sampling of size n from the distribution
200 3 Monte Carlo Methods

P(X = k) = πk with k = 1, 2, . . . , N

that is, from the probability vector

N
π = (π1 , π2 , . . . , π N ); πi ∈ [0, 1] ∀i = 1, . . . , N and πi = 1
i=1

and assume that it is difficult to generate a sample from this distribution by other
procedures. Then, we may start from a sample of size n generated from a simpler
distribution; for instance, a Discrete Uniform with

1
P0 (X = k) = ; ∀k
N
N
and from the sample obtained {n 1 , n 2 , . . . , n N }, where n = i=1 n i , we form the
initial sample probability vector

π (0) = (π1(0) , π2(0) , . . . , π (0)


N ) = (n 1 /n, n 2 /n, . . . , n N /n)

Once we have the n events distributed in the N classes of the sample space  =
{1, 2, . . . , N } we just have to redistribute them according to some criteria in different
steps so that eventually we have a sample of size n drawn from the desired distribution
P(X = k) = πk .
We can consider the process of redistribution as an evolving system such that,
if at step i the system is described by the probability vector π (i) , the new state at
step i + 1, described by π (i+1) , depends only on the present state of the system (i)
and not on the previous ones; that is, as a Markov Chain. Thus, we start from the
state π (0) and the aim is to find a Transition Matrix P, of dimension N ×N , such that
π (i+1) = π (i) P and allows us to reach the desired state π. The matrix P is
⎛ ⎞
P11 P12 · · · P1N
⎜ P21 P22 · · · P2N ⎟
⎜ ⎟
P=⎜ . .. .. ⎟
⎝ .. . ··· . ⎠
PN 1 PN 2 · · · PN N

where each element (P)i j = P(i→ j) ∈ [0, 1] represents the probability for an event
in class i to move to class j in one step. Clearly, at any step in the evolution the
probability that an event in class i goes to any other class j = 1, . . . , N is 1 so

N
(P)i j = P(i→ j) = 1
j=1 j=1
3.4 Markov Chain Monte Carlo 201

and therefore is a Probability Matrix. If the Markov Chain is:


(1) irreducible; that is, all the states of the system communicate among themselves;
(2) ergodic; that is, the states are:
(2.1) recurrent: being at one state we shall return to it at some point in the
evolution with probability 1;
(2.2) positive: we shall return to it in a finite number of steps in the evolution;
(2.3) aperiodic: the system is not trapped in cycles;
then there is a stationary distribution π such that:
(1) π = π P;
(2) Starting at any arbitrary state π (0) of the system, the sequence

π (0)
π (1) = π (0) P
π (2) = π (1) P = π (0) P2
..
.
π (n) = π (0) Pn
..
.

converges asymptotically to the fix vector π;


(3)
⎛ ⎞
π1 π2 · · · πN
⎜ π1 π2 · · · πN ⎟
⎜ ⎟
limn→∞ Pn = ⎜ . .. .. ⎟
⎝ .. . ··· . ⎠
π1 π2 · · · πN

There are infinite ways to choose the transition matrix P. A sufficient (although not
necessary) condition for this matrix to describe a Markov Chain with fixed vector π
is that the Detailed Balance condition is satisfied (i.e., a reversible evolution); that is

πi (P)i j = π j (P) ji ⇐⇒ πi P(i→ j) = π j P( j→i)

It is clear that if this condition is satisfied, then π is a fixed vector since:


 N 

N
πP = πi (P)i1 , πi (P)i2 , . . . πi (P)i N =π
i=1 i=1 i=1
202 3 Monte Carlo Methods

due to the fact that

N
πi (P)ik = πk (P)ki = πk for k = 1, 2, . . . , N
i=1 i=1

Imposing the Detailed Balance condition, we have freedom to choose the elements
(P)i j . We can obviously take (P)i j = π j so that it is satisfied trivially (πi π j = π j πi )
but this means that being at class i we shall select the new possible class j with
probability P(i→ j) = π j and, therefore, to sample directly the desired distribution
that, in principle, we do not know how to do. The basic idea of Markov Chain Monte
Carlo simulation is to take

(P)i j = q( j|i) · ai j

where
q( j|i): is a probability law to select the possible new class j = 1, . . . , N for an
event that is actually in class i;
ai j : is the probability to accept the proposed new class j for an event that is at
i taken such that the Detailed Balance condition is satisfied for the desired
distribution π.
Thus, at each step in the evolution, for an event that is in class i we propose
a new class j to go according to the probability q( j|i) and accept the transition
with probability ai j . Otherwise, we reject the transition and leave the event in the
class where it was. The Metropolis-Hastings [8] algorithm consists in taking the
acceptance function
 
π j q(i| j)
ai j = min 1,
πi q( j|i)

It is clear that this election of ai j satisfies the Detailed Balance condition. Indeed, if
πi q( j|i) > π j q(i| j) we have that:
   
π j q(i| j) π j q(i| j) πi q( j|i)
ai j = min 1, = and a ji = min 1, =1
πi q( j|i) πi q( j|i) π j q(i| j)

and therefore:
π j · q(i| j)
πi (P)i j = πi q( j|i) ai j = πi q( j|i) =
πi · q( j|i)
= π j q(i| j) = π j q(i| j) a ji = π j (P) ji

The same holds if πi q( j|i) < π j q(i| j) and is trivial if both sides are equal. Clearly, if
q(i| j) = πi then ai j = 1 so the closer q(i| j) is to the desired distribution the better.
3.4 Markov Chain Monte Carlo 203

A particularly simple case is to choose a symmetric probability q( j|i) =


q(i| j) [9]
 
πj
ai j = min 1,
πi

In both cases, it is clear that since the acceptance of the proposed class depends upon
the ratio π j /πi , the normalization of the desired probability is not important.
The previous expressions are directly applicable in the case we want to sample an
absolute continuous random quantity X ∼ π(x). If reversibility holds, p(x
|x)π(x) =
p(x|x
)π(x
) and therefore
  



p(x |x) d x = 1 −→ p(x |x)π(x) d x = π(x ) p(x|x


) d x = π(x
)
X X X

The transition kernel is expressed as

p(x
|x) ≡ p(x→x
) = q(x
|x) · a(x→x
)

and the acceptance probability given by


 
π(x
) q(x|x
)
a(x→x
) = min 1,
π(x) q(x
|x)

Let’s see one example.

Example 3.11 (The Binomial Distribution) Suppose we want a sampling of size


n of the random quantity X ∼ Bi(x|N , θ) Since x = 0, 1, . . . , N we have i =
1, 2, . . . , N + 1 classes and the desired probability vector, of dimension N + 1, is

N
π = ( p0 , p1 , . . . , p N ) where pk = P(X = k|·) = θk (1 − θ) N −k
k

Let’s take for this example N = 10 (that is, 11 classes), θ = 0.45 and n = 100,000.
We start from a sampling of size n from a uniform distribution (Fig. 3.5(1)). At each
step of the evolution we swap over the n generated events. For and event that is in bin
j we choose a new possible bin to go j = 1, . . . , 10 with uniform probability q( j|i).
Suppose that we look at an event in bin i = 7 and choose j with equal probability
among the 10 possible bins. If, for instance, j = 2, then we accept the move with
probability

π2 = p2
a72 = a(7→2) min 1, = 0.026
π7 = p 7

if, on the other hand, we have j = 6,


204 3 Monte Carlo Methods

25000 25000

1 2
20000 20000

15000 15000

10000 10000

5000 5000

0 0
−1 0 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 3 4 5 6 7 8 9 10 11

25000

3
20000

15000

10000

5000

0
−1 0 1 2 3 4 5 6 7 8 9 10 11

Fig. 3.5 Distributions at steps 0, 2 and 100 (1, 2, 3; blue) of the Markov Chain with the desired
Binomial distribution superimposed in yellow


π6 = p6
a76 = a(7→6) min 1, = 1.
π7 = p 7

so we make the move of the event. After two swaps over all the sample we have the
distribution shown in Fig. 3.5(2) and after 100 swaps that shown in Fig. 3.5(3), both
compared to the desired distribution:

π 0) = (0.091, 0.090, 0.090, 0.092, 0.091, 0.091, 0.093, 0.089, 0.092, 0.090, 0.092)
π 2) = (0.012, 0.048, 0.101, 0.155, 0.181, 0.182, 0.151, 0.100, 0.050, 0.018, 0.002)
π 100) = (0.002, 0.020, 0.077, 0.167, 0.238, 0.235, 0.159, 0.074, 0.022, 0.004, 0.000)
π = (0.000, 0.021, 0.076, 0.166, 0.238, 0.234, 0.160, 0.075, 0.023, 0.004, 0.000)
3.4 Markov Chain Monte Carlo 205

4.8 7

4.7 6

4.6 5

4.5 4

4.4 3

4.3 2

4.2 1
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

5 10 15 20 25 30 35 40 45 50

Fig. 3.6 Distributions of the mean value, variance and logarithmic discrepancy vs the number of
steps. For the first two, the red line indicates what is expected for the Binomial distribution

The evolution of the moments, in this case the mean value and the variance with the
number of steps is shown in Fig. 3.6 together with the Kullback-Leibler logarithmic
discrepancy between each state and the new one defined as

10
πkn)
δ K L {π|π n) } = πkn) ln
k=1
πk

As the previous example shows, we have to let the system evolve some steps (i.e.,
some initial sweeps for “burn-out” or “thermalization”) to reach the stable running
conditions and get close to the stationary distribution after starting from an arbitrary
state. Once this is achieved, each step in the evolution will be a sampling from the
desired distribution so we do not have necessarily to generate a sample of the desired
206 3 Monte Carlo Methods

size to start with. In fact, we usually don’t do that; we choose one admissible state
and let the system evolve. Thus, for instance if we want a sample of X ∼ p(x|·) with
x ∈  X , we may start with a value x0 ∈  X . At a given step i the system will be in
the state {x} and at the step i + 1 the system will be in a new state {x
} if we accept
the change x → x
or in the state {x} if we do not accept it. After thermalization,
each trial will simulate a sampling of X ∼ p(x|·). Obviously, the sequence of states
of the system is not independent so, if correlations are important for the evaluation
of the quantities of interest, it is a common practice to reduce them by taking for the
evaluations one out of few steps.
As for the thermalization steps, there is no universal criteria to tell whether stable
conditions have been achieved. One may look, for instance, at the evolution of the
discrepancy between the desired probability distribution and the probability vector
of the state of the system and at the moments of the distribution evaluated with a
fraction of the last steps. More details about that are given in [10]. It is interesting
also to look at the acceptance rate; i.e. the number of accepted new values over the
number of trials. If the rate is low, the proposed new values are rejected with high
probability (are far away from the more likely ones) and therefore the chain will
mix slowly. On the contrary, a high rate indicates that the steps are short, successive
samplings move slowly around the space and therefore the convergence is slow. In
both cases we should think about tuning the parameters of the generation.

Example 3.12 (The Beta distribution) Let’s simulate a sample of size 107 from a
Beta distribution Be(x|4, 2); that is:

π(x) ∝ x 3 (1 − x) with x ∈ [0, 1]

In this case, we start from the admissible state {x = 0.3} and select a new possible
state x
from the density q(x
|x) = 2x
; not symmetric and independent of x. Thus
we generate a new possible state as
 x
 x

Fq (x
) = q(s|x)ds = 2sds = x
2 −→ x
= u 1/2 with u ⇐ U n(0, 1)
0 0

The acceptance function will then be


   

π(x
) · q(x|x
) x
2 (1 − x
)
a(x→x ) = min 1, = min 1, 2
π(x) · q(x
|x) x (1 − x)

depending on which we set at the state i + 1 the system in x or x


. After evolving the
system for thermalization, the distribution is shown in Fig. 3.7 where we have taken
one x value out of 5 consecutive ones. The red line shows the desired distribution
Be(x|4, 2).
3.4 Markov Chain Monte Carlo 207

Fig. 3.7 Sampling of the 0.022


Beta distribution Be(x|4, 2)
0.02
(blue) of the Example 3.12
compared to the desired 0.018
distribution (continuous line)
(color figure online) 0.016

0.014

0.012

0.01

0.008

0.006

0.004

0.002

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Example 3.13 (Path Integrals in Quantum Mechanics)


In Feynman’s formulation of non-relativistic Quantum Mechanics, the probability
amplitude to find a particle in x f at time t f when at ti was at xi is given by

K (x f , t f |xi , ti ) = ei//h S[x(t)] D[x(t)]
paths

where the integral is performed over all possible trajectories x(t) that connect the
initial state xi = x(ti ) with the final state x f = x(t f ), S[x(t)] is the classic action
functional
 tf
S[x(t)] = L(ẋ, x, t) dt
ti

that corresponds to each trajectory and L(ẋ, x, t) is the Lagrangian of the particle.
All trajectories contribute to the amplitude with the same weight but different phase.
In principle, small differences in the trajectories cause big changes in the action
compared to h/ and, due to the oscillatory nature of the phase, their contributions
cancel. However, the action does not change, to first order, for the trajectories in a
neighborhood of the one for which the action is extremal and, since they have similar
phases (compared to h/ ) their contributions will not cancel. The set of trajectories
around the extremal one that produce changes in the action of the order of h/ define the
limits of classical mechanics and allow to recover is laws expressed as the Extremal
Action Principle.

The transition amplitude (propagator) allows to get the wave-function (x f , t f )


from (xi , ti ) as:
208 3 Monte Carlo Methods

(x f , t f ) = K (x f , t f |xi , ti ) (xi , ti ) d xi for t f > ti

In non-relativistic Quantum Mechanics there are no trajectories evolving backwards


in time so in the definition of the propagator a Heaviside step function θ(t f − ti ) is
implicit. Is this clear from this equation that K (x f , t|xi , t) = δ(x f − xi ).
For a local Lagrangian (additive actions), it holds that:

K (x f , t f |xi , ti ) = K (x f , t f |x, t) K (x, t|xi , ti ) d x

analogous expression to the Chapman-Kolmogorov equations that are satisfied by


the conditional probabilities of a Markov process. If the Lagrangian is not local,
the evolution of the system will depend on the intermediate states and this equation
will not be true. On the other hand, if the Classical Lagrangian has no explicit time
dependence the propagator admits an expansion (Feynman-Kac Expansion Theorem)
in terms of a compete set of eigenfunctions {φn } of the Hamiltonian as:

K (x f , t f |xi , ti ) = e−i//h En (t f −ti ) φn (x f ) φn (xi )


n

where the sum is understood as a sum for discrete eigenvalues and as an integral for
continuous eigenvalues. Last, remember that expected value of an operator A(x) is
given by:
 
A = A[x(t)] e i//h S[x(t)]
D[x(t)] / ei//h S[x(t)] D[x(t)]

Let’s see how to do the integral over paths to get the propagator in a one-
dimensional problem. For a particle that follows a trajectory x(t) between xi = x(ti )
and x f = x(t f ) under the action of a potential V (x(t)), the Lagrangian is:

1
L(ẋ, x, t) = m ẋ(t)2 − V (x(t))
2
and the corresponding action:
 tf
1
S[x(t)] = m ẋ(t)2 − V (x(t)) dt
ti 2

so we have for the propagator:



K (x f , t f |xi , ti ) = ei//h S[x(t)] D[x(t)] =
Tr
   tf 
i 1
= exp m ẋ(t) − V (x(t)) dt D[x(t)]
2
Tr h/ ti 2
3.4 Markov Chain Monte Carlo 209

Fig. 3.8 Trajectory in a space with discretized time

where the integral is performed over the set T r of all possible trajectories that start
at xi = x(ti ) and end at x f = x(t f ). Following Feynman, a way to perform this
integrals is to make a partition of the interval (ti , t f ) in N subintervals of equal
length  (Fig. 3.8); that is, with

t f − ti
= so that t j − t j−1 = ; j = 1, 2, . . . , N
N
Thus, if we identify t0 = ti and t N = t f we have that
−1
[ti , t f ) = ∪ Nj=0 [t j , t j+1 )

On each interval [t j , t j+1 ), the possible trajectories x(t) are approximated by straight
segments so they are defined by the sequence

{x0 = xi = x(ti ), x1 = x(t1 ), x2 = x(t2 ), . . . x N −1 = x(t N −1 ), x N = x f = x(t f )}

Obviously, the trajectories so defined are continuous but not differentiable so we


have to redefine the velocity. An appropriate prescription is to substitute ẋ(t j ) by
210 3 Monte Carlo Methods

x j − x j−1
ẋ(t j ) −→

so the action is finally expressed as:
 2

N
1 x j − x j−1
S N [x(t)] =  m − V (x j )
j=1
2 

Last, the integral over all possible trajectories that start at x0 and end at x N is
translated in this space with discretized time axis as an integral over the quantities
x1 , x2 , . . . , x N −1 so the differential measure for the trajectories D[x(t)] is substituted
by
j=N −1
D[x(t)] −→ A N dx j
j=1

with A N a normalization factor. The propagator is finally expressed as:

K N (x f , t f |xi , ti ) =
⎧  ⎫
  ⎨i
N ⎬
1 x j − x j−1 2
= AN d x1 · · · d x N −1 exp m − V (x j ) · 
x1 x N −1 ⎩ h/ 2  ⎭
j=1

After doing the integrals, taking the limit →0 (or N →∞ since the product
N  = (t f − ti ) is fixed) we get the expression of the propagator. Last, note that
the interpretation of the integral over trajectories as the limit of a multiple Riemann
integral is valid only in Cartesian coordinates.
To derive the propagator from path integrals is a complex problem and there
are few potentials that can be treated exactly (the “simple” Coulomb potential for
instance was solved in 1982). The Monte Carlo method allows to attack satisfactorily
this type of problems but before we have first to convert the complex integral in a
positive real function. Since the propagator is an analytic function of time, it can be
extended to the whole complex plane of t and then perform a rotation of the time
axis (Wick’s rotation) integrating along

τ = eiπ/2 t = i t; that is, t −→ − i τ

Taking as prescription the analytical extension over the imaginary time axis, the
oscillatory exponentials are converted to decreasing exponentials, the results are
consistent with those derived by other formulations (Schrodinger or Heisenberg for
instance) and it is manifest the analogy with the partition function of Statistical
Mechanics. Then, the action is expressed as:
 τf
1
S[x(t)] −→ i m ẋ(t)2 + V (x(t)) dt
τi 2
3.4 Markov Chain Monte Carlo 211

Note that the integration limits are real as corresponds to integrate along the imaginary
axis and not to just a simple change of variables. After partitioning the time interval,
the propagator is expressed as:
   
1
K N (x f , t f |xi , ti ) = A N d x1 · · · d x N −1 exp − S N (x0 , x1 , . . . , x N )
x1 x N −1 h/

where
 2

N
1 x j − x j−1
S N (x0 , x1 , . . . , x N ) = m + V (x j ) · 
j=1
2 

and the expected value of an operator A(x) will be given by:


 $ j=N −1 % &
j=1 d x j A(x0 , x1 , . . . , x N ) exp − h/1 S N (x0 , x1 , . . . , x N )
A =  $ j=N −1 % &
j=1 d x j exp − h/1 S N (x0 , x1 , . . . , x N )

Our goal is to generate Ngen trajectories with the Metropolis criteria according to
 
1
p(x0 , x1 , . . . , x N ) ∝ exp − S N (x0 , x1 , . . . , x N )
h/

Then, over these trajectories we shall evaluate the expected value of the operators of
interest A(x)

Ngen
A = A(x0 , x1(k) , . . . , x N(k)−1 , x N )
Ngen k=1

Last, note that if we take (τ f , x f ) = (τ , x) and (τi , xi ) = (0, x) in the Feynman-Kac


expansion we have that

K (x, τ |x, 0) = e−1//h En τ φn (x) φn (x)


n

and therefore, for sufficiently large times

K (x, τ |x, 0) ≈ e−1//h E0 τ φ0 (x) φ0 (x) + · · ·

so basically only the fundamental state will contribute.


Well, now we have everything we need. Let’s apply all that first to an harmonic
potential
212 3 Monte Carlo Methods

14
30 V(x) V(x)
12
25
10
20
8
15
6

10
4

5 2

0 0
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
x x

Fig. 3.9 Potential wells studied in the Example 3.13

1
V (x) = k x2
2
so the discretized action will be:
 2

N
1 x j − x j−1 1
S N (x0 , x1 , . . . , x N ) = m + k x 2j ·
j=1
2  2

To estimate the energy of the fundamental state we use the Virial Theorem. Since:

1  (
T  = 
x ·∇V x )
2
we have that T  = V  and therefore

E = T  + V  = k x 2 

In this example we shall take m = k = 1 (Fig. 3.9).


We start with an initial trajectory from x0 = x(ti ) = 0 to x f = x(t f ) = 0 and
the intermediate values x1 , x2 , . . . , x N −1 drawn from U n(−10., 10.), sufficiently
large in this case since their support is (−∞, ∞). For the parameters of the grid
we took  = 0.25 and N = 2000. The parameter  has to be small enough so that
the results we obtain are close to what we should have for a continuous time and N
sufficiently large so that τ = N  is large enough to isolate the contribution of the
fundamental state. With this election we have that τ = 2000·0.25 = 500. Obviously,
we have to check the stability of the result varying both parameters. Once the grid
is fixed, we sweep over all the points x1 , x2 , . . . , x N −1 of the trajectory and for each

x j , j = 1, . . . , N − 1 we propose a new candidate x j with support . Then, taking


h/ = 1 we have that:
3.4 Markov Chain Monte Carlo 213

&
P(x j −→x j ) = exp −S N (x0 , x1 , . . . x j , . . . , x N )

and
' (
P(x j −→x j ) = exp −S N (x0 , x1 , . . . x j , . . . , x N )

so the acceptance function will be:


)
*

P(x j −→x j )
a(x j −→x j ) = min 1,
P(x j −→x j )

Obviously we do not have to evaluate the sum over all the nodes because when
dealing with node j, only the intervals (x j−1 , x j ) and (x j , x j+1 ) contribute to the
sum. Thus, at node j we have to evaluate

% %

&&
a(x j −→x j ) = min 1, exp −S N (x j−1 , x j , x j+1 ) + S N (x j−1 , x j , x j+1 )

Last, the trajectories obtained with the Metropolis algorithm will follow eventu-
ally the desired distribution p(x0 , x1 , . . . , x N ) in the asymptotic limit. To have a
reasonable approximation to that we shall not use the first Nter m trajectories (ther-
malization). In this case we have taken Nter m = 1000 and again, we should check
the stability of the result. After this, we have generated Ngen = 3000 and, to reduce
correlations, we took one out of three for the evaluations; that is Nused = 1000 tra-
jectories, each one determined by N = 2000 nodes. The distribution of the accepted
values x j will be an approximation to the probability to find the particle at position
x for the fundamental state; that is, |0 (x)|2 . Figure 3.10 shows the results of the
simulation compared to

|0 (x)|2 ∝ e−x


2

0.06 2000
t 1800
0.05 2 1600
|Ψ0(x)|
1400
0.04
1200

0.03 1000
800
0.02 600
400
0.01
200

0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x x

Fig. 3.10 Squared norm of the fundamental state wave-function for the harmonic potential and one
of the simulated trajectories
214 3 Monte Carlo Methods

5000
0.04
t 4500
2
0.035 |Ψ0(x)|
4000
0.03
3500
0.025 3000

0.02 2500
2000
0.015
1500
0.01
1000
0.005 500
0 0
− 10 − 8 −6 −4 −2 0 2 4 6 8 10 −8 −6 −4 −2 0 2 4 6 8
x x

Fig. 3.11 Squared norm of the fundamental state wave-function for the quadratic potential and one
of the simulated trajectories

together with one of the many trajectories generated. The sampling average x 2  =
0.486 is a good approximation to the energy of the fundamental state E 0 = 0.5.
As a second example, we have considered the potential well (Fig. 3.9)

a2 + ,2
V (x) = (x/a)2 − 1
4
and, again from the Virial Theorem:

3 a2
E = x 4
 − x 2
 +
4 a2 4

We took a = 5, a grid of N = 9000 nodes and  = 0.25 (so τ = 2250), and as before
Nter m = 1000, Ngen = 3000 and Nused = 1000. From the generated trajectories we
have the sample moments x 2  = 16.4264 and x 4  = 361.4756 so the estimated
fundamental state energy is E 0  = 0.668 to be compared with the exact result E 0 =
0.697. The norm of the wave-function for the fundamental state is shown in Fig. 3.11
together with one of the simulated trajectories exhibiting the tunneling between the
two wells.

3.4.1 Sampling from Conditionals and Gibbs Sampling

In many cases, the distribution of an n-dimensional random quantity is either not


known explicitly or difficult to sample directly but sampling the conditionals is easier.
In fact, sometimes it may help to introduce an additional random quantity an consider
the conditional densities (see the Example 3.14). Consider then the n-dimensional
random quantity X = (X 1 , . . . , X n ) with density p(x1 , . . . , xn ), the usually simpler
conditional densities
3.4 Markov Chain Monte Carlo 215

p(x1 |x2 , x3 . . . , xn )
p(x2 |x1 , x3 . . . , xn )
..
.
p(xn |x1 , x2 , . . . , xn−1 )

and an arbitrary initial value x 0) = {x10) , x20) , . . . , xn0) } ∈ X . If we take the approx-
imating density q(x1 , x2 , . . . , xn ) and the conditional densities

q(x1 |x2 , x3 , . . . , xn )
q(x2 |x1 , x3 , . . . , xn )
..
.
q(xn |x1 , x2 , . . . , xn−1 )

we generate for x1 a proposed new value x11) from q(x1 |x20) , x30) , . . . , xn0) ) and accept
the change with probability
) *
p(x11) , x20) , x30) , . . . , xn0) )q(x10) , x20) , x30) , . . . , xn0) )
a(x10) → x11) ) = min 1, =
p(x10) , x20) , x30) , . . . , xn0) )q(x11) , x20) , x30) , . . . , xn0) )
) *
p(x11) |x20) , x30) , . . . , xn0) )q(x10) |x20) , x30) , . . . , xn0) )
= min 1,
p(x10) |x20) , x30) , . . . , xn0) )q(x11) |x20) , x30) , . . . , xn0) )

After this step, let’s denote the value of x1 by x1


(that is, x1
= x11) or x1
= x10) if it
was not accepted). Then, we proceed with x2 . We generate a proposed new value x21)
from q(x2 |x1
, x30) , . . . , xn0) ) and accept the change with probability
) *
p(x1
, x21) , x30) , . . . , xn0) )q(x1
, x20) , x30) , . . . , xn0) )
a(x20) → x21) ) = min 1, =
p(x1
, x20) , x30) , . . . , xn0) )q(x1
, x21) , x30) , . . . , xn0) )
) *
p(x21) |x1
, x30) , . . . , xn0) )q(x20) |x1
, x30) , . . . , xn0) )
= min 1,
p(x20) |x1
, x30) , . . . , xn0) )q(x21) |x1
x30) , . . . , xn0) )

After we run over all the variables, we are in a new state {x1
, x2
, . . . , xn
} and repeat
the whole procedure until we consider that stability has been reached so that we are
sufficiently close to sample the desired density. The same procedure can be applied
if we consider more convenient to express the density

p(x1 , x2 , x3 . . . , xn ) = p(xn |xn−1 , . . . , x2 . . . , x1 ) · · · p(x2 |x1 ) p(x1 )

Obviously, we need only one one admissible starting value x10) .


216 3 Monte Carlo Methods

Gibbs sampling is a particular case of this approach and consists on sampling


sequentially all the random quantities directly from the conditional densities; that is:

q(xi |x1 , . . . , xi−1 , xi+1 , . . . , xn ) = p(xi |x1 , . . . , xi−1 , xi+1 , . . . , xn )

so the acceptance factor a(x→x


) = 1. This is particularly useful for Bayesian infer-
ence since, in more than one dimension, densities are usually specified in conditional
form after the ordering of parameters.

Example 3.14 Sometimes it may be convenient to introduce additional random quan-


tities in the problem to ease the treatment. Look for instance at the Student’s distri-
bution X ∼ St (x|ν) with
 −(ν+1)/2
p(x|ν) ∝ 1 + x 2 /ν

Since
 ∞
e−au u b−1 du = (b) a −b
0

we can introduce an additional random quantity U ∼ Ga(u|a, b) in the problem with


a = 1 + x 2 /ν and b = (ν + 1)/2 so that
 ∞  −(ν+1)/2
p(x, u|ν) ∝ e−au u b−1 and p(x) ∝ p(x, u|ν)du ∝ a −b = 1 + x 2 /ν
0

The conditional densities are


p(x, u|ν)
p(x|u, ν) = = N (x|0, σ); σ 2 = ν(2u)−1
p(u|ν)
p(x, u|ν)
p(u|x, ν) = = Ga(u|a, b); a = 1 + x 2 /ν , b = (ν + 1)/2
p(x|ν)

so, if we start with an arbitrary initial value x ∈ R, we can


(1) Sample U |X : u ⇐ Ga(u|a, b) with a = 1 + x 2 /ν and b = (ν + 1)/2
(2) Sample X |U : x ⇐ N (x|0, σ) with σ 2 = ν(2u)−1
and repeat the procedure so that, after equilibrium, X ∼ St (x|ν). We can obviously
start from u ∈ R and reverse the steps (1) and (2). Thus, instead of sampling from the
Student’s distribution we may sample the conditional densities: Normal and a Gamma
distributions. Following this approach for ν = 2 and 103 thermalization sweeps (far
beyond the needs), the results of 106 draws are shown in Fig. 3.12 together with the
Student’s distribution St (x|2).
3.4 Markov Chain Monte Carlo 217

Fig. 3.12 Sampling of the −1


10
Student’s Distribution
St (x|2) (blue) compared to
the desired distribution (red)

−2
10

−3
10

− 10 −8 −6 −4 −2 0 2 4 6 8 10

Example 3.15 We have j = 1, . . . , J groups of observations each with a sample


of size n j ; that is x j = {x1 j , x2 j , . . . , , xn j j }. Within each of the J groups, obser-
vations are considered an exchangeable sequence and assumed to be drawn from a
distribution xi, j ∼ N (x|μ j , σ 2 ) where i = 1, . . . , n j . Then:
) *

nj

nj
(xi j − μ j )2
−n j
p(x j |μ j , σ) = N (xi j |μ j , σ ) ∝ σ
2
exp −
i=1 i=1
2 σ2

Then, for the J groups we have the parameters μ = {μ1 , . . . μ J } that, in turn, are
also considered as an exchangeable sequence drawn from a parent distribution μ j ∼
N (μ j |μ, σμ2 ). We reparameterize the model in terms of η = σμ−2 and φ = σ −2 and
consider conjugated priors for the parameters considered independent; that is

π(μ, η, φ) = N (μ|μ0 , σ02 ) Ga(η|c, d) Ga(φ|a, b)


n j J
Introducing the sample means x j = n −1 i=1 x i j , x = J
−1
j=1 x j and defining
j
μ = J −1 Jj=1 μ j and defining After some simple algebra it is easy to see that the
marginal densities are:
218 3 Monte Carlo Methods
 
n j σμ2 x j + μσ 2 σμ2 σ 2
μj ∼ N ,
n j σμ2 + σ 2 n j σμ2 + σ 2
 
σμ2 μ0 + σ02 J μ σμ2 σ02
μ ∼ N ,
σμ2 + J σ02 σμ2 + J σ02
⎛ ⎞
1
J
J
η = σμ−2 ∼ Ga ⎝ (μ j − μ)2 + c, + d ⎠
2 j=1 2
⎛ ⎞
1
J
nj
1
J
φ = σ −2 ∼ Ga ⎝ (xi j − μ j )2 + a, n j + b⎠
2 j=1 i=1 2 j=1

Thus, we set initially the parameters {μ0 , σ0 , a, b, c, d} and then, at each step
1. Get {μ1 , . . . , μ J } each as μ j ∼ N (·, ·)
2. Get μ ∼ N (·, ·)
3. Get σμ = η −1/2 with η ∼ Ga(·, ·)
4. Get σ = φ−1/2 with φ ∼ Ga(·, ·)
and repeat the sequence until equilibrium is reached and samplings for evaluations
can be done.

3.5 Evaluation of Definite Integrals

A frequent use of Monte Carlo sampling is the evaluation of definite integrals. Cer-
tainly, there are many numerical methods for this purpose and for low dimensions
they usually give a better precision when fairly compared. In those cases one rarely
uses Monte Carlo… although sometimes the domain of integration has a very com-
plicated expression and the Monte Carlo implementation is far easier. However, as
we have seen√the uncertainty of Monte Carlo estimations decreases with the sampling
size N as 1/ N regardless the number of dimensions so, at some point, it becomes
superior. And, besides that, it is fairly easy to estimate the accuracy of the evaluation.
Let’s see in this section the main ideas.
Suppose we have the n-dimensional definite integral

I = f (x1 , x2 , . . . , xn ) d x1 d x2 . . . d xn


where (x1 , x2 , . . . , xn ) ∈  and f (x1 , x2 , . . . , xn ) is a Riemann integrable function.


If we consider a random quantity X = (X 1 , X 2 , . . . , X n ) with distribution function
P(x) and support in , the mathematical expectation of Y = g(X)≡ f (X)/ p(X) is
given by
3.5 Evaluation of Definite Integrals 219
  
f (x)
E[Y ] = g(x) d P(x) = d P(x) = f (x) d x = I
  p(x) 

Thus, if we have a sampling {x 1 , x 2 , . . . , x N }, of size N , of the random quantity X


under P(x) we know, by the Law of Large Numbers that, as N → ∞, the sample
means

1
1
2
N N
I N(1) = g(x i ) and I N(2) = g (x i )
N i=1 N i=1

converge respectively to E[Y ] (and therefore to I) and to E[Y 2 ] (as for the rest, all
needed conditions for existence are assumed to hold). Furthermore, if we define

1  (2) 
SI2 = I K − (I K(1) )2
N
we know by the Central Limit Theorem that the random quantity

I K(1) − I
Z=
SI

is, in the limit N →∞, distributed as N (x|0, 1). Thus, Monte Carlo integration pro-
vides a simple way to estimate the integral I and a quantify the accuracy. Depending
on the problem at hand, you can envisage several tricks to further improve the accu-
racy. For instance, if g(x) is a function “close” to f (x) with the same support and
known integral Ig one can write
 
I = f (x) d x = ( f (x) − g(x)) d x + Ig
 

and in consequence estimate the value of the integral as

N

I= ( f (x i ) − g(x i )) + Ig
N i=1

reducing the uncertainty.

References

1. F. James, Monte Carlo theory and practice. Rep. Prog. Phys. 43, 1145–1189 (1980)
2. D.E. Knuth, The Art of Computer Programming, vol. 2 (Addison-Wesley, Menlo Park, 1981)
3. F. James, J. Hoogland, R. Kleiss, Comput. Phys. Commun. 2–3, 180–220 (1999)
4. P. L’Ecuyer, Handbook of Simulations, Chap. 4 (Wiley, New York, 1998)
220 3 Monte Carlo Methods

5. G. Marsaglia, A. Zaman, Toward a Universal Random Number Generator, Florida State Uni-
versity Report FSU-SCRI-87-50 (1987)
6. D.B. Rubin, Ann. Stat. 9, 130–134 (1981)
7. G.E.P. Box, M.E. Müller, A note on the generation of random normal deviates. Ann. Math.
Stat. 29(2), 610–611 (1958)
8. W.K. Hastings, Biometrika 57, 97–109 (1970)
9. N. Metropolis, A.W. Rosenbluth, M.W. Rosenbluth, A.H. Teller, E. Teller, J. Chem. Phys. 21,
1087–1092 (1953)
10. A.B. Gelman, J.S. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis (Chapman & Hall,
London, 1995)
Chapter 4
Information Theory

Sir, the reason is very plain; knowledge is of two kinds. We know


the subject ourselves, or we know where we can find information
upon it
S. Johnson

The ultimate goal of doing experiments and make observations is to learn about
the way nature behaves and, eventually, unveil the mathematical laws governing
the Universe and predict yet-unobserved phenomena. In less pedantic words, to get
information about the natural world. Information plays a relevant role in a large
number is disciplines (physics, mathematics, biology, image processing,…) and, in
particular, it is an important concept in Bayesian Inference. It is useful for instance
to quantify the similarities or differences between distributions and to evaluate the
different ways we have to analyse the observed data because, in principle, not all
of them provide the same amount of information on the same questions. The first
steep will be to quantify the amount of information that we get from a particular
observation.

4.1 Quantification of Information

When we observe the result of an experiment, we get some amount of information


about the underlying random process. How can we quantify the information we have
received? Let’s start with a discrete random quantity X that can take the values
{x1 , x2 , . . . , xk , . . .} with probabilities pi = p(xi ); i = 1, 2, . . . It is reasonable to
assume that the information we get when we observe the event X = xi will depend on
its probability of occurrence pi = p(xi ); that is, I (xi ) = g( pi ). Now, if I tell you that I
have seen a lion in a photo-safari in Kenya, you will not be surprised. In fact, it is quite

© Springer International Publishing AG 2017 221


C. Maña, Probability and Statistics for Particle Physics,
UNITEXT for Physics, DOI 10.1007/978-3-319-55738-0_4
222 4 Information Theory

natural, a very likely observation, and I hardly give you any valuable information.
However, if I tell you that I have seen a lone lion walking along Westminster Bridge,
you will be quite surprised. This is not expected; a very unlikely observation worth of
further investigation. I give you a lot of information. Surprise is Information. Thus,
it is also sensible to assume that if the probability for an event to occur is large
we receive a small amount of information and, conversely, if the probability is very
small we receive a large amount of information. In fact, if the event is a sure event,
p(xi ) = 1 and its occurrence will provide no information at all. Therefore, we start
assuming two reasonable hypothesis:
H1 : I (xi ) = f (1/ pi ) with f (x) a non-negative increasing function;
H2 : f (1) = 0.
Now, imagine that we repeat the experiment n-times under the same conditions
and obtain the sequence of independent events {x1 , x2 , . . . , xn }. We shall assume that
the information provided by this n-tuple of observed results is equal to the sum of
the information provided by each observation separately; that is:


n
I (x1 , x2 , . . . , xn ) = I (xi )
=1

n
Being all independent, we have that p(x1 , x2 , . . . , xn ) = i=1 p(xi ) and, in conse-
quence:
   n  
1 1
H3 : f p · · · p = f
1 n
k=1
pk
Those are the three hypothesis we shall make. Since pi ∈ [0, 1], we have that wi =
1/ pi ∈ [1, ∞) and therefore we are looking for a function f (w) such that:
(1) f : w ∈ [1, ∞) −→ [0, ∞) and increasing;
(2) f (1) = 0;
(3) f (w1 ·w2 · · · wn ) = f (w1 ) + · · · + f (wn )
The third condition implies that f (w n ) = n f (w) so, taking derivatives with respect
to w:
∂ f (w n ) ∂ f (w)
wn =w
∂ wn ∂w
and this has to hold for all n ∈ N and w ∈ [1, ∞); hence, we can write

∂ f (w)
w =c −→ f (w) = c log w
∂w
with c a positive constant and f (w) an increasing function since w ≥ 1. Taking
c = 1, we define:
4.1 Quantification of Information 223

• The amount of information we receive about the random process after we have
observed the occurrence of the event X = xi is:

1
I (xi ) = log = −log p(xi )
p(xi )

This is the expression Claude Shannon derived in 1948 [1] as a quantification


of information in the context of Communication Theory. The integration constant
c determines the base of the logarithms and therefore the units of information. In
particular:

I (xi ) = −ln p(xi ) “nats” = −lg2 p(xi ) “bits” = −lg10 p(xi ) “har tleys”

and therefore: 1nat = lg2 e bits (1.44) = lg10 e har tleys (0.43). In general,
the units will be irrelevant for us and we shall work with natural logarithms.

4.2 Expected Information and Entropy

The amount of information we receive after we have observed the event X = xi ,


I (xi ) = − log p(xi ), depends only on the probability of this particular event but
before we do the experiment we do not know which result we shall get; we know
only that each of the possible outcomes {x1 , x2 , . . . , xk , . . .} has a probability p(xi )
to occur. Then, we define

• The amount of information we expect to get from the realization of the random
experiment is:
 
I (X ) = p(xi ) I (xi ) = − p(xi ) ln p(xi )
i i

with the prescription lim x→0+ x ln x = 0. It is clear that the expected information
I (X ) does not depend on any particular result but on the probability distribution
associated to the random process.
We can look at this expression from another point of view. If from the experiment
we are going to do we expect to get the amount of information I (X ), before we do the
experiment we have a lack of information I (X ) relative to what we shall have after
the experiment is done. Interpreted in this way, the quantity I (X ) is called entropy
(H (X )) and quantifies the amount of ignorance about the random process that we
expect to reduce after the observation.

Example 4.1 Consider a binary random process described by a random quantity X


that can take the values {0, 1} with probabilities p and 1 − p respectively. Then, if
we observe the result X = 0 we get I (X = 0) = −log p units of information and,
224 4 Information Theory

if we observe the event X = 1, we get I (X = 1) = −log (1 − p). The information


we expect to get from the realization of the experiment e(1) is:

I (X ) = − p log p − (1 − p) log (1 − p)

that, in the case the two results are equally likely ( p = 1/2) and we take logarithms
in base 2, becomes:

1 1
I = log2 2 + log2 2 = log2 2 = 1 bit
2 2
Example 4.2 Consider a discrete random quantity with support on the finite set
{x1 , x2 , . . . , xn } and probabilities pi = p(xi ) and consider an experiment that consist
in one observation of X . What is the distribution for which the lack of information
is maximal? Defining
 n

n 
φ(p, λ) = − pi ln pi + λ pi − 1
i=1 i=1

we have that

∂ φ(p, λ)
= −ln pi − 1 + λ = 0 −→ pi = eλ−1 ⎪


∂ pi 1
−→ p(xi ) = ; i = 1, . . . , n

n ⎪
⎪ n
∂ φ(p, λ)
= pi − 1 = 0 −→ eλ−1 = n1 ⎭
∂λ i=1

Therefore the entropy, the lack of information, is maximal for the Discrete Uniform
Distribution and its value will be

n
1
HM (X ) = ln n = ln n
i=1
n

Thus, for any random quantity with finite support of dimension n, 0 ≤ H (X ) ≤ ln n


and the maximum amount of information we expect to get from one observation is
I (X ) = ln n.
If the sample space is countable, it is obvious that the distribution that maximizes
the entropy can not be the Discrete Uniform. Suppose that we know the expected
values of k functions { f j (x)}kj=1 ; that is:



μj = p(xi ) f j (xi ) with j = 1, 2, . . . , k
i=1
4.2 Expected Information and Entropy 225

and let’s define



 ∞
 ∞

  
k 
φ(p, λ, λ) = − pi ln pi + λ pi − 1 + λj μj − pi f j (xi )
i=1 i=1 j=1 i=1

with λ = {λ1 , . . . , λk }. Then:

∂ φ(p, λ, λ)  k
= −ln pi − 1 + λ − λ j f j (xi ) = 0
∂ pi j=1
⎧ ⎫
⎨  k ⎬
−→ pi = exp {λ − 1} exp − λ j f j (xi )
⎩ ⎭
j=1

The derivative with respect to λ gives the normalization condition


⎛ ⎞

 
k
exp (λ − 1) exp ⎝− λ j f j (xi )⎠ = 1
i=1 j=1

so if we define
⎡ ⎛ ⎞⎤−1
∞ 
k
Z (λ) = ⎣ exp ⎝− λ j f j (xi )⎠⎦
i=1 j=1

we have that
⎛ ⎞

k
pi = Z (λ) exp ⎝− λ j f j (xi )⎠
j=1

The remaining Lagrange multipliers λ are determined by the conditions




μj = p(xi ) f j (xi ) with j = 1, 2, . . . , k
i=1

Suppose for instance that X may take the values {0, 1, 2, . . .} and the have only one
condition:


μ = E[X ] = pn n
n=0
226 4 Information Theory

that is, f (xn ) = n. Then pn = Z (λ1 ) exp {−λ1 n} and therefore



 ∞

μ= pn n = Z (λ1 ) n exp {−λ1 n} −→ Z (λ1 ) = μ eλ1 (1 − e−λ1 )2
n=0 n=0


∞  
pn = 1 we get λ1 = ln 1
Imposing finally that n=0 μ + 1 and in consequence

μn
pn = with n = 0, 1, 2, . . . and μ>0
(1 + μ)1+n

is the distribution for which the lack of information is maximal. In this case:

  
(1 + μ)1+μ
HM (X ) = − p(xi )ln p(xi ) = ln
i=1
μμ

is the maximum entropy for any random quantity with countable support and known
mean value μ = E[X ] > 0.

4.3 Conditional and Mutual Information

Consider two discrete random quantities, X and Y , with supports x = {x1 , x2 , . . .}


and  y = {y1 , y2 , . . .} and joint probability

P(X = xi , Y = y j ) = p(xi , y j ) = p(xi |y j ) p(y j ) = p(y j |xi ) p(xi )

The Expected Information about (X, Y ) from an observation will be



I (X, Y ) = − p(xi , y j ) ln p(xi , y j )
x y
 
=− p(xi , y j ) ln p(xi |y j ) − p(y j ) ln p(y j )
x y y

If we define1

1 The non-negativity of this and the following expressions of Information can be easily derived from

the Jensen’s inequality for convex functions: Given the probability  space (R,B, μ), a μ-integrable
function X and a convex function φ over the range of X , then φ( R X dμ) ≤ R φ(X ) dμ provided
the last integral exist; that is, φ(E[X ])] ≤ E[φ(X )]. Observe that if φ is a concave function, then
−φ is convex so the inequality sign is reversed and that if φ is twice continuously differentiable
on [a, b], it is convex on that interval iff φ (x) ≥ 0 for all x ∈ [a, b]. Frequent and useful convex
functions are φ(x) = exp(x) and φ(x) = − log x.
4.3 Conditional and Mutual Information 227

I (X |Y ) = − p(xi , y j ) ln p(xi |y j ) ≥ 0
x y

we can write:

I (X, Y ) = I (X |Y ) + I (Y ) = I (Y |X ) + I (X ) = I (Y, X )

Now, I (Y ) is the amount of information we expect to get about Y and, if (X, Y ) are
not independent, the knowledge of Y gives some information on X so the remaining
information we expect to get about X is not I (X ) but the smaller quantity I (X |Y ) <
I (X ) because we already know something about it. In entropy language, H (X |Y )
is the amount of ignorance about X that remains after Y is known. It is clear that
if X and Y are independent, I (X |Y ) = I (X ) so the knowledge of Y doesn’t say
anything about X and therefore the remaining information we expect to get about
X is I (X ). The interesting question is: How much information on X is contained
in Y ? (or, entropy wise, By how much the ignorance about X will be reduced if we
observe first the quantity Y ?). Well, if observing X we expect to get I (X ) and after
observing Y we expect to get I (X |Y ), the amount of information that the knowledge
of Y provides on X is the Mutual Information:

I (X : Y ) = I (X ) − I (X |Y )

Again, if they are independent I (X |Y ) = I (X ) so I (X : Y ) = 0. From the previous


properties, it is clear that

I (X : Y ) = I (X ) − I (X |Y ) = I (Y ) − I (Y |X ) = I (X ) + I (Y ) − I (X, Y ) = I (Y : X )

so it is symmetric: the observation of Y provides as much information about X as


the observation of X about Y . Expliciting the terms:

I (X : Y ) = I (X ) + I (Y ) − I (X, Y ) =
  
= p(xi , y j ) ln p(xi , y j ) − p(xi ) ln p(xi ) − p(y j ) ln p(yi )
x  y x y

and, in consequence:
• The Mutual Information, the amount of information we expect to get about X
(or Y ) from the knowledge of Y (or X ), is given by

  
p(xi , y j )
I (X : Y ) = I (Y : X ) = p(xi , y j ) ln
x y
p(xi ) p(y j )
228 4 Information Theory

Again from the Jensen’s inequality for convex functions I (X : Y ) ≥ 0 with the equal-
ity satisfied if and only if X and Y are independent random quantities so the Mutual
Information is therefore a measure of the statistical dependence between them (see
Note 3).

4.4 Generalization for Absolute Continuous Random


Quantities

Up to now, we have been dealing with discrete random quantities. Consider now a
continuous random quantity X ∼ p(x) with support on the compact set x = [a, b]
and let’s get a discrete approximation. Given a partition


N
X = n ; n = [a + (n − 1), a + n]
n=1

with large N and  = (b − a)/N > 0, if p(x) is continuous on n we can use the
Mean Value Theorem and write

P(X ∈ n ) = p(x) d x = p(xn ) 
n

with xn an interior point of n . Then, we define the discrete approximation X D


of X that takes values {x1 , x2 , . . . , x N } with probabilities pk = p(xk ) such that

N 
k=1 pk = 1 and write


N 
N 
N
I D (X D ) = − [ p(xk ) ] log [ p(xi ) ] = − pk log p(xk ) − log  pk
k=1 k=1 k=1

so, in limit →0+ , X D tends to X in distribution and


 b
I D (X D ) −→ − p(x) log p(x) d x − log 
a

One may be tempted to regularize this expression and define



de f.
I (X ) = lim−→0 [I D (X D ) + log ] = − p(x) log p(x) d x
X

… but this doesn’t work. This “naive” generalisation is not the limit of the information
for a discrete quantity and is not an appropriate measure of information, among other
4.4 Generalization for Absolute Continuous Random Quantities 229

things, because I (X ) so defined may be negative. For example, if X ∼ U n(x|0, b),


have that I (X ) = ln b and for 0 < b < 1 this is negative.2
However, this reasoning is meaningful for the Mutual Information because, being
the argument of the logarithm a ratio of probability densities, the limit is well defined.
Thus, we can state that
• Given two continuous random quantities X and Y with probability density p(x, y),
the amount of information on X that is contained in Y is the Mutual Information:
   
p(x, y)
I (X : Y ) = dx dy p(x, y) ln
x y p(x) p(y)
   
p(x|y)
= dx dy p(x, y) ln .
x y p(x)

4.5 Kullback–Leibler Discrepancy and Fisher’s Matrix

Consider the random quantity X ∼ p(x|θ), the sequence of observations x =


{x1 , x2 , . . . , xn } and the prior density π(θ) that describes the knowledge we have
on θ before we do the experiment. Then p(x, θ) = p(θ|x) p(x) and therefore, the
Mutual Information
   
p(x, θ)
I (X : θ) = dx dθ p(x, θ) ln
X θ p(x)π(θ)
   
p(θ|x)
= d x p(x) dθ p(θ|x) ln
X θ π(θ)

quantifies (Lindley; 1956) the Information we expect to get from the experiment
e(n) on the parameter θ ∈ θ when the prior knowledge is represented by π(θ).
Therefore, we have that:
• The amount of Information provided by the data sample x = {x1 , x2 , . . . , xn } on
the parameter θ ∈ θ with respect to the prior density π(θ) is:
  
p(θ|x)
I (x|π(θ)) = p(θ|x) ln dθ
θ π(θ)

that is; given the experimental sample x, I (x|θ) is the information we need to
actualize the prior knowledge π(θ) and substitute it by p(θ|x);


2 Despite of that, the “Differential Entropy” h( p) = −  X p(x) log p(x) d x is a useful quantity
in a different context. It is left as an exercise to show that among all continuous distributions with
support on [a, b], then the Uniform distribution U n(x|a, b) is the one that maximizes the Differential
Entropy, among those with support on [0, ∞) and specified first order moment is the Exponential
E x(x|μ) and, if the second order moment is also constrained, we get the Normal density N (x|μ, σ).
230 4 Information Theory

• The amount of Expected Information from the experiment e(n) on the parameter
θ ∈ θ with respect to the knowledge contained in π(θ) will be
    
p(θ|x)
I (e|π(θ)) = p(x) I (x|π(θ)) d x = p(x) d x p(θ|x) ln dθ =
X X θ π(θ)
   
p(x, θ)
= p(x, θ) ln d x dθ = I (X : θ)
X θ π(θ) p(x)

As a general criteria to compare a probability density p(x) with an approximation


π(x), Solomon Kullback [2] y Richard Leibler introduced in 1951 the so called
Kullback–Leibler Discrepancy as:
  
p(x)
D K L [ p(.)
π(.)] = d x p(x) ln
X π(x)

It is easy to check that:


(i) D K L [ p(.)
π(.)] ≥ 0 with the equality iff p(x) = π(x) almost everywhere;
(ii) D K L [ p(.)
π(.)] is a convex function with respect to the pair ( p(x), π(x)).
The Kullback–Leibler discrepancy D K L [ p(.)
π(.)] is not a metric distance because
it is not symmetric; that is, D K L [ p(.)
π(.)] = D K L [π(.)
p(.)] and therefore does
not satisfy either the triangular inequality. However, it can be symmetrized as
J [ p(.); π(.)] = D K L [ p(.)
π(.)] + D K L [π(.)
p(.)].
From the previous expressions, we have that

I (e|π(θ)) = D K L [ p(x, θ)
π(θ) p(x)] and I (x|π(θ) = D K L [ p(θ|x)
π(θ)]

In the context of Bayesian Inference, the Kullback–Leibler discrepancy is a natural


measure of information and clearly evidences that amount of knowledge on the
parameters θ that we get from the experiment and is contained in the posterior density,
is relative to our prior knowledge. This is an important relation for the Reference
Analysis (Sect. 6.7) developed in [3, 4].

4.5.1 Fisher’s Matrix

What is the capacity of the experiment to distinguish two infinitesimally close values
θ 0 and θ 1 = θ 0 + θ 0 of the parameters? or, in other words, How much infor-
mation has to be provided by the experiment so that we can discern between two
infinitesimally close values of θ? Let’s analyze the local behaviour of
  
p(x|θ 0 )
I (θ 0 : θ 1 = θ 0 + θ 0 ) = p(x|θ 0 ) ln dx
X p(x|θ 1 )
4.5 Kullback–Leibler Discrepancy and Fisher’s Matrix 231

If dim{θ} = n, after a Taylor expansion of ln p(x|θ 1 ) around θ 0 and considering


that
  
∂ ∂
p(x|θ) d x = 1 −→ p(x|θ) ln p(x|θ) d x = p(x|θ) d x = 0
X X ∂θ ∂θ  X

we get
 
1 
n n
∂ 2 ln p(x|θ)
I (θ 0 : θ 0 + θ)  − θi θ j p(x|θ) dx + ···
2! i=1 j=1 X ∂θi ∂θ j θ0

Thus, if we define the


• Fisher’s matrix
  2 
∂ 2 ln p(x|θ) ∂ ln p(x|θ)
I i j (θ) = − p(x|θ) dx = EX −
X ∂θi ∂θ j ∂θi ∂θ j

we can write

1 
n n
I (θ 0 : θ 0 + θ)  θi I i j (θ 0 )θ j + · · ·
2 i=1 j=1

The Fisher’s matrix is a non-negative symmetric matrix that plays a very important
role in statistical inference… provided it exists. This is the case for regular distribu-
tions where:
(1) suppx { p(x|θ)} does not depend on θ;
(2) p(x|θ) ∈ Ck (θ) for k ≥ 2 and
 
(3) The integrand is well-behaved so ∂  X (•)d x =  X ∂(•) d x
∂θ ∂θ
Interchanging the derivatives with respect to the parameters θi and the integrals over
 X it is easy to obtain the equivalent expressions:
   
∂ 2 ln p(x|θ) ∂ ln p(x|θ) ∂ ln p(x|θ)
I i j (θ) = E X − = EX
∂θi ∂θ j ∂θi ∂θ j
 2   2
∂ ln p(x|θ) ∂ln p(x|θ)
Iii (θ) = E X − = EX
∂θi 2 ∂θi

If X = (X 1 , . . . , X n

) is a n-dimensional random quantity and {X i }i=1n


are indepen-
dent, then I X (θ) = i I X i (θ). Obviously, if they are iid then I X i (θ) = I X (θ) for all
i and I X (θ) = n I X (θ).
232 4 Information Theory

4.5.2 Asymptotic Behaviour of the Likelihood Function

Suppose that the experiment e(n) provides an independent and exchangeable


sequence of observations {x1 , x2 , . . . , xn } from the model p(x|θ) with dim(θ) = d.
The information that the experiment provides about θ is contained in the likelihood
function and, being non-negative function, consider for simplicity its logarithm:


n
w(θ|·) = log p(xi |θ)
i=1

and the Taylor expansion around the maximum !


θ:

⎡  ⎤
n 
d 
d
1 
n
∂ 2 log p(x |θ)
w(θ|·) = w(! ⎣ ⎦ (θk − !
θk )(θm − !
i
θ|·) − − θm ) + · · ·
2 n ∂θk ∂θm !
k=1 m=1 i=1 θ

where the second term has been multiplied and divided by n. Under sufficiently
regular conditions we have that !
θ converges in probability to the true value θ 0 so we
can neglect higher order terms and, by the Law of Large Numbers, approximate
   2   2 
1 n ∂ log p(xi |θ) ∂ log p(x|θ)
lim −  EX −  I km (!
θ)
n→∞ n i=1 ∂θk ∂θm !θ ∂θk ∂θm !
θ

Therefore

1  " #
d d
!
w(θ|·) = w(θ|·) − (θk − !
θk ) n I km (!
θ) (θm − !
θm ) + · · ·
2 k=1 m=1

and, under regularity conditions, we can approximate the likelihood function by a


Normal density with mean ! θ and covariance matrix  −1 = n I(! θ). In fact, for a
Normal density

1 1
p(x|μ, σ) = exp{− (x − μ)T V−1 (x − μ)}
(2π) n/2
det [V]
1/2 2

we have for the parameters μ that

∂ ln p(x|μ, σ) n
=− [V−1 ]ik (xk − μk )
∂μi k=1
4.5 Kullback–Leibler Discrepancy and Fisher’s Matrix 233

and therefore
n n " #
I i j (μ) = [V−1 ]ik [V−1 ] j p E X (xk − μk )(x p − μ p ) =
k=1 p=1
n n
= [V−1 ]ik [V−1 ] j p [V]kp = [V−1 ]i j
k=1 p=1

that is; the Fisher’s Matrix I i j (μ) is the inverse of the Covariance Matrix.

Example 4.3 The quantity Ii j (θ) as an intrinsic measure of the information that
an experiment will provide about the parameters θ was introduced by R.A. Fisher
around 1920 in the context of Experimental Design. It depends only on the condi-
tional density p(data|parameters) and therefore is not a quantification relative to
the knowledge we have on the parameters before the experiment is done but it is
very useful and has many interesting applications; for instance, to compare different
procedures to analyze the data. Let’s see as an example the charge asymmetry of
the angular distribution of the process e+ e− −→μ+ μ− . If we denote by θ the angle
between the incoming electron and the outgoing μ− and by x = cosθ ∈ [−1, 1], the
angular distribution can be expressed in first order of electroweak perturbations as

3
p(x|a) = (1 + x 2 ) + a x
8
where a, the asymmetry coefficient, is bounded by |a| ≤ am = 3/4 since p(x|a) ≥ 0.
Now, suppose that an experiment e(n) provides n independent observations x =
{x1 , x2 , . . . , xn } so we have the joint density

$n  
3
p(x|A) = (1 + xi2 ) + a xi
i=1
8

In this case, there are no minimal sufficient statistics but, instead of working with the
whole sample, we may simplify the analysis and classify the events in two categories:
Forward (F) if θ =∈ [0, π/2] → x ∈ [0, 1]) or Backward (B) if θ ∈ [π/2, π] −→
x ∈ [−1, 0]. Then
 1  0
1 1
pF = p(x|a) d x = (1 + a) and pB = p(x|a) d x = (1 − a) = 1 − p F
0 2 −1 2

and the model we shall use for inferences on a is the simpler Binomial model
 
1 n
P(n F |n, a) = n (1 + a)n F (1 − a)n−n F ; n F = 0, 1, . . . , n
2 nF
234 4 Information Theory

For this second analysis (A2) we have that


n  2
∂ ln P(n F |n, a) n
I A2 (a) = P(n F |n, a) =
n F =0
∂a 1 − a2

In contrast, for the first analysis (A1 ) we have that


  
1
x2 8 1 − α π 2 α2 − 1
I A1 (a) = n dx = n 2 + α ln + √
−1 p(x|a) 3 1+α 2 1 − α2

where α = a/am ∈ [−1, 1]. For anys a ∈ [−3/4, 3/4], it holds that I A1 (a) > I A2 (a)
so it is preferable to do the first analysis.

4.6 Some Properties of Information

• If we do n independent samplings from the same distribution, Shall we get


n times the Information on the parameters of interest that we get from one
observation?

Consider the random quantity X ∼ p(x|θ), the sample x = {x1 , . . . , xn } and the
Mutual Information
   
p(x|θ)
I (X 1 , . . . , X n : θ) = dθ π(θ) p(x|θ) ln dx
θ X p(x)

If the n observations are independent, p(x|θ) = p(x1 |θ) · · · p(xn |θ) and therefore
 n 

d x p(x|θ) ln p(x|θ) = d xi p(xi |θ) ln p(xi |θ)
X i=1 X

= n d x p(x|θ) ln p(x|θ)
X

On the other hand:


  
dθ π(θ) d x p(x|θ) ln p(x) = d x p(x) ln p(x)
θ X X

so:
  
I (X 1 , . . . , X n : θ) = n dθ π(θ) d x p(x|θ) ln p(x|θ) − d x p(x) ln p(x) =
θ X X
= n I (X : θ) − D K L [ p(x)
p(x1 ) · · · p(xn )]
4.6 Some Properties of Information 235

and D K L [ p(x)
p(x1 ) · · · p(xn )] = 0 if, and only if, p(x) = p(x1 ) · · · p(xn ). How-
ever, since

p(x) = dθ p(x1 |θ) . . . p(xn |θ) π(θ)

the random quantities X 1 , X 2 , . . . , X n are correlated through θ and therefore are not
independent. Thus, since D K L [ p(x)
p(x1 ) · · · p(xn )] > 0 the information provided
by the experiment e(n) is less than n times the one provided by e(1). This is reasonable
because, as the number of samplings grows, the knowledge about the parameter θ
increases, the prior distribution π(θ) is actualized by p(θ|x1 , . . . , xn ) and further
independent realizations of the same experiment (under the same conditions) will
provide less information.
This is not the case for Fisher’s criteria of Information. In fact, if the observations
are independent, it is trivial to see that for x = {x1 , x2 , . . . , xn }
   
∂ ln p(x|θ) ∂ ln p(x|θ) ∂ ln p(x1 |θ) ∂ ln p(x1 |θ)
Ii(n)
j (θ) = EX = n EX
∂θi ∂θ j ∂θi ∂θ j
= n Ii(1)
j (θ);

Example 4.4 Consider a random quantity X ∼ N (x|μ, σ) with σ known and μ the
parameter of interest with a prior density π(μ) = N (μ|μ0 , σ0 ). Then
%
p(x, μ|μ0 , σ, σ0 ) = N (x|μ, σ) N (μ|μ0 , σ0 ); p(x|μ0 , σ, σ0 ) = N (x|μ0 , σ 2 + σ02 )

so the amount of Expected Information from the experiment e(1) on the parameter
μ with respect to the knowledge contained in π(μ) will be

     1/2
∞ ∞ p(x, μ|·) σ02
I (e(1)|π(μ)) = dμ d x p(x, μ|·) ln = ln 1+
−∞ −∞ π(μ|·) p(x|·) σ2

Consider now the experiment e(2) that provides two independent observations
{x1 , x2 }. Then

p(x1 , x2 , μ|μ0 , σ, σ0 ) = N (x1 |μ, σ) N (x2 |μ, σ) N (μ|μ0 , σ0 )

and
% %
p(x1 , x2 |μ0 , σ, σ0 ) = N (x1 , x2 |μ0 , μ0 , σ 2 + σ02 , σ 2 + σ02 , ρ)

where ρ = (1 + σ 2 /σ02 )−1 ; that is, the random quantities X 1 and X 2 are correlated
through π(μ). In this case:
236 4 Information Theory
 
1 σ02 1
I (e(2)|π(μ)) = ln 1 + 2 2 = 2 I (e(1)|π(μ)) + ln (1 − ρ2 )
2 σ 2

so D K L [ p(x1 , x2 |·)
p(x1 |·) p(x1 |·)] = − 21 ln (1 − ρ2 ). In general, for n independent
observations we have that
 
1 σ2
I (e(n)|π(μ)) = ln 1 + n 02
2 σ

that behaves asymptotically with n as I (e(n)|π(μ)) ∼ ln n.

• In general, the grouping of observations implies a reduction of information.


Consider a partition of the sample space
N
X = Ei ; E i ∩ E j = ∅ ∀i = j
i=1

and let’s write the Mutual Information as:


   
p(x, θ)
I (X : θ) = dx dθ p(x, θ) ln =
X θ p(x)π(θ)
    
p(x, θ)
= dθ d x p(x, θ) ln − dθ p(θ) ln π(θ)
θ X p(x) θ

Introducing
 
μ1 (E i , θ) = p(x, θ) d x y μ2 (E i ) = p(x) d x
Ei Ei

we have for the first term that:

 N 
  
p(x, θ)
dθ d x p(x, θ) ln =
θ Ei p(x)
i=1
 N 
  
p(x, θ) p(x, θ)/μ1 (E i , θ) μ1 (E i , θ)
= dθ dx μ1 (E i , θ) ln =
θ μ1 (E i , θ) p(x)/μ2 (E i ) μ2 (E i )
i=1 E i
N  &   '
μ1 (E i , θ) f 1 (x, θ)
= dθ μ1 (E i , θ) ln + d x f 1 (x, θ) ln
μ2 (E i ) f 2 (x)
i=1 θ Ei

where
p(x, θ) p(x)
f 1 (x, θ) = ≥0 and f 2 (x) = ≥0
μ1 (E i , θ) μ2 (E i )
4.6 Some Properties of Information 237

The Mutual Information for the grouped data is

N 
 
μ1 (E i , θ)
IG (X : θ) = dθ μ1 (E i , θ) ln − dθ p(θ) ln π(θ)
i=1 θ μ2 (E i ) θ

and therefore
 
N 
f 1 (x, θ)
I (X : θ) = IG (X : θ) + dθ μ1 (E i , θ) d x f 1 (x, θ) ln
θ i=1 Ei f 2 (x)

Jensen’s inequality for convex functions implies that



f 1 (x, θ)
d x f 1 (x, θ) ln ≥0
Ei f 2 (x)

and, in consequence, I (X : θ) ≥ IG (X : θ) being the equality satisfied if and only if

p(x, θ) μ1 (E i , θ)
f 1 (x, θ) = f 2 (x) −→ =
p(x) μ2 (E i )

• If sufficient statistics exist, using them instead of the whole sample does not
reduce the information.

Given a parametric model p(x1 , x2 , . . . , xn |θ), we know that the set of statistics
t = t(x1 , . . . , xn ) is sufficient for θ iff for all n ≥ 1 and any prior distribution π(θ)
it holds that

p(θ|x1 , x2 , . . . , xm ) = p(θ|t)

Then, since p(θ|t) ∝ p(t|θ) π(θ) we have that


   
p(θ|x)
I (X : θ) = d x p(x) dθ p(θ|x) ln
x θ π(θ)
   
p(θ|t)
= d t p(t) dθ p(θ|t)ln = I (T : θ)
T θ π(θ)

It is then clear that for inferences about θ, all the information provided by the data
is contained in the set of sufficient statistics.

Example 4.5 Consider the random quantity X ∼ E x(x|λ) and the experiment e(n).
Under independence of the sample x = {x1 , x2 , . . . , xn } we have that

p(x1 , . . . , xn |λ) = λn e−λ (x1 +···+xn )


238 4 Information Theory

We know already that there is a sufficient statistic t = x1 + · · · + xn with density


Ga(t|λ, n):

λn −λ t n−1
p(t|λ) = e t
(n)

Therefore, for a prior density π(λ):


 ∞  ∞  
p(x, λ)
I (e(n)|π(λ)) = dλ d x p(x, λ) ln
0 0 π(λ) p(x)
 ∞  ∞  
p(t, λ)
= dλ dt p(t, λ) ln
0 0 π(λ) p(t)

In particular, for a conjugated prior π(λ|a, b) = Ga(λ|a, b) we have that

(b)
I (e(n)|π(λ)) = ln − n + (n + b)(n + b) − b(b)
(n + b)

with (x) the Digamma Function. This can be written as


&
1
I (e(n)|π(λ)) = n I (e(1)|π(λ)) − n (1 + ) + (n + 1)(b) − (n + b) (n + b)
b
'
bn (b)
− ln
(n + b)

where the last term in brackets the Kullback–Leibler discrepancy.


√ Again, the asymp-
totic behaviour with n is non-linear: I (e(n)|π(λ)) ∼ ln n.

4.7 Geometry and Information

This last section is somewhat marginal for the use of Information in the context that we
have been interested in but, besides being an active field of research, illustrates a very
interesting connection with geometry that almost surely will please most physicists.
Consider the family of distributions F[ p(x|θ)] with θ ∈  ⊆Rn . They all have the
same functional form so their difference is determined solely by different values of the
parameters. In fact, there is a one-to-one correspondence between each distribution
p(x|θ) ∈ F and each point θ ∈  and the “separation” between them will be
determined by the geometrical properties of this parametric space. Intuitively, we
can already see that, in general, this space is a non-Euclidean Riemannian Manifold.
Consider for instance the Normal density N (x|μ, σ) and two points (μ1 , σ1 ) and
(μ2 , σ2 ) of the parametric space μ,σ = R × R+ . For a real constant a > 0, if μ2 =
μ1 + a and σ1 = σ2 (same variance, different mean values) we have the Euclidean
distance
4.7 Geometry and Information 239
(
dE = (μ2 − μ1 )2 + (σ2 − σ1 )2 = a

We have the same Euclidean distance if μ2 = μ1 and σ2 = σ1 + a (different variance,


same mean values) but, intuitively, one would say that in this second case it will
be more difficult to distinguish the two distributions. Even though the Euclidean
distances are the same, we shall need more information to achieve the same level of
“separation”.
As we have seen (Problem 2.1), the Fisher’s Matrix (from now on denoted by
g i j (θ))

∂ ln p(x|θ) ∂ ln p(x|θ)
g i j (θ) = p(x|θ) dx
X ∂θi ∂θ j

behaves under a change of parameters φ = φ(θ) as a second order covariant tensor;


that is:
∂ θk ∂ θl
gi j (φ) = gkl (θ)
∂φi ∂φ j

with the contravariant form g i j (θ) satisfying g ik (θ)gk j (θ) = δ ij . In the geometric
context it is called the Fisher-Rao metric tensor and defines the geometry of the
parameters’ non-Euclidean Riemannian manifold. The differential distance is given
by

(ds)2 = gi j (θ) dθi dθ j

and for any two pints θ 1 and θ 2 of the parametric space, the distance between them
along the trajectory θ(t) parametrized by t ∈ [t1 , t2 ] with end points θ(t1 ) = θ 1 and
θ(t2 ) = θ 2 will be:
 θ2  t2  t2  1/2
ds dθi dθ j
S(θ 1 , θ 2 ) = ds = dt = dt gi j (θ)
θ1 t1 dt t1 dt dt

that, for one parameter families, reduces to


 θ2 (
S(θ1 , θ2 ) = g(θ) dθ
θ1

The path of shortest information distance between two points is always a geodesic
curve determined by the second order differential equation
240 4 Information Theory

d 2 θi j k
i (θ) dθ dθ = 0 1 im ) *
+  jk with  ijk (θ) = g g jm,k + gmk, j − g jk,m
dt 2 dt dt 2

the Christoffel symbols and t the affine parameter along the geodesic. Each point
of the manifold has associated a tensor (Riemann tensor) given, in its covariant
representation, by

1) * ) p p*
Riklm = gim,kl + gkl,im − gil,km − gkm,il + gnp kln im − km
n
il
2
that depends only the metric and provides a local measure of the curvature; that
is, by how much the Fisher-Rao metric is not locally isometric to the Euclidean
space. For a n-dimensional manifold it has n 4 components but due to the sym-
metries Riklm = Rlmik = −Rkilm = −Rikml and Riklm + Rimkl + Rilmk = 0 they are
considerably reduced. The only non-trivial contraction of indices of Riklm is the
Ricci curvature tensor Ri j = glm Ril jm (symmetric) and its trace R = g i j Ri j is
the scalar curvature. For two dimensional manifolds, the Gaussian curvature is
κ = R1212√/det(gi j ) = R/2. Last,
√ note that the invariant differential volume element
is d V = |det g(x)|d x and |det g(x)| is the by now quite familiar Jeffrey’s prior.
We have then that, in this geometrical context, the Information is the source of
curvature of the parametric space and the curvature determines how the information
flows from one point to another. The geodesic distance is an intrinsic distance, invari-
ant under reparameterisations, and is related to the amount of information difference
between two points; in other words, how easy is to discern between them: the larger
the distance, the larger the separation and the easier will be to discern.
Let’s start with a simple one-parameter case: the Binomial distribution Bi(n|N , θ).
In this case
1 1 − 2θ
g11 (θ) = ; g 11 (θ) = θ(1 − θ) and 11
1
(θ) = −
θ(1 − θ) 2θ(1 − θ)

so the geodesic equation is:


 2
d 2θ 1 − 2θ dθ
− =0 −→ θ(t) = sin2 (at + b)
dt 2 2θ(1 − θ) dt

Then, for any two points θ1 , θ2 ∈ [0, 1] the geodesic distance will be:
 θ2 ( + √ ++θ2
+
S(θ1 , θ2 ) = g11 (θ) dθ = 2 +arc sin θ+
θ1 θ1

and for two infinitesimally close points (θ, θ + )


 √
S(θ, θ + )  √ ∝ g11
θ(1 − θ)
4.7 Geometry and Information 241

that is, Jeffrey’s prior. We leave as an exercise to see that for the exponential family
E x(x|1/τ )
+ +
1 1 (τ ) = − 1 ;
+ τ2 +  √
g11 (τ ) = 2 ; 11 S(τ1 , τ2 ) = ++log ++ ; S(τ , τ + )  ∝ g11
τ τ τ1 τ

The Normal family N (x|μ, σ) is more interesting. The metric tensor for {μ, σ} is
   
1/σ 2 0 σ2 0
gi j = ; det(gi j ) = 2/σ ;4
g = ij
0 2/σ 2 0 σ 2 /2

and therefore the non-zero Christoffel symbols are

12
1
= 21
1
= −1/σ; 11
2
= 1/2σ; 22
2
= −1/σ

From them we can obtain the Gaussian and scalar curvatures:


R1212 1
κ= =− ; R = g i j glm Ril jm = −1
det(gi j ) 2

and conclude that the parametric manifold (μ, σ) of the Normal family is hyperbolic
and with constant curvature (thus, there is no complete isometric immersion in E 3 ).
The geodesic equations are:
 2  2
d 2μ 2 dμ dσ d 2σ 1 dμ 1 dσ
− =0 and + − =0
dt 2 σ dt dt dt 2 2σ dt σ dt

so writing
 2  
dσ dσ dμ d 2σ d 2σ dμ dσ d 2μ
= and = +
dt dμ dt dt 2 dμ2 dt dμ dt 2

we have that:
 2 
d 2σ dσ 1 d(σσ ) 1
σ + + = + =0 −→ 2 σ 2 (μ) = a − (μ − b)2
dμ2 dμ 2 dμ 2
√ √
where b − a < μ < b + a. For the points (μ1 , σ1 ) and (μ2 , σ2 ), the integration
constants are

μ1 + μ2 σ 2 − σ22 [(μ1 − μ2 )2 + 2(σ1 + σ2 )2 ][(μ1 − μ2 )2 + 2(σ1 − σ2 )2 ]


b= + 1 and a=
2 μ1 − μ 2 4(μ1 − μ2 )2
242 4 Information Theory

and the geodesic distance is


 & '
( μ2
−2
√ 1+δ
S12 = a/2 σ(μ) dμ = 2 ln
μ1 1−δ

with
 1/2
(μ1 − μ2 )2 + 2(σ1 − σ2 )2
δ= ∈ [0, 1)
(μ1 − μ2 )2 + 2(σ1 + σ2 )2

Suppose for instance that (μ1 , σ1 ) are given. Then, under the hypothesis
H0 : {μ = μ0 }:

inf σ S(μ1 , σ1 ; μ0 , σ) −→ σ 2 = σ12 + (μ1 − μ0 )2 /2

and, under H0 : {σ = σ0 }:

inf μ S(μ1 , σ1 ; μ, σ0 ) −→ μ = μ1

Problem 4.1 Show that for the Cauchy distribution

β " #−1
p(x|α, β) = 1 + β 2 (x − α)2 ; x ∈ R; (α, β) ∈ R × R+
π
we have for {α, β}:
 
β 2 /2 0
gi j = ;
0 1/(2β 2 )
11
2
= −β 2 ; 12
1
= −22
2
= β −1 ; R1212 = −1/2; R = 2κ = −4

and, as for the Normal family, the geodesic equation has a parabolic dependence
β −2 = a − (α − b)2 . It may help to consider that
 ∞ " #−n (n + 1/2)
p(x|α, β) 1 + β 2 (x − α)2 = ; n∈N
−∞ (1/2)(n + 1)

As for the Normal family, the parametric manifold for the Cauchy family of distribu-
tions is also hyperbolic with constant curvature. Show that this is the general case for
distributions with location and scale parameters p(x|α, β) = β f [β(x − α)] where
x ∈ R and (α, β) ∈ R × R+ .
Problem 4.2 Show that for the metric of the Example 2.21 (ratio of Poisson para-
meters)
 
μ/θ 1
gi j (θ, μ) =
1 (1 + θ)/μ
4.7 Geometry and Information 243

the Riemann tensor is zero and therefore the manifold for {θ, μ} is locally isometric
to the two dimensional Euclidean space. Show that the geodesics are given by θ(μ) =
(b0 + b1 μ−1/2 )2 and, for the affine parameter t, μ(t) = (a0 + a1 t)2 . If we define the
new parameters φ1 = 2(θμ)1/2 and φ2 = 2μ1/2 , What do you expect to get for the
manifold M{φ1 , φ2 }?
As we move along the information geodesic, there are some quantities that
remain invariant. The Killing vectors ζμ are given by the first order differential
ρ
equation ζμ,ν + ζν,μ − 2μν ζρ = 0 and if we denote by u μ = d x μ /dt the tangent
vector to the geodesic, with t the affine parameter, one has that ζμ u μ = constant.
For n-dimensional spaces with constant curvature there are n(n + 1)/2 of them and
clearly and linear combination with constant coefficients will be also a Killing vector.
In the case of the Normal family we have that:
     
μ2 − 2σ 2 μ μ 1 1
ζμ = c1 , + c2 , + c3 ,0
4σ 2 σ 2σ 2 σ σ2

with c1,2,3 integration constants and setting ci = 1, c j =i = 0 for i, j = 1, 2, 3 we get


the 3 independent vectors. Therefore, along the information geodesic:

μ2 − 2σ 2 dμ μ dσ μ dμ 1 dσ 1 dμ
+ , + and
4σ 2 dt σ dt 2σ 2 dt σ dt σ 2 dt
will remain constant. In fact, it is easier to derive from these first order differential
equations the expression of the geodesic as function of the affine parameter t:
,
√ −1 2
μ(t) = b + a tanh(c1 t + c0 ) and σ(t) = cosh(c1 t + c0 )
a

where for t ∈ [0, 1], (μ1 , σ1 )t=0 and (μ2 , σ2 )t=1 :


   
μ1 − b μ2 − b
c0 = tanh √ and c0 − c1 = tanh √
a a

With respect to the standardized Normal N (x|0, 1), all points of the (μ, σ) of the
manifold R × R+ with the same geodesic distance S ≥ 0 are given by
 √ 1/2
μ = ± 4σ cosh(S/ 2) − 2(1 + σ 2 )

Figure 4.1(left) shows the curves in the (μ, σ) plane whose points have the same
geodesic distance with respect to the Normal density N (x|0, 1). The inner set cor-
responds to a distance of dG = 0.1, the outer one to dG = 1.5 and those in between
to increasing steps of 0.2.
244 4 Information Theory

σ
3
d 0.6

d
E
0.5
2.5
0.4

2
0.3

dKL
1.5 0.2

0.1
1
dH
0
0.5
0.4
0.5 0.3
0.2 1.5
0.1 1.4
1.3
0 1.2
−0.1 1.1
1
0 μ −0.2
−0.3
−0.4 0.7
0.8
0.9 σ
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −0.5 0.5 0.6
μ

Fig. 4.1 Left Set of points in the (μ, σ) plane have the same geodesic distance with respect to the
Normal density N (x|0, 1) points in the (μ, σ). The inner set corresponds to a distance of dG = 0.1,
the outer one to dG = 1.5 and those in between to increasing steps of 0.2. Right Euclidean (d E ),
symmetrized Kullback–Leibler discrepancy (d K L ) and the Hellinger distance (d H ) for the set of
(μ, σ) points that have the same geodesic distance dG = 0.5

For two normal densities, the symmetrized Kullback–Leibler discrepancy is

(μ1 − μ2 )2 + (σ12 − σ22 )2


J [ p1 ; p2 ] = D( p1
p2 ) + D( p2
p1 ) =
2σ12 σ22

and the Hellinger distance


- & '
2σ1 σ2 (μ1 − μ2 )2
d H [ p1 ; p2 ] = 1 − exp −
σ12 + σ22 4(σ12 + σ22 )

Figure 4.1(right) shows the euclidean (d E ), the symmetrized Kullback–Leibler (d K L )


and the Hellinger distances (d H ) for the set of points that have the same geodesic
distance dG = 0.5.

References

1. C. Shannon, A mathematical theory of communication. Bell Syst. Tech. J. 27 (1948)


2. S. Kullback, Information Theory and Statistics (Dover, New York, 1968)
3. J.M. Bernardo, A.F.M. Smith, Bayesian Theory (Wiley, New York, 1994)
4. J.M. Bernardo, J. R. Stat. Soc. Ser. B 41, 113–147 (1979)

You might also like