Probability and Statistics For Particle Physics: Carlos Maña
Probability and Statistics For Particle Physics: Carlos Maña
Carlos Maña
Probability
and Statistics
for Particle
Physics
UNITEXT for Physics
Series editors
Paolo Biscari, Milano, Italy
Michele Cini, Roma, Italy
Attilio Ferrari, Torino, Italy
Stefano Forte, Milano, Italy
Morten Hjorth-Jensen, Oslo, Norway
Nicola Manini, Milano, Italy
Guido Montagna, Pavia, Italy
Oreste Nicrosini, Pavia, Italy
Luca Peliti, Napoli, Italy
Alberto Rotondi, Pavia, Italy
UNITEXT for Physics series, formerly UNITEXT Collana di Fisica e Astronomia,
publishes textbooks and monographs in Physics and Astronomy, mainly in English
language, characterized of a didactic style and comprehensiveness. The books
published in UNITEXT for Physics series are addressed to graduate and advanced
graduate students, but also to scientists and researchers as important resources for
their education, knowledge and teaching.
123
Carlos Maña
Departamento de Investigación Básica
Centro de Investigaciones Energéticas,
Medioambientales y Tecnológicas
Madrid
Spain
1 Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Elements of Probability: ðX; B; lÞ . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Events and Sample Space: ðXÞ . . . . . . . . . . . . . . . . . . . . 1
1.1.2 r-algebras ðBX Þ and Measurable Spaces ðX; BX Þ . . . . . . 3
1.1.3 Set Functions and Measure Space: ðX; BX ; lÞ . . . . . . . . . 6
1.1.4 Random Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Conditional Probability and Bayes Theorem . . . . . . . . . . . . . . . . 14
1.2.1 Statistically Independent Events . . . . . . . . . . . . . . . . . . . 15
1.2.2 Theorem of Total Probability . . . . . . . . . . . . . . . . . . . . . 18
1.2.3 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.1 Discrete and Continuous Distribution Functions . . . . . . . 24
1.3.2 Distributions in More Dimensions . . . . . . . . . . . . . . . . . 28
1.4 Stochastic Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.4.1 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . 35
1.4.2 Moments of a Distribution . . . . . . . . . . . . . . . . . . . . . . . 36
1.4.3 The “Error Propagation Expression” . . . . . . . . . . . . . . . . 44
1.5 Integral Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.5.1 The Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.5.2 The Mellin Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.6 Ordered Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
1.7 Limit Theorems and Convergence . . . . . . . . . . . . . . . . . . . . . . . . 67
1.7.1 Chebyshev’s Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
1.7.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . 69
1.7.3 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . 70
1.7.4 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . 71
1.7.5 Convergence in Lp Norm . . . . . . . . . . . . . . . . . . . . . . . . 76
1.7.6 Uniform Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
v
vi Contents
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
2.1 Elements of Parametric Inference . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.2 Exchangeable Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.3 Predictive Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.4 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.5 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.6 Prior Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.6.1 Principle of Insufficient Reason . . . . . . . . . . . . . . . . . . . 96
2.6.2 Parameters of Position and Scale . . . . . . . . . . . . . . . . . . 97
2.6.3 Covariance Under Reparameterizations . . . . . . . . . . . . . . 103
2.6.4 Invariance Under a Group of Transformations . . . . . . . . 109
2.6.5 Conjugated Distributions. . . . . . . . . . . . . . . . . . . . . . . . . 115
2.6.6 Probability Matching Priors . . . . . . . . . . . . . . . . . . . . . . 119
2.6.7 Reference Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
2.7 Hierarchical Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
2.8 Priors for Discrete Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
2.9 Constrains on Parameters and Priors . . . . . . . . . . . . . . . . . . . . . . 136
2.10 Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
2.10.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
2.10.2 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
2.11 Credible Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
2.12 Bayesian (B) Versus Classical (F ) Philosophy . . . . . . . . . . . . . . 148
2.13 Some Worked Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
2.13.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
2.13.2 Characterization of a Possible Source of Events . . . . . . . 158
2.13.3 Anisotropies of Cosmic Rays . . . . . . . . . . . . . . . . . . . . . 161
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
3 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
3.1 Pseudo-Random Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.2 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.2.1 Inverse Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.2.2 Acceptance-Rejection (Hit-Miss; J. Von Neumann
1951) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.2.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
3.2.4 Decomposition of the Probability Density. . . . . . . . . . . . 185
3.3 Everything at Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.3.1 The Compton Scattering . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.3.2 An Incoming Flux of Particles . . . . . . . . . . . . . . . . . . . . 192
3.4 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
3.4.1 Sampling from Conditionals and Gibbs Sampling . . . . . 214
Contents vii
J.C. Maxwell
ix
x Introduction
familiar with measure theory, there is an appendix to this chapter with a short
digression on some basic concepts. A large fraction of the material presented in this
lecture can be found in more depth, together with other interesting subjects, in the
book Probability: A Graduate Course (2013; Springer Texts in Statistics) by A.
Gut. Chapter 2 is about statistical inference, Bayesian inference in fact, and a must
for this topic is the Bayesian Theory (1994; John Wiley & Sons) by J.M. Bernardo
and A.F.M. Smith that contains also an enlightening discussion about the Bayesian
and frequentist approaches in the Appendix B. It is beyond question that in any
worthwhile course on statistics the ubiquitous frequentist methodology has to be
taught as well and there are excellent references on the subject. Students are
encouraged to look, for instance, at Statistical Methods in Experimental Physics
(2006; World Scientific) by F. James, Statistics for Nuclear and Particle Physicists
(1989; Cambridge University Press) by L. Lyons, or Statistical Data Analysis
(1997; Oxford Science Pub.) by G. Cowan. Last, Chap. 3 is devoted to Monte Carlo
simulation, an essential tool in statistics and particle physics, and Chap. 4 to
information theory, and, like for the first chapters, both have interesting references
given along the text.
“Time is short, my strength is limited,…”, Kafka dixit, so many interesting
subjects that deserve a whole lecture by themselves are left aside. To mention some:
an historical development of probability and statistics, Bayesian networks, gener-
alized distributions (a different approach to probability distributions), decision
theory (games theory), and Markov chains for which we shall state only the relevant
properties without further explanation.
I am grateful to Drs. J. Berdugo, J. Casaus, C. Delgado, and J. Rodriguez for
their suggestions and a careful reading of the text and much indebted to Dr. Hisako
Niko. Were not for her interest, this notes would still be in the drawer. My gratitude
goes also to Mieke van der Fluit for her assistance with the edition.
Chapter 1
Probability
To learn about the state of nature, we do experiments and observations of the natural
world and ask ourselves questions about the outcomes. In a general way, the object
of questions we may ask about the result of an experiment such that the possible
answers are it occurs or it does not occur are called events. There are different kinds
of events and among them we have the elementary events; that is, those results of
the random experiment that can not be decomposed in others of lesser entity. The
sample space () is the set of all the possible elementary outcomes (events) of a
random experiment and they have to be:
(i) exhaustive: any possible outcome of the experiment has to be included in ;
(ii) exclusive: there is no overlap of elementary results.
To study random phenomena we start by specifying the sample space and, therefore,
we have to have a clear idea of what are the possible results of the experiment.
To center the ideas, consider the simple experiment of rolling a die with 6 faces
numbered from 1 to 6. We consider as elementary events
so = {e1 , . . ., e6 }. Note that any possible outcome of the roll is included in and
we can not have two or more elementary results simultaneously. But there are other
types of events besides the elementary ones. We may be interested for instance in
the parity of the number so we would like to consider also the possible results1
They are not elementary since the result A = {e2 , e4 , e6 } is equivalent to get e2 , e4
or e6 and Ac = \ A to get e1 , e3 or e5 . In general, an event is any subset2 of the
sample space and we shall distinguish between:
Any event that is neither sure nor impossible is called random event. Going back to
the rolling of the die, sure events are
1 Given two sets A, B⊂, we shall denote by Ac the complement of A (that is, the set of all elements
of that are not in A) and by A\B ≡ A∩B c the set difference or relative complement of B in A
(that is, the set of elements that are in A but not in B). It is clear that Ac = \A.
2 This is not completely true if the sample space is non-denumerable since there are subsets that can
not be considered as events. It is however true for the subsets of Rn we shall be interested in. We
shall talk about that in Sect. 1.1.2.2.
1.1 The Elements of Probability: (, B, μ) 3
Depending on the number of possible outcomes of the experiment, the the sample
space can be:
It is important to note that the events are not necessarily numerical entities. We
could have for instance the die with colored faces instead of numbers. We shall deal
with that when discussing random quantities. Last, given a sample space we shall
talk quite frequently about a partition (or a complete system of events); that is, a
sequence {Si } of events, finite or countable, such that
= Si (complete system) and Si S j = ∅; i = j (disjoint events).
i ∀i, j
As we have mentioned, in most cases we are interested in events other than the
elementary ones. We single out them in a class of events that contains all the
possible results of the experiment we are interested in such that when we ask
about the union, intersection and complements of events we obtain elements that
belong the same class. A non-empty family B = {Si }i=1n
of subsets of the sample
space that is closed (or stable) under the operations of union and complement;
4 1 Probability
that is
∈ B ∅ ∈ B Si ∩ S j ∈ B
Si c ∪ S j c ∈ B (Si c ∪ S j c )c ∈ B Si \ S j ∈ B
∪i=1
m
Si ∈ B ∩i=1
m
Si ∈ B
1.1.2.1 σ-algebras
If the sample space is countable, we have to generalize the Boole algebra such that
the unions and intersections can be done a countable number of times getting always
events that belong to the same class; that is:
∞
∞
Si ∈ B and Si ∈ B
i=1 i=1
∞
with {Si }i=1 ∈ B. These algebras are called σ-algebras. Not all the Boole algebras
satisfy these properties but the σ-algebras are always Boole algebras (closed under
finite union).
Consider for instance a finite set E and the class A of subsets of E that are either
finite or have finite complements. The finite union of subsets of A belongs to A
because the finite union of finite sets is a finite set and the finite union of sets that
have finite complements has finite complement. However, the countable union of
finite sets is countable and its complement will be an infinite set so it does not belong
to A. Thus, A is a Boole algebra but not a σ-algebra.
Let now E be any infinite set and B the class of subsets of E that are either
countable or have countable complements. The finite or countable union of countable
sets is countable and therefore belongs to B. The finite or countable union of sets
1.1 The Elements of Probability: (, B, μ) 5
Eventually, we are going to assign a probability to the events of interest that belong
to the algebra and, anticipating concepts, probability is just a bounded measure so
we need a class of measurable sets with structure of a σ-algebra. Now, it turns out
that when the sample space is a non-denumerable topological space there exist
non-measurable subsets that obviously can not be considered as events.3 We are
particularly interested in R (or, in general, in Rn ) so we have to construct a family
BR of measurable subsets of R that is
∞ ∞
(i) closed under countable number of intersections: {Bi }i=1 ∈ BR −→ ∩i=1 Bi ∈ BR
(ii) closed under complements: B ∈ BR → B = R\B ∈ BR
c
Observe that, for instance, the family of all subsets of R satisfies the conditions
(i) and (ii) and the intersection of any collection of families that satisfy them is a
family that also fulfills this conditions but not all are measurable. Measurably is the
key condition. Let’s start identifying what we shall considered the basic set in R
to engerder an algebra. The sample space R is a linear set of points and, among it
subsets, we have the intervals. In particular, if a ≤ b are any two points of R we
have:
• open intervals: (a, b) = {x ∈ R | a < x < b}
• closed intervals: [a, b] = {x ∈ R | a ≤ x ≤ b}
• half-open intervals on the right: [a, b) = {x ∈ R | a ≤ x < b}
• half-open intervals on the left: (a, b] = {x ∈ R | a < x ≤ b}
When a = b the closed interval reduces to a point {x = a} (degenerated interval)
and the other three to the null set and, when a→ − ∞ or b→∞ we have the infinite
intervals (−∞, b), (−∞, b], (a, ∞) and [a, ∞). The whole space R can be consid-
ered as the interval (−∞, ∞) and any interval will be a subset of R. Now, consider
the class of all intervals of R of any of the aforementioned types. It is clear that the
intersection of a finite or countable number of intervals is an interval but the union
is not necessarily an interval; for instance [a1 , b1 ] ∪ [a2 , b2 ] with a2 > b1 is not an
interval. Thus, this class is not additive and therefore not a closed family. However,
3 Isnot difficult to show the existence of Lebesgue non-measurable sets in R. One simple example
is the Vitali set constructed by G. Vitali in 1905 although there are other interesting examples
(Hausdorff, Banach–Tarsky) and they all assume the Axiom of Choice. In fact, the work of R.M.
Solovay around the 70s shows that one can not prove the existence of Lebesgue non-measurable sets
without it. However, one can not specify the choice function so one can prove their existence but
can not make an explicit construction in the sense Set Theorists would like. In Probability Theory,
we are interested only in Lebesgue measurable sets so those which are not have nothing to do in
this business and Borel’s algebra contains only measurable sets.
6 1 Probability
it is possible to construct an additive class including, along with the intervals, other
measurable sets so that any set formed by countably many operations of unions,
intersections and complements of intervals is included in the family. Suppose, for
instance, that we take the half-open intervals on the right [a, b), b > a as the initial
class of sets4 to generate the algebra BR so they are in the bag to start with. The
open, close and degenerate intervals are
∞
∞
(a, b) = [a − 1/n, b); [a, b] = [a, b + 1/n) and a = {x ∈ R|x = a} = [a, a]
n=1 n=1
so they go also to the bag as well as the half-open intervals (a, b] = (a, b) ∪ [b, b]
and the countable union of unitary sets and their complements. Thus, countable sets
like N , Z or Q are in the bag too. Those are the sets we shall deal with.
The smallest family BR (or simply B) of measurable subsets of R that contains
all intervals and is closed under complements and countable number of intersections
has the structure of a σ -algebra, is called Borel’s algebra and its elements are gener-
ically called Borel’s sets or borelians Last, recall that half-open sets are Lebesgue
measurable (λ((a, b]) = b − a) and so is any set built up from a countable number of
unions, intersections and complements so all Borel sets are Lebesgue measurable and
every Lebesgue measurable set differs from a Borel set by at most a set of measure
zero. Whatever has been said about R is applicable to the n-dimensional euclidean
space Rn .
The pair (, B ) is called measurable space and in the next section it will be
equipped with a measure and “upgraded” to a measure space and eventually to a
probability space.
A function f : A ∈ B −→ R that assigns to each set A ∈ B one, and only one real
number, finite or not, is called a set function. Given a sequence {Ai }i=1n
of subset
of B pair-wise disjoint, (Ai ∩ A j = ∅; i, j = 1, . . ., n; i = j) we say that the set
function is additive (finitely additive) if:
n
n
f Ai = f (Ai )
i=1 i=1
∞
or σ-additive if, for a countable the sequence {Ai }i=1 of pair-wise disjoint sets of B,
∞ ∞
f Ai = f (Ai )
i=1 i=1
4 The same algebra is obtained if one starts with (a, b), (a, b] or [a, b].
1.1 The Elements of Probability: (, B, μ) 7
It is clear that any σ-additive set function is additive but the converse is not true.
A countably additive set function is a measure on the algebra B , a signed measure
in fact. If the σ-additive set function is μ : A ∈ B −→ [0, ∞) (i.e., μ(A) ≥ 0 ) for
all A ∈ B it is a non-negative measure. In what follows, whenever we talk about
measures μ, ν, . . . on a σ-algebra we shall assume that they are always non-negative
measures without further specification. If μ(A) = 0 we say that A is a set of zero
measure.
The “trio” (, B , μ), with a non-empty set, B a σ-algebra of the sets of
and μ a measure over B is called measure space and the elements of B measurable
sets.
In the particular case of the n-dimensional euclidean space = Rn , the σ-algebra
is the Borel algebra and all the Borel sets are measurable. Thus, the intervals I of
any kind are measurable sets and satisfy that
(i) If I ∈ R is measurable −→ I c = R − I is measurable;
∞ ∞
(ii) If {I }i=1 ∈ R are measurable −→ ∪i=1 Ii is measurable;
Countable sets are Borel sets of zero measure for, if μ is the Lebesgue measure (see
Appendix 2), we have that μ ([a, b)) = b − a and therefore:
1
μ ({a}) = lim μ ([a, a + 1/n)) = lim =0
n→∞ n→∞ n
Thus, any point is a Borel set with zero Lebesgue measure and, being μ a σ-additive
function, any countable set has zero measure. The converse is not true since there
are borelians with zero measure that are not countable (i.e. Cantor’s ternary set).
In general, a measure μ over B satisfies that, for any A, B ∈ B not necessarily
disjoint:
(m.1) A ∪ B is the union of two disjoint sets A and B\A and the measure is an
additive set function;
(m.2) A ∩ B c and B are disjoint and its union is A ∪ B so μ(A ∪ B) = μ(A ∩ B c ) +
μ(B). On the other hand A ∩ B c and A ∩ B are disjoint at its union is A so
μ(A ∩ B c ) + μ(A ∩ B) = μ(A). It is enough to substitute μ(A ∩ B c ) in the
previous expression;
(m.3) from (m.1) and considering that, if A ⊆ B, then A ∪ B = B
(m.4) from (m.3) with B = A.
A measure μ over a measurable space (, B ) is finite if μ() < ∞ and σ-finite
∞
if = ∪i=1 Ai , with Ai ∈ B and μ(Ai ) < ∞. Clearly, any finite measure is σ-finite
but the converse is not necessarily true. For instance, the Lebesgue measure λ in
(Rn , BRn ) is not finite because λ(Rn ) = ∞ but is σ-finite because
8 1 Probability
Rn = [−k, k]n
k ∈N
and λ([−k, k]n ) = (2k)n is finite. As we shall see in Chap. 2, in some circumstances
we shall be interested in the limiting behaviour of σ-finite measures over a sequence
of compact sets. As a second example, consider the measurable space (R, B) and μ
such that for A⊂B is μ(A) = card(A) if A is finite and ∞ otherwise. Since R is
an uncountable union of finite sets, μ is not σ-finite in R. However, it is σ-finite in
(N , BN ).
Let (, B ) be a measurable space. A measure P over B (that is, with domain
in B ), image in the closed interval [0, 1] ∈ R and such that P() = 1 (finite) is
called a probability measure and its properties a just those of finite (non-negative)
measures. Expliciting the axioms, a probability measure is a set function with domain
in B and image in the closed interval [0, 1] ∈ R that satisfies three axioms:
These properties coincide obviously with those of the frequency and combinatorial
probability (see Note 1). All probability measures are finite (P() = 1) and any
bounded measure can be converted in a probability measure by proper normalization.
The measurable space (, B ) provided with and probability measure P is called
the probability space (, B , P). It is straight forward to see that if A, B ∈ B, then:
union of finite sets is a countable set. In consequence, we can assign finite probabilities
on at most a countable subset of R.
NOTE 1: What is probability?
It is very interesting to see how along the 500 years of history of probability many
people (Galileo, Fermat, Pascal, Huygens, Bernoulli, Gauss, De Moivre, Poisson,…)
have approached different problems and developed concepts and theorems (Laws of
Large Numbers, Central Limit, Expectation, Conditional Probability,…) and a proper
definition of probability has been so elusive. Certainly there is a before and after Kol-
mogorov’s “General Theory of Measure and Probability Theory” and “Grundbegriffe
der Wahrscheinlichkeitsrechnung” so from the mathematical point of view the ques-
tion is clear after 1930s. But, as Poincare said in 1912: “It is very difficult to give a
satisfactory definition of Probability”. Intuitively, What is probability?
The first “definition” of probability was the Combinatorial Probability (∼1650).
This is an objective concept (i.e., independent of the individual) and is based on
Bernoulli’s Principle of Symmetry or Insufficient Reason: all the possible outcomes
of the experiment equally likely. For its evaluation we have to know the cardinal
(ν(·)) of all possible results of the experiment (ν()) and the probability for an
event A⊂ is “defined” by the Laplace’s rule: P(A) = ν(A)/ν(). This concept
of probability, implicitly admitted by Pascal and Fermat and explicitly stated by
Laplace, is an a priory probability in the sense that can be evaluated before or even
without doing the experiment. It is however meaningless if is a countable set
(ν() = ∞) and one has to justify the validity of the Principle of Symmetry that
not always holds originating some interesting debates. For instance, in a problem
attributed to D’Alembert, a player A tosses a coin twice and wins if H appears in at
least one toss. According to Fermat, one can get {(T T ), (T H ), (H T ), (H H )} and A
will loose only in the first case so being the four cases equally likely, the probability
for A to win is P = 3/4. Pascal gave the same result. However, for Roberval one
should consider only {(T T ), (T H ), (H ·)} because if A has won already if H appears
at the first toss so P = 2/3. Obviously, Fermat and Pascal were right because, in
this last case, the three possibilities are not all equally likely and the Principle of
Symmetry does not apply.
The second interpretation of probability is the Frequentist Probability, and is based
on the idea of frequency of occurrence of an event. If we repeat the experiment n
times and a particular event Ai appears n i times, the relative frequency of occurrence
is f (Ai ) = n i /n. As n grows, it is observed (experimental fact) that this number
stabilizes around a certain value and in consequence the probability of occurrence
of Ai is defined as P(Ai )≡limex n→∞ f (Ai ). This is an objective concept inasmuch
p
In many circumstances, the possible outcomes of the experiments are not numeric
(a die with colored faces, a person may be sick or healthy, a particle may decay
1.1 The Elements of Probability: (, B, μ) 11
in different modes,…) and, even in the case they are, the possible outcomes of the
experiment may form a non-denumerable set. Ultimately, we would like to deal with
numeric values and benefit from the algebraic structures of the real numbers and the
theory behind measurable functions and for this, given a measurable space (, B ),
we define a function X (w) : w ∈ −→R that assigns to each event w of the sample
space one and only one real number.
In a more formal way, consider two measurable spaces (, B ) and ( , B ) and
a function
X (w) : w ∈ −→ X (w) ∈
Obviously, since we are interested in the events that conform the σ-algebra B ,
the same structure has to be maintained in ( , B ) by the application X (w) for
otherwise we wont be able to answer the questions of interest. Therefore, we require
the function X (w) to be Lebesgue measurable with respect to the σ-algebra B ; i.e.:
X −1 (B ) = B ⊆ B
∀ B ∈ B
so we can ultimately identify P(B ) with P(B). Usually, we are interested in the
case that = R (or Rn ) so B is the Borel σ-algebra and, since we have generated
the Borel algebra B from half-open intervals on the left Ix = (−∞, x] with x ∈ R,
we have that X (w) will be a Lebesgue measurable function over the Borel algebra
(Borel measurable) if, and only if:
X −1 (Ix ) = {w ∈ | X (w) ≤ x} ∈ B ∀ x ∈R
We could have generated as well the Borel algebra from open, closed or half-open
intervals on the right so any of the following relations, all equivalent, serve to define
a Borel measurable function X (w):
(1) {w|X (w) > c} ∈ B ∀c ∈ R;
(2) {w|X (w) ≥ c} ∈ B ∀c ∈ R;
(3) {w|X (w) < c} ∈ B ∀c ∈ R;
(4) {w|X (w) ≤ c} ∈ B ∀c ∈ R
To summarize:
• Given a probability space (, B , Q), a random variable is a function X (w) :
→R, Borel measurable over the σ-algebra B , that allows us to work with the
induced probability space (R, B, P).5
Form this definition, it is clear that the name “random variable” is quite unfortunate
inasmuch it is a univoque function, neither random nor variable. Thus, at least to
get rid of variable, the term “random quantity” it is frequently used to design
is important to note that a random variable X (w) : −→R is measurable with respect to the
5 It
σ-algebra B .
12 1 Probability
Example 1.1 Consider the measurable space (, B ) and X (w) : → R. Then:
n
X (w) = ak 1 Ak (w)
k=1
where ak ∈ R and {Ak }nk=1 is a partition of is Borel measurable and any random
quantity that takes a finite number of values can be expressed in this way.
• Let = [0, 1]. It is obvious that if G is a non-measurable Lebesgue subset
of [0, 1], the function X (w) = 1G c (w) is not measurable over B[0,1] because
a ∈ [0, 1) → X −1 (Ia ) = G ∈B/ [0,1] .
• Consider a coin tossing, the elementary events
e1 = {H }, and e2 = {T } −→ = {e1 , e2 }
1.1 The Elements of Probability: (, B, μ) 13
the algebra B = {∅, , {e1 }, {e2 }} and the function X : −→R that denotes the
number of heads
a ∈ (−∞, 0) −→ X −1 (Ia ) = ∅ ∈ B
a ∈ [0, 1) −→ X −1 (Ia ) = e2 ∈ B
a ∈ [1, ∞) −→ X −1 (Ia ) = {e1 , e2 } = ∈ B E
Example 1.2 Let = [0, 1] and consider the sequence of functions X n (w) =
2n 1n (w) where w ∈ , n = [1/2n , 1/2n−1 ] and n ∈ N . Is each X n (w) is measur-
able iff ∀r ∈ R, A = {w ∈ | X n (w) > r } is a Borel set of B . Then:
(1) r ∈ (2n , ∞)−→A = ∅ ∈ B with λ(A) = 0;
(2) r ∈ [0, 2n ]−→A = [1/2n , 1/2n−1 ] ∈ B with λ(A) = 2/2n − 1/2n = 1/2n .
(3) r ∈ (−∞, 0)−→A = [0, 1] = with λ() = 1.
Thus, each X n (w) is a measurable function.
Problem 1.1 Consider the experiment of tossing two coins, the elementary events
The functions X (w) : −→R such that X (e1 ) = 2; X (e2 ) = X (e3 ) = 1; X (e4 ) = 0
(number of heads) and Y (w) : −→R such that Y (e1 ) = Y (e2 ) = 1; Y (e3 ) =
Y (e4 ) = 0, with respect to which algebras are admissible random quantities? (sol.:
X wrt B1 ; Y wrt B2 )
Suppose and experiment that consists on rolling a die with faces numbered from one
to six and the event e2 ={get the number two on the upper face}. If the die is fair,
based on the Principle of Insufficient Reason you and your friend would consider
reasonable to assign equal chances to any of the possible outcomes and therefore a
probability of P1 (e2 ) = 1/6. Now, if I look at the die and tell you, and only you,
that the outcome of the roll is an even number, you will change your beliefs on the
occurrence of event e2 and assign the new value P2 (e2 ) = 1/3. Both of you assign
different probabilities because you do not share the same knowledge so it may be a
truism but it is clear that the probability we assign to an event is subjective and is
conditioned by the information we have about the random process. In one way or
another, probabilities are always conditional degrees of belief since there is always
some state of information (even before we do the experiment we know that whatever
number we shall get is not less than one and not greater than six) and we always
assume some hypothesis (the die is fair so we can rely on the Principle of Symmetry).
Consider a probability space (, B , P) and two events A, B ⊂ B that are
not disjoint so A ∩ B = /0. The probability for both A and B to happen is P(A ∩
B)≡P(A, B). Since = B ∪ B c and B∩B c = /0 we have that:
P(A) ≡ P(A ) = P(A B) + P(A B c ) = P(A\B)
probability for A P(A\B) : probability for
and B to occur A to happen and not B
What is the probability for A to happen if we know that B has occurred? The prob-
ability of A conditioned to the occurrence of B is called conditional probability
of A given B and is expressed as P(A|B). This is equivalent to calculate the prob-
ability for A to happen in the probability space ( , B , P ) with the reduced
sample space where B has already occurred and B the corresponding sub-algebra
that does not contain B c . We can set P(A|B) ∝ P(A ∩ B) and define (Kolmogorov)
the conditional probability for A to happen once B has occurred as:
P(A
de f.B) P(A, B)
P(A|B) = =
P(B) P(B)
1.2 Conditional Probability and Bayes Theorem 15
provided that P(B) = 0 for otherwise the conditional probability is not defined. This
normalization factor ensures that P(B|B) = P(B ∩ B)/P(B) = 1. Conditional
probabilities satisfy the basic axioms of probability:
(i) non-negative since (A B) ⊂ B → 0 ≤ P(A|B) ≤ 1
P( B)
(ii) unit measure (certainty) since P(|B) = = P(B) = 1
P(B) P(B)
∞
(iii) σ-additive: For a countable sequence of disjoint set {Ai }i=1
∞
∞
∞ P(Ai B) ∞
P ( i=1 Ai ) B i=1
P Ai |B = = = P(Ai |B)
i=1
P(B) P(B) i=1
P(A, B) = P(A)P(B)
P(A, B) P(A)P(B)
P(A, B) = P(A)P(B) −→ P(A|B) = = = P(A)
P(B) P(B)
6 In fact for the events A, B ∈ B we should talk about conditional independence for it is
true that if C ∈ B , it may happen that P(A, B) = P(A)P(B) but conditioned on C,
P(A, B|C) = P(A|C)P(B|C) so A and B are related through the event C. On the other hand,
that P(A|B) = P(A) does not imply that B has a “direct” effect on A. Whether this is the case or
not has to be determined by reasoning on the process and/or additional evidences. Bernard Shaw
said that we all should buy an umbrella because there is statistical evidence that doing so you have
a higher life expectancy. And this is certainly true. However, it is more reasonable to suppose that
instead of the umbrellas having any mysterious influence on our health, in London, at the beginning
of the XX century, if you can afford to buy an umbrella you have most likely a well-off status,
healthy living conditions, access to medical care,…
16 1 Probability
Sufficient because
If this is not the case, we say that they are statistically dependent or correlated. In
general, we have that:
P(A|B) > P(A) → the events A and B are positively correlated; that is,
that B has already occurred increases the chances for
A to happen;
P(A|B) < P(A) → the events A and B are negatively correlated; that is,
that B has already occurred reduces the chances for A
to happen;
P(A|B) = P(A) → the events A and B are not correlated so the occur-
rence of B does not modify the chances for A to hap-
pen.
for any finite subsequence {Ak }mk= j ; 1 ≤ j < m ≤ n of events. Thus, for instance,
for a sequence of 3 events {A1 , A2 , A3 } the condition of independence requires that:
Example 1.3 In four cards (C1 , C2 , C3 and C4 ) we write the numbers 1 (C1 ), 2 (C2 ),
3 (C3 ) and 123 (C4 ) and make a fair random extraction. Let be the events
Now, I look at the card and tell you that it has number j. Since you know that
A j has happened, you know that the extracted card was either C j or C4 and the
only possibility to have Ai = A j is that the extracted card was C4 so the conditional
probabilities are
P(Ai |A j ) = 1/2; i, j = 1, 2, 3; i = j
The, since
P(Ai |A j ) = P(Ai ); i, j = 1, 2, 3; i = j
and if I tell you that events A2 and A3 have occurred then you are certain that chosen
card is C4 and therefore A1 has happened too so P(A1 |A2 , A3 ) = 1. But
1 1 1
P(A1 , A2 , A3 ) = 1 = P(A1 )P(A2 )P(A3 ) =
2 2 8
so the events {A1 , A2 , A3 } are not independent even though they are pairwise inde-
pendent.
and gives a lover bound for the joint probability P(A1 , . . ., An ). For n = 1 it is
trivially true since P(A1 ) ≥ P(A1 ). For n = 2 we have that
P(A1 A2 ) = P(A1 ) + P(A2 ) − P(A1 A2 ) ≤ 1 −→ P(A1 A2 ) ≥ P(A1 ) + P(A2 ) − 1
Proceed then by induction. Assume the statement is true for n − 1 and see if it is so
for n. If Bn−1 = A1 ∩ · · · ∩ An−1 and apply the result we got for n = 2 we have that
P(A1 ··· An ) = P(Bn−1 An ) ≥ P(Bn−1 ) + P(An ) − 1
but
P(Bn−1 ) = P(Bn−2 An−1 ) ≥ P(Bn−2 ) + P(An−1 ) − 1
18 1 Probability
so
P(A1 ··· An ) ≥ P(Bn−2 ) + P(An−1 ) + P(An ) − 2
n n n
n
P(A) = P A Si =P A Si = P(A Si ) = P(A|Si )·P(Si )
i=1 i=1
i=1 i=1
n
P(Bk ) = P(Bk |Si )P(Si ); k = 1, . . . , m
i=1
and
n
m
n
n
P(Bk ) = P(Si ) P(Bk |Si ) = P(Si ) = 1
k=1 i=1 k=1 i=1
and
n
P(A, B) = P(A, B, Si )
i=1
we have that
1
n n
P(A, B)
P(A|B) = = P(A, B, Si ) = P(A|B, Si )P(Si |B).
P(B) P(B) i=1 i=1
1.2 Conditional Probability and Bayes Theorem 19
Example 1.5 We have two indistinguishable urns: U1 with three white and two black
balls and U2 with two white balls and three black ones. What is the probability that
in a random extraction we get a white ball?
Consider the events:
3 2 1
P(B|A1 ) = ; P(B|A2 ) = and P(A1 ) = P(A2 ) =
5 5 2
so we have that
2
31 21 1
P(B) = P(B|Ai )·P(Ai ) = + =
i=1
52 52 2
Given a probability space (, B , P) we have seen that the joint probability for for
two events A, B ∈ B can be expressed in terms of conditional probabilities as:
The Bayes Theorem (Bayes ∼1770s and independently Laplace ∼1770s) states that
if P(B) = 0, then
P(B|A)P(A)
P(A|B) =
P(B)
apparently a trivial statement but with profound consequences. Let’s see other expres-
sions of the theorem. If H = {Hi }i=1
n
is a partition of the sample space then
n
P(A) = P(A|Hk )P(Hk )
k=1
20 1 Probability
P(A|Hi ) : is the probability for event A to happen given that event Hi has
occurred. This may be different depending on i = 1, 2, . . . , n and
when considered as function of Hi is usually called likelihood;
Clearly, if the events A and Hi are independent, the occurrence of A does not provide
any information on the chances for Hi to happen. Whether it has occurred or not does
not modify our beliefs about Hi and therefore P(Hi |A) = P(Hi ).
In first place, it is interesting to note that the occurrence of A restricts the sample
space for H and modifies the prior chances P(Hi ) for Hi in the same proportion as
the occurrence of Hi modifies the probability for A because
Second, from Bayes Theorem we can obtain relative posterior probabilities (in the
case, for instance, that P(A) is unknown) because
Last, conditioning all the probabilities to H0 (maybe some conditions that are
assumed) we get a third expression of Bayes Theorem
7 Although is usually the case, the terms prior and posterior do not necessarily imply a temporal
ordering.
1.2 Conditional Probability and Bayes Theorem 21
where H0 represents to some initial state of information or some conditions that are
assumed. The posterior degree of credibility we have on Hi is certainly meaningful
when we have an initial degree of information and therefore is relative to our prior
beliefs. And those are subjective inasmuch different people may assign a different
prior degree of credibility based on their previous knowledge and experiences. Think
for instance in soccer pools. Different people will assign different prior probabili-
ties to one or other team depending on what they know before the match and this
information may not be shared by all of them. However, to the extent that they share
common prior knowledge they will arrive to the same conclusions.
Bayes’s rule provides a natural way to include new information and update our
beliefs in a sequential way. After the event (data) D1 has been observed, we have
P(D1 |Hi )
P(Hi ) −→ P(Hi |D1 ) = P(Hi ) ∝ P(D1 |Hi )P(Hi )
P(D1 )
P(A|Hi , H0 ): is the probability that the effect A is produced by the cause (or
hypothesis) Hi ;
P(Hi |A, H0 ): is the posterior probability we have for Hi being the cause of the
event (effect) A that has already been observed.
Let’s see an example of a clinical diagnosis just because the problem is general
enough and conclusions may be more disturbing. If you want, replace individuals by
events and for instance (sick, healthy) by (signal, background). Now, the incidence
22 1 Probability
of certain rare disease is of 1 every 10,000 people and there is an efficient diagnostic
test such that:
(1) If a person is sick, the tests gives positive in 99% of the cases;
(2) If a person is healthy, the tests may fail and give positive (false positive) in 0.5%
of the cases;
In this case, the effect is to give positive (T ) in the test and the exclusive and
exhaustive hypothesis for the cause are:
with H2 = H1 c . A person, say you, is chosen randomly (H0 ) among the population
to go under the test and give positive. Then you are scared when they tell you: “The
probability of giving positive being healthy is 0.5%, very small” (p-value). There is
nothing wrong with the statement but it has to be correctly interpreted and usually
it is not. It means no more and no less than what the expression P(T |H2 ) says:
“under the assumption that you are healthy (H2 ) the chances of giving positive are
0.5%” and this is nothing else but a feature of the test. It doesn’t say anything about
P(H1 |T ), the chances you have to be sick giving positive in the test that, in the end,
is what you are really interested in. The two probabilities are related by an additional
piece of information that appears in Bayes’s formula: P(H1 |H0 ); that is, under the
hypothesis that you have been chosen at random (H0 ), What are the prior chances
to be sick? From the prior knowledge we have, the degree of credibility we assign to
both hypothesis is
1 9999
P(H1 |H0 ) = and P(H2 |H0 ) = 1 − P(H1 ) =
10000 10000
On the other hand, if T denotes the event give positive in the test we know that:
99 5
P(T |H1 ) = and P(T |H2 ) =
100 1000
Therefore, Bayes’s Theorem tells that the probability to be sick giving positive in
the test is
99 1
P(T |H1 )·P(H1 |H0 )
P(H1 |T ) = 2 = 100 10000
0.02
i=1 P(T |Hi )·P(Hi |H0 )
99 1
100 10000
+ 1000
5 9999
10000
Thus, even if the test looks very efficient and you gave positive, the fact that you
were chosen at random and that the incidence of the disease in the population is very
small, reduces dramatically the degree of belief you assign to be sick. Clearly, if you
were not chosen randomly but because there is a suspicion from to other symptoms
that you are sic, prior probabilities change.
1.3 Distribution Function 23
de f.
F(x) = P(X ≤ x) = P (X ∈ (−∞, x]) ; ∀x ∈ R
Note that the Distribution Function F(x) is defined for all x ∈ R so if supp{P(X )} =
[a, b], then F(x) = 0 ∀x < a and F(x) = 1 ∀x ≥ b. From the definition, it is easy
to show the following important properties:
(a) ∀x ∈ R we have that:
de f.
(a.1) P(X ≤ x) = F(x)
(a.2) P(X <x) = F(x − ) ;
(a.3) P(X >x) = 1 − P(X ≤ x) = 1 − F(x) ;
(a.4) P(X ≥ x) = 1 − P(X <x) = 1 − F(x − ) ;
(b) ∀x1 < x2 ∈ R we have that:
(b.1) P(x1 <X ≤ x2 ) = P(X ∈ (x1 , x2 ]) = F(x2 ) − F(x1 ) ;
(b.2) P(x1 ≤ X ≤x2 ) = P(X ∈ [x1 , x2 ]) = F(x2 ) − F(x1 − )
(thus, if x1 = x2 then P(X = x1 ) = F(x1 ) − F(x1 − ));
(b.3) P(x1 <X <x2 ) = P(X ∈ (x1 , x2 )) = F(x2 − ) − F(x1 ) =
= F(x2 ) − F(x1 ) − P(X = x2 ) ;
(b.4) P(x1 ≤ X <x2 ) = P(X ∈ [x1 , x2 )) = F(x2 − ) − F(x1 − ) =
= F(x2 ) − F(x1 ) − P(X = x2 ) + P(X = x1 ) .
The Distribution Function is discontinuous at all x ∈ R where F(x −) = F(x +).
Let D be the set of all points of discontinuity. If x ∈ D, then F(x −)<F(x +) since
it is monotonous non-decreasing. Thus, we can associate to each x ∈ D a rational
number r (x) ∈ Q such that F(x − ) < r (x) < F(x + ) and all will be different
because if x1 < x2 ∈ D then F(x1 + ) ≤ F(x2 − ). Then, since Q is a countable set,
we have that the set of points of discontinuity of F(x) is either finite or countable.
8 The condition P(X ≤ x) is due to the requirement that F(x) be continuous on the right. This is
not essential in the sense that any non-decreasing function G(x), defined on R, bounded between
0 and 1 and continuous on the left (G(x) = lim→0+ G(x − )) determines a distribution function
defined as F(x) for all x where G(x) is continuous and as F(x + ) where G(x) is discontinuous.
In fact, in the general theory of measure it is more common to consider continuity on the left.
24 1 Probability
At each of them the distribution function has a “jump” of amplitude (property b.2):
Consider the probability space (, F, Q), the random quantity X (w) : w ∈ →
X (w) ∈ R and the induced probability space (R, B, P). The function X (w) is a
discrete random quantity if its range (image) D = {x1 , . . ., xi , . . .}, with xi ∈ R ,
i = 1, 2, . . . is a finite or countable set; that is, if {Ak ; k = 1, 2, . . .} is a finite or
countable partition of , the function X (w) is either:
n ∞
simple: X (w) = xk 1 Ak (w) or elementary: X (w) = xk 1 Ak (w)
k=1 k=1
and, conversely, if such a function exists then ν << μ (see Appendix 1.3 for the
main properties).
The function p(w) = dν(w)/dμ(w) is called Radon density and is unique up to at
most a set of measure zero; that is, if
ν(A) = p(w)dμ(w) = f (w)dμ(w)
A A
then μ{x| p(x) = f (x)} = 0. Furthermore, if ν and μ are equivalent (ν∼μ; μ << ν
and ν << μ) then dν/dμ > 0 almost everywhere. In consequence, if we have a
probability space (R, B, P) with P equivalent to the Lebesgue measure, there exists
a non-negative Lebesgue integrable function (see Appendix 2) p : R −→ [0, ∞),
unique a.e., such that
P(A) ≡ P(X ∈ A) = p(x) d x; ∀A ∈ B
A
and therefore
x
F(x) = P(X ≤ x) = 1(a,∞) (x)1[b,∞) (x) + 1[a,b) (x) p(u) du
a
Note that from the previous considerations, the value of the integral will not be
affected if we modify the integrand on a countable set of points. In fact, what we
actually integrate is an equivalence class of functions that differ only in a set of
measure zero. Therefore, a probability density function p(x) has to be continuous
for all x ∈ R but, at most, on a countable set of points. If F(x) is not differentiable
at a particular point, p(x) is not defined on it but the set of those points is of zero
measure. However, if p(x) is continuous in R then F (x) = p(x) and the value of
p(x) is univocally determined by F(x). We also have that
x +∞
P(X ≤ x) = F(x) = p(w)dw −→ P(X > x) = 1 − F(x) = p(w)dw
−∞ x
and therefore:
x2
P(x1 < X ≤ x2 ) = F(x2 ) − F(x1 ) = p(w) dw
x1
Nd
Nac
Ns
F(x) = ai Fd (x) + b j Fac (x) + ck Fs (x)
i=1 j=1 k=1
of a discrete Distribution Functions (Fd (x)), absolutely continuous ones (Fac (x) with
derivative at every point so F (x) = p(x)) and singular ones (Fs (x)). For the cases
we shall deal with, ck = 0.
Example 1.7 Consider a real parameter μ > 0 and a discrete random quantity X
that can take values {0, 1, 2, . . .} with a Poisson probability law:
μk
P(X = k|μ) = e−μ ; k = 0, 1, 2, . . .
(k + 1)
m=[x]
μk
F(x|μ) = P(X ≤ x|μ) = e−μ
k=0
(k + 1)
μk
F(k|μ) − F(k − 1|μ) = P(X = k|μ) = e−μ
(k + 1)
Therefore, for reals x2 > x1 > 0 such that x2 − x1 < 1, P(x1 < X ≤ x2 ) =
F(x2 ) − F(x1 ) = 0.
Example 1.8 Consider the function g(x) = e−ax with a > 0 real and support in
(0, ∞). It is non-negative and Riemann integrable in R+ so we can define a proba-
bility density
e−ax
p(x|a) = ∞ 1(0,∞) (x) = a e−ax 1(0,∞) (x)
0 e−αx d x
Example 1.9 The ternary Cantor Set Cs(0, 1) is constructed iteratively. Starting with
the interval Cs0 = [0, 1], at each step one removes the open middle third of each of
the remaining segments. That is; at step one the interval (1/3, 2/3) is removed so
Cs1 = [0, 1/3]∪[2/3, 1] and so on. If we denote by Dn the union of the 2n−1 disjoint
open intervals removed at step n, each of length 1/3n , the Cantor set is defined as
Cs(0, 1) = [0, 1]\ ∪∞n=1 Dn . It is easy to check that any element X of the Cantor Set
can be expressed as
∞
Xn
X=
n=1
3n
with supp{X n } = {0, 2}9 and that Cs(0, 1) is a closed set, uncountable, nowhere
dense in [0, 1] and with zero measure. The Cantor Distribution, whose support is the
Cantor Set, is defined assigning a probability P(X n = 0) = P(X n = 2) = 1/2.
Thus, X is a continuous random quantity with support on a non-denumerable set
of measure zero and can not be described by a probability density function. The
Distribution Function F(x) = P(X ≤ x) (Cantor Function; Fig. 1.1) is an example
of singular Distribution.
9 Note that the representation of a real number r ∈ [0, 1] as (a1 , a2 , . . .) : ∞
n=1 an 3
−n with a =
i
{0, 1, 2} is not unique. In fact x = 1/3 ∈ Cs(0, 1) and can be represented by (1, 0, 0, 0, . . .) or
(0, 2, 2, 2, . . .).
1.3 Distribution Function 29
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 2 4 6 8 10 12 14 16 0 1 2 3 4 5 6 7
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 1.1 Empiric distribution functions (ordinate) form a Monte Carlo sampling (106 events) of
the Poisson Po(x|5) (discrete; upper left), Exponential E x(x|1) (absolute continuous; upper right)
and Cantor (singular; bottom) Distributions
and satisfies:
(i) monotonous no-decreasing in both variables; that is, if (x1 , x2 ), (x1 , x2 ) ∈ R2 :
Now, if (x1 , x2 ), (x1 , x2 ) ∈ R2 with x1 < x1 and x2 < x2 we have that:
P(x1 < X 1 ≤ x1 , x2 < X 2 ≤ x2 )=F(x1 , x2 ) − F(x1 , x2 ) − F(x1 , x2 ) + F(x1 , x2 ) ≥ 0
and
will give the amplitude of the jump of the Distribution Function at the points of
discontinuity.
As for the one-dimensional case, for absolutely continuous random quantities we
can introduce a two-dimensional probability density function p(x) : R2 −→ R:
(i) p(x) ≥ 0; ∀x ∈ R2 ;
every bounded interval of R , p(x) is bounded and Riemann integrable;
2
(ii) At
(iii) R2 p(x)d x = 1
such that:
x1 x2
∂2
F(x1 , x2 ) = du 1 du 2 p(u 1 , u 2 ) ←→ p(x1 , x2 ) = F(x1 , x2 ).
−∞ −∞ ∂x1 ∂x2
It may happen that we are interested only in one of the two random quantities say,
for instance, X 1 . Then we ignore all aspects concerning X 2 and obtain the one-
dimensional Distribution Function
with p(x1 ) the marginal probability density function10 of the random quantity X 1 :
+∞
∂
p1 (x1 ) = F1 (x1 ) = p(x1 , u 2 ) du 2
∂x1 −∞
As we have seen, given a probability space (, B , P), for any two sets A, B ∈ B
the conditional probability for A given B was defined as
de f. P(A ∩ B) P(A, B)
P(A|B) = ≡
P(B) P(B)
de f. P(X 1 = x1 , X 2 = x2 )
P(X 1 = x1 |X 2 = x2 ) =
P(X 2 = x2 )
and therefore
P(X 1 ≤ x1 , X 2 = x2 )
F(x1 |x2 ) =
P(X 2 = x2 )
de f. p(x1 , x2 ) ∂
p(x1 |x2 ) = = F(x1 |x2 )
p(x2 ) ∂x1
10 It is habitual to avoid the indices and write p(x) meaning “the probability density function of the
variable x ” since the distinctive features are clear within the context.
32 1 Probability
Then, we shall say that two discrete random quantities X 1 and X 2 are statistically
independent if F(x1 , x2 ) = F1 (x1 )F2 (x2 ); that is, if
∂2
p(x1 , x2 ) = F(x1 )F(x2 ) = p(x1 ) p(x2 ) ←→ p(x1 |x2 ) = p(x1 )
∂x1 ∂x2
Example 1.10 Consider the probability space (, B , λ) with = [0, 1] and λ the
Lebesgue measure. If F is an arbitrary Distribution Function, X : w ∈ [0, 1]−→
F −1 (w) ∈ R is a random quantity and is distributed as F(w). Take the Borel set
I = (−∞, r ] with r ∈ R. Since F is a Distribution Function is monotonous and
non-decreasing we have that:
Example 1.11 Consider the probability space (R, B, μ) with μ the probability mea-
sure
μ(A) = dF
A∈B
11 Recall that for continuous random quantities P(X 2 = x2 ) = P(X 1 = x1 ) = 0). One can justify
this expression with kind of heuristic arguments; essentially considering X 1 ∈ 1 = (−∞, x1 ],
X 2 ∈ (x2 ) = [x2 , x2 + ] and taking the limit → 0+ of
This is the basis of the Inverse Transform sampling method that we shall see in
Chap. 3 on Monte Carlo techniques.
Example 1.12 Suppose that the number of eggs a particular insect may lay (X 1 )
follows a Poisson distribution X 1 ∼ Po(x1 |μ):
μx1
P(X 1 = x1 |μ) = e−μ ; x1 = 0, 1, 2, . . .
(x1 + 1)
Now, if the probability for an egg to hatch is θ and X 2 represent the number of off
springs, given x1 eggs the probability to have x2 descendants follows a Binomial law
X 2 ∼Bi(x2 |x1 , θ):
x1
P(X 2 = x2 |x1 , θ) = θ x2 (1 − θ)x1 −x2 ; 0 ≤ x2 ≤ x1
x2
In consequence
Suppose that we have not observed the number of eggs that were laid. What is the
distribution of the number of off springs? This is given by the marginal probability
∞
(μθ)x2
P(X 2 = x2 |θ, μ) = P(X 1 = x1 , X 2 = x2 ) = e−μθ = Po(x2 |μθ)
x1 =x2
(x2 + 1)
Now, suppose that we have found x2 new insects. What is the distribution of the
number of eggs laid? This will be the conditional probability P(X 1 = x1 |X 2 =
x2 , θ, μ) and, since P(X 1 = x1 , X 2 = x2 ) = P(X 1 = x1 |X 2 = x2 )P(X 2 = x2 ) we
have that:
34 1 Probability
P(X 1 = x1 , X 2 = x2 )
P(X 1 = x1 |X 2 = x2 , μ, θ) =
P(X 2 = x2 )
1
= (μ(1 − θ))x1 −x2 e−μ(1−θ)
(x1 − x2 )!
Example 1.13 Let X 1 and X 2 two independent Poisson distributed random quanti-
ties with parameters μ1 and μ2 . How is Y = X 1 + X 2 distributed? Since they are
independent:
μ1x1 μ2x2
P(X 1 = x1 , X 2 = x2 |μ1 , μ2 ) = e−(μ1 +μ2 )
(x1 + 1) (x2 + 1)
Then, since X 2 = Y − X 1 :
(y−x)
μ1x μ2
P(X 1 = x, Y = y) = P(X 1 = x, X 2 = y − x) = e−(μ1 +μ2 )
(x + 1) (y − x + 1)
y
μ1x μ2
(y−x)
(μ1 + μ2 ) y
−(μ1 +μ2 )
P(Y = y) = e = e−(μ1 +μ2 )
x=0
(x + 1) (y − x + 1) (y + 1)
1
2 (x 1 − 2ρx 1 x 2 + x 2 )
2 2
1 1 −
p(x1 , x2 |ρ) = e 2(1 − ρ )
2π 1 − ρ2
and since
ρ
2 (x 1 ρ − 2x 1 x 2 + x 2 ρ)
2 2
1 −
p(x1 , x2 |ρ) = p(x1 ) p(x2 ) e 2(1 − ρ )
1 − ρ2
1.3 Distribution Function 35
1
2 (x 1 − x 2 ρ)
2
p(x1 , x2 ) 1 1 −
p(x1 |x2 , ρ) = =√ e 2(1 − ρ )
p(x2 ) 2π 1 − ρ2
1
2 (x 2 − x 1 ρ)
2
f (x1 , x2 ) 1 1 −
p(x2 |x1 , ρ) = =√ e 2(1 − ρ )
f 1 (x1 ) 2π 1 − ρ2
and when ρ = 0 (thus independent) p(x1 |x2 ) = p(x1 ) and p(x2 |x1 ) = p(x2 ). Last,
it is clear that
12 In
what follows we consider the Stieltjes-Lebesgue integral so → for discrete random
quantities and in consequence:
∞ ∞
g(x) d P(x) = g(x) p(x) d x −→ g(xk ) P(X = xk ).
−∞ −∞ ∀x k
36 1 Probability
In general, the function g(x) will be unbounded on supp{X } so both the sum and the
integral have to be absolutely convergent for the mathematical expectation to exist.
In a similar way, we define the conditional expectation. If X = (X 1 , . . . ,
X m . . . , X n ), W = (X 1 . . . , X m ) and Z = (X m+1 . . . , X n ) we have for Y = g(W )
that
p(w, z 0 )
E[Y |Z 0 ] = g(w) p(w|z 0 ) dw = g(w) dw.
Rm Rm p(z 0 )
Given a random quantity X ∼ p(x), we define the moment or order n (αn ) as:
∞
de f.
αn = E[X n ] = x n p(x) d x
−∞
Obviously, they exist if x n p(x) ∈ L 1 (R) so it may happen that a particular probability
distribution has only a finite number of moments. It is also clear that if the moment
of order n exists, so do the moments of lower order and, if it does not, neither those
of higher order. In particular, the moment of order 0 always exists (that, due to the
normalization condition, is α0 = 1) and those of even order, if exist, are non-negative.
A specially important moment is that order 1: the mean (mean value) μ = E[X ] that
has two important properties:
n n
• It is a linear operator since X = c0 + i=1 ci X i −→ E[X ] = c0 + i=1 ci E[X i ]
n
• If X = i=1 ci X i with {X i }i=1
n
independent random quantities, then E[X ] =
n
i=1 ci E[X i ].
We can define as well the moments (βn ) with respect to any point c ∈ R as:
∞
de f.
βn = E[(X − c) ] = n
(x − c)n p(x) d x
−∞
so αn are also called central moments or moments with respect to the origin. It is easy
to see that the non-central moment of second order, β2 = E[(X −c)2 ], is minimal for
c = μ = E[X ]. Thus, of special relevance are the moments or order n with respect
to the mean
∞
μn ≡ E[(X − μ) ] =
n
(x − μ)n p(x) d x
−∞
i=1 ci V [X i ].
Usually, is less tedious to calculate the moments with respect to the origin and
evidently they are related so, from the binomial expansion
n
n
n n
(X − μ)n = X k (−μ)n−k −→ μn = αk (−μ)n−k
k k
k=0 k=0
so that α01 = μ1 and α02 = μ2 , and the moments order (n, m) with respect to the
mean:
μnm = E[(X 1 − μ1 )n (X 2 − μ2 )m ] = (x1 − μ1 )n (x2 − μ2 )m p(x1 , x2 ) d x1 d x2
R2
for which
The moment
is called covariance between the random quantities X 1 and X 2 and, if they are
independent, μ11 = 0. The second order moments with respect to the mean can be
condensed in matrix form, the covariance matrix defined as:
μ20 μ11 V [X 1 , X 1 ] V [X 1 , X 2 ]
V [X] = =
μ11 μ02 V [X 1 , X 2 ] V [X 2 , X 2 ]
The covariance matrix V [X] = E[(X −μ)(X −μ)T ] has the following properties
that are easy to prove from basic matrix algebra relations:
(1) It is a symmetric matrix (V = V T ) with positive diagonal elements (V ii ≥ 0);
(2) It is positive defined (x T V x ≥ 0; ∀x ∈ Rn with the equality when ∀i xi = 0);
(3) Being V symmetric, all the eigenvalues are real and the corresponding eigen-
vectors orthogonal. Furthermore, since it is positive defined all eigenvalues are
positive;
(4) If J is a diagonal matrix whose elements are the eigenvalues of V and H a matrix
whose columns are the corresponding eigenvectors, then V = HJH−1 (Jordan
dixit);
(5) Since V is symmetric, there is an orthogonal matrix C (CT = C−1 ) such that
CVCT = D with D a diagonal matrix whose elements are the eigenvalues of V ;
(6) Since V is symmetric and positive defined, there is a non-singular matrix C such
that V = CCT ;
(7) Since V is symmetric and positive defined, the inverse V−1 is also symmetric
and positive defined;
(8) (Cholesky Factorization) Since V is symmetric and positive defined, there exists
a unique lower triangular matrix C (Ci j = 0; ∀i < j) with positive diagonal
elements such that V = CCT (more about this in Chap. 3).
Among other things to be discussed later, the moments of the distribution are
interesting because they give an idea of the shape and location of the probability
distribution and, in many cases, the distribution parameters are expressed in terms
of the moments.
Let X ∼ p(x) with support in ⊂R. The position parameters choose a characteristic
value of X and indicate more or less where the distribution is located. Among them
we have the mean value
∞
μ = α1 = E[X ] = x p(x) d x
−∞
1.4 Stochastic Characteristics 39
The mean is bounded by the minimum and maximum values the random quantity
/ If, for instance, = 1 ∪2
can take but, clearly, if ⊂R it may happen that μ∈.
is the union of two disconnected regions, μ may lay in between and therefore μ∈.
/
On the other hand, as has been mentioned the integral has to be absolute convergent
and there are some probability distributions for which there is no mean value. There
are however other interesting location quantities. The mode is the value x0 of X for
which the distribution is maximum; that is,
x0 = supx ∈ p(x)
Nevertheless, it may happen that there are several relative maximums so we talk
about uni-modal, bi-modal,… distributions. The median is the value xm such that
xm ∞
F(xm ) = P(X ≤ xm ) = 1/2 −→ p(x)d x = p(x)d x = P(X > xm ) = 1/2
−∞ xm
For discrete random quantities, the distribution function is either a finite or count-
able combination of indicator functions 1 Ak (x) with {Ak }n,∞
k=1 a partition of so it
may happen that F(x) = 1/2 ∀x ∈ Ak . Then, any value of the interval Ak can be
considered the median. Last, we may consider the quantiles α defined as the value
qα of the random quantity such that F(qα ) = P(X ≤ qα ) = α so the median is the
quantile q1/2 .
There are many ways to give an idea of how dispersed are the values the random
quantity may take. Usually they are based on the mathematical expectation of a
function that depends on the difference between X and some characteristic value it
may take; for instance E[|X − μ|]. By far, the most usual and important one is the
already defined variance
V [X ] = σ 2 = E[(X − E[X ])2 ] = (x − μ)2 p(x)d x
R
provided it exists. Note that if the random quantity X has dimension D[X ] = d X ,
the variance has dimension D[σ 2 ] = d X2 so to have a quantity that gives an idea
of the dispersion
√ and√has the same dimension one defines the standard deviation
σ = + V [X ] = + σ 2 and, if both the mean value (μ) and the variance exist, the
standardized random quantity
X −μ
Y =
σ
Related to higher order non-central moments, there are two dimensionless quantities
of interest: the skewness and the kurtosis. The first non-trivial odd moment with
respect to the mean is that of order 3: μ3 . Since it has dimension D[μ3 ] = d X3 we
define the skewness (γ1 ) as the dimensionless quantity
de f μ3 μ3 E[(X − μ)3 ]
γ1 = = =
3/2
μ2 σ3 σ3
The skewness is γ1 = 0 for distributions that are symmetric with respect to the mean,
γ1 > 0 if the probability content is more concentrated on the right of the mean and
γ1 < 0 if it is to the left of the mean. Note however that there are many asymmetric
distributions for which μ3 = 0 and therefore γ1 = 0. For unimodal distributions, it
is easy to see that
γ1 = 0 mode = median = mean
μ4 μ4 E[(X − μ)4 ]
γ2 = = =
μ22 σ4 σ4
and gives an idea of how peaked is the distribution. For the Normal distribution γ2 = 3
so in order to have a reference one defines the extended kurtosis as γ2ext = γ2 − 3.
Thus, γ2ext > 0 (<0) indicates that the distribution is more (less) peaked than the
Normal. Again, γ2ext = 0 for the Normal density and for any other distribution for
which μ4 = 3 σ 4 . Last you can check that ∀a, b ∈ R E[(X −μ−a)2 (X −μ−b)2 ] > 0
so, for instance, defining u = a + b, w = ab and taking derivatives, γ2 ≥ 1 + γ12 .
λk
P(X = k) ≡ Pn(k|λ) = e−λ ; λ ∈ R+ ; k = 0, 1, 2, . . .
(k + 1)
so being the series absolute convergent all order moments exist. Taking the derivative
of αn (λ) with respect to λ one gets the recurrence relation
dαn (λ)
αn+1 (λ) = λ αn (λ) + ; α0 (λ) = 1
dλ
μ0 = 1; μ1 = 0; μ2 = λ; μ3 = λ; μ4 = λ(3λ + 1)
a b −ax b−1
p(x) = e x 1(0,∞) (x)λ; a, b ∈ R+
(b)
and in consequence
b b 2 6
E[X ] = ; V [X ] = ; γ1 = √ ; γ2ext. =
a a2 b b
42 1 Probability
1 1
p(x) == 1(−∞,∞) (x)
π 1 + x2
we have that
∞
1 xn
αn = E[X n ] = dx
π −∞ 1 + x2
and clearly the integral diverges for n > 1 so there are no moments but the trivial
one α0 . Even for n = 1, the integral
∞ ∞
|x| x
dx = 2 d x = lima→∞ ln (1 + a 2 )
−∞ (1 + x 2 ) 0 (1 + x 2 )
is not absolute convergent so, in strict sense, there is no mean value. However, the
mode and the median are x0 = xm = 0, the distribution is symmetric about x = 0
and for n = 1 there exists the Cauchy’s Principal Value and is equal to 0. Had we
introduced the Probability Distributions as a subset of Generalized Distributions, the
Principal Value is an admissible distribution. It is left as an exercise to show that for:
• Pareto: X ∼Pa(x|ν, xm ) with p(x|xm , ν) ∝ x −(ν+1) 1[xm ,∞) (x); xm , ν ∈ R+
−(ν+1)/2
• Student: X ∼St (x|ν) with p(x|ν) ∝ 1 + x 2 /ν 1(−∞,∞) (x); ν ∈ R+
the moments αn = E[X n ] exist iff n < ν.
Another distribution of interest in physics is the Landau Distribution that describes
the energy lost by a particle when traversing a material under certain conditions. The
probability density, given as the inverse Laplace Transform, is:
c+i∞
1
p(x) = es log s+xs ds
2πi c−i∞
with c ∈ R+ and closing the contour on the left along a counterclockwise semicircle
with a branch-cut along the negative real axis it has a real representation
∞
1
p(x) = e−(r log r +xr ) sin(πr ) dr
π 0
The actual expression of the distribution of the energy loss is quite involved and
some simplifying assumptions have been made; among other things, that the energy
transfer in the collisions is unbounded (no kinematic constraint). But nothing is for
free and the price to pay is that the Landau Distribution has no moments other than
the trivial of order zero. This is why instead of mean and variance one talks about
the most probable energy loss and the full-width-half-maximum.
1.4 Stochastic Characteristics 43
The covariance between the random quantities X i and X j was defined as:
−1 ≤ ρi j ≤ 1
The extreme values (+1, −1) will be taken when E[X i X j ] = E[X i ]E[X j ]±σi σ j
and ρi j = 0 when E[X i X j ] = E[X i ]E[X j ]. In particular, it is immediate to see that
if here is a linear relation between both random quantities; that is, X i = a X j + b,
then ρi j = ±1. Therefore, it is a linear correlation coefficient. Note however that:
• If X i and X j are linearly related, ρi j = ±1, but ρi j = ±1 does not imply neces-
sarily a linear relation;
• If X i and X j are statistically independent, then ρi j = 0 but ρi j = 0 does not imply
necessarily statistical independence as the following example shows.
X 2 = g(X 1 ) = a + bX 1 + cX 1 2
Obviously, X 1 and X 2 are not statistically independent for there is a clear parabolic
relation. However
with μ, σ 2 and α3 respectively the mean, variance and moment of order 3 with respect
to the origin of X 1 and, if we take b = cσ −2 (μ3 + μσ 2 − α3 ) then V [Y, X ] = 0 and
so is the (linear) correlation coefficient.
n
1 ∂ 2 g(x)
n
E[Y ] = E[g(X)] = g(μ) + + Vi j + · · ·
2! i=1 j=1 ∂xi ∂x j μ
and therefore
n
n
1 ∂ 2 g(x)
n
∂g(x)
Y − E[Y ] = Zi + (Z i Z j − Vi j ) + · · ·
i=1
∂xi μ 2! i=1 j=1 ∂xi ∂x j μ
n n
∂g(x) ∂g(x)
V [Y ] = E[(Y − E[Y ]) ] = 2
V [X i , X j ] + · · ·
i=1 j=1
∂xi μ ∂x j μ
This is the first order approximation to V [Y ] and usually is reasonable but has to
be used with care. On the one hand, we have assumed that higher order terms are
negligible and this is not always the case so further terms in the expansion may
have to be considered. Take for instance the simple case Y = X 1 X 2 with X 1 and X 2
independent random quantities. The first order expansion gives V [Y ] μ21 σ22 +μ22 σ12
and including second order terms (there are no more) V [Y ] = μ21 σ22 + μ22 σ12 + σ12 σ22 ;
the correct result. On the other hand, all this is obviously meaningless if the random
quantity Y has no variance. This is for instance the case for Y = X 1 X 2−1 when X 1,2
are Normal distributed.
The Integral Transforms of Fourier, Laplace and Mellin are a very useful tool to
study the properties of the random quantities and their distribution functions. In
particular, they will allow us to obtain the distribution of the sum, product and ratio
of random quantities, the moments of the distributions and to study the convergence
of a sequence {Fk (x)}∞ k=1 of distribution functions to F(x).
The class of functions for which the Fourier Transform exists is certainly much wider
than the probability density functions p(x) ∈ L 1 (R) (normalized real functions of
real argument) we are interested in for which the transform always exists. If X ∼ p(x),
the Fourier Transform is nothing else but the mathematical expectation
F(t) = E[eit X ]; t ∈ R
(1) (0) = 1;
(2) (t) is bounded: |(t)| ≤ 1;
(3) (t) has schwarzian symmetry: (−t) = (t);
(4) (t) is uniformly continuous in R.
The first three properties are obvious. For the fourth one, observe that for any > 0
there exists a δ > 0 such that |(t1 ) − (t2 )| < when |t1 − t2 | < δ with t1 and t2
arbitrary in R because
+∞ +∞
|(t + δ) − (t)| ≤ |1 − e−iδx |d P(x) = 2 | sin δx/2| d P(x)
−∞ −∞
and this integral can be made arbitrarily small taking a sufficiently small δ.
These properties, that obviously hold also for a discrete random quantity, are
necessary but not sufficient for a function (t) to be the Characteristic Function
of a distribution P(x) (see Example 1.9). Generalizing for a n-dimensional random
quantity X = (X 1 , . . . , X n ):
Example 1.19 There are several criteria (Bochner, Kintchine, Cramèr,…) specify-
ing sufficient and necessary conditions for a function (t), that satisfies the four
aforementioned conditions, to be the Characteristic Function of a random quantity
X ∼F(x). However, it is easy to find simple functions like
1
g1 (t) = e−t
4
and g2 (t) = ;
1 + t4
that satisfy four stated conditions and that can not be Characteristic Functions asso-
ciated to any distribution. Let’s calculate the moments of order one with respect to
the origin and the central one of order two. In both cases (see Sect. 1.5.1.4) we have
that:
that is, the mean value and the variance are zero so the distribution function is
zero almost everywhere but for X = 0 where P(X = 0) = 1… but this is the
Singular Distribution Sn(x|0) that takes the value 1 if X = 0 and 0 otherwise whose
Characteristic Function is (t) = 1. In general, any function (t) that in a boundary
48 1 Probability
of t = 0 behaves as (t) = 1 + O(t 2+ ) with > 0 can not be the Characteristic
Function associated to a distribution F(x) unless (t) = 1 for all t ∈ R.
Example 1.20 The elements of the Cantor Set C S (0, 1) can be represented in base
3 as:
∞
Xn
X=
n=1
3n
with X n ∈ {0, 2}. This set is non-denumerable and has zero Lebesgue measure so
any distribution with support on it is singular and, in consequence, has no pdf. The
Uniform Distribution on C S (0, 1) is defined assigning a probability P(X n = 0) =
P(X n = 2) = 1/2 (Geometric Distribution). Then, for the random quantity X n we
have that
1
X n (t) = E[eit X ] = 1 + e2it
2
and for Yn = X n /3n :
1 n
Yn (t) = X n (t/3n ) = 1 + e2it/3
2
Being all X n statistically independent, we have that
1 n
+∞ +∞ +∞
1 it/3n
X (t) = 1 + e2it/3 = e cos(t/3n ) = eit/2 cos(t/3n )
n=1
2 n=1
2 n=1
and, from the derivatives (Sect. 1.5.1.4) it is straight forward to calculate the moments
of the distribution. In particular:
(1) i 1 (2) 3 3
X (0) = −→ E[X ] = and X (0) = − −→ E[X 2 ] =
2 2 8 8
so V [X ] = 1/8.
In particular, if the discrete distribution is reticular (that is, all the possible values
that the random quantity X may take can be expressed as a + b n with a, b ∈ R;
b = 0 and n integer) we have that:
π/b
b
P(X = xk ) = e−it xk (t) dt
2π −π/b
From this expressions, we can obtain also the relation between the Characteristic
Function and the Distribution Function. For discrete random quantities we shall
have:
+∞
1
F(xk ) = P(X = xk ) = e−it x (t) dt
x ≤x
2π −∞ x ≤x
k k
Let X ∼P(x) be a random quantity with Characteristic Function X (t) and g(X ) a
one-to-one finite real function defined for all real values of X . The Characteristic
Function of the random quantity Y = g(X ) will be given by:
that is:
+∞
Y (t) = eitg(x) d P(x) or Y (t) = eitg(xk ) P(X = xk )
−∞ xk
that is, the product of the Characteristic Functions of each one, a necessary but not
sufficient condition for the random quantities X 1 , . . . , X n to be independent. In a
similar way, we have that if X = X 1 − X 2 with X 1 and X 2 independent random
quantities, then
it
X i ∼Po(μi ) −→ i (t) = e−μi (1 − e )
we have that
it + μ e−it )
X (t) = 1 (t) 2 (t) = e−(μ1 + μ2 ) e(μ1 e 2
being μ S = μ1 + μ2 . If we take:
,
μ1 it
z= e
μ2
we have
n/2
μ1 w
z −n−1 e 2 (z + 1/z) dz
1
P(X = n) = e−μ S
μ2 2πi C
√ √
with w = 2 μ1 μ2 and C the circle |z| = μ1 /μ2 around the origin. From the
definition of the Modified Bessel Function of first kind
t −n−1 e 2 (t + 1/t) dt
1 z
In (z) =
2πi C
with C a circle enclosing the origin anticlockwise and considering that I−n (z) = In (z)
we have finally:
n/2
μ1 √
P(X = n) = e−(μ1 + μ2 ) I|n| (2 μ1 μ2 ).
μ2
52 1 Probability
and let us assume that there exists the moment of order k. Then, upon derivation of
the Characteristic Function k times with respect to t we have:
+∞
∂k
(t) = E[i k X k eit X ] = (i x)k eit x d P(x)
∂k t −∞
1 ∂k
E[X ] = k
k
(t)
i ∂k t t=0
In a similar way, upon k times derivation with respect to t we get the moments with
respect to an arbitrary point a:
1 ∂k
E[(X − a)k ] = (t, a)
ik ∂k t t=0
Example 1.22 For the difference of Poisson distributed random quantities analyzed
in the previous example, one can easily derive the moments from the derivatives of
the Characteristic Function. Since
we have that
and so on.
Let f : R+ →C be a complex and integrable function with support on the real positive
axis. The Mellin Transform is defined as:
∞
M( f ; s) = M f (s) = f (x) x s−1 d x
0
and therefore
∞ 1 ∞
|M( f ; s)| ≤ | f (x)|x Re(s)−1 d x = | f (x)|x Re(s)−1 d x + | f (x)|x Re(s)−1 d x ≤
0 0 1
1 ∞
≤ C1 x Re(s)−1+α d x + C2 x Re(s)−1+β d x
0 1
The first integral converges for −α < Re(s) and the second for Re(s) < −β so the
Mellin Transform exists and is holomorphic on the band −α < Re(s) < −β, parallel
to the imaginary axis (s) and determined by the conditions of convergence of the
integral. We shall denote the holomorphy band (that can be a half of the complex
plane or the whole complex plane) by S f = −α, −β. Last, to simplify the notation
when dealing with several random quantities, we shall write for X n ∼ pn (x) Mn (s)
of M X (s) instead of M( pn ; s).
54 1 Probability
1.5.2.1 Inversion
assuming that the integral exists. Since s ∈ C, we can write s = x + i y so: the Mellin
Transform of f (t) is the Fourier Transform of g(u) = f (eu )e xu . Setting now t = eu
we have that
∞ σ+i∞
1 1
f (t) = M( f ; s = x + i y) t −(x+i y) dy = M( f ; s) t −s ds
2π −∞ 2πi σ−i∞
where, due to Chauchy’s Theorem, σ lies anywhere within the holomorphy band.
The uniqueness of the result holds with respect to this strip so, in fact, the Mellin
Transform consists on the pair M(s) together with the band a, b.
Example 1.23 It is clear that to determine the function f (x) from the transform F(s)
we have to specify the strip of analyticity for, otherwise, we do not know which poles
should be included. Let’s see as an example f 1 (x) = e−x . We have that
∞
M1 (z) = e−x x z−1 d x = (z)
0
holomorphic in the band 0, ∞ so, for the inverse transform, we shall include the
poles z = 0, −1, −2, . . .. For f 2 (x) = e−x − 1 we get M2 (s) = (s), the same
function, but
lim x→0+ f (x) O(x 1 )−→α = 1 and lim x→∞ f (x) O(x 0 )−→β = 0
Thus, the holomorphy strip is −1, 0 and for the inverse transform we shall include
the poles z = −1, −2, . . .. For f 3 (x) = e−x − 1 + x we get M3 (s) = (s), again
the same function, but
lim x→0+ f (x) O(x 2 )−→α = 2 and lim x→∞ f (x) O(x 1 )−→β = 1
Thus, the holomorphy strip is −2, −1 and for the inverse transform we include the
poles z = −2, −3, . . ..
Consider a positive random quantity X with continuous density p(x) and x ∈ [0, ∞),
the Mellin Transform M X (s) (defined only for x ≥ 0)
1.5 Integral Transforms 55
∞
M( p; s) = x s−1 p(x) d x = E[X s−1 ]
0
defined for all x where p(x) is continuous with the line of integration contained in
the strip of analyticity of M( p; s). Then:
• Moments: E[X n ] = M X (n + 1);
• For the positive random quantity Z = a X b (a, b ∈ R and a > 0) we have that
∞ ∞
M Z (s) = z s−1 f (z) dz = a s−1 x b(s−1) p(x) d x = a s−1 M X (bs − b + 1)
0 0
c+i∞
2πi p(z) = z −s M X (bs − b + 1) ds
c−i∞
M Z =1/ X (s) = M X (2 − s)
• If Z = X 1 X 2 · · ·X n with {X i }i=1
n
n independent positive defined random quantities,
each distributed as pi (xi ), we have that
∞ n ∞
+ +
n +
n
M Z (s) = z s−1 p(z) dz = xis−1 pi (xi ) d xi = E[X is−1 ] = Mi (s)
0 i=1 0 i=1 i=1
c+i∞
2πi p(z) = z −s M1 (s)· · ·Mn (s) ds
c−i∞
and therefore
∞ c+i∞
1
p(x) = p1 (wx) p2 (w) w dw = x −s M1 (s) M2 (2 − s) ds
0 2πi c−i∞
56 1 Probability
x
• Consider the distribution function F(x) = 0 p(u)du of the random quantity X .
Since d F(x) = p(x)d x we have that
∞ ∞
- .∞
M( p(x); s) = x s−1 d F(x) = x s−1 F(x) 0 − (s − 1) x s−2 F(x) d x
0 0
and therefore, if lim x→0+ [x s−1 F(x)] = 0 and lim x→∞ [x s−1 F(x)] = 0 we have,
shifting s→s − 1, that
x
1
M(F(x); s) = M( p(u) du; s) = − M( p(x); s + 1).
0 s
(a1 a2 x)n
Res( f (z), z n ) = (2ψ(n + 1) − ln(a1 a2 x))
(n!)2
and therefore
∞
(a1 a2 x)n
p(x) = a1 a2 (2ψ(n + 1) − ln(a1 a2 x))
n=0
(n!)2
√
If we define w = 2 a1 a2 x
√
p(x) = 2 a1 a2 K 0 (2 a1 a2 x)1(0,∞) (x)
from the Neumann Series expansion the Modified Bessel Function K 0 (w).
z−1
• Y = X 1 X 2−1 −→ MY (z) = M1 (z)M2 (2 − z) = aa2 π(1 − z)
1 sin(zπ)
a1 a2−1 c+i∞
1−z
p(x) = (a1 a2−1 x)−z dz
2i c−i∞ sin(zπ)
and therefore:
∞
a1 a1 x n a1 a2
p(x) = (1 + n) (−1)n = 1(1,∞) (x)
a2 n=0 a2 (a2 + a1 x)2
To summarize, if X 1 ∼E x(x1 |a1 ) and X 2 ∼E x(x2 |a2 ) are independent random quan-
tities:
√
X = X 1 X 2 ∼ 2 a1 a2 K 0 (2 a1 a2 x)1(1,∞) (x)
a1 a2
Y = X 1/ X 2 ∼ 1(0,∞) (x)
(a2 + a1 x)2
(b + s − 1)
M X (s) =
(b)
Closing the contour on the left of the line Re(z) = c contained in the strip of
holomorphy 0, ∞ we have poles of order one at bi − 1 + z n = −n with n =
0, 1, 2, . . ., that is, at z n = 1 − bi − n. Expansion around z = z n + gives the
residuals
58 1 Probability
(−1)n
Res( f (z), z n ) = (b1 + b2 + n) x n+b1 −1
n!
and therefore the quantity X = X 1 / X 2 is distributed as
∞
x b1 −1 (−1)n (b1 + b2 ) x b1 −1
p(x) = (b1 + b2 + n) x n = 1(0,∞) (x)
(b1 ) (b2 ) n! (b1 )(b2 ) (1 + x)b1 +b2
n=0
we get that
2 √
p(x) = x (b1 +b2 )/2−1 K ν (2 x)1(0,∞) (x)
(b1 ) (b2 )
with ν = b2 − b1 > 0.
To summarize, if X 1 ∼Ga(x1 |a1 , b1 ) and X 2 ∼Ga(x2 |a2 , b2 ) are two independent
random quantities and ν = b2 − b1 > 0 we have that
ν/2
2a1b1 a2b2 a2 √
X = X1 X2 ∼ x (b1 +b2 )/2−1 K ν (2 a1 a2 x)1(0,∞) (x)
(b1 ) (b2 ) a1
(b1 + b2 ) a1b1 a2b2 x b1 −1
X = X 1/ X 2 ∼ 1(0,∞) (x)
(b1 )(b2 ) (a2 + a1 x)b1 +b2
1 1
M X (s) = M1 (s)M2 (2 − s) =
s 2−s
so the strip of holomorphy is S = 0, 2 and there are two poles, at s = 0 and s = 2.
If lnx < 0→x < 1 we shall close the Bromwich the contour on the left enclosing
the pole at s = 0 and if lnx > 0→x > 1 we shall close the contour on the right
enclosing the pole at s = 2 so the integrals converge. Then it is easy to get that
1 - .
p(x) = 1(0,1] (x) + x −2 1(1,∞) (x) = U n(x|0, 1) + Pa(x|1, 1)
2
Note that
1 1
E[X n ] = M X (n + 1) =
n+1 1−n
(ai + bi ) (s + ai − 1)
Mi (s) =
(ai ) (s + ai + bi − 1)
with
(a1 + b1 )(a2 + b2 )
Np =
(a1 )(a2 )(b1 + b2 )
• X = X 1/ X 2
with
B(a1 + a2 , bk )
Nk =
B(a1 + b1 )B(a2 + b2 )
60 1 Probability
2 a (b+1)/2
e−ax x b
2
X ∼ p(x|a, b) =
(b/2 + 1/2)
with S = −b, ∞ and, from this, derive that the probability density function of
X = X 1 X 2 , with X 1 ∼ p(x1 |a1 , b1 ) and X 2 ∼ p(x2 |a2 , b2 ) independent, is given by:
√
4 a1 a2 √ √
p(x) = ( a1 a2 x)(b1 +b2 )/2 K |ν| 2 a1 a2 x
(b1 /2 + 1/2) (b2 /2 + 1/2)
2 (b + 1) (a x 2 )b1 /2
p(x) = a 1/2
(b1 /2 + 1/2) (b2 /2 + 1/2) (1 + a x 2 )b+1
Problem 1.4 Show that if X 1,2 ∼U n(x|0, 1), then for X = X 1X 2 we have that p(x) =
−x −1 Ei(ln x), with Ei(z) the exponential integral, and E[X m ] = m −1 ln(1 + m).
Hint: Consider Z = log X = X 1 log X 2 = −X 1 W2 and the Mellin Transform for
the Uniform and Exponential densities.
The Mellin Transform is defined for integrable functions with non-negative support.
To deal with the more general case X ∼ p(x) with supp{X } = x ≥ 0 + x<0 ⊆R we
have to
(1) Express the density as p(x) = p(x) 1x ≥ 0 (x) + p(x) 1x<0 (x);
p+ (x) p− (x)
(2) Define Y1 = X when x ≥ 0 and Y2 = −X when x < 0 so supp{Y2 } is positive
and find MY1 (s) and MY2 (s);
(3) Get from the inverse transform the corresponding densities p1 (z) for the quantity
of interest Z 1 = Z (Y1 , X 2 , . . .) with MY1 (s) and p2 (z) for Z 2 = Z (Y2 , X 2 , . . .)
with MY2 (s) and at the end for p2 (z) make the corresponding change for X →−X .
This is usually quite messy and for most cases of interest it is far easier to find
the distribution for the product and ratio of random quantities with a simple change
of variables.
1.5 Integral Transforms 61
• Ratio of Normal and χ2 distributed random quantities Let’s study the ran-
dom quantity X = X 1 (X 2 /n)−1/2 where X 1 ∼N (x1 |0, 1) with sup{X 1 } = R and
X 2 ∼χ2 (x2 |n) with sup{X 2 } = R+ . Then, for X 1 we have
Since
2s−1 (n/2 + s − 1)
M2 (s) =
(n/2)
2s/2 (s/2)
M1+ (s) = √ ; 0 < (s)
2 2π
and therefore
with holomorphy stripe 0 < (s) < n + 1. There are poles at sm = −2m with
m = 0, 1, 2, . . . on the negative real axis and sk = n + 1 + 2k with k = 0, 1, 2, . . .
on the positive real axis. Closing the contour on the left we include only sm so
∞ 2 m
+ 1 (−1)m x n+1
p (x) = √ m+ =
nπ(n/2) m=0 (m + 1) n 2
−(n+1)/2
((n + 1)/2) x2
= √ 1+ 1[0,∞) (x)
nπ(n/2) n
−(n+1)/2
((n + 1)/2) x2
X ∼ p(x) = √ 1+ 1(−∞,∞) (x) = St (x|n)
nπ(n/2) n
with Da (x) the Whittaker Parabolic Cylinder Functions. The upper sign (−) of
the argument corresponds to X ∈ [0, ∞) and the lower one (+) to the quantity Y =
−X ∈ (0, ∞). Again, the problem is considerably simplified if μ1 = μ2 = 0 because
2z/2
MY (z) = √ σ z−1 (z/2)
2 2π
with S = 0, ∞ and, due to symmetry, all contributions are the same. Thus, summing
over the poles at z n = −2n for n = 0, 1, 2, . . . we have that for X = X 1 X 2 and
a −1 = 4σ12 σ22 :
√ ∞ √ √
2 a ( a|x|)2n √ 2 a √
p(x) = 2(1 + n) − ln( a|x|) = K 0 (2 a|x|)
π n=0 (n + 1)2 π
Dealing with the general case of μi = 0 it is much more messy to get compact
expressions and life is easier with a simple change of variables. Thus, for instance
for X = X 1 / X 2 we have that
√
a1 a2 ∞ −{a1 (xw − μ1 )2 + a2 (w − μ2 )2 }
p(x) = e |w| dw
π −∞
√
w0 = a2 + a1 x 2 ; w1 = a1 a2 (xμ2 − μ1 )2 and w2 = (a1 μ1 x + a2 μ2 )/ w0
one has:
√
a1 a2 1 −w1 /w0 −w 2 √
p(x) = e e 2 + π w2 erf(w2 ) 1(−∞,∞) (x).
π w0
1.6 Ordered Samples 63
Let X ∼ p(x|θ) be a one-dimensional random quantity and the experiment e(n) that
consists on n independent observations and results in the exchangeable sequence
{x1 , x2 , . . ., xn } are equivalent to an observation of the n-dimensional random quan-
tity X∼ p(x|θ) where
+
n
p(x|θ) = p(x1 , x2 , . . ., xn |θ) = p(xi |θ)
i=1
and the Statistic of Order k; that is, the random quantity X (k) associated with the kth
observation (1 ≤ k ≤ n) of the ordered sample such that there are k − 1 observations
smaller than xk and n − k above xk . Since
xk
P(X ≤ xk |θ) = p(x|θ)d x = F(xk |θ) and P(X > xk |θ) = 1 − F(xk |θ)
−∞
we have that
- .k−1 - .n−k
X (k) ∼ p(xk |θ, n, k) = Cn,k p(xk |θ) F(xk |θ) 1 − F(xk |θ)
/ x 0k−1 / ∞ 0n−k
k
= Cn,k p(xk |θ) p(x|θ) d x p(x|θ) d x
−∞ xk
[P(X ≤ xk )]k−1 [P(X >xk )]n−k
/ xi 0i−1 / xj 0 j−i−1
X (i j) ∼ p(xi , x j |θ, i, j, n) = Cn,i, j p(x|θ)d x p(xi |θ) p(x|θ)d x
−∞ xi
[P(X <xi )]i−1 [P(xi <X ≤ <x j )] j−i−1
n− j
∞
p(x j |θ) p(x|θ)d x
xj
[P(x j <X )]n− j
n!
Cn,i, j =
(i − 1)! ( j − i − 1)! (n − j)!
1 2
(n + 1) b−s
p(s) = p(w + s) p(w) [F(w)]i−1 [1 − F(w + s)]n−i−1 dw
(i)(n − i) a
1.6 Ordered Samples 65
In the case of discrete random quantities, the idea is the same but a bit more messy
because one has to watch for the discontinuities of the Distribution Function. Thus,
for instance:
• Maximum X (n) = max{X 1 , X 2 , . . ., X n }:
X (n) ≤ x iff all xi are less or equal x and this happens with probability
P(xn ≤ x) = [F(x)]n
X (n) < x iff all xi are less than x and this happens with probability
Therefore
X (1) ≥ x iff all xi are grater or equal x and this happens with probability
X (1) > x iff all xi are greater than x and this happens with probability
Therefore
k−1 p−1 n − (k + p + 1)
Let’s think for instance that those are the arrival times of n events collected with
a detector in a time window [a = 0, b = T ]. If we define w1 = xk+ p − xk and
w2 = xk+ p+1 − xk we have that
p−1
p(xk , w1 , w2 |T, n, p) = xkk−1 w1 (T − xk − w2 )n−k− p−1 1[0,T −w2 ] (xk )1[0,w2 ] (w1 )1[0,T ] (w2 )
Observe that the support can be expressed also as 1[0,T ] (w1 )1[w1 ,T ] (w2 ) and that the
distribution of (W1 , W2 ) does not depend on k. The marginal densities are given by:
n p p−1
p(w1 |T, n, p) = w (T − w1 )n− p 1[0,T ] (w1 )
p Tn 1
n n−p p
p(w2 |T, n, p) = w2 (T − w2 )n− p−1 1[0,T ] (w2 )
p Tn
and if we take the limit T →∞ and n→∞ keeping the rate λ = n/T constant we
have
and
λ p −λw1 p−1
p(w1 |λ, p) = e w1 1[0,∞) (w1 )
( p)
In consequence, under the stated conditions the time difference between two consec-
utive events ( p = 1) tends to an exponential distribution. Let’s consider for simplicity
1.6 Ordered Samples 67
this limiting behaviour in what follows and leave as an exercise the more involved
case of finite time window T .
Suppose now that after having observed one event, say xk , we have a dead-time of
size a in the detector during which we can not process any data. All the events that
fall in (xk , xk + a) are lost (unless we play with buffers). If the next observed event
is at time xk+ p+1 , we have lost p events and the probability for this to happen is
(λa) p
P(w1 ≤ a, w2 ≥ a|λ, p) = e−λa
( p + 1)
that is, Nlost ∼Po( p|λa) regardless the position of the last recorded time (xk ) in the
ordered sequence. As one could easily have intuited, the expected number of events
lost for each observed one is E[Nlost ] = λa. Last, it is clear that the density for the
time difference between two consecutive observed events when p are lost due to the
dead-time is
Note that it depends on the dead-time window a and not on the number of events
lost.
In Probability, the Limit Theorems are statements that, under the conditions of
applicability, describe the behavior of a sequence of random quantities or of Dis-
tribution Functions. In principle, whenever we can define a distance (or at least a
positive defined set function) we can establish a convergence criteria and, obviously,
some will be stronger than others so, for instance, a sequence of random quantities
∞
{X i }i=1 may converge according to one criteria and not to other. The most usual types
of convergence, their relation and the Theorems derived from them are:
68 1 Probability
so Convergence in Distribution is the weakest of all since does not imply any of the
others. In principle, there will be no explicit mention to statistical independence of
the random quantities of the sequence nor to an specific Distribution Function. In
most cases we shall just state the different criteria for convergence and refer to the
literature, for instance [2], for further details and demonstrations. Let’s start with the
very useful Chebyshev’s Theorem.
Let X be a random quantity that takes values in ⊂R with Distribution Function
F(x) and consider the random quantity Y = g(X ) with g(X ) a non-negative single
valued function for all X ∈ . Then, for α ∈ R+
E[g(X )]
P (g(X ) ≥ α) ≤
α
In fact, given a measure space (, B , μ), for any μ-integrable function f (x) and
c > 0 we have for A = {x : | f (x)| ≥ c} that c1 A (x) ≤ | f (x)| for all x and therefore
cμ(A) = c1 A (x)dμ ≤ | f (x)|dμ
Let’s see two particular cases. First, consider g(X ) = (X − μ)2n where μ = E[X ]
and n a positive integer such that g(X ) ≥ 0 ∀X ∈ . Applying Chebishev’s Theorem:
that is, whatever the Distribution Function of the random quantity X is, the probability
that X differs from its expected value μ more than k times its standard deviation is
less or equal than 1/k 2 . As a second case, assume X takes only positive real values
and has a first order moment E[X ] = μ. Then (Markov’s inequality):
μ α=kμ
P(X ≥ α) ≤ −→ P(X ≥ kμ) ≤ 1/k
α
The Markov and Bienaymé–Chebishev’s inequalities provide upper bounds for
the probability knowing just mean value and the variance although they are usually
very conservative. They can be considerably improved if we have more information
about the Distribution Function but, as we shall see, the main interest of Chebishev’s
inequality lies on its importance to prove Limit Theorems.
Note that P(|X n (w) − X (w)| ≥ ) is a a real number so this is is the usual limit
for a sequence of real numbers and, in consequence, for all > 0 and δ > 0
∃ n 0 (, δ) such that for all n > n 0 (, δ) it holds that P(|X n (w) − X (w)| ≥ ) < δ.
For a sequence of n-dimensional random quantities, this can be generalized to
limn→∞ P(X n (w), X (w)) and, as said earlier, Convergence in Probability implies
Convergence in Distribution but the converse is not true. An important consequence
of the Convergence in Probability is the
• Weak Law of Large Numbers: Consider a sequence of independent random
∞
quantities {X i (w)}i=1 , all with the same Distribution Function and first order moment
E[X i (w)] = μ, and define a new random quantity
1
n
Z n (w) = X i (w)
n i=i
The Law of Large Numbers was stated first by J. Bernoulli in 1713 for the Binomial
Distribution, generalized (and named Law of Large Numbers) by S.D. Poisson and
shown in the general case by A. Khinchin in 1929. In the case X i (w) have variance
V (X i ) = σ 2 it is straight forward from Chebishev’s inequality:
E[(Z n − μ)2 ] σ2
P (|Z n − μ| ≥ ) = P (Z n − μ)2 ≥ 2 ≤ =
2 n2
Intuitively, Convergence in Probability means that when n is very large, the prob-
ability that Z n (w) differs from μ by a small amount is very small; that is, Z n (w) gets
more concentrated around μ. But “very small” is not zero and it may happen that for
some k > n Z k differs from μ by more than . An stronger criteria of convergence
is the Almost Sure Convergence.
A sequence {X n (w)}∞
n=1 of random quantities converges almost sure to X (w) if, and
only if:
for all > 0. Needed less to say that the random quantities X 1 , X 2 . . . and X are
defined on the same probability space. Again, Almost Sure Convergence implies
Convergence in Probability but the converse is not true. An important consequence
of the Almost Sure Convergence is the:
• Strong Law of Large Numbers (E. Borel 1909, A.N. Kolmogorov,…): Let
∞
{X i (w)}i=1 be a sequence of independent random quantities all with the same Distrib-
ution Function and first order moment E[X i (w)] = μ. Then the sequence {Z n (w)}∞n=1
with
1
n
Z n (w) = X i (w)
n i=i
1.7 Limit Theorems and Convergence 71
Intuitively, Almost Sure Convergence means that the probability that for some
k > n, Z k differs from μ by more than becomes smaller as n grows.
with C(F) the set of points of continuity of F(x). Expressed in a different manner,
the sequence {X n (w)}∞ n=1 Converges in Distribution to X (w) if, and only if, for all
> 0 and x ∈ C(F), ∃ n 0 (, x) such that |Fn (x) − F(x)| < , ∀n > n 0 (, x). Note
that, in general, n 0 depends on x so it is possible that, given an > 0, the value of
n 0 for which the condition |Fn (x) − F(x)| < is satisfied for certain values of x
may not be valid for others. It is important to note also that we have not made any
statement about the statistical independence of the random quantities and that the
Convergence in Distribution is determined only by the Distribution Functions so the
corresponding random quantities do not have to be defined on the same probability
space. To study the Convergence in Distribution, the following theorem it is very
useful:
• Theorem (Lévy 1937; Cramèr 1937): Consider a sequence of Distribution Func-
tions {Fn (x)}∞ ∞
n=1 and of the corresponding Characteristic Functions {n (t)}n=1 . Then
" if limn→∞ Fn (x) = F(x), then limn→∞ n (t) = (t) for all t ∈ R with (t)
the Characteristic Function of F(x).
n→∞
" Conversely, if n (t) −→ (t) ∀t ∈ R and (t) is continuous at t = 0, then
n→∞
Fn (x) −→ F(x)
This criteria of convergence is weak in the sense that if there is convergence if
probability or almost sure or in quadratic mean then there is convergence in distri-
bution but the converse is not necessarily true. However, there is a very important
consequence of the Convergence in Distribution:
∞
• Central Limit Theorem (Lindberg-Levy): Let {X i (w)}i=1 be a sequence of inde-
pendent random quantities all with the same Distribution Function and with sec-
ond order moments so E[X i (w)] = μ and V [X i (w)] = σ 2 . Then the sequence
{Z n (w)}∞
n=1 of random quantities
72 1 Probability
1
n
Z n (w) = X i (w)
n i=i
with
1 1
n n
σ2
E[Z n ] = E[X i ] = μ and V [Z n ] = 2 V [X i ] =
n i=i n i=i n
√
tends, in the limit n→∞, to be distributed as N (z|μ, σ/ n) or, what is the same,
the standardized random quantity
n
1 Xi − μ
n
∼ Zn − μ i=1
Zn= √ = √
V [Z n ] σ/ n
1
W (t) = 1 − t 2 σ 2 + O(t k )
2
Since we require that the random quantities X i have at least moments of order two,
the remaining terms O(t k ) are either zero or powers of t larger than 2. Then,
1 1
n n
σ2
Zn = Xi = Wi + μ; E[Z n ] = μ; V [Z n ] = σ 2Z n =
n i=i n i=i n
so
Z n (t) = eitμ [W (t/n)]n −→ lim Z n (t) = eitμ lim [W (t/n)]n
n→∞ n→∞
Now, since:
1 t 2 2 1 t2 2
W (t/n) = 1 − σ + O(t k /n k ) = 1 − σ + O(t k /n k )
2 n 2 n Zn
we have that:
/ 0n *
1 t2 2 1
lim [W (t/n)]n = limn→∞ 1 − σ Z n + O(t k /n k ) = exp − t 2 σ 2Z n
n→∞ 2 n 2
1.7 Limit Theorems and Convergence 73
and therefore:
n→∞
√
so, limn→∞ Z n ∼N (x|μ, σ/ n).
The first indications about the Central Limit Theorem are due to A. De Moivre
(1733). Later, C.F. Gauss and P.S. Laplace enunciated the behavior in a general
way and, in 1901, A. Lyapunov gave the first rigorous demonstration under more
restrictive conditions. The theorem in the form we have presented here is due to
Lindeberg and Lévy and requires that the random quantities X i are:
(i) Statistically Independent;
(ii) have the same Distribution Function;
(iii) First and Second order moments exist (i.e. they have mean value and variance).
In general, there is a set of Central Limit Theorems depending on which of the
previous conditions are satisfied and justify the empirical fact that many natural
phenomena are adequately described by the Normal Distribution. To quote E.T.
Whittaker and G. Robinson (Calculus of Observations):
“Everybody believes in the exponential law of errors;
The experimenters because they think that it can be proved by mathematics;
and the mathematicians because they believe it has been established by
observation”
Example 1.29 From the limiting behavior of the Characteristic Function, show that:
• If X ∼Bi(r |n, p), in the limit p → 0 with np constant tends to a Poisson Distrib-
ution Po(r |μ = np);
• If X ∼Bi(r |n, p), in the limit n → ∞ the standardized random quantity
X − μX X − np n→∞
Z= = √ ∼ N (x|0, 1)
σX npq
X − μX X −μ μ→∞
Z= = √ ∼ N (x|0, 1)
σX μ
X − μX X −ν n→∞
Z= = √ ∼ N (x|0, 1)
σX 2ν
1000
500
250 500
0 0
0 0.5 1 0 0.5 1
2000
1000
1000
0 0
0 0.5 1 0 0.5 1
4000
2000
2000
0 0
0 0.5 1 0 0.5 1
Fig. 1.2 Generated sample from U n(x|0, 1) (1) and sampling distribution of the mean of 2 (2),
5 (3), 10 (4), 20 (5) y 50 (6) generated values
Example 1.30 It is interesting to see the Central Limit Theorem at work. For this,
we have done a Monte Carlo sampling of the random quantity X ∼U n(x|0, 1). The
sampling distribution is shown in the Fig. 1.2(1) and the following ones show the
sample mean of n = 2 (Fig. 1.2(2)), 5 (Fig. 1.2(3)), 10 (Fig. 1.2(4)), 20 (Fig. 1.2(5))
y 50 (Fig. 1.2(6)) consecutive values. Each histogram has 500000 events and, as you
can see, as n grows the distribution “looks” more Normal. For n = 20 and n = 50
the Normal distribution is superimposed.
The same behavior is observed in Fig. 1.3 where we have generated a sequence
of values from a parabolic distribution with minimum at x = 1 and support on
= [0, 2].
Last, Fig. 1.4 shows the results for a sampling from the Cauchy Distribution
X ∼Ca(x|0, 1). As you can see, the sampling averages follow a Cauchy Distrib-
ution regardless the value of n. For n = 20 and n = 50 a Cauchy and a Normal
distributions have been superimposed. In this case, since the Cauchy Distribution
has no moments the Central Limit Theorem does not apply.
1.7 Limit Theorems and Convergence 75
1000
1000
0 0
0 1 2 0 1 2
1000
1000
0 0
0 1 2 0 1 2
4000
2000
2000
1000
0 0
0 1 2 0 1 2
Fig. 1.3 Generated sample from a parabolic distribution with minimum at x = 1 and support on
= [0, 2] (1) and sampling distribution of the mean of 2 (2), 5 (3), 10 (4), 20 (5) y 50 (6) generated
values
∞
Example 1.31 Let {X i (w)}i=1 be a sequence of independent random quantities all
with the same Distribution Function, mean value μ and variance σ 2 and consider the
random quantity
1
n
Z (w) = X i (w)
n i=i
What is the value of n such that the probability that Z differs from μ more than is
less than δ = 0.01?
From the√ Central Limit Theorem we know that in the limit n→∞, Z ∼
N (x|μ, σ/ n) so we may consider that, for large n:
76 1 Probability
2000 2000
0 0
-10 0 10 -10 0 10
2000 2000
0 0
-10 0 10 -10 0 10
2000 2000
0 0
-10 0 10 -10 0 10
Fig. 1.4 Generated sample from a Cauchy distribution Ca(x|0, 1) (1) and sampling distribution of
the mean of 2 (2), 5 (3), 10 (4), 20 (5) y 50 (6) generated values
P(|Z − μ| ≥ ) = P(μ − ≥ Z ≥ μ + )
μ− +∞ /√ 0
n
N (x|μ, σ) d x + N (x|μ, σ) d x = 1 − erf √ <δ
−∞ μ+ σ 2
that is, iff for any real > 0 there exists a natural n 0 () > 0 such that for all n ≥ n 0 ()
it holds that E[|X n (w) − X (w)| p ] < . In the particular case that p = 2 it is called
Convergence in Quadratic Mean.
From Chebyshev’s Theorem
| f n (x) − f (x)| < but for larger values of x we need larger values of n. Thus, the
the convergence is not uniform because
Intuitively, for whatever small a given is, the band f (x)± = x± does not con-
tain f n (x) for all n sufficiently large. As a second example, take f n (x) = x n with
x ∈ (0, 1). We have that limn→∞ f n (x) = 0 but supx |gn (x)| = 1 so the convergence
is not uniform. For the cases we shall be interested in, if a Distribution Function F(x)
is continuous and the sequence of {Fn (x)}∞ n=1 converges in distribution to F(x) (i.e.
point-wise) then it does uniformly too. An important case of uniform convergence
is the (sometimes called Fundamental Theorem of Statistics):
• Glivenko–Cantelli Theorem (V. Glivenko-F.P. Cantelli; 1933): Consider the ran-
dom quantity X ∼F(x) and a statistically independent (essential point) sampling of
size n {x1 , x2 , . . ., xn }. The empirical Distribution Function
1
n
Fn (x) = 1(−∞,x] (xi )
n i=1
n
n
Z n (x) = 1(−∞,x] (xi ) = n Fn (x) −→ Z n (t) = eit F(x) + (1 − F(x))
i=1
with
F(x) (1 − F(x))
limn→∞ E[|Fn (x) − F(x)|2 ] = limn→∞ =0
n
converges also in quadratic mean and therefore in distribution.
1
n
Fn (x) = 1(−∞,x] (xi )
n i=1
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9
is the form (exponential family) that satisfies k + 1 constraints h i (x) p(x)d x =
ci < ∞; i = 0, . . ., k with specified constants {c j }kj=0 (c0 = 1 for h 0 (x) = 1) and
best approximates a given density q(x). The Hellinger distance is a metric on the
set of all probability measures on B and we shall make use of it, for instance, as a
further check for convergence in Markov Chain Monte Carlo sampling.
Appendices 81
Appendices
2
2
E[X ] = X (Ak ) μ(Ak ) = k (k/3) = 5/3
k=1 k=1
In principle, for a real valued function f (x) defined on [a, b] the basic approach to
b
evaluate a f d x is that of Riemann and goes as follows. Consider a partition of the
interval [a, b] = ∪n−1
k=0 k with
n−1
Sn = f (xk ) (xk+1 − xk )
k=0
and the limit of Sn as the partition gets finer and finer in such a way that max(xk+1 −
xk ) → 0. If the limit exists, we say that the function f (x) is Riemann integrable and
the limit is the (Riemann) integral. Therefore, for the posed problem:
(1) Take a partition of the domain [1, 2] where xk = 1 + k with x0 = 1 and
xn = 2 → = 1/n;
(2) For each subinterval k , of length , take xk = xk so f (xk ) = ln xk = ln(1+k);
(3) Evaluate the sum
*
1 +
n−1 n−1 n−1
1 1 (2n)
Sn = [ln xk ] = ln(1 + k/n) = ln (1 + k/n) = ln
n n n (n) n n
k=0 k=0 k=0
and take the limit →0+ (n → ∞). You can check that limn→∞ Sn = 2 ln 2 − 1.
Consider now a measure space (R, B, μ) and a non-negative, bounded and Borel
measurable function f (x). Lebesgue’s definition of the integral rests on partitioning
the range of f (x) instead of the domain. Thus, we start with a partition of [0, sup f ] =
k=0 k where
∪n−1
n−2
Sn = yk μ[ f −1 (k )] + yn−1 μ[ f −1 (n−1 )]
k=0
Appendices 83
Again, as the partition gets finer in such a way that max(yk+1 − yk ) → 0, the limit
will be the Lebesgue integral provided it exists. For the problem at hand:
(1) Take a partition of the range [0, ln 2] where yk = k, y0 = 0 and yn = ln 2 →
= n −1 ln 2;
(2) For each subinterval k determine the length of the corresponding interval on
the support; that is, μ(k ) = f −1 (yk+1 ) − f −1 (yk ) = ek (e − 1)
(3) Evaluate the sum
n−1
n−1
Sn = yk μ(k ) = (e − 1) k ek = 2 ln 2 + e (1 − e )−1
k=0 k=0
Nevertheless, the crucial difference with respect Riemann’s integral is not the parti-
tion of the range but the possibility to perform integrals over “wilder” sets and, for
us, the chance to consider arbitrary probability measures over arbitrary sets. But, for
this, we have to clarify how to define the measure of a set. In general, we shall be
concerned only with Rn and it turns out that there is a unique measure λ on R n that
is invariant under translations and such that for the unit cube λ([0, 1]n ) = 1: the
Lebesgue measure that assigns to an interval [a, b] ∈ R what we intuitively would
guess: λ([a, b]) = (b − a). However, as explained in Sect. 1.1.2.2, if we want to
satisfy these conditions there is a price to pay: not all subsets of R are measurable.
Let’s finish with a more axiomatic introduction and some properties. Consider
the measure space (, B, μ); eventually a probability space with μ a probability
measure. Then, for S⊂B we define
de f
μ(S) = dμ = 1 S dμ
S
where μ(S) may be +∞ (unless it is a finite measure). Now, given a finite partition
{Sk ; k = 1, . . ., n} of and a simple function
n
S= ak 1 Sk where ak ≥ 0 ∀k and μ(Sk ) < +∞ if ak = 0
k=1
84 1 Probability
Then:
(1) Let f be a non-negative measurable function with respect to B (that may take
the value +∞). We define:
de f
f dμ = sup S dμ; 0 ≤ S≤ f ; S simple
we have that
+
| f | dμ = f dμ + f − dμ < +∞
The function f (x) is integrable iff g(x) is integrable and both integrals are the
same;
(2) If f (x) and g(x) are two integrable functions then
• if a, b ∈ R, it holds that (a f + b g) dμ = a f dμ + b g dμ
Appendices 85
• if f ≤ g it holds that f dμ ≤ g dμ
(3) If { f k (x)}k ∈ N is a sequence of non-negative measurable functions such that
f k (x) ≤ f k+1 (x) for all k ∈ N and x ∈ , then
limk f k dμ = limk f k dμ
dμ dμ dμ
(1) If μ1 << μ2 and μ2 << μ3 , then μ1 << μ3 and dμ1 = dμ1 dμ2 so:
3 2 3
dμ1 dμ1 dμ2 dμ1
μ1 (A) = dμ1 = dμ2 = dμ3 = dμ3
A A dμ2 A dμ2 dμ3 A dμ3
d(μ1 + μ2 ) dμ dμ
(2) If μ1 << μ3 and μ2 << μ3 , then dμ3 = dμ1 + dμ2
3 3
(3) If μ1 << μ2 and g es a μ1 integrable function, then
dμ1
g dμ1 = g dμ2
A A dμ2
dμ dμ −1
(4) If μ1 << μ2 and μ2 << μ1 (equivalent: μ1 ∼μ2 ) then dμ1 = dμ2
2 1
We shall use some of these properties in different places; for instance, relating math-
ematical expectations under different probability measures or justifying some tech-
niques used in Monte Carlo Sampling.
References
given along the section. At the end, to quote Lindley, “Inside every non-Bayesian
there is a Bayesian struggling to get out”. For a more classical approach to Statistical
Inference see [2] where most of what you will need in Experimental Physics is
covered in detail.
p(θ, x) = p(θ1 , θ2 , . . . , θk , x1 , x2 , . . . , xn ); θ = ∈ ⊆ Rk ; x∈ X
(2) Conditioning the observed data (x) to the parameters (θ) of the model:
p(x|θ) p(θ)
p(θ|x) =
p(x|θ) p(θ)dθ
This is the basic equation for parametric inference. The integral of the denominator
does not depend on the parameters (θ) of interest; is just a normalization factor so
we can write in a general way;
n
p(x1 , x2 , . . . , xn ) = p(xi )
i=1
2 It
is easy to check for instance that if X 0 is a non-trivial random quantity independent of the X i ,
the sequence {X 0 + X 1 , X 0 + X 2 , . . . X 0 + X n } is exchangeable but not iid.
90 2 Bayesian Inference
sequence of real-valued random quantities it can be shown that, for any finite subset,
there exists a parameter θ ∈ , a parametric model p(x|θ) and measure dμ(θ) such
that3 :
n
p(x1 , x2 , . . . , xn ) = p(xi |θ) dμ(θ)
i=1
we obtain the posterior density p(θ|x) that accounts for the degree of knowledge
we have on the parameter after the experiment has been performed. Note that the
random quantities of the exchangeable sequence {X 1 , X 2 , . . . , X n } are conditionally
independent given θ but not iid because
⎛ ⎞
n
p(x j ) = p(x j |θ) dμ(θ) ⎝ p(xi |θ) d xi ⎠
i(= j)=1 X
and
n
p(x1 , x2 , . . . , xn ) = p(xi )
i=1
There are situations for which the hypothesis of exchangeability can not be
assumed to hold. That is the case, for instance, when the data collected by an exper-
iment depends on the running conditions that may be different for different periods
of time, for data provided by two different experiments with different acceptances,
selection criteria, efficiencies,… or the same medical treatment when applied to
individuals from different environments, sex, ethnic groups,… In these cases, we
shall have different units of observation and it may be more sound to assume partial
exchangeability within each unit (data taking periods, detectors, hospitals,…) and
design a hierarchical structure with parameters that account for the relevant infor-
mation from each unit analyzing all the data in a more global framework.
Note 4: Suppose that we have a parametric model p1 (x|θ) and the exchangeable
sample x 1 = {x1 , x2 , . . . , xn } provided by the experiment e1 (n). The inferences on
3 Thisis referred as De Finetti’s Theorem after B. de Finetti (1930s) and was generalized by
E. Hewitt and L.J. Savage in the 1950s. See [4].
2.2 Exchangeable Sequences 91
the parameters θ will be drawn from the posterior density p(θ|x 1 ) ∝ p1 (x 1 |θ) p(θ).
Now, we do a second experiment e2 (m), statistically independent of the first, that
provides the exchangeable sample x 2 = {xn+1 , xn+2 , . . . , xn+m } from the model
p2 (x|θ). It is sound to take as prior density for this second experiment the pos-
terior of the first including therefore the information that we already have about θ
so
Being the two experiments statistically independent and their sequences exchange-
able, if they have the same sampling distribution p(x|θ) we have that
p1 (x 1 |θ) p2 (x 2 |θ) = p(x|θ) where x = {x 1 , x 2 } = {x1 , . . . , xn , xn+1 , . . . , xn+m }
and therefore p(θ|x 2 ) ∝ p(x|θ) p(θ). Thus, the knowledge we have on θ includ-
ing the information provided by the experiments e1 (n) and e2 (m) is determined by
the likelihood function p(x|θ) and, in consequence, under the aforementioned con-
ditions the realization of e1 (n) first and e2 (m) after is equivalent, from the inferential
point of view, to the realization of the experiment e(n + m).
Consider the realization of the experiment e1 (n) that provides the sample x =
{x1 , x2 , . . . , xn } drawn from the model p(x|θ). Inferences about θ ∈ are deter-
mined by the posterior density
Now suppose that, under the same model and the same experimental conditions, we
think about doing a new independent experiment e2 (m). What will be the distrib-
ution of the random sample y = {y1 , y2 , . . . , ym } not yet observed? Consider the
experiment e(n + m) and the sampling density
Since both experiments are independent and iid, we have the joint density
This is the basic expression for the predictive inference and allows us to predict the
results y of a future experiment from the results x observed in a previous experiment
within the same parametric model. Note that p( y|x) is the density of the quantities not
yet observed conditioned to the observed sample. Thus, even though the experiments
e( y) and e(x) are statistically independent, the realization of the first one (e(x))
modifies the knowledge we have on the parameters θ of the model and therefore
affect the prediction on future experiments for, if we do not consider the results of
the first experiment or just don’t do it, the predictive distribution for e( y) would be
p( y) = p( y|θ) π(θ) dθ
On the other hand, if after the first experiment we know the parameters with high
accuracy then, in distributional sense,
p(θ|x), ·
δ(θ 0 ), · and
p( y|x)
δ(θ 0 ), p( y|θ) = p( y|θ0 ).
T : 1 × · · · ×m −→ Rk(m)
t = t(x1 , . . . , xm ) is sufficient for θ if, and only if, ∀m ≥ 1 and any prior distribution
π(θ) it holds that
p(θ|x1 , x2 , . . . , xm ) = p(θ|t)
Since the data act in the Bayes formula only through the likelihood, it is clear that
to specify the posterior density of θ we can consider
and all other aspects of the data but t are irrelevant. It is obvious however that t =
{x1 , . . . , xm } is sufficient and, in principle, gives no simplification in the modeling.
For this we should have k(m) = dim(t) < m (minimal sufficient statistics) and, in
the ideal case, we would like that k(m) = k does not depend on m. Except some
irregular cases, the only distributions that admit a fixed number of sufficient statistics
independently of the sample size (that is, k(m) = k < m ∀m) are those that belong
to the exponential family.
Example 2.1 (1) Consider the exponential model X ∼ E x(x|θ): and the iid experi-
ment e(m) that provides the sample x = {x1 , . . . , xm }. The likelihood function is:
m m
and t = (m, i=1 xi , i=1 xi2 ) : 1 ×· · ·×m −→ Rk(m)=3 a sufficient statistic.
Usually we shall consider t = {m, x, s 2 } with
1 1
m m
x= xi and s2 = (xi − x)2
m i=1 m i=1
the sample mean and the sample variance. Inferences on the parameters μ and σ will
depend on t and all other aspects of the data are irrelevant.
(3) Consider the Uniform model X ∼ U n(x|0, θ) and the iid sampling {x1 , x2 , . . . ,
xm }. Then t = (m, max{xi , i = 1, . . . m}) : 1 ×· · ·×m −→ Rk(m)=2 is a suffi-
cient statistic for θ.
94 2 Bayesian Inference
with
k
−1
g(θ) = f (x) exp {ci φi (θ) h i (x)} d x ≤ ∞
x i=1
n n
and therefore t(x) = n, i=1 h 1 (xi ), . . . i=1 h k (xi ) will be a set of sufficient
statistics.
Example 2.2 Several distributions of interest, like Poisson and Binomial, belong to
the exponential family:
e−μ μn −(μ−nln μ)
(1) Poisson Po(n|μ): P(n|μ) = =e
(n + 1)
(n + 1)
N N
(2) Binomial Bi(n|N , θ): P(n|N , θ) = θn (1 − θ) N −n = en lnθ+(N −n)ln (1−θ)
n n
However, the Cauchy Ca(x|α, β) distribution, for instance, does not because
m
n
2 −1
p(x1 , . . . , xm |α, β) ∝ 1 + β(xi − α) = exp log(1 + β(xi − α) )
2
i=1 i=1
can not be expressed as the exponential family form. In consequence, there are
no sufficient minimal statistics (in other words t = {n, x1 , . . . , xn } is the sufficient
statistic) and we will have to work with the whole sample.
2.6 Prior Functions 95
In the Bayes rule, p(θ|x) ∝ p(x|θ) p(θ), the prior function p(θ) represents the
knowledge (degree of credibility) that we have about the parameters before the exper-
iment is done and it is a necessary element to obtain the posterior density p(θ|x)
from which we shall make inferences. If we have faithful information on them before
we do the experiment, it is reasonable to incorporate that in the specification of the
prior density (informative prior) so the new data will provide additional information
that will update and improve our knowledge. The specific form of the prior can be
motivated, for instance, by the results obtained in previous experiments. However, it
is usual that before we do the experiment, either we have a vague knowledge of the
parameters compared to what we expect to get from the experiment or simply we
do not want to include previous results to perform an independent analysis. In this
case, all the new information will be contained in the likelihood function p(x|θ) of
the experiment and the prior density (non-informative prior) will be merely a math-
ematical element needed for the inferential process. Being this the case, we expect
that the whole weight of the inferences rests on the likelihood and the prior function
has the smallest possible influence on them. To learn something from the experiment
it is then desirable to have a situation like the one shown in Fig. 2.1 where the pos-
terior distribution p(θ|x) is dominated by the likelihood function. Otherwise, the
experiment will provide little information compared to the one we had before and,
unless our previous knowledge is based on suspicious observations, it will be wise
to design a better experiment.
A considerable amount of effort has been put to obtain reasonable non-informative
priors that can be used as a standard reference function for the Bayes rule. Clearly,
non-informative is somewhat misleading because we are never in a state of absolute
ignorance about the parameters and the specification of a mathematical model for
the process assumes some knowledge about them (masses and life-times take non-
negative real values, probabilities have support on [0, 1],…). On the other hand, it
doesn’t make sense to think about a function that represents ignorance in a formal
and objective way so knowing little a priory is relative to what we may expect to
learn from the experiment. Whatever prior we use will certainly have some effect on
the posterior inferences and, in some cases, it would be wise to consider a reasonable
set of them to see what is the effect.
The ultimate task of this section is to present the most usual approaches to derive
a non-informative prior function to be used as a standard reference that contains
little information about the parameters compared to what we expect to get from the
experiment.4 In many cases, these priors will not be Lebesgue integrable (improper
functions) and, obviously, can not be considered as probability density functions that
quantify any knowledge on the parameters (although, with little rigor, sometimes we
still talk about prior densities). If one is reluctant to use them right the way one can,
for instance, define them on a sufficiently large compact support that contains the
region where the likelihood is dominant. However, since
1.4
1.2
0.8
0.6
prior
0.4
0.2
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
θ
in most cases it will be sufficient to consider them simply as what they really are: a
measure. In any case, what is mandatory is that the posterior is a well defined proper
density.
The Principle of Insufficient Reason5 dates back to J. Bernoulli and P.S. Laplace
and, originally, it states that if we have n exclusive and exhaustive hypothesis and
there is no special reason to prefer one over the other, it is reasonable to consider
them equally likely and assign a prior probability 1/n to each of them. This certainly
sounds reasonable and the idea was right the way extended to parameters taking
countable possible values and to those with continuous support that, in case of com-
pact sets, becomes a uniform density. It was extensively used by P.S. Laplace and
T. Bayes, being he the first to use a uniform prior density for making inferences on the
parameter of a Binomial distribution, and is usually referred to as the “Bayes-Laplace
Postulate”. However, a uniform prior density is obviously not invariant under repa-
rameterizations. If prior to the experiment we have a very vague knowledge about
the parameter θ ∈ [a, b], we certainly have a vague knowledge about φ = 1/θ or
ζ = logθ and a uniform distribution for θ:
5 Apparently, “Insufficient Reason” was coined by Laplace in reference to the Leibniz’s Principle
of Sufficient Reason stating essentially that every fact has a sufficient reason for why it is the way
it is and not other way.
2.6 Prior Functions 97
1
π(θ) dθ = dθ
b−a
implies that:
1
π(φ) dφ = dφ and π(ζ) dζ = eζ dζ
φ2
An important class of parameters we are interested in are those of position and scale.
Let’s treat them separately and leave for a forthcoming section the argument behind
that. Start with a random quantity X ∼ p(x|μ) with μ a location parameter. The
density has the form p(x|μ) = f (x − μ) so, taking a prior function π(μ) we can
write
In both cases the models have the same structure so making inferences on μ from the
sample {x1 , x1 , . . . , xn } is formally equivalent to making inferences on μ from the
shifted sample {x1 , x2 , . . . , xn }. Since we have the same prior degree of knowledge
on μ and μ , it is reasonable to take the same functional form for π(·) and π (·) so:
and, in consequence:
π(μ) = constant
If θ is a scale parameter, the model has the form p(x|θ) = θ f (xθ) so taking a
prior function π(θ) we have that
where we have defined the new parameter θ = θ/a. Following the same argument
as before, it is sound to assume the same functional form for π(·) and π (·) so:
and, in consequence:
1
π(θ) =
θ
Both prior functions are improper so they may be explicited as
1
π(μ, θ) ∝ 1 (θ) 1 M (μ)
θ
with , M an appropriate sequence of compact sets or considered as prior measures
provided that the posterior densities are well defined. Let’s see some examples.
(nθ)n n−1
p(t|θ) = t exp{−nθt}
(n)
It is clear that θ is a scale parameter so we shall take the prior function π(θ) = 1/θ.
Note that if we make the change z = log t and φ = log θ we have that
nn
p(z|φ) = exp{n (φ + z) − eφ+z }
(n)
(nt)n
p(θ|t, n) = exp{−ntθ} θn−1 ; θ>0
(n)
2.6 Prior Functions 99
1 1
πk (θ) = 1C (θ)
2 log k θ k
(nt)n
pk (θ|t, n) = exp{−ntθ} θn−1 1Ck (θ)
γ(n, ntk) − γ(n, nt/k)
i=1
σ 2σ i=1
1 1
n n
x= xi and s2 = (xi − x)2
n i=1 n i=1
100 2 Bayesian Inference
so we can write
1 n 2
p(x|μ, σ) ∝ exp − s + (x − μ) 2
σn 2σ 2
In this case we have both position and scale parameters so we take π(μ, σ) =
π(μ)π(σ) = σ −1 and get the proper posterior
1 n
p(μ, σ|x) ∝ p(x|μ, σ) π(μ, σ) ∝ exp − 2 s 2 + (x − μ)2
σ n+1 2σ
n s2
Z= ∼ χ2 (z|n − 1)
σ2
It is clear that p(μ, σ|x) = p(μ|x) p(σ|x) and, in consequence, are not independent.
dom samplings x 1 = {x11 , x12 , . . . , x1n 1 } and x 2 = {x21 , x22 , . . . , x2n 2 } of sizes n 1
and n 2 under the usual conditions. From the considerations of the previous example,
we can write
1 ni
p(x i |μi , σi ) ∝ ni exp − 2 si2 + (xi − μi )2 ; i = 1, 2
σi 2σi
Clearly, (μ1 , μ2 ) are position parameters and (σ1 , σ2 ) scale parameters so, in princi-
ple, we shall take the improper prior function
1
π(μ1 , σ1 , μ2 , σ2 ) = π(μ1 )π(μ2 )π(σ1 )π(σ2 ) ∝
σ1 σ2
However, if we have know that both distributions have the same variance, then we
may set σ = σ1 = σ2 and, in this case, the prior function will be
1
π(μ1 , μ2 , σ) = π(μ1 )π(μ2 )π(σ) ∝
σ
Let’s analyze both cases.
both with support in (0, +∞), and integrate the last we get we get that Z follows a
Snedecor Distribution Sn(z|n 2 − 1, n 1 − 1) whose density is
(ν1 /ν2 )ν1 /2 ν1 −(ν1 +ν2 )/2
p(z|x 1 , x 2 ) = z (ν1 /2)−1 1+ z 1(0,∞) (z).
Be (ν1 /2, ν2 /2) ν2
we can write
1 1
p(μ1 , μ2 , σ|x, y) ∝ exp − A/σ 2
σ n 1 +n 2 +1 2
n 1 s12 + n 2 s22
s2 =
n1 + n2 − 2
we have that
−[(n 1 +n 2 −2)+1]/2
n 1 n 2 [w − (x 1 − x 2 )]2
p(w|x 1 , x 2 ) ∝ 1 +
n 1 + n 2 s 2 (n 1 + n 2 − 2)
(μ1 − μ2 ) − (x 1 − x 2 )
T =
s (1/n 1 + 1/n 2 )1/2
where integral over u ∈ R can not be expressed in a simple way. The density
2.6 Prior Functions 103
+∞
p(w|x 1 , x 2 ) ∝ p(w, u|x 1 , x 2 ) du
−∞
The question of how to establish a reasonable criteria to obtain a prior for a given
model p(x|θ) that can be used as a standard reference function was studied by
Harold Jeffreys [6] in the mid XX century. The rationale behind the argument is that
if we have the model p(x|θ) with θ ∈ θ ⊆ R n and make a reparameterizations
φ = φ(θ) with φ(·) a one-to-one differentiable function, the statements we make
about θ should be consistent with those we make about φ and, in consequence, priors
should be related by
! "
∂φi (θ)
πθ (θ)dθ = πφ (φ(θ)) det dθ
∂θ j
π(θ) ∝ [det[I(θ]]1/2
In fact, if we consider the parameter space as a Riemannian manifold (see Sect. 4.7)
the Fisher’s matrix is the metric tensor (Fisher-Rao metric) and this is just the invariant
104 2 Bayesian Inference
It should be pointed out that there may be other priors that are also invariant under
reparameterizations and that, as usual, we talk loosely about prior densities although
they usually are improper functions.
For one-dimensional parameter, the density function expressed in terms of
φ ∼ [I(θ)]1/2 dθ
may be reasonably well approximated by a Normal density (at least in the parametric
region where the likelihood is dominant) because I(φ) is constant (see Sect. 4.5) and
then, due to translation invariance, a constant prior for φ is justified. Let’ see some
examples.
Example 2.7 (The Binomial Distribution) Consider the random quantity
X ∼ Bi(x|θ, n):
n
p(x|n, θ) = θk (1 − θ)n−k ; n, k ∈ N0 ; k ≤ n
x
π(θ) ∝ [θ (1 − θ)]−1/2
we have that θ = sin2 φ/2 and, parameterized in terms of φ, I(φ) is constant so the
distribution “looks” more Normal (see Fig. 2.2).
2.6 Prior Functions 105
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3
Fig. 2.2 Dependence of the likelihood function with the parameter θ (upper) and with φ =
2 asin(θ1/2 ) (lower) for a Binomial process with n = 10 and k = 1, 5 and 9
μx
p(x|μ) = e−μ ; x ∈ N ; μ ∈ R+
(x + 1)
θ # x0 $θ+1
p(x|θ, x0 ) = 1(x0 ,∞) (x); θ ∈ R+
x0 x
Then,
! "
∂ 2 log p(x|θ, x0 ) 1
I(θ) = E X − = 2
∂θ 2
θ
p(θ|x, x0 ) = x −θ log x
for which θ is a scale parameter and, from previous considerations, we should take
π(θ) ∝ θ−1 in consistency with Jeffreys’s prior.
αβ −αx β−1
p(x|α, β) = e x 1(0,∞) (x)
(β)
with (x) the first derivative of the Digamma Function and, following Jeffreys’
rule, we should take the prior
1/2
π(α, β) ∝ α−1 β (β) − 1
Note that α is a scale parameter so, from previous considerations, we should take
π(α) ∝ α−1 . Furthermore, if we consider α and β independently, we shall get
2.6 Prior Functions 107
1/2
π(α, β) = π(α)π(β) ∝ α−1 (β)
(α + β) α−1
p(x|α, β) = (x (1 − x)β−1 1[0,1] (x); α, β ∈ R +
(α)(β)
so
1
π1 (μ, σ) ∝ [det[I(μ, σ]]1/2 ∝
σ2
However, had we treated the two parameters independently, we should have obtained
1
π2 (μ, σ) = π(μ) π(σ) ∝
σ
The prior π2 ∝ σ −1 is the one we had used in Example 2.5 where the problem was
treated as two one-dimensional independent problems and, as we saw:
√
n − 1(μ − x) n s2
T = ∼ St (t|n − 1) and Z= ∼ χ2 (z|n − 1)
s σ2
⎛ ⎞
(2 − ρ2 )σ1−2 −ρ2 (σ1 σ2 )−1 −ρσ1−1
I(σ1 , σ2 , ρ) = (1 − ρ2 )−1 ⎝ −ρ2 (σ1 σ2 )−1 (2 − ρ2 )σ2−2 −ρσ2−1 ⎠
−1 −1
−ρσ1 −ρσ2 (1 + ρ2 )(1 − ρ2 )−1
I(μ1 , μ2 ) 0
I(μ1 , μ2 , σ1 , σ2 , ρ) =
0 I(σ1 , σ2 , ρ)
From this,
1
π(μ1 , μ2 , σ1 , σ2 , ρ) ∝ |det I(μ1 , μ2 , σ1 , σ2 , ρ)|1/2 =
σ12 σ22 (1 − ρ2 )2
1
π(μ1 , μ2 , σ1 , σ2 , ρ) ∝
σ1 σ2 (1 − ρ2 )3/2
Problem 2.1 Show that for the density p(x|θ); x ∈ ⊆ R n , the Fisher’s matrix (if
exists)
! "
∂ log p(x|θ) ∂ log p(x|θ)
I i j (θ) = E X
∂θi ∂θ j
∂θk ∂θl
I i j (φ) = I kl (θ)
∂φi ∂φ j
Problem 2.2 Show that for X ∼ Po(x|μ + b) with b ∈ R + known (Poisson model
with known background), we have that I(μ) = (μ + b)−1 and therefore the posterior
(proper) is given by:
Problem 2.3 Show that for the one parameter mixture model p(x|λ) = λ p1 (x) +
(1 − λ) p2 (x) with p1 (x) = p2 (x) properly normalized and λ ∈ (0, 1),
∞
1 p1 (x) p2 (x)
I(λ) = 1− dx
λ(1 − λ) −∞ p(x|λ)
When p1 (x) and p2 (x) are “well separated”, the integral is << 1 and therefore
I(λ) ∼ [λ(1 − λ)]−1 . On the% other hand, when they “get closer” we can write
∞
p2 (x) = p1 (x) + η(x) with −∞ η(x)d x = 0 and, after a Taylor expansion for
|η(x)| << 1 get to first order that
∞
( p1 (x) − p2 (x))2
I(λ) dx + · · ·
−∞ p1 (x)
independent of λ. Thus, for this problem it will be sound to consider the prior
π(λ|a, b) = Be(λ|a, b) with parameters between (1/2, 1/2) and (1, 1).
Some times, we may be interested to provide the prior with invariance under some
transformations of the parameters (or a subset of them) considered of interest for
the problem at hand. As we have stated, from a formal point of view the prior
can be treated as an absolute continuous measure with respect to Lebesgue so
p(θ|x) dθ ∝ p(x|θ) π(θ) dθ = p(x|θ) dμ(θ). Now, consider the probability space
(, B, μ) and a measurable homeomorphism T : →. A measure μ on the Borel
algebra B would be invariant by the mapping T if for any A ⊂ B, we have that
μ(T −1 (A)) = μ(A). We know, for instance, that there is a unique measure λ on R n
that is invariant under translations and such that for the unit cube λ([0, 1]n ) = 1:
the Lebesgue measure (in fact, it could have been defined that way). This is consis-
tent with the constant prior specified already for position parameters. The Lebesgue
measure is also the unique measure in R n that is invariant under the rotation group
S O(n) (see Problem 2.5). Thus, when expressed in spherical polar coordinates, it
would be reasonable for the spherical surface S n−1 the rotation invariant prior
n−1
dμ(φ) = (sin φk )(n−1)−k dφk
k=1
with φn−1 ∈ [0, 2π) and φ j ∈ [0, π] for the rest. We shall use this prior function in
a later problem.
In other cases, the group of invariance is suggested by the model
M : { p(x|θ), x ∈ X , θ ∈ }
110 2 Bayesian Inference
in the sense that we can make a transformation of the random quantity X→X and
absorb the change in a redefinition of the parameters θ→θ such that the expression
of the probability density remains unchanged. Consider a group of transformations6
G that acts
for any Lebesgue integrable function f (x) on X . Shortly after, it was shown (Von
Neumann (1934); Weil and Cartan (1940)) that this measure is unique up to a mul-
tiplicative constant. In our case, the function will be p(·|θ)1 (θ) and the invariant
measure we are looking for is dμ(θ) ∝ π(θ)dθ. Furthermore, since the group may be
non-abelian, we shall consider the action on the right and on the left of the parameter
space. Thus, we shall have:
p(·|g◦θ) π L (θ) dθ = p(·|θ ) π L (θ ) dθ
if the action is on the right. Then, we should start by identifying the group of transfor-
mations under which the model is invariant (if any; in many cases, either there is no
invariance or at least not obvious) work in the parameter space. The most interesting
cases for us are:
6 In this context, the use of Transformation Groups arguments was pioneered by E.T. Jaynes [7].
2.6 Prior Functions 111
Translations and scale transformations are a particular case of the first and rotations
of the second. Let’s start with the location and scale parameters; that is, a density
1 x −μ
p(x|μ, σ) d x = f dx
σ σ
Now,
p(·|μ , σ ) π L (μ , σ ) dμ dσ = p(·|g◦(μ, σ) π L (μ, σ) dμ dσ =
= p(·|μ , σ ) π L [g −1 (μ , σ )] J (μ , σ ; μ, σ) dμ dσ =
μ − a σ 1
= p(·|μ , σ ) π L , dμ dσ
b b b2
and this should hold for all (a, b) ∈ R×R + so, in consequence:
1
dμ L (μ, σ) = π L (μ, σ) dμ dσ ∝ dμ dσ
σ2
However, the group of Affine Transformations is non-abelian so if we study the
action on the left, there is no reason why we should not consider also the action on
the right. Since
1
dμ R (μ, σ) = π R (μ, σ) dμ dσ ∝ dμ dσ
σ
The first one (π L ) is the one we obtain using Jeffrey’s rule in two dimensions while
π R is the one we get for position and scale parameters or Jeffrey’s rule treating both
parameters independently; that is, as two one-dimensional problems instead a one
two-dimensional problem. Thus, although from the invariance point of view there is
no reason why one should prefer one over the other, the right invariant Haar prior
112 2 Bayesian Inference
gives more consistent results. In fact ([9, 10]), a necessary and sufficient condition
for a sequence of posteriors based on proper priors to converge in probability to an
invariant posterior is that the prior is the right Haar measure.
Problem 2.4 As a remainder, given a measure space (, B, μ) a mapping T : −→
is measurable if T −1 (A) ∈ B for all A ∈ B and the measure μ is invariant under T if
μ(T −1 (A)) = μ(A) for all A ∈ B. Show that the measure dμ(θ) = [θ(1 − θ)]−1/2 dθ
is invariant under the mapping T : [0, 1]→[0, 1] such that T : θ→θ = T (θ) =
4θ(1 − θ). This is the Jeffrey’s prior for the Binomial model Bi(x|N , θ).
Problem 2.5 Consider the n-dimensional spherical surface Sn of unit radius, x ∈
Sn and the transformation x = Rx ∈ Sn where R ∈ S O(n). Show that the Haar
invariant measure is the Lebesgue measure on the sphere.
Hint: Recall that R is an orthogonal matrix so Rt = R−1 ; that |det R| = 1 so
J (x ; x) = |∂ x/∂ x )| = |∂ R−1 x /∂ x )| = |det R| = 1 and that x t x = x t x = 1.
Using the Cholesky decomposition we can express −1 as the product of two lower
(or upper) triangular matrices:
⎛ ⎞
1 0
1 σ22 −ρσ1 σ2 σ1
−1 = = At A with A = ⎝ &−ρ &1
⎠
det[] −ρσ1 σ2 σ12
σ1 1 − ρ σ2 1 − ρ
2 2
and, in consequence
1
π(aa11 , aa21 + ba22 , ca22 ) ac2 = π(a11 , a21 , a22 ) −→ π(a11 , a21 , a22 ) ∝ 2
a11 a22
and det[] = (det[ −1 ])−1 = (det[ A])−2 . Thus, in the new parameterization θ =
{a11 , a21 , a22 }
2.6 Prior Functions 113
1 t t
p(x|θ) = (2π)−1 |det[ A]| exp − x A Ax
2
T ◦x→T x = x x◦T →T −1 x = x
x t (T t (T t )−1 ) At A (T −1 T ) x x t ((T t )−1 T t ) At A (T T −1 ) x
M=T M = T −1
Then
1
M x = x; x = M −1 x ; x = xt M t d x
t
and dx =
|det[M]|
so
−1 |det[ A]| 1 t −1 t −1
p(x |θ) = (2π) exp − x ( AM ) ( AM ) x
|det[M]| 2
and the model is invariant under G l if the action on the parameter space is
so
−1 1 t t
p(x |θ ) = (2π) |det[ A ]| exp − x A A x
2
and, in consequence, ∀M ∈ G
π( A M) J ( A ; A) da11
da21
da22 = π( A ) da11
da21
da22
114 2 Bayesian Inference
and, in consequence
1
π(aa11 , aa21 + ba22 , ca22 ) a 2 c = π(a11 , a21 , a22 ) −→ π(a11 , a21 , a22 ) ∝ 2
a11 a22
and, in consequence
a11 ca − ba22
a 1 1
π , 21 , 22 = π(a11 , a21 , a22 ) −→ π(a11 , a21 , a22 ) ∝
a ac c ac 2 2
a11 a22
1
da11 da21 da22 = dσ1 dσdρ
σ12 σ22 (1− ρ2 )2
1 1
πlL (σ1 , σ2 , ρ) = and πlR (σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )3/2 σ22 (1 − ρ2 )
1 1
π uL (σ1 , σ2 , ρ) = and π uR (σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )3/2 σ12 (1 − ρ2 )
As we see, in both cases the left Haar invariant prior coincides with Jeffrey’s prior
when {μ1 , μ2 } and {σ1 , σ2 , ρ} are decoupled.
At this point, one may be tempted to use a right Haar invariant prior where the
two parameters σ1 and σ2 are treated on equal footing
1
π(σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )
2.6 Prior Functions 115
is a sufficient statistics for ρ, we have that the posterior for inferences on the corre-
lation coefficient will be
we say that the class P is conjugated to S. We are mainly interested in the class
of priors P that have the same functional form as the likelihood. In this case, since
both the prior density and the posterior belong to the same family of distributions,
we say that they are closed under sampling. It should be stressed that the criteria for
116 2 Bayesian Inference
taking conjugated reference priors is eminently practical and, in many cases, they do
not exist. In fact, only the exponential family of distributions has conjugated prior
densities. Thus, if x = {x1 , x2 , . . . , xn } is an exchangeable random sampling from
the k-parameter regular exponential family, then
⎧ ' n (⎫
⎨k ⎬
p(x|θ) = f (x) g(θ) exp c j φ j (θ) h j (xi )
⎩ ⎭
j=1 i=1
p(x, θ, τ ) p(x|θ)π(θ|τ )
p(θ|x, τ ) = =
p(x, τ ) p(x|τ )
The obvious question that arises is how do we choose the prior π(φ) for the hyper-
parameters. Besides reasonableness, we may consider two approaches. Integrating
the parameters θ of interest, we get
p(τ , x) = π(τ ) p(x|θ)π(θ|τ ) dθ = π(τ ) p(x|τ )
so we may use any of the procedures under discussion to take π(τ ) as the prior for
the model p(x|τ ) and then obtain
π(θ) = π(θ|τ ) π(τ ) dτ
τ
7 We can go an step upwards and assign a prior to the hyperparameters with hyper-
hyperparameters,…
2.6 Prior Functions 117
The beauty of Bayes rule but not very practical in complicated situations. A second
approach, more ugly and practical, is the so called Empirical Method where we
assign numeric values to the hyperparameters suggested by p(x|τ ) (for instance,
moments, maximum-likelihood estimation,…); that is, setting, in a distributional
sense, π(τ ) = δτ so
π(τ ), p(θ, x, τ ) = p(θ, x, τ ). Thus,
p(θ|x, τ ) ∝ p(x|θ)π(θ|τ )
k
π(θ|τ 1 , . . . , τ k ) = wi π(θ|τ i )
i=1
In fact [12], any prior density for a model that belongs to the exponential family can
be approximated arbitrarily close by a mixture of conjugated priors.
Example 2.14 Let’s see the conjugated prior distributions for some models:
• Poisson model Po(n|μ): Writing
e−μ μn e−(μ−nlog μ)
p(n|μ) = =
(n + 1) (n + 1)
it is clear that the Poisson distribution belongs to the exponential family and the
conjugated prior density for the parameter μ is
and integrating μ:
! "
(n + τ2 ) τ1τ2
p(n, τ1 , τ2 ) = π(τ1 , τ2 ) = p(n|τ1 , τ2 ) π(τ1 , τ2 )
(τ1 ) (1 + τ1 )n+τ2
it is clear that it belong to the exponential family and the conjugated prior density
for the parameter θ will be:
is the natural conjugated prior for this model. It is a degenerated distribution in the
sense that
+k−1 ,+ ,αk −1
k−1
π(θ|α) = D(α) θiαi −1 1− θi
i=1 i=1
where:
k−1
k−1
0 < θi < 1, θi < 1, θn = 1 − θi
i=1 i=1
βi − αi+1 − βi+1 ; i = 1, 2, . . . , k − 2
αi > 0, βi > 0, and γi
βk−1 − 1; i = k − 1
2.6 Prior Functions 119
When βi = αi+1 + βi+1 it becomes the Dirichlet distribution. For this prior we have
that
αi αi + δi j
E[θi ] = Si and V [θi , θ j ] = E[θ j ] Ti − E[θi ]
αi + βi αi + βi + 1
where
i−1
βj
i−1
βj + 1
Si = and Ti =
α + βj
j=1 j
α + βj + 1
j=1 j
with S1 = T1 = 1 and we can have control over the prior means and variances.
A pragmatic criteria is that of probability matching priors for which the one sided
credible intervals derived from the posterior distribution coincide, to a certain level
of accuracy, with those derived by the classical approach. This condition leads to
a differential equation for the prior distribution [13, 14]. We shall illustrate in the
following lines the rationale behind for the simple one parameter case assuming that
the needed regularity conditions are satisfied.
Consider then a random quantity X ∼ p(x|θ) and an iid sampling x = {x1 , x2 ,
. . . , xn } with θ the parameter of interest. The classical approach for inferences is
based on the likelihood
n
p(x|θ) = p(x1 , x2 , . . . xn |θ) = p(xi |θ)
i=1
(3) Given the model X ∼ p(x|θ0 ), after the appropriate change of variables get the
distribution
p(θm |θ0 )
120 2 Bayesian Inference
Let’s start with the Bayesian and expand the term on the right around θm . On the
one hand:
' ( ' (
p(x|θ) 1 ∂ 2 ln p(x|θ) 1 ∂ 3 ln p(x|θ)
ln = (θ − θm )2 + (θ − θm )3 + · · ·
p(x|θm ) 2! ∂θ2 3! ∂θ3
θm θm
Now,
! 2 "
1 ∂ 2 (− ln p(xi |θ)) n→∞
n
1 ∂ 2 ln p(x|θ) ∂ (− ln p(x|θ))
− = −→ E X = I (θ)
n ∂θ2 n i=1 ∂θ2 ∂θ2
so we can substitute:
2
∂ ln p(x|θ) ∂ 3 ln p(x|θ) ∂ I (θ)
= −n I (θm ) and = −n
∂θ2 θm ∂θ3 θm ∂θ θm
to get
' (
−
n I (θm ) (θ − θ )2 n ∂ I (θ)
p(x|θ) = eln p(x|θ) ∝ e
m
2 1− (θ−θm ) +· · ·
3
3! ∂θ θm
√
so If we define the random quantity T = n I (θm )(θ − θm ) and consider that
∂ I (θ) ∂ I −1/2
I −3/2 (θ) = −2
∂θ ∂θ
we get finally:
' + ,
exp(−t 2 /2) 1 I −1/2 (θ) ∂π(θ) 1 ∂ I −1/2
p(t|x) = √ 1+ √ t+ t3
2π n π(θ) ∂θ θm 3 ∂θ θm
1
+O
n
2.6 Prior Functions 121
Defining
x
1
Z (x) = √ e− x /2
2
and P(x) = Z (t)dt
2π −∞
From this probability distribution, we can infer what the classical approach will
get. Since he will draw inferences from p(x|θ0 ), we can take a sequence of proper
priors πk (θ|θ0 ) for k = 1, 2, . . . that induce a sequence of distributions such that
limk→∞
πk (θ|θ0 ), p(x|θ) = p(x|θ0 )
k
πk (θ|θ0 ) = 1[θ −1/k,θ0 +1/k] ; k = 1, 2, . . .
2 0
converge to the Delta distribution δθ0 and, from distributional derivatives, as k→∞,
d d −1/2 ∂ I −1/2 (θ))
πk (θ|θ0 ), I −1/2 (θ) = −
πk (θ|θ0 ), I (θ) −
dθ dθ ∂θ θ0
√
But θ0 = θm + O(1/ n) so, for a sequence of priors that shrink to θ0 θm ,
+ ,
Z (z) z2 + 1 ∂ I −1/2 1
P(T ≤ z|x) = P(z) − √ +O
n 3 ∂θ θm n
√
For terms of order O(1/ n) in both expressions of P(T ≤ z|x) to be the same, we
need that:
122 2 Bayesian Inference
1 1 ∂π(θ) ∂ I −1/2
√ =−
I (θ) π(θ) ∂θ θm ∂θ θm
and therefore
that is, Jeffrey’s prior. In the case of n-dimensional parameters, the reasoning goes
along the same lines but the expressions and the development become much more
lengthy and messy and we refer to the literature.
The procedure for a first order probability matching prior [15, 16] starts from the
likelihood
p(x1 , x2 , . . . xn |θ1 , θ2 , . . . θ p )
and then:
(1) Get the Fisher’s matrix I(θ1 , θ2 , . . . θ p ) and the inverse I −1 (θ1 , θ2 , . . . θ p );
(2) Suppose we are interested in the parameter t = t (θ1 , θ2 , . . . θ p ) a twice contin-
uous and differentiable function of the parameters. Define the column vector
T
∂t ∂t ∂t
∇t = , ,...,
∂θ1 ∂θ2 ∂θ p
I −1 ∇t
η= so that η T Iη = 1
(∇tT I −1 ∇t )1/2
(4) The probability matching prior for the parameter t = t (θ) in terms of θ1 , θ2 , . . . θ p
is given by the equation:
p
∂
[ηk (θ) π(θ)] = 0
k=1
∂θk
Example 2.15 Consider two independent random quantities X 1 and X 2 such that
μn1 1 μn2 2
P(n 1 , n 2 |μ1 , μ2 ) = P(n 1 |μ1 ) P(n 2 |μ2 ) = e−(μ1 + μ2 )
(n 1 + 1)(n 2 + 1)
Therefore:
−1 μ1 μ−1 μ1 (μ1 + μ2 )
I ∇t = 2 S = ∇tT I−1 ∇t =
−μ1 μ−1
2 μ32
I−1 ∇t (μ1 μ2 )1/2 (μ1 + μ2 )−1/2
η= =
(∇tT I−1 ∇t )1/2 −(μ1 μ2 )1/2 (μ1 + μ2 )−1/2
2
∂
[ηk (μ) π(μ)] = 0
k=1
∂μk
∂ ∂
f (μ1 , μ2 ) π(μ1 , μ2 ) = f (μ1 , μ2 ) π(μ1 , μ2 )
∂μ1 ∂μ2
and, integrating the nuisance parameter μ2 ∈ [0, ∞), we get the posterior density:
t n 1 −1/2
p(t|n 1 , n 2 ) = N
(1 + t)n+1
αβ −αx β−1
p(x|α, β) = e x 1(0,∞) (x)
(β)
Then
2 ∂ ∂ ∂
[π(1 − ρ2 )] + [πσ1 ] + [πσ2 ] = 0
ρ ∂ρ ∂σ1 ∂σ2
2.6 Prior Functions 125
for which
1
π(σ1 , σ2 , ρ) =
σ1 σ2 (1 − ρ2 )
is a solution.
b+a (b − a)2 π2
E[X ] = and V [x] = +
2 12 12σ 2
and that, for known σ >>, the probability matching prior for a and b tends to
π pm (a, b) ∼ (b − a)−1/2 . Show also that, under the same limit, π pm (θ) ∼ θ−1/2 for
σ>>
(a, b) = (−θ, θ) and (a, b) = (0, θ). Since p(x|a, b, σ) → U n(x|a, b) discuss in
this last case what is the difference with the Example 2.4.
This is a nice but complicated implicit equation because, on the one hand, f k (θ)
depends on π(θ) through the posterior p(θ|z k ) and, on the other hand, the limit
k→∞ is usually divergent (intuitively, the more precision we want for θ, the more
information is needed and to know the actual value from the experiment requires an
infinite amount of information). This can be circumvented regularizing the expression
as
f k (θ)
π(θ) ∝ π(θ0 ) lim
k→∞ f k (θ0 )
with θ0 any interior point of (we are used to that in particle physics!). Let’s see
some examples.
n
Example 2.18 Consider again the exponential model for which t = n −1 i=1 xi is
sufficient for θ and distributed as
(nθ)n n−1
p(t|θ) = t exp{−nθt}
(n)
(nt)n+1
π (θ|t) = exp {−nθt} θn
(n + 1)
Example 2.19 Prior functions depend on the particular model we are treating. To
learn about a parameter, we can do different experimental designs that respond to
different models and, even though the parameter is the same, they may have different
priors. For instance, we may be interested in the acceptance; the probability to accept
an event under some conditions. For this, we can generate for instance a sample of
N observed events and see how many (x) pass the conditions. This experimental
design corresponds to a Binomial distribution
N
p(x|N , θ) = θ x (1 − θ) N −x
x
2.6 Prior Functions 127
with x = {0, 1, . . . , N }. For this model, the reference prior (also Jeffrey’s and PM) is
π(θ) = θ−1/2 (1 − θ)−1/2 and the posterior θ ∼ Be(θ|x + 1/2, N − x + 1/2). Con-
versely, we can generate events until r are accepted and see how many (x) have we
generated. This experimental design corresponds to a Negative Binomial distribution
x −1
p(x|r, θ) = θr (1 − θ)x−r
r −1
where x = r, r + 1, . . . and r ≥ 1. For this model, the reference prior (Jeffrey’s and
PM too) is π(θ) = θ−1 (1 − θ)−1/2 and the posterior θ ∼ Be(θ|r, x − r + 1/2).
f k (θ)
π(θ) ∝ π(θ0 ) lim ∝ θ−1/2
k→∞ f k (θ0 )
N iid
(2) X ∼ Bi(x|N , θ) = θ x (1 − θ) N −x and the experiment e(k) → {x1 , x2 ,
x
. . . xk }. Take π (θ) ∝ θa−1 (1 − θ)b−1 1(0,1) (θ) with a, b > 0 and show that
f k (θ)
π(θ) ∝ π(θ0 ) lim ∝ θ−1/2 (1 − θ)−1/2
k→∞ f k (θ0 )
(Hint: For (1) and (2) consider the Taylor expansion of log (z, ·) around E[z] and
the asymptotic behavior of the Polygamma Function n) (z) = an z −n + an+1 z −(n+1)
+ · · · ).
(3) X ∼ U n(x|0, θ) and the iid sample {x1 , x2 , . . . xk }. For inferences on θ, show
that f k = θ−1 g(k) and in consequence the posterior is Pareto Pa(θ|x M , n) with
x M = max{x1 , x2 , . . . xk } the sufficient statistic.
A very useful constructive theorem to obtain the reference prior is given in [18].
First, a permissible prior for the model p(x|θ) is defined as a strictly positive function
π(θ) such that it renders a proper posterior; that is,
∀x ∈ X p(x|θ) π(θ) dθ < ∞
the maximum amount of information the experiment can provide for the parameter.
The constructive procedure for a one-dimensional parameter consists on:
(1) Take π (θ) as a continuous strictly positive function such that the corresponding
posterior
For instance in the case of two parameters and the ordered parameterization {θ, λ}:
(1) Get the conditional π(λ|θ) as the reference prior for λ keeping θ fixed;
2.6 Prior Functions 129
(3) Get the reference prior π(θ) from the marginal model p(x|θ)
Then π(θ, λ) ∝ π(λ|θ)π(θ). This is fine if π(λ|θ) and π(θ) are propper functions;
seldom the case. Otherwise one has to define the appropriate sequence of compact
sets observing, among other things, that this has to be done for the full parameter
space and usually the limits depend on the parameters. Suppose that we have the
i→∞
sequence i × i −→ × . Then:
(1) Obtain πi (λ|θ):
(4) The reference prior for the ordered parameterization {θ, λ} will be:
πi (λ|θ) πi (θ)
π(θ, λ) = lim
i→∞ πi (λ0 |θ0 ) πi (θ0 )
I 22 (θ, λ) = a12 (θ) b12 (λ) and S11 (θ, λ) = a0−2 (θ) b0−2 (λ)
then [19] π(θ, λ) = π(λ|θ)π(θ) = a0 (θ) b1 (λ) is a permissible prior even if the con-
ditional reference priors are not proper. The reference priors are usually probability
matching priors.
k
p(x|θ) ∝ θ1 x1 θ2 x2 . . . θk xk (1 − δk )xk+1 ; δk = θj
j=1
and therefore
k
π(θ1 , θ2 , . . . θk ) ∝ θi −1/2 (1 − δi )−1/2
i=1
Example 2.21 Consider again the case of two independent Poisson distributed ran-
dom quantities X 1 and X 2 with joint density
μn1 1 μn2 2
P(n 1 , n 2 |μ1 , μ2 ) = P(n 1 |μ1 ) P(n 2 |μ2 ) = e−(μ1 + μ2 )
(n 1 + 1)(n 2 + 1)
θn 1 μn
P(n 1 , n 2 |θ, μ) = e−μ(1 + θ)
(n 1 + 1)(n 2 + 1)
Therefore
and, in consequence:
√ √
−1/2 μ 1/2 1+θ
π(θ) f 1 (μ) ∝ S11 =√ π(μ|θ) f 2 (θ) ∝ F22 = √
θ(1 + θ) μ
Thus, we have for the ordered parameterization {θ, μ} the reference prior:
1
π(θ, μ) = π(μ|θ) π(θ) ∝ √
μθ(1 + θ)
θn 1 −1/2
p(θ|n 1 , n 2 ) = N
(1 + θ)n+1
with I (x; a, b) the Incomplete Beta Function and the moments, when they exist;
1 θn1
P(n 1 , n 2 |θ, λ) = e−λ λn
(n 1 + 1)(n 2 + 1) (1 + θ)n
The domains are = (0, ∞) and = (0, ∞), independent. Thus, no need to specify
the prior for λ since
θn1 θn1
p(θ|n 1 , n 2 ) ∝ π(θ) e−λ λn π(λ)dλ ∝ π(θ)
(1 + θ)n (1 + θ)n
1 1
I (θ) ∝ −→ π(θ) =
θ (1 + θ)2 θ 1/2
(1 + θ)
and, in consequence,
θn 1 −1/2
p(θ|n 1 , n 2 ) = N
(1 + θ)n+1
Problem 2.8 Show that the reference prior for the Pareto distribution Pa(x|θ, x0 )
∝ (θx0 )−1 and that for an iid sample x = {x1 , . . . , xn },
(see Example 2.9) is π(θ, x0 )
n
if xm = min{xi }i=1 and a = i=1
n
ln(xi /xm ) the posterior
nθ−1
na n−1 x0
p(θ, x0 |x) = e−aθ θn−1 1(0,∞) (θ)1(0,xm ) (x0 )
xm (n − 1) xm
a n−1
p(θ|x) = e−aθ θn−2 1(0,∞) (θ) and
(n − 1)
! "−n
n(n − 1) −1 n xm
p(x0 |x) = x0 1 + ln 1(0,xm ) (x0 )
a a x0
and show that for large n (see Sect. 2.10.2) E[θ] na −1 and E[x0 ] xm .
Problem 2.9 Show that for the shifted Pareto distribution (Lomax distribution):
θ+1
θ x0
p(x|θ, x0 ) = 1(0,∞) (x); θ, x0 ∈ R +
x0 x + x0
for the ordered parameterizations {β, α} and {α, β} respectively being ζ(2) = π 2 /6
the Riemann Zeta Function and ψ(2) = 1 − γ the Digamma Function.
x 1 = {x11 , x21 , . . . , xn 1 1 }
..
.
x j = {x1 j , x2 j , . . . , xn j j }
..
.
x J = {x1J , x2J , . . . , xn J J }
p(x j |θ j ); j = 1, 2, . . . , J
Since the experiments are independent we assume that the parameters of the sequence
{θ 1 , θ 2 , . . . , θ J } are exchangeable and that, although different, they can be assumed
to have a common origin since they respond to the same phenomena. Thus,we can
set
J
p(θ 1 , θ 2 , . . . , θ J |φ) = p(θi |φ)
i=1
134 2 Bayesian Inference
θ1 θ2 θm-1 θm
x1 x2 xm-1 xm
with φ the hyperparameters for which we take a prior π(φ). Then we have the
structure (Fig. 2.3.)
J
p(x 1 , . . . , x J , θ 1 , . . . , θ J , φ) = π(φ) p(x i |θi ) π(θi |φ)
i=1
Now, consider the model p(x, θ, φ). We may be interested in θ, in the hyperpa-
rameters φ or in both. In general we shall need the conditional densities:
%
• p(φ|x) ∝ p(φ) p(x|θ) p(θ|φ) dθ
p(θ, x, φ)
• p(θ|x, φ) = and
p(x, φ)
p(x|θ) p(x|θ) %
• p(θ|x) = p(θ) = p(θ|φ) p(φ) dφ
p(x) p(x)
and, since
p(φ|x)
p(θ|x) = p(x|θ) p(θ|φ) dφ
p(x|φ)
with wi ≥ 0 and wi = 1 so that the combination is convex and we assure that it
is proper density or, extending this to a continuous mixture:
p(θ|φ) = w(σ) p(θ|φ, σ) dσ.
So far we have discussed parameters with continuous support but in some cases it
is either finite or countable. If the parameter of interest can take only a finite set
of n possible values, the reasonable option for an uninformative prior is a Discrete
Uniform Probability P(X = xi ) = 1/n. In fact, it is shown in Sect. 4.2 that maxi-
mizing the expected information provided by the experiment with the normalization
constraint (i.e. the probability distribution for which the prior knowledge is minimal)
drives to P(X = xi ) = 1/n in accordance with the Principle of Insufficient Reason.
Even though finite discrete parameter spaces are either the most usual case we
shall have to deal with or, at least, a sufficiently good approximation for the real
situation, it may happen that a non-informative prior is not the most appropriate (see
Example 2.22). On the other hand, if the parameter takes values on a countable set
the problem is more involved. A possible way out is to devise a hierarchical structure
in which we assign the discrete parameter θ a prior π(θ|λ) with λ a set of continuous
hyperparameters. Then, since
p(x, λ) = p(x|θ)π(θ|λ) π(λ) = p(x|λ) π(λ)
θ∈
we get the prior π(λ) by any of the previous procedures for continuous parameters
with the model p(x|λ) and obtain
136 2 Bayesian Inference
π(θ) ∝ π(θ|λ) π(λ) dλ
(n 0 Z 2 )n γ
P(n γ |n 0 , Z ) = e−n 0 Z
2
(n γ + 1)
e−n 0 k k 2n γ
2
P(Z = k|n γ , n 0 , n) = n −n 0 k 2 2n γ
.
k=1 e k
Consider a parametric model p(x|θ) and the prior π0 (θ). Now we have some infor-
mation on the parameters that we want to include in the prior. Typically we shall
have say k constraints of the form
gi (θ) π(θ) dθ = ai ; i = 1, . . . , k
Then, we have to find the prior π(θ) for which π0 (θ) is the best approximation, in the
Kullback-Leibler sense, including the constraints with the corresponding Lagrange
multipliers λi ; that is, the extremal of
k
π(θ)
F= π(θ) log dθ + λi gi (θ) π(θ) dθ − ai
π0 (θ) i=1
Again, it is left as an exercise to show that from Calculus of Variations we have the
well known solution
k
π(θ) ∝ π0 (θ) exp λi gi (θ) where λi | gi (θ) π(θ) dθ = ai
i=1
2.9 Constrains on Parameters and Priors 137
Quite frequently we are forced to include constraints on the support of the parame-
ters: some are non-negative (masses, energies, momentum, life-times,…), some are
bounded in (0, 1) (β = v/c, efficiencies, acceptances,…),... At least from a formal
point of view, to account for constraints on the support is a trivial problem. Consider
the model p(x|θ) with θ ∈ 0 and a reference prior π0 (θ). Then, our inferences on
θ shall be based on the posterior
p(x|θ) π0 (θ)
p(θ|x) = %
0 p(x|θ) π0 (θ) dθ
π0 (θ)
π(θ) = % 1 (θ)
π0 (θ) dθ
that is, the same initial expression but normalized in the domain of interest .
Even though all the information we have on the parameters of relevance is contained
in the posterior density it is interesting, as we saw in Chap. 1, to explicit some
particular values that characterize the probability distribution. This certainly entails
a considerable and unnecessary reduction of the available information but in the end,
quoting Lord Kelvin, “… when you cannot express it in numbers, your knowledge
is of a meager and unsatisfactory kind”. In statistics, to specify a particular value of
the parameter is termed Point Estimation and can be formulated in the framework of
Decision Theory.
In general, Decision Theory studies how to choose the optimal action among
several possible alternatives based on what has been experimentally observed. Given
a particular problem, we have to explicit the set θ of the possible “states of nature”,
138 2 Bayesian Inference
the set X of the possible experimental outcomes and the set A of the possible
actions we can take. Imagine, for instance, that we do a test on an individual suspected
to have some disease for which the medical treatment has some potentially dangerous
collateral effects. Then, we have:
θ = {healthy, sic}
Or, for instance, a detector that provides within some accuracy the momentum ( p)
and the velocity (β) of charged particles. If we want to assign an hypothesis for
the mass of the particle we have that θ = R+ is the set of all possible states of
nature (all possible values of the mass), X the set of experimental observations (the
momentum and the velocity) and A the set of all possible actions that we can take
(assign one or other value for the mass). In this case, we shall take a decision based
on the probability density p(m| p, β).
Obviously, unless we are in a state of absolute certainty we can not take an action
without potential losses. Based on the observed experimental outcomes, we can for
instance assign the particle a mass m 1 when the true state of nature is m 2 = m 1 or
consider that the individual is healthy when is actually sic. Thus, the first element of
Decision Theory is the Loss Function:
This is a non-negative function, defined for all θ ∈ θ and the set of possible actions
a ∈ A , that quantifies the loss associated to take the action a (decide for a) when
the state of nature is θ.
Obviously, we do not have a perfect knowledge of the state of nature; what we
know comes from the observed data x and is contained in the posterior distribution
p(θ|x). Therefore, we define the Risk Function (risk associated to take the action a,
or decide for a when we have observed the data x) as the expected value of the Loss
Function:
R(a|x) = E θ [l(a, θ)] = l(a, θ) p(θ|x) dθ
θ
Sound enough, the Bayesian decision criteria consists on taking the action a(x)
(Bayesian action) that minimizes the risk R(a|x) (minimum risk); that is, that
minimizes the expected loss under the posterior density function.8 Then, we shall
encounter to kinds of problems:
8 Theproblems studied by Decision Theory can be addressed from the point of view of Game
Theory. In this case, instead of Loss Functions one works with Utility Functions u(θ, a) that, in
essence, are nothing else but u(θ, a) = K − l(θ, a) ≥ 0; it is just matter of personal optimism to
2.10 Decision Problems 139
Consider the case where we have to choose between two exclusive and exhaustive
hypothesis H1 and H2 (=H1 c ). From the data sample and our prior beliefs we have
the posterior probabilities
P(data|Hi ) P(Hi )
P(Hi |data) = ; i = 1, 2
P(data)
(Footnote 8 continued)
work with “utilities” or “losses”. J. Von Neumann and O. Morgenstern introduced in 1944 the idea
of expected utility and the criteria to take as optimal action hat which maximizes the expected utility.
140 2 Bayesian Inference
2
R(ai |data) = l(ai |H j ) P(H j |data)
j=1
that is:
and, according to the minimum Bayesian risk, we shall choose the hypothesis H1
(action a1 ) if
R(a1 |data) < R(a2 |data) −→ P(H1 |data) (l11 − l21 ) < P(H2 |data) (l22 − l12 )
Since we have chosen l11 = l22 = 0 in this particular case, we shall take action a1
(decide for hypothesis H1 ) if:
is called Bayes Factor Bi j and changes our prior beliefs on the two alternative
hypothesis based on the evidence we have from the data; that is, quantifies how
strongly data favors one model over the other. Thus, we shall decide in favor of
hypothesis Hi against H j (i, j = 1, 2) if
If we consider the same loss if we decide upon the wrong hypothesis whatever it be,
we have l12 = l21 (Zero-One Loss Function). In general, we shall be interested in
testing:
(1) Two simple hypothesis, H1 versus H2 , for which the models Mi = {X ∼
pi (x|θi )}; i = 1, 2 are fully specified including the values of the parameters
(that is, i = {θi }). In this case, the Bayes Factor will be given by the ratio of
likelihoods
p1 (x|θ1 ) p(x|θ1 )
B12 = usually
p2 (x|θ2 ) p(x|θ2 )
The classical Bayes Factor is the ratio of the likelihoods for the two competing
models evaluated at their respective maximums.
(2) A simple (H1 ) versus a composite hypothesis H2 for which the parameters of
the model M2 = {X ∼ p2 (x|θ2 )} have support on 2 . Then we have to average
the likelihood under H2 and
p1 (x|θ1 )
B12 = %
2 p2 (x|θ)π2 (θ)dθ
(3) Two composite hypothesis: in which the models M1 and M2 have parameters
that are not specified by the hypothesis so
%
p1 (x|θ1 )π1 (θ1 )dθ1
B12 = % 1
2 p2 (x|θ2 )π2 (θ2 )dθ2
and, since P(H1 |data) + P(H2 |data) = 1, we can express the posterior probability
P(H1 |data) as
B12 P(H1 )
P(H1 |data) =
P(H2 ) + B12 P(H1 )
Usually, we consider equal prior probabilities for the two hypothesis (P(H1 ) =
P(H2 ) = 1/2) but be aware that in some cases this may not be a realistic assumption.
Bayes Factors are independent of the prior beliefs on the hypothesis (P(Hi )) but,
when we have composite hypothesis, we average the likelihood with a prior and if
it is an improper function they are not well defined. If we have prior knowledge
about the parameters, we may take informative priors that are proper but this is not
always the case. One possible way out is to consider sufficiently general proper
priors (conjugated priors for instance) so the Bayes factors are well defined and then
study what is the sensitivity for different reasonable values of the hyperparameters. A
more practical and interesting approach to avoid the indeterminacy due to improper
priors [21, 22] is to take a subset of the observed sample to render a proper posterior
(with, for instance, reference priors) and use that as proper prior density to compute
142 2 Bayesian Inference
the Bayes Factor with the remaining sample. Thus, if the sample x = {x1 , . . . , xn }
consists on iid observations, we may consider x = {x 1 , x 2 } and, with the reference
prior π(θ), obtain the proper posterior
The remaining subsample (x 2 ) is then used to compute the partial Bayes Factor9 :
%
p1 (x 2 |θ 1 ) π1 (θ 1 |x 1 ) dθ 1 B F(x 1 , x 2 )
B12 (x2 |x1 ) = % 1 =
2 p2 (x 2 |θ 2 ) π2 (θ 2 |x 1 ) dθ 2 B F(x 1 )
for the hypothesis testing. Berger and Pericchi propose to use the minimal amount
of data needed to specify a proper prior (usually max{dim(θi )}) so as to leave most
of the sample for the model testing and dilute the dependence on a particular election
of the training sample evaluating the Bayes Factors with all possible minimal samples
and choosing the truncated mean, the geometric mean or the median, less sensitive to
outliers, as a characteristic value (see Example 2.24). A thorough analysis of Bayes
Factors, with its caveats and advantages, is given in [23].
A different alternative to quantify the evidence in favour of a particular model that
avoids the need of the prior specification and is easy to evaluate is the Schwarz cri-
teria [24] (or “Bayes Information Criterion (BIC)”). The rationale is the following.
Consider a sample x = {x1 , . . . , xn } and two alternative hypothesis for the mod-
els Mi = { pi (x|θi ); dim(θi ) = di }; i = 1, 2. As we can see in Sect. 4.5, under the
appropriate conditions we can approximate the likelihood as
1
d d
l(θ|x) l(2
θ|x) exp − (θk − 2
θk ) n I km (2
θ) (θm − 2
θm )
2 k=1 m=1
so taking a uniform prior for the parameters θ, reasonable in the region where the
likelihood is dominant, we can approximate
J (x) = p(x|θ) π(θ) dθ p(x|2
θ) (2π/n)d/2 |det[I(2
θ)]|−1/2
and, ignoring terms that are bounded as n→∞, define the B I C(Mi ) for the model
Mi as
so:
9 Essentially, the ratio of the predictive inferences for x 2 after x 1 has been observed.
2.10 Decision Problems 143
' (
p1 (x|2
θ 1 ) (d2 −d1 )/2 p1 (x|2
θ1 )
B12 n −→
12 = 2 lnB12 2 ln − (d1 − d2 ) ln n
p2 (x|2
θ2 ) p2 (x|2
θ2 )
Taking (l12 = l21 ; l11 = l22 = 0), the Bayesian decision criteria in favor of the
hypothesis H1 is:
P(H2 ) P(H2 )
B12 > −→ ln B12 > ln
P(H1 ) P(H1 )
such that, if m 0 < m c we decide in favor of H1 and for H2 otherwise. In the case
that σ1 = σ2 and P(H1 ) = P(H2 ), then m c = (m 1 + m 2 )/2. This, however, may be
a quite unrealistic assumption for if P(H1 ) > P(H2 ), it may be more likely that the
event is of type 1 being B12 < 1.
1
π(μ|xi ) = √ exp{−(μ − xi )2 /2}
2π
that we use as a prior for the rest of the sample x = {x1 , . . . , xi−1 , xi+1 , . . . , xn }.
Then
P(H1 |x , xi ) P(H1 )
= B12 (i)
P(H2 |x , xi ) P(H2 )
p(x |0)
where B12 (i) = % ∞
= n 1/2 exp{−(n x 2 − xi2 )/2}
−∞ p(x |μ)π(μ|x i )dμ
and x = n −1 nk=1 xk . To avoid the effect that a particular choice of the minimal sam-
ple ({xi }) may have, this is evaluated for all possible minimal samples and the median
(or the geometric mean) of all the B12 (i) is taken. Since P(H1 |x) + P(H2 |x) = 1, if
we assign equal prior probabilities to the two hypothesis (P(H1 ) = P(H2 ) = 1/2)
we have that
B12 −1
P(H1 |x) = = 1 + n −1/2 exp{(n x 2 − med{xi2 })/2}
1 + B12
is the posterior probability that quantifies the evidence in favor of the hypothesis H1 .
It is left as an exercise to compare the Bayes Factor obtained from the geometric mean
with what you would get if you were to take a proper prior π(μ|σ) = N (μ|0, σ).
Problem 2.11 Suppose we have n observations (independent, under the same exper-
imental conditions,…) of energies or decay time of particles above a certain known
threshold and we want to test the evidence of an exponential fall against a power law.
Consider then a sample x = {x1 , . . . , xn } of observations with supp(X ) = (1, ∞)
and the two models
that is, Exponential and Pareto with unknown parameters θ and α. Show that for the
minimal sample {xi } and reference priors, the Bayes Factor B12 (i) is given by
n
xg ln xg xi − 1 p1 (x|2
θ) xi − 1
B12 (i) = =
x −1 xi ln xi p2 (x|2
α) xi ln xi
where (x, xg ) are the arithmetic and geometric sample means and (2 θ, α
2) the values
that maximize the likelihoods and therefore
xg ln xg n xi − 1 n
med{B12 (i)}i=1 =
n
med
x −1 xi ln xi i=1
We may consider proper Beta prior densities Be(θ|a, b). In a specific pharmaco-
logical analysis, a sample of n 1 = 52 individuals were administered a placebo and
n 2 = 61 were treated with an a priori beneficial drug. After the essay, positive effects
were observed in x1 = 22 out of the 52 and x2 = 41 out of the 61 individuals.
It is left as an exercise to obtain the posterior probability P(H2 |data) with Jef-
freys’ (a = b = 1/2) and Uniform (a = b = 1) priors and to determine the BIC
difference
12 .
When we have to face the problem to characterize the posterior density by a single
number, the most usual Loss Functions are:
l(θ, a) = (θ − a)2
l(θ, a) = (a − θ)T H (a − θ)
so, if H−1 exists, then a = E[θ]. Thus, we have that the Bayesian estimate under a
quadratic loss function is the mean of p(θ|x) (… if exists!).
146 2 Bayesian Inference
In particular, if c1 = c2 then P(θ ≤ a) = 1/2 and we shall have the median of the
distribution p(θ|x). In this case, the Loss Function can be expressed more simply as
l(θ, a) = |θ − a|.
It is clear that, in the limit → 0, the Bayesian estimator for the Zero-One Loss
Function will be the mode of p(θ|x) if exists.
As explained in Chap. 1, the mode, the median and the mean can be very different
if the distribution is not symmetric. Which one should we take then? Quadratic
losses, for which large deviations from the true value are penalized quadratically,
are the most common option but, even if for unimodal symmetric the three statistics
coincide, it may be misleading to take this value as a characteristic number for the
information we got about the parameters or even be nonsense. In the hypothetical
case that the posterior is essentially the same as the likelihood (that is the case for a
sufficiently smooth prior), the Zero-One Loss points to the classical estimate of the
Maximum Likelihood Method. Other considerations of interest in Classical Statistics
(like bias, consistency, minimum variance,…) have no special relevance in Bayesian
inference.
Problem 2.13 (The Uniform Distribution) Show that for the posterior density (see
Example 2.4)
2.10 Decision Problems 147
n
xM
p(θ|x M , n) = n 1[x ,∞) (θ)
θn+1 M
the point estimates under quadratic, linear and 0–1 loss functions are
n
θQ L = x M ; θ L L = x M 21/n and θ01L = x M
n−1
Obviously, for a given probability content credible regions are not unique and a
sound criteria is to specify the one that the smallest possible volume. A region C of
the parametric space is called Highest Probability Region (HPD) with probability
content 1 − α if:
(1) P(θ ∈ C) = 1 − α; C ⊆ ;
(2) p(θ 1 |·) ≥ p(θ 2 |·) for all θ 1 ∈ C and θ 2 ∈C
/ except, at most, for a subset of
with zero probability measure.
It is left as an exercise to show that condition (2) implies that the HPD region so
defined is of minimum volume so both definitions are equivalent. Further properties
that are easy to demonstrate are:
(1) If p(θ|·) is not uniform, the HPD region with probability content 1 − α is unique;
(2) If p(θ 1 |·) = p(θ 2 |·), then θ 1 and θ 2 are both either included or excluded of the
HPD region;
(3) If p(θ 1 |·) = p(θ 2 ·), there is an HPD region for some value of 1 − α that contains
one value of θ and not the other;
(4) C = {θ ∈ | p(θ|x) ≥ kα } where kα is the largest constant for which
P(θ ∈ C) ≥ α;
(5) If φ = f (θ) is a one-to-one transformation, then
(a) any region with probability content 1 − α for θ will have probability content
1 − α for φ but…
(b) an HPD region for θ will not, in general, be an HPD region for φ unless the
transformation is linear.
In general, evaluation of credible regions is a bit messy task. A simple way through
is to do a Monte Carlo sampling of the posterior density and use the 4th property.
148 2 Bayesian Inference
For a one-dimensional parameter, the condition that the HPD region with probability
content 1 − α has the minimum length allows to write a relation that may be useful
to obtain those regions in an easier manner. Let [θ1 , θ2 ] be an interval such that
θ2
p(θ|·) dθ = 1 − α
θ1
For this to be an HPD region we have to find the extremal of the function
θ2
φ(θ1 , θ2 , λ) = (θ2 − θ1 ) + λ p(θ|·) dθ − (1 − α)
θ1
Thus, from the first two conditions we have that p(θ1 |·) = p(θ2 |·) and, from the
third, we know that θ1 = θ2 . In the special case that the distribution is unimodal and
symmetric the only possible solution is θ2 = 2E[θ] − θ1 .
The HPD regions are useful to summarize the information on the parameters con-
tained in the posterior density p(θ|x) but it should be clear that there is no justification
to reject a particular value θ 0 just because is not included in the HPD region (or, in
fact, in whatever confidence region) and that in some circumstances (distributions
with more than one mode for instance) it may be the union of disconnected regions.
The Bayesian philosophy aims at the right questions in a very intuitive and, at least
conceptually, simple manner. However the “classical” (frequentist) approach to sta-
tistics, that has been very useful in scientific reasoning over the last century, is at
present more widespread in the Particle Physics community and most of the stirred
up controversies are originated by misinterpretations. It is worth to take a look for
instance at [2]. Let’s see how a simple problem is attacked by the two schools. “We”
are B, “they” are F.
Suppose we want to estimate the life-time of a particle. We both “assume” an
exponential model X ∼ E x(x|1/τ ) and do an experiment e(n) that provides an iid
sample x = {x1 , x2 , . . . , xn }. In this case there is a sufficient statistic t = (n, x) with
x the sample mean so let’s define the random quantity
2.12 Bayesian (B) Versus Classical (F ) Philosophy 149
1
n # n $n 1
X= X i ∼ p(x|n, τ ) = exp −nxτ −1 x n−1 1(0,∞) (x)
n i=1 τ (n)
τ2
τ] = τ
E[2 and V [2
τ] =
n
and F will claim that “if you repeat the experiment” many times under the same
conditions, you will get a sequence of estimators {2 τ1 ,2
τ2 , . . .} that eventually will
cluster around the life-time τ . Fine but we shall point out that, first, although desirable
we usually do not repeat the experiments (and under the same conditions is even more
rare) so we have just one observed sample (x → x = 2 τ ) from e(n). Second, “if you
repeat the experiment you will get” is a free and unnecessary hypothesis. You do not
know what you will get, among other things, because the model we are considering
may not be the way nature behaves. Besides that, it is quite unpleasant that inferences
on the life-time depend upon what you think you will get if you do what you know
you are not going to do. And third, that this is in any case a nice sampling property
of the estimator τ̂ but eventually we are interested in τ so, What can we say about it?
For us, the answer is clear. Being τ a scale parameter we write the posterior density
function
(n x)n
p(τ |n, x) = exp −n xτ −1 τ −(n+1) 1(0,∞) (τ )
(n)
for the degree of belief we have on the parameter and easily get for instance:
(n − k) n n2
E[τ k ] = (nx)k −→ E[τ ] = x ; V [τ ] = x 2 ;...
(n) n−1 (n − 1)2 (n − 2)
for a particular fixed value of θ. Thus, for each possible value of θ he has one interval
[x1 = f 1 (θ; β), x1 = f 2 (θ; β)] ⊂ X and the sequence of those intervals gives a
band in the θ × X region of the real plane. As for the Credible Regions, these
intervals are not uniquely determined so one usually adds the condition:
∞
x1
1−β
(1) p(x|θ) d x = p(x|θ) d x = or
−∞ x2 2
θ x2
β
(2) p(x|θ) d x = p(x|θ) d x =
x1 θ 2
or, less often, (3) chooses the interval with smallest size. Now, for an invertible
mapping xi −→ f i (θ) one can write
and get the random interval [ f 2−1 (X ), f 1−1 (X )] that contains the given value of θ with
probability β. Thus, for each possible value that X may take he will get an interval
[ f 2−1 (X ), f 1−1 (X )] on the θ axis and a particular experimental observation {x} will
single out one of them. This is the Confidence Interval that the frequentist analyst will
quote. Let’s continue with the life-time example and take, for illustration, n = 50
and β = 0.68. The bands [x1 = f 1 (τ ), x2 = f 2 (τ )] in the (τ , X ) plane, in this case
obtained with the third prescription, are shown in Fig. 2.4(1). They are essentially
straight lines so P[X ∈ (0.847τ , 1.126τ )] = 0.68. This is a correct statement, but
doesn’t say anything about τ so he inverts that and gets 0.89 X < τ < 1.18 X in
such a way that an observed value {x} singles out an interval in the vertical τ axis.
We, Bayesians, will argue this does not mean that τ has a 0.68 chance to lie in this
interval and the frequentist will certainly agree on that. In fact, this is not an admissible
question for him because in the classical philosophy τ is a number, unknown but a
fixed number. If he repeats the experiment τ will not change; it is the interval that
will be different because x will change. They are random intervals and what the
68% means is just that if he repeats the experiment a large number N of times, he
will end up with N intervals of which ∼68% will contain the true value τ whatever
it is. But the experiment is done only once so: Does the interval derived from this
observation contain τ or not? We don’t know, we have no idea if it does contain τ ,
2.12 Bayesian (B) Versus Classical (F ) Philosophy 151
(1) (2)
τ 9 100
8
68% CL band 80
7
6
60
5
4 40
3
2 20
1
0
1 2 3 4 5 6 7 8 9 0.5 1 1.5 2 2.5 3 3.5
x
Fig. 2.4 (1) 68% confidence level bands in the (τ , X ) plane. (2) 68% confidence intervals obtained
for 100 repetitions of the experiment
if it does not and how far is the unknown true value. Figure 2.4(2) shows the 68%
confidence intervals obtained after 100 repetitions of the experiment for τ = 2 and
67 of them did contain the true value. But when the experiment is done once, he picks
up one of those intervals and has a 68% chance that the one chosen contains the true
value. We B shall proceed in a different manner. After integration of the posterior
density we get the HPD interval P [τ ∈ (0.85 x, 1.13 x)] = 0.68; almost the same
but with a direct interpretation in terms of what we are interested in. Thus, both have
an absolutely different philosophy:
F: “Given a particular value of the parameters of interest, How likely is the observed
data?”
B: “Having observed this data, What can we say about the parameters of interest?”
… and the probability if the causes, as Poincare said, is the most important from the
point of view of scientific applications.
In many circumstances we are also interested in one-sided intervals. That is for
instance the case when the data is consistent with the hypothesis H : {θ = θ0 } and
we want to give an upper bound on θ so that P(θ ∈ (−∞, θβ ] = β. The frequentist
rationale is the same: obtain the interval [−∞, x2 ] ⊂ X such that
x2
P(X ≤ x2 ) = p(x|θ) d x = β
−∞
where x2 = f 2 (θ); in this case without ambiguity. For the random interval (−∞, f 2−1
(X )) F has that
P θ < f 2−1 (X ) = 1 − P θ ≥ f 2−1 (X ) = 1 − β
so, for a probability content α (say 0.95), one should set β = 1 − α (=0.05). Now,
consider for instance the example of the anisotropy is cosmic rays discussed in the last
152 2 Bayesian Inference
Sect. 2.13.3. For a dipole moment (details are unimportant now) we have a statistic
exp{−θ2 /2} √
X ∼ p(x|θ, 1/2) = √ exp{−x/2}sinh(θ x) 1(0,∞) (x)
2πθ
where the parameter θ is the dipole coefficient multiplied by a factor that is irrelevant
for the example. It is argued in Sect. 2.13.3 that the reasonable prior for this model
is π(θ) = constant so we have the posterior
√
2 √
p(θ|x, 1/2) = √ exp{−θ2 /2} θ−1 sinh(θ x) 1(0,∞) (θ)
πx M(1/2, 3/2, x/2)
This upper bound shown as function of x in Fig. 2.5 under “Bayes” (red line).
Neyman’s construction is also straight forward. From
x2
p(x|θ, 1/2) d x = 1 − α = 0.05
0
(essentially a χ2 probability for ν = 3), F will get the upper bound shown in the
same figure under “Neyman” (blue broken line). As you can see, they get closer as x
grows but, first, there is no solution for x ≤ xc = 0.352. In fact, E[X ] = θ2 + 3 so if
the dipole moment is δ = 0 (θ = 0), E[X ] = 3 and observed values below xc will be
an unlikely fluctuation downwards (assuming of course that the model is correct) but
certainly a possible experimental outcome. In fact, you can see that for values of x
less than 2, even though there is a solution Neyman’s upper bound is underestimated.
To avoid this “little” problem, a different prescription has to be taken.
The most interesting solution is the one proposed by Feldman and Cousins [25]
in which the region
X ⊂ X that is considered for the specified probability content
is determined by the ratio of probability densities. Thus, for a given value θ0 , the
interval
X is such that
θ0.95
6
Bayes
2
Neyman
Feldman-Cousins
1
−2 −1
10 10 1 10
x
Fig. 2.5 95% upper bounds on the parameter θ following the Bayesian approach (red), the Neyman
approach (broken blue) and Feldman and Cousins (solid blue line)
(1) (2)
12
1
θmax R(x|θ0)
10
0.8
8
0.6
6 kβ
0.4
4
2 0.2
0 0
1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 25 30
x x
Fig. 2.6 (1) Dependence of θm with x. (2) Probability density ratio R(x|θ) for θ = 2
p(x|θ0 )
p(x|θ0 ) d x = β with R(x|θ0 ) = > kβ ; ∀x ∈
X
X p(x|θb )
and where θb is the best estimation of θ for a given {x}; usually the one that maximizes
the likelihood (θm ). In our case, it is given by:
√
0 if x ≤ √3
θm = √ √
θm + θm−1 − x coth(θm x) = 0 if x > 3
154 2 Bayesian Inference
and the dependence with x is shown in Fig. 2.6(1) (θm x for x >>). As illustration,
function R(x|θ) is shown in Fig. 2.6(2) for the particular value θ0 = 2. Following this
procedure,12 the 0.95 probability content band is shown in Fig. 2.5 under “Feldman-
Cousins” (blue line). Note that for large values of x, the confidence region becomes
an interval. It is true that if we observe a large value of X , the hypothesis H0 : {δ = 0}
will not be favoured by the data and a different analysis will be more relevant although,
by a simple modification of the ordering rule, we still can get an upper bound if desired
or use the standard Neyman’s procedure.
The Feldman and Cousins prescription allows to consider constrains on the para-
meters in a simpler way than Neyman’s procedure and, as opposed to it, will always
provide a region with the specified probability content. However, on the one hand,
they are frequentist intervals and as such have to be interpreted. On the other hand,
for discrete random quantities with image in {x1 , . . . , xk , . . .} it may not be possible
to satisfy exactly the probability content equation since for the Distribution Function
one has that F(xk+1 ) = F(xk ) + P(X = xk+1 ). And last, it is not straight forward
to deal with nuisance parameters. Therefore, the best advice: “Be Bayesian!”.
2.13.1 Regression
We shall assume that the precisions σxi and σxi are known and that there is a func-
tional relation μ y = f (μx ; θ) with unknown parameters θ. Then, in terms of the new
parameters of interest:
+ ,
1 (yi − f (μxi ; θ))2
n
(xi − μxi )2
p( y|·) ∝ exp − +
2 i=1 σ 2yi σx2i
t 2 t4 − t3 t5 t1 t5 − t3 t4
a0 = , b0 =
t1 t2 − t32 t1 t2 − t32
t2 t1 t3
σa2 = , σb2 = , ρ = −√
t1 t2 − t32 t1 t2 − t32 t1 t2
Both (a, b) are position parameters so we shall take a uniform prior and in conse-
quence
(a−a0 )2 (b−b0 )2
1 − 2(1−ρ
1
2) σa2
+ σb2
− 2 ρ (a−a
σa
0 ) (b−b0 )
σb
p(a, b|·) = & e
2πσa σb 1 − ρ2
In general, the expressions one gets for non-linear regression problems are compli-
cated and setting up priors is a non-trivial task but fairly vague priors easy to deal with
are usually a reasonable choice. In this case, for instance, one may consider uniform
156 2 Bayesian Inference
priors or normal densities N (·|0, σ >>) for both parameters (a, b) and sample the
proper posterior with a Monte Carlo algorithm (Gibbs sampling will be appropriate).
The same reasoning applies if we want to consider other models or more involved
b
relations with several explanatory variables like θi = kj=1 α j xi jj . In counting exper-
iments, for example, yi ∈ N so we may be interested in a Poisson model Po(yi |μi )
where μi is parameterized as a simple log-linear form ln(μi ) = α0 + α1 xi (so μi > 0
for whatever α0 , α1 ∈ R). Suppose for instance that we have the sample {(yi , xi )}i=1 n
.
Then:
n n
α1 α2 x i
p( y|α1 , α2 , x) ∝ exp{−μi }μi = exp α1 s1 + α2 s2 − e
yi
e
i=1 i=1
n n
where s1 = i=1 yi and s2 = i=1 yi xi . In this case, the Normal distribution
N (αi |ai , σi ) with σi >> is a reasonable smooth and easy to handle proper prior
density for both parameters. Thus, we get the posterior conditional densities
α2 ai
n
p(αi |α j , y, x) ∝ exp − i 2 + αi + si − e α1 e α2 x i ; i = 1, 2
2σi σi2 i=1
that are perfectly suited for the Gibbs sampling to be discussed in Sect. 4.1 of Chap. 3.
Example 2.25 (Proton Flux in Primary Cosmic Rays) For energies between ∼20 and
∼200 GeV, the flux of protons of the primary cosmic radiation is reasonably well
described by a power law φ(r ) = c r γ where r is the rigidity13 and γ = dlnφ/ds,
with s = ln r , is the spectral index. At lower energies, this dependence is significantly
modified by the geomagnetic cut-off and the solar wind but at higher energies, where
these effects are negligible, the observations are not consistent with a single power law
(Fig. 2.7(1)). One may characterize this behaviour with a simple phenomenological
model where the spectral index is no longer constant but has a dependence γ(s) =
α + β tanh[a(s − s0 )] such that lims→−∞ γ(s) = γ1 (r →0) and lims→∞ γ(s) = γ2
(r → + ∞). After integration, the flux can be expressed in terms of 5 parameters
θ = {φ0 , γ1 , δ = γ2 − γ1 , r0 , σ} as:
! σ "δ/σ
γ1 r
φ(r ; θ) = φ0 r 1+
r0
For this example, I have used the data above 45 GeV published by the AMS experi-
ment14 and considered only the quoted statistical errors (see Fig. 2.7(1)). Last, for a
better description of the flux the previous expression has been modified to account
for the effect of the solar wind with the force-field approximation in consistency with
(1) (2) 3
× 10
14000 2000
φ r 2.7 1800
12000
1600
10000 1400
1200
8000
1000
800
6000
600
4000 400
r 200 γ1
2000
0
2 3 −2.852 −2.85 −2.848 −2.846 −2.844 −2.842 −2.84 −2.838
10 10 10
(3) 3
(4)
× 10
0.2
1800
1600
δ
0.18
1400
1200
0.16
1000
800
0.14
600
400 0.12
200 γ1
δ
0 0.1
0.1 0.12 0.14 0.16 0.18 0.2 −2.852 −2.85 −2.848 −2.846 −2.844 −2.842 −2.84 −2.838
Fig. 2.7 (1) Observed flux multiplied by r 2.7 in m−2 sr −1 sec−1 GV1.7 as given in [AMS15];
(2) Posterior density of the parameter γ1 (arbitrary vertical scale); (3) Posterior density of the
parameter δ = γ2 − γ1 (arbitrary vertical scale); (4): Projection of the posterior density p(γ1 , δ)
[AMS15]. This is just a technical detail, irrelevant for the purpose of the example.
Then, assuming a Normal model for the observations we can write the posterior
density
n
1
p(θ|data) = π(θ) exp − (φi − φ(ri ; θ))2
i=1
2σi 2
I have taken Normal priors with large variances (σi >>) for the parameters γ1
and δ and restricted the support to R+ for {φ0 , r0 , σ}. The posterior densities for
the parameters γ1 and δ are shown in Fig. 2.7(2, 3) together with the projection
(Fig. 2.7(4)) that gives an idea of correlation between them. For a visual inspection,
the phenomenological form of the flux is shown in Fig. 2.7(1) (blue line) overimposed
to the data when the parameters are set to their expected posterior values.
158 2 Bayesian Inference
Suppose that we observe a particular region of the sky during a time t and denote
by λ the rate at which events from this region are produced. We take a Poisson
model to describe the number of produced events: k ∼ Po(k|λt). Now, denote by
the probability to detect one event (detection area, efficiency of the detector,…).
The number of observed events n from the region after an exposure time t and
detection probability will follow:
∞
n ∼ Bi(k|n, ) Po(k|λt) = Po(n|λt)
k=n
The approach to the problem will be the same for other counting process like,
for instance, events collected from a detector for a given integrated luminosity. We
suspect that the events observed in a particular region o of the sky are background
events together with those from an emitting source. To determine the significance
of the potential source we analyze a nearby region, b , to infer about the expected
background. If after a time tb we observe n b events from this region with detection
probability eb then, defining β = b tb we have that
(βλb )n b
n b ∼ Po(n b |λb β) = exp {−βλb }
(n b + 1)
no
no ∼ Po(n 1 |λs α)Po(n o − n 1 |λb α) = Po(n o |(λs + λb )α)
n 1 =0
Now, we can do several things. We can assume for instance that the overall rate
form the region o is λ, write n o ∼ Po(n o |αλ) and study the fraction λ/λb of the
rates from the information provided by the observations in the two different regions.
Then, reparameterizing the model in terms of θ = λ/λb and φ = λb we have
where γ = α/β =(s ts )/(b tb ). For the ordering {θ, φ} we have that the Fisher’s
matrix and its inverse are
' ( ' (
θ(1+γθ)
γβφ
θ
γβ −1 φγβ
− βθ
I(θ, φ) = and I (μ1 , μ2 ) =
γβ β(1+γθ)
φ − βθ φ
β
2.13 Some Worked Examples 159
Then
φ−1/2
π(θ, φ) = π(φ|θ) π(θ) ∝ √
θ(1 + γθ)
γ n o +1/2 θn o −1/2
p(θ|n o , n b , γ) =
B(n o + 1/2, n b + 1/2) (1 + γθ)n o +n b +1
From this:
1 (n o + 1/2 + m) (n b + 1/2 − m) 1 n o + 1/2
E[θm ] = −→ E[θ] =
γ m (n o + 1/2) (n b + 1/2) γ n b − 1/2
and
θ0
P(θ ≤ θ0 ) = p(θ|·) dθ = 1 − I B(n b + 1/2, n o + 1/2; (1 + γθ0 )−1 )
0
with I B(x, y; z) the Incomplete Beta Function. Had we interest in θ = λs /λb , the
corresponding reference prior will be
φ−1/2 1+γ
π(θ, φ) ∝ √ with δ=
(1 + θ)(δ + θ) γ
and therefore:
∞
no
p(n o |α(λs + λb )) p(λb |n b , β) dλb ∝ π(λs ) e−αλs λs o ak λs −k
n
p(λs |·) ∝ π(λs )
0 k=0
where
no (k + n b + 1/2)
ak =
k [(α + β)]k
A reasonable choice for the prior will be a conjugated prior π(λs ) = Ga(λs |a, b)
that simplifies the calculations and provides enough freedom analyze the effect of
different shapes on the inferences. The same reasoning is valid if the knowledge on
λb is represented by a different p(λb |·) from, say, a Monte Carlo simulation. Usual
160 2 Bayesian Inference
14
μs μ =3
b
12
10
0 1 2 3 4 5 6 7 8 9 10 11 12
x
Fig. 2.8 90% Confidence Belt derived with Feldman and Cousins (filled band) and the Bayesian
HPD region (red lines) for a background parameter μb = 3
distributions in this case are the Gamma and the Normal with non-negative support.
Last, it is clear that if the rate of background events is known with high accuracy
then, with μi = αλi and π(μs ) ∝ (μs + μb )−1/2 we have
1
p(μs |·) = exp{−(μs + μb )} (μs + μb )x−1/2 1(0,∞) (μs )
(x + 1/2, μb )
As an example, we show in Fig. 2.8 the 90% HPD region obtained from the previous
expression (red lines) as function of x for μb = 3 (conditions as given in the example
of [25]) and the Confidence Belt derived with the Feldman and Cousins approach
(filled band). In this case, μs,m = max{0, x − μb } and therefore, for a given μs :
x2 x
(μs,m −μs ) μs + μb
Po(x|μs + μb ) = β with R(x|μs ) = e > kβ
x1
μs,m + μb
Problem 2.14 In the search for a new particle, assume that the number of observed
events follows a Poisson distribution with μb = 0.7 known with enough precision
from extensive Monte Carlo simulations. Consider the hypothesis H0 : {μs = 0} and
H1 : {μs = 0}. It is left as an exercise to obtain the Bayes Factor B F01 with the proper
2.13 Some Worked Examples 161
prior π(μs |μb ) = μb (μs + μb )−2 proposed in [26], P(H1 |n) and the BIC difference
∞
l
f (θ, φ) = alm Ylm (θ, φ) where alm = f (θ, φ)Ylm (θ, φ) dμ;
l=0 m=−l
alm ∈ R and dμ = sin θdθdφ. The convention adopted for the spherical harmonic
functions is such that (orthonormal basis):
√
Ylm (θ, φ) Yl m (θ, φ) dμ = δll δmm and Ylm (θ, φ) dμ = 4π δl0
∞
l
p(θ, φ) = c00 Y00 (θ, φ) + clm Ylm (θ, φ)
l=1 m=−l
√
The normalization imposes that c00 = 1/ 4π so we can write
1
p(θ, φ|a) = (1 + alm Ylm (θ, φ))
4π
where l ≥ 1,
alm = 4πclm = 4π p(θ, φ) Ylm (θ, φ) dμ = 4π E p;μ [Ylm (θ, φ)]
and summation over repeated indices understood. Obviously, for any (θ, φ) ∈ we
have that p(θ, φ|a) ≥ 0 so the set of parameters a are constrained on a compact
support.
Even though we shall study the general case, we are particularly interested in
the expansion up to l = 1 (dipole terms) so, to simplify the notation, we redefine
the indices (l, m) = {(1, −1), (1, 0), (1, 1)} as i = {1, 2, 3} and, accordingly, the
coefficients a = (a1−1 , a10 , a11 ) as a = (a1 , a2 , a3 ). Thus:
162 2 Bayesian Inference
1
p(θ, φ|a) = (1 + a1 Y1 + a2 Y2 + a3 Y3 )
4π
In this case, the condition p(θ, φ|a) ≥ 0 implies that the coefficients are bounded by
the sphere a12 + a22 + a32 ≤ 4π/3 and therefore, the coefficient of anisotropy
1
de f. 3 2 1/2
δ = a1 + a22 + a32 ≤ 1
4π
There are no sufficient statistics for this model but the Central Limit Theorem
applies and, given the large amount of data, the experimental observations can be
cast in the statistic a = (a1 , a2 , a3 ) such that15
3
p(a|μ) = N (ai |μi , σi2 )
i=1
with V (ai ) = 4π/n known and with negligible correlations (ρi j 0).
Consider then a k-dimensional random quantity Z = {Z 1 , . . . , Z k } and the distri-
bution
k
p(z|μ, σ) = N (z j |μ j , σ 2j )
j=1
The interest is centered on the euclidean norm ||μ||, with dim{μ} = k, and its square;
in particular, in
1
3 ||μ||2
δ= ||μ|| for k = 3 and Ck =
4π k
ρ1 = ρ cos φ1
ρ2 = ρ sin φ1 cos φ2
ρ3 = ρ sin φ1 sin φ2 cos φ3
..
.
ρk−1 = ρ sin φ1 sin φ2 . . . sin φk−2 cos φk−1
ρk = ρ sin φ1 sin φ2 . . . sin φk−2 sin φk−1
The Fisher’s matrix is the Riemann metric tensor so the square root of the determinant
is the k-dimensional volume element:
n
15 Essentially, alm = 4π
n i=1 Ylm (θi , φi ) for a sample of size n.
2.13 Some Worked Examples 163
d V k = ρk−1 dρ d S k−1
with
k−1
d S k−1 = sink−2 φ1 sink−3 φ2 . . . sin φk−2 dφ1 dφ2 . . . dφk−1 = sin(k−1)− j φ j dφ j
j=1
the k − 1 dimensional spherical surface element, φk−1 ∈ [0, 2π) and φ1,...,k−2 ∈
[0, π]. The interest we have is on the parameter ρ so we should consider the ordered
parameterization {ρ; φ} with φ = {φ1 , φ2 , . . . , φk−1 } nuisance parameters. Being ρ
and φi independent for all i, we shall consider the surface element (that is, the deter-
minant of the submatrix obtained for the angular part) as prior density (proper) for
the nuisance parameters. As we have commented in Chap. 1, this is just the Lebesgue
measure on the k − 1 dimensional sphere (the Haar invariant measure under rotations)
and therefore the natural choice for the prior; in other words, a uniform distribution
on the k − 1 dimensional sphere. Thus, we start integrating the angular parameters.
Under the assumption that the variances σi2 are all the same and considering that
π ν
√ 2 1 1
e±β cos θ sin2ν θdθ = π (ν + ) Iν (β) for Re(ν) > −
0 β 2 2
is properly normalized,
1 n
ν = k/2 − 1 ; φ = ||μ||2 ; φm = ||a||2 ; b= 2
=
2σ 8π
b e−bφ (s + ν)
M(s)
−ν,∞ = M(s + ν, ν + 1, bφ)
(ν + 1) bs
164 2 Bayesian Inference
with M(a, b, z) the Kummer’s function one can easily get the moments (E[φnm ] =
M(n + 1)); in particular
Now that we have the model p(φm |φ), let’s go for the prior function π(φ) or π(δ).
One may guess already what shall we get. The first element of the Fisher’s matrix
(diagonal) corresponds to the norm and is constant so it would not be surprising to get
the Lebesgue measure for the norm dλ(δ) = π(δ)dδ = c dδ. As a second argument,
for large sample sizes (n >>) we have b >> so φm ∼ N (φm |φ, σ 2 = 2φ/b) and, to
first order, Jeffreys’ prior is π(φ) ∼ φ−1/2 . From the reference analysis, if we take
for instance
π (φ) = φ(ν−1)/2
where
∞
√
Iν (2b φφm )
I (φ, b) = p(φm |φ) log dφm
0 Iν/2 (bφm /2)
and φ0 any interior point of (φ) = [0, ∞). From the asymptotic behavior of the
Bessel functions one gets
π(φ) ∝ φ−1/2
and therefore, π(δ) = c. It is left as an exercise to get the same result with other
priors like π (φ) = c or π (φ) = φ−1/2 .
For this problem, it is easier to derive the prior from the reference analysis. Nev-
ertheless, the Fisher’s information that can be expressed as:
√
∞
e−bφ −bz ν/2+1 Iν+1 (2b zφ)
2
F(φ; ν) = b 2
−1 + b ν/2+1 e z √ dz
φ 0 Iν (2b zφ)
and, for large b (large sample size), F(λ; ν) → φ−1 regardless the number of degrees
of freedom ν. Thus, Jeffrey’s prior is consistent with the result from reference analy-
sis. In fact, from the asymptotic behavior of the Bessel Function in the corresponding
expressions of the pdf, one can already see that F(φ; ν) ∼ φ−1 . A cross check from a
numeric integration is shown in Fig. 2.9 where, for k = 3, 5, 7 (ν = 1/2, 3/2, 5/2),
F(φ; ν) is depicted as function of φ compared to 1/φ in black for a sufficiently large
2.13 Some Worked Examples 165
0.8
0.6
0.4
0.2
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
value of b. Therefore we shall use π(φ) = φ−1/2 for the cases of interest (dipole,
quadrupole, … any-pole).
The posterior densities are
√ √
• For φ = ||μ||2 : p(φ|φm , ν) = N e−bφ φ−(ν+1)/2 Iν (2b φm φ) with
−ν/2
(ν + 1) b1/2−ν φm
N=√
π M(1/2, ν + 1, bφm )
√
In the particular case that k = 3 (dipole; ν = 1/2), we have for δ = 3/4πρ that
the first two moments are:
erf (z) 1
E[δ] = E[δ 2 ] =
aδm M(1, 3/2, −z 2 ) a M(1, 3/2, −z 2 )
√
with z = 2δm bπ/3 and, when δm →0 we get
1
2 1.38 1 1.04
E[δ] = √ E[δ 2 ] = σδ √
πa n a n
and a one sided 95% upper credible region (see Sect. 2.11 for more details) of δ0.95 =
3.38
√ .
n
So far, the analysis has been done assuming that the variances σ 2j are of the
same size (equal in fact) and the correlations are small. This is a very reasonable
assumption but may not always be the case. The easiest way to proceed then is
to perform a transformation of the parameters of interest (μ) to polar coordinates
μ(ρ, ) and do a Monte Carlo sampling from the posterior:
⎡ ⎤
n
p(ρ, |z, −1 ) ∝ ⎣ N (z j |μ j (ρ, ), −1 )⎦ π(ρ)dρ d S n−1
j=1
References
1. G. D’Agostini, Bayesian Reasoning in Data Analysis (World Scientific Publishing Co, Singa-
pore, 2003)
2. F. James, Statistical Methods in Experimental Physics (World Scientific Publishing Co, Sin-
gapore, 2006)
3. J.M. Bernardo, The concept of exchangeability and its applications. Far East J. Math. Sci. 4,
111–121 (1996). www.uv.es/~bernardo/Exchangeability.pdf
4. J.M. Bernardo, A.F.M. Smith, Bayesian Theory (Wiley, New York, 1994)
5. R.E. Kass, L. Wasserman, The selection of prior distributions by formal Rules. J. Am. Stat.
Assoc. V 91(453), 1343–1370 (1996)
References 167
The Monte Carlo Method is a very useful and versatile numerical technique that
allows to solve a large variety of problems difficult to tackle by other procedures.
Even though the central idea is to simulate experiments on a computer and make
inferences from the “observed” sample, it is applicable to problems that do not
have an explicit random nature; it is enough if they have an adequate probabilistic
approach. In fact, a frequent use of Monte Carlo techniques is the evaluation of
definite integrals that at first sight have no statistical nature but can be interpreted as
expected values under some distribution.
Detractors of the method used to argue that one uses Monte Carlo Methods because
a manifest incapability to solve the problems by other more academic means. Well,
consider a “simple” process in particle physics: ee→eeμμ. Just four particles in
the final state; the differential cross section in terms of eight variables that are not
independent due to kinematic constraints. To see what we expect for a particular
experiment, it has to be integrated within the acceptance region with dead zones
between subdetectors, different materials and resolutions that distort the momenta
and energy, detection efficiencies,… Yes. Admittedly we are not able to get nice
expressions. Nobody in fact and Monte Carlo comes to our help. Last, it may be a
truism but worth to mention that Monte Carlo is not a magic black box and will not
give the answer to our problem out of nothing. It will simply present the available
information in a different and more suitable manner after more or less complicated
calculations are performed but all the needed information has to be put in to start
with in some way or another.
In this lecture we shall present and justify essentially all the procedures that are
commonly used in particle physics and statistics leaving aside subjects like Markov
Chains that deserve a whole lecture by themselves and for which only the relevant
properties will be stated without demonstration. A general introduction to Monte
Carlo techniques can be found in [1].
Sequences of random numbers {x1 , x2 , . . . , xn } are the basis of Monte Carlo simula-
tions and, in principle, their production is equivalent to perform an experiment e(n)
sampling n times the random quantity X ∼ p(x|θ). Several procedures have been
developed for this purpose (real experiments, dedicated machines, digits of transcen-
dental numbers,…) but, besides the lack of precise knowledge behind the generated
sequences and the need of periodic checks, the complexity of the calculations we are
interested in demands large sequences and fast generation procedures. We are then
forced to devise simple and efficient arithmetical algorithms to be implemented in a
computer. Obviously neither the sequences produced are random nor we can produce
truly random sequences by arithmetical algorithms but we really do not need them. It
is enough for them to simulate the relevant properties of truly random sequences and
be such that if I give you one of these sequences and no additional information, you
won’t be able to tell after a bunch of tests [2] whether it is a truly random sequence or
not (at least for the needs of the problem at hand). That’s why they are called pseudo-
random although, in what follows we shall call them random. The most popular (and
essentially the best) algorithms are based on congruential relations (used for this
purpose as far as in the 1940s) together with binary and/or shuffling operations with
some free parameters that have to be fixed before the sequence is generated. They
are fast, easy to implement on any computer and, with the adequate initial setting of
the parameters, produce very long sequences with sufficiently good properties. And
the easiest and fastest pseudo-random distribution to be generated on a computer is
the Discrete Uniform.1
Thus, let’s assume that we have a good Discrete Uniform random number
generator2 although, as Marsaglia said, “A Random Number Generator is like sex:
When it is good it is wonderful; when it is bad… it is still pretty good”. Each
call in a computer algorithm will produce an output (x) that we shall represent as
x ⇐ U n(0, 1) and simulates a sampling of the random quantity X ∼ U n(x|0, 1).
Certainly, we are not very much interested in the Uniform Distribution so the task is
to obtain a sampling of densities p(x|θ) other than Uniform from a Pseudo-Uniform
Random Number Generator for which there are several procedures.
1 See
[3, 4] for a detailed review on random and quasi-random number generators.
2 For
the examples in this lecture I have used RANMAR [5] that can be found, for instance, at the
CERN Computing Library.
3.1 Pseudo-Random Sequences 171
Example 3.1 (Estimate the value of π) As a simple first example, let’s see how we
may estimate the value of π. Consider a circle of radius r inscribed in a square with
sides of length 2r . Imagine now that we throw random points evenly distributed
inside the square and count how many have fallen inside the circle. It is clear that
since the area of the square is 4r 2 and the area enclosed by the circle is πr 2 , the
probability that a throw falls inside the circle is θ = π/4.
If we repeat the experiment N times, the number n of throws falling inside the
circle follows a Binomial law Bi(n|N , p) and therefore, having observed n out of
N trials we have that
Let’s take π = 4E[θ] as point estimator and σ = 4σθ as a measure of the precision.
The results obtained for samplings of different size are shown in Table 3.1.
It is interesting to see that the precision decreases with the sampling size as
√
1/ N . This dependence is a general feature of Monte Carlo estimations regardless
the number of dimensions of the problem.
A similar problem is that of Buffon’s needle: A needle of length l is thrown at
random on a horizontal plane with stripes of width d > l. What is the probability that
the needle intersects one of the lines between the stripes? It is left as an exercise to
shown that, as given already by Buffon in 1777, Pcut = 2l/πd. Laplace pointed out,
in what may be the first use of the Monte Carlo method, that doing the experiment
one may estimate the value of π “… although with large error”.
This is, at least formally, the easiest procedure. Suppose we want a sampling of the
continuous one-dimensional random quantity X ∼ p(x)3 so
3 Remember that if supp(X ) = ⊆R, it is assumed that the density is p(x)1 (x).
172 3 Monte Carlo Methods
x x
P[X ∈ (−∞, x]] = p(x
)d x
= d F(x
) = F(x)
−∞ −∞
Now, we define the new random quantity U = F(X ) with support in [0, 1]. How is
it distributed? Well,
F −1 (u|·)
FU (u) ≡ P[U ≤ u] = P[F(X ) ≤ u)] = P[X ≤F −1 (u)] = d F(x
) = u
−∞
Example 3.2 Let’ see how we generate a sampling of the Laplace distribution X ∼
La(x|α, β) with α ∈ R, β ∈ (0, ∞) and density
1 −|x−α|/β
p(x|α, β) = e 1(−∞,∞) (x)
2β
n
n
p(x1 , x2 , . . . , xn ) = pi (xi ) and F(x1 , x2 , . . . , xn ) = Fi (xi )
i=1 i=1
but, if this is not the case, note that there are n! ways to do the decomposition and
some may be easier to handle than others (see Example 3.3).
F0 = P(X ≤ x0 ) = p0
F1 = P(X ≤ x1 ) = p0 + p1
F2 = P(X ≤ x2 ) = p0 + p1 + p2
...
0 F0 F1 F2 ........ 1
Then, it is clear that a random quantity u i drawn from U (x|0, 1) will determine a point
in the interval [0, 1] and will belong to the subinterval [Fk−1 , Fk ] with probability
pk = Fk − Fk−1 so we can set up the following algorithm:
(i 1 ) Get u i ∼ U n(u|0, 1);
(i 2 ) Find the value xk such that Fk−1 < u i ≤ Fk
The sequence {x0 , x1 , x2 , . . .} so generated will be a sampling of the probability
law P(X = xk ) = pk . Even though discrete random quantities can be sampled in
this way, some times there are specific properties based on which faster algorithms
can be developed. That is the case for instance for the Poisson Distribution as the
following example shows.
μk μ
pk = e−μ = pk−1
(k + 1) k
(i 1 ) u i ⇐ U n(0, 1)
(i 2 ) Find the value k = 0, 1, . . . such that Fk−1 < u i ≤ Fk and deliver xk = k
For the Poisson Distribution, there is a faster procedure. Consider a sequence
of n independent random quantities {X 1 , X 2 , . . . , X n }, each distributed as X i ∼
U n(x|0, 1), and introduce a new random quantity
n
Wn = Xk
k=1
n−1
μk
P(Wn ≤ e−μ ) = 1 − P(n, μ) = e−μ = Po(X ≤ n − 1|μ)
k=0
(k + 1)
Therefore,
(i 0 ) Set w p = 1;
(i 1 ) u i ⇐ U n(0, 1) and set w p = w p u;
(i 2 ) Repeat step (i 1 ) while w p ≤ e−μ , say k times, and deliver xk = k − 1
Example 3.5 (Binomial Distribution Bn(k|N , θ)) From the recurrence relation
N θ n−k+1
pk = θk (1 − θ)n−k = pk−1
k 1−θ k
with p0 = (1 − θ)k
(i 1 ) u i ⇐ U n(0, 1)
(i 2 ) Find the value k = 0, 1, . . . , N such that Fk−1 < u i ≤ Fk and deliver xk = k
also a Poisson law n gen ∼ Po(n|μ). The parameter μ accounts for the efficiency of
this first process and depends on the characteristics of the photocathode, the applied
voltage and the geometry of the focusing electrodes (essentially that of the first
dynode). It has been estimated to be μ = 0.25. Thus, we start our simulation with
(1) n gen ⇐ Po(n|μ)
electrons leaving the photocathode. They are driven towards the first dynode to start
the electron shower but there is a chance that they miss the first and start the shower
at the second. Again, the analysis of the experimental data suggest that this happens
with probability pd2 0.2. Thus, we have to decide how many of the n gen electrons
start the shower at the second dynode. A Binomial model is appropriate in this case:
(2) n d2 ⇐ Bi(n d2 |n gen , pd2 ) and therefore n d1 = n gen − n d2 .
Obviously, we shall do this second step if n gen > 0.
Now we come to the amplification stage. Our photomultiplier has 12 dynodes so
let’s see the response of each of them. For each electron that strikes upon dynode k
(k = 1, . . . , 12), n k electrons will be produced and directed towards the next element
of the chain (dynode k + 1), the number of them again well described by a Poisson
law Po(n k |μk ). If we denote by V the total voltage applied between the photocathode
and the anode and by Rk the resistance previous to dynode k we have that the current
intensity through the chain will be
V
I =
13
i=1 Ri
where we have considered also the additional resistance between the last dynode
and the anode that collects the electron shower. Therefore, the parameters μk are
determined by the relation
μk = a (I Rk )b
where a and b are characteristic parameters of the photomultiplier. In our case we have
that N = 12, a = 0.16459, b = 0.75, a total applied voltage of 800 V and a resistance
chain of {2.4, 2.4, 2.4, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.2, 2.4} Ohms. It is easy
to see that if the response of dynode k to one electron is modeled as Po(n k |μk ), the
response to n i incoming electrons is described by Po(n k |n i μk ). Thus, we simulate
the amplification stage as:
(3.1) If n d1 > 0: do from k = 1 to 12:
μ = μk n d1 −→ n d1 ⇐ Po(n|μ)
μ = μk n d2 −→ n d2 ⇐ Po(n|μ)
3.2 Basic Algorithms 177
2
10
10
1
0 25 50 75 100 125 150 175 200 225 250
Once this is done, we have to convert the number of electrons at the anode in ADC
counts. The electron charge is Q e = 1.602176 10−19 C and in our set-up we have
f ADC = 2.1 1014 ADC counts per Coulomb so
ADC pm = (n d1 + n d2 ) (Q e f ADC )
Last, we have to consider the noise (pedestal). In our case, the number of pedestal
ADC counts is well described by a mixture model with two Normal densities
with α = 0.8. Thus, with probability α we obtain ADC ped ⇐N1 (x|10, 1.5), and with
probability 1 − α, ADC ped ⇐N1 (x|10, 1) so the total number of ADC counts will be
Obviously, if in step 1) we get n gen = 0, then ADCtot = ADC ped . Figure 3.1 shows
the result of the simulation for a sampling size of 106 together with the main con-
tributions (1, 2 or 3 initial photoelectrons) and the pedestal. From these results, the
parameters of the device can be adjusted (voltage, resistance chain,…) to optimize
the response for our specific requirements.
The Inverse Transform method is conceptually simple and easy to implement for
discrete distributions and many continuous distributions of interest. Furthermore,
is efficient in the sense that for each generated value u i as U n(x|0, 1) we get a
value xi from F(x). However, with the exception of easy distributions the inverse
function F −1 (x) has no simple expression in terms of elementary functions and may
be difficult or time consuming to invert. This is, for instance, the case if you attempt
178 3 Monte Carlo Methods
to invert the Error Function for the Normal Distribution. Thus, apart from simple
cases, the Inverse Transform method is used in combination with other procedures
to be described next.
NOTE 5: Bootstrap. Given the iid sample {x1 , x2 , . . . , xn } of the random quantity
X ∼ p(x|θ) we know (Glivenko-Cantelly theorem; see lecture 1 (7.6)) that:
n
unif.
Fn (x) = 1/n 1(−∞,x] (xk ) −→ F(x|θ)
k=1
Essentially the idea behind the bootstrap is to sample from the empirical Distribution
Function Fn (x) that, as we have seen for discrete random quantities, is equivalent to
draw samplings {x1
, x2
, . . . , xn
} of size n from the original sample with replacement.
Obviously, increasing the number of resamplings does not provide more information
than what is contained in the original data but, used with good sense, each bootstrap
will lead to a posterior and can also be useful to give insight about the form of the
underlying model p(x|θ) and the distribution of some statistics. We refer to [6] for
further details.
1 1
p(x1 , x2 |·) =
β−α δ
x p(x1 |θ)
P(X 1 ≤ x, X 2 ≤ p(x|θ)) α d x1 0 p(x1 , x2 |·) d x2
P(X 1 ≤ x|X 2 ≤ p(x|θ)) = = p(x1 |θ) =
P(X 2 ≤ p(x|θ)) β
α d x1 0 p(x1 , x2 |·) d x2
x x
p(x1 |θ)1[a,b] (x1 ) d x1
= αβ = p(x1 |θ) d x1 = F(x|θ)
α p(x 1 |θ)1[a,b] (x 1 ) d x 1 a
3.2 Basic Algorithms 179
Note that the efficiency so defined refers only to the fraction of accepted trials and,
obviously, the more adjusted the covering is the better but for the Inverse Transform
= 1 and it does not necessarily imply that it is more efficient attending to other
considerations. It is interesting to observe that if we do not know the normalization
factor of the density function, it can be estimated as
p(x|θ) d x (area of the covering)
X
Suppose that α, β > 1 so the mode xo = (α − 1)(α + β − 2)−1 exists and is unique.
Then
(α − 1)α−1 (β − 1)β−1
pm ≡maxx { p(x|α, β)} = p(x0 |α, β) =
(α + β − 2)α+β−2
(α + β − 2)α+β−2
= B(α, β)
(α − 1)α−1 (β − 1)β−1
|Y10 |2 ∝ cos2 θ; |Y1±1 |2 ∝ sin2 θ; |Y20 |2 ∝ (3 cos2 θ − 1)2 ; |Y2±1 |2 ∝ cos2 θsin2 θ;
|Y2±2 |2 ∝ sin4 θ
the angular dependence from the spherical harmonics. Since dμ = r 2 sinθdr dθdφ,
the probability density will be
(i 2 ) Accept the n-tuple x i = (xi1) , xi2) , . . . , xin) ) if yi ≤ p(x i |θ) or reject it otherwise.
Usually, we know the support [α, β] of the random quantity but the pdf is complicated
enough to know the maximum. Then, we start the generation with our best guess for
maxx p(x|·), say k1 , and after having generated N1 events (generated, not accepted)
in [α, β]×[0, k1 ],… wham!, we generate a value xm such that p(xm ) > k1 . Certainly,
3.2 Basic Algorithms 181
20 20 20
15 15 15
10 10 10
5 5 5
0 0 0
−5 −5 −5
− 10 − 10 − 10
− 15 − 15 − 15
− 20 − 20 − 20
− 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20
20 20 20
15 15 15
10 10 10
5 5 5
0 0 0
−5 −5 −5
− 10 − 10 − 10
− 15 − 15 − 15
− 20 − 20 − 20
− 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20 − 20 − 15 − 10 −5 0 5 10 15 20
Fig. 3.2 Spatial probability distributions of an electron in a hydrogen atom corresponding to the
quantum states (n, l, m) = (3, 1, 0), (3, 2, 0) and (3, 2, ±1) (columns 1, 2, 3) and projections (x, y)
and (x, z) = (y, z) (rows 1 and 2) (see Example 3.8)
our estimation of the maximum was not correct. A possible solution is to forget about
what has been generated and start again with the new maximum k2 = p(xm ) > k1
but, obviously, this is not desirable among other things because we have no guarantee
that this is not going to happen again. We better keep what has been done and proceed
in the following manner:
(1) We have generated N1 pairs (x1 , x2 ) in [α, β] × [0, k1 ] and, in particular, X 2
uniformly in [0, k1 ]. How many additional pairs Na do we have to generate? Since
the density of pairs is constant in both domains [α, β] × [0, k1 ] and [α, β] ×
[0, k2 ] we have that
N1 N1 + Na k2
= −→ Na = N1 −1
(β − α) k1 (β − α) k2 k1
(2) How do we generate them? Obviously in the domain [α, β]×[k1 , k2 ] but from
the truncated density
(3) Once the Na additional events have been generated (out of which some have been
hopefully accepted) we continue with the usual procedure but on the domain
[α, β]×[0, k2 ].
The whole process is repeated as many times as needed.
NOTE 6: Weighted events.
The Acceptance-Rejection algorithm just explained is equivalent to:
(i 1 ) Sample xi from U n(x|α, β) and u i from U n(u|0, 1);
(i 2 ) Assign to each generated event xi a weight: wi = p(xi |·)/ pm ; 0 ≤ wi ≤1 and
accept the event if u i ≤ wi or reject it otherwise.
It is clear that:
• Events with a higher weight will have a higher chance to be accepted;
• After applying the acceptance-rejection criteria at step (i 2 ), all events will have a
weight either 1 if it has been accepted or 0 if it was rejected.
• The generation efficiency will be
N
accepted trials
= = wi = w
total trials(N ) N i=1
n
Cn = [xic − r, xic + r ]
i=1
and:
(1) xi ⇐ U n(x|xic − r, xic + r ) for i = 1, . . . , n
(2) Accept x i if ρi = ||x i − x c || ≤ r and reject otherwise.
The generation efficiency will be
2 π n/2 1
(n) =
n (n/2) 2n
and limn→∞ (n) = 0. Certainly, some times we can refine the covering since there
is no need other than simplicity for a hypercube (see Stratified Sampling) but, in
general, the origin of the problem will remain: when we generate points uniformly in
whatever domain, we are sampling with constant density regions that have a very low
probability content or even zero when they have null intersection with the support
of the random quantity X. This happens, for instance, when we want to sample
from a differential cross-section that has very sharp peaks (sometimes of several
orders of magnitude as in the case of bremsstrahlung). Then, the problem of having
a low efficiency is not just the time expend in the generation but the accuracy and
convergence of the evaluations. We need a more clever way to generate sequences
and the Importance Sampling method comes to our help.
where:
(1) h(x) is a probability density function, i.e., non-negative and normalized in X ;
(2) g(x) ≥ 0; ∀x ∈ X and has a finite maximum gm = max{g(x); x ∈ X };
(3) c > 0 a constant normalization factor.
Now, consider a sampling {x1 , x2 , . . . , xn } drawn from the density h(x). If we apply
the Acceptance-Rejection criteria with g(x), how are the accepted values distributed?
It is clear that, if gm = max(g(x)) and Y ∼ U n(y|0, gm )
x g(x) x
h(x) d x dy h(x) g(x) d x
P(X ≤ x|Y ≤g(x)) = a
b
0
g(x) = ab = F(x)
a h(x) d x 0 dy a h(x) g(x) d x
184 3 Monte Carlo Methods
p(x)
p(x) d x = h(x) d x = g(x) d H (x).
h(x)
The Stratified Sampling is a particular case of the Importance Sampling where the
density p(x); x ∈ X is approximated by a simple function over X . Thus, in the
one-dimensional case, if = [a, b) and we take the partition (stratification)
= ∪i=1
n
i = ∪i=1
n
[ai−1 , ai ); a0 = a , an = b
n an
1[ai−1 , ai ) (x)
h(x) = −→ h(x) d x = 1
i=1
λ(i ) a0
Vk
k
P(Z = k) =
n ; F(k) = P(Z ≤ k) = P(Z = j)
i=1 Vi j=1
k
p(x) = a j p j (x); a j > 0 ∀ j = 1, 2, . . . , k
j=1
k ∞
k
p(x) d x = aj p j (x) d x = am = 1
−∞ j=1 −∞ j=1
we can sample from p(x) selecting, at each step i, one of the k densities pi (x) with
probability pi = ai from which we shall obtain xi and therefore sampling with higher
frequency from those densities that have a higher relative weight. Thus:
(i 1 ) Select which density pi (x) are we going to sample at step (i 2 ) with probability
pi = ai ;
(i 2 ) Get xi from pi (x) selected at (i 1 ).
It may happen that some densities p j (x) can not be easily integrated so we do
not know a priory the relative weights. If this is the case, we can sample from
f j (x) ∝ p j (x) and estimate with the generated events from f i (x) the corresponding
normalizations Ii with, for instance, from the sample mean
n
Ii = f i (xk )
n k=1
K
f i (x)
K
p(x|·) = ai f i (x|·) = ai Ii = ai Ii pi (x)
i=1 i=1
Ii i=1
3
p(x) = (1 + x 2 ); x ∈ [−1, 1]
8
Then, we can take:
p1 (x) ∝ 1 p1 (x) = 1/2
−→ normalization −→
p2 (x) ∝ x 2 p2 (x) = 3 x 2 /2
186 3 Monte Carlo Methods
so:
3 1
p(x) = p1 (x) + p2 (x)
4 4
Then:
(i 1 ) Get u i and wi as U n(u|0, 1);
(i 2 ) Get xi as:
if u i ≤ 3/4 then xi = 2 wi − 1
if u i > 3/4 then xi = (2 wi − 1)1/3
In this case, 75% of the times we sample from the trivial density U n(x| − 1, 1).
When problems start to get complicated, we have to combine several of the afore-
mentioned methods; in this case Importance Sampling, Acceptance-Rejection and
Decomposition of the probability density.
Compton Scattering is one of the main processes that occur in the interaction of
photons with matter. When a photon interacts with one of the atomic electrons with an
energy greater than the binding energy of the electron, is suffers an inelastic scattering
resulting in a photon of less energy and different direction than the incoming one
and an ejected free electron from the atom. If we make the simplifying assumptions
that the atomic electron initially at rest and neglect the binding energy we have that
if the incoming photon has an energy E γ its energy after the interaction (E γ ) is:
Eγ 1
= =
Eγ 1 + a (1 − cosθ)
where θ ∈ [0, π] is the angle between the momentum of the outgoing photon and
the incoming one and a = E γ /m e . It is clear that if the dispersed photon goes in the
forward direction (that is, with θ = 0), it will have the maximum possible energy
( = 1) and when it goes backwards (that is, θ = π) the smallest possible energy
( = (1 + 2a)−1 ). Being a two body final state, given the energy (or the angle) of
the outgoing photon the rest of the kinematic quantities are determined uniquely:
1 cotθ/2
Ee = Eγ 1+ − and tanθe =
a 1+a
3.3 Everything at Work 187
d σ0 3 σT
= f (x)
dx 8
where x = cos(θ), σT = 0.665 barn = 0.665 · 10−24 cm2 is the Thomson cross-
section and
1 a 2 (1 − x)2
f (x) = 1+x +2
[1 + a (1 − x)]2 1 + a (1 − x)
has all the angular dependence. Due to the azimuthal symmetry, there is no explicit
dependence with φ ∈ [0, 2π] and has been integrated out. Last, integrating this
expression for x ∈ [−1, 1] we have the total cross-section of the process:
σT 1+a 2 (1 + a) ln (1 + 2a) ln (1 + 2a) 1 + 3a
σ0 (E γ ) = − + −
4 a2 1 + 2a a 2a (1 + 2 a)2
f 1 (x)
g(x) = 1 − (2 − x 2 )
1 + f 1 (x) + f 2 (x)
The functions f n (x) are easy enough to use the Inverse Transform method and apply
afterward the Acceptance-Rejection on g(x) > 0 ∀x ∈ [−1, 1]. The shape of this
function is shown in Fig. 3.3 (right) for different values of the incoming photon
energy and clearly is much more smooth than f (x) so the Acceptance-Rejection
will be significantly more efficient. Normalizing properly the densities
1
1
pi (x) = f i (x) such that pi (x) d x = 1; i = 1, 2, 3
wi −1
188 3 Monte Carlo Methods
2 1
1.5 0.75
1 0.5
0.5 0.25
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
1.5 1
1 0.75
0.5
0.5
0.25
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
0.4
1
0.3 0.75
0.2 0.5
0.1 0.25
0 0
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Fig. 3.3 Functions f (x) (left) and g(x) (right) for different values of the incoming photon energy
1 2 2b
w1 = lnb w2 = 2 2 w3 = 3 2
a a (b − 1) a (b − 1)2
and therefore
f (x) = ( f 1 (x) + f 2 (x) + f 3 (x)) · g(x) = (w1 p1 (x) + w2 p2 (x) + w3 p3 (x)) · g(x)
= wt (α1 p1 (x) + α2 p2 (x) + α3 p3 (x)) · g(x)
where wt = w1 + w2 + w3 ,
wi
i=3
αi = > 0; i = 1, 2, 3 and αi = 1
wt i=1
3.3 Everything at Work 189
b
g M ≡ max[g(x)] = g(x = −1) = 1 −
1 + b + b2
then:
• x ∼ p1 (x): F1 (x) = 1 − ln(1 + a(1 − x)) −→ xg = 1 + aa − b
u
ln(b)
• x ∼ p2 (x): F2 (x) = b − 1 − 21a −→ xg = b − a1(b+ 2au− 1)
2 2
2 (b − x)
• x ∼ p3 (x): F3 (x) = 4a(11+ a) (b + 1)2 − 1 −→ xg = b − b+1
(b − x)2 [1 + 4a(1 + a)u]1/2
Once we have xg we can deduce the remaining quantities of interest from the kine-
matic relations. In particular, the energy of the outcoming photon will be
Eg 1
g = =
E 1 + a(1 − xg )
γ + atom −→ atom+ + e− ;
γ + atom −→ γ + e− + atom+
190 3 Monte Carlo Methods
and at high energies (E γ ≥ 100 MeV) the dominant one is pair production
γ + nucleus −→ e+ + e− + nucleus
we decide upon the process i that is going to happen next with probability pi = σi /σt ;
that is, u ⇐ U n(0, 1) and
(1) if u ≤ pphot. we simulate the photoelectric interaction;
(2) if pphot. < u ≤ ( pphot. + pCompt. ): we simulate the Compton effect and otherwise
(3) we simulate the pair production
Once we have decided which interaction is going to happen next, we have to
decide where. The probability that the photon interacts after traversing a distance x
(cm) in the material is given by
Fint = 1 − e−x/λ
where λ is the mean free path. Being A the atomic mass number of the material, N A
the Avogadro’s number, ρ the density of the material in g/cm 3 , and σ the cross-section
of the process under discussion, we have that
A
λ= [cm]
ρ NA σ
Thus, if u ⇐ U n(0, 1), the next interaction is going to happen at x = −λlnu along
the direction of the photon momentum.
As an example, we are going to simulate what happens when a beam of photons
of energy E γ = 1 MeV (X rays) incide normally on the side of a rectangular block of
carbon (Z = 6, A = 12.01, ρ = 2.26) of 10 × 10 cm2 surface y 20 cm depth. Behind
the block, we have hidden an iron coin (Z = 26, A = 55.85, ρ = 7.87) centered on
the surface and in contact with it of 2 cm radius and 1 cm thickness. Last, at 0.5 cm
from the coin there is a photographic film that collects the incident photons.
The beam of photons is wider than the block of carbon so some of them will
go right the way without interacting and will burn the film. We have assumed for
simplicity that when the photon energy is below 0.01 MeV, the photoelectric effect is
dominant and the ejected electron will be absorbed in the material. The photon will
then be lost and we shall start with the next one. Last, an irrelevant technical issue:
3.3 Everything at Work 191
-5
-10
-15
-5 0 5 10 15 20 25
10
-2
-4
-6
-8
-10
-10 -8 -6 -4 -2 0 2 4 6 8 10
the angular variables of the photon after the interaction are referred to the direction
of the incident photon so in each case we have to do the appropriate rotation.
Figure 3.4 (up) shows the sketch of the experimental set-up and the trajectory
of one of the traced photons of the beam collected by the film. The radiography
obtained after tracing 100,000 photons is shown in Fig. 3.4 (down). The black zone
corresponds to photons that either go straight to the screen or at some point leave the
192 3 Monte Carlo Methods
block before getting to the end. The mid zone are those photons that cross the carbon
block and the central circle, with less interactions, those that cross the carbon block
and afterward the iron coin.
1
p(r 0 ) dμ0 = d x0 dy0 dz 0
V
Assume now that the velocities are isotropically distributed; that is:
1
p(v) dμv = sin θ dθ dφ f (v) dv
4π
with f (v) properly normalized. Under independence of positions and velocities at
t0 , we have that:
1 1
p(r 0 , v) dμ0 dμv = d x0 dy0 dz 0 sin θ dθ dφ f (v) dv
V 4π
Given a square of surface S = (2l)2 , parallel to the (x, y) plane, centered at (0, 0, z c )
and well inside the volume V , we want to find the probability and distribution of
particles that, coming from the top, cross the surface S in unit time.
For a particle having a coordinate z 0 at t0 = 0, we have that z(t) = z 0 + v z t. The
surface S is parallel to the plane (x, y) at z = z c so particles will cross this plane at
time tc = (z c − z 0 )/v z from above iff:
(0) z 0 ≥ z c ; obvious for otherwise they are below the plane S at t0 = 0;
(1) θ ∈ [π/2, π); also obvious because if they are above S at t0 = 0 and cut the
plane at some t > 0, the only way is that v z = v cos θ < 0 → cos θ < 0 → θ ∈
[π/2, π).
But to cross the squared surface S of side 2l we also need that
(2) −l ≤ x(tc ) = x0 + v x tc ≤ l and −l ≤ y(tc ) = y0 + v y tc ≤ l
Last, we want particles crossing in unit time; that is tc ∈ [0, 1] so 0 ≤ tc = (z c −
z 0 )/v z ≤ 1 and therefore
(3) z 0 ∈ [z c , z c − v cos θ]
3.3 Everything at Work 193
After integration:
z c −v cos θ l−v x tc l−v y tc
dz 0 d x0 dy0 = −(2l)2 v cos θ
zc −l−v x tc −l−v y tc
Thus, we have that for the particles crossing the surface S = (2l)2 from above in unit
time
(2l)2 1
p(θ, φ, v) dθdφdv = − sin θ cos θ dθ dφ f (v) v dv
V 4π
and the pdf for the angular distribution of velocities (direction of crossing particles)
is
1 1
p(θ, φ) dθdφ = − sin θ cos θ dθ dφ = d(cos2 θ) dφ
π 2π
If we have a density of n particles per unit volume, the expected number of crossings
per unit time due to the n V = n V particles in the volume is
n E[v]
n c = n V Pcut (tc ≤ 1) = S
4
so the flux, number of particles crossing the surface from one side per unit time and
unit surface is
nc n E[v]
0c = =
S 4
194 3 Monte Carlo Methods
Note that the requirement that the particles cross the square surface S in a finite time
(tc ∈ [0, 1]) modifies the angular distribution of the direction of particles. Instead of
we have
that is; one fourth the solid angle spanned by the sphere. Therefore, the flux expressed
as number of particles crossing from one side of the square surface S per unit time
and solid angle is
0c nc n E[v]
c = = =
π πS 4π
Thus, if we generate a total of n T = 6n c particles on the of the surface of a cube,
each face of area S, with the angular distribution
1
p(θ, φ) dθdφ = d(cos2 θ) dφ
2π
for each surface with θ ∈ [π/2, π) and φ ∈ [0, 2π) defined with k normal to the
surface, the equivalent generated flux per unit time, unit surface and solid angle is
T = n T /6πS and corresponds to a density of n = 2n T /3S E[v] particles per unit
volume.
NOTE 7: Sampling some continuous distributions of interest
These are some procedures to sample from continuous distributions of interest. There
are several algorithms for each case with efficiency depending on the parameters but
those outlined here have in general high efficiency. In all cases, it is assumed that
u ⇐ U n(0, 1).
3.3 Everything at Work 195
1 x1
p(x|·) = x α−1 (1 − x)β−1 1(0,1) (x) −→ x =
B(α, β) x1 + x2
β/π
p(x|·) = 1(−∞,∞) (x) −→ x = α + β −1 tan(π(u − 1/2))
1 + β 2 (x − α)2
(α1 + · · · + αn ) α j −1
n
p(x|α) = x 1(0,1) (x j ) −→ {x j = z j /z 0 }nj=1
(α1 )· · ·(αn ) j=1 j
n
where z j ⇐ Ga(z|1, α j ) and z 0 = j=1 z j.
n−1
Generalized Dirichlet G Di(x|α, β); dim(β) = n, β j ∈ (0, ∞), j=1 xj < 1
γi
n−1
(αi + βi ) αi −1
i
p(x1 , . . . , xn−1 |α, β) = x 1− xk
i=1
(αi )(βi ) i k=1
with
βi − αi+1 − βi+1 for i = 1, 2, . . . , n − 2
γi =
βn−1 − 1 for i = n − 1
αβ − αx β−1
p(x|α, β) = e x 1(0,∞) (x)
(β)
196 3 Monte Carlo Methods
x1 = −lnu 1 , . . . , xm = −lnu m
m
xs = x1 + x2 · · · + xm = −ln ui
i=1
will be a sampling from Ga(x|1, β). The problem is reduced to get a sampling
w ⇐ Ga(w|1, δ) with δ ∈ (0, 1).
0 < β < 1: In this case, for small values of x the density is dominated by p(x) ∼
x β−1 and for large values by p(x) ∼ e−x . Let’s then take the approximant
Defining
and therefore:
(1) u i ⇐ U n(0, 1); i = 1, 2, 3
(2) If u 1 ≤ w1 , set x = u 2 and accept x if u 3 ≤ e−x ; otherwise go to (1);
1/β
e
(β) = (β + 1)
e+β
1 −r 2 /2
p(r, θ) = e r
2π
Clearly, both quantities R and are independent and their distribution functions
θ
Fr (r ) = 1 − e−r /2
2
and Fθ (θ) =
2π
are easy to invert so, using then the Inverse Transform algorithm:
(1) u 1 ⇐√U n(0, 1) and u 2 ⇐ U n(0, 1);
(2) r = −2 lnu 1 and θ = 2πu 2 ;
(3) x1 = r cosθ and x2 = r sinθ.
Thus, we get two independent samplings x1 and x2 from N (x|0, 1) and
(4) z 1 = μ1 + σ1 x1 and z 2 = μ2 + σ2 x2
will be two independent samplings from N (x|μ1 , σ1 ) and N (x|μ2 , σ2 ).
For the n-dimensional case, X ∼ N (x|μ, V ) and V the covariance matrix, we
proceed from the conditional densities
198 3 Monte Carlo Methods
For high dimensions this is a bit laborious and it is easier if we do first a bit of algebra.
We know from Cholesky’s Factorization Theorem that if V ∈ Rn×n is a symmetric
positive defined matrix there is a unique lower triangular matrix C, with positive
diagonal elements, such that V = CCT . Let then Y be an n-dimensional random
quantity distributed as N (y|0, I) and define a new random quantity
X = μ+CY
After some algebra, the elements if the matrix C can be easily obtained as
Vi1
Ci1 = √ 1 ≤ i≤n
V11
j−1
Vi j − Cik C jk
Ci j = k=1
1 < j < i ≤n
Cjj
1/2
i−1
Cii = Vii − 2
Cik 1 < i ≤n
k=1
and, being lower triangular, Ci j = 0 ∀ j > i. Thus, we have the following algorithm:
(1) Get the matrix C from the covariance matrix V;
samplings z i ⇐ N (0, 1) with i = 1, . . . , n;
(2) Get n independent
(3) Get xi = μi + nj=1 Ci j z j
In particular, for a two-dimensional random quantity we have that
σ12 ρσ1 σ2
V=
ρσ1 σ2 σ22
and therefore:
V11
C11 = √ = σ1 ; C12 = 0
V11
V21
C21 = √ = ρ σ2 ; C22 = (V22 − C221 )1/2 = σ2 1 − ρ2
V11
so:
σ1 0
C=
ρσ2 σ2 1 − ρ2
3.3 Everything at Work 199
x1 /α
p(x|·) ∝ x α/2−1 (β + αx)−(α+β)/2 1(0,∞) (x) −→ x =
x2 /β
With the methods we have used up to now we can simulate samples from distributions
that are more or less easy to handle. Markov Chain Monte Carlo allows to sample
from more complicated distributions. The basic idea is to consider each sampling as
a state of a system that evolves in consecutive steps of a Markov Chain converging
(asymptotically) to the desired distribution. In the simplest version were introduced
by Metropolis in the 1950 s and were generalized by Hastings in the 1970s.
Let’s start for simplicity with a discrete distribution. Suppose that we want a
sampling of size n from the distribution
200 3 Monte Carlo Methods
P(X = k) = πk with k = 1, 2, . . . , N
N
π = (π1 , π2 , . . . , π N ); πi ∈ [0, 1] ∀i = 1, . . . , N and πi = 1
i=1
and assume that it is difficult to generate a sample from this distribution by other
procedures. Then, we may start from a sample of size n generated from a simpler
distribution; for instance, a Discrete Uniform with
1
P0 (X = k) = ; ∀k
N
N
and from the sample obtained {n 1 , n 2 , . . . , n N }, where n = i=1 n i , we form the
initial sample probability vector
Once we have the n events distributed in the N classes of the sample space =
{1, 2, . . . , N } we just have to redistribute them according to some criteria in different
steps so that eventually we have a sample of size n drawn from the desired distribution
P(X = k) = πk .
We can consider the process of redistribution as an evolving system such that,
if at step i the system is described by the probability vector π (i) , the new state at
step i + 1, described by π (i+1) , depends only on the present state of the system (i)
and not on the previous ones; that is, as a Markov Chain. Thus, we start from the
state π (0) and the aim is to find a Transition Matrix P, of dimension N ×N , such that
π (i+1) = π (i) P and allows us to reach the desired state π. The matrix P is
⎛ ⎞
P11 P12 · · · P1N
⎜ P21 P22 · · · P2N ⎟
⎜ ⎟
P=⎜ . .. .. ⎟
⎝ .. . ··· . ⎠
PN 1 PN 2 · · · PN N
where each element (P)i j = P(i→ j) ∈ [0, 1] represents the probability for an event
in class i to move to class j in one step. Clearly, at any step in the evolution the
probability that an event in class i goes to any other class j = 1, . . . , N is 1 so
N
(P)i j = P(i→ j) = 1
j=1 j=1
3.4 Markov Chain Monte Carlo 201
π (0)
π (1) = π (0) P
π (2) = π (1) P = π (0) P2
..
.
π (n) = π (0) Pn
..
.
There are infinite ways to choose the transition matrix P. A sufficient (although not
necessary) condition for this matrix to describe a Markov Chain with fixed vector π
is that the Detailed Balance condition is satisfied (i.e., a reversible evolution); that is
N
πP = πi (P)i1 , πi (P)i2 , . . . πi (P)i N =π
i=1 i=1 i=1
202 3 Monte Carlo Methods
N
πi (P)ik = πk (P)ki = πk for k = 1, 2, . . . , N
i=1 i=1
Imposing the Detailed Balance condition, we have freedom to choose the elements
(P)i j . We can obviously take (P)i j = π j so that it is satisfied trivially (πi π j = π j πi )
but this means that being at class i we shall select the new possible class j with
probability P(i→ j) = π j and, therefore, to sample directly the desired distribution
that, in principle, we do not know how to do. The basic idea of Markov Chain Monte
Carlo simulation is to take
(P)i j = q( j|i) · ai j
where
q( j|i): is a probability law to select the possible new class j = 1, . . . , N for an
event that is actually in class i;
ai j : is the probability to accept the proposed new class j for an event that is at
i taken such that the Detailed Balance condition is satisfied for the desired
distribution π.
Thus, at each step in the evolution, for an event that is in class i we propose
a new class j to go according to the probability q( j|i) and accept the transition
with probability ai j . Otherwise, we reject the transition and leave the event in the
class where it was. The Metropolis-Hastings [8] algorithm consists in taking the
acceptance function
π j q(i| j)
ai j = min 1,
πi q( j|i)
It is clear that this election of ai j satisfies the Detailed Balance condition. Indeed, if
πi q( j|i) > π j q(i| j) we have that:
π j q(i| j) π j q(i| j) πi q( j|i)
ai j = min 1, = and a ji = min 1, =1
πi q( j|i) πi q( j|i) π j q(i| j)
and therefore:
π j · q(i| j)
πi (P)i j = πi q( j|i) ai j = πi q( j|i) =
πi · q( j|i)
= π j q(i| j) = π j q(i| j) a ji = π j (P) ji
The same holds if πi q( j|i) < π j q(i| j) and is trivial if both sides are equal. Clearly, if
q(i| j) = πi then ai j = 1 so the closer q(i| j) is to the desired distribution the better.
3.4 Markov Chain Monte Carlo 203
In both cases, it is clear that since the acceptance of the proposed class depends upon
the ratio π j /πi , the normalization of the desired probability is not important.
The previous expressions are directly applicable in the case we want to sample an
absolute continuous random quantity X ∼ π(x). If reversibility holds, p(x
|x)π(x) =
p(x|x
)π(x
) and therefore
p(x
|x) ≡ p(x→x
) = q(x
|x) · a(x→x
)
Let’s take for this example N = 10 (that is, 11 classes), θ = 0.45 and n = 100,000.
We start from a sampling of size n from a uniform distribution (Fig. 3.5(1)). At each
step of the evolution we swap over the n generated events. For and event that is in bin
j we choose a new possible bin to go j = 1, . . . , 10 with uniform probability q( j|i).
Suppose that we look at an event in bin i = 7 and choose j with equal probability
among the 10 possible bins. If, for instance, j = 2, then we accept the move with
probability
π2 = p2
a72 = a(7→2) min 1, = 0.026
π7 = p 7
25000 25000
1 2
20000 20000
15000 15000
10000 10000
5000 5000
0 0
−1 0 1 2 3 4 5 6 7 8 9 10 11 −1 0 1 2 3 4 5 6 7 8 9 10 11
25000
3
20000
15000
10000
5000
0
−1 0 1 2 3 4 5 6 7 8 9 10 11
Fig. 3.5 Distributions at steps 0, 2 and 100 (1, 2, 3; blue) of the Markov Chain with the desired
Binomial distribution superimposed in yellow
π6 = p6
a76 = a(7→6) min 1, = 1.
π7 = p 7
so we make the move of the event. After two swaps over all the sample we have the
distribution shown in Fig. 3.5(2) and after 100 swaps that shown in Fig. 3.5(3), both
compared to the desired distribution:
π 0) = (0.091, 0.090, 0.090, 0.092, 0.091, 0.091, 0.093, 0.089, 0.092, 0.090, 0.092)
π 2) = (0.012, 0.048, 0.101, 0.155, 0.181, 0.182, 0.151, 0.100, 0.050, 0.018, 0.002)
π 100) = (0.002, 0.020, 0.077, 0.167, 0.238, 0.235, 0.159, 0.074, 0.022, 0.004, 0.000)
π = (0.000, 0.021, 0.076, 0.166, 0.238, 0.234, 0.160, 0.075, 0.023, 0.004, 0.000)
3.4 Markov Chain Monte Carlo 205
4.8 7
4.7 6
4.6 5
4.5 4
4.4 3
4.3 2
4.2 1
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
5 10 15 20 25 30 35 40 45 50
Fig. 3.6 Distributions of the mean value, variance and logarithmic discrepancy vs the number of
steps. For the first two, the red line indicates what is expected for the Binomial distribution
The evolution of the moments, in this case the mean value and the variance with the
number of steps is shown in Fig. 3.6 together with the Kullback-Leibler logarithmic
discrepancy between each state and the new one defined as
10
πkn)
δ K L {π|π n) } = πkn) ln
k=1
πk
As the previous example shows, we have to let the system evolve some steps (i.e.,
some initial sweeps for “burn-out” or “thermalization”) to reach the stable running
conditions and get close to the stationary distribution after starting from an arbitrary
state. Once this is achieved, each step in the evolution will be a sampling from the
desired distribution so we do not have necessarily to generate a sample of the desired
206 3 Monte Carlo Methods
size to start with. In fact, we usually don’t do that; we choose one admissible state
and let the system evolve. Thus, for instance if we want a sample of X ∼ p(x|·) with
x ∈ X , we may start with a value x0 ∈ X . At a given step i the system will be in
the state {x} and at the step i + 1 the system will be in a new state {x
} if we accept
the change x → x
or in the state {x} if we do not accept it. After thermalization,
each trial will simulate a sampling of X ∼ p(x|·). Obviously, the sequence of states
of the system is not independent so, if correlations are important for the evaluation
of the quantities of interest, it is a common practice to reduce them by taking for the
evaluations one out of few steps.
As for the thermalization steps, there is no universal criteria to tell whether stable
conditions have been achieved. One may look, for instance, at the evolution of the
discrepancy between the desired probability distribution and the probability vector
of the state of the system and at the moments of the distribution evaluated with a
fraction of the last steps. More details about that are given in [10]. It is interesting
also to look at the acceptance rate; i.e. the number of accepted new values over the
number of trials. If the rate is low, the proposed new values are rejected with high
probability (are far away from the more likely ones) and therefore the chain will
mix slowly. On the contrary, a high rate indicates that the steps are short, successive
samplings move slowly around the space and therefore the convergence is slow. In
both cases we should think about tuning the parameters of the generation.
Example 3.12 (The Beta distribution) Let’s simulate a sample of size 107 from a
Beta distribution Be(x|4, 2); that is:
In this case, we start from the admissible state {x = 0.3} and select a new possible
state x
from the density q(x
|x) = 2x
; not symmetric and independent of x. Thus
we generate a new possible state as
x
x
Fq (x
) = q(s|x)ds = 2sds = x
2 −→ x
= u 1/2 with u ⇐ U n(0, 1)
0 0
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
where the integral is performed over all possible trajectories x(t) that connect the
initial state xi = x(ti ) with the final state x f = x(t f ), S[x(t)] is the classic action
functional
tf
S[x(t)] = L(ẋ, x, t) dt
ti
that corresponds to each trajectory and L(ẋ, x, t) is the Lagrangian of the particle.
All trajectories contribute to the amplitude with the same weight but different phase.
In principle, small differences in the trajectories cause big changes in the action
compared to h/ and, due to the oscillatory nature of the phase, their contributions
cancel. However, the action does not change, to first order, for the trajectories in a
neighborhood of the one for which the action is extremal and, since they have similar
phases (compared to h/ ) their contributions will not cancel. The set of trajectories
around the extremal one that produce changes in the action of the order of h/ define the
limits of classical mechanics and allow to recover is laws expressed as the Extremal
Action Principle.
where the sum is understood as a sum for discrete eigenvalues and as an integral for
continuous eigenvalues. Last, remember that expected value of an operator A(x) is
given by:
A = A[x(t)] e i//h S[x(t)]
D[x(t)] / ei//h S[x(t)] D[x(t)]
Let’s see how to do the integral over paths to get the propagator in a one-
dimensional problem. For a particle that follows a trajectory x(t) between xi = x(ti )
and x f = x(t f ) under the action of a potential V (x(t)), the Lagrangian is:
1
L(ẋ, x, t) = m ẋ(t)2 − V (x(t))
2
and the corresponding action:
tf
1
S[x(t)] = m ẋ(t)2 − V (x(t)) dt
ti 2
where the integral is performed over the set T r of all possible trajectories that start
at xi = x(ti ) and end at x f = x(t f ). Following Feynman, a way to perform this
integrals is to make a partition of the interval (ti , t f ) in N subintervals of equal
length (Fig. 3.8); that is, with
t f − ti
= so that t j − t j−1 = ; j = 1, 2, . . . , N
N
Thus, if we identify t0 = ti and t N = t f we have that
−1
[ti , t f ) = ∪ Nj=0 [t j , t j+1 )
On each interval [t j , t j+1 ), the possible trajectories x(t) are approximated by straight
segments so they are defined by the sequence
x j − x j−1
ẋ(t j ) −→
so the action is finally expressed as:
2
N
1 x j − x j−1
S N [x(t)] = m − V (x j )
j=1
2
Last, the integral over all possible trajectories that start at x0 and end at x N is
translated in this space with discretized time axis as an integral over the quantities
x1 , x2 , . . . , x N −1 so the differential measure for the trajectories D[x(t)] is substituted
by
j=N −1
D[x(t)] −→ A N dx j
j=1
K N (x f , t f |xi , ti ) =
⎧ ⎫
⎨i
N ⎬
1 x j − x j−1 2
= AN d x1 · · · d x N −1 exp m − V (x j ) ·
x1 x N −1 ⎩ h/ 2 ⎭
j=1
After doing the integrals, taking the limit →0 (or N →∞ since the product
N = (t f − ti ) is fixed) we get the expression of the propagator. Last, note that
the interpretation of the integral over trajectories as the limit of a multiple Riemann
integral is valid only in Cartesian coordinates.
To derive the propagator from path integrals is a complex problem and there
are few potentials that can be treated exactly (the “simple” Coulomb potential for
instance was solved in 1982). The Monte Carlo method allows to attack satisfactorily
this type of problems but before we have first to convert the complex integral in a
positive real function. Since the propagator is an analytic function of time, it can be
extended to the whole complex plane of t and then perform a rotation of the time
axis (Wick’s rotation) integrating along
Taking as prescription the analytical extension over the imaginary time axis, the
oscillatory exponentials are converted to decreasing exponentials, the results are
consistent with those derived by other formulations (Schrodinger or Heisenberg for
instance) and it is manifest the analogy with the partition function of Statistical
Mechanics. Then, the action is expressed as:
τf
1
S[x(t)] −→ i m ẋ(t)2 + V (x(t)) dt
τi 2
3.4 Markov Chain Monte Carlo 211
Note that the integration limits are real as corresponds to integrate along the imaginary
axis and not to just a simple change of variables. After partitioning the time interval,
the propagator is expressed as:
1
K N (x f , t f |xi , ti ) = A N d x1 · · · d x N −1 exp − S N (x0 , x1 , . . . , x N )
x1 x N −1 h/
where
2
N
1 x j − x j−1
S N (x0 , x1 , . . . , x N ) = m + V (x j ) ·
j=1
2
Our goal is to generate Ngen trajectories with the Metropolis criteria according to
1
p(x0 , x1 , . . . , x N ) ∝ exp − S N (x0 , x1 , . . . , x N )
h/
Then, over these trajectories we shall evaluate the expected value of the operators of
interest A(x)
Ngen
A = A(x0 , x1(k) , . . . , x N(k)−1 , x N )
Ngen k=1
14
30 V(x) V(x)
12
25
10
20
8
15
6
10
4
5 2
0 0
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
x x
1
V (x) = k x2
2
so the discretized action will be:
2
N
1 x j − x j−1 1
S N (x0 , x1 , . . . , x N ) = m + k x 2j ·
j=1
2 2
To estimate the energy of the fundamental state we use the Virial Theorem. Since:
1 (
T =
x ·∇V x )
2
we have that T = V and therefore
E = T + V = k x 2
&
P(x j −→x j ) = exp −S N (x0 , x1 , . . . x j , . . . , x N )
and
' (
P(x j −→x j ) = exp −S N (x0 , x1 , . . . x j , . . . , x N )
Obviously we do not have to evaluate the sum over all the nodes because when
dealing with node j, only the intervals (x j−1 , x j ) and (x j , x j+1 ) contribute to the
sum. Thus, at node j we have to evaluate
% %
&&
a(x j −→x j ) = min 1, exp −S N (x j−1 , x j , x j+1 ) + S N (x j−1 , x j , x j+1 )
Last, the trajectories obtained with the Metropolis algorithm will follow eventu-
ally the desired distribution p(x0 , x1 , . . . , x N ) in the asymptotic limit. To have a
reasonable approximation to that we shall not use the first Nter m trajectories (ther-
malization). In this case we have taken Nter m = 1000 and again, we should check
the stability of the result. After this, we have generated Ngen = 3000 and, to reduce
correlations, we took one out of three for the evaluations; that is Nused = 1000 tra-
jectories, each one determined by N = 2000 nodes. The distribution of the accepted
values x j will be an approximation to the probability to find the particle at position
x for the fundamental state; that is, |0 (x)|2 . Figure 3.10 shows the results of the
simulation compared to
0.06 2000
t 1800
0.05 2 1600
|Ψ0(x)|
1400
0.04
1200
0.03 1000
800
0.02 600
400
0.01
200
0 0
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x x
Fig. 3.10 Squared norm of the fundamental state wave-function for the harmonic potential and one
of the simulated trajectories
214 3 Monte Carlo Methods
5000
0.04
t 4500
2
0.035 |Ψ0(x)|
4000
0.03
3500
0.025 3000
0.02 2500
2000
0.015
1500
0.01
1000
0.005 500
0 0
− 10 − 8 −6 −4 −2 0 2 4 6 8 10 −8 −6 −4 −2 0 2 4 6 8
x x
Fig. 3.11 Squared norm of the fundamental state wave-function for the quadratic potential and one
of the simulated trajectories
together with one of the many trajectories generated. The sampling average x 2 =
0.486 is a good approximation to the energy of the fundamental state E 0 = 0.5.
As a second example, we have considered the potential well (Fig. 3.9)
a2 + ,2
V (x) = (x/a)2 − 1
4
and, again from the Virial Theorem:
3 a2
E = x 4
− x 2
+
4 a2 4
We took a = 5, a grid of N = 9000 nodes and = 0.25 (so τ = 2250), and as before
Nter m = 1000, Ngen = 3000 and Nused = 1000. From the generated trajectories we
have the sample moments x 2 = 16.4264 and x 4 = 361.4756 so the estimated
fundamental state energy is E 0 = 0.668 to be compared with the exact result E 0 =
0.697. The norm of the wave-function for the fundamental state is shown in Fig. 3.11
together with one of the simulated trajectories exhibiting the tunneling between the
two wells.
p(x1 |x2 , x3 . . . , xn )
p(x2 |x1 , x3 . . . , xn )
..
.
p(xn |x1 , x2 , . . . , xn−1 )
and an arbitrary initial value x 0) = {x10) , x20) , . . . , xn0) } ∈ X . If we take the approx-
imating density q(x1 , x2 , . . . , xn ) and the conditional densities
q(x1 |x2 , x3 , . . . , xn )
q(x2 |x1 , x3 , . . . , xn )
..
.
q(xn |x1 , x2 , . . . , xn−1 )
we generate for x1 a proposed new value x11) from q(x1 |x20) , x30) , . . . , xn0) ) and accept
the change with probability
) *
p(x11) , x20) , x30) , . . . , xn0) )q(x10) , x20) , x30) , . . . , xn0) )
a(x10) → x11) ) = min 1, =
p(x10) , x20) , x30) , . . . , xn0) )q(x11) , x20) , x30) , . . . , xn0) )
) *
p(x11) |x20) , x30) , . . . , xn0) )q(x10) |x20) , x30) , . . . , xn0) )
= min 1,
p(x10) |x20) , x30) , . . . , xn0) )q(x11) |x20) , x30) , . . . , xn0) )
After we run over all the variables, we are in a new state {x1
, x2
, . . . , xn
} and repeat
the whole procedure until we consider that stability has been reached so that we are
sufficiently close to sample the desired density. The same procedure can be applied
if we consider more convenient to express the density
Since
∞
e−au u b−1 du = (b) a −b
0
−2
10
−3
10
− 10 −8 −6 −4 −2 0 2 4 6 8 10
Then, for the J groups we have the parameters μ = {μ1 , . . . μ J } that, in turn, are
also considered as an exchangeable sequence drawn from a parent distribution μ j ∼
N (μ j |μ, σμ2 ). We reparameterize the model in terms of η = σμ−2 and φ = σ −2 and
consider conjugated priors for the parameters considered independent; that is
Thus, we set initially the parameters {μ0 , σ0 , a, b, c, d} and then, at each step
1. Get {μ1 , . . . , μ J } each as μ j ∼ N (·, ·)
2. Get μ ∼ N (·, ·)
3. Get σμ = η −1/2 with η ∼ Ga(·, ·)
4. Get σ = φ−1/2 with φ ∼ Ga(·, ·)
and repeat the sequence until equilibrium is reached and samplings for evaluations
can be done.
A frequent use of Monte Carlo sampling is the evaluation of definite integrals. Cer-
tainly, there are many numerical methods for this purpose and for low dimensions
they usually give a better precision when fairly compared. In those cases one rarely
uses Monte Carlo… although sometimes the domain of integration has a very com-
plicated expression and the Monte Carlo implementation is far easier. However, as
we have seen√the uncertainty of Monte Carlo estimations decreases with the sampling
size N as 1/ N regardless the number of dimensions so, at some point, it becomes
superior. And, besides that, it is fairly easy to estimate the accuracy of the evaluation.
Let’s see in this section the main ideas.
Suppose we have the n-dimensional definite integral
I = f (x1 , x2 , . . . , xn ) d x1 d x2 . . . d xn
1
1
2
N N
I N(1) = g(x i ) and I N(2) = g (x i )
N i=1 N i=1
converge respectively to E[Y ] (and therefore to I) and to E[Y 2 ] (as for the rest, all
needed conditions for existence are assumed to hold). Furthermore, if we define
1 (2)
SI2 = I K − (I K(1) )2
N
we know by the Central Limit Theorem that the random quantity
I K(1) − I
Z=
SI
is, in the limit N →∞, distributed as N (x|0, 1). Thus, Monte Carlo integration pro-
vides a simple way to estimate the integral I and a quantify the accuracy. Depending
on the problem at hand, you can envisage several tricks to further improve the accu-
racy. For instance, if g(x) is a function “close” to f (x) with the same support and
known integral Ig one can write
I = f (x) d x = ( f (x) − g(x)) d x + Ig
N
∼
I= ( f (x i ) − g(x i )) + Ig
N i=1
References
1. F. James, Monte Carlo theory and practice. Rep. Prog. Phys. 43, 1145–1189 (1980)
2. D.E. Knuth, The Art of Computer Programming, vol. 2 (Addison-Wesley, Menlo Park, 1981)
3. F. James, J. Hoogland, R. Kleiss, Comput. Phys. Commun. 2–3, 180–220 (1999)
4. P. L’Ecuyer, Handbook of Simulations, Chap. 4 (Wiley, New York, 1998)
220 3 Monte Carlo Methods
5. G. Marsaglia, A. Zaman, Toward a Universal Random Number Generator, Florida State Uni-
versity Report FSU-SCRI-87-50 (1987)
6. D.B. Rubin, Ann. Stat. 9, 130–134 (1981)
7. G.E.P. Box, M.E. Müller, A note on the generation of random normal deviates. Ann. Math.
Stat. 29(2), 610–611 (1958)
8. W.K. Hastings, Biometrika 57, 97–109 (1970)
9. N. Metropolis, A.W. Rosenbluth, M.W. Rosenbluth, A.H. Teller, E. Teller, J. Chem. Phys. 21,
1087–1092 (1953)
10. A.B. Gelman, J.S. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis (Chapman & Hall,
London, 1995)
Chapter 4
Information Theory
The ultimate goal of doing experiments and make observations is to learn about
the way nature behaves and, eventually, unveil the mathematical laws governing
the Universe and predict yet-unobserved phenomena. In less pedantic words, to get
information about the natural world. Information plays a relevant role in a large
number is disciplines (physics, mathematics, biology, image processing,…) and, in
particular, it is an important concept in Bayesian Inference. It is useful for instance
to quantify the similarities or differences between distributions and to evaluate the
different ways we have to analyse the observed data because, in principle, not all
of them provide the same amount of information on the same questions. The first
steep will be to quantify the amount of information that we get from a particular
observation.
natural, a very likely observation, and I hardly give you any valuable information.
However, if I tell you that I have seen a lone lion walking along Westminster Bridge,
you will be quite surprised. This is not expected; a very unlikely observation worth of
further investigation. I give you a lot of information. Surprise is Information. Thus,
it is also sensible to assume that if the probability for an event to occur is large
we receive a small amount of information and, conversely, if the probability is very
small we receive a large amount of information. In fact, if the event is a sure event,
p(xi ) = 1 and its occurrence will provide no information at all. Therefore, we start
assuming two reasonable hypothesis:
H1 : I (xi ) = f (1/ pi ) with f (x) a non-negative increasing function;
H2 : f (1) = 0.
Now, imagine that we repeat the experiment n-times under the same conditions
and obtain the sequence of independent events {x1 , x2 , . . . , xn }. We shall assume that
the information provided by this n-tuple of observed results is equal to the sum of
the information provided by each observation separately; that is:
n
I (x1 , x2 , . . . , xn ) = I (xi )
=1
n
Being all independent, we have that p(x1 , x2 , . . . , xn ) = i=1 p(xi ) and, in conse-
quence:
n
1 1
H3 : f p · · · p = f
1 n
k=1
pk
Those are the three hypothesis we shall make. Since pi ∈ [0, 1], we have that wi =
1/ pi ∈ [1, ∞) and therefore we are looking for a function f (w) such that:
(1) f : w ∈ [1, ∞) −→ [0, ∞) and increasing;
(2) f (1) = 0;
(3) f (w1 ·w2 · · · wn ) = f (w1 ) + · · · + f (wn )
The third condition implies that f (w n ) = n f (w) so, taking derivatives with respect
to w:
∂ f (w n ) ∂ f (w)
wn =w
∂ wn ∂w
and this has to hold for all n ∈ N and w ∈ [1, ∞); hence, we can write
∂ f (w)
w =c −→ f (w) = c log w
∂w
with c a positive constant and f (w) an increasing function since w ≥ 1. Taking
c = 1, we define:
4.1 Quantification of Information 223
• The amount of information we receive about the random process after we have
observed the occurrence of the event X = xi is:
1
I (xi ) = log = −log p(xi )
p(xi )
I (xi ) = −ln p(xi ) “nats” = −lg2 p(xi ) “bits” = −lg10 p(xi ) “har tleys”
and therefore: 1nat = lg2 e bits (1.44) = lg10 e har tleys (0.43). In general,
the units will be irrelevant for us and we shall work with natural logarithms.
• The amount of information we expect to get from the realization of the random
experiment is:
I (X ) = p(xi ) I (xi ) = − p(xi ) ln p(xi )
i i
with the prescription lim x→0+ x ln x = 0. It is clear that the expected information
I (X ) does not depend on any particular result but on the probability distribution
associated to the random process.
We can look at this expression from another point of view. If from the experiment
we are going to do we expect to get the amount of information I (X ), before we do the
experiment we have a lack of information I (X ) relative to what we shall have after
the experiment is done. Interpreted in this way, the quantity I (X ) is called entropy
(H (X )) and quantifies the amount of ignorance about the random process that we
expect to reduce after the observation.
I (X ) = − p log p − (1 − p) log (1 − p)
that, in the case the two results are equally likely ( p = 1/2) and we take logarithms
in base 2, becomes:
1 1
I = log2 2 + log2 2 = log2 2 = 1 bit
2 2
Example 4.2 Consider a discrete random quantity with support on the finite set
{x1 , x2 , . . . , xn } and probabilities pi = p(xi ) and consider an experiment that consist
in one observation of X . What is the distribution for which the lack of information
is maximal? Defining
n
n
φ(p, λ) = − pi ln pi + λ pi − 1
i=1 i=1
we have that
⎫
∂ φ(p, λ)
= −ln pi − 1 + λ = 0 −→ pi = eλ−1 ⎪
⎪
⎬
∂ pi 1
−→ p(xi ) = ; i = 1, . . . , n
n ⎪
⎪ n
∂ φ(p, λ)
= pi − 1 = 0 −→ eλ−1 = n1 ⎭
∂λ i=1
Therefore the entropy, the lack of information, is maximal for the Discrete Uniform
Distribution and its value will be
n
1
HM (X ) = ln n = ln n
i=1
n
∞
μj = p(xi ) f j (xi ) with j = 1, 2, . . . , k
i=1
4.2 Expected Information and Entropy 225
∂ φ(p, λ, λ) k
= −ln pi − 1 + λ − λ j f j (xi ) = 0
∂ pi j=1
⎧ ⎫
⎨ k ⎬
−→ pi = exp {λ − 1} exp − λ j f j (xi )
⎩ ⎭
j=1
so if we define
⎡ ⎛ ⎞⎤−1
∞
k
Z (λ) = ⎣ exp ⎝− λ j f j (xi )⎠⎦
i=1 j=1
we have that
⎛ ⎞
k
pi = Z (λ) exp ⎝− λ j f j (xi )⎠
j=1
Suppose for instance that X may take the values {0, 1, 2, . . .} and the have only one
condition:
∞
μ = E[X ] = pn n
n=0
226 4 Information Theory
∞
pn = 1 we get λ1 = ln 1
Imposing finally that n=0 μ + 1 and in consequence
μn
pn = with n = 0, 1, 2, . . . and μ>0
(1 + μ)1+n
is the distribution for which the lack of information is maximal. In this case:
∞
(1 + μ)1+μ
HM (X ) = − p(xi )ln p(xi ) = ln
i=1
μμ
is the maximum entropy for any random quantity with countable support and known
mean value μ = E[X ] > 0.
If we define1
1 The non-negativity of this and the following expressions of Information can be easily derived from
the Jensen’s inequality for convex functions: Given the probability space (R,B, μ), a μ-integrable
function X and a convex function φ over the range of X , then φ( R X dμ) ≤ R φ(X ) dμ provided
the last integral exist; that is, φ(E[X ])] ≤ E[φ(X )]. Observe that if φ is a concave function, then
−φ is convex so the inequality sign is reversed and that if φ is twice continuously differentiable
on [a, b], it is convex on that interval iff φ (x) ≥ 0 for all x ∈ [a, b]. Frequent and useful convex
functions are φ(x) = exp(x) and φ(x) = − log x.
4.3 Conditional and Mutual Information 227
I (X |Y ) = − p(xi , y j ) ln p(xi |y j ) ≥ 0
x y
we can write:
I (X, Y ) = I (X |Y ) + I (Y ) = I (Y |X ) + I (X ) = I (Y, X )
Now, I (Y ) is the amount of information we expect to get about Y and, if (X, Y ) are
not independent, the knowledge of Y gives some information on X so the remaining
information we expect to get about X is not I (X ) but the smaller quantity I (X |Y ) <
I (X ) because we already know something about it. In entropy language, H (X |Y )
is the amount of ignorance about X that remains after Y is known. It is clear that
if X and Y are independent, I (X |Y ) = I (X ) so the knowledge of Y doesn’t say
anything about X and therefore the remaining information we expect to get about
X is I (X ). The interesting question is: How much information on X is contained
in Y ? (or, entropy wise, By how much the ignorance about X will be reduced if we
observe first the quantity Y ?). Well, if observing X we expect to get I (X ) and after
observing Y we expect to get I (X |Y ), the amount of information that the knowledge
of Y provides on X is the Mutual Information:
I (X : Y ) = I (X ) − I (X |Y )
I (X : Y ) = I (X ) − I (X |Y ) = I (Y ) − I (Y |X ) = I (X ) + I (Y ) − I (X, Y ) = I (Y : X )
I (X : Y ) = I (X ) + I (Y ) − I (X, Y ) =
= p(xi , y j ) ln p(xi , y j ) − p(xi ) ln p(xi ) − p(y j ) ln p(yi )
x y x y
and, in consequence:
• The Mutual Information, the amount of information we expect to get about X
(or Y ) from the knowledge of Y (or X ), is given by
p(xi , y j )
I (X : Y ) = I (Y : X ) = p(xi , y j ) ln
x y
p(xi ) p(y j )
228 4 Information Theory
Again from the Jensen’s inequality for convex functions I (X : Y ) ≥ 0 with the equal-
ity satisfied if and only if X and Y are independent random quantities so the Mutual
Information is therefore a measure of the statistical dependence between them (see
Note 3).
Up to now, we have been dealing with discrete random quantities. Consider now a
continuous random quantity X ∼ p(x) with support on the compact set x = [a, b]
and let’s get a discrete approximation. Given a partition
N
X = n ; n = [a + (n − 1), a + n]
n=1
with large N and = (b − a)/N > 0, if p(x) is continuous on n we can use the
Mean Value Theorem and write
P(X ∈ n ) = p(x) d x = p(xn )
n
N
k=1 pk = 1 and write
N
N
N
I D (X D ) = − [ p(xk ) ] log [ p(xi ) ] = − pk log p(xk ) − log pk
k=1 k=1 k=1
… but this doesn’t work. This “naive” generalisation is not the limit of the information
for a discrete quantity and is not an appropriate measure of information, among other
4.4 Generalization for Absolute Continuous Random Quantities 229
quantifies (Lindley; 1956) the Information we expect to get from the experiment
e(n) on the parameter θ ∈ θ when the prior knowledge is represented by π(θ).
Therefore, we have that:
• The amount of Information provided by the data sample x = {x1 , x2 , . . . , xn } on
the parameter θ ∈ θ with respect to the prior density π(θ) is:
p(θ|x)
I (x|π(θ)) = p(θ|x) ln dθ
θ π(θ)
that is; given the experimental sample x, I (x|θ) is the information we need to
actualize the prior knowledge π(θ) and substitute it by p(θ|x);
2 Despite of that, the “Differential Entropy” h( p) = − X p(x) log p(x) d x is a useful quantity
in a different context. It is left as an exercise to show that among all continuous distributions with
support on [a, b], then the Uniform distribution U n(x|a, b) is the one that maximizes the Differential
Entropy, among those with support on [0, ∞) and specified first order moment is the Exponential
E x(x|μ) and, if the second order moment is also constrained, we get the Normal density N (x|μ, σ).
230 4 Information Theory
• The amount of Expected Information from the experiment e(n) on the parameter
θ ∈ θ with respect to the knowledge contained in π(θ) will be
p(θ|x)
I (e|π(θ)) = p(x) I (x|π(θ)) d x = p(x) d x p(θ|x) ln dθ =
X X θ π(θ)
p(x, θ)
= p(x, θ) ln d x dθ = I (X : θ)
X θ π(θ) p(x)
I (e|π(θ)) = D K L [ p(x, θ)
π(θ) p(x)] and I (x|π(θ) = D K L [ p(θ|x)
π(θ)]
What is the capacity of the experiment to distinguish two infinitesimally close values
θ 0 and θ 1 = θ 0 + θ 0 of the parameters? or, in other words, How much infor-
mation has to be provided by the experiment so that we can discern between two
infinitesimally close values of θ? Let’s analyze the local behaviour of
p(x|θ 0 )
I (θ 0 : θ 1 = θ 0 + θ 0 ) = p(x|θ 0 ) ln dx
X p(x|θ 1 )
4.5 Kullback–Leibler Discrepancy and Fisher’s Matrix 231
we get
1
n n
∂ 2 ln p(x|θ)
I (θ 0 : θ 0 + θ) − θi θ j p(x|θ) dx + ···
2! i=1 j=1 X ∂θi ∂θ j θ0
we can write
1
n n
I (θ 0 : θ 0 + θ) θi I i j (θ 0 )θ j + · · ·
2 i=1 j=1
The Fisher’s matrix is a non-negative symmetric matrix that plays a very important
role in statistical inference… provided it exists. This is the case for regular distribu-
tions where:
(1) suppx { p(x|θ)} does not depend on θ;
(2) p(x|θ) ∈ Ck (θ) for k ≥ 2 and
(3) The integrand is well-behaved so ∂ X (•)d x = X ∂(•) d x
∂θ ∂θ
Interchanging the derivatives with respect to the parameters θi and the integrals over
X it is easy to obtain the equivalent expressions:
∂ 2 ln p(x|θ) ∂ ln p(x|θ) ∂ ln p(x|θ)
I i j (θ) = E X − = EX
∂θi ∂θ j ∂θi ∂θ j
2 2
∂ ln p(x|θ) ∂ln p(x|θ)
Iii (θ) = E X − = EX
∂θi 2 ∂θi
If X = (X 1 , . . . , X n
n
w(θ|·) = log p(xi |θ)
i=1
⎡ ⎤
n
d
d
1
n
∂ 2 log p(x |θ)
w(θ|·) = w(! ⎣ ⎦ (θk − !
θk )(θm − !
i
θ|·) − − θm ) + · · ·
2 n ∂θk ∂θm !
k=1 m=1 i=1 θ
where the second term has been multiplied and divided by n. Under sufficiently
regular conditions we have that !
θ converges in probability to the true value θ 0 so we
can neglect higher order terms and, by the Law of Large Numbers, approximate
2 2
1 n ∂ log p(xi |θ) ∂ log p(x|θ)
lim − EX − I km (!
θ)
n→∞ n i=1 ∂θk ∂θm !θ ∂θk ∂θm !
θ
Therefore
1 " #
d d
!
w(θ|·) = w(θ|·) − (θk − !
θk ) n I km (!
θ) (θm − !
θm ) + · · ·
2 k=1 m=1
1 1
p(x|μ, σ) = exp{− (x − μ)T V−1 (x − μ)}
(2π) n/2
det [V]
1/2 2
∂ ln p(x|μ, σ) n
=− [V−1 ]ik (xk − μk )
∂μi k=1
4.5 Kullback–Leibler Discrepancy and Fisher’s Matrix 233
and therefore
n n " #
I i j (μ) = [V−1 ]ik [V−1 ] j p E X (xk − μk )(x p − μ p ) =
k=1 p=1
n n
= [V−1 ]ik [V−1 ] j p [V]kp = [V−1 ]i j
k=1 p=1
that is; the Fisher’s Matrix I i j (μ) is the inverse of the Covariance Matrix.
Example 4.3 The quantity Ii j (θ) as an intrinsic measure of the information that
an experiment will provide about the parameters θ was introduced by R.A. Fisher
around 1920 in the context of Experimental Design. It depends only on the condi-
tional density p(data|parameters) and therefore is not a quantification relative to
the knowledge we have on the parameters before the experiment is done but it is
very useful and has many interesting applications; for instance, to compare different
procedures to analyze the data. Let’s see as an example the charge asymmetry of
the angular distribution of the process e+ e− −→μ+ μ− . If we denote by θ the angle
between the incoming electron and the outgoing μ− and by x = cosθ ∈ [−1, 1], the
angular distribution can be expressed in first order of electroweak perturbations as
3
p(x|a) = (1 + x 2 ) + a x
8
where a, the asymmetry coefficient, is bounded by |a| ≤ am = 3/4 since p(x|a) ≥ 0.
Now, suppose that an experiment e(n) provides n independent observations x =
{x1 , x2 , . . . , xn } so we have the joint density
$n
3
p(x|A) = (1 + xi2 ) + a xi
i=1
8
In this case, there are no minimal sufficient statistics but, instead of working with the
whole sample, we may simplify the analysis and classify the events in two categories:
Forward (F) if θ =∈ [0, π/2] → x ∈ [0, 1]) or Backward (B) if θ ∈ [π/2, π] −→
x ∈ [−1, 0]. Then
1 0
1 1
pF = p(x|a) d x = (1 + a) and pB = p(x|a) d x = (1 − a) = 1 − p F
0 2 −1 2
and the model we shall use for inferences on a is the simpler Binomial model
1 n
P(n F |n, a) = n (1 + a)n F (1 − a)n−n F ; n F = 0, 1, . . . , n
2 nF
234 4 Information Theory
n 2
∂ ln P(n F |n, a) n
I A2 (a) = P(n F |n, a) =
n F =0
∂a 1 − a2
where α = a/am ∈ [−1, 1]. For anys a ∈ [−3/4, 3/4], it holds that I A1 (a) > I A2 (a)
so it is preferable to do the first analysis.
Consider the random quantity X ∼ p(x|θ), the sample x = {x1 , . . . , xn } and the
Mutual Information
p(x|θ)
I (X 1 , . . . , X n : θ) = dθ π(θ) p(x|θ) ln dx
θ X p(x)
If the n observations are independent, p(x|θ) = p(x1 |θ) · · · p(xn |θ) and therefore
n
d x p(x|θ) ln p(x|θ) = d xi p(xi |θ) ln p(xi |θ)
X i=1 X
= n d x p(x|θ) ln p(x|θ)
X
so:
I (X 1 , . . . , X n : θ) = n dθ π(θ) d x p(x|θ) ln p(x|θ) − d x p(x) ln p(x) =
θ X X
= n I (X : θ) − D K L [ p(x)
p(x1 ) · · · p(xn )]
4.6 Some Properties of Information 235
and D K L [ p(x)
p(x1 ) · · · p(xn )] = 0 if, and only if, p(x) = p(x1 ) · · · p(xn ). How-
ever, since
p(x) = dθ p(x1 |θ) . . . p(xn |θ) π(θ)
θ
the random quantities X 1 , X 2 , . . . , X n are correlated through θ and therefore are not
independent. Thus, since D K L [ p(x)
p(x1 ) · · · p(xn )] > 0 the information provided
by the experiment e(n) is less than n times the one provided by e(1). This is reasonable
because, as the number of samplings grows, the knowledge about the parameter θ
increases, the prior distribution π(θ) is actualized by p(θ|x1 , . . . , xn ) and further
independent realizations of the same experiment (under the same conditions) will
provide less information.
This is not the case for Fisher’s criteria of Information. In fact, if the observations
are independent, it is trivial to see that for x = {x1 , x2 , . . . , xn }
∂ ln p(x|θ) ∂ ln p(x|θ) ∂ ln p(x1 |θ) ∂ ln p(x1 |θ)
Ii(n)
j (θ) = EX = n EX
∂θi ∂θ j ∂θi ∂θ j
= n Ii(1)
j (θ);
Example 4.4 Consider a random quantity X ∼ N (x|μ, σ) with σ known and μ the
parameter of interest with a prior density π(μ) = N (μ|μ0 , σ0 ). Then
%
p(x, μ|μ0 , σ, σ0 ) = N (x|μ, σ) N (μ|μ0 , σ0 ); p(x|μ0 , σ, σ0 ) = N (x|μ0 , σ 2 + σ02 )
so the amount of Expected Information from the experiment e(1) on the parameter
μ with respect to the knowledge contained in π(μ) will be
1/2
∞ ∞ p(x, μ|·) σ02
I (e(1)|π(μ)) = dμ d x p(x, μ|·) ln = ln 1+
−∞ −∞ π(μ|·) p(x|·) σ2
Consider now the experiment e(2) that provides two independent observations
{x1 , x2 }. Then
and
% %
p(x1 , x2 |μ0 , σ, σ0 ) = N (x1 , x2 |μ0 , μ0 , σ 2 + σ02 , σ 2 + σ02 , ρ)
where ρ = (1 + σ 2 /σ02 )−1 ; that is, the random quantities X 1 and X 2 are correlated
through π(μ). In this case:
236 4 Information Theory
1 σ02 1
I (e(2)|π(μ)) = ln 1 + 2 2 = 2 I (e(1)|π(μ)) + ln (1 − ρ2 )
2 σ 2
so D K L [ p(x1 , x2 |·)
p(x1 |·) p(x1 |·)] = − 21 ln (1 − ρ2 ). In general, for n independent
observations we have that
1 σ2
I (e(n)|π(μ)) = ln 1 + n 02
2 σ
√
that behaves asymptotically with n as I (e(n)|π(μ)) ∼ ln n.
Introducing
μ1 (E i , θ) = p(x, θ) d x y μ2 (E i ) = p(x) d x
Ei Ei
N
p(x, θ)
dθ d x p(x, θ) ln =
θ Ei p(x)
i=1
N
p(x, θ) p(x, θ)/μ1 (E i , θ) μ1 (E i , θ)
= dθ dx μ1 (E i , θ) ln =
θ μ1 (E i , θ) p(x)/μ2 (E i ) μ2 (E i )
i=1 E i
N & '
μ1 (E i , θ) f 1 (x, θ)
= dθ μ1 (E i , θ) ln + d x f 1 (x, θ) ln
μ2 (E i ) f 2 (x)
i=1 θ Ei
where
p(x, θ) p(x)
f 1 (x, θ) = ≥0 and f 2 (x) = ≥0
μ1 (E i , θ) μ2 (E i )
4.6 Some Properties of Information 237
N
μ1 (E i , θ)
IG (X : θ) = dθ μ1 (E i , θ) ln − dθ p(θ) ln π(θ)
i=1 θ μ2 (E i ) θ
and therefore
N
f 1 (x, θ)
I (X : θ) = IG (X : θ) + dθ μ1 (E i , θ) d x f 1 (x, θ) ln
θ i=1 Ei f 2 (x)
p(x, θ) μ1 (E i , θ)
f 1 (x, θ) = f 2 (x) −→ =
p(x) μ2 (E i )
• If sufficient statistics exist, using them instead of the whole sample does not
reduce the information.
Given a parametric model p(x1 , x2 , . . . , xn |θ), we know that the set of statistics
t = t(x1 , . . . , xn ) is sufficient for θ iff for all n ≥ 1 and any prior distribution π(θ)
it holds that
p(θ|x1 , x2 , . . . , xm ) = p(θ|t)
It is then clear that for inferences about θ, all the information provided by the data
is contained in the set of sufficient statistics.
Example 4.5 Consider the random quantity X ∼ E x(x|λ) and the experiment e(n).
Under independence of the sample x = {x1 , x2 , . . . , xn } we have that
λn −λ t n−1
p(t|λ) = e t
(n)
(b)
I (e(n)|π(λ)) = ln − n + (n + b)(n + b) − b(b)
(n + b)
This last section is somewhat marginal for the use of Information in the context that we
have been interested in but, besides being an active field of research, illustrates a very
interesting connection with geometry that almost surely will please most physicists.
Consider the family of distributions F[ p(x|θ)] with θ ∈ ⊆Rn . They all have the
same functional form so their difference is determined solely by different values of the
parameters. In fact, there is a one-to-one correspondence between each distribution
p(x|θ) ∈ F and each point θ ∈ and the “separation” between them will be
determined by the geometrical properties of this parametric space. Intuitively, we
can already see that, in general, this space is a non-Euclidean Riemannian Manifold.
Consider for instance the Normal density N (x|μ, σ) and two points (μ1 , σ1 ) and
(μ2 , σ2 ) of the parametric space μ,σ = R × R+ . For a real constant a > 0, if μ2 =
μ1 + a and σ1 = σ2 (same variance, different mean values) we have the Euclidean
distance
4.7 Geometry and Information 239
(
dE = (μ2 − μ1 )2 + (σ2 − σ1 )2 = a
with the contravariant form g i j (θ) satisfying g ik (θ)gk j (θ) = δ ij . In the geometric
context it is called the Fisher-Rao metric tensor and defines the geometry of the
parameters’ non-Euclidean Riemannian manifold. The differential distance is given
by
and for any two pints θ 1 and θ 2 of the parametric space, the distance between them
along the trajectory θ(t) parametrized by t ∈ [t1 , t2 ] with end points θ(t1 ) = θ 1 and
θ(t2 ) = θ 2 will be:
θ2 t2 t2 1/2
ds dθi dθ j
S(θ 1 , θ 2 ) = ds = dt = dt gi j (θ)
θ1 t1 dt t1 dt dt
The path of shortest information distance between two points is always a geodesic
curve determined by the second order differential equation
240 4 Information Theory
d 2 θi j k
i (θ) dθ dθ = 0 1 im ) *
+ jk with ijk (θ) = g g jm,k + gmk, j − g jk,m
dt 2 dt dt 2
the Christoffel symbols and t the affine parameter along the geodesic. Each point
of the manifold has associated a tensor (Riemann tensor) given, in its covariant
representation, by
1) * ) p p*
Riklm = gim,kl + gkl,im − gil,km − gkm,il + gnp kln im − km
n
il
2
that depends only the metric and provides a local measure of the curvature; that
is, by how much the Fisher-Rao metric is not locally isometric to the Euclidean
space. For a n-dimensional manifold it has n 4 components but due to the sym-
metries Riklm = Rlmik = −Rkilm = −Rikml and Riklm + Rimkl + Rilmk = 0 they are
considerably reduced. The only non-trivial contraction of indices of Riklm is the
Ricci curvature tensor Ri j = glm Ril jm (symmetric) and its trace R = g i j Ri j is
the scalar curvature. For two dimensional manifolds, the Gaussian curvature is
κ = R1212√/det(gi j ) = R/2. Last,
√ note that the invariant differential volume element
is d V = |det g(x)|d x and |det g(x)| is the by now quite familiar Jeffrey’s prior.
We have then that, in this geometrical context, the Information is the source of
curvature of the parametric space and the curvature determines how the information
flows from one point to another. The geodesic distance is an intrinsic distance, invari-
ant under reparameterisations, and is related to the amount of information difference
between two points; in other words, how easy is to discern between them: the larger
the distance, the larger the separation and the easier will be to discern.
Let’s start with a simple one-parameter case: the Binomial distribution Bi(n|N , θ).
In this case
1 1 − 2θ
g11 (θ) = ; g 11 (θ) = θ(1 − θ) and 11
1
(θ) = −
θ(1 − θ) 2θ(1 − θ)
Then, for any two points θ1 , θ2 ∈ [0, 1] the geodesic distance will be:
θ2 ( + √ ++θ2
+
S(θ1 , θ2 ) = g11 (θ) dθ = 2 +arc sin θ+
θ1 θ1
that is, Jeffrey’s prior. We leave as an exercise to see that for the exponential family
E x(x|1/τ )
+ +
1 1 (τ ) = − 1 ;
+ τ2 + √
g11 (τ ) = 2 ; 11 S(τ1 , τ2 ) = ++log ++ ; S(τ , τ + ) ∝ g11
τ τ τ1 τ
The Normal family N (x|μ, σ) is more interesting. The metric tensor for {μ, σ} is
1/σ 2 0 σ2 0
gi j = ; det(gi j ) = 2/σ ;4
g = ij
0 2/σ 2 0 σ 2 /2
12
1
= 21
1
= −1/σ; 11
2
= 1/2σ; 22
2
= −1/σ
and conclude that the parametric manifold (μ, σ) of the Normal family is hyperbolic
and with constant curvature (thus, there is no complete isometric immersion in E 3 ).
The geodesic equations are:
2 2
d 2μ 2 dμ dσ d 2σ 1 dμ 1 dσ
− =0 and + − =0
dt 2 σ dt dt dt 2 2σ dt σ dt
so writing
2
dσ dσ dμ d 2σ d 2σ dμ dσ d 2μ
= and = +
dt dμ dt dt 2 dμ2 dt dμ dt 2
we have that:
2
d 2σ dσ 1 d(σσ ) 1
σ + + = + =0 −→ 2 σ 2 (μ) = a − (μ − b)2
dμ2 dμ 2 dμ 2
√ √
where b − a < μ < b + a. For the points (μ1 , σ1 ) and (μ2 , σ2 ), the integration
constants are
with
1/2
(μ1 − μ2 )2 + 2(σ1 − σ2 )2
δ= ∈ [0, 1)
(μ1 − μ2 )2 + 2(σ1 + σ2 )2
Suppose for instance that (μ1 , σ1 ) are given. Then, under the hypothesis
H0 : {μ = μ0 }:
and, under H0 : {σ = σ0 }:
inf μ S(μ1 , σ1 ; μ, σ0 ) −→ μ = μ1
β " #−1
p(x|α, β) = 1 + β 2 (x − α)2 ; x ∈ R; (α, β) ∈ R × R+
π
we have for {α, β}:
β 2 /2 0
gi j = ;
0 1/(2β 2 )
11
2
= −β 2 ; 12
1
= −22
2
= β −1 ; R1212 = −1/2; R = 2κ = −4
and, as for the Normal family, the geodesic equation has a parabolic dependence
β −2 = a − (α − b)2 . It may help to consider that
∞ " #−n (n + 1/2)
p(x|α, β) 1 + β 2 (x − α)2 = ; n∈N
−∞ (1/2)(n + 1)
As for the Normal family, the parametric manifold for the Cauchy family of distribu-
tions is also hyperbolic with constant curvature. Show that this is the general case for
distributions with location and scale parameters p(x|α, β) = β f [β(x − α)] where
x ∈ R and (α, β) ∈ R × R+ .
Problem 4.2 Show that for the metric of the Example 2.21 (ratio of Poisson para-
meters)
μ/θ 1
gi j (θ, μ) =
1 (1 + θ)/μ
4.7 Geometry and Information 243
the Riemann tensor is zero and therefore the manifold for {θ, μ} is locally isometric
to the two dimensional Euclidean space. Show that the geodesics are given by θ(μ) =
(b0 + b1 μ−1/2 )2 and, for the affine parameter t, μ(t) = (a0 + a1 t)2 . If we define the
new parameters φ1 = 2(θμ)1/2 and φ2 = 2μ1/2 , What do you expect to get for the
manifold M{φ1 , φ2 }?
As we move along the information geodesic, there are some quantities that
remain invariant. The Killing vectors ζμ are given by the first order differential
ρ
equation ζμ,ν + ζν,μ − 2μν ζρ = 0 and if we denote by u μ = d x μ /dt the tangent
vector to the geodesic, with t the affine parameter, one has that ζμ u μ = constant.
For n-dimensional spaces with constant curvature there are n(n + 1)/2 of them and
clearly and linear combination with constant coefficients will be also a Killing vector.
In the case of the Normal family we have that:
μ2 − 2σ 2 μ μ 1 1
ζμ = c1 , + c2 , + c3 ,0
4σ 2 σ 2σ 2 σ σ2
μ2 − 2σ 2 dμ μ dσ μ dμ 1 dσ 1 dμ
+ , + and
4σ 2 dt σ dt 2σ 2 dt σ dt σ 2 dt
will remain constant. In fact, it is easier to derive from these first order differential
equations the expression of the geodesic as function of the affine parameter t:
,
√ −1 2
μ(t) = b + a tanh(c1 t + c0 ) and σ(t) = cosh(c1 t + c0 )
a
With respect to the standardized Normal N (x|0, 1), all points of the (μ, σ) of the
manifold R × R+ with the same geodesic distance S ≥ 0 are given by
√ 1/2
μ = ± 4σ cosh(S/ 2) − 2(1 + σ 2 )
Figure 4.1(left) shows the curves in the (μ, σ) plane whose points have the same
geodesic distance with respect to the Normal density N (x|0, 1). The inner set cor-
responds to a distance of dG = 0.1, the outer one to dG = 1.5 and those in between
to increasing steps of 0.2.
244 4 Information Theory
σ
3
d 0.6
d
E
0.5
2.5
0.4
2
0.3
dKL
1.5 0.2
0.1
1
dH
0
0.5
0.4
0.5 0.3
0.2 1.5
0.1 1.4
1.3
0 1.2
−0.1 1.1
1
0 μ −0.2
−0.3
−0.4 0.7
0.8
0.9 σ
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −0.5 0.5 0.6
μ
Fig. 4.1 Left Set of points in the (μ, σ) plane have the same geodesic distance with respect to the
Normal density N (x|0, 1) points in the (μ, σ). The inner set corresponds to a distance of dG = 0.1,
the outer one to dG = 1.5 and those in between to increasing steps of 0.2. Right Euclidean (d E ),
symmetrized Kullback–Leibler discrepancy (d K L ) and the Hellinger distance (d H ) for the set of
(μ, σ) points that have the same geodesic distance dG = 0.5
References