Bayesian Uncertainty Quantification
Bayesian Uncertainty Quantification
Contents
1 Introduction 2
2 Bayesian Framework 2
2.1 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Example: The coin flipping problem . . . . . . . . . . . . . . . . 3
2.3 Example: Linear model . . . . . . . . . . . . . . . . . . . . . . . 4
5 Sampling methods 15
5.1 Function Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 21
1
1 Introduction
In science, we attempt to describe, understand and predict systems via models
which depend on parameters. These models are an approximation of the reality
and contain several sources of uncertainty, including modeling and numerical
errors. Furthermore, we often don’t know the parameters of the model or how
sensible in the output of the model with respect to the parameters.. We wish
to describe the uncertainty of these parameters given observations of the real
system. We will here present the steps to complete this process.
In Section 2, we will present the Bayes’ theorem and its applications in
the field of uncertainty quantification. In Section 3, we will describe how to
derive analytically estimations to quantify the uncertainty in parameters. In
Section 4, we will introduce the concept of Monte Carlo methods, which are
the basis of most numerical methods used in uncertainty quantification. Fi-
nally, in Section 5, we will present numerical methods able to sample arbitrary
distributions.
2 Bayesian Framework
2.1 Bayes’ theorem
Let X and Y be two random variables (r.v.) with densities pX and pY . The
Bayes’ theorem states that,
pY |X (y|x) pX (x)
pX|Y (x|y) = . (1)
pY (y)
The density pX|Y is called the posterior probability. The term pY |X is viewed
as a function of x since on the left hand side of Eq. (1) we condition on the fixed
value for the random variable Y = y. As a function of x this term is called the
likelihood functions and is a measure of how likely is to observe the value y for
the r.v. Y conditioning on the value x for the r.v. X. Notice that pY |X is not a
probability density as a function of x. The term pX is called prior distribution
and represents our belief on the values of X prior to observing any values for
the random variable Y . Finally, the denominator is defined as,
Z
pY (y) = pY |X (y|x) pX (x) dx, (2)
and is the normalizing constant that makes the right hand side of Eq. (1) a
probability density function.
In order to simplify the notation we drop the dependence of the density on
the random variable. Which density is used will be evident from the arguments.
For example, when we write p(x|y) then p = pX|Y or p(X) the p = pX .
Bayes’ theorem in action The way we will use Bayes theorem in the next
sections is the following. First we make some assumptions:
2
• We assume that we have a computational model that depends on some
parameters. These parameters are considered to be random variables and
will be denoted by X. A prior distribution can be imposed on them,
e.g. if we know that X takes only positive values pX can be the gamma
distribution.
• We have observed a set of data, y. We assume that data are also r.v. that
follow a probability distribution.
• The likelihood function of the data, pY |X , is either known explicitly or
can modeled based on other assumptions.
Based on these assumptions and using Eq. (1) we are able to find the distribution
of the parameters conditioned on the data. Stated differently, we can answer the
question “what values for the parameters will make the computational model
fit the data better?”.
In order to fix the notation, we will denote the random variable that repre-
sents the parameters and the data with X and D respectively and a realization
from these variables with x and d.
3
Here, H plays the role of the model parameter x. Suppose we observe “R heads
in N tosses”. We want to estimate the posterior distribution of H given the
observed data d = (R, N ),
p(H | d). (5)
Using Bayes’ theorem, we write
y = x + , (9)
4
18 Parameter estimation I
Fig. 2.2 The effect of different priors, prob(H |I ), on the posterior pdf for the bias-weighting of
Figurea coin.
1: The solid line isofthethe
Evolution sameposterior
as in Fig. 2.1, and is included
density of theforbias-weighting
ease of comparison.of
Thea case
coin as
for two alternative
the number of data priors, reflecting slightly
increases. The different
differentassumptions in the conditioning
lines show posteriorinformation
densities for
I, are priors.
different shown with Thedashed andfigure
first dotted lines.
for 0 data points represents the three priors. It
can be seen that after many observations, the three posterior densities converge
to the same distribution. However, the effect of the prior is evident for smaller
observation data sets, as the biased Gaussian prior converges later to the actual
posterior density. Taken from [1].
5
where µ and σ are given. After a single observation y = d ∈ R, we can write
from Bayes’ theorem
p(d | x)p(x)
p(x | d) = . (12)
p(d)
From Eq. (9) and Eq. (11), the likelihood p(d| x) is a Gaussian centered at x
with variance 1,
1 1 2
p(d | x) = √ exp − (d − x) .
2π 2
Substituting the prior and the likelihood inside Eq. (12) gives the posterior
distribution,
!
2 2
1 (d − x) (x − µ)
p(x | d) ∝ exp − + ,
2 1 σ2
Note that in this case, the posterior robust prediction gives a smaller confi-
dence interval, so adding data increases the robustness of the prediction. These
quantities are represented in Fig. 2.
The same result can be obtained from Eq. (3) or Eq. (4). In this particular
case, p(y | x) = N (x, 1).
6
prior
posterior
µ ŷ µŷ
x y
Figure 2: Left: Prior and Posterior distributions, p(x) and p(x |d), of the param-
eter x. Adding data increases our confidence in x. Right: Prior and Posterior
Robust predictions of y. Again, adding data in this particular case increases
the confidence in the prediction.
• the best estimate, which is the model parameter for which the posterior
density function is maximized,
• measure of reliability of the best estimate.
The posterior distribution around the best estimate can be locally approxi-
mated with a Gaussian distribution, by employing the Laplace approximation
method. The Laplace approximation method uses the Taylor expansion of a
function around a global maximum in order to construct the exponential form
of a function. Since the posterior distribution is approximated by a Gaussian,
the logarithm of the posterior plays the role of the function onto which the Tay-
lor expansion is applied. The main idea of the Laplace approximation method
is discussed below.
7
x̂, which corresponds to the maximum of p(x), we have
1 ∂ 2 L 2
3
L(x) = L(x̂) + (x − x̂) + O (x − x̂) , (15)
2 ∂x2
x̂
where we used Eq. (13). Keeping only terms up to second order, we can write
the probability distribution as,
2
1 ∂ L
p(x) ≈ A exp (x − x̂)2 ,
2 ∂x2
x̂
where A is a constant A = exp L(x̂) . We obtained a Gaussian approximation
of the probability density function with variance
−1
2
∂ L
σ2 = − 2 .
∂x
x̂
∇L(x) = 0
The log-likelihood around the best estimates (x̂1 , x̂2 ) is then approximated by
Taylor series expansion,
2 2 2
1 ∂ L ∂ L ∂ L
L(x) ≈ L(x̂)+ 2 (x1 − x̂1 )2 + (x2 − x̂2 )2 + 2 (x1 − x̂1 )(x2 − x̂2 ) .
2 ∂x1 ∂x22 ∂x1 ∂x2
x̂ x̂ x̂
Defining
∂ 2 L ∂ 2 L ∂ 2 L
A= , B= , C= ,
∂x21 ∂x22 ∂x1 ∂x2
x̂ x̂ x̂
8
Figure 3: Laplace approximation. In the limit of many observation data the
probability distribution function p(x) is locally approximated as a Gaussian
around the best estimate, i.e. the value that maximizes the density function.
9
D-dimensional approximation In higher dimensions, the Taylor expansion
of L(x) about the best estimate x̂ extends as follows,
1
L ≈ L(x̂) + (x − x̂)T ∇∇T L(x̂)(x − x̂).
2
The Hessian at the best estimate is defined as
Σ = H −1 (x̂).
p H | d ∝ H R (1 − H)N −R .
where the constant does not play any role as we want to find the best estimate.
The first two derivatives read,
∂L R N −R
= − ,
∂H H 1−H
∂2L R N −R
=− 2 − .
∂H 2 H (1 − H)2
10
R
The condition Eq. (13) gives the best estimate Ĥ = N. The standard deviation
is therefore given by
− 12
2
∂ L
σ = −
∂H 2
Ĥ
s
Ĥ(1 − Ĥ)
= .
N
Note that the certainty increases as we add more data. We can also notice that
it is easier to detect a biased coin than a fair coin. Indeed, the uncertainty is
maximized for Ĥ = 21 .
Here, we assume an uninformative uniform prior for the mean of the Gaussian,
(
1
c = µmax −µ , µmin ≤ µ ≤ µmax
p(µ) = min
0, otherwise.
11
We compute the log-likelihood,
12
2. robust posterior prediction:
Z
p(y | d) = p(y | x)p(x |d) dx,
µf = E[ [f ] (x)]
2
σf2 = Var[f (x)] = E[ [f ] (x)] − µ2f ,
the Law of Large numbers and the Central Limit theorem give
lim µ̂f,N = µf ,
N →∞
µ̂f,N ∼ N µf , σf2 /N .
Conclusively:
√
• The error of the sample estimate decreases as 1/ N .
• The sample estimate is an unbiased estimate of the true value.
• Convergence of the estimate is independent of the dimensionality of the
problem.
for generating integers in the interval [0, M −1]. In this case, Zi is the remainder
of the division of g(Zi−1 , . . . , Zi−m ) by M. In a simpler form, Zi can be generated
by:
Zi = αZi−1 mod M,
13
for Z0 = 1 and i ≥ 1. For the above, M is a large prime number and α an
integer [2].
The resulting sequence of random numbers turns out to be ergodic with
period M − 1. A sequence is accepted when it satisfies certain criteria. For
example, for a choice of M = 109 not all α result into an good quality sequence
of random numbers. Recommended options are: α = 75 , M = 231 − 1 [3].
can be written as
Z
I = |Ω| f (x) p(x) dx = |Ω| Ep [f ].
Ω
Here p is the uniform distribution over Ω from which samples are drawn,
(
1
, x ∈ Ω,
p(x) = |Ω|
0, otherwise.
For these cases, the pseudo-random number generators are sufficient. In the
general case we want to approximate the integral Eq. (16) for a non uniform
density p. Usually, the distribution p is known up to a constant factor,
φ(x)
p(x) = ,
Z
R∞
where Z = −∞ φ(x) dx. In the next sections, we will discuss how to generate
random numbers from such distributions.
14
where the samples {x(k) }N k=1 are i.i.d. and follow the density p. There is an
infinite amount of choices for p from which we would like to select one which:
1. is easy to sample,
2. minimizes the error of the estimate for a finite number of samples.
A measure of the error of the estimate is given by
" 2 # Z
f (x) f (x)2
Ep −I = p(x) dx − I 2 . (18)
p(x) p(x)2
f (x)
p(x) = ,
I
but this expression implies that we already know I. In practice, we choose p
“similar” to f .
5 Sampling methods
5.1 Function Inversion
Let X be a real random variable with probability density function pX and
corresponding cumulative distribution function
Z x
FX (x) = pX (r) dr.
−∞
The idea behind the “Inverse Transform Sampling” method is that samples from
the density pX (x) can be generated by a transformation
x = g(u),
where u ∼ U(0, 1). We will identify the function g such that X follows the
desired density pX . The densities of X and U should satisfy
which leads to
dg(u) −1
du
pX (x) = pU (u) = pU (u) .
dx du
For the r.v. U drawn from a uniform probability distribution, U ∼ pU (u) =
U(u| 0, 1), the probability of generating a random number between u and u + du
is (
du, 0 ≤ u < 1,
pU (u) du =
0, otherwise.
15
Therefore, pU (u) = 1 for u ∈ [0, 1] and integrating Eq. (19) yields
Z x Z u
pX (r) dr = pU (r) dr = u.
−∞ 0
FX (x) = u.
−1
Assuming that FX has an inverse FX , we obtain
−1
x = g(u) = FX (u).
−1
For simple density functions pX for which FX is known, it is therefore easy to
generate samples X. However, the inverse is not available in general, which lead
to the developpement of sampling algorithms, as discussed later in this section.
pX (x) = λe−λx ,
Setting the random variable u := FX (x), x can be sampled from the inverse
transformation
−1
x = g(u) = FX (u),
or equivalently
FX (x) = u ⇒ 1 − e−λx = u,
which results in
1
x = − ln(1 − u).
λ
If samples {u(k) }N
k=1 are drawn from U(0, 1) then x
(k)
= − λ1 ln(1 − u(k) ) follow
pX .
16
• φ is drawn by a uniform distribution with a simple transformation so that
(
1
, 0 ≤ φ ≤ 2π,
pΦ (φ) = 2π
0, otherwise.
The function (
1, 0 ≤ u ≤ p(x),
χ[0,p(x)] (u) =
0, otherwise.
17
Figure 4: Assuming we can sample from the distribution p, we draw a sample
x ∼ p and then a uniform number u ∼ U(0, p(x)). If we marginalize the samples
(x, u) we recover the distribution p, as shown in Eq. (20).
can be seen as the joint distribution pX,U of the random variables X and U
following the distribution p(x) and p(u|x) = U(u | 0, p(x). Marginalizing the
joint distribution over U we recover the distribution p,
Z
p(x) = pX,U (x, u) du. (21)
18
Figure 5: Demonstration of the Accept-Reject algorithm. 1. A sample x is
drawn from the distribution q. 2. A random number u is drawn uniformly in
[0, M q(x)]. 3. The sample x is accepted if u < p(x), i.e., if the point (x, u) is
bellow the graph of p, and rejected otherwise.
p(x)
M > max .
x q(x)
19
Algorithm 1 Rejection sampling algorithm.
Input: densities p, q and constant M > 0 such that p(x)/q(x)
Output: a sample distributed according to p
function Rejection sampling(p, q, M )
Generate x ∼ q . Propose a new sample
Generate u ∼ U(0, M q(x))
if u < p(x) then
return x . Accept the proposed sample
else
return Rejection sampling(p, q, C) . Reject and try again
end if
end function
M ≥ max (b − a)p(x).
x∈[a,b]
20
Since M should be as small as possible, we select the lower bound in the above
inequality,
M = (b − a) max p(x).
x∈[a,b]
21
The proposed state x0 is then accepted with acceptance probability A(x0 | y). The
acceptance must be chosen to satisfy the detailed balance condition Eq. (26),
We now define
T (y | x)peq (x)
q(x | y) = . (28)
T (x | y)peq (y)
Note that q(x | y) ≥ 0. It is easy to check that the detailed balance condition is
satisfied for
A(x |y) = min 1, q(x |y) .
The algorithm used to generate one sample from a given state is summarized in
Algorithm 3. Note that it is sufficient to know peq only up to a constant factor.
Indeed, in the M(RT)2 algorithm, peq appears only in a ratio (see Eq. (28)).
We will now demonstrate that the M(RT)2 algorithm indeed converges to the
desired distribution peq . We define the probability densities φi of each random
variable xi in the sequence. Given φn , we can write φn+1 (x) as a sum of two
contributions:
• Probability of accepting a new state,
Z
P (“previous state was not x”) = A(x |y)T (x | y)φn (y)dy,
• Probability of not moving away from x, i.e. rejected the proposed state,
Z
P (“previous state was x”) = φn (x) (1 − A(y |x))T (y | x)dy.
22
Combining the above contributions gives
Z Z
φn+1 (x) = A(x | y)T (x | y)φn (y)dy + φn (x) (1 − A(y | x))T (y |x)dy. (29)
It can be shown that this recursive relation gives an ergodic system (the system
will return to the states already visited with probability one and every state
is aperiodic). Therefore, according to Theorem 2 (see also [5]), there exist a
unique equilibrium distribution to which the recursion Eq. (29) converges.
Theorem 2 (Feller). If a random variable defines an ergodic system, then it
exists a unique probability distribution peq that is a fixed point in the above
recursion.
Proof. We will show now that the fixed point is indeed peq . We substitute
φn = peq in Eq. (29) and obtain
Z Z
φn+1 (x) = A(x | y)T (x |y)peq (y)dy + peq (x) (1 − A(y | x))T (y |x)dy,
Z
= A(x |y)T (x |y)peq (y) − A(y |x)T (y |x)peq (x) dy
Z
+ peq (x) T (y | x)dy,
Z
= peq (x) T (y | x)dy,
= peq (x),
where we used the detailed balance condition Eq. (26). Therefore, peq is the
asymptotic density distribution of the random walk.
References
[1] Devinderjit Sivia and John Skilling. Data analysis: a Bayesian tutorial.
OUP Oxford, 2006.
[2] Derrick H Lehmer. Euclid’s algorithm for large numbers. American Mathe-
matical Monthly, pages 227–233, 1938.
[3] Donald E Knuth. The Art of Computer Programming; Volume 2: Seminu-
meral Algorithms. 1981.
[4] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Au-
gusta H Teller, and Edward Teller. Equation of state calculations by fast
computing machines. The journal of chemical physics, 21(6):1087–1092,
1953.
[5] Francesco Petruccione and Peter Biechele. Stochastic methods for physics
using java: An introduction. 2000.
23