05 MCMC
05 MCMC
Lecture 05
Markov Chain Monte Carlo
Philipp Hennig
03 May 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
∫
1∑
N
F := f(x)p(x) dx ≈ f(xi ) =: F̂ if xi ∼ p
N
i=1
varp (f)
Ep (F̂) = F varp (F̂) =
N
Recap from last lecture:
▶ Random numbers can be used to estimate integrals _ Monte Carlo algorithms
▶ although the concept of randomness is fundamentally unsound, Monte Carlo algorithms are
competitive in high dimensional problems (primarily because the advantages of the alternatives
degrade rapidly with dimensionality)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
But in High Dimensions, Sampling isn’t Easy, Either!
Sampling is harder than global optimization
p̃(x)
p(x) =
Z
assuming that it is possible to evaluate the unnormalized density p̃ (but not p) at arbitrary points.
Typical example: Compute moments of a posterior
p(D | x)p(x) 1∑ n
p(x | D) = ∫ as Ep(x|D) (xn ) ≈ x with xi ∼ p(x | D)
p(D, x) dx S s i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Rejection Sampling
a simple method [Georges-Louis Leclerc, Comte de Buffon, 1707–1788]
0.4
0.3
0.2
0.1
0
−4 −2 0 2 4 6 8 10
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
The Problem with Rejection Sampling
the curse of dimensionality [MacKay, §29.3]
0.4
Example:
p(x) ▶ p(x) = N (x; 0, σp2 )
cq(x)
▶ q(x) = N (x; 0, σq2 )
0.3
▶ σq > σ p
▶ optimal c is given by
( )D ( )
p(x)
0.2
(2πσq2 )D/2 σq σq
c= = = exp D ln
(2πσp2 )D/2 σp σp
0.1
▶ acceptance rate is ratio of volumes: 1/c
▶ rejection rate rises exponentially in D
0
−4 −2 0 2 4 ▶ for σq /σp = 1.1, D = 100, 1/c < 10−4
x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Importance Sampling
a slightly less simple method
0
−2 0 2 4 6 8 −20 0 20 40 60 80 100
x f(x), g(x)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Summary: Simple Practical Monte Carlo Methods
1. Producing exact samples is just as hard as high-dimensional integration. Thus, practical MC
methods sample from a unnormalized density p̃(x) = Z · p(x)
2. even this, however, is hard. Because it is hard to build a globally useful approximation to the
integrand
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
Markov-Chain Monte Carlo
random walks drawing random numbers
x1 x2 x3 x4
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
An analogy
optimization
Usually, throw away estimates at the end, only keep “best guess”. But the estimates do contain
information about the shape of p̃!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
The Metropolis-Hastings∗ Method
∗ Authorship controversial. Likely inventors: M. Rosenbluth, A. Rosenbluth & E. Teller, 1953
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Metropolis-Hastings in pictures
t=1
p(x)
0.4 q(x)
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=2
p(x)
0.4 q(x)
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=3
p(x)
0.4 q(x)
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=4
p(x)
0.4 q(x)
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=5
p(x)
0.4 q(x)
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t = 300
p(x)
0.4 q(x)
p, q
0.2
0
−1 0 1 2 3 4 5 6 7 8
x
p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Visualization
by Chi Feng https://github.com/chi-feng
https://chi-feng.github.io/mcmc-demo/app.html#RandomWalkMH
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Why is this a Monte Carlo Method?
MH draws from p(x) in the limit of ∞ samples
Definition (Ergodicity)
A sequence {xt }t∈N is called ergodic if it
1. is a-periodic (contains no recurring sequence)
2. has positive recurrence: xt = x∗ implies there is a t′ > t such that p(xt′ = x∗ ) > 0
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
Metropolis-Hastings performs a (biased) random walk
hence diffuses O(s1/2 )
101
exact
2 MH
100
10−1
|x̂ − x|
x̂
10−2
−1
10−3
−2
100 101 102 103 104 105 100 101 102 103 104 105
# samples # samples
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Summary: Practical Monte Carlo Methods
1. Producing exact samples is just as hard as high-dimensional integration. Thus, practical MC
methods sample from a unnormalized density p̃(x) = Z · p(x)
2. even this, however, is hard. Because it is hard to build a globally useful approximation to the
integrand
3. Markov Chain Monte Carlo circumvents this problem by using local operations. It only converges
well on the scale in which the local models cover the global problem. Thus the local behaviour has
to be tuned.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Gibbs Sampling
Preparation for Exercise Sheet [D. & S. Geman, 1984]
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Visualization
by Chi Feng https://github.com/chi-feng
https://chi-feng.github.io/mcmc-demo/app.html#GibbsSampling,banana
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
Proceed with Confidence!
and don’t worry, it’ll be fine …
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Hamiltonian Monte Carlo
reduce randomness by smoothing [e.g. DJC MacKay, 2003, §30]
1 ⊺
H(x, p) = E(x) + K(p) with, e.g. K(p) = p p
2
▶ do Metropolis-Hastings with p, x coupled by to Hamiltonian dynamics
∂x ∂H ∂p ∂H
ẋ := = ṗ := =− nb: need to solve an ODE!
∂t ∂p ∂t ∂x
William R Hamilton
▶ note that, due to additive structure of Hamiltonian, this (asymptotically) 1805 – 1865
samples from the factorizing joint (Dublin)
∫
1 1
PH (x, p) = exp(−H(x, p)) = exp(−E(x))·exp(−K(p)) with PH (x) = PH (x, p) dp = P(x)
ZH ZH
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
Why does this improve things?
Hidden gems of Hamiltonian Monte Carlo
∂H ∂H
ẋ = ṗ = −
∂p ∂x
√
▶ If p(x) is locally flat, then after N steps, x has changed by x + Nhp, so O(N), not O( N) as for
Metropolis Hastings! Hamiltonian MC mixes faster than Metropolis-Hastings
▶ The Hamiltonian is a conserved quantity:
dH(p, x) ∂H ∂x ∂H ∂p ∂H ∂H ∂H ∂H
= + = · − · =0
dt ∂x ∂t ∂p ∂t ∂x ∂p ∂p ∂x
So, if we have managed to simulate the dynamics well, then
δH = 0 ⇒ PH (x′ , p′ ) = PH (x, p)
and the MH step will always be accepted!
p̃(x′ , p′ ) q(xt , pt | x′ , p′ ) exp(−H(x′ , p′ )) q(xt , pt | x′ , p′ )
a= =
p̃(xt , pt ) q(x′ , p′ | xt , pt ) exp(−H(xt , pt )) q(x′ , p′ | xt , pt )
HMC is a way to construct really good MH proposals that are always accepted (up to numerical errors).
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
Implementing Hamiltonian Monte Carlo …
Heun’s method for the Hamiltonian System
1 ∂H ∂H
H(x, p) = E(x) + p⊺ p ẋ = =p ṗ = − = −∇x E(x)
2 ∂p ∂x
▶ We are trying to solve the ordinary differential equation
[ ] ( ) [ ]
dz(t) x(t) x p(t)
= f(z(t)) such that z(t0 ) = z0 z(t) = , f =
dt p(t) p −∇E(x(t))
▶ Heun’s method:
h
z(ti + h) = zi + (f(zi ) + f(zi + hf(zi )))
[ ] [ 2] ([ ] ([ ]))
xi+1 xi+1 h pi xi + hpi
= + +f
pi+1 pi+1 2 −∇E(xi ) pi − h∇E(xi )
[ ] [ 2
]
xi + 2h (pi + pi − h∇E(xi )) xi + hpi − h2 ∇E(xi ))
= =
pi + 2h (−∇E(xi ) − ∇E(xi + hpi )) pi − 2h (∇E(xi ) + ∇E(xi + hpi ))
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Hamiltonian Monte Carlo
moving with momentum
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
Visualization
by Chi Feng https://github.com/chi-feng
http://chi-feng.github.io/mcmc-demo/app.html#RandomWalkMH,banana
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
How to set the Hyperparameters? The No-U-Turn Sampler
The state of the art in MCMC [Hoffman & Gelman, JMLR 15 (2014), pp. 1593–1623]
Abstract
Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that
avoids the random walk behavior and sensitivity to correlated parameters that plague many
MCMC methods by taking a series of steps informed by first-order gradient information.
These features allow it to converge to high-dimensional target distributions much more
quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However,
HMC’s performance is highly sensitive to two user-specified parameters: a step size ✏
and a desired number of steps L. In particular, if L is too small then the algorithm
exhibits undesirable random walk behavior, while if L is too large the algorithm wastes
computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that
eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build
a set of likely candidate points that spans a wide swath of the target distribution, stopping
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo—
automatically when © Philipp Hennig,
it starts to 2021 CC BY-NC-SA
double back and3.0 retrace its steps. Empirically, NUTS 27
Visualization
by Chi Feng https://github.com/chi-feng
https://chi-feng.github.io/mcmc-demo/app.html#NaiveNUTS,banana
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
Markov Chain Monte Carlo
▶ breaks down sampling into local dynamics
▶ samples correctly in the asymptotic limit
▶ avoiding random walk behaviour (achieving good asymptotic mixing) requires careful design
▶ Hamiltonian MCMC methods (like NUTS) are currently among the state of the art (sequantial MC
being an alternative).
▶ they require the solution of an ordinary differential equation (the Hamiltonian dynamics)
▶ their hyperparameters are tuned using elaborate subroutines
▶ this is typical of all good numerical methods!
▶ these methods are available in software packages
Reminder: Monte Carlo methods converge stochastically. This stochastic rate is an optimistic bound
for MCMC, because it has to be scaled by the mixing time. Monte Carlo methods are a powerful, well-
developed tool. But they are most likely not the final solution to integration.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Exercises
Computing with Probabilities, but without tools
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30