0% found this document useful (0 votes)
33 views36 pages

05 MCMC

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views36 pages

05 MCMC

Uploaded by

irpower1375
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Probabilistic Inference and Learning

Lecture 05
Markov Chain Monte Carlo

Philipp Hennig
03 May 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1

1∑
N
F := f(x)p(x) dx ≈ f(xi ) =: F̂ if xi ∼ p
N
i=1

varp (f)
Ep (F̂) = F varp (F̂) =
N
Recap from last lecture:
▶ Random numbers can be used to estimate integrals _ Monte Carlo algorithms
▶ although the concept of randomness is fundamentally unsound, Monte Carlo algorithms are
competitive in high dimensional problems (primarily because the advantages of the alternatives
degrade rapidly with dimensionality)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
But in High Dimensions, Sampling isn’t Easy, Either!
Sampling is harder than global optimization

To produce exact samples:


▶ need to know cumulative density everywhere
▶ need to know regions of high density (not just local maxima!)
▶ a global description of the entire function
Practical Monte Carlo Methods aim to construct samples from

p̃(x)
p(x) =
Z
assuming that it is possible to evaluate the unnormalized density p̃ (but not p) at arbitrary points.
Typical example: Compute moments of a posterior

p(D | x)p(x) 1∑ n
p(x | D) = ∫ as Ep(x|D) (xn ) ≈ x with xi ∼ p(x | D)
p(D, x) dx S s i

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Rejection Sampling
a simple method [Georges-Louis Leclerc, Comte de Buffon, 1707–1788]

0.4
0.3
0.2
0.1
0
−4 −2 0 2 4 6 8 10

▶ for any p(x) = p̃(x)/Z (normalizer Z not required)


▶ choose q(x) s.t. cq(x) ≥ p̃(x)
▶ draw s ∼ q(x), u ∼ Uniform[0, cq(s)]
▶ reject if u > p̃(s)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
The Problem with Rejection Sampling
the curse of dimensionality [MacKay, §29.3]

0.4
Example:
p(x) ▶ p(x) = N (x; 0, σp2 )
cq(x)
▶ q(x) = N (x; 0, σq2 )
0.3
▶ σq > σ p
▶ optimal c is given by
( )D ( )
p(x)

0.2
(2πσq2 )D/2 σq σq
c= = = exp D ln
(2πσp2 )D/2 σp σp
0.1
▶ acceptance rate is ratio of volumes: 1/c
▶ rejection rate rises exponentially in D
0
−4 −2 0 2 4 ▶ for σq /σp = 1.1, D = 100, 1/c < 10−4
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Importance Sampling
a slightly less simple method

▶ computing p̃(x), q(x), then throwing them away seems wasteful


▶ instead, rewrite (assume q(x) > 0 if p(x) > 0)
∫ ∫
p(x)
ϕ = f(x)p(x) dx = f(x) q(x) dx
q(x)
1∑ p(xs ) 1∑
≈ f(xs ) =: f(xs )ws if xs ∼ q(x)
S s q(xs ) S s
▶ this is just using a new function g(x) = f(x)p(x)/q(x), so it is an unbiased estimator
▶ ws is known as the importance (weight) of sample s
▶ if normalization unknown, can also use p̃(x) = Zp(x)

11∑ p̃(xs )
f(x)p(x) = f(xs ) dx
ZS s q(xs )
1∑ p̃(xs )/q(xs ) ∑
= f(xs ) 1 ∑ =: f(xs )w̃s
S s S s′ 1p̃(xs )/q(xs ) s

▶ this is consistent, but biased


Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
What’s wrong with Importance Sampling?
the curse of dimensionality, revisited
( )
▶ recall that var ϕ̂ = var(f)/S — importance sampling replaces var(f) with var(g) = var f qp
( )
▶ var f qp can be very large if q ≪ p somewhere. In many dimensions, usually all but everywhere!
▶ if p has “undiscovered islands”, some samples have p(x)/q(x) _ ∞
4
p(x)
q(x)
w(x) 3

log10 sample count


2

0
−2 0 2 4 6 8 −20 0 20 40 60 80 100
x f(x), g(x)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Summary: Simple Practical Monte Carlo Methods
1. Producing exact samples is just as hard as high-dimensional integration. Thus, practical MC
methods sample from a unnormalized density p̃(x) = Z · p(x)
2. even this, however, is hard. Because it is hard to build a globally useful approximation to the
integrand

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
Markov-Chain Monte Carlo
random walks drawing random numbers

▶ problem of importance sampling: samples generated independently, requires q good


approximation to p everywhere.
▶ instead: generate samples iteratively, approximation q only needs to be good locally

Definition (Markov Chains)


A joint distribution p(X) over a sequence of random variabels X := [x1 , . . . , xN ] is said to have the
Markov property if
p(xi | x1 , x2 , . . . , xi−1 ) = p(xi | xi−1 ).
The sequence is then called a Markov chain.

x1 x2 x3 x4

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
An analogy
optimization

assume we wanted to find the maximum of p̃(x)


▶ given current estimate xt
▶ draw proposal x′ ∼ q(x′ | xt )
▶ evaluate
p̃(x′ )
a=
p̃(xt )
▶ if a ≥ 1, accept: xt+1 ^ x′
▶ else stay: xt+1 ^ xt

Usually, throw away estimates at the end, only keep “best guess”. But the estimates do contain
information about the shape of p̃!

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
The Metropolis-Hastings∗ Method
∗ Authorship controversial. Likely inventors: M. Rosenbluth, A. Rosenbluth & E. Teller, 1953

we want to find representers (samples) of p̃(x)


▶ given current sample xt
▶ draw proposal x′ ∼ q(x′ | xt ) (for example, q(x′ | xt ) = N (x′ ; xt , σ 2 ))
▶ evaluate
p̃(x′ ) q(xt | x′ )
a=
p̃(xt ) q(x′ | xt )
▶ if a ≥ 1, accept: xt+1 ^ x′
▶ else
▶ accept with probability a: xt+1 ^ x′
▶ stay with probability 1 − a: xt+1 ^ xt
Usually, assume symmetry q(xt | x′ ) = q(x′ | xt ) (the Metropolis method)
▶ no rejection. Every sample counts!
▶ like optimization, but with a chance to move downhill

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Metropolis-Hastings in pictures
t=1

p(x)
0.4 q(x)
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x

p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=2

p(x)
0.4 q(x)
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x

p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=3

p(x)
0.4 q(x)
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x

p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=4

p(x)
0.4 q(x)
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x

p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t=5

p(x)
0.4 q(x)
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x

p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Metropolis-Hastings in pictures
t = 300

p(x)
0.4 q(x)
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x

p̃(x′ ) q(xt | x′ )
a= accept with p = min(1, a)
p̃(xt ) q(x′ | xt )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Visualization
by Chi Feng https://github.com/chi-feng

https://chi-feng.github.io/mcmc-demo/app.html#RandomWalkMH

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Why is this a Monte Carlo Method?
MH draws from p(x) in the limit of ∞ samples

Theorem (convergence of Metropolis-Hastings, simplified)


If q(x′ | xt ) > 0 ∀(x′ , xt ), then, for any x0 , the distribution of xt approaches p(x) as t _ ∞.

proof (sketch) existence of stationary distribution: detailed balance


▶ MH satisfies detailed balance
[ ]
p(x′ )q(x | x′ )
p(x)T(x _ x′ ) = p(x) · q(x′ | x) min 1,
p(x)q(x′ | x)
= min[p(x)q(x′ | x), p(x′ )q(x | x′ )]
[ ]
p(x)q(x′ | x)
= p(x′ ) · q(x | x′ ) min , 1
p(x′ )q(x | x′ )
= p(x′ )T(x′ _ x)
▶ Markov Chains satisfying detailed balance have at least one stationary distribution
∫ ∫ ∫
p(x)T(x _ x′ ) dx = p(x′ )T(x′ _ x) dx = p(x′ ) T(x′ _ x) dx = p(x′ )
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Why is this a Monte Carlo Method?
MH draws from p(x) in the limit of ∞ samples

proof (sketch) uniqueness of stationary distribution:

Definition (Ergodicity)
A sequence {xt }t∈N is called ergodic if it
1. is a-periodic (contains no recurring sequence)
2. has positive recurrence: xt = x∗ implies there is a t′ > t such that p(xt′ = x∗ ) > 0

_ for MH, {xt }t∈N is ergodic (by definition)


▶ ergodic Markov Chains have at most one stationary distribution

Theorem (convergence of Metropolis-Hastings, simplified)


If q(x′ | xt ) > 0 ∀(x′ , xt ), then, for any x0 , the density of {xt }t∈N approaches p(x) as t _ ∞.

▶ this is not a statement about convergence rate!

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
Metropolis-Hastings performs a (biased) random walk
hence diffuses O(s1/2 )

Rule of Thumb: [MacKay, (29.32)]


▶ typical use-case: high-dimensional D problem of largest
length-scale L, smallest ε, isotropic proposal distribution
2
▶ have to set width of q to ≈ ε, otherwise acceptance rate
r will be very low.
▶ then Metropolis-Hastings does a random walk in D
dimensions, moving a√distance of

0 E[∥xt − x0 ∥2 ] ∼ ϵ rt
▶ so, to create one independent draw at distance L, MCMC
has to run for at least
( )2
1 L
−2 t∼
r ϵ

−2 0 2 iterations. In practice (e.g. if the distribution has islands),


the situation can be much worse.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Metropolis-Hastings performs a (biased) random walk
estimating the mean of a correlated Gaussian

101
exact
2 MH

100

10−1

|x̂ − x|

10−2
−1

10−3
−2
100 101 102 103 104 105 100 101 102 103 104 105
# samples # samples

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
Summary: Practical Monte Carlo Methods
1. Producing exact samples is just as hard as high-dimensional integration. Thus, practical MC
methods sample from a unnormalized density p̃(x) = Z · p(x)
2. even this, however, is hard. Because it is hard to build a globally useful approximation to the
integrand
3. Markov Chain Monte Carlo circumvents this problem by using local operations. It only converges
well on the scale in which the local models cover the global problem. Thus the local behaviour has
to be tuned.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Gibbs Sampling
Preparation for Exercise Sheet [D. & S. Geman, 1984]

▶ xt ^ xt−1 ; xti ∼ p(xti | xt1 , xt2 , . . . , xt(i−1) , xt(i+1) , . . . )


▶ a special case of Metropolis-Hastings:
▶ q(x′ | xt ) = δ(x′\i − xt,\i )p(x′i | xt,\i )
▶ p(x′ ) = p(x′i | x′\i )p(x′\i ) = p(x′i | xt,\i )p(xt,\i )
▶ acceptance rate:

p(x′ ) q(xt | x′ ) p(x′i | xt,\i )p(xt,\i ) q(xt | x′ )


a= · = ·
p(xt ) q(x′ | xt ) p(xti | xt,\i )p(xt,\i ) δ(x′\i − xt,\i )p(x′i | xt,\i )
q(xt | x′ )
= =1
p(xti | xt,\i )δ(x′\i − xt,\i )

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
Visualization
by Chi Feng https://github.com/chi-feng

https://chi-feng.github.io/mcmc-demo/app.html#GibbsSampling,banana

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
Proceed with Confidence!
and don’t worry, it’ll be fine …

▶ you don’t need to understand the following slides


▶ but a good engineer knows their tools

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
Hamiltonian Monte Carlo
reduce randomness by smoothing [e.g. DJC MacKay, 2003, §30]

▶ consider Boltzmann distributions P(x) = Z−1 exp(−E(x))


▶ augment the state-space by auxiliary momentum variables p = ẋ. Define
Hamiltonian (“potential and kinetic energy”)

1 ⊺
H(x, p) = E(x) + K(p) with, e.g. K(p) = p p
2
▶ do Metropolis-Hastings with p, x coupled by to Hamiltonian dynamics

∂x ∂H ∂p ∂H
ẋ := = ṗ := =− nb: need to solve an ODE!
∂t ∂p ∂t ∂x
William R Hamilton
▶ note that, due to additive structure of Hamiltonian, this (asymptotically) 1805 – 1865
samples from the factorizing joint (Dublin)

1 1
PH (x, p) = exp(−H(x, p)) = exp(−E(x))·exp(−K(p)) with PH (x) = PH (x, p) dp = P(x)
ZH ZH
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
Why does this improve things?
Hidden gems of Hamiltonian Monte Carlo

∂H ∂H
ẋ = ṗ = −
∂p ∂x

▶ If p(x) is locally flat, then after N steps, x has changed by x + Nhp, so O(N), not O( N) as for
Metropolis Hastings! Hamiltonian MC mixes faster than Metropolis-Hastings
▶ The Hamiltonian is a conserved quantity:
dH(p, x) ∂H ∂x ∂H ∂p ∂H ∂H ∂H ∂H
= + = · − · =0
dt ∂x ∂t ∂p ∂t ∂x ∂p ∂p ∂x
So, if we have managed to simulate the dynamics well, then
δH = 0 ⇒ PH (x′ , p′ ) = PH (x, p)
and the MH step will always be accepted!
p̃(x′ , p′ ) q(xt , pt | x′ , p′ ) exp(−H(x′ , p′ )) q(xt , pt | x′ , p′ )
a= =
p̃(xt , pt ) q(x′ , p′ | xt , pt ) exp(−H(xt , pt )) q(x′ , p′ | xt , pt )
HMC is a way to construct really good MH proposals that are always accepted (up to numerical errors).
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
Implementing Hamiltonian Monte Carlo …
Heun’s method for the Hamiltonian System

1 ∂H ∂H
H(x, p) = E(x) + p⊺ p ẋ = =p ṗ = − = −∇x E(x)
2 ∂p ∂x
▶ We are trying to solve the ordinary differential equation
[ ] ( ) [ ]
dz(t) x(t) x p(t)
= f(z(t)) such that z(t0 ) = z0 z(t) = , f =
dt p(t) p −∇E(x(t))
▶ Heun’s method:
h
z(ti + h) = zi + (f(zi ) + f(zi + hf(zi )))
[ ] [ 2] ([ ] ([ ]))
xi+1 xi+1 h pi xi + hpi
= + +f
pi+1 pi+1 2 −∇E(xi ) pi − h∇E(xi )
[ ] [ 2
]
xi + 2h (pi + pi − h∇E(xi )) xi + hpi − h2 ∇E(xi ))
= =
pi + 2h (−∇E(xi ) − ∇E(xi + hpi )) pi − 2h (∇E(xi ) + ∇E(xi + hpi ))

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24
Hamiltonian Monte Carlo
moving with momentum

1 import numpy as np; from numpy.random import randn, rand


2 def HamiltonianMC(findE,gradE,L,Tau,h,x0):
3 x = x0 # initial sample
4 X = np.zeros([L,x.shape[0]]) # sample storage
5 X[0,:] = x # initialize storage
6 E = findE(x); g = gradE(x) # compute initial gradient and objective
7 for l in range(L): # loop L times
8 p = randn(x.shape[0]) # initial momentum is N(0,1)
9 H = p.T @ p / 2 + E; # evaluate H(x,p)
10 xnew = x; gnew = g # make temporary copy
11 for tau in range(Tau): # make Tau Heun steps
12 p = p - h/2 * gnew # make half-step in p
13 xnew = xnew + h * p # make step in x
14 gnew = gradE(xnew) # find new gradient
15 p = p - h/2 * gnew # make half-step in p
16 Enew = findE(xnew) # find new value of H
17 Hnew = p.T @ p / 2 + Enew
18 dH = Hnew - H # decide whether to accept
19 if dH < 0 or rand() < np.exp(-dH): accept = 1
20 else: accept = 0
21 if accept: g = gnew; x = xnew; E = Enew
22 X[l,:] = x
23 return X

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 25
Visualization
by Chi Feng https://github.com/chi-feng

http://chi-feng.github.io/mcmc-demo/app.html#RandomWalkMH,banana

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 26
How to set the Hyperparameters? The No-U-Turn Sampler
The state of the art in MCMC [Hoffman & Gelman, JMLR 15 (2014), pp. 1593–1623]

The No-U-Turn Sampler: Adaptively Setting Path Lengths


in Hamiltonian Monte Carlo

Matthew D. Ho↵man [email protected]


Department of Statistics
Columbia University
New York, NY 10027, USA
Andrew Gelman [email protected]
46v1 [stat.CO] 18 Nov 2011

Departments of Statistics and Political Science


Columbia University
New York, NY 10027, USA

Abstract
Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that
avoids the random walk behavior and sensitivity to correlated parameters that plague many
MCMC methods by taking a series of steps informed by first-order gradient information.
These features allow it to converge to high-dimensional target distributions much more
quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However,
HMC’s performance is highly sensitive to two user-specified parameters: a step size ✏
and a desired number of steps L. In particular, if L is too small then the algorithm
exhibits undesirable random walk behavior, while if L is too large the algorithm wastes
computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that
eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build
a set of likely candidate points that spans a wide swath of the target distribution, stopping
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo—
automatically when © Philipp Hennig,
it starts to 2021 CC BY-NC-SA
double back and3.0 retrace its steps. Empirically, NUTS 27
Visualization
by Chi Feng https://github.com/chi-feng

https://chi-feng.github.io/mcmc-demo/app.html#NaiveNUTS,banana

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 28
Markov Chain Monte Carlo
▶ breaks down sampling into local dynamics
▶ samples correctly in the asymptotic limit
▶ avoiding random walk behaviour (achieving good asymptotic mixing) requires careful design
▶ Hamiltonian MCMC methods (like NUTS) are currently among the state of the art (sequantial MC
being an alternative).
▶ they require the solution of an ordinary differential equation (the Hamiltonian dynamics)
▶ their hyperparameters are tuned using elaborate subroutines
▶ this is typical of all good numerical methods!
▶ these methods are available in software packages
Reminder: Monte Carlo methods converge stochastically. This stochastic rate is an optimistic bound
for MCMC, because it has to be scaled by the mixing time. Monte Carlo methods are a powerful, well-
developed tool. But they are most likely not the final solution to integration.

Despite centuries of research, integration remains an open problem.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 29
Exercises
Computing with Probabilities, but without tools

▶ try to build an agent playing the game (with


multiple ships)
▶ Things to think about:
▶ how to deal with the combinatorial explosion
▶ How is it best implemented in practice (in python)
▶ how to build an autonomous agent?

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 30

You might also like