05 MCMC

Uploaded by

irpower1375

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views36 pages

05 MCMC

Uploaded by

irpower1375

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Probabilistic Inference and Learning

Lecture 05
Markov Chain Monte Carlo

Philipp Hennig
03 May 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# date content Ex # date content Ex
1 20.04. Introduction 1 14 09.06. Logistic Regression 8
2 21.04. Reasoning under Uncertainty 15 15.06. Exponential Families
3 27.04. Continuous Variables 2 16 16.06. Graphical Models 9
4 28.04. Monte Carlo 17 22.06. Factor Graphs
5 04.05. Markov Chain Monte Carlo 3 18 23.06. The Sum-Product Algorithm 10
6 05.05. Gaussian Distributions 19 29.06. Example: Topic Models
7 11.05. Parametric Regression 4 20 30.06. Mixture Models 11
8 12.05. Understanding Deep Learning 21 06.07. EM
9 18.05. Gaussian Processes 5 22 07.07. Variational Inference 12
10 19.05. An Example for GP Regression 23 13.07. Example: Topic Models
11 25.05. Understanding Kernels 6 24 14.07. Example: Inferring Topics 13
12 26.05. Gauss-Markov Models 25 20.07. Example: Kernel Topic Models
13 08.06. GP Classification 7 26 21.07. Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
∫
1∑
N
F := f(x)p(x) dx ≈ f(xi ) =: F̂ if xi ∼ p
N
i=1

varp (f)
Ep (F̂) = F varp (F̂) =
N
Recap from last lecture:
▶ Random numbers can be used to estimate integrals _ Monte Carlo algorithms
▶ although the concept of randomness is fundamentally unsound, Monte Carlo algorithms are
competitive in high dimensional problems (primarily because the advantages of the alternatives
degrade rapidly with dimensionality)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
But in High Dimensions, Sampling isn’t Easy, Either!
Sampling is harder than global optimization

To produce exact samples:

▶ need to know cumulative density everywhere
▶ need to know regions of high density (not just local maxima!)
▶ a global description of the entire function
Practical Monte Carlo Methods aim to construct samples from

p̃(x)
p(x) =
Z
assuming that it is possible to evaluate the unnormalized density p̃ (but not p) at arbitrary points.
Typical example: Compute moments of a posterior

p(D | x)p(x) 1∑ n
p(x | D) = ∫ as Ep(x|D) (xn ) ≈ x with xi ∼ p(x | D)
p(D, x) dx S s i

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Rejection Sampling
a simple method [Georges-Louis Leclerc, Comte de Buffon, 1707–1788]

0.4
0.3
0.2
0.1
0
−4 −2 0 2 4 6 8 10

▶ for any p(x) = p̃(x)/Z (normalizer Z not required)

▶ choose q(x) s.t. cq(x) ≥ p̃(x)
▶ draw s ∼ q(x), u ∼ Uniform[0, cq(s)]
▶ reject if u > p̃(s)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
The Problem with Rejection Sampling
the curse of dimensionality [MacKay, §29.3]

0.4
Example:
p(x) ▶ p(x) = N (x; 0, σp2 )
cq(x)
▶ q(x) = N (x; 0, σq2 )
0.3
▶ σq > σ p
▶ optimal c is given by
( )D ( )
p(x)

0.2
(2πσq2 )D/2 σq σq
c= = = exp D ln
(2πσp2 )D/2 σp σp
0.1
▶ acceptance rate is ratio of volumes: 1/c
▶ rejection rate rises exponentially in D
0
−4 −2 0 2 4 ▶ for σq /σp = 1.1, D = 100, 1/c < 10−4
x

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Importance Sampling
a slightly less simple method

▶ computing p̃(x), q(x), then throwing them away seems wasteful

▶ instead, rewrite (assume q(x) > 0 if p(x) > 0)
∫ ∫
p(x)
ϕ = f(x)p(x) dx = f(x) q(x) dx
q(x)
1∑ p(xs ) 1∑
≈ f(xs ) =: f(xs )ws if xs ∼ q(x)
S s q(xs ) S s
▶ this is just using a new function g(x) = f(x)p(x)/q(x), so it is an unbiased estimator
▶ ws is known as the importance (weight) of sample s
▶ if normalization unknown, can also use p̃(x) = Zp(x)
∫
11∑ p̃(xs )
f(x)p(x) = f(xs ) dx
ZS s q(xs )
1∑ p̃(xs )/q(xs ) ∑
= f(xs ) 1 ∑ =: f(xs )w̃s
S s S s′ 1p̃(xs )/q(xs ) s

▶ this is consistent, but biased

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
What’s wrong with Importance Sampling?
the curse of dimensionality, revisited
( )
▶ recall that var ϕ̂ = var(f)/S — importance sampling replaces var(f) with var(g) = var f qp
( )
▶ var f qp can be very large if q ≪ p somewhere. In many dimensions, usually all but everywhere!
▶ if p has “undiscovered islands”, some samples have p(x)/q(x) _ ∞
4
p(x)
q(x)
w(x) 3

log10 sample count

0
−2 0 2 4 6 8 −20 0 20 40 60 80 100
x f(x), g(x)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Summary: Simple Practical Monte Carlo Methods
1. Producing exact samples is just as hard as high-dimensional integration. Thus, practical MC
methods sample from a unnormalized density p̃(x) = Z · p(x)
2. even this, however, is hard. Because it is hard to build a globally useful approximation to the
integrand

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
Markov-Chain Monte Carlo
random walks drawing random numbers

▶ problem of importance sampling: samples generated independently, requires q good

approximation to p everywhere.
▶ instead: generate samples iteratively, approximation q only needs to be good locally

Definition (Markov Chains)

A joint distribution p(X) over a sequence of random variabels X := [x1 , . . . , xN ] is said to have the
Markov property if
p(xi | x1 , x2 , . . . , xi−1 ) = p(xi | xi−1 ).
The sequence is then called a Markov chain.

x1 x2 x3 x4

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
An analogy
optimization

assume we wanted to find the maximum of p̃(x)

▶ given current estimate xt
▶ draw proposal x′ ∼ q(x′ | xt )
▶ evaluate
p̃(x′ )
a=
p̃(xt )
▶ if a ≥ 1, accept: xt+1 ^ x′
▶ else stay: xt+1 ^ xt

Usually, throw away estimates at the end, only keep “best guess”. But the estimates do contain
information about the shape of p̃!

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
The Metropolis-Hastings∗ Method
∗ Authorship controversial. Likely inventors: M. Rosenbluth, A. Rosenbluth & E. Teller, 1953

we want to find representers (samples) of p̃(x)

▶ given current sample xt
▶ draw proposal x′ ∼ q(x′ | xt ) (for example, q(x′ | xt ) = N (x′ ; xt , σ 2 ))
▶ evaluate
p̃(x′ ) q(xt | x′ )
a=
p̃(xt ) q(x′ | xt )
▶ if a ≥ 1, accept: xt+1 ^ x′
▶ else
▶ accept with probability a: xt+1 ^ x′
▶ stay with probability 1 − a: xt+1 ^ xt
Usually, assume symmetry q(xt | x′ ) = q(x′ | xt ) (the Metropolis method)
▶ no rejection. Every sample counts!
▶ like optimization, but with a chance to move downhill

Probabilistic ML — P. Hennig, SS 2021 — Lecture 05: Markov Chain Monte Carlo— © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Metropolis-Hastings in pictures
t=1

p(x)
0.4 q(x)
p, q

0.2

0
−1 0 1 2 3 4 5 6 7 8
x