0% found this document useful (0 votes)
18 views64 pages

Lecture 11 - 14 Computational Techniques

The document discusses Bayesian statistics and computational techniques, focusing on the Laplace approximation for integrals involving smooth functions. It covers examples, error analysis, and methods such as Monte Carlo, independent Monte Carlo, and importance sampling for approximating posterior expected values. Additionally, it introduces Markov Chain Monte Carlo (MCMC) as a method for sampling from complex posterior distributions.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views64 pages

Lecture 11 - 14 Computational Techniques

The document discusses Bayesian statistics and computational techniques, focusing on the Laplace approximation for integrals involving smooth functions. It covers examples, error analysis, and methods such as Monte Carlo, independent Monte Carlo, and importance sampling for approximating posterior expected values. Additionally, it introduces Markov Chain Monte Carlo (MCMC) as a method for sampling from complex posterior distributions.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Bayesian Statistics

Computational Techniques

Shaobo Jin
Department of Mathematics

Shaobo Jin (Math) Bayesian Statistics 1 / 64


Computational Techniques Laplace Approximation

Laplace Approximation to Integral


Suppose that we want to approximate the integral
ˆ
h (θ) exp {−ℓ (θ)} dθ,

where θ has the dimension d × 1, p is a known constant, and ℓ (θ) and


h (θ) are smooth functions. If ℓ (θ) is uniquely minimized at θ̂ such that
   
∂ℓ θ̂ ∂ 2 ℓ θ̂
= 0, > 0.
∂θ ∂θ∂θT
Then, the Laplace approximation to the above integral is
v
u    −1 
2
 ∂ ℓ θ̂
u n  o  
u
(2π)d/2 u
tdet  ∂θ∂θT   exp −ℓ θ̂ h θ̂ .

Shaobo Jin (Math) Bayesian Statistics 2 / 64


Computational Techniques Laplace Approximation

Laplace Approximation: Example

Example
Suppose that posterior is
 Pn
σ2

2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
1 X 2 ( ni=1 yi )2
P
n
σ | y ∼ InvGamma 2 + , 2 +
2
yi − ,
2 2 n+1
i=1

where n = 20, i=1 yi = 40.4, and = 93.2. Approximate


Pn Pn 2
i=1 yi
E [β | y] by Laplace approximation.

Shaobo Jin (Math) Bayesian Statistics 3 / 64


Computational Techniques Laplace Approximation

Error Analysis

Suppose that we can express ℓ (θ) as pq (θ) such that the integral is
ˆ
h (θ) exp {−pq (θ)} dθ,

where h (θ) and q (θ) do not include p, and h (θ) and q (θ) are smooth
functions.
Let H def
∂ 2 q (θ̂)
= ∂θ∂θT
> 0. The Laplace approximation satises
r     n  o h  
d/2 i
(2π) det H −1 θ̂ /p exp −pq θ̂ h θ̂ + O p−1 .

Shaobo Jin (Math) Bayesian Statistics 4 / 64


Computational Techniques Laplace Approximation

Ratio of Integrals

In practice, we often want to approximate a ratio of integrals such that


ˆ ´
h (θ) f (x | θ) π (θ) dθ
h (θ) π (θ | x) dθ = ´ .
f (x | θ) π (θ) dθ

The naive approach is to approximate both the numerator and


denominator separately by Laplace approximation and take the ratio of
approximations. This yields
ˆ  
h (θ) π (θ | x) dθ ≈ h θ̂ ,

which is not recommended.

Shaobo Jin (Math) Bayesian Statistics 5 / 64


Computational Techniques Laplace Approximation

Moment Generation Function

Consider the moment generation function


ˆ
E [exp {th (θ)} | x] = exp {th (θ)} π (θ|x) dθ
´
exp {th (θ)} f (x | θ) π (θ) dθ
= ´ .
f (x | θ) π (θ) dθ

We apply the Laplace approximation both the denominator and


numerator, and take the ratio of approximations. Using the property of
the moment generation function,
d log E [exp {th (θ)} | x]
E [h (θ) | x] = .
dt t=0

Shaobo Jin (Math) Bayesian Statistics 6 / 64


Computational Techniques Laplace Approximation

Fully Exponential Laplace Approximation

Let

ℓ (θ, t) = −th (θ) − log f (x | θ) − log π (θ) .

The fully exponential Laplace approximation is


 
  1∂ ∂ 2 ℓ θ̃ (t) , t
E [h (θ) | x] ≈ h θ̂ − log ,
2 ∂t ∂θ∂θT t=0

where θ̃ (t) maximizes ℓ (θ, t) for a given t, and θ̂ = θ̃ (0).


Under the same assumptions as for the Laplace approximation, the
error rate is O p .
−2


Shaobo Jin (Math) Bayesian Statistics 7 / 64


Computational Techniques Monte Carlo

Expectation Under Posterior


For given data x, we often need to compute the posterior expected
value of a function h (θ, x),
ˆ
µ (x) = h (θ, x) π (θ | x) dθ.

However, it is not always the case that we can nd the closed form
expression of µ (x). Approximations are more often needed.
Suppose that we want to approximate
ˆ
E [h (x)] = h (x) f (x) dx,

where f (x) is the density of random variable/vector X . A natural


approximation is to approximate it by the sample mean
n
1X
h̄ = h (xi ) .
n
i=1

Shaobo Jin (Math) Bayesian Statistics 8 / 64


Computational Techniques Monte Carlo

Approximate Expectation by Sample Mean

Under mild conditions, the sample mean


n
1X
h̄ = h (xi )
n
i=1

has nice properties.


1 Unbiasedness: E h̄ = E [h (x)] for any n.
 

2 Consistency: h̄ → E [h (x)] in probability, as n → ∞.

3 Strong consistency: h̄ → E [h (x)] almost surely, as n → ∞.



4 Asymptotic normality: n h̄ − E [h (x)] → N (0, Var [h (x)]) in


distribution, as n → ∞.
The classic methods (e.g., independent Monte Carlo and importance
sampling) have these properties.

Shaobo Jin (Math) Bayesian Statistics 9 / 64


Computational Techniques Independent Monte Carlo

Sample From Posterior

For given data x, suppose that we want to compute the posterior


expected value
ˆ
µ (x) = h (θ, x) π (θ | x) dθ.

If π (θ | x) is a well-known distribution such that we can easily sample


from it, then we draw n independent samples from π (θ | x) and the
independent Monte Carlo approximation is
n
1X
µ̂IMC = h (θi , x) .
n
i=1

Shaobo Jin (Math) Bayesian Statistics 10 / 64


Computational Techniques Independent Monte Carlo

Independent Monte Carlo: Example

Example
Suppose that posterior is
 Pn
σ2

2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
Pn 2
n 1 ( y )
σ 2 | y ∼ InvGamma 2 + , 2 +
i
X
yi2 − i=1
,
2 2 n+1
i=1

where n = 20, ni=1 yi = 40.4, and ni=1 yi2 = 93.2. We want to


P P
approximate E [β | y] by independent Monte Carlo. In this example, we
know the true value
Pn
i=1 yi
E [β | y] = .
n+1

Shaobo Jin (Math) Bayesian Statistics 11 / 64


Computational Techniques Importance Sampling

Importance Distribution

It is common that it is not straightforward to sample directly from


π (θ | x). Suppose that it is easy for us to sample directly from another
distribution with density g (θ | x) such that g (θ | x) > 0 whenever
h (θ, x) π (θ | x) ̸= 0.
We can rewrite µ (x) as
ˆ  
π (θ | x) π (θ | x)
µ (x) = h (θ, x) g (θ | x) dθ = E h (θ, x) |x ,
g (θ | x) g (θ | x)

where the expectation is taken with respect to θ | x ∼ g (θ | x).


We call g (θ | x) an importance distribution or instrumental
distribution.

Shaobo Jin (Math) Bayesian Statistics 12 / 64


Computational Techniques Importance Sampling

Importance Sampling Approximation


The importance sampling approximation is
n
1X π (θi | x)
µ̂IS = h (θi , x) .
n g (θi | x)
i=1

Example
Suppose that posterior is
 Pn
σ2

2 i=1 yi
β | y, σ ∼ N , ,
n+1 n+1
" n #!
1 X 2 ( ni=1 yi )2
P
n
σ | y ∼ InvGamma 2 + , 2 +
2
yi − ,
2 2 n+1
i=1

where
 2 n = 20, i=1 yi = 40.4, and = 93.2. Approximate
Pn Pn 2
i=1 yi
E σ | y by importance sampling.
Shaobo Jin (Math) Bayesian Statistics 13 / 64
Computational Techniques Normalized Importance Sampling

Normalizing Constant
Since we often derive the posterior using π (θ | x) ∝ f (x | θ) π (θ) by
ignoring the normalizing constant, we cannot always evaluate π (θ | x).
It is easy to evaluate f (x | θ) π (θ), but not m (x).
We can rewrite µ as
ˆ ´
h (θ, x) f (x | θ) π (θ) dθ
µ (x) = h (θ, x) π (θ | x) dθ = ´ .
f (x | θ) π (θ) dθ

We can apply the importance sampling trick to both integrals:


 
ˆ  f (x | θ) π (θ) 
h (θ, x) f (x | θ) π (θ) dθ = E h (θ, x)
 
g (θ | x)

 
| {z }
importance weight w(θ,x)

where g (θ | x) > 0 whenever π (θ | x) ̸= 0, stronger than IS.


Shaobo Jin (Math) Bayesian Statistics 14 / 64
Computational Techniques Normalized Importance Sampling

Normalized Importance Sampling

The importance sampling approximations to the numerator and


denominator are
n n
1X 1 X f (x | θi ) π (θi )
w (θi , x) h (θi , x) = h (θi , x) ,
n n g (θi | x)
i=1 i=1
n n
1 X 1 X f (x | θi ) π (θi )
w (θi , x) = .
n n g (θi | x)
i=1 i=1

The ratio is the normalized importance sampling estimator


Pn
w (θ , x) h (θi , x)
µ̂NIS = Pn i
i=1
.
i=1 w (θi , x)

Shaobo Jin (Math) Bayesian Statistics 15 / 64


Computational Techniques Normalized Importance Sampling

Normalized Importance Sampling: Example

We can even ignore the constants in f (y | θ) π (θ) in normalized


importance sampling.
Example
Consider an iid sample of size n from Y | β, σ 2 ∼ 2 . The prior

N β, σ
of σ 2 is InvGamma (2, 2), and β | σ 2 is N 0, σ 2 . Then,


(n+1)β 2 −2β n
n P Pn 2
o
i=1 yi +4+ i=1 yi
exp − 2σ 2
f (y | θ) π (θ) ∝
(σ 2 )(n+1)/2+3

We observe n = 20, ni=1 yi = 40.4, and ni=1 yi2 = 93.2. Approximate
P P
E σ 2 | y by normalized importance sampling.


Shaobo Jin (Math) Bayesian Statistics 16 / 64


Computational Techniques Normalized Importance Sampling

Randomness
In independent Monte Carlo, importance sampling, and normalized
importance sampling, we simulate random numbers from π (θ | x) or
g (θ | x), then µ̂ is a random variable. This means that we can
construct condence interval for µ̂ using the central limit theorem.

Monte Carlo approximation


8
6
Density

4
2
0

1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10

MCInt
Shaobo Jin (Math) Bayesian Statistics 17 / 64
Computational Techniques MCMC

Markov Chain Monte Carlo

We often want to get a sample from the posterior.


If the posterior follows some well known distribution, we can
generate a sample easily.
If the posterior does not follow any well known distribution, the
Markov Chain Monte Carlo (MCMC) is a very popular choice.
The idea of MCMC relies on the Markov property.
Denition
A Markov chain is a sequence of random variables Xi that satisfy the
Markov property:
P (Xi+1 ∈ A | Xj = xj , 0 ≤ j ≤ i) = P (Xi+1 ∈ A | Xi = xi ) .

Shaobo Jin (Math) Bayesian Statistics 18 / 64


Computational Techniques MCMC

Transition Kernel

The transition kernel describes how the Markov chain moves from Xn−1
to Xn .
If {Xn } is discrete, the transition kernel is a matrix K with
elements P (Xn = y | Xn−1 = x).
If {Xn } is continuous, the Markov property means that
ˆ
P (Xn ∈ A | Xn−1 = x, · · · X0 ) = K (x, y) dy,
y∈A
f (Xn = y | Xn−1 = x, · · · X0 ) = f (Xn = y | Xn−1 = x) = K (x, y) ,

where the transition kernel K (x, y) is the conditional density of Y


given X = x.

Shaobo Jin (Math) Bayesian Statistics 19 / 64


Computational Techniques MCMC

Stationary Distribution

Denition
The distribution p on Ω is a stationary distribution (or invariant
distribution) of the Markov chain with the transition kernel K , if

P (y) = P (x) K (x, y) , discrete case,


X

x∈X
ˆ
f (y) = f (x) K (x, y) dx, continuous case,
x∈X

where P and f are not generic symbols.

The stationary distribution means that if the initial state


X0 ∼ π (θ | data), then Xn ∼ π (θ | data) for all n ≥ 0, the same
distribution.
Shaobo Jin (Math) Bayesian Statistics 20 / 64
Computational Techniques MCMC

Long-Run Property

Theorem
Let π () be the stationary distribution of the Markov chain. Under some
regularity conditions,

lim sup |P (Xn ∈ A | X0 = x) − π (A)| = 0, almost surely,


n→∞ A

regardless of the initial state X0 = x.


Since the limiting distribution does not depend on the initial state x,
the marginal distribution of Xn is approximately the stationary
distribution, after large enough iterations.

Shaobo Jin (Math) Bayesian Statistics 21 / 64


Computational Techniques MCMC

Choose the Transition Kernel

Our goal is to simulate data from π (θ | x). We need to choose the


transition kernel K such that the stationary distribution is π (θ | x).
Fact
If π (θ | x) and K (θ, θ∗ | x) satises the detailed balance condition, i.e,

K (θ, θ∗ ) π (θ | x) = K (θ∗ , θ) π (θ∗ | x) ,

for any θ, θ∗ ∈ Θ, then π (θ | x) is the stationary distribution of the


Markov chain with the transition kernel K .

Shaobo Jin (Math) Bayesian Statistics 22 / 64


Computational Techniques Metropolis-Hastings

Proposal Distribution
When we simulate random numbers from a Markov chain, we need a
proposal distribution
T (θ, θ∗ ) = f (θ∗ | θ) .

Find a proposal distribution T (θ, θ∗ ) that satises the detailed balance


condition is dicult.
So with probability A (θ, θ∗ ) we let θ(n+1) = θ∗ (accept), and
probability 1 − A (θ, θ∗ ) we let θ(n+1) = θ (reject).
For θ(n+1) ̸= θ, the transition is
K (θ, θ∗ ) = T (θ, θ∗ ) A (θ, θ∗ ) .

Hence, we should seek A such that the detailed balance condition


is fullled.
Shaobo Jin (Math) Bayesian Statistics 23 / 64
Computational Techniques Metropolis-Hastings

Deriving A (θ, θ∗ )
The detailed balance condition is fullled, if we choose the acceptance
probability to be
A (θ, θ∗ ) = λ (θ, θ∗ ) π (θ∗ | x) T (θ∗ , θ) ≤ 1,
A (θ∗ , θ) = λ (θ, θ∗ ) π (θ | x) T (θ, θ∗ ) ≤ 1.

The value λ that maximizes the probability A (·, ·) ≤ 1 is


 
∗ 1 1
λ (θ, θ ) = min , .
π (θ | x) T (θ , θ) π (θ | x) T (θ, θ∗ )
∗ ∗

Hence,
π (θ∗ | x) T (θ∗ , θ)
 
∗ ∗ ∗ ∗
A (θ, θ ) = λ (θ, θ ) π (θ | x) T (θ , θ) = min 1, .
π (θ | x) T (θ, θ∗ )

Shaobo Jin (Math) Bayesian Statistics 24 / 64


Computational Techniques Metropolis-Hastings

Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm allows proposal distributions such
that T (θ, θ∗ ) > 0 if and only if T (θ∗ , θ) > 0.

Algorithm 1: Metropolis-Hastings Algorithm


1 Choose an initial state θ (0) ;
2 for t = 1 in 1 : n do
Sample a candidate θ ∗ from

3 T θ(t) , θ | x ;
π(θ ∗ |x)T (θ ∗ ,θ (t) )
R θ(t) , θ∗ = π θ(t) |x T θ(t) ,θ∗

4 Calculate the ratio ;
( ) ( )
5 Draw U ∼ U [0, 1] ;
6 Update
(
θ∗ , U ≤ R θ(t) , θ∗ ,

(t+1) if
θ =
θ(t) , otherwise.

7 end
Shaobo Jin (Math) Bayesian Statistics 25 / 64
Computational Techniques Metropolis-Hastings

Metropolis-Hastings Algorithm: Example


Since the ratio R θ(t) , θ∗ includes ππ(θ |x)
, we only need to know

(θ(t) |x)
π (· | x) up to a normalizing constant.
Example
Consider an iid sample of size n from Y | β, σ 2 ∼ 2 . The prior

N β, σ
of σ 2 is InvGamma (2, 2), and β | σ 2 is N 0, σ 2 . Then,


  Pn Pn  
exp − (n + 1) β 2 − 2β i=1 yi + 4 + i=1 yi2 / 2σ 2
f (y | θ) π (θ) ∝ (n+1)/2+3
.
(σ 2 )

We observe n = 20, ni=1 yi = 40.4, and = 93.2. Obtain a


P Pn 2
i=1 yi
sample from the posterior.

Shaobo Jin (Math) Bayesian Statistics 26 / 64


Computational Techniques Metropolis-Hastings

Detailed Balance: Symmetric Proposal

The Metropolis-Hastings algorithm allows asymmetric proposal


distributions.
If the proposal distribution is symmetric, i.e., T (θ, θ∗ ) = T (θ∗ , θ),
then the Metropolis-Hastings algorithm reduces to the Metropolis
algorithm.
Example
θ∗ | θ ∼ N θ, σ 2 is symmetric, since


( )
2
∗ 1 (θ − θ∗ )
T (θ, θ ) = √ exp − .
2σ 2 2σ 2

Shaobo Jin (Math) Bayesian Statistics 27 / 64


Computational Techniques Metropolis-Hastings

Metropolis Algorithm

Algorithm 2: Metropolis Algorithm


1 Choose an initial state θ (0) ;
2 for t = 1 in 1 : n do
Sample a candidate θ ∗ from

3 T θ(t) , θ | x ;

|x)
R θ(t) , θ∗ = ππ(θ

4 Calculate the ratio ;
(θ(t) |x)
5 Draw U ∼ U [0, 1] ;
6 Update
(
θ∗ , U ≤ R θ(t) , θ∗ ,

(t+1) if
θ =
θ(t) , otherwise.

7 end

Shaobo Jin (Math) Bayesian Statistics 28 / 64


Computational Techniques Metropolis-Hastings

Some Examples of Metropolis-Hastings Algorithms

Many dierent MCMC algorithms dier mainly in how the candidate y


is sampled.
In the random-walk Metropolis algorithm, θ∗ = θ(t) + ϵ, where ϵ is
sampled from some distribution, e.g., Uniform [−a, a], Normal, etc.
In independence sampler, θ∗ is sampled from g (·) that does not
depend on θ(t) .
The Langevin Metropolis-Hastings algorithm explores the shape of
the posterior distribution by θ∗ = θ(t) + d(t) + τ ϵ, where
ϵ ∼ N (0, I) and

τ 2 ∂ log π θ(t) | x

(t)
d = .
2 ∂θ

Shaobo Jin (Math) Bayesian Statistics 29 / 64


Computational Techniques Gibbs Sampler

Gibbs Sampler: Conditioning


It can be the case that it is much easier to sample from the conditional
distributions than using Metroplis-Hastings from the joint distribution
of θ ∈ Θ ⊂ Rd .
Suppose that θ = (θ1 , ..., θp ), where θi ∈ Rdi .
Let πi|−i (θi | θ−i , x) be the conditional distribution of θi given θ−i
and x, where θ−i = θ1 · · · θi−1 θi+1 · · · θp .


Algorithm 3: Basic Gibbs Sampler


1 Choose an initial state θ (0) ;
2 for t = 1 in 1 : n do
3 for i = 1 in 1 : p do  
(t+1) (t+1) (t+1) (t) (t)
4 Draw θi ∼ πi|−i θi | θ1 , ..., θi−1 , θi+1 , ..., θp , x ;

5 end
6 end

Shaobo Jin (Math) Bayesian Statistics 30 / 64


Computational Techniques Gibbs Sampler

Gibbs Sampler: Example

Example
Suppose that our data X1 , ..., Xn are iid from N µ, λ−1 . The prior


distributions of µ and λ are


µ ∼ N µ0 , λ−1

0 ,
λ ∼ Exp (b0 ) .

Use Gibbs sampler to sample random numbers from the posterior


distribution of µ, λ.

Shaobo Jin (Math) Bayesian Statistics 31 / 64


Computational Techniques Gibbs Sampler

Why Does Gibbs Sampler Work?


In order to show the Gibbs sampler generate random numbers from the
desired stationary distribution, we only need to show
ˆ

π (θ | x) = K (θ, θ∗ ) π (θ | x) dθ,

where π (· | x) is not a generic symbol.


For simplicity, we consider p = 2 and continuous posterior.
The transition kernel K (θ, θ∗ ) is
K ((θ1 , θ2 ) , (θ1∗ , θ2∗ )) = π1|2 (θ1∗ | θ2 , x) π2|1 (θ2∗ | θ1∗ , x) .

This transition kernel satises


ˆ ˆ
K ((θ1 , θ2 ) , (θ1∗ , θ2∗ )) π (θ1 , θ2 | x) dθ1 dθ2 = π (θ1∗ , θ2∗ | x) .

Shaobo Jin (Math) Bayesian Statistics 32 / 64


Computational Techniques Gibbs Sampler

Collapsed Gibbs Sampler

Suppose that θ can be partitioned into three groups of parameters


(θ1 , θ2 , θ3 ).
The Gibbssampler samples  from the full
 conditional distributions

∼ π θ1 | θ2 , θ3 , x , θ2 , θ3 , x , and
(t+1) (t) (t) (t+1) (t+1) (t)
θ1 ∼ π θ2 | θ1
 
,x .
(t+1) (t+1) (t+1)
θ3 ∼ π θ3 | θ1 , θ2

In collapsed Gibbs sampler, we can integrate out θ3 analytically and


work with (θ1 , θ2 ) ∼ π (θ1 , θ2 | x).
   
We sample θ1(t+1) ∼ π θ1 | θ2(t) , x and θ2(t+1) ∼ π θ2 | θ1(t+1) , x
by Gibbs sampler.
 
We then sample θ3(t+1) ∼ π θ3 | θ1(t+1) , θ2(t+1) , x .

Shaobo Jin (Math) Bayesian Statistics 33 / 64


Computational Techniques Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

The Metropolis algorithm and the Gibbs sampler often move too
slowly through the target distribution when the dimension of the
target distribution is high.
Hamiltonian Monte Carlo (HMC) moves much quicker through the
target distribution.
For each component in the target distribution, HMC adds a
momentum variable and the proposal distribution largely depends
on the momentum variable.
Both the component in the target distribution and the momentum
are updated in the MCMC algorithm.

Shaobo Jin (Math) Bayesian Statistics 34 / 64


Computational Techniques Hamiltonian Monte Carlo

Hamiltonian Dynamics

The idea of HMC originates from the Hamiltonian dynamics in physics.


The state of a system consists of the position θ ∈ Rd and the
momentum ϕ ∈ Rd of same dimension.
The Hamiltonian is a function of θ and ϕ, denoted by H (θ, ϕ).
The position and the momentum can change over time t. The
change is described by the Hamilton's equations:
dθi ∂H dϕi ∂H
= , and =− ,
dt ∂ϕi dt ∂θi
for i = 1, ..., d.

Shaobo Jin (Math) Bayesian Statistics 35 / 64


Computational Techniques Hamiltonian Monte Carlo

Potential and Kinetic Energy


For HMC, the Hamiltonian is usually
H (θ, ϕ) = U (θ) + V (ϕ) ,

where U (θ) = − log π (θ | x) is called the potential energy and V (ϕ) is


called the kinetic energy.
We want to sample from π (θ | x). Hence, ϕ is articial.
We often let ϕ ∼ N (0, M ), independent of θ | x, for a prespecied
covariance matrix M , and V (ϕ) the negative log density of ϕ.
The Hamilton's equations become
dθ dϕ ∂ log π (θ | x)
= M −1 ϕ, and = ,
dt dt ∂θ
arranged as column vectors.

Shaobo Jin (Math) Bayesian Statistics 36 / 64


Computational Techniques Hamiltonian Monte Carlo

Augmentation

Since θ and ϕ are independent, their joint density is


f (θ, ϕ | x) = π (θ | x) p (ϕ | x) = exp {−U (θ) − V (ϕ)}
= exp {−H (θ, ϕ)} .

We have augmented the problem from sampling θ from π (θ | x) to


sampling (θ, ϕ) form exp {−H (θ, ϕ)}.
1 We rst sample ϕ from N (0, M ), independent of current θ .

Since ϕ ∼ N (0, M ), we already sample ϕ from the desired


distribution.
2 We then sample θ, where the new state is proposed by Hamiltonian
dynamics by solving the dierential equations.

Shaobo Jin (Math) Bayesian Statistics 37 / 64


Computational Techniques Hamiltonian Monte Carlo

Solve Dierential Equation

To solve the dierential equations, we consider an approximation


known as the leapfrog method. For some stepsize ϵ > 0, we perform
half-step updates as
 ϵ ϵ ∂ϕ (t) ϵ ∂ log π (θ (t) | x)
ϕ t+ = ϕ (t) + = ϕ (t) + ,
2 2 ∂t 2 ∂x
∂θ t + 2ϵ
  ϵ
θ (t + ϵ) = θ (t) + ϵ = θ (t) + ϵM −1 ϕ t + ,
∂t 2
 ϵ  ϵ ∂ϕ (t + ϵ)  ϵ  ϵ ∂ log π (θ (t + ϵ) | x)
ϕ (t + ϵ) = ϕ t + + =ϕ t+ + .
2 2 ∂t 2 2 ∂θ
Starting from t = 0, we can get a trajectory at times {ϵ, 2ϵ, ..., Lϵ}, and
approximate the values for θ (Lϵ) and ϕ (Lϵ).

Shaobo Jin (Math) Bayesian Statistics 38 / 64


Computational Techniques Hamiltonian Monte Carlo

Leapfrog Method to Sample θ


Suppose that the current state is (θ, ϕ).
1 Update ϕ with a half-step update by

ϵ ∂ log π (θ | x)
ϕ ← ϕ+ .
2 ∂θ
2 For ℓ = 1, ..., L − 1,
1 Update the position: θ ← θ + ϵM −1 ϕ.
2 Update the momentum:

∂ log π (θ | x)
ϕ ← ϕ+ϵ .
∂θ
3 Make one last update on the position: θ ← θ + ϵM −1 ϕ.
4 Make one last half-step update of the momentum
ϵ ∂ log π (θ | x)
ϕ ← ϕ+ .
2 ∂θ
Shaobo Jin (Math) Bayesian Statistics 39 / 64
Computational Techniques Hamiltonian Monte Carlo

Metropolis Step

Suppose that the state after such L updates is (θ∗ , ϕ∗ ). We negate the
momentum and the new proposal state is (θ∗ , −ϕ∗ ).
We determine whether to accept the proposal using the Metropolis
algorithm, where the acceptance probability is
exp {−H (θ∗ , −ϕ∗ )}
 
∗ ∗
A ((θ, ϕ) , (θ , −ϕ )) = min 1, .
exp {−H (θ, ϕ)}

If the proposed state is accepted, then we accept θ∗ as a new state


for θ, but don't care about ϕ∗ .
No matter we accept or reject the proposal, we will draw a new
momentum in the next iteration, independent of previous
momentum.

Shaobo Jin (Math) Bayesian Statistics 40 / 64


Computational Techniques Hamiltonian Monte Carlo

Properties of HMC
Some crucial properties of the Hamiltonian dynamics for MCMC
updates include
1 deterministic updates. The Hamiltonian dynamics is deterministic.

After running the leapfrog loop L times, we always move the initial
state (θ0 , ϕ0 ) to the same proposal (θ∗ , ϕ∗ ).
2 reversible. The mapping from the state at time t, denoted by

(θ (t) , ϕ (t)), to the state at time t + s, denoted by


(θ (t + s) , ϕ (t + s)), is one-to-one and has an inverse mapping. If
we negate the momentum, we will come back from
(θ (t + s) , −ϕ (t + s)) to (θ (t) , −ϕ (t)).
3 connection between momentum and position. The momentum is

changed based on the position since


dϕi ∂H ∂ log π (θ | x)
=− = .
dt ∂θi ∂θi
Shaobo Jin (Math) Bayesian Statistics 41 / 64
Computational Techniques Hamiltonian Monte Carlo

Tuning Parameters

Tuning of HMC can occur in several places such as


1 the distribution for the momentum,

2 the scaling factor ϵ,

3 the number of leapfrog steps L per iteration.

Some theory suggest that we can tune HMC such that the acceptance
probability is around 65%.

Shaobo Jin (Math) Bayesian Statistics 42 / 64


Computational Techniques Hamiltonian Monte Carlo

No-U-Turn Sampler
The no-U-turn sampler (NUTS) allows us to automatically tune the
number of steps L: we increases L until the simulated dynamics is long
enough such that the proposed position θ∗ starts to move back towards
the initial position θ if we run more steps.
This is measured by the angle between θ∗ − θ and current
momentum ϕ∗ .

A basic NUTS works as follows. Given the initial status,


1 Sample u | θ, ϕ ∼ Uniform [0, exp {−H (θ, ϕ)}].

2 Apply the leapfrog method (with some modication) until a

U-turn occurs.
3 Sample uniformly from the points in

{(θ, ϕ) : exp {−H (θ, ϕ)} ≥ u} that the leapfrog step has visited
and the detailed balance condition is fullled.
Shaobo Jin (Math) Bayesian Statistics 43 / 64
Computational Techniques Hamiltonian Monte Carlo

Adaptively Tune ϵ

A too small ϵ will waste computation by taking needlessly tiny steps,


and a too large will cause high rejection rates.
In HMC, we tune ϵ in the warm-up stage of MCMC such that the
average acceptance probability δ is the user specied value.
In NUTS, there is no Metropolis accept/reject step. But we can
still compute the ratio as if we were using the accept/reject step
and set ϵ such that the pseudo acceptance probability is the user
specied value.
In stan, the default is δ = 0.8.

Shaobo Jin (Math) Bayesian Statistics 44 / 64


Computational Techniques Convergence Diagnostics

Burn-In Period

The stationary distribution is reached after large enough iterations.


If the iterations have not proceeded long enough, the simulated
numbers may be unrepresentative of the target distribution.

To diminish the inuence of the starting values, we can discard the


early simulations, known as the burn-in.
There is no golden standard on how long the burn-in period should
be.
Hereafter, if the Markov chain has length n, we mean that after
the burn-in period, the length is n.

Shaobo Jin (Math) Bayesian Statistics 45 / 64


Computational Techniques Convergence Diagnostics

Mixing
We want the Markov chain to show good mixing.
Bad
2

−1

−2
x

Good
2

−1

−2

0 500 1000 1500 2000


iteration

Shaobo Jin (Math) Bayesian Statistics 46 / 64


Computational Techniques Convergence Diagnostics

Several Markov Chains


5.0

2.5

chain
chain 1
x

0.0
chain 2

−2.5

−5.0
0 500 1000 1500 2000
iteration

One suggestion is to generate several independent Markov chains,


starting from widely separated places.
Shaobo Jin (Math) Bayesian Statistics 47 / 64
Computational Techniques Convergence Diagnostics

Gelman-Rubin R̂ Statistic: Variation


One way to assess convergence is the Gelman-Rubin R̂ statistic.
Suppose that we have simulated m chains
  each with n iterations. Say
we have a univariate quantity yij = f θj , where θj(i) is the ith value
(i)

in the j th chain.
The variation within the chains is measured by
m n
" #
1 X 1 X
W = (yij − ȳ·j )2 ,
m n−1
j=1 i=1

where ȳ·j is the average of {yij }ni=1 .


The variation between the chains is measured by
m
n X
B = (ȳ·j − ȳ·· )2 ,
m−1
j=1

where ȳ·· is the average of all ȳ·j .


Shaobo Jin (Math) Bayesian Statistics 48 / 64
Computational Techniques Convergence Diagnostics

Gelman-Rubin R̂ Statistic: Expression

If the Markov chains have reached stationary, then we expect W to be


close to B . The Gelman-Rubin R̂ statistic is then
s
n−1
n W + n1 B
R̂ = ,
W

which declines to 1 as n → ∞.
It is suggested that we keep simulating the Markov chain until
R̂ < 1.1 or even < 1.01.

Shaobo Jin (Math) Bayesian Statistics 49 / 64


Computational Techniques Convergence Diagnostics

Variants of Gelman-Rubin R̂

Several dierent versions of R̂ have been proposed. One suggestion is to


split each chain into two parts, yields 2m chains of length n/2 each.
Then compute the R̂, pretending that we have simulated 2m chains of
length n/2.
This can be useful to detect the case where each chain does not
reach stationary but the chains cover a common distribution, e.g,
two chains exhibit an X -shape.

Shaobo Jin (Math) Bayesian Statistics 50 / 64


Computational Techniques Convergence Diagnostics

Serial Correlation
It is obvious that θ(t+1) and θ(t) are not independent draws. Inference
from autocorrelated draws is generally less precise than from the same
number of independent draws.
However, such serial correlation is not necessarily a problem.
Remember that, at convergence, we reach the stationary
distribution.

Algorithm 4: General MCMC Integral


1 Sample a Markov chain for a given stationary distribution π (θ | x): θ(1) ,
..., θ(R) (after burn-in) ;
2 Approximate µ (x) by

n
µ̂MCMC
1X
= h (θi , x) .
n i=1

Shaobo Jin (Math) Bayesian Statistics 51 / 64


Computational Techniques Convergence Diagnostics

Long-Run Property

Theorem
Under some conditions, for all starting state θ0 ∈ Θ,
1 ergodic theorem: For any initial state,

n
1X a.s.
h (θi , x) → E [h (θ, x) | x] = µ (x) .
n i=1

2 central limit theorem: Let σ 2 = Var


 [h (θ, x) | x] and
ρj = corr h θ , x , h θ
(1) (j+1) , x | x . Then,


  

" n
#
√ 1X d
X
n h (θi , x) − µ (x) → N 0, σ 2 1 + 2 ρj  .
n i=1 j=1

Shaobo Jin (Math) Bayesian Statistics 52 / 64


Computational Techniques Convergence Diagnostics

Eective Sample Size


If we have an iid sample of size n, then
n
" #
√ 1X d
→ N 0, σ 2 .

n h (θi , x) − µ (x)
n
i=1

If we have a converged Markov chain of length n,


  
n ∞
" #
√ 1 X d
X
n h (θi , x) − µ (x) → N 0, σ 2 1 + 2 ρj  .
n
i=1 j=1

The variance of µ̂MCMC is larger than the variance of µ̂IMC . We dene


n
ne = P∞
1+2 j=1 ρj

as the eective sample size of this Markov chain sample.


Shaobo Jin (Math) Bayesian Statistics 53 / 64
Computational Techniques Convergence Diagnostics

Thinning
Some prefer thinning the sequence by only keeping every kth draw from
a sequence in order to reduce serial correlation.
But whether or not the Markov chain is thinned, it can be used for
inferences, provided that it has reached convergence.
Suppose that the length of the Markov chain is n. We discard k − 1
out of every k observations and the chain after thinning is n/k.
Under some assumptions,
√ d
n [µ̂ − µ (x)] → N 0, τ 2 ,

p d
n/k [µ̂k − µ (x)] → N 0, τk2 ,


where µ̂ and µ̂k are the estimators without and with thinning,
respectively.
In fact, it has been proved that, for any k > 1, kτk2 > τ 2 , indicating
that discarding k − 1 out of every k observations will increase the
variance.
Shaobo Jin (Math) Bayesian Statistics 54 / 64
Computational Techniques Convergence Diagnostics

Simulation Under Posterior


Using MCMC and other methods, we can simulate n random numbers
from the posterior distribution π (θ | x). Using the simulated θ, we can
1 approximate the posterior mean: n−1 → E [θ | x].
Pn (i)
i=1 θ
2 approximate the posterior probability:

n
1 X  (i) 
1 θ ∈A → E [1 (θ ∈ A) | x] = P (θ ∈ A | x) .
n
i=1

3 approximate predictive density:


n ˆ
1X  
f xnew | x, θ(i) → f (xnew | x, θ) π (θ | x) dθ.
n
i=1

4 approximate mean of predictive distribution:


n−1 ni=1 xnew → E [xnew | x], where xnew is simulated from
P (i) (i)

f xnew | x, θ(i) .
Shaobo Jin (Math) Bayesian Statistics 55 / 64
Computational Techniques Variational Inference

Approximate Posterior
If the posterior distribution family is dicult to handle, it can be useful
to approximate it by another distribution family that is easier to
handle.
The Kullback-Leibler divergence for distributions P and Q with
respective densities p and q are
ˆ  
q (θ)
KL (q, p) = q (θ) log dθ ≥ 0.
p (θ)

We choose a model D for the posterior, called the variational


family.
The variational density is
q ∗ (θ | x) = arg min KL (q (θ | x) , π (θ | x)) .
q∈D

Shaobo Jin (Math) Bayesian Statistics 56 / 64


Computational Techniques Variational Inference

Variational Bayesian Inference

The idea of variational inference (VI) is to use q ∗ (θ | x) ∈ D instead of


π (θ | x) and to explore the properties of D.

We need to choose D ourselves.


Trade-o: too simple D poorly approximates π (θ | x) but too
complex D is hard to handle.
One choice is the mean-eld variational family DMF , where
m
Y
q (θ | x) = qj (θj | x) ,
j=1

that is, the components in θ are independent. We call qj (θj | x)


the j th variational factor.

Shaobo Jin (Math) Bayesian Statistics 57 / 64


Computational Techniques Variational Inference

Evidence Lower Bound

The Kullback-Leilber divergence satises


ˆ  
p (θ, x)
KL (q (θ | x) , π (θ | x)) = log [m (x)] − q (θ | x) log dθ.
q (θ | x)
| {z }
evidence lower bound ELBO(q)

Since KL (q (θ | x) , π (θ | x)) ≥ 0, the ELBO satises


ELBO (q) ≤ log [m (x)] ,
a lower bound of the log-marginal likelihood of x.
Minimizing the KL divergence is the same as maximization of
ELBO.

Shaobo Jin (Math) Bayesian Statistics 58 / 64


Computational Techniques Variational Inference

Variational Inference in Linear Regression

Example
Suppose that y | β ∼ Nn (Xβ, Σ) and β ∼ Np µ0 , Λ−1 , where Σ is

0
known. The posterior is β | y ∼ N µn , Λn , where
−1

Λn = Λ0 + X T Σ−1 X,
µn = Λ−1 Λ0 µ0 + X T Σ−1 y .

n

Consider the mean-eld variational family


Np µ, Λ−1 : µ ∈ Rp , Λ is diagonal .
 
DMF =

Find the ELBO and the variational density.

Shaobo Jin (Math) Bayesian Statistics 59 / 64


Computational Techniques Variational Inference

Explicit Expression of DMF

Theorem
Consider the mean eld variational family DMF , where
m
Y
q (θ | x) = qj (θj | x) .
j=1

Let θk be the kth group in θ and

qk∗ (θk | x) = arg min KL (q (θ | x) , π (θ | x)) .


qk

Then,
ˆ 
qk (θk | x) ∝ exp q−k (θ−k | x) log π (θk | θ−k , x) dθ−k .

Shaobo Jin (Math) Bayesian Statistics 60 / 64


Computational Techniques Variational Inference

Coordinate Ascent Variational Inference Algorithm


The previous theorem suggests the following stepwise conditioning to
approximate q ∗ (θ | x).

Algorithm 5: Coordinate ascent variational inference (CAVI) Algo-


rithm
Qm (0)
1 Choose an initial approximation q̂ (0) (θ | x) = j=1 q̂j (θj | x) ;
2 for t = 1 in 1 : T do
3 for j = 1 in 1 : m do
(t) ´
4 Calculate q̂j (θj | x) ∝ exp q−j (θ | x) log π (θj | θ−j , x) dθ−j ,
where
"j−1 # m

(t) (t−1)
Y Y
q−j (θ | x) = q̂k (θk | x)  q̂k (θk | x)
k=1 k=j+1

5 end
6 end
Shaobo Jin (Math) Bayesian Statistics 61 / 64
Computational Techniques Variational Inference

CAVI Algorithm: Example

Example
Suppose that we have an iid sample Xi | µ, σ 2 ∼ N µ, σ 2 , i = 1, ..., n.


The priors are µ | σ 2 ∼ N µ0 , σ 2 /λ0 and σ 2 ∼ InvGamma (a0 , b0 ).


The posterior is
µ | x, σ 2 ∼ N µn , σ 2 /λn , and σ 2 | x ∼ InvGamma (an , bn ) .


where λn = λ0 + n, µn = λ−1 i=1 xi ), an = a0 + 2 , and


Pn n
n (λ0 µ0 +

n
!
1 X 2 2 2
bn = b0 + xi + λ0 µ0 − λn µn .
2
i=1

Shaobo Jin (Math) Bayesian Statistics 62 / 64


Computational Techniques Stan

Stan

Stan is a c++ library for Bayesian inference using HMC to obtain


posterior simulations.
Rstan is the R interface to Stan.
PyStan is the Python interface to Stan.
It is the state-of-the-art library for doing Bayesian statistics.
A Stan model consists of
1 data,

2 parameters,

3 statistical model.

Shaobo Jin (Math) Bayesian Statistics 63 / 64


Computational Techniques Stan

R Package rstanarm

The R package rstanarm emulates the R syntax but uses Stan via the
rstan package to t models in the background. So you skip writing the
Stan syntax.
Various common regression models have been implemented in
rstanarm.
Another benet is that various visualization tools in R can be used.

Shaobo Jin (Math) Bayesian Statistics 64 / 64

You might also like