Lec30 GibbsSampling
Lec30 GibbsSampling
Email: [email protected]
URL: https://www.zabaras.com/
The problem with these algorithms is that we try to sample all the components
of a high-dimensional parameter simultaneously.
Model: Failures of the 𝑖 th pump follow a Poisson process with parameter 𝜆𝑖 , 1 ≤ 𝜆𝑖 ≤ 10. For
an observation time 𝑡𝑖, the number of failures 𝑝𝑖 is a Poisson 𝒫(𝜆𝑖𝑡𝑖 ) random variable.
i 1
P ( iti ) i ~Ga ( , )
~Ga ( , )
10
pi 1 i ti 10 1
i e e
i 1
It is not obvious how the inverse CDF method or the accept/reject method or how importance
sampling could be used for this multidimensional distribution!
i 1
The conditionals can be obtained with direct observation from the above posterior:
i | ( , ti , pi ) ~ Ga ( pi , ti ) for 1 i 10
10
| (1 ,..., 10 ) ~ Ga ( 10 , i )
i 1
Instead of directly sampling the vector 𝜃 = (𝜆1 , . . . , 𝜆10 , 𝛽) at once, one could suggest sampling
it iteratively.
We can start with the i’s for a given guess of , followed by an update of given the new samples
𝜆1 , . . . , 𝜆10 .
Note that instead of directly sampling in a space of dimension 11, one samples 11 times in
spaces of dimension 1!
The validity of the approach described here is derived from the fact that the
sequence 𝜃 𝑡 : = 𝜆1𝑡 , 𝜆𝑡2 , . . . , 𝜆10
𝑡
, 𝛽𝑡 is a Markov chain.
( X n A | X 0 ,..., X n 1 ) ( X n A | X n 1 )
and we write:
Transition Kernel : P ( x, A) ( X n A | X n 1 )
Markov Chain Monte Carlo (MCMC): Given a target distribution 𝜋, we need to design a
transition kernel 𝑃 such that asymptotically
1 N N
N
f (X
n 1
n ) f ( x) ( x)dx and / or X n ~
2
( x) N x;0, 2
1
To sample from 𝜋, we just sample the Markov chain and we know that
asymptotically 𝑋𝑛~𝜋.
Of course this problem is only to demonstrate the main idea of MCMC since
we can here sample directly from !
In the following example, we choose 𝛼 = 0.4, σ = 5 (see here for a MatLab implementation).
0.09 0.09
0.07
0.08 0.08
0.06
0.07 0.07
0.05
0.06 0.06
0.04 0.04
0.03
0.03 0.03
0.02
0.02 0.02
0.01
0.01 0.01
0 0 0
0 2 4 6 8 10 12 14 16 18 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20
0.08
0.07 0.07
0.07
0.06 0.06
0.06
0.05 0.05
0.05
0.04 0.04
0.04
0.03 0.03
0.03
0.02 0.02
0.02
0 0 0
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20
0 0 0
0 2 4 6 8 10 12 14 16 18 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20
0 0 0
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20
We will see that it is not necessary to run 𝑁 Markov chains in parallel in order to obtain 100
samples, but that one can consider a unique Markov chain, and build the histogram from this
single Markov chain by forming histograms from one trajectory.
1 𝑁
We suggest the estimator 𝑛=1𝑓(𝑋𝑛 ) which is the estimator we used before when
𝑁
𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 were independent.
Under relatively mild conditions, such an estimator is consistent despite the fact that the
samples 𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 are not independent. Under additional conditions, the CLT also
holds with a rate of convergence 1ൗ 𝑁 .
It is usually easy to establish that an MCMC sampler converges towards 𝜋(𝑥) but difficult to
obtain rates of convergence.
Initialization:
Iteration 𝑖, 𝑖 ≥ 1.
Sampling from conditionals is often feasible even when sampling from the joint is
impossible (e.g. in the nuclear pump data).
ඵ𝜋 𝜃1 , 𝜃 2 𝑃 𝜃1 , 𝜃 2 , 𝜃෨1 , 𝜃෨ 2 𝑑𝜃1 𝑑𝜃 2 =
න𝜋 𝜃 2 𝜋 𝜃෨1 |𝜃 2 𝜋 𝜃෨ 2 |𝜃෨1 𝑑𝜃 2 =
f n1 , n2 f 1 , 2 1 , 2 d 1d 2
N
1
N
n 1
We have
f X f x x dx
N
1
n
N n 1
You need to make sure that you do not explore the space in a periodic way to
ensure that 𝑋𝑛 ~𝜋 asymptotically.
Initialization:
0 0 0 0
Select deterministically or randomly 𝜃 = (𝜃1 , 𝜃1 , . . . , 𝜃𝑝 )
Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
𝑖 𝑖
Sample 𝜃𝑘 ~𝜋(𝜃𝑘 |𝜃−𝑘 )
𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝜃−𝑘 = 𝜃1 , . . . , 𝜃𝑘−1 , 𝜃𝑘+1 , . . . , 𝜃𝑝
𝑖 𝑖−1 𝑖−1
Update 𝜃1 from 𝜋 . |𝜃2 , . . . , 𝜃𝑝
𝑖 𝑖 𝑖−1 𝑖−1
Update 𝜃2 from 𝜋 . |𝜃1 , 𝜃3 , . . . , 𝜃𝑝
……
𝑖 𝑖 𝑖 𝑖
Update 𝜃𝑝 from 𝜋 . |𝜃1 , 𝜃2 , . . . , 𝜃𝑝−1
Initialization:
0 0 0 0
Select deterministically or randomly 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝
Iteration 𝑖, 𝑖 ≥ 1
Sample 𝐾~𝒰 1,2,…,𝑝
𝑖 𝑖−1
Set 𝜃−𝐾 = 𝜃−𝐾
𝑖 𝑖
Sample 𝜃𝐾 ~π 𝜃𝐾 |𝜃−𝐾
𝑖 𝑖 𝑖 𝑖 𝑖
where 𝜃−𝐾 = 𝜃1 , . . . , 𝜃𝐾−1 , 𝜃𝐾+1 , . . . , 𝜃𝑝
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Random-Scan Gibbs Sampler
i i i i
Random scan Gibbs: Let 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝 at step (iteration) 𝑖.
0.45 2.5
histogram x1-x2 plot
0.4 2
1.5
0.35
1
0.3
0.5
0.25
x2
0
0.2
-0.5
0.15
-1
0.1
-1.5
0.05 -2
0 -2.5
-5 -4 -3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4
x1
We can see that the sampling process in this case of highly correlated variables is
inaccurate.
0.5
-0.5
histogram
x1-x2 plot
0.45
0.4
-1
0.35
0.3 -1.5
0.25
x2
0.2 -2
0.15
0.1 -2.5
0.05
0 -3
-5 -4 -3 -2 -1 0 1 2 3 4 5 -3.5 -3 -2.5 -2 -1.5 -1 -0.5
x1
20
15
10
x2
0
-5
-10
-15
-20
-20 -15 -10 -5 0 5 10 15 20
x1
-1
-2
A MatLab implementation
can be found here
-3
-3 -2 -1 0 1 2 3
Another MatLab
Implementation with movie
-1
frame animation
can be found here.
-2
-3
-3 -2 -1 0 1 2 3
0.04
0.03
0.02
0.01
0
0
200
-100
400
-50
600
0
800
50
1000 100
iteration x-axis
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
100
200
300 -80
400 -60
500 -40
-20
600
0
700 20
800 40
900 60
80
1000
x
iteration
A MatLab implementation can be found here. This implementation works like a movie frame animation.
1
The two conditional distributions for the Gibbs sampler are
x1 | x2 ~ Binom (n, x2 )
x2 | x1 ~ Be ( x1 , n x1 )
0.9
3.5
Histogram of 𝑥2, the 0.8
3
exact pdf of which is 0.7
2 0.5
0.4
1.5
0.3
1
0.2
0.5
0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20
| (t ) , D Exp 1 ( xi (t ) ) 2
0.18
4
0.16
2 0.14
0.12
0
0.1
-2
0.08
-4 0.06
0.04
-6
0.02
-8 0
9900 9910 9920 9930 9940 9950 9960 9970 9980 9990 10000 -10 -8 -6 -4 -2 0 2 4 6 8 10
You can make this type of problems to work by introducing a proper coordinate
transformation.
Conditioning now on 𝑦1
y1 x1 x2 , y2 x2 x1 produces a uniform
2.5
distribution on the
2
union of a negative & a
1.5
positive interval.
1
0.5
-0.5
-1
Therefore, one iteration
-1.5
-2
of the Gibbs sampler is
-2.5
-3 -2 -1 0 1 2 3
sufficient to jump from
one disk to the other
one.
𝜈0 𝛾0
where we assume as priors ℐ𝒢 𝜎 2 ; , and for 𝛼 2 << 1
2 2
1 1
k ~ N (0, 2 2 2 ) N (0, 2 2 )
2 2
1
Pr( k 0) Pr( k 1)
2
k | k 0 ~ N (0, 2 2 2 ), k | k 1 ~ N (0, 2 2 )
1 k2
exp 2 2
p( k 1| k , 2 )
2 2
1 k2 1 k2
exp 2 2 exp 2 2 2
2 2 2 2
The same result can be shown with 𝑝(𝛾𝑘 |𝒟, 𝛽𝑘 , 𝜎 2 ). The Gibbs sampler becomes reducible as
𝑎 goes to zero.
1 ,..., p , k : k 1 , X X k : k 1 , and n k
p
k 1
Prior distributions
2 0 0
N
p
, 2
;0, I n
2 2
IG ; , , and ( ) ( k ) 2 p.
2 2 k 1
where 𝜋 𝛾|𝒟 ∝ 𝜋(𝒟|γ)𝜋(𝛾) and (using an earlier lecture result on the evidence for this
regression model)
0 n
( )
n
0 yi2 T 1T
2
n n
( D | ) ( D , , 2 | ) d d 2 ( 0
1/2
) i 1
2 2
with
n 1 n
yi x ,i , I n x ,i xT,i
2
i 1 i 1
( , 2 | D ) N ; , 2
n
n
0 yi
2
T 1 T
IG 2 ; 0 , i 1
2 2
where
n 1 n
yi x ,i , I n x ,i xT,i
2
i 1 i 1
i ~ B , where ~ U[0,1]
i ~ B i , where ~ B ( , )
g-prior (Zellner)
| 2 ~ N ;0, 2 2 ( X T X ) 1
where here for robustness we additionally use
a b
2 ~ IG 0 , 0
2 2
Such variations in the priors are very important and can affect the performance of the
Bayesian model.
Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
𝑖 𝑖
Sample 𝛾𝑘 ~𝜋 𝛾𝑘 |𝒟, 𝛾−𝑘 ,
𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝛾−𝑘 = 𝛾1 , . . . , 𝛾𝑘−1 , 𝛾𝑘+1 , . . . , 𝛾𝑝
𝑖 2(𝑖 ) 𝑖
Optional step: Sample 𝛽𝛾 , 𝜎 ~𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾
Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
𝑖 𝑖
Sample 𝛾𝑘 ~𝜋 𝛾𝑘 |𝒟, 𝛾−𝑘 , 𝛿 2(𝑖−1)
𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝛾−𝑘 = 𝛾1 , . . . , 𝛾𝑘−1 , 𝛾𝑘+1 , . . . , 𝛾𝑝
𝑖 2(𝑖 ) 𝑖
Sample 𝛽𝛾 , 𝜎 ~𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾 , 𝛿 2(𝑖−1)
However, it mixes very slowly because the components are updated one at a time.
Updating correlated components together would increase significantly the convergence speed
of the algorithm at the cost of an increased complexity.