0% found this document useful (0 votes)

36 views55 pages

Lec30 GibbsSampling

The document discusses Gibbs sampling and its application to Bayesian regression variable selection. It begins with an introduction to incremental sampling strategies like Markov chain Monte Carlo (MCMC). Gibbs sampling is introduced as an MCMC method that iteratively samples the conditional distributions of model parameters. This allows sampling high-dimensional distributions by incrementally updating individual parameters. As an example, Gibbs sampling is applied to a hierarchical Bayesian model for failures in a nuclear power plant. The conditional distributions are derived and used within an iterative sampling procedure to obtain samples from the joint posterior distribution.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views55 pages

Lec30 GibbsSampling

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Gibbs Sampling

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

October 23, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 1

Contents
 Incremental Strategies for Sampling, Iterative sampling
 Introduction to MCMC, autoregressive model
 The Gibbs sampler, systematic scan, random scan, Gibbs sampler examples,
Block and Metropolized Gibbs, Application in variable/model selection in linear
regression
 The goals for today’s lecture: Understand the fundamentals of MCMC, Learn
about the Gibbs sampler and how to apply it to Bayesian regression variable
selection.
 Monte Carlo Statistical Methods, C.P. Roberts and G. Casella, Chapter 3 (google books, slides, video)
 D Mackay, Introduction to MC methods, reprint.
 R Neal, Probabilistic Inference Using MCMC Methods, 1993.
 C. Andriew et al., An introduction to MCMC for Machine Learning, Machine Learning, 50, 5–43, 2003
 S. Brooks, MCMC methods and its applications, Journal of the Royal Statistical Society. Series D (The
Statistician), Vol. 47, No. 1 (1998), pp. 69-100
 G. Casella and EI George, Explaining the Gibbs Sampler, The American Statistician, Vol. 46, 1992, 167-174
 S. Chib and E. Greenberg, Understanding the MH algorithm, The American Statistician, Vol. 49, No. 4 (Nov.,
1995), pp. 327-335

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2

Using Incremental Strategies for Sampling
 We have seen that both rejection sampling (RS) and importance sampling (IS)
are limited to problems of moderate dimensions.

 The problem with these algorithms is that we try to sample all the components
of a high-dimensional parameter simultaneously.

 To address this we will look next at incremental strategies:

- Iterative Methods: Markov chain Monte Carlo.

- Sequential Methods: Sequential Monte Carlo.

A. Doucet, Statistical Computing: Monte Carlo Methods, Online course resource

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

Motivating Example
 Multiple failures in a nuclear plant:

 Model: Failures of the 𝑖 th pump follow a Poisson process with parameter 𝜆𝑖 , 1 ≤ 𝜆𝑖 ≤ 10. For
an observation time 𝑡𝑖, the number of failures 𝑝𝑖 is a Poisson 𝒫(𝜆𝑖𝑡𝑖 ) random variable.

 The unknowns consist of 𝜃: = 𝜆1 , 𝜆2 , . . . , 𝜆10 , 𝛽 where 𝛽 is a parameter in the hierarchical

model introduced next.

Statistical Computing and MC Methods, A. Doucet, Lecture 10.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

Motivating Example: Nuclear Pump Data
 Hierarchical Model:
i .i .d .
i ~ Ga ( ,  ), and  ~ Ga ( ,  ),
with 𝛼 = 1.8, 𝛾 = 0.01, 𝛿 = 1.
 The posterior distribution (see here Ga distribution)
 
 10

  i ,  | ti , pi     i ti  i e  iti   i 1e  i    1e  
p

i 1 
 P ( iti ) i ~Ga ( ,  ) 
  ~Ga ( , )

  
10
pi  1  i  ti    10   1 
i e e
i 1
 It is not obvious how the inverse CDF method or the accept/reject method or how importance
sampling could be used for this multidimensional distribution!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Conditional Distributions
 
10
  i ,  | ti , pi    i p  1e   t     10  1e 
i i i

i 1

 The conditionals can be obtained with direct observation from the above posterior:

i | (  , ti , pi ) ~ Ga ( pi   , ti   ) for 1  i  10
10
 | (1 ,..., 10 ) ~ Ga (  10 ,    i )
i 1

 Instead of directly sampling the vector 𝜃 = (𝜆1 , . . . , 𝜆10 , 𝛽) at once, one could suggest sampling
it iteratively.

 We can start with the i’s for a given guess of , followed by an update of  given the new samples
𝜆1 , . . . , 𝜆10 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

Iterative Sampling
 Given a sample, at iteration 𝑡, 𝜃 𝑡 = 𝜆1𝑡 , . . . , 𝜆10
𝑡
, 𝛽𝑡 , one could proceed as follows at iteration
𝑡 + 1,

Step 1: it 1 | ( t , ti , pi ) ~ Ga ( pi   , ti   t ) for 1  i  10

10
Step 2 :  t 1 | (1t 1 ,..., 10t 1 ) ~ Ga (  10 ,    it 1 )
i 1

 Note that instead of directly sampling in a space of dimension 11, one samples 11 times in
spaces of dimension 1!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

Iterative Sampling
 With this iterative procedure:

 Are we sampling from the desired joint distribution of the 11 variables?

 If yes, how many times should the iteration above be repeated?

 The validity of the approach described here is derived from the fact that the
sequence 𝜃 𝑡 : = 𝜆1𝑡 , 𝜆𝑡2 , . . . , 𝜆10
𝑡
, 𝛽𝑡 is a Markov chain.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

Introduction to Markov Chain Monte Carlo
 Markov chain: A sequence of random variables {𝑋𝑛, 𝑛 ∈ ℕ} defined on (𝒳, ℬ (𝒳)) such that
for any 𝐴 ∈ ℬ(𝒳) the following probability condition is satisfied:

( X n  A | X 0 ,..., X n 1 )  ( X n  A | X n 1 )
and we write:

Transition Kernel : P ( x, A)  ( X n  A | X n 1 )

 Markov Chain Monte Carlo (MCMC): Given a target distribution 𝜋, we need to design a
transition kernel 𝑃 such that asymptotically

1 N N 

N
 f (X
n 1
n )   f ( x) ( x)dx and / or X n ~

 It is easy to simulate the Markov Chain even if 𝜋 is complex.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

Autoregressive Model
 Consider the autoregression model for |𝑎 | < 1

𝑋𝑛 = 𝑎 𝑋𝑛−1 + 𝑉𝑛, 𝑤ℎ𝑒𝑟𝑒 𝑉𝑛 ∼ 𝒩(0, 𝜎2)

 The limiting distribution is:

 2 
 ( x)  N  x;0, 2 
 1   

 To sample from 𝜋, we just sample the Markov chain and we know that
asymptotically 𝑋𝑛~𝜋.

 Of course this problem is only to demonstrate the main idea of MCMC since
we can here sample directly from !

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10

Autoregressive Model
 Consider 100 independent Markov chains run in parallel.

 We assume that the initial distribution of these Markov chains is 𝒰[0,20].

 So initially, the Markov chains samples are not distributed according to .

 In the following example, we choose 𝛼 = 0.4, σ = 5 (see here for a MatLab implementation).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

Example
 A Markov chain with a normal distribution as target distribution.
0.1 0.08 0.1

0.09 0.09
0.07
0.08 0.08
0.06
0.07 0.07
0.05
0.06 0.06

0.05 0.04 0.05

0.04 0.04
0.03
0.03 0.03
0.02
0.02 0.02
0.01
0.01 0.01

0 0 0
0 2 4 6 8 10 12 14 16 18 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

Initial distribution step=1 step=2

0.1 0.09 0.09

0.09 0.08 0.08

0.08
0.07 0.07

0.07
0.06 0.06
0.06
0.05 0.05
0.05
0.04 0.04
0.04
0.03 0.03
0.03

0.02 0.02
0.02

0.01 0.01 0.01

0 0 0
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

step=3 step=4 step=100

MatLab implementation

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

Example
 Histograms of 100 independent Markov chains with a normal distribution as target distribution.

0.08 0.08 0.08

0.07 0.07 0.07

0.06 0.06 0.06

0.05 0.05 0.05

0.04 0.04 0.04

0.03 0.03 0.03

0.02 0.02 0.02

0.01 0.01 0.01

0 0 0
0 2 4 6 8 10 12 14 16 18 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

Initial distribution step=1 step=2

0.08 0.08 0.08

0.07 0.07 0.07

0.06 0.06 0.06

0.05 0.05 0.05

0.04 0.04 0.04

0.03 0.03 0.03

0.02 0.02 0.02

0.01 0.01 0.01

0 0 0
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

step=3 step=4 step=100

MatLab implementation

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

Example
 The target normal distribution seems to “attract” the distribution of the samples and even to be
a fixed point of the algorithm.

 We have produced 100 independent samples from the normal distribution.

 We will see that it is not necessary to run 𝑁 Markov chains in parallel in order to obtain 100
samples, but that one can consider a unique Markov chain, and build the histogram from this
single Markov chain by forming histograms from one trajectory.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

Markov Chain Monte Carlo
 The estimate of the target distribution, through the series of histograms, improves with the #
of iterations.
 Assume that we have stored 𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 for 𝑁 large and wish to estimate
‫𝑥𝑑)𝑥(𝜋)𝑥(𝑓 𝒳׬‬.

1 𝑁
 We suggest the estimator ෌𝑛=1𝑓(𝑋𝑛 ) which is the estimator we used before when
𝑁
𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 were independent.

 Under relatively mild conditions, such an estimator is consistent despite the fact that the
samples 𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 are not independent. Under additional conditions, the CLT also
holds with a rate of convergence 1ൗ 𝑁 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15

Markov Chain Monte Carlo
 We are interested in Markov chains with transition kernel 𝑃 which has the
following three important properties observed in the autoregressive example:
A. The desired distribution 𝜋 is an invariant distribution of the Markov chain,
i.e.
X
 ( x) P( x, y )dx  ( y )

We will see in a forthcoming lecture that this is satisfied if the following

detailed balance (reversibility) equation is satisfied:
𝜋 𝑥 𝑃 𝑥, 𝑦 = 𝜋 𝑦 𝑃 𝑦, 𝑥
B. The successive distributions of the Markov chain converge towards 𝜋
regardless of the starting point.
1 𝑁
C. The estimator ෌𝑛=1𝑓(𝑋𝑛 ) converges towards 𝔼𝜋 (𝑓(𝑋)) and
𝑁
asymptotically 𝑋𝑛~ 𝜋 (stronger requirement).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Markov Chain Monte Carlo
 Since there is an infinite number of kernels 𝑃(𝑥, 𝑦) which admit 𝜋(𝑥) as their invariant
distribution, the main task in MCMC is coming up with good ones.

 Convergence is ensured under very weak assumptions -- irreducibility and aperiodicity.

 It is usually easy to establish that an MCMC sampler converges towards 𝜋(𝑥) but difficult to
obtain rates of convergence.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

The Gibbs Sampler
 The Gibbs sampler is a generic method to sample from a high dimensional
distribution.

 It generates a Markov chain which converges to the target distribution under

weak assumptions: irreducibility and aperiodicity.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

The Two Component Gibbs Sampler
 Consider the target distribution 𝜋(𝜃) such that 𝜃 = {𝜃1, 𝜃2}. The two component Gibbs
sampler proceeds as follows:

 Initialization:

 Select deterministically or randomly 𝜃0 = 𝜃01 , 𝜃02 .

 Iteration 𝑖, 𝑖 ≥ 1.

 Sample 𝜃𝑖1 ~𝜋 𝜃 1 |𝜃𝑖−1

 Sample 𝜃𝑖2 ~𝜋 𝜃 2 |𝜃𝑖1

 Sampling from conditionals is often feasible even when sampling from the joint is
impossible (e.g. in the nuclear pump data).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19

Invariant Distribution
 Clearly 𝜃𝑖1 , 𝜃𝑖2 is a Markov Chain. Its transition kernel is:

𝑃 𝜃1 , 𝜃 2 , 𝜃෨1 , 𝜃෨ 2 = 𝜋 𝜃෨1 |𝜃 2 𝜋 𝜃෨ 2 |𝜃෨ 1

 The detailed balance equation ‫𝑥(𝑃)𝑥(𝜋 𝒳׬‬, 𝑦)𝑑𝑥 = 𝜋(𝑦) is satisfied:

ඵ𝜋 𝜃1 , 𝜃 2 𝑃 𝜃1 , 𝜃 2 , 𝜃෨1 , 𝜃෨ 2 𝑑𝜃1 𝑑𝜃 2 =

ඵ𝜋 𝜃1 , 𝜃 2 𝜋 𝜃෨1 |𝜃 2 𝜋 𝜃෨ 2 |𝜃෨1 𝑑𝜃1 𝑑𝜃 2 =

න𝜋 𝜃 2 𝜋 𝜃෨1 |𝜃 2 𝜋 𝜃෨ 2 |𝜃෨1 𝑑𝜃 2 =

න𝜋 𝜃෨1 , 𝜃 2 𝜋 𝜃෨ 2 |𝜃෨1 𝑑𝜃 2 = 𝜋 𝜃෨1 𝜋 𝜃෨ 2 |𝜃෨1 = 𝜋 𝜃෨1 , 𝜃෨ 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

Irreducibility
 That 𝜋 is the invariant distribution of 𝑃 (or the detailed balance eq.
𝜋 𝜃 𝑃 𝜃, 𝜃 ′ = 𝜋 𝜃 ′ 𝐾 𝜃 ′ , 𝜃 is satisfied) does not ensure that the Gibbs
sampler converges towards the invariant distribution.

 Additionally, it is required to ensure irreducibility: the Markov chain can move

to any set 𝐴 such that 𝜋(𝐴) > 0 from (almost) any starting point.

 This ensures that

f  n1 , n2    f  1 , 2    1 , 2  d 1d 2
N
1
N

n 1

but not that asymptotically 𝜃𝑛1 , 𝜃𝑛2 ~𝜋.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

Irreducibility
 A distribution is shown here that leads to a reducible Gibbs sampler.

 Conditioning on 𝑥1 < 1, the distribution of 𝑥2 cannot produce a value in [1,2].

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22

Aperiodicity
 Consider an example with 𝒳 = {1, 2} and transition probabilities 𝑃 (1, 2) =
𝑃 (2, 1) = 1. The invariant distribution is clearly given by 𝜋 (1) = 𝜋 (2) =
1/2 .

 However, we know that if the chain starts in 𝑋0 = 1, then 𝑋2𝑛 = 1 and

𝑋2𝑛+1 = 2 for any 𝑛.

 We have
 f  X    f  x    x  dx
N
1
n
N n 1

but clearly 𝑋𝑛 is not distributed according to 𝜋.

 You need to make sure that you do not explore the space in a periodic way to
ensure that 𝑋𝑛 ~𝜋 asymptotically.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23

Gibbs Sampler
 If 𝜃 = (𝜃1 , 𝜃2 , . . . , 𝜃𝑝 ) where 𝑝 > 2, the Gibbs sampler still applies.

 Initialization:

0 0 0 0
 Select deterministically or randomly 𝜃 = (𝜃1 , 𝜃1 , . . . , 𝜃𝑝 )

 Iteration 𝑖, 𝑖 ≥ 1

 For 𝑘 = 1: 𝑝

𝑖 𝑖
 Sample 𝜃𝑘 ~𝜋(𝜃𝑘 |𝜃−𝑘 )

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝜃−𝑘 = 𝜃1 , . . . , 𝜃𝑘−1 , 𝜃𝑘+1 , . . . , 𝜃𝑝

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24

Systematic-Scan Gibbs Sampler
𝑖 𝑖 𝑖 𝑖
 Systematic Scan Gibbs: Let 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝

𝑖 𝑖−1 𝑖−1
 Update 𝜃1 from 𝜋 . |𝜃2 , . . . , 𝜃𝑝

𝑖 𝑖 𝑖−1 𝑖−1
 Update 𝜃2 from 𝜋 . |𝜃1 , 𝜃3 , . . . , 𝜃𝑝

 ……

𝑖 𝑖 𝑖 𝑖
 Update 𝜃𝑝 from 𝜋 . |𝜃1 , 𝜃2 , . . . , 𝜃𝑝−1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

Random Scan Gibbs Sampler
 Consider again: 𝜃 = (𝜃1 , 𝜃2 , . . . , 𝜃𝑝 ) where 𝑝 > 2. We consider the following
random scan Gibbs sampler.

 Initialization:
0 0 0 0
 Select deterministically or randomly 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝

 Iteration 𝑖, 𝑖 ≥ 1
 Sample 𝐾~𝒰 1,2,…,𝑝
𝑖 𝑖−1
 Set 𝜃−𝐾 = 𝜃−𝐾
𝑖 𝑖
 Sample 𝜃𝐾 ~π 𝜃𝐾 |𝜃−𝐾

𝑖 𝑖 𝑖 𝑖 𝑖
where 𝜃−𝐾 = 𝜃1 , . . . , 𝜃𝐾−1 , 𝜃𝐾+1 , . . . , 𝜃𝑝
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Random-Scan Gibbs Sampler
i i i i
 Random scan Gibbs: Let 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝 at step (iteration) 𝑖.

 Draw 𝑗 from 1 to 𝑝 with probability 𝑤𝑗 = 1/𝑝

 Draw new coordinate 𝑗, 𝜃𝑗 |𝜃−𝑗 ~𝜋 . |𝜃−𝑗 and leave the remaining

components unchanged; that is, let
𝑖 𝑖−1
𝜃−𝑗 = 𝜃−𝑗

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Gibbs Sampler: Example
 Consider the following bivariate target distribution:
 0   1    1 1 1     x1  
 ( x1 , x2 )  N   ,     exp    x1 x2      
 0    1   2 1  2  1   x2  

 The marginal distribution is given as:

 1 
 ( x2 )  exp   x22 
2 
 A systematic-scan Gibbs sampler (see a C++ implementation) is
generated with the following conditionals:
x1t 1 | x2t ~ N  x2t ,1   2 
x2t 1 | x1t 1 ~ N  x1t 1 ,1   2 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28

Gibbs Sampler: Example
 Set 𝜌 = 0.5, # of iterations 10000, and (𝑥0, 𝑥1) = (−3, −3)

0.45 2.5
histogram x1-x2 plot

0.4 2

1.5
0.35

1
0.3

0.5
0.25

x2
0
0.2
-0.5

0.15
-1

0.1
-1.5

0.05 -2

0 -2.5
-5 -4 -3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4
x1

Histogram of 𝑥1, the exact pdf of which 𝑥1 − 𝑥2 plot

is the standard Gaussian

C++ programs are given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

Gibbs Sampler: Example
 Set 𝜌 = 0.999, # of iterations 10000, and (𝑥0, 𝑥1) = (−3, −3)

 We can see that the sampling process in this case of highly correlated variables is
inaccurate.
0.5
-0.5
histogram
x1-x2 plot
0.45

0.4
-1

0.35

0.3 -1.5

0.25

x2
0.2 -2

0.15

0.1 -2.5

0.05

0 -3
-5 -4 -3 -2 -1 0 1 2 3 4 5 -3.5 -3 -2.5 -2 -1.5 -1 -0.5
x1

Histogram of 𝑥1, the exact pdf of which 𝑥1 − 𝑥2 plot

is the standard Gaussian

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

Convergence of the Gibbs Sampler
 Even when irreducibility and aperiodicity are ensured, the Gibbs sampler can
still converge very slowly.

 Consider the target bivariate Gaussian distribution

a b 
N (0,   )
b a 

 A systematic-scan Gibbs sampler is generated as

t 1 b t b2 
x | x ~ N  x2 , a  
1
t
2
a a
t 1 t 1  b t 1 b2 
x2 | x1 ~ N  x1 , a  
a a

 In this example, we set

100 99 
N (0,   )
 99 100 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Convergence of the Gibbs Sampler
 The Gibbs sampling path and equiprobability curves are plotted below.

x2
0

-5

-10

-15

-20
-20 -15 -10 -5 0 5 10 15 20
x1

A C++ implementation can be found here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

MatLab Implementation: Gibbs Sampler
 Consider the Gaussian 𝜋(𝑥1 , 𝑥2 ) = 𝒩 𝜇, 𝐶 . Following the conditionals shown
earlier, it can be shown that the Gibbs sampler can proceed as follows:
( t 1) C121 (t ) randn ( t 1) C121 ( t 1) randn
x
1   1 x2  , randn ~ N (0,1), x2   1 x1 
1
C11 C11 C11 C221
3

-1

-2

A MatLab implementation
can be found here
-3
-3 -2 -1 0 1 2 3

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33

( t 1) C121 (t ) randn ( t 1) C121 ( t 1) randn

x
1   1 x2  , randn ~ N (0,1), x2   1 x1 
1
C11 C11 C11 C221
3

Another MatLab
Implementation with movie
-1
frame animation
can be found here.
-2

-3
-3 -2 -1 0 1 2 3

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34

Gibbs Sampler for Mixture of Gaussians

0.04

0.03

0.02

0.01

0
0

200
-100
400
-50
600
0
800
50
1000 100
iteration x-axis

A MatLab implementation can be found here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35

Gibbs Sampler for Mixture of Gaussians

0.04

0.035

0.03

0.025

0.02

0.015

0.01

0.005

0
0
100

200
300 -80
400 -60

500 -40
-20
600
0
700 20
800 40

900 60
80
1000

x
iteration

A MatLab implementation can be found here. This implementation works like a movie frame animation.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

Gibbs Sampler: Example
n
 Consider the following target distribution   x1 , x2  ~  x  x2 (1  x2 )
x  1 n  x   1 1 1

 1 
 The two conditional distributions for the Gibbs sampler are
x1 | x2 ~ Binom (n, x2 )
x2 | x1 ~ Be ( x1   , n  x1   )

 We set 𝑛 = 20, 𝛼 = 𝛽 = 0.5, initial state (0,0), time of iterations 10000.

 See here for a C++ implementation and a MatLab implementation.
4 1

0.9
3.5
Histogram of 𝑥2, the 0.8
3
exact pdf of which is 0.7

2.5 a Beta distribution 0.6

2 0.5

0.4
1.5

0.3
1
0.2

0.5
0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Gibbs Sampler: Example
 Consider a likelihood defined with the Cauchy distribution 𝒞(𝜇, 1) with two measurements as
follows:
n2
1
(  | Dn )   f  ( xi )  2
i 1  
 1  ( x1   ) 2 1  ( x2   ) 2 
 We take as prior a normal distribution 𝜇~𝒩(0,10)

 This leads to a posterior of the form:

2

20
e
 ( | D ) ~
1  ( x   ) 1  ( x
1
2
2   ) 2

 How do we use the Gibbs sampler to sample from this univariate distribution?

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

Gibbs Sampler: Example
2

20
e
 ( | D ) ~
1  ( x   ) 1  ( x
1
2
2   ) 2

 We can use Gibbs sampler by noticing:

1 i 1 ( xi   )2 
 e  
di
1  ( xi   ) 2
0

 We can then think 𝜋(𝜇|𝒟) as the marginal of 𝜋(𝜇, 𝜔1 , 𝜔2 |𝒟)

2 2
 i 1 ( xi   )2 
 (  , 1 , 2 | D ) ~ e 20
e
i 1
 

 The Gibbs sampler is based on the following 2 steps:

 Generate 𝜇 (𝑡) ~π 𝜇|𝜔 𝑡−1 ,𝒟

 Generate 𝜔 (𝑡) ~π 𝜔|𝜇 𝑡 ,𝒟

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
Gibbs Sampler: Example
 The step 𝜇 (𝑡) ~π 𝜇|𝜔 𝑡−1 , 𝒟 is straight forward since
  i xi 

   | , D   N  i
,
1 

 
 i
 i  1/ 20 2 
i
i  1/10 

 The step 𝜔 (𝑡) ~π 𝜔|𝜇 𝑡
, 𝒟 is also straighrforward:

  |  (t ) , D   Exp 1  ( xi   (t ) ) 2 

 A MatLab implementation can be found here.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40

Gibbs sampler: Example
 On the left, the last 100 iterations of the chain (𝜇 (𝑡) ); on the right, the
histogram of the chain 𝜇 𝑡 and comparison with the target density for
10,000 iterations.
6 0.2

0.18
4
0.16

2 0.14

0.12
0
0.1
-2
0.08

-4 0.06

0.04
-6
0.02

-8 0
9900 9910 9920 9930 9940 9950 9960 9970 9980 9990 10000 -10 -8 -6 -4 -2 0 2 4 6 8 10

A MatLab implementation can be found here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41

Block and Metropolized Gibbs
 Instead of updating single coordinates 𝑥 j , one can update blocks 𝒙𝐴. This
is more efficient but requires knowing the block conditionals 𝜋 (𝒙𝐴 | 𝒙−𝐴 )
and being able to sample from them.
 Combinations of Gibbs and Metropolis Hastings (to be discussed in a
follow up lecture) are popular.
 In Metropolized Gibbs, for example, some coordinates are updated
from conditionals and others using arbitrary proposals as in Metropolis-
Hastings.
 Each transition kernel in Gibbs (which updates a single coordinate) is not
irreducible nor aperiodic. However, their combination (random or
systematic scan) might be!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42

Gibbs Sampling
 Consider a target 𝜋(𝑥1, 𝑥2) (e.g. a uniform distribution) with disconnected support as in the
figure. Conditioning on 𝑥1 < 0, the distribution of 𝑥2 cannot produce a value in [0,1].

 You can make this type of problems to work by introducing a proper coordinate
transformation.

Conditioning now on 𝑦1
y1  x1  x2 , y2  x2  x1 produces a uniform
2.5
distribution on the
2
union of a negative & a
1.5

positive interval.

1

0.5

-0.5

-1
Therefore, one iteration
-1.5

-2
of the Gibbs sampler is
-2.5
-3 -2 -1 0 1 2 3
sufficient to jump from
one disk to the other
one.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43

Gibbs Sampler: Recommendation
 Have as few blocks as possible.

 Put the most correlated variables in the same block. If necessary,

reparametrize the model to achieve this.

 Integrate analytically as many variables as possible.

 There is no general strategy that will work for all problems.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44

Bayesian Variable Selection in Regression
 We select the following regression model:
p
Y    k X k   V , where V ~ N (0,1)
k 1

𝜈0 𝛾0
where we assume as priors ℐ𝒢 𝜎 2 ; , and for 𝛼 2 << 1
2 2

1 1
 k ~ N (0,  2 2 2 )  N (0,  2 2 )
2 2

 We introduce a latent variable 𝛾𝑘 ∈ 0,1 such that:

1
Pr( k  0)  Pr( k  1) 
2
 k |  k  0 ~ N (0,  2 2 2 ),  k |  k  1 ~ N (0,  2 2 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45

A Bad Gibbs Sampler
𝑛
 We have parameters 𝛽1:𝑝 , 𝛾1:𝑝 , 𝜎 2 and observe 𝒟 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 .

 A potential Gibbs sampler consists of sampling iteratively from

p ( 1: p | D ,  1: p ,  2 ) (Gaussian), p( 2 | D ,  1: p , 1: p ) (inverse  Gamma), and

p ( 1: p | D , 1: p ,  2 )
𝑝
 In particular, 𝑝(𝛾1:𝑝 |𝒟, 𝛽1:𝑝 , 𝜎 2 ) = ැ𝑘=1𝑝(𝛾𝑘 |𝒟, 𝛽𝑘 , 𝜎 2 ) and using the prior models:

1  k2 
exp   2 2 
p( k  1|  k ,  2 ) 
2  2  
1   k2  1   k2 
exp   2 2   exp   2 2 2 
2  2   2  2   

 The same result can be shown with 𝑝(𝛾𝑘 |𝒟, 𝛽𝑘 , 𝜎 2 ). The Gibbs sampler becomes reducible as
𝑎 goes to zero.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46

Bayes Variable Selection
 This is the result of bad modeling. We consider 𝛼 ≃ 0 and write:
p
Y    k  k X k   V , where V ~ N (0,1)
i 1

 Here 𝛾𝑘 = 1 if 𝑋𝑘 is included or 𝛾𝑘 = 0, otherwise.

 However, this suggests that 𝛽𝑘 is defined even when 𝛾𝑘 = 0.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47

Bayes Variable Selection
 A neater way to write such models is
Y   k X k   V  T X    V , where V ~ N (0,1)
k :
 k 1

where, for a vector

    1 ,...,  p  ,    k :  k  1 , X    X k :  k  1 , and n    k
p

k 1

 Prior distributions
 2 0 0 
  N  
p
    ,  2
 ;0,   I n
2 2
IG   ; ,  , and  ( )    ( k )  2 p.
 2 2 k 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48

Bayes Variable Selection
 We are interested in sampling from the trans-dimensional distribution 𝜋 𝛾, 𝛽𝛾 , 𝜎 2 |𝒟

 However, we know that

𝜋 𝛾, 𝛽𝛾 , 𝜎 2 |𝐷 = 𝜋 𝛾|𝒟 𝜋(𝛽𝛾 , 𝜎 2 |𝒟, 𝛾൯

where 𝜋 𝛾|𝒟 ∝ 𝜋(𝒟|γ)𝜋(𝛾) and (using an earlier lecture result on the evidence for this
regression model)
 0 n
( )
 n

 0   yi2  T 1T
2

  n  n  
 ( D |  )    ( D ,   ,  2 |  ) d   d 2   ( 0
1/2
)   i 1

2  2 
 
 
with
 n  1 n
     yi x ,i  ,    I n   x ,i xT,i
2

 i 1  i 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49

Bayes Variable Selection
 The full conditional distribution for 𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾 is

  (   ,  2 | D )  N    ;  ,  2    
 n

  n
 0   yi
2
     
T 1 T

IG   2 ; 0 , i 1

 2 2 
 
 
where
 n  1 n
     yi x ,i  ,    I n   x ,i xT,i
2

 i 1  i 1

 The derivation of the above conditional is already given in an earlier lecture.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50

Bayes Variable Selection
 Popular alternative prior models for 𝛾𝑖 include

 i ~ B    , where  ~ U[0,1]
 i ~ B  i  , where  ~ B ( ,  )
 g-prior (Zellner)

 |  2 ~ N   ;0,  2 2 ( X T X  ) 1 
where here for robustness we additionally use
a b 
 2 ~ IG  0 , 0 
 2 2
 Such variations in the priors are very important and can affect the performance of the
Bayesian model.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51

Bayes Variable Selection: Algorithm
 𝜋 𝛾 |𝒟 is a discrete probability distribution with 2𝑝 potential values. We assume 𝛿2 is
known here.
 We can use the Gibbs sampler to sample from it.
 Initialization:
0 0 0
 Select deterministically or randomly 𝛾 = 𝛾1 , . . . , 𝛾𝑝

 Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
𝑖 𝑖
 Sample 𝛾𝑘 ~𝜋 𝛾𝑘 |𝒟, 𝛾−𝑘 ,

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝛾−𝑘 = 𝛾1 , . . . , 𝛾𝑘−1 , 𝛾𝑘+1 , . . . , 𝛾𝑝

𝑖 2(𝑖 ) 𝑖
 Optional step: Sample 𝛽𝛾 , 𝜎 ~𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52

Bayes Variable Selection: Algorithm
 Consider the case where 𝛿2 is unknown.
 Initialization:
0
 Select deterministically or randomly 𝛾 0
, 𝛽𝛾 , 𝜎 2(0) , 𝛿 2(0)

 Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
𝑖 𝑖
 Sample 𝛾𝑘 ~𝜋 𝛾𝑘 |𝒟, 𝛾−𝑘 , 𝛿 2(𝑖−1)

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝛾−𝑘 = 𝛾1 , . . . , 𝛾𝑘−1 , 𝛾𝑘+1 , . . . , 𝛾𝑝

𝑖 2(𝑖 ) 𝑖
 Sample 𝛽𝛾 , 𝜎 ~𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾 , 𝛿 2(𝑖−1)

 Sample 𝛿 2(𝑖) ~ 𝜋 𝛿 2(𝑖) |𝛽𝛾(𝑖)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53

Bayesian Variable Selection
 This very simple sampler is much more efficient than the ones where 𝛾 is sampled conditional
upon (𝛽, 𝜎2)

 However, it mixes very slowly because the components are updated one at a time.

 Updating correlated components together would increase significantly the convergence speed
of the algorithm at the cost of an increased complexity.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54

Bayesian Variable Selection Example
 Top five most likely models for the selection models discussed:

Results from: Statistical Computing and MC Methods, A. Doucet.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55

Markov Chain Monte Carlo in Practice (W R Gilks, S Richardson, D J Spiegelhalter
No ratings yet
Markov Chain Monte Carlo in Practice (W R Gilks, S Richardson, D J Spiegelhalter
485 pages
ST407 2020 Notes
100% (1)
ST407 2020 Notes
126 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
MCMC Notes
No ratings yet
MCMC Notes
77 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
No ratings yet
General State Space Markov Chains and MCMC Algorithms - Gareth O. Roberts, Jeffrey S. Rosenthal
64 pages
Markov Chain Monte Carlo and Gibbs Sampling
No ratings yet
Markov Chain Monte Carlo and Gibbs Sampling
24 pages
STAT 608 II: Monte Carlo Methods in Statistics. Winter 2015
No ratings yet
STAT 608 II: Monte Carlo Methods in Statistics. Winter 2015
2 pages
MCMC Sampling - Class 2025
No ratings yet
MCMC Sampling - Class 2025
101 pages
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
No ratings yet
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
13 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
MCMC - Markov Chain Monte Carlo: One of The Top Ten Algorithms of The 20th Century
100% (1)
MCMC - Markov Chain Monte Carlo: One of The Top Ten Algorithms of The 20th Century
31 pages
03 Markov Chain Monte Carlo
No ratings yet
03 Markov Chain Monte Carlo
4 pages
MCMC
No ratings yet
MCMC
70 pages
Geyer - Markov Chain Monte Carlo Lecture Notes
No ratings yet
Geyer - Markov Chain Monte Carlo Lecture Notes
166 pages
MCMC Final Edition
No ratings yet
MCMC Final Edition
17 pages
MCMC Brief
100% (1)
MCMC Brief
69 pages
UNIT-5 Markov Chain Monte Carlo Methods
No ratings yet
UNIT-5 Markov Chain Monte Carlo Methods
17 pages
Markov Chain Monte Carlo: A Method of Simulation
No ratings yet
Markov Chain Monte Carlo: A Method of Simulation
10 pages
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
No ratings yet
Markov Chains and Monte Carlo Methods: Ioana A. Cosma and Ludger Evers
97 pages
Adaptive MCMC For Everyone
No ratings yet
Adaptive MCMC For Everyone
13 pages
Unit 5
No ratings yet
Unit 5
74 pages
MTH210
No ratings yet
MTH210
126 pages
CPSC 540: Machine Learning: Monte Carlo Methods
No ratings yet
CPSC 540: Machine Learning: Monte Carlo Methods
32 pages
Questions For Unit 5 RM
No ratings yet
Questions For Unit 5 RM
4 pages
2024-Leonardo RIPOLI Thesis
No ratings yet
2024-Leonardo RIPOLI Thesis
224 pages
Monte Carlo Methods in Bayesian Computation Full Text Download
94% (17)
Monte Carlo Methods in Bayesian Computation Full Text Download
16 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
ML - Unit-V-1
No ratings yet
ML - Unit-V-1
42 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Cra I U Rosenthal Ann Rev
No ratings yet
Cra I U Rosenthal Ann Rev
40 pages
Stochastic Simulation Book
No ratings yet
Stochastic Simulation Book
146 pages
Monte Carlo
No ratings yet
Monte Carlo
59 pages
Monte Carlo Method
0% (1)
Monte Carlo Method
23 pages
Markov Chain Monte Carlo:: A Workhorse For Modern Scientific Computation
No ratings yet
Markov Chain Monte Carlo:: A Workhorse For Modern Scientific Computation
22 pages
Computational Statistics With Matlab: Mark Steyvers May 13, 2011
No ratings yet
Computational Statistics With Matlab: Mark Steyvers May 13, 2011
78 pages
Putational Statistics Using Matlab
No ratings yet
Putational Statistics Using Matlab
78 pages
Markov Chain Monte Carlo (MCMC) Methods: Example 11 (Matlab)
No ratings yet
Markov Chain Monte Carlo (MCMC) Methods: Example 11 (Matlab)
21 pages
Sampling Methods: Søren Højsgaard
No ratings yet
Sampling Methods: Søren Højsgaard
22 pages
Lec26 RandomVariableGeneration
No ratings yet
Lec26 RandomVariableGeneration
38 pages
Ky - Unit 3
No ratings yet
Ky - Unit 3
20 pages
Computational Statistics With Matlab
No ratings yet
Computational Statistics With Matlab
71 pages
18 Aos1715
No ratings yet
18 Aos1715
33 pages
An Introduction To MCMC For Machine Learning
No ratings yet
An Introduction To MCMC For Machine Learning
39 pages
An Introduction To MCMC For Machine Learning: Abstract
No ratings yet
An Introduction To MCMC For Machine Learning: Abstract
39 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
39 pages
A Tutorial On Particle Filtering and Smoothing: Fifteen Years Later
No ratings yet
A Tutorial On Particle Filtering and Smoothing: Fifteen Years Later
41 pages
My Notes Unit 5
No ratings yet
My Notes Unit 5
12 pages
Convergence Analysis of A Collapsed Gibbs Sampler For Bayesian Vector Autoregressions
No ratings yet
Convergence Analysis of A Collapsed Gibbs Sampler For Bayesian Vector Autoregressions
31 pages
MonteCarlo PDF
No ratings yet
MonteCarlo PDF
84 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Discrete Stochastic Processes: Nicolas Privault
No ratings yet
Discrete Stochastic Processes: Nicolas Privault
296 pages
LN ML Rug
No ratings yet
LN ML Rug
283 pages
Markov Chains: Modified by Longin Jan Latecki Temple University, Philadelphia Latecki@temple - Edu
No ratings yet
Markov Chains: Modified by Longin Jan Latecki Temple University, Philadelphia Latecki@temple - Edu
36 pages
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
No ratings yet
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
35 pages
LectureNotes Complete
No ratings yet
LectureNotes Complete
90 pages
Annurev Statistics 022513 115540
No ratings yet
Annurev Statistics 022513 115540
26 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Seminar em
No ratings yet
Seminar em
51 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec12 13 BayesianInferenceForTheGaussian
No ratings yet
Lec12 13 BayesianInferenceForTheGaussian
57 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages