0% found this document useful (0 votes)
36 views55 pages

Lec30 GibbsSampling

The document discusses Gibbs sampling and its application to Bayesian regression variable selection. It begins with an introduction to incremental sampling strategies like Markov chain Monte Carlo (MCMC). Gibbs sampling is introduced as an MCMC method that iteratively samples the conditional distributions of model parameters. This allows sampling high-dimensional distributions by incrementally updating individual parameters. As an example, Gibbs sampling is applied to a hierarchical Bayesian model for failures in a nuclear power plant. The conditional distributions are derived and used within an iterative sampling procedure to obtain samples from the joint posterior distribution.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views55 pages

Lec30 GibbsSampling

The document discusses Gibbs sampling and its application to Bayesian regression variable selection. It begins with an introduction to incremental sampling strategies like Markov chain Monte Carlo (MCMC). Gibbs sampling is introduced as an MCMC method that iteratively samples the conditional distributions of model parameters. This allows sampling high-dimensional distributions by incrementally updating individual parameters. As an example, Gibbs sampling is applied to a hierarchical Bayesian model for failures in a nuclear power plant. The conditional distributions are derived and used within an iterative sampling procedure to obtain samples from the joint posterior distribution.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Gibbs Sampling

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

October 23, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 1


Contents
 Incremental Strategies for Sampling, Iterative sampling
 Introduction to MCMC, autoregressive model
 The Gibbs sampler, systematic scan, random scan, Gibbs sampler examples,
Block and Metropolized Gibbs, Application in variable/model selection in linear
regression
 The goals for today’s lecture: Understand the fundamentals of MCMC, Learn
about the Gibbs sampler and how to apply it to Bayesian regression variable
selection.
 Monte Carlo Statistical Methods, C.P. Roberts and G. Casella, Chapter 3 (google books, slides, video)
 D Mackay, Introduction to MC methods, reprint.
 R Neal, Probabilistic Inference Using MCMC Methods, 1993.
 C. Andriew et al., An introduction to MCMC for Machine Learning, Machine Learning, 50, 5–43, 2003
 S. Brooks, MCMC methods and its applications, Journal of the Royal Statistical Society. Series D (The
Statistician), Vol. 47, No. 1 (1998), pp. 69-100
 G. Casella and EI George, Explaining the Gibbs Sampler, The American Statistician, Vol. 46, 1992, 167-174
 S. Chib and E. Greenberg, Understanding the MH algorithm, The American Statistician, Vol. 49, No. 4 (Nov.,
1995), pp. 327-335

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2


Using Incremental Strategies for Sampling
 We have seen that both rejection sampling (RS) and importance sampling (IS)
are limited to problems of moderate dimensions.

 The problem with these algorithms is that we try to sample all the components
of a high-dimensional parameter simultaneously.

 To address this we will look next at incremental strategies:

- Iterative Methods: Markov chain Monte Carlo.

- Sequential Methods: Sequential Monte Carlo.

A. Doucet, Statistical Computing: Monte Carlo Methods, Online course resource

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Motivating Example
 Multiple failures in a nuclear plant:

 Model: Failures of the 𝑖 th pump follow a Poisson process with parameter 𝜆𝑖 , 1 ≤ 𝜆𝑖 ≤ 10. For
an observation time 𝑡𝑖, the number of failures 𝑝𝑖 is a Poisson 𝒫(𝜆𝑖𝑡𝑖 ) random variable.

 The unknowns consist of 𝜃: = 𝜆1 , 𝜆2 , . . . , 𝜆10 , 𝛽 where 𝛽 is a parameter in the hierarchical


model introduced next.

Statistical Computing and MC Methods, A. Doucet, Lecture 10.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Motivating Example: Nuclear Pump Data
 Hierarchical Model:
i .i .d .
i ~ Ga ( ,  ), and  ~ Ga ( ,  ),
with 𝛼 = 1.8, 𝛾 = 0.01, 𝛿 = 1.
 The posterior distribution (see here Ga distribution)
 
 10

  i ,  | ti , pi     i ti  i e  iti   i 1e  i    1e  
p

i 1 
 P ( iti ) i ~Ga ( ,  ) 
  ~Ga ( , )

  
10
pi  1  i  ti    10   1 
i e e
i 1
 It is not obvious how the inverse CDF method or the accept/reject method or how importance
sampling could be used for this multidimensional distribution!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Conditional Distributions
 
10
  i ,  | ti , pi    i p  1e   t     10  1e 
i i i

i 1

 The conditionals can be obtained with direct observation from the above posterior:

i | (  , ti , pi ) ~ Ga ( pi   , ti   ) for 1  i  10
10
 | (1 ,..., 10 ) ~ Ga (  10 ,    i )
i 1

 Instead of directly sampling the vector 𝜃 = (𝜆1 , . . . , 𝜆10 , 𝛽) at once, one could suggest sampling
it iteratively.

 We can start with the i’s for a given guess of , followed by an update of  given the new samples
𝜆1 , . . . , 𝜆10 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Iterative Sampling
 Given a sample, at iteration 𝑡, 𝜃 𝑡 = 𝜆1𝑡 , . . . , 𝜆10
𝑡
, 𝛽𝑡 , one could proceed as follows at iteration
𝑡 + 1,

Step 1: it 1 | ( t , ti , pi ) ~ Ga ( pi   , ti   t ) for 1  i  10


10
Step 2 :  t 1 | (1t 1 ,..., 10t 1 ) ~ Ga (  10 ,    it 1 )
i 1

 Note that instead of directly sampling in a space of dimension 11, one samples 11 times in
spaces of dimension 1!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Iterative Sampling
 With this iterative procedure:

 Are we sampling from the desired joint distribution of the 11 variables?

 If yes, how many times should the iteration above be repeated?

 The validity of the approach described here is derived from the fact that the
sequence 𝜃 𝑡 : = 𝜆1𝑡 , 𝜆𝑡2 , . . . , 𝜆10
𝑡
, 𝛽𝑡 is a Markov chain.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Introduction to Markov Chain Monte Carlo
 Markov chain: A sequence of random variables {𝑋𝑛, 𝑛 ∈ ℕ} defined on (𝒳, ℬ (𝒳)) such that
for any 𝐴 ∈ ℬ(𝒳) the following probability condition is satisfied:

( X n  A | X 0 ,..., X n 1 )  ( X n  A | X n 1 )
and we write:

Transition Kernel : P ( x, A)  ( X n  A | X n 1 )

 Markov Chain Monte Carlo (MCMC): Given a target distribution 𝜋, we need to design a
transition kernel 𝑃 such that asymptotically

1 N N 

N
 f (X
n 1
n )   f ( x) ( x)dx and / or X n ~

 It is easy to simulate the Markov Chain even if 𝜋 is complex.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Autoregressive Model
 Consider the autoregression model for |𝑎 | < 1

𝑋𝑛 = 𝑎 𝑋𝑛−1 + 𝑉𝑛, 𝑤ℎ𝑒𝑟𝑒 𝑉𝑛 ∼ 𝒩(0, 𝜎2)

 The limiting distribution is:

 2 
 ( x)  N  x;0, 2 
 1   

 To sample from 𝜋, we just sample the Markov chain and we know that
asymptotically 𝑋𝑛~𝜋.

 Of course this problem is only to demonstrate the main idea of MCMC since
we can here sample directly from !

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Autoregressive Model
 Consider 100 independent Markov chains run in parallel.

 We assume that the initial distribution of these Markov chains is 𝒰[0,20].

 So initially, the Markov chains samples are not distributed according to .

 In the following example, we choose 𝛼 = 0.4, σ = 5 (see here for a MatLab implementation).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Example
 A Markov chain with a normal distribution as target distribution.
0.1 0.08 0.1

0.09 0.09
0.07
0.08 0.08
0.06
0.07 0.07
0.05
0.06 0.06

0.05 0.04 0.05

0.04 0.04
0.03
0.03 0.03
0.02
0.02 0.02
0.01
0.01 0.01

0 0 0
0 2 4 6 8 10 12 14 16 18 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

Initial distribution step=1 step=2


0.1 0.09 0.09

0.09 0.08 0.08

0.08
0.07 0.07

0.07
0.06 0.06
0.06
0.05 0.05
0.05
0.04 0.04
0.04
0.03 0.03
0.03

0.02 0.02
0.02

0.01 0.01 0.01

0 0 0
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

step=3 step=4 step=100


MatLab implementation

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Example
 Histograms of 100 independent Markov chains with a normal distribution as target distribution.

0.08 0.08 0.08

0.07 0.07 0.07

0.06 0.06 0.06

0.05 0.05 0.05

0.04 0.04 0.04

0.03 0.03 0.03

0.02 0.02 0.02

0.01 0.01 0.01

0 0 0
0 2 4 6 8 10 12 14 16 18 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

Initial distribution step=1 step=2


0.08 0.08 0.08

0.07 0.07 0.07

0.06 0.06 0.06

0.05 0.05 0.05

0.04 0.04 0.04

0.03 0.03 0.03

0.02 0.02 0.02

0.01 0.01 0.01

0 0 0
-20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20

step=3 step=4 step=100


MatLab implementation

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Example
 The target normal distribution seems to “attract” the distribution of the samples and even to be
a fixed point of the algorithm.

 We have produced 100 independent samples from the normal distribution.

 We will see that it is not necessary to run 𝑁 Markov chains in parallel in order to obtain 100
samples, but that one can consider a unique Markov chain, and build the histogram from this
single Markov chain by forming histograms from one trajectory.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


Markov Chain Monte Carlo
 The estimate of the target distribution, through the series of histograms, improves with the #
of iterations.
 Assume that we have stored 𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 for 𝑁 large and wish to estimate
‫𝑥𝑑)𝑥(𝜋)𝑥(𝑓 𝒳׬‬.

1 𝑁
 We suggest the estimator ෌𝑛=1𝑓(𝑋𝑛 ) which is the estimator we used before when
𝑁
𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 were independent.

 Under relatively mild conditions, such an estimator is consistent despite the fact that the
samples 𝑋𝑛 , 𝑛 = 1, . . . , 𝑁 are not independent. Under additional conditions, the CLT also
holds with a rate of convergence 1ൗ 𝑁 .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15


Markov Chain Monte Carlo
 We are interested in Markov chains with transition kernel 𝑃 which has the
following three important properties observed in the autoregressive example:
A. The desired distribution 𝜋 is an invariant distribution of the Markov chain,
i.e.
X
 ( x) P( x, y )dx  ( y )

We will see in a forthcoming lecture that this is satisfied if the following


detailed balance (reversibility) equation is satisfied:
𝜋 𝑥 𝑃 𝑥, 𝑦 = 𝜋 𝑦 𝑃 𝑦, 𝑥
B. The successive distributions of the Markov chain converge towards 𝜋
regardless of the starting point.
1 𝑁
C. The estimator ෌𝑛=1𝑓(𝑋𝑛 ) converges towards 𝔼𝜋 (𝑓(𝑋)) and
𝑁
asymptotically 𝑋𝑛~ 𝜋 (stronger requirement).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16
Markov Chain Monte Carlo
 Since there is an infinite number of kernels 𝑃(𝑥, 𝑦) which admit 𝜋(𝑥) as their invariant
distribution, the main task in MCMC is coming up with good ones.

 Convergence is ensured under very weak assumptions -- irreducibility and aperiodicity.

 It is usually easy to establish that an MCMC sampler converges towards 𝜋(𝑥) but difficult to
obtain rates of convergence.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


The Gibbs Sampler
 The Gibbs sampler is a generic method to sample from a high dimensional
distribution.

 It generates a Markov chain which converges to the target distribution under


weak assumptions: irreducibility and aperiodicity.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


The Two Component Gibbs Sampler
 Consider the target distribution 𝜋(𝜃) such that 𝜃 = {𝜃1, 𝜃2}. The two component Gibbs
sampler proceeds as follows:

 Initialization:

 Select deterministically or randomly 𝜃0 = 𝜃01 , 𝜃02 .

 Iteration 𝑖, 𝑖 ≥ 1.

 Sample 𝜃𝑖1 ~𝜋 𝜃 1 |𝜃𝑖−1


2

 Sample 𝜃𝑖2 ~𝜋 𝜃 2 |𝜃𝑖1

 Sampling from conditionals is often feasible even when sampling from the joint is
impossible (e.g. in the nuclear pump data).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Invariant Distribution
 Clearly 𝜃𝑖1 , 𝜃𝑖2 is a Markov Chain. Its transition kernel is:

𝑃 𝜃1 , 𝜃 2 , 𝜃෨1 , 𝜃෨ 2 = 𝜋 𝜃෨1 |𝜃 2 𝜋 𝜃෨ 2 |𝜃෨ 1

 The detailed balance equation ‫𝑥(𝑃)𝑥(𝜋 𝒳׬‬, 𝑦)𝑑𝑥 = 𝜋(𝑦) is satisfied:

ඵ𝜋 𝜃1 , 𝜃 2 𝑃 𝜃1 , 𝜃 2 , 𝜃෨1 , 𝜃෨ 2 𝑑𝜃1 𝑑𝜃 2 =

ඵ𝜋 𝜃1 , 𝜃 2 𝜋 𝜃෨1 |𝜃 2 𝜋 𝜃෨ 2 |𝜃෨1 𝑑𝜃1 𝑑𝜃 2 =

න𝜋 𝜃 2 𝜋 𝜃෨1 |𝜃 2 𝜋 𝜃෨ 2 |𝜃෨1 𝑑𝜃 2 =

න𝜋 𝜃෨1 , 𝜃 2 𝜋 𝜃෨ 2 |𝜃෨1 𝑑𝜃 2 = 𝜋 𝜃෨1 𝜋 𝜃෨ 2 |𝜃෨1 = 𝜋 𝜃෨1 , 𝜃෨ 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Irreducibility
 That 𝜋 is the invariant distribution of 𝑃 (or the detailed balance eq.
𝜋 𝜃 𝑃 𝜃, 𝜃 ′ = 𝜋 𝜃 ′ 𝐾 𝜃 ′ , 𝜃 is satisfied) does not ensure that the Gibbs
sampler converges towards the invariant distribution.

 Additionally, it is required to ensure irreducibility: the Markov chain can move


to any set 𝐴 such that 𝜋(𝐴) > 0 from (almost) any starting point.

 This ensures that

f  n1 , n2    f  1 , 2    1 , 2  d 1d 2
N
1
N

n 1

but not that asymptotically 𝜃𝑛1 , 𝜃𝑛2 ~𝜋.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


Irreducibility
 A distribution is shown here that leads to a reducible Gibbs sampler.

 Conditioning on 𝑥1 < 1, the distribution of 𝑥2 cannot produce a value in [1,2].

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22


Aperiodicity
 Consider an example with 𝒳 = {1, 2} and transition probabilities 𝑃 (1, 2) =
𝑃 (2, 1) = 1. The invariant distribution is clearly given by 𝜋 (1) = 𝜋 (2) =
1/2 .

 However, we know that if the chain starts in 𝑋0 = 1, then 𝑋2𝑛 = 1 and


𝑋2𝑛+1 = 2 for any 𝑛.

 We have
 f  X    f  x    x  dx
N
1
n
N n 1

but clearly 𝑋𝑛 is not distributed according to 𝜋.

 You need to make sure that you do not explore the space in a periodic way to
ensure that 𝑋𝑛 ~𝜋 asymptotically.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23


Gibbs Sampler
 If 𝜃 = (𝜃1 , 𝜃2 , . . . , 𝜃𝑝 ) where 𝑝 > 2, the Gibbs sampler still applies.

 Initialization:

0 0 0 0
 Select deterministically or randomly 𝜃 = (𝜃1 , 𝜃1 , . . . , 𝜃𝑝 )

 Iteration 𝑖, 𝑖 ≥ 1

 For 𝑘 = 1: 𝑝

𝑖 𝑖
 Sample 𝜃𝑘 ~𝜋(𝜃𝑘 |𝜃−𝑘 )

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝜃−𝑘 = 𝜃1 , . . . , 𝜃𝑘−1 , 𝜃𝑘+1 , . . . , 𝜃𝑝

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24


Systematic-Scan Gibbs Sampler
𝑖 𝑖 𝑖 𝑖
 Systematic Scan Gibbs: Let 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝

𝑖 𝑖−1 𝑖−1
 Update 𝜃1 from 𝜋 . |𝜃2 , . . . , 𝜃𝑝

𝑖 𝑖 𝑖−1 𝑖−1
 Update 𝜃2 from 𝜋 . |𝜃1 , 𝜃3 , . . . , 𝜃𝑝

 ……

𝑖 𝑖 𝑖 𝑖
 Update 𝜃𝑝 from 𝜋 . |𝜃1 , 𝜃2 , . . . , 𝜃𝑝−1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Random Scan Gibbs Sampler
 Consider again: 𝜃 = (𝜃1 , 𝜃2 , . . . , 𝜃𝑝 ) where 𝑝 > 2. We consider the following
random scan Gibbs sampler.

 Initialization:
0 0 0 0
 Select deterministically or randomly 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝

 Iteration 𝑖, 𝑖 ≥ 1
 Sample 𝐾~𝒰 1,2,…,𝑝
𝑖 𝑖−1
 Set 𝜃−𝐾 = 𝜃−𝐾
𝑖 𝑖
 Sample 𝜃𝐾 ~π 𝜃𝐾 |𝜃−𝐾

𝑖 𝑖 𝑖 𝑖 𝑖
where 𝜃−𝐾 = 𝜃1 , . . . , 𝜃𝐾−1 , 𝜃𝐾+1 , . . . , 𝜃𝑝
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Random-Scan Gibbs Sampler
i i i i
 Random scan Gibbs: Let 𝜃 = 𝜃1 , 𝜃1 , . . . , 𝜃𝑝 at step (iteration) 𝑖.

 Draw 𝑗 from 1 to 𝑝 with probability 𝑤𝑗 = 1/𝑝

 Draw new coordinate 𝑗, 𝜃𝑗 |𝜃−𝑗 ~𝜋 . |𝜃−𝑗 and leave the remaining


components unchanged; that is, let
𝑖 𝑖−1
𝜃−𝑗 = 𝜃−𝑗

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Gibbs Sampler: Example
 Consider the following bivariate target distribution:
 0   1    1 1 1     x1  
 ( x1 , x2 )  N   ,     exp    x1 x2      
 0    1   2 1  2  1   x2  

 The marginal distribution is given as:


 1 
 ( x2 )  exp   x22 
2 
 A systematic-scan Gibbs sampler (see a C++ implementation) is
generated with the following conditionals:
x1t 1 | x2t ~ N  x2t ,1   2 
x2t 1 | x1t 1 ~ N  x1t 1 ,1   2 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Gibbs Sampler: Example
 Set 𝜌 = 0.5, # of iterations 10000, and (𝑥0, 𝑥1) = (−3, −3)

0.45 2.5
histogram x1-x2 plot

0.4 2

1.5
0.35

1
0.3

0.5
0.25

x2
0
0.2
-0.5

0.15
-1

0.1
-1.5

0.05 -2

0 -2.5
-5 -4 -3 -2 -1 0 1 2 3 4 5 -3 -2 -1 0 1 2 3 4
x1

Histogram of 𝑥1, the exact pdf of which 𝑥1 − 𝑥2 plot


is the standard Gaussian

C++ programs are given here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Gibbs Sampler: Example
 Set 𝜌 = 0.999, # of iterations 10000, and (𝑥0, 𝑥1) = (−3, −3)

 We can see that the sampling process in this case of highly correlated variables is
inaccurate.
0.5
-0.5
histogram
x1-x2 plot
0.45

0.4
-1

0.35

0.3 -1.5

0.25

x2
0.2 -2

0.15

0.1 -2.5

0.05

0 -3
-5 -4 -3 -2 -1 0 1 2 3 4 5 -3.5 -3 -2.5 -2 -1.5 -1 -0.5
x1

Histogram of 𝑥1, the exact pdf of which 𝑥1 − 𝑥2 plot


is the standard Gaussian

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


Convergence of the Gibbs Sampler
 Even when irreducibility and aperiodicity are ensured, the Gibbs sampler can
still converge very slowly.

 Consider the target bivariate Gaussian distribution


a b 
N (0,   )
b a 

 A systematic-scan Gibbs sampler is generated as


t 1 b t b2 
x | x ~ N  x2 , a  
1
t
2
a a
t 1 t 1  b t 1 b2 
x2 | x1 ~ N  x1 , a  
a a

 In this example, we set


100 99 
N (0,   )
 99 100 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Convergence of the Gibbs Sampler
 The Gibbs sampling path and equiprobability curves are plotted below.

20

15

10

x2
0

-5

-10

-15

-20
-20 -15 -10 -5 0 5 10 15 20
x1

A C++ implementation can be found here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


MatLab Implementation: Gibbs Sampler
 Consider the Gaussian 𝜋(𝑥1 , 𝑥2 ) = 𝒩 𝜇, 𝐶 . Following the conditionals shown
earlier, it can be shown that the Gibbs sampler can proceed as follows:
( t 1) C121 (t ) randn ( t 1) C121 ( t 1) randn
x
1   1 x2  , randn ~ N (0,1), x2   1 x1 
1
C11 C11 C11 C221
3

-1

-2

A MatLab implementation
can be found here
-3
-3 -2 -1 0 1 2 3

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


MatLab Implementation: Gibbs Sampler
 Consider the Gaussian 𝜋(𝑥1 , 𝑥2 ) = 𝒩 𝜇, 𝐶 . Following the conditionals shown earlier, it can be
shown that the Gibbs sampler can proceed as follows:

( t 1) C121 (t ) randn ( t 1) C121 ( t 1) randn


x
1   1 x2  , randn ~ N (0,1), x2   1 x1 
1
C11 C11 C11 C221
3

Another MatLab
Implementation with movie
-1
frame animation
can be found here.
-2

-3
-3 -2 -1 0 1 2 3

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Gibbs Sampler for Mixture of Gaussians

0.04

0.03

0.02

0.01

0
0

200
-100
400
-50
600
0
800
50
1000 100
iteration x-axis

A MatLab implementation can be found here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Gibbs Sampler for Mixture of Gaussians

0.04

0.035

0.03

0.025

0.02

0.015

0.01

0.005

0
0
100

200
300 -80
400 -60

500 -40
-20
600
0
700 20
800 40

900 60
80
1000

x
iteration

A MatLab implementation can be found here. This implementation works like a movie frame animation.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Gibbs Sampler: Example
n
 Consider the following target distribution   x1 , x2  ~  x  x2 (1  x2 )
x  1 n  x   1 1 1

 1 
 The two conditional distributions for the Gibbs sampler are
x1 | x2 ~ Binom (n, x2 )
x2 | x1 ~ Be ( x1   , n  x1   )

 We set 𝑛 = 20, 𝛼 = 𝛽 = 0.5, initial state (0,0), time of iterations 10000.


 See here for a C++ implementation and a MatLab implementation.
4 1

0.9
3.5
Histogram of 𝑥2, the 0.8
3
exact pdf of which is 0.7

2.5 a Beta distribution 0.6

2 0.5

0.4
1.5

0.3
1
0.2

0.5
0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 12 14 16 18 20

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Gibbs Sampler: Example
 Consider a likelihood defined with the Cauchy distribution 𝒞(𝜇, 1) with two measurements as
follows:
n2
1
(  | Dn )   f  ( xi )  2
i 1  
 1  ( x1   ) 2 1  ( x2   ) 2 
 We take as prior a normal distribution 𝜇~𝒩(0,10)

 This leads to a posterior of the form:


2

20
e
 ( | D ) ~
1  ( x   ) 1  ( x
1
2
2   ) 2

 How do we use the Gibbs sampler to sample from this univariate distribution?

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


Gibbs Sampler: Example
2

20
e
 ( | D ) ~
1  ( x   ) 1  ( x
1
2
2   ) 2

 We can use Gibbs sampler by noticing:

1 i 1 ( xi   )2 
 e  
di
1  ( xi   ) 2
0

 We can then think 𝜋(𝜇|𝒟) as the marginal of 𝜋(𝜇, 𝜔1 , 𝜔2 |𝒟)


2 2
 i 1 ( xi   )2 
 (  , 1 , 2 | D ) ~ e 20
e
i 1
 

 The Gibbs sampler is based on the following 2 steps:


 Generate 𝜇 (𝑡) ~π 𝜇|𝜔 𝑡−1 ,𝒟

 Generate 𝜔 (𝑡) ~π 𝜔|𝜇 𝑡 ,𝒟


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
Gibbs Sampler: Example
 The step 𝜇 (𝑡) ~π 𝜇|𝜔 𝑡−1 , 𝒟 is straight forward since
  i xi 

   | , D   N  i
,
1 

 
 i
 i  1/ 20 2 
i
i  1/10 

 The step 𝜔 (𝑡) ~π 𝜔|𝜇 𝑡
, 𝒟 is also straighrforward:

  |  (t ) , D   Exp 1  ( xi   (t ) ) 2 

 A MatLab implementation can be found here.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40


Gibbs sampler: Example
 On the left, the last 100 iterations of the chain (𝜇 (𝑡) ); on the right, the
histogram of the chain 𝜇 𝑡 and comparison with the target density for
10,000 iterations.
6 0.2

0.18
4
0.16

2 0.14

0.12
0
0.1
-2
0.08

-4 0.06

0.04
-6
0.02

-8 0
9900 9910 9920 9930 9940 9950 9960 9970 9980 9990 10000 -10 -8 -6 -4 -2 0 2 4 6 8 10

A MatLab implementation can be found here

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41


Block and Metropolized Gibbs
 Instead of updating single coordinates 𝑥 j , one can update blocks 𝒙𝐴. This
is more efficient but requires knowing the block conditionals 𝜋 (𝒙𝐴 | 𝒙−𝐴 )
and being able to sample from them.
 Combinations of Gibbs and Metropolis Hastings (to be discussed in a
follow up lecture) are popular.
 In Metropolized Gibbs, for example, some coordinates are updated
from conditionals and others using arbitrary proposals as in Metropolis-
Hastings.
 Each transition kernel in Gibbs (which updates a single coordinate) is not
irreducible nor aperiodic. However, their combination (random or
systematic scan) might be!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42


Gibbs Sampling
 Consider a target 𝜋(𝑥1, 𝑥2) (e.g. a uniform distribution) with disconnected support as in the
figure. Conditioning on 𝑥1 < 0, the distribution of 𝑥2 cannot produce a value in [0,1].

 You can make this type of problems to work by introducing a proper coordinate
transformation.

Conditioning now on 𝑦1
y1  x1  x2 , y2  x2  x1 produces a uniform
2.5
distribution on the
2
union of a negative & a
1.5

positive interval.

1

0.5

-0.5

-1
Therefore, one iteration
-1.5

-2
of the Gibbs sampler is
-2.5
-3 -2 -1 0 1 2 3
sufficient to jump from
one disk to the other
one.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Gibbs Sampler: Recommendation
 Have as few blocks as possible.

 Put the most correlated variables in the same block. If necessary,


reparametrize the model to achieve this.

 Integrate analytically as many variables as possible.

 There is no general strategy that will work for all problems.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44


Bayesian Variable Selection in Regression
 We select the following regression model:
p
Y    k X k   V , where V ~ N (0,1)
k 1

𝜈0 𝛾0
where we assume as priors ℐ𝒢 𝜎 2 ; , and for 𝛼 2 << 1
2 2

1 1
 k ~ N (0,  2 2 2 )  N (0,  2 2 )
2 2

 We introduce a latent variable 𝛾𝑘 ∈ 0,1 such that:

1
Pr( k  0)  Pr( k  1) 
2
 k |  k  0 ~ N (0,  2 2 2 ),  k |  k  1 ~ N (0,  2 2 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45


A Bad Gibbs Sampler
𝑛
 We have parameters 𝛽1:𝑝 , 𝛾1:𝑝 , 𝜎 2 and observe 𝒟 = 𝑥𝑖 , 𝑦𝑖 𝑖=1 .

 A potential Gibbs sampler consists of sampling iteratively from

p ( 1: p | D ,  1: p ,  2 ) (Gaussian), p( 2 | D ,  1: p , 1: p ) (inverse  Gamma), and


p ( 1: p | D , 1: p ,  2 )
𝑝
 In particular, 𝑝(𝛾1:𝑝 |𝒟, 𝛽1:𝑝 , 𝜎 2 ) = ැ𝑘=1𝑝(𝛾𝑘 |𝒟, 𝛽𝑘 , 𝜎 2 ) and using the prior models:

1  k2 
exp   2 2 
p( k  1|  k ,  2 ) 
2  2  
1   k2  1   k2 
exp   2 2   exp   2 2 2 
2  2   2  2   

 The same result can be shown with 𝑝(𝛾𝑘 |𝒟, 𝛽𝑘 , 𝜎 2 ). The Gibbs sampler becomes reducible as
𝑎 goes to zero.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


Bayes Variable Selection
 This is the result of bad modeling. We consider 𝛼 ≃ 0 and write:
p
Y    k  k X k   V , where V ~ N (0,1)
i 1

 Here 𝛾𝑘 = 1 if 𝑋𝑘 is included or 𝛾𝑘 = 0, otherwise.

 However, this suggests that 𝛽𝑘 is defined even when 𝛾𝑘 = 0.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47


Bayes Variable Selection
 A neater way to write such models is
Y   k X k   V  T X    V , where V ~ N (0,1)
k :
 k 1

where, for a vector

    1 ,...,  p  ,    k :  k  1 , X    X k :  k  1 , and n    k
p

k 1

 Prior distributions
 2 0 0 
  N  
p
    ,  2
 ;0,   I n
2 2
IG   ; ,  , and  ( )    ( k )  2 p.
 2 2 k 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48


Bayes Variable Selection
 We are interested in sampling from the trans-dimensional distribution 𝜋 𝛾, 𝛽𝛾 , 𝜎 2 |𝒟

 However, we know that

𝜋 𝛾, 𝛽𝛾 , 𝜎 2 |𝐷 = 𝜋 𝛾|𝒟 𝜋(𝛽𝛾 , 𝜎 2 |𝒟, 𝛾൯

where 𝜋 𝛾|𝒟 ∝ 𝜋(𝒟|γ)𝜋(𝛾) and (using an earlier lecture result on the evidence for this
regression model)
 0 n
( )
 n

 0   yi2  T 1T
2

  n  n  
 ( D |  )    ( D ,   ,  2 |  ) d   d 2   ( 0
1/2
)   i 1

2  2 
 
 
with
 n  1 n
     yi x ,i  ,    I n   x ,i xT,i
2

 i 1  i 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49


Bayes Variable Selection
 The full conditional distribution for 𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾 is

  (   ,  2 | D )  N    ;  ,  2    
 n

  n
 0   yi
2
     
T 1 T

IG   2 ; 0 , i 1

 2 2 
 
 
where
 n  1 n
     yi x ,i  ,    I n   x ,i xT,i
2

 i 1  i 1

 The derivation of the above conditional is already given in an earlier lecture.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50


Bayes Variable Selection
 Popular alternative prior models for 𝛾𝑖 include

 i ~ B    , where  ~ U[0,1]
 i ~ B  i  , where  ~ B ( ,  )
 g-prior (Zellner)

 |  2 ~ N   ;0,  2 2 ( X T X  ) 1 
where here for robustness we additionally use
a b 
 2 ~ IG  0 , 0 
 2 2
 Such variations in the priors are very important and can affect the performance of the
Bayesian model.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51


Bayes Variable Selection: Algorithm
 𝜋 𝛾 |𝒟 is a discrete probability distribution with 2𝑝 potential values. We assume 𝛿2 is
known here.
 We can use the Gibbs sampler to sample from it.
 Initialization:
0 0 0
 Select deterministically or randomly 𝛾 = 𝛾1 , . . . , 𝛾𝑝

 Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
𝑖 𝑖
 Sample 𝛾𝑘 ~𝜋 𝛾𝑘 |𝒟, 𝛾−𝑘 ,

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝛾−𝑘 = 𝛾1 , . . . , 𝛾𝑘−1 , 𝛾𝑘+1 , . . . , 𝛾𝑝

𝑖 2(𝑖 ) 𝑖
 Optional step: Sample 𝛽𝛾 , 𝜎 ~𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52


Bayes Variable Selection: Algorithm
 Consider the case where 𝛿2 is unknown.
 Initialization:
0
 Select deterministically or randomly 𝛾 0
, 𝛽𝛾 , 𝜎 2(0) , 𝛿 2(0)

 Iteration 𝑖, 𝑖 ≥ 1
For 𝑘 = 1: 𝑝
𝑖 𝑖
 Sample 𝛾𝑘 ~𝜋 𝛾𝑘 |𝒟, 𝛾−𝑘 , 𝛿 2(𝑖−1)

𝑖 𝑖 𝑖 𝑖−1 𝑖−1
where 𝛾−𝑘 = 𝛾1 , . . . , 𝛾𝑘−1 , 𝛾𝑘+1 , . . . , 𝛾𝑝

𝑖 2(𝑖 ) 𝑖
 Sample 𝛽𝛾 , 𝜎 ~𝜋 𝛽𝛾 , 𝜎 2 |𝒟, 𝛾 , 𝛿 2(𝑖−1)

 Sample 𝛿 2(𝑖) ~ 𝜋 𝛿 2(𝑖) |𝛽𝛾(𝑖)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53


Bayesian Variable Selection
 This very simple sampler is much more efficient than the ones where 𝛾 is sampled conditional
upon (𝛽, 𝜎2)

 However, it mixes very slowly because the components are updated one at a time.

 Updating correlated components together would increase significantly the convergence speed
of the algorithm at the cost of an increased complexity.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54


Bayesian Variable Selection Example
 Top five most likely models for the selection models discussed:

Results from: Statistical Computing and MC Methods, A. Doucet.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55

You might also like