0% found this document useful (0 votes)
10 views

Lecture 19

Uploaded by

niveditasimmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 19

Uploaded by

niveditasimmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Gibbs Sampling Examples,

Some Aspects of MCMC


CS698X: Topics in Probabilistic Modeling and Inference
Piyush Rai
2
Recap: Gibbs Sampling
▪ An instance of MH sampling where the acceptance probability = 1
▪ Based on sampling 𝒛 one “component” at a time with proposal = conditional distr.
In practice, we won’t use
all the 𝐿 samples to
approximate the target
distribution 𝑝(𝒛) since
there will be a burn-in
phase and thinning as well

Denoting the collected samples


(ℓ) by 𝒛(1) , 𝒛(2) , … , 𝒛(𝑆) , the
𝒛 posterior approximation will be
the empirical distribution defined
by these samples

▪ Very easy to derive if the conditional distributions are easy to obtain


CS698X: TPMI
3
Deriving A Gibbs Sampler: The General Recipe
▪ Suppose the target is an intractable posterior 𝑝(𝒁|𝑿) where 𝒁 = [𝒛1 , 𝒛2 , … , 𝒛𝑀 ]
▪ Gibbs sampling requires the conditional posteriors 𝑝(𝒛𝑚 |𝒁−𝑚 , 𝑿)
▪ In general, 𝑝 𝒛𝑚 𝒁−𝑚 , 𝑿 ∝ 𝑝 𝒛𝑚 𝑝(𝑿|𝒛𝑚 , 𝒁−𝑚 ) where 𝒁−𝑚 is assumed “known”

▪ If 𝑝 𝒛𝑚 and 𝑝(𝑿|𝒛𝑚 , 𝒁−𝑚 ) are conjugate, the above CP is straightforward to obtain


▪ Another way to get each CP 𝑝 𝒛𝑚 𝒁−𝑚 𝑿 is by following this
▪ Write down the expression of 𝑝(𝑿, 𝒁)
▪ Only terms that contain 𝒛𝑚 needed to get CP of 𝒛𝑚 (up to a prop const)
Markov Blanket
▪ In 𝑝 𝒛𝑚 𝒁−𝑚 , 𝑿 , we only need to condition on terms in Markov Blanket of 𝒛𝑚
▪ Markov Blanket of a variable: Its parents, children, and other parents of its children
▪ Very useful in deriving CP
CS698X: TPMI
4
Gibbs Sampling: An Example
▪ The CPs for the Gibbs sampler for a GMM are as shown in green rectangles below

Joint distribution
Can verify that of data and
Markov Blanket unknowns
property holds
for each CP

CS698X: TPMI
5
Gibbs Sampling: Another Example
𝐽 2 Joint distribution of data
𝑝 𝒀, 𝒘𝑗 𝑗=1
𝝁𝑤 , 𝚺 𝑤 , 𝜎 𝑿
and unknowns
𝐽 𝑁𝑗
= ෑ ෑ 𝑝 𝑦𝑖𝑗 𝒙𝑖𝑗 , 𝒘𝑗 , 𝜎 2 𝑝(𝒘𝑗 |𝝁𝑤 , 𝚺𝑤 ) 𝑝 𝝁𝑤 𝑝 𝚺𝑤 𝑝 𝜎 2
𝑗=1 𝑖=1
𝐽 𝑁𝑗
= ෑ ෑ 𝒩(𝑦𝑖𝑗 |𝒘⊤ 2
𝒋 𝒙𝑖𝑗 , 𝜎 )𝒩(𝒘𝑗 |𝝁𝑤 , 𝚺𝑤 )
𝑗=1 𝑖=1

𝒩(𝝁𝑤 |𝝁0 , 𝐕0 ) IW(𝚺𝑤 |𝜼0 , 𝐒0−1 ) IG(𝜎 2 |𝜈0 /2, 𝜈0 𝜎02 /2)

Can verify that


Markov Blanket
property holds
for each CP

CS698X: TPMI
6
Gibbs Sampling: Another Example
𝑀 Joint distribution of data Assuming even the
𝑁 𝑝(𝑹, 𝒖𝑖 𝑁
𝑖=1 , 𝒗𝑗 , 𝜆𝑢 , 𝜆𝑣 , 𝛽) and unknowns hyperparams to be
𝒖𝑖 𝜆𝑢 𝑗=1
unknown
=ෑ 𝑝 𝑟𝑖𝑗 𝒖𝑖 , 𝒗𝑗 , 𝛽 ෑ 𝑝 𝒖𝑖 𝜆𝑢 ෑ 𝑝 𝒗𝑗 𝜆𝑣 𝑝 𝜆𝑢 𝑝 𝜆𝑣 𝑝(𝛽)
(𝑖,𝑗)∈Ω 𝑖 𝑗
Can also use non-zero mean and full cov matrix for
𝛽 𝑟𝑖𝑗 𝒗𝑖 𝜆𝑣 𝑢𝑖 , 𝑣𝑗 , with Gaussian and Wishart priors respectively*

𝑀 =ෑ 𝒩 𝑟𝑖𝑗 𝒖⊤ −1 −1
𝑖 𝒗𝑗 , 𝛽 ෑ 𝒩 𝒖𝑖 0, 𝜆𝑢 𝐈 ෑ 𝒩 𝒗𝑗 0, 𝜆𝑣 𝐈
(𝑖,𝑗)∈Ω 𝑖 𝑗
Bayesian Matrix Factorization Gamma(𝜆𝑢 |𝑎, 𝑏)Gamma(𝜆𝑣 |𝑐, 𝑑)Gamma(𝛽|𝑒, 𝑓)

Can verify that


Markov Blanket
property holds
for each CP

𝑁 Ω denotes the
𝑝 𝜆𝑢 𝐔 = Gamma(𝜆𝑢 |𝑎 + 0.5 ∗ 𝑁𝐾, 𝑏 + 0.5 ∗ ෍ 𝒖⊤ indices that are
𝑖 𝒖𝑖 )
𝑖=1 𝑝 𝛽 𝐑, 𝐔, 𝐕 = Gamma(𝛽|𝑒 + 0.5 ∗ |Ω|, observed in the
𝑀 2 ratings matrix
⊤ σ ⊤
𝑝 𝜆𝑣 𝐕 = Gamma(𝜆𝑣 |𝑐 + 0.5 ∗ 𝑀𝐾, 𝑑 + 0.5 ∗ ෍ 𝒗𝑗 𝒗𝑗 ) 𝑓 + 0.5 ∗ 𝑟
𝑖,𝑗∈Ω 𝑖𝑗 − 𝒖 𝑖 𝑗 )
𝒗
𝑗=1
CS698X: TPMI
*Bayesian Probabilistic Matrix Factorization using Markov Chain Monte Carlo (Salakhutdinov and Mnih, 2008)
7

MCMC: Some Other Aspects

CS698X: TPMI
8
Using the Samples to make Predictions
1
▪ Using the 𝑆 samples 𝒁(1) , 𝒁(2) , … , 𝒁(𝑆) , our approx. 𝑝 𝒁 ≈ 𝑆 σ𝑆𝑠=1 𝛿𝒁(𝑠) (𝒁)

▪ Any expectation that depends on 𝑝(𝒁) be approximated as


1 𝑆
𝔼𝑓 𝒁 = ∫ 𝑓 𝒁 𝑝 𝒁 𝑑𝒁 ≈ ෍ 𝑓(𝒁(𝑠) )
𝑆 𝑠=1
▪ For Bayesian lin. reg., assuming 𝒘, 𝛽, 𝜆 to be unknown, the PPD approx. will be
Joint posterior over all Thus, in this case, the PPD
unknowns 𝑆 is a sum of 𝑆 Gaussians
1 𝑠
∫ 𝑝 𝑦∗ 𝒙∗ , 𝒘, 𝛽 𝑝 𝒘, 𝛽, 𝜆 𝑿, 𝒚 𝑑𝒘𝑑𝛽𝑑𝜆 ≈ ෍ 𝑝 𝑦∗ 𝒙∗ , 𝒘 , 𝛽 (𝑠)
Sampling based
𝑆 𝑠=1
1
σ𝑆𝑠=1 𝒘 𝑠 ⊤
approximation of PPD Mean and variance of 𝑦∗ Mean: 𝔼[𝑦∗ ] = 𝒙∗
𝑆
can be computed using
Variance: Exercise! Use definition
sum of Gaussian properties
of variance and use Monte-Carlo
approximation
▪ Sampling based approx. for PPD of other models can also be obtained likewise
CS698X: TPMI
9
Sampling Methods: Label Switching Issue
▪ Suppose we are given samples 𝒁(1) , 𝒁(2) , … , 𝒁(𝑆) from the posterior 𝑝(𝒁|𝑿)

▪ We can’t always simply “average” them to get the “posterior mean” 𝒁

▪ Why: Non-identifiability of latent vars in models with multiple equival. posterior modes

▪ Example: In clustering via GMM, the likelihood is invariant to how we label clusters
▪ What we call cluster 1 in one sample may be cluster 2 in the next sample One sample may be
(1) (2) from near one of the
▪ Say, in GMM, 𝑧𝑛 = 1,0 and 𝑧𝑛 = 0,1 , both samples imply the same modes and the other
▪ Averaging will give 𝑧𝑛ҧ = [0.5,0.5], which is incorrect may be from near the
other mode

▪ Quantities not affected by permutations of dims of 𝒁 can be safely averaged


▪ E.g., probability that two points belong to the same cluster (e.g., in GMM) Changes in order of entries in
these 𝐾 × 1 vectors across
1 ⊤ different samples doesn’t affect
▪ Predicting the mean of an entry 𝑟𝑖𝑗 in matrix factorization σ𝑆𝑠=1 𝒖𝑖 𝑠 𝒗𝑗
𝑠 the inner product
𝑆
CS698X: TPMI
10
MCMC: Some Practical Aspects
▪ Choice of proposal distribution is important
▪ For MH sampling, Gaussian proposal is popular when 𝒛 is continuous, e.g.,
Hessian at the MAP of
𝑞 𝒛𝒛 ℓ−1 = 𝒩(𝒛|𝒛 ℓ−1 , 𝐇) the target distribution
Change at each iter
▪ Other options: Mixture of proposal distributions, data-driven or adaptive proposals
𝑆
▪ Autocorrelation. Can show that when approximating 𝑓 ∗ = 𝔼 𝑓 using {𝒁 𝑠 }𝑠=1
Basically measures what fractions of
Monte Carlo assumes the total samples are uncorrelated.
1 𝑆 uncorrelated samples Value of 𝑓 using 𝑠 𝑡ℎ MCMC sample
ҧ
𝑓 = ෍ 𝑓𝑠 Want it to be close to 1
𝑆 𝑠=1

Lower is
better
▪ Autocorrelation function (ACF) at lag 𝑡:

▪ Multiple Chains: Run multiple chains, take union of generated samples


CS698X: TPMI
11
Approximate Inference: VI vs Sampling
▪ VI approximates a posterior distribution 𝑝(𝒁|𝑿) by another distribution 𝑞(𝒁|𝜙)
▪ Sampling uses 𝑆 samples 𝒁(1) , 𝒁(2) , … , 𝒁(𝑆) to approximate 𝑝(𝒁|𝑿)
▪ Sampling can be used within VI (ELBO approx using Monte-Carlo)
▪ In terms of “comparison” between VI and sampling, a few things to be noted
▪ Convergence: VI only has local convergence, sampling (in theory) can give exact posterior
▪ Storage: Sampling based approx needs to storage all samples, VI only needs var. params 𝜙
▪ Prediction Cost: Sampling always requires Monte-Carlo avging for posterior predictive; with
VI, sometimes we can get closed form posterior predictive
1 𝑆 𝑠
PPD if using sampling: 𝑝 𝑥∗ 𝑋 = ∫ 𝑝 𝑥∗ 𝑍 𝑝 𝑍 𝑋 𝑑𝑍 ≈ ෍ 𝑝 𝑥∗ 𝑍 Compressing the 𝑆 samples
𝑆 𝑠=1 into something more
PPD if using VI: 𝑝 𝑥∗ 𝑋 = ∫ 𝑝 𝑥∗ 𝑍 𝑝 𝑍 𝑋 𝑑𝑍 ≈ ∫ 𝑝 𝑥∗ 𝑍 𝑞 𝑍 𝜙 𝑑𝑍 compact

▪ There is some work on “compressing” sampling-based approximations*


*”Compact approximations to Bayesian predictive distributions” by Snelson and Ghaharamani, 2005; and “Bayesian Dark Knowledge” by Korattikara et al, 2015 CS698X: TPMI
12
Coming Up Next
▪ Avoiding the random-walk behavior of MCMC
▪ Using gradient information of the posterior
▪ Scalable MCMC methods

CS698X: TPMI

You might also like