0% found this document useful (0 votes)
21 views13 pages

Problem Set

The document is a problem set for a course titled MTH707, focusing on Markov chains, transition kernels, and various sampling algorithms including Metropolis-Hastings and Gibbs sampling. It contains theoretical questions and coding tasks related to statistical distributions and MCMC methods. The problem set is structured into chapters, each addressing different aspects of the course material with specific exercises and coding assignments.

Uploaded by

munniprincess27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views13 pages

Problem Set

The document is a problem set for a course titled MTH707, focusing on Markov chains, transition kernels, and various sampling algorithms including Metropolis-Hastings and Gibbs sampling. It contains theoretical questions and coding tasks related to statistical distributions and MCMC methods. The problem set is structured into chapters, each addressing different aspects of the course material with specific exercises and coding assignments.

Uploaded by

munniprincess27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

MTH707: Problem Set

Instructor: Dootika Vats

April 4, 2025

1 Chapter 1
1. Let F be the target distribution with density f , and let P (x, ·) be a Markov kernel
with transition density k(x, y). We write the one-step marginal distribution as
Z Z
F P (dy) = F (dx)P (x, dy) = f (x)k(x, y)dxdy .
X X

Although we think of f (x)k(x, y) as the joint density of the current state and the next
state, why is not actually written as a joint density?

2. Consider an autoregressive (AR(1)) process of order 1. That is, for ρ such that |ρ| < 1
and given an X0 ,
Xt+1 = ρXt + ϵt+1
iid
where ϵt ∼ N (0, σ 2 ). What is the Markov transition kernel for this Markov chain.
What is the Markov transition density?

3. For an X0 , consider the following Markov chain:

(a) Draw U ∼ U (0, 1).

(b) If U < 1/2, set Xn+1 = Xn

(c) Else set Xn+1 = N (Xn , 1).

Write the Markov transition kernel for this chain. Does this kernel define a measure
that is absolutely continuous with respect to the Lebesgue measure?

4. Consider the following Markov chain. X0 ∼ λ where λ is a probability distribution.


Set Xt+1 = Xt for all t ≥ 0.

1
(a) Is this a Markov chain?

(b) What is the Markov transition kernel?

(c) What is a stationary distribution for this kernel?

5. Ideal Slice Sampler: Consider a distribution F with density f . Consider the follow-
ing “Slice Sampler”:

(i) Draw Un+1 ∼ U (0, f (Xn ))

(ii) Draw Xn+1 ∼ Uniform(C) where set C is {x : Un+1 ≤ f (x)}

Answer the following questions:

(a) Write down the Markov transition kernel for this chain.

(b) Write down the Markov transition density for this chain.

(c) Show that F is the stationary distribution for this kernel.

6. Show
 that2 the
 AR(1) Markov chain satisfies detailed balance with respect to F =
σ
N 0, for |ρ| < 1.
1 − ρ2
7. For Markov proposal kernel Q(x, ·) = U (x − h, x + h) where h > 0, show that q(x, y) =
q(y, x).

8. Show that a Markov chain transition kernel P is F -communicating for all A ∈ B(X )
iff P is F -irreducible.

9. Let q(x, y) be the proposal distribution of the Metropolis-Hastings algorithm. Show


that if q(x, y) > 0 for all x, y ∈ X , then P is F -irreducible.

10. Will the following target and proposal distributions in a Metropolis-Hastings algo-
rithm lead to a reducible or an F -irreducible Markov chain? Explain intuitively, and
mathematically wherever possible:

(a) F = U (0, 1) and Q(x, ·) = U (x − 2, x + 2)

(b) F = U (0, 1) and Q(x, ·) = U (x − .2, x + .2)

(c) F = N (0, 1) and Q(x, ·) = td for d ≥ 1

(d) F = N (0, 1) and Q(x, ·) = U (0, 1)

2
(e) F = N (0, 1) and Q(x, ·) = U (x − 2, x + 2)

11. What happens to the Metropolis-Hastings algorithm when Q(x, ·) = F ? Describe


mathematically and intuitively.

12. For a symmetric proposal and for c ≥ 0, consider the acceptance probability

f (y)
αc (x, y) = .
f (x) + f (y) + c

Show that αc (x, y) will yield an F -symmetric Markov chain. Will this acceptance prob-
ability be useful in practice?

13. Rejection sampling chains: Consider the accept-reject MCMC algorithm with an in-
dependent proposal distribution: q(x, y) = q(y). For c > 0, define the set

S = {x : f (x) ≤ cq(x)} .

Consider accepting a draw with probability





 1 x∈S

 cq(x)

α(x, y) = f (x) x∈
̸ S, y ∈ S
  

 f (y) q(x)
min 1, x∉ S, y ̸∈ S .


f (x) q(y)

Show that this procedure yields an F -invariant Markov chain. In the context of iid
accept-reject samplers, what are the potential advantages of this sampler?

14. Code: Consider target distribution F = N (0, 1) and proposal Q(x, ·) = N (x, h) (this
is obviously not a realistic scenario). Run a MH algorithm with starting value X0 = 0
and obtain 1000 samples for each of h = 1, 5, 10. What are the estimated acceptance
probabilities for each h? Which h seems best from the trace plots?

15. Code: Repeat the previous problem for target distribution F = N100 (0, I100 ), where
I100 is the 100 × 100 identity matrix. That is, the target distribution is a standard
multivariate normal of 100 dimensions. Have the acceptance probabilities for each h
changed? Why? Can you find a better h?

3
16. Code: Consider target F = N (0, Σ) where Σ is the diagonal matrix with elements
(1, 2, 5, 10, 100). Using Q(x, ·) = N5 (x, hI5 ), tune h to obtain roughly 30% acceptance
rate.

17. Code: For the same target as above, now use Q(x, ·) = N5 (x, Ω), where Ω is a diagonal
of matrix of (h1 , h2 , h3 , h4 , h5 ). Tune all choices to obtain 30% acceptance while also
retaining good behavior of the Markov chain.

2 Chapter 2
β
1. For what values of β > 1 is f (x) ∝ e−|x| is

|∇ log f (x)|
lim → ∞.
|x|→∞ |x|

2. Code: Implement random-walk Metropolis for Bayesian logsitic for the dataset Pima.tr
in the MASS library in R. Tune this to obtain 30% acceptance.

3. Code: For the same dataset and model, implement the MALA algorithm and tune it
to obtain 60% acceptance. Compare the performance of this algorithm with the RWM
in the previous.
4
4. Code: Implement MALA for f (x) ∝ e−x with various starting values and tuning.
What do you observe.

5. Unadjusted Langevin Algorithm: Write down the discretized Lanvegin dynamics


for F = N (5, 20) distribution. For h = .001, .01, .1, 1, 10:

(a) Implement this discretized process (without accept reject) to obtain 104 samples
from each implement.

(b) Plot the density estimate of the samples for each value of h and compare with the
truth

(c) Identify what you observe?

This algorithm is often called the ‘Unadjusted Langevin Algorithm’.

6. Repeat the above but for MALA (that is discretized Langevin dynamics with accept-
reject.)

4
7. For when µσ is the density of a symmetric distribution with mode at 0, show that
Z
1 1
µ σ (z)dz = .
R 1 + e−z∇ log f (x) 2


8. Show that for g(t) = t,  
g e(y−x)∇ log f (x) µσ (y − x)

is proportional to the density of the MALA proposal.

3 Chapter 3
NOTHING

4 Chapter 4
1. Recall that
f (y)q(y, x)
αB (x, y) = .
f (x)q(x, y) + f (y)q(y, x)
Show that αB (x, y) ≤ αMH (x, y). That is, Barker’s acceptance probability is less than
equal to the MH acceptance probability.

2. For i = 1, . . . , d, let Pi be F -invariant MTK. Let Pm and Pc be mixing and composition


kernels, respectively.

(a) Show that Pm and Pc are F -invariant.

(b) If one Pi is F -irreducible, show that both Pm and Pc are F -irreducible.

3. Suppose Pi is F -symmetric for i = 1, . . . , d. Prove that P = PRS is also F -symmetric.

4. Suppose Pi is F -symmetric for i = 1, . . . , d. Then note that Pc = P1 P2 . . . Pd is not


F -symmetric. However, show that Ps = P1 P2 . . . Pd Pd−1 . . . P2 P1 is F -symmetric.

5. Recall a component-wise algorithm for a joint target density f (x1 , . . . , xd ). Let f (xi |x(i) )
be the full conditionals for the components and let q((xi , x(i) ), yi ) be the respective pro-

5
posal densities. Recall, the acceptance probability for each component

f (yi |x(i) ) q((yi , x(i) ), xi )


 
α(xi , yi ) = min 1, .
f (xi |x(i) ) q((xi , x(i) ), yi )

However, in practice, the joint target density is used in the acceptance probability.
That is, the following is evaluated:
 
f (yi , x(i) ) q((yi , x(i) ), xi )
α(xi , yi ) = min 1, .
f (x) q((xi , x(i) ), yi )

Why are the two equivalent? What is the advantage of using the second over the first?

6. Consider target distribution F defined on X with density f (x) = mf˜(x), where m > 0
is unknown, and f˜ is known. Suppose g(x) is an independent proposal density with
support Y such that
f˜(x)
sup ≤M.
x∈Y g(x)

Consider an accept-reject MCMC algorithm with acceptance probability

f˜(y)
αI (x, y) = .
M g(y)

(a) What is the Markov chain transition kernel, P ?

(b) Show that the P is F -invariant.

(c) What conditions are required for P to be F -irreducible?

(d) Notice that αI (x, y) is independent of the current state x. Does this imply that
X1 , X2 , . . . , Xn drawn from P are independent? Why or why not?

(e) Would this sampler work well if X is high-dimensional? Why or why not?

7. Consider a three component target density f (x, y, z). Suppose the following three
conditional densities are available to sample from

f (z|x, y), f (x|y, z), f (y|z) .

(a) Consider the component-wise algorithm: Given (Xn , Yn , Zn ) = (xn , yn , zn ), the


next update is

6
i. Yn+1 ∼ f (y | zn )

ii. Xn+1 ∼ f (x | yn+1 , zn )

iii. Zn+1 ∼ f (z | xn+1 , yn+1 )

Write the Markov transition density of a deterministic scan combination of these


component-wise updates.

(b) Show that the Markov chain is invariant for the joint density f (x, y, z).

8. Bayesian linear regression: Consider likelihood:

ind
Yi |β ∼ N (xTi β, σ 2 )

where Yi ∈ R and xi ∈ Rp . We have priors

β ∼ N (0, σ 2 Ip ) and σ 2 ∼ Inverse Gamma(a, b) .

Find the full conditionals of β and σ 2 , and construct a deterministic scan Gibbs sam-
pler, writing down its transition density.

9. Bayesian Gaussian Mixture Model Let xi ∈ Rd be the observed data points,


assumed to come from a mixture of K Gaussian components:

xi | zi , µk , Σk ∼ N (µzi , Σzi )

where zi ∈ {1, . . . , K} is the latent cluster assignment. Each cluster zi follows a


categorical distribution based on mixture weights:

zi | π ∼ Categorical(π)

where π = (π1 , . . . , πK ) are the mixture weights. A Dirichlet prior is placed on the
mixture proportions:
π ∼ Dirichlet(α1 , . . . , αK )

where αk are concentration parameters. We assume a Normal-Inverse-Wishart prior


on each component:

µk | Σk ∼ N (µ0 , Σk /κ0 ) and Σk ∼ Inverse-Wishart(ν0 , Λ0 )

where µ0 , κ0 , ν0 , and Λ0 are hyperparameters.

7
Show that the full conditions are:

(a) Sample Cluster Assignments zi : For each data point i, update zi from:

P (zi = k | xi , µk , Σk , π) ∝ πk N (xi | µk , Σk )

(b) Sample Mixture Weights π: Given cluster counts nk (number of data points in
cluster k), update:

π ∼ Dirichlet(α1 + n1 , . . . , αK + nK )

(c) Sample Component Means µk and Covariances Σk : Given the subset of data
points assigned to cluster k, compute sufficient statistics (sample mean x̄k and
covariance Sk ):

µk | Σk , Xk ∼ N (µ∗k , Σk /κ∗k )

where
κ0 µ0 + nk x̄k
µ∗k = , κ∗k = κ0 + nk
κ0 + nk
and
Σk | Xk ∼ Inverse-Wishart(ν0 + nk , Λ∗k )

where
κ0 nk
Λ∗k = Λ0 + Sk + (x̄k − µ0 )(x̄k − µ0 )T .
κ0 + nk

10. Bayesian Robust Regression with t-Likelihood: A heavy-tailed alternative to


normal regression using a latent variance. The likelihood is:

yi | xi , β, σ 2 , ν ∼ Tν (x⊤ 2
i β, σ )

where Tν is a Student-t distribution. Consider an augmented variable model where a


new variable is included to allow for a Gibbs sampler:

yi | xi , β, λi , σ 2 ∼ N (x⊤ 2
i β, λi σ )

λi ∼ Inverse-Gamma(ν/2, ν/2)

The following implies that the marginal distribution of yi |xi , β, σ 2 is Tν , and thus adding

8
this λi does not change the marginal model.

Assume priors:

β ∼ N (0, s0 Ip ) and σ 2 ∼ Inverse Gamma(a0 , b0 ) .

(a) Show that the distribution of yi |xi , β, σ 2 is Tν (x⊤ 2


i β, σ )

(b) Find the full conditionals for β, σ 2 and λ = (λ1 , . . . , λp )s to construct a Gibbs
sampler.

(c) Write the Markov transition density of a deterministic scan Gibbs sampler

11. Code: For the 100-dimensional multivariate normal question in the previous example,
write a deterministic scan component-wise MCMC algorithm with h = 1 in the pro-
posal distribution. Store the acceptance probability for each component in the vector
accept.vec and at the end of the program output

summary(accept.vec)

12. Code: Consider the bivariate normal target distribution


   
 2   1 ρ 
F = N  ,  .
−2 ρ 1.

Implement a MH (with h chosen that acceptance probability is around 35%) and Gibbs
sampler for ρ = 0, ρ = .5 and ρ = .99. For each value of ρ, run the Markov chain for
1000 steps and store the output of the chain in n×2 matrix chain.mh and chain.gibbs.

For each value of ρ, output the marginal density plots and compare with the true
marginal distribution. Also compare the trace plots of MH and Gibbs. For which ρ
MH is better, and for which ρ Gibbs seems better?

13. Bayesian logistic regression: The Bayesian logistic regression model is one of the
most popular in MCMC has been given special attention. Let Y1 , . . . , Ym be the ob-
served binary response. For i = 1, . . . , m, let xi = (xi1 , . . . , xip )T denote the vector of
covariates for the ith response. For β ∈ Rp , the Bayesian logistic regression setup is
 
1
Yi ∼ Bernoulli .
1 + exp(−xTi β)

9
We assume the following multivariate normal prior on β, β ∼ N (0, 100I10 ), where I10
is the 10 × 10 identity matrix.

Load the following data in R

install.packages("mcmc") # install package


library(mcmc)
data(logit) # loads the dataset. Do ?logit to learn about the data

(a) Write a Metropolis-Hastings algorithm to sample from the posterior distribution


of β for the above data set.

(b) You may notice that the above algorithm exhibits high autocorrelation. A Gibbs
sampler using data augmentation was introduced for this model by ?. An open
version of the paper is available here: https://arxiv.org/pdf/1205.0310.pdf

The Gibbs sampler is presented at the bottom of Page 6. Describe the sampler
in your own words and explain the justification of this algorithm presented in the
paper.

(c) Implement the Gibbs sampling algorithm and compare the performance with
Metropolis-Hastings for the logit dataset.

14. Metropolis-within-Gibbs: We have learned component-wise MCMC updates and


within that we’ve learned Gibbs sampling which requires the full conditional distribu-
tion for each component to be available. But what if for one or more of the components,
the full conditional distribution is not available.

When at least one of the fi (xi |x(i) ) is not available to sample from, then we use a
general proposal qi as described before to update that component, and use Gibbs
sampling updates for all other components.

For i = 1, . . . m, let ti denote the observed failure time for lamp (where m lamps’ data
are collected). Suppose
Ti | λ, β ∼ Weibull(λ, β)

where λ > 0 is the scale parameter and β is a shape parameter. In a Bayesian paradigm,
we further assume prior distributions on this. So

λ ∼ Gamma(a0 , b0 ) and β ∼ Gamma(a1 , b1 ) .

The resulting posterior distribution is complicated for which the normalizing constant

10
is not known.

m
!β−1 ( m
)
Y X
f (λ, β | T) ∝ λm+a0 −1 m+a1 −1
β ti exp −λ tβi exp {−b1 β} exp {b0 λ}
i=1 i=1

It can also be shown that


m
!
X
λ | β, T ∼ Gamma m + a0 , b0 + tβi .
i=1

However, β | λ, T does not have a closed-form expression.

m
!β−1 ( m
)
Y X
f (β | λ, T ) ∝ β m+a1 −1 ti exp −λ tβi exp {−b1 β} .
i=1 i=1

In this case, we can implement the following (deterministic scan) Metropolis-within-


Gibbs sampler:

(a) λn+1 ∼ λ | βn , T

(b) Propose Y ∼ Q((λn+1 , βn ), ·) and draw U ∼ U [0, 1].

(c) If U ≤ α((λn+1 , βn ), y), where


 
f (y|λn+1 , T ) q((y, λn+1 ), βn )
α((λn+1 , βn ), y) = min 1, ,
f (βn |λn+1 , T ) q((βn , λn+1 ), y)

then βn+1 = Y .

(d) Else βn+1 = βn .

The MTK for this is

k((λ, β), (λ′ , β ′ )) = f (λ′ |β) p((λ′ , β), (λ′ , β ′ )) .


| {z }
MH kernel

Implement the above sampler to sample from the posterior distribution with randomly gen-
erated data.

11
5 Chapter 5

6 Chapter 7
1. Recall that an F -Harris ergodic Markov chain on X is uniformly ergodic if for some
M < ∞ and some t ∈ [0, 1],

∥P n (x, ·) − F (·)∥ ≤ M tn .

(a) Suppose the full support X is small. That is, there exists an ϵ > 0 and a measure
Q such that for all x ∈ X and for all A ∈ B(X )

P (x, A) ≥ ϵ Q(A) .

Show that since X is small, P is uniformly ergodic with t = (1 − ϵ) and M = 1.

(b) In how many steps, n∗ , will the above Markov chain be within δ TV-distance of
stationarity?

(c) Consider target distribution F being a Pareto(α, β) and consider an independent


proposal distribution Pareto(α, λ). Show that is λ ≤ β, then the Markov chain is
uniformly ergodic. For what value of λ is n∗ the smallest?

(d) Show that if λ > 2β, then the asymptotic variance for estimating the mean of a
Pareto distribution using the above independent MH algorithm is ∞.

2. Consider an Independence MH for target F = Exp(λ), and Q = Exp(θ).

(a) For what values of θ is the chain uniformly ergodic?

(b) For λ = 5 and θ = 3, how long will a Markov chain take to be within .01 TV
distance of the target?

3. If X is finite, then every F -irreducible, F -invariant, aperiodic, recurrent Markov chain


is uniformly ergodic.

4. Suppose X is compact and P be a M-H kernel with proposal q be absolutely continuous


wrt. Lebesgue measure such that x, y) > 0 on X . If f (x) ≤ k on X , then show that P
is uniformly ergodic.

5. Let PRSGS denote the Markov transition density of a 2-component random scan Gibbs
2
sampler. Then show that PRSGS ≥ r(1 − r)PDU GS .

12
Hint: You can use the MTDs to show this. Then

kRSGS ((x, y), (u, v))kRSGS ((u, v), (x′ , y ′ ))


= rfX|Y (u|y)δy (v) + (1 − r)fY |X (v|x)δx (u) rfX|Y (x′ |v)δv (y) + (1 − r)fY |X (y ′ |u)δu (x′ )
 

6. Using the above result, show that if PDU GS is uniformly ergodic, then so id PRSGS .
(This is true even outside of the two variable case).

13

You might also like