Problem Set
Problem Set
April 4, 2025
1 Chapter 1
1. Let F be the target distribution with density f , and let P (x, ·) be a Markov kernel
with transition density k(x, y). We write the one-step marginal distribution as
Z Z
F P (dy) = F (dx)P (x, dy) = f (x)k(x, y)dxdy .
X X
Although we think of f (x)k(x, y) as the joint density of the current state and the next
state, why is not actually written as a joint density?
2. Consider an autoregressive (AR(1)) process of order 1. That is, for ρ such that |ρ| < 1
and given an X0 ,
Xt+1 = ρXt + ϵt+1
iid
where ϵt ∼ N (0, σ 2 ). What is the Markov transition kernel for this Markov chain.
What is the Markov transition density?
Write the Markov transition kernel for this chain. Does this kernel define a measure
that is absolutely continuous with respect to the Lebesgue measure?
1
(a) Is this a Markov chain?
5. Ideal Slice Sampler: Consider a distribution F with density f . Consider the follow-
ing “Slice Sampler”:
(a) Write down the Markov transition kernel for this chain.
(b) Write down the Markov transition density for this chain.
6. Show
that2 the
AR(1) Markov chain satisfies detailed balance with respect to F =
σ
N 0, for |ρ| < 1.
1 − ρ2
7. For Markov proposal kernel Q(x, ·) = U (x − h, x + h) where h > 0, show that q(x, y) =
q(y, x).
8. Show that a Markov chain transition kernel P is F -communicating for all A ∈ B(X )
iff P is F -irreducible.
10. Will the following target and proposal distributions in a Metropolis-Hastings algo-
rithm lead to a reducible or an F -irreducible Markov chain? Explain intuitively, and
mathematically wherever possible:
2
(e) F = N (0, 1) and Q(x, ·) = U (x − 2, x + 2)
12. For a symmetric proposal and for c ≥ 0, consider the acceptance probability
f (y)
αc (x, y) = .
f (x) + f (y) + c
Show that αc (x, y) will yield an F -symmetric Markov chain. Will this acceptance prob-
ability be useful in practice?
13. Rejection sampling chains: Consider the accept-reject MCMC algorithm with an in-
dependent proposal distribution: q(x, y) = q(y). For c > 0, define the set
S = {x : f (x) ≤ cq(x)} .
Show that this procedure yields an F -invariant Markov chain. In the context of iid
accept-reject samplers, what are the potential advantages of this sampler?
14. Code: Consider target distribution F = N (0, 1) and proposal Q(x, ·) = N (x, h) (this
is obviously not a realistic scenario). Run a MH algorithm with starting value X0 = 0
and obtain 1000 samples for each of h = 1, 5, 10. What are the estimated acceptance
probabilities for each h? Which h seems best from the trace plots?
15. Code: Repeat the previous problem for target distribution F = N100 (0, I100 ), where
I100 is the 100 × 100 identity matrix. That is, the target distribution is a standard
multivariate normal of 100 dimensions. Have the acceptance probabilities for each h
changed? Why? Can you find a better h?
3
16. Code: Consider target F = N (0, Σ) where Σ is the diagonal matrix with elements
(1, 2, 5, 10, 100). Using Q(x, ·) = N5 (x, hI5 ), tune h to obtain roughly 30% acceptance
rate.
17. Code: For the same target as above, now use Q(x, ·) = N5 (x, Ω), where Ω is a diagonal
of matrix of (h1 , h2 , h3 , h4 , h5 ). Tune all choices to obtain 30% acceptance while also
retaining good behavior of the Markov chain.
2 Chapter 2
β
1. For what values of β > 1 is f (x) ∝ e−|x| is
|∇ log f (x)|
lim → ∞.
|x|→∞ |x|
2. Code: Implement random-walk Metropolis for Bayesian logsitic for the dataset Pima.tr
in the MASS library in R. Tune this to obtain 30% acceptance.
3. Code: For the same dataset and model, implement the MALA algorithm and tune it
to obtain 60% acceptance. Compare the performance of this algorithm with the RWM
in the previous.
4
4. Code: Implement MALA for f (x) ∝ e−x with various starting values and tuning.
What do you observe.
(a) Implement this discretized process (without accept reject) to obtain 104 samples
from each implement.
(b) Plot the density estimate of the samples for each value of h and compare with the
truth
6. Repeat the above but for MALA (that is discretized Langevin dynamics with accept-
reject.)
4
7. For when µσ is the density of a symmetric distribution with mode at 0, show that
Z
1 1
µ σ (z)dz = .
R 1 + e−z∇ log f (x) 2
√
8. Show that for g(t) = t,
g e(y−x)∇ log f (x) µσ (y − x)
3 Chapter 3
NOTHING
4 Chapter 4
1. Recall that
f (y)q(y, x)
αB (x, y) = .
f (x)q(x, y) + f (y)q(y, x)
Show that αB (x, y) ≤ αMH (x, y). That is, Barker’s acceptance probability is less than
equal to the MH acceptance probability.
5. Recall a component-wise algorithm for a joint target density f (x1 , . . . , xd ). Let f (xi |x(i) )
be the full conditionals for the components and let q((xi , x(i) ), yi ) be the respective pro-
5
posal densities. Recall, the acceptance probability for each component
However, in practice, the joint target density is used in the acceptance probability.
That is, the following is evaluated:
f (yi , x(i) ) q((yi , x(i) ), xi )
α(xi , yi ) = min 1, .
f (x) q((xi , x(i) ), yi )
Why are the two equivalent? What is the advantage of using the second over the first?
6. Consider target distribution F defined on X with density f (x) = mf˜(x), where m > 0
is unknown, and f˜ is known. Suppose g(x) is an independent proposal density with
support Y such that
f˜(x)
sup ≤M.
x∈Y g(x)
f˜(y)
αI (x, y) = .
M g(y)
(d) Notice that αI (x, y) is independent of the current state x. Does this imply that
X1 , X2 , . . . , Xn drawn from P are independent? Why or why not?
(e) Would this sampler work well if X is high-dimensional? Why or why not?
7. Consider a three component target density f (x, y, z). Suppose the following three
conditional densities are available to sample from
6
i. Yn+1 ∼ f (y | zn )
(b) Show that the Markov chain is invariant for the joint density f (x, y, z).
ind
Yi |β ∼ N (xTi β, σ 2 )
Find the full conditionals of β and σ 2 , and construct a deterministic scan Gibbs sam-
pler, writing down its transition density.
xi | zi , µk , Σk ∼ N (µzi , Σzi )
zi | π ∼ Categorical(π)
where π = (π1 , . . . , πK ) are the mixture weights. A Dirichlet prior is placed on the
mixture proportions:
π ∼ Dirichlet(α1 , . . . , αK )
7
Show that the full conditions are:
(a) Sample Cluster Assignments zi : For each data point i, update zi from:
P (zi = k | xi , µk , Σk , π) ∝ πk N (xi | µk , Σk )
(b) Sample Mixture Weights π: Given cluster counts nk (number of data points in
cluster k), update:
π ∼ Dirichlet(α1 + n1 , . . . , αK + nK )
(c) Sample Component Means µk and Covariances Σk : Given the subset of data
points assigned to cluster k, compute sufficient statistics (sample mean x̄k and
covariance Sk ):
µk | Σk , Xk ∼ N (µ∗k , Σk /κ∗k )
where
κ0 µ0 + nk x̄k
µ∗k = , κ∗k = κ0 + nk
κ0 + nk
and
Σk | Xk ∼ Inverse-Wishart(ν0 + nk , Λ∗k )
where
κ0 nk
Λ∗k = Λ0 + Sk + (x̄k − µ0 )(x̄k − µ0 )T .
κ0 + nk
yi | xi , β, σ 2 , ν ∼ Tν (x⊤ 2
i β, σ )
yi | xi , β, λi , σ 2 ∼ N (x⊤ 2
i β, λi σ )
λi ∼ Inverse-Gamma(ν/2, ν/2)
The following implies that the marginal distribution of yi |xi , β, σ 2 is Tν , and thus adding
8
this λi does not change the marginal model.
Assume priors:
(b) Find the full conditionals for β, σ 2 and λ = (λ1 , . . . , λp )s to construct a Gibbs
sampler.
(c) Write the Markov transition density of a deterministic scan Gibbs sampler
11. Code: For the 100-dimensional multivariate normal question in the previous example,
write a deterministic scan component-wise MCMC algorithm with h = 1 in the pro-
posal distribution. Store the acceptance probability for each component in the vector
accept.vec and at the end of the program output
summary(accept.vec)
Implement a MH (with h chosen that acceptance probability is around 35%) and Gibbs
sampler for ρ = 0, ρ = .5 and ρ = .99. For each value of ρ, run the Markov chain for
1000 steps and store the output of the chain in n×2 matrix chain.mh and chain.gibbs.
For each value of ρ, output the marginal density plots and compare with the true
marginal distribution. Also compare the trace plots of MH and Gibbs. For which ρ
MH is better, and for which ρ Gibbs seems better?
13. Bayesian logistic regression: The Bayesian logistic regression model is one of the
most popular in MCMC has been given special attention. Let Y1 , . . . , Ym be the ob-
served binary response. For i = 1, . . . , m, let xi = (xi1 , . . . , xip )T denote the vector of
covariates for the ith response. For β ∈ Rp , the Bayesian logistic regression setup is
1
Yi ∼ Bernoulli .
1 + exp(−xTi β)
9
We assume the following multivariate normal prior on β, β ∼ N (0, 100I10 ), where I10
is the 10 × 10 identity matrix.
(b) You may notice that the above algorithm exhibits high autocorrelation. A Gibbs
sampler using data augmentation was introduced for this model by ?. An open
version of the paper is available here: https://arxiv.org/pdf/1205.0310.pdf
The Gibbs sampler is presented at the bottom of Page 6. Describe the sampler
in your own words and explain the justification of this algorithm presented in the
paper.
(c) Implement the Gibbs sampling algorithm and compare the performance with
Metropolis-Hastings for the logit dataset.
When at least one of the fi (xi |x(i) ) is not available to sample from, then we use a
general proposal qi as described before to update that component, and use Gibbs
sampling updates for all other components.
For i = 1, . . . m, let ti denote the observed failure time for lamp (where m lamps’ data
are collected). Suppose
Ti | λ, β ∼ Weibull(λ, β)
where λ > 0 is the scale parameter and β is a shape parameter. In a Bayesian paradigm,
we further assume prior distributions on this. So
The resulting posterior distribution is complicated for which the normalizing constant
10
is not known.
m
!β−1 ( m
)
Y X
f (λ, β | T) ∝ λm+a0 −1 m+a1 −1
β ti exp −λ tβi exp {−b1 β} exp {b0 λ}
i=1 i=1
m
!β−1 ( m
)
Y X
f (β | λ, T ) ∝ β m+a1 −1 ti exp −λ tβi exp {−b1 β} .
i=1 i=1
(a) λn+1 ∼ λ | βn , T
then βn+1 = Y .
Implement the above sampler to sample from the posterior distribution with randomly gen-
erated data.
11
5 Chapter 5
6 Chapter 7
1. Recall that an F -Harris ergodic Markov chain on X is uniformly ergodic if for some
M < ∞ and some t ∈ [0, 1],
∥P n (x, ·) − F (·)∥ ≤ M tn .
(a) Suppose the full support X is small. That is, there exists an ϵ > 0 and a measure
Q such that for all x ∈ X and for all A ∈ B(X )
P (x, A) ≥ ϵ Q(A) .
(b) In how many steps, n∗ , will the above Markov chain be within δ TV-distance of
stationarity?
(d) Show that if λ > 2β, then the asymptotic variance for estimating the mean of a
Pareto distribution using the above independent MH algorithm is ∞.
(b) For λ = 5 and θ = 3, how long will a Markov chain take to be within .01 TV
distance of the target?
5. Let PRSGS denote the Markov transition density of a 2-component random scan Gibbs
2
sampler. Then show that PRSGS ≥ r(1 − r)PDU GS .
12
Hint: You can use the MTDs to show this. Then
6. Using the above result, show that if PDU GS is uniformly ergodic, then so id PRSGS .
(This is true even outside of the two variable case).
13