Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
Intro To Markov Chain Monte Carlo: Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601
Rebecca C. Steorts
Bayesian Methods and Modern Statistics: STA 360/601
Module 7
1
Gibbs sampling
2
Two-stage Gibbs sampler
3
Two-stage Gibbs sampler
0. Set (x0 , y0 ) to some starting value.
1. Sample x1 ∼ p(x|y0 ), that is, from the conditional
distribution X | Y = y0 .
Current state: (x1 , y0 )
Sample y1 ∼ p(y|x1 ), that is, from the conditional distribution
Y | X = x1 .
Current state: (x1 , y1 )
2. Sample x2 ∼ p(x|y1 ), that is, from the conditional
distribution X | Y = y1 .
Current state: (x2 , y1 )
Sample y2 ∼ p(y|x2 ), that is, from the conditional distribution
Y | X = x2 .
Current state: (x2 , y2 )
..
.
Repeat iterations 1 and 2, M times.
This procedure defines a sequence of pairs of random variables
(X0 , Y0 ), (X1 , Y1 ), (X2 , Y2 ), (X3 , Y3 ), . . . 4
Markov chain and dependence
5
Ideal Properties of MCMC
6
Toy Example
where c > 0, and (0, c) denotes the (open) interval between 0 and
c. (This example is due to Casella & George, 1992.)
7
Toy Example
8
Toy Example
1
Under ∝, we write the random variable (x) for clarity.
9
Toy Example
F (x|θ) = 1 − e−θx
10
Let’s apply Gibbs sampling, denoting S = (0, c).
0. Initialize x0 , y0 ∈ S.
1. Sample x1 ∼ TExp(y0 , S), then sample y1 ∼ TExp(x1 , S).
2. Sample x2 ∼ TExp(y1 , S), then sample y2 ∼ TExp(x2 , S).
..
.
N . Sample xN ∼ TExp(yN −1 , S), sample yN ∼ TExp(xN , S).
Figure 1 demonstrates the algorithm, with c = 2 and initial point
(x0 , y0 ) = (1, 1).
11
Figure 1: (Left) Schematic representation of the first 5 Gibbs sampling
iterations/sweeps/scans. (Right) Scatterplot of samples from 104 Gibbs
sampling iterations.
12
Example: Normal with semi-conjugate prior
iid
Consider X1 , . . . , Xn |µ, λ ∼ N (µ, λ−1 ). Then independently
consider
µ ∼ N (µ0 , λ−1
0 )
λ ∼ Gamma(a, b)
13
We know that for the Normal–Normal model, we know that for
any fixed value of λ,
where
λ0 µ0 + λ ni=1 xi
P
Lλ = λ0 + nλ and Mλ = .
λ0 + nλ
2
do this on your own
14
We know that for the Normal–Normal model, we know that for
any fixed value of λ,
where
λ0 µ0 + λ ni=1 xi
P
Lλ = λ0 + nλ and Mλ = .
λ0 + nλ
2
do this on your own
14
To implement Gibbs sampling in this example, each iteration
consists of sampling:
15
Pareto example
16
Power Law Distribution
The Pareto distribution with shape α > 0 and scale c > 0 has p.d.f.
αcα 1
Pareto(x|α, c) = α+1
1(x > c) ∝ α+1 1(x > c).
x x
This is referred to as a power law distribution, because the p.d.f. is
proportional to x raised to a power. Notice that c is a lower bound
on the observed values. In this example, we’ll see how Gibbs
sampling can be used to perform inference for α and c.
17
Rank City Population
1 Charlotte 731424
2 Raleigh 403892
3 Greensboro 269666
4 Durham 228330
5 Winston-Salem 229618
6 Fayetteville 200564
7 Cary 135234
8 Wilmington 106476
9 High Point 104371
10 Greenville 84554
11 Asheville 85712
12 Concord 79066
.. .. ..
. . .
44 Havelock 20735
45 Carrboro 19582
46 Shelby 20323
47 Clemmons 18627
48 Lexington 18931
49 Elizabeth City 18683
50 Boone 17122
18
Parameter Interpretations
19
To keep things as simple as possible, let’s use an (improper)
default prior:
p(α, c) ∝ 1(α, c > 0).
20
Recall
αcα
p(x|α, c) = 1(x > c) (2)
xα+1
1(α, c > 0) (3)
To use Gibbs, we need to be able to sample α|c, x1:n and c|α, x1:n .
and
22
Defining the Mono distribution
23
To use the inverse c.d.f. technique, we solve for the inverse of F
a
on 0 < x < b: Let u = xba and solve for x.
xa
u= (5)
ba
b u = xa
a
(6)
bu1/a = x (7)
3
It turns out that this is an inverse of the Pareto distribution, in the sense
that if X ∼ Pareto(α, c) then 1/X ∼ Mono(α, 1/c).
24
So, in order to use the Gibbs sampling algorithm to sample from
the posterior p(α, c|x1:n ), we initialize α and c, and then
alternately update them by sampling:
P
α|c, x1:n ∼ Gamma n + 1, log xi − n log c
c|α, x1:n ∼ Mono(nα + 1, x∗ ).
25
Ways of visualizing results
26
Figure 2: Traceplot of α
Figure 3: Traceplot of c.
27
Estimated density. We are primarily interested in the posterior on
α, since it tells us the scaling relationship between the size of cities
and their probability of occurring.
The two vertical lines indicate the lower ` and upper u boundaries
of an (approximate) 90% credible interval [`, u]—that is, an
interval that contains 90% of the posterior probability:
P α ∈ [`, u]x1:n = 0.9.
28
Figure 4: Estimated density of α|x1:n with ≈ 90 percent credible
intervals.
29
Running averages. Panel (d) shows the running average
1 Pk
k i=1 i for k = 1, . . . , N .
α
30
Figure 5: Running average plot
31
Survival function
32
Figure 6(e) shows an empirical estimate of the survival function
1 Pn
(based on the empirical c.d.f., F̂ (x) = n i=1 1(x ≥ xi )) along
with the posterior survival function, approximated by
33
Figure 6: Empirical vs posterior survival function