Sampling Methods: Søren Højsgaard
Sampling Methods: Søren Højsgaard
Søren Højsgaard
Department of Mathematical Sciences
June 3, 2014
Contents
1 Introduction – Bayesian modelling 1
1
1 Introduction – Bayesian modelling
• In a Bayesian setting, parameters are treated as random quantities on equal footing
with the random variables.
• The joint distribution of a parameter (vector) ✓ and data (vector) y is specified
through a prior distribution ⇡(✓) for ✓ and a conditional distribution p(y | ✓) of data
for a fixed value of ✓.
• This leads to the joint distribution for data AND parameters
p(y, ✓) = p(y | ✓)⇡(✓)
• The prior distribution ⇡(✓) represents our knowledge (or uncertainty) about ✓ before
data have been observed.
• In such cases one will often resort to sampling based methods: If we can draw samples
✓(1) , . . . , ✓(N ) from ⇡ ⇤ (✓) we can do just as well:
1 X
E(g(✓)|⇡ ⇤ ) ⇡ g(✓(i) )
N i
• The question is then how to draw samples from ⇡ ⇤ (✓) where ⇡ ⇤ (✓) is only known up
to the normalizing constant.
• There are many methods for achieving this; these methods are known as Markov
Chain Monte Carlo (MCMC) methods and will be described elsewhere.
• Sections marked with “*” in the following can be skipped at first reading.
2
2 Computations using Monte Carlo methods
Consider a random vector X with density / probability mass function p(x) which is the
target distribution (from which we want to sample).
In many real world applications
p(x) = k(x)/c
We reserve h(x) for a proposal distribution which is a distribution from which we can
draw samples.
3. If u < ↵, accept x.
3
2.2 Example: Rejection sampling
> k <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
> x <- seq(-4, 4, 0.1)
> plot(x,k(x),type="l")
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−4 −2 0 2 4
4
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.04
0.02
0.00
−4 −2 0 2 4 −2.8 −1.6 −0.4 0.7 1.6 3.1
2.3 QUIZ
Reuse the code from above to answer these questions, but please think about what the
results would be before executing the code.
• Suppose we could not easily determine M and hence had to make a conservative
choice; say M = 100 or M = 500 in this context.
Which e↵ect would that have on the number of accepted samples, and how would
you have to compensate?
• Suppose we take the proposal distribution h() to be uniform om [ 10, 10]. Which
e↵ect would that have on the acceptance rate? What if the proposal distribution is
an N (0, 1)? What is the quality of the samples in this case? Hint: Use dnorm() to
evaluate the normal density.
• This scheme works also if p is only known up to proportionality (because the nor-
malizing constant cancels out in step 3. above).
5
• Samples from h which “fits best to p” are those most likely to appear in the resample.
However, if h is a poor approximation to p then the “best samples from h” are not
necessarily good samples in the sense of resembling p.
0.03
0.00
• This leads schemes (described below) for drawing samples x1 , . . . , xN and these sam-
ples will, under certain conditions, form an ergodic Markov chain with p(x) as its
stationary distribution.
6
• Hence, the expected value of any function of x can be calculated approximately as
Z
1 X
g(x)p(x)dx ⇡ g(xi ).
N i
Example: x = x0 + N (0, 2
)
The independence sampler (A special case of the Metropolis–Hastings algorithm) The
proposal h(x|x0 ) = h(x) does not depend on x0 .) The acceptance probability be-
h(xt 1 )
comes ↵ = min 1, p(xp(x)
t 1 ) h(x) . For this sampler to work well, h should be a good
approximation to p.
7
3.3 Example: Metropolis–Hastings algorithm
Random walk Metropolis is straight forward to implement
> N <- 10000
> x.acc5 <- rep.int(NA, N)
> u <- runif(N)
> acc.count <- 0
> std <- 0.05 ## Spread of proposal distribution
> xc <- 0; ## Starting value
> for (ii in 1:N){
xp <- rnorm(1, mean=xc, sd=std) ## proposal
alpha <- min(1, (k(xp)/k(xc)) *
(dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
## find number of acccepted proposals:
acc.count <- acc.count + (u[ii] < alpha)
}
> ## Fraction of accepted *new* proposals
> acc.count/N
[1] 0.9846
> par(mfrow=c(1,2), mar=c(2,2,1,1))
> plot(x,k(x),type="l")
> barplot(table(round(x.acc5,1))/length(x.acc5))
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.04
0.02
0.00
8
We assume
m ⇠ bin(n, ✓), u ⇠ bin(U, ✓)
So we get
p(m|✓) ⇠ bin(n, ✓) p(u|✓, U ) ⇠ bin(U, ✓)
To complete the model specification we must specify prior distributions for ✓ and U . These
must reflect our prior knowledge of the problem.
The joint density of data (m, u) and the parameters (✓, U ) is then
✓ ◆
U m+u
p(m, u, ✓, U ) / ✓ (1 ✓)n+U (m+u) ⇡✓ (✓)⇡U (U )
u
To fit in with the current notation let x1 = ✓, x2 = U and x = (x1 , x2 ). Also notice that
data (m, u) is fixed so we need not write that in the posterior.
✓ ◆
⇤ x2 m+u
p (x1 , x2 ) / x (1 x1 )n+x2 (m+u) ⇡x1 (x1 )⇡x2 (x2 ) = k(x1 , x2 )
u 1
> logk <- function(x1, x2, n_, m_, u_){
R_ <- m_ + u_
R_*log(x1) + (n_+x2-R_)*log(1-x1) + lchoose (x2, u_) +
+ log(dunif(x1, .0, .2)) + log( disc.pmf(x2, 500, 2000))
}
> disc.pmf <- function(x, a, b){
ifelse (x>=a & x<=b, 1/(b-a+1), 0)
}
> n_ <- 100
> m_ <- 20
9
> u_ <- 180
> NN <- 10000 ## Number of samples
> u <- runif(NN)
> th.prop <- runif(NN, .0, 0.5)
> U.prop <- sample(300:3000, NN, replace=T)
> out <- matrix(NA, NN,2)
> xc <- c(0.2, 1500)
> acc.count <- 0
> for (i in 1:NN){
xp <- c( th.prop[i], U.prop[i] )
alpha <- min(1, exp(logk(xp[1], xp[2], n_, m_, u_) -
logk(xc[1], xc[2], n_, m_, u_)))
xc <- if(u[i]<alpha) xp else xc
out[i, ] <- xc
acc.count <- acc.count + (u[i]<alpha)
}
> acc.count / NN ## Not impressive acceptance ratio
[1] 0.0128
> summary(out[,1])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.103 0.158 0.168 0.166 0.180 0.200
> summary(out[,2])
Min. 1st Qu. Median Mean 3rd Qu. Max.
813 991 1090 1110 1200 1750
> par(mfrow=c(1,2), mar=c(2,2,1,1))
> hist(out[,1], prob=T); lines(density(out[,1]), col="red")
> hist(out[,2], prob=T); lines(density(out[,2]), col="red")
Histogram of out[, 1] Histogram of out[, 2]
0.0000 0.0010 0.0020 0.0030
10 15 20 25 30
Density
5
0
3.5 Quiz
Using the code from the slides, experiment with the following:
10
• Set m = 2 and u = 18. How does that e↵ect the posterior distribution? What if you
set m = 40 and u = 360?
• Experiment with narrowing and widening the range of the proposal distributions.
Which e↵ect does that have on the output?
• Try chaning the prior for U to a poisson distribution. Hint: dpois is your friend.
• Experiment with changing the number of samples. How many do you need to produce
“nice” histograms?
Notice:
• Item 3. can be restated as: With probability ↵ set xt2 = x2 ; with probability 1 ↵
set xt2 = xt2 1 .
11
The Gibbs sampler is a special case of single component Metropolis–Hastings, namely the
case where the proposal distribution h2 (x2 |xt1 , xt2 1 , xt3 1 ) for updating x2 is chosen to be
p(x2 |xt1 , xt3 1 )
Hence for the Gibbs sampler the proposed values are always accepted.
One version of the algorithm is as follows. Suppose a sample xt = (xt1 , xt2 , xt3 ) is available.
1. Sample xt+1
1 ⇠ p(x1 |xt2 , xt3 )
2. Sample xt+1
2 ⇠ p(x2 |xt+1 t
1 , x3 )
3. Sample xt+1
3 ⇠ p(x3 |xt+1 t+1
1 , x2 )
• The proposed values are always accepted (because ↵ = 1), so the sampler is very
efficient.
• The sampler requires that we can sample from the conditionals p(xi |x i ). In some
cases this is easy; in some cases this is difficult. In general; slice sampling can be
used (and this is what JAGS does).
−4 −2 0 2 4
12
Notice: k() is practically zero outside [ 4, 4] and in this interval k() takes values in, say
[0, 3].
Slice sampling is based on the following idea: Sample uniformly in a “large enough” window:
> N <- 5000
> xs <- runif(N, -4, 4)
> ys <- runif(N, 0, 3)
> plot(x,k(x), type='l', lwd=2, col=2)
> points(xs,ys, pch=".")
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−4 −2 0 2 4
−4 −2 0 2 4
> hist(xg)
13
Histogram of xg
250
150
0 50
−3 −2 −1 0 1 2 3
0 2 4 6 8 10
Algorithm goes as follows: Given sample xt . Pick y uniformly in [0, k(xt )].
> xt<-1; y <- runif(1, 0, k(xt))
> plot(x,k(x), type='l', lwd=2, col=2)
> abline(v=xt, col='green'); abline(h=y, col='blue')
0.0 0.5 1.0 1.5 2.0 2.5 3.0
−4 −2 0 2 4
Let set S = {x : k(x) y} of x–values for which k(x) is below the curve. Sample xt+1
uniformly from S.
14
1. Sample y uniformly from ]0, k(xt )]. Defines a horisontal “slice” S = {x : k(x) y}.
The last two steps can be implemented in many ways. We need an interval width w (chosen
by us).
“Stepping out”: Position interval of length w randomly around xt . Denote interval by
I = [L, R]. Expand both ends in steps of size w until both ends are outside the slice, i.e.
until k(L) < y and k(R) < y. Sample xt+1 from the part of the slice within I. (That is,
sample uniformly from I; if a sample is outside S just sample again).
“Doubling”: Position interval of length w around xt . Denote interval by I = [L, R]. Double
the interval until both ends are outside the slice, i.e. until k(L) < y and k(R) < y. Sample
xt+1 from the part of the slice within I.
> sliceSample_real<- function(k, xc, w){
kc<-k(xc)
y <-runif(1, 0, kc)
a <- runif(1) ## place w randomly around xc
l <- xc-a*w
u <- xc+(1-a)*w
kl <- k(l)
while (kl>y){ ## expand interval to the left if necessary
l <- l-w; kl <- k(l)
}
ku <- k(u)
while(ku>y){ ## expand interval to the right if necessary
u <- u+w; ku <- k(u)
}
xp <- runif(1, l, u) ## propose xp
kp <- k(xp)
while(kp<y){ ## shrink interval if possible
if (xp<xc) l <- xp else u <- xp
xp <- runif(1, l, u)
kp <- k(xp)
}
xp
}
> N <- 3000
> out <- rep.int(NA, N)
> x <- 1
> for (i in 1:1000){
x <- sliceSample_real(k, x, w=1)
out[i] <- x
}
15
> hist( out )
Histogram of out
200
150
100
50
0
−3 −2 −1 0 1 2 3
• Sampling on a, a + 1, a + 2, . . . a + b
16
●
2500000
●
1000000
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
0
0 5 10 15 20
17
120
0 20 40 60 80
4 5 6 7 8 9 11 13 15 17 19 21
7 8 9 10 11 12 13 14 15
18
● ●
● ●
● ●
● ●
0.20
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
0.10
● ●
● ●
● ●
● ●
●● ●
●● ●●●
●● ●●
●● ●●
●● ●●
●● ●●
●● ●
●●● ●●
●●● ●●●
0.00
●●●●● ●●●
●●●●●●●●●●●●●
19
recaptured not recaptured
marked m=20 n-m=80 n=100
unmarked u=180 ? U?
total R=200 ? N?
So we get
p(m|✓) ⇠ bin(n, ✓) p(u|✓, U ) ⇠ bin(U, ✓)
To complete the model specification we must specify prior distributions for ✓ and U . These
must reflect our prior knowledge of the problem.
The joint density of data (m, u) and the parameters (✓, U ) is then
✓ ◆
U m+u
p(m, u, ✓, U ) / ✓ (1 ✓)n+U (m+u) ⇡✓ (✓)⇡U (U )
u
To fit in with the current notation let x1 = ✓, x2 = U and x = (x1 , x2 ). Also notice that
data (m, u) is fixed so we need not write that in the posterior.
✓ ◆
⇤ x2 m+u
p (x1 , x2 ) / x (1 x1 )n+x2 (m+u) ⇡x1 (x1 )⇡x2 (x2 ) = k(x1 , x2 )
u 1
> k <- function(x1, x2, n_, m_, u_){
R_ <- m_ + u_
z<-R_*log(x1) + (n_+x2-R_)*log(1-x1) + lchoose (x2, u_) +
+ log(dunif(x1, .0, .2)) + log( disc.pmf(x2, 500, 2000))
exp(z)
}
> disc.pmf <- function(x, a, b){
20
ifelse (x>=a & x<=b, 1/(b-a+1), 0)
}
> n_ <- 100
> m_ <- 20
> u_ <- 180
> library(doBy)
> kk <- specialize(k, list(n_=n_, m_=m_, u_=u_))
> # Now kk is function only of x1, x2
> args(kk)
function (x1, x2)
NULL
> kk
function (x1, x2)
{
R_ <- 20 + 180
z <- R_ * log(x1) + (100 + x2 - R_) * log(1 - x1) + lchoose(x2,
180) + +log(dunif(x1, 0, 0.2)) + log(disc.pmf(x2, 500,
2000))
exp(z)
}
<environment: 0x07e1c188>
> N <- 10000
> x1t <- .1 # initial values
> x2t <- 1000 # initial values
> out <- matrix(NA, N, 2)
> kk1 <- specialize(kk, list(x2=x2t)); kk1
function (x1)
{
R_ <- 20 + 180
z <- R_ * log(x1) + (100 + 1000 - R_) * log(1 - x1) + lchoose(1000,
180) + +log(dunif(x1, 0, 0.2)) + log(disc.pmf(1000, 500,
2000))
exp(z)
}
<environment: 0x07db16e0>
> kk2 <- specialize(kk, list(x1=x1t)); kk2
function (x2)
{
R_ <- 20 + 180
z <- R_ * log(0.1) + (100 + x2 - R_) * log(1 - 0.1) + lchoose(x2,
180) + +log(dunif(0.1, 0, 0.2)) + log(disc.pmf(x2, 500,
2000))
exp(z)
}
<environment: 0x07d95200>
21
> for (i in 1:N){
x1t <- sliceSample_unit(kk1, x1t, w=1)
kk2 <- specialize(kk, list(x1=x1t))
x2t <- sliceSample_int(kk2, x2t, w=10)
kk1 <- specialize(kk, list(x2=x2t))
out[i,] <- c(x1t,x2t)
}
> par(mfrow=c(1,2))
> z<-apply(out, 2, hist)
Histogram of newX[, i] Histogram of newX[, i]
2500
1500
Frequency
1500
1000
500
500
0
We need to sample xt+1 uniformly from the slice S = {x : k(x) y}. But this is the same
as sampling xt+1 uniformly from the slice S = {x : log k(x) log y = z}.
22