Walker 2016 SliceSampling
Walker 2016 SliceSampling
Stephen G. Walker
Stephen G. Walker
May 2006
1
1. Introduction. The aim of this paper is to introduce a new method for
sampling the well known and widely used mixture of Dirichlet process (MDP)
model. There have been a number of recent contributions to the literature on
this problem, notably Ishwaran & Zarepour (2000) and Papaspiliopoulos &
Roberts (2005). These papers have been concerned with sampling the MDP
model while retaining the random distribution functions.
The issue and the causes of the complexities is the countably infiniteness
of the discrete masses from the random distribution functions chosen from the
Dirichlet process prior. Ishwaran & Zarepour (2000) circumvent this with an
approximate method based on a truncation of the distributions. Motivated
by the work of Ishwaran & Zarepour (2000), Papaspiliopoulos & Roberts
(2005) proposed an exact algorithm based on the notion of retrospective
sampling. However, the algorithm itself becomes non-trivial when applied
to the MDP mocdel, and involves setting up a detailed balance criterion
with connecting proposal moves (Green, 1995). On the other hand, we find a
simple trick, based on the slice sampling schemes (Damien et al., 1999), which
deals with the infiniteness. The introduction of latent variables makes finite
the part of the random distribution function required to iterate through a
Gibbs sampler. Moreover, all the conditional distributions are easy to sample
and no accept/reject methods are needed.
The first sampler for the MDP model, based on a Gibbs sampler, was
given in the PhD Thesis of Escobar (1988, 1994). Alternative approaches
have been proposed by MacEachern (1994) and co-authors; for example,
MacEachern & Müller (1998). A recent survey is given in MacEachern (1998),
and other papers in the book of Dey et al. (1998), and by Müller & Qun-
intana (2004). Richardson & Green (1997) provide a comparison with more
traditional mixture models and Neal (2000) also discusses ideas for sampling
2
the MDP model.
Recently, Ishwaran & James (2001) developed a Gibbs sampling scheme
involving more general stick-breaking priors, which is a direct extension of
the Escobar (1998) approach. Escobar’s Gibbs sampler makes use of the
Pólya-urn sampling scheme (Blackwell & MacQueen, 1973) and the idea of
using the Pólya-urn scheme is connected with the procedure of integrating
out of the model the random distribution function from the Dirichlet process.
Recent attempts have avoided this step and retained the random distribu-
tion functions in the algorithms, notably Ishwaran & Zarepour (2001) and
Papaspiliopoulos & Roberts (2005).
In Section 2 we describe the Dirichlet process mixture model and describe
the latent variables of use to the sampling strategy. In Section 3 we will write
down the algorithm for the Gibbs sampler and Section 4 contains a couple
of illustrative examples. Finally, Section 5 concludes with a brief discussion.
3
Tiwari (1982), see also Sethuraman (1994), and involving the so-called stick-
breaking prior (see, for example, Freedman, 1963; Connor & Mosimann,
1969). Take v1 , v2 , . . . to be independent and identically distributed beta(1, c)
variables and take θ1 , θ2 , . . . to be independent and identically distributed
from P0 , which we will assume has density g0 with respect to the Lebesgue
measure. Then define
∞
X
P = wj δθj ,
j=1
Here δθ denotes the measure with a point mass of 1 at θ. The weights are
obtained via what is known as a stick-breaking procedure. Ishwaran & James
(2001) consider a more general model with the vj ∼ beta(aj , bj ) and show
that the sum of weights is 1 almost surely when
∞
X
log(1 + aj /bj ) = +∞.
j=1
While we work with the v’s which lead to the Dirichlet process, our algorithm
for sampling the MDP model can be extended to cover other stick-breaking
prior distributions in a simple way. This will be elaborated on later in the
paper.
The MDP model is based on the idea of constructing absolutely contin-
uous random distribution functions and was first considered in Lo (1984).
The random distribution function chosen from a Dirichlet process is almost
surely discrete (Blackwell, 1973). Consequently, consider the random density
function
Z
fP (y) = N(y|θ) dP (θ).
4
Here N(y|θ) denotes a conditional density function, which will typically be a
normal distribution and the parameters of which are represented by θ. So in
the normal case θ = (µ, σ 2 ). Given the form for P , we can write
X
fw,θ (y) = wj N(y|θj ).
j
The prior distributions for the w and θ have been given earlier.
Our attempt to estimate the model, via Gibbs sampling ideas, is to in-
troduce a latent variable u such that the joint density with of (y, u) given
(w, θ) is given by
X
fw,θ (y, u) = 1(u < wj ) N(y|θj ).
j
If we let
Aw (u) = {j : wj > u}
5
Note, it is quite clear that Aw (u) is a finite set for all u > 0. The conditional
density of y given u is given by
1 X
fw,θ (y|u) = N(y|θj ),
fw (u) j∈Aw (u)
P
where fw (u) = j 1(u < wj ) is the marginal density for u, being defined on
(0, w∗ ) where w∗ is the largest wj .
The usefulness of the latent variable u will become clear later on. A brief
comment here is that the move from an infinite sum to a finite sum, given u,
is going to make a lot of difference when sampling is involved.
So, given u, we have a finite mixture model with equal weights, all equal
to 1/fw (u). We can now introduce a further indicator latent variable which
will identify the component of the mixture from which y is to be taken.
Therefore, consider the joint density
As has been mentioned, we already know the prior distributions for the w
and θ. Though as it happens, we will use the v’s rather than the w’s when
it comes to sampling.
6
respectively. Let us proceed to consider the full conditional densities; listed
A to E.
A. We will start with the ui ’s. These are easy to find and are the uniform
distributions on the interval
(0, wki ).
B. Next we have θj , and this is easily seen to be the density function given
up to a constant of proportionality by
Y
f (θj | · · ·) ∝ g0 (θj ) N(yi |θj ).
ki =j
C. Slightly harder, but quite do-able, is the sampling of the vj ’s. For the
joint full conditional density we have
n
Y
f (v| · · ·) ∝ π(v) 1(wki > ui ),
i=1
where π(v) denotes the collection of independent beta variables, and we have
already given the relation between the wj ’s and the vj ’s. Hence,
n
Y Y
f (v| · · ·) ∝ π(v) 1 vki (1 − vl ) > ui .
i=1 l<ki
It is quite evident from this that only the vj ’s for j ≤ k ∗ , where k ∗ is the
maximum of {k1 , . . . , kn }, will be affected; that is, for j > k ∗ , we have
f (vj | · · ·) = beta(1, c). For j ≤ k ∗ we have
where ( )
ui
αj = max Q .
ki =j l<j (1 − vl )
7
and ( )
ui
βj = 1 − max Q .
ki >j v ki l<ki ,l6=j (1 − vl )
Then the distribution function, on αj < vj < βj , is given by
(1 − αj )c − (1 − vj )c
F (vj ) =
(1 − αj )c − (1 − βj )c
and so a sample can be taken via the inverse cdf technique. Clearly, it is
now evident that this approach covers more general stick-breaking models;
it is no more difficult to sample a truncated beta variable when we have
vj ∼ beta(aj , bj ) as the priors.
So, to cover all the i’s, we find the smallest k ∗ such that
k∗
X
w j > 1 − u∗ ,
j=1
8
where u∗ = min{u1 , . . . , un }. Hence, we now know how many of the wk ’s we
need to sample in order for the chain to proceed; it is {w1 , . . . , wk∗ }. It is that
k ∗ will be necessary to find to implement the algorithm. One needs to know
how many of the wj are larger than u and it is only at k ∗ that one knows for
sure that all have been found. Hence, k ∗ is not a loose approximation; it is
an exact piece of information.
For the prior model it is that,
k ∗ ∼ 1 + Poisson(−c log u∗ ).
where Γ(·) denotes the usual gamma function. A nice way to sample from
this is given in Escobar & West (1995) when π(c) is a gamma distribution.
Hence, all the conditional densities are easy to sample and the Markov chain
we have constructed is automatic. It requires no tuning nor retrospective
steps.
9
For density estimation we would like to sample from the predictive dis-
tribution of
f (yn+1 |y1 , . . . , yn ).
At each iteration we have (wj , θj ) and we sample a θj using the weights. The
idea is to sample a uniform random variable r from the unit interval and
to take that θj for which wj−1 < r < wj , with w0 = 0. If more weights
are required than currently exist then it is straightforward to sample more
as we know the additional vj ’s for j > k ∗ are independent and identically
distributed from beta(1, c) and the additional θj ’s are independent and iden-
tically distributed from g0 . Having taken θj , we draw yn+1 from N(·|θj ).
where
X
ξj = yi
ki =j
and
X
mj = 1.
ki =j
We also have
f (λj | · · ·) = Ga(² + mj /2, ² + dj /2),
where
X
dj = (yi − θj )2 .
ki =j
10
In the simulated data set example that follows, the code was written using
scilab, which is freely downloadable from the internet.
We sampled 50 random variables independently from the mixture of nor-
mal distributions given by
1 1 1
f (y) = N(y| − 4, 1) + N(y|0, 1) + N(y|8, 1).
3 3 3
5. Discussion. We have provided a simple and fast way to sample the MDP
model. The key is the introduction of the latent variables which truncate the
weights of the random Dirichlet distributions. It is a highly simple piece of
code to write and is direct in the sense that no accept/reject sampling nor
retrospective sampling is required. It is also remarkably quick to run. It
improves on current approaches in the following way: we know exactly how
many of the wj ’s and θj ’s we need to sample at each iteration - it is k ∗ . This
fundamental result eludes the alternative approaches.
Retaining the random distribution function is useful as it removes the
dependence between the θki ’s which exist in the Pólya-urn model. However,
11
retaining the random distributions leads to problems with the countably
infinite representation. In this paper we deal with it by introducing a latent
variable which makes the representation finite for the purposes of proceeding
with the sampling and allowing sampling from the predictive distribution.
The sampling of the latent variable given the other variables is a uniform
distribution.
In the non-conjugate case, that is when N (y|θ) and g0 (θ) form a non-
conjugate pair and perhaps difficult to sample, then a possible useful solution
is again provided by the latent variable ideas presented in Damien et al.
(1999, Sections 4 & 5).
References
Statistics 1, 356–358.
Connor, R.J. & Mosimann, J.E. (1969). Concepts of independence for pro-
Damien, P., Wakefield, J.C. & Walker, S.G. (1999). Gibbs sampling for
Dey, D., Sinha, D. & Müller, P. (1998). Practical Nonparametric and Semi-
12
Escobar, M.D. (1988). Estimating the means of several normal populations
Escobar, M.D. & West, M. (1995). Bayesian density estimation and infer-
Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computa-
13
MacEachern, S.N. (1998). Computational methods for Mixture of Dirich-
Neal, R.M. (2000). Markov chain sampling methods for Dirichlet process
14
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statis-
15
0.25
0.20
0.15
Density
0.10
0.05
0.00
−10 −5 0 5 10 15
data
16
10
8
6
clusters
4
2
0
iterations
17