0% found this document useful (0 votes)
19 views18 pages

Walker 2016 SliceSampling

Uploaded by

DanielSaavedra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views18 pages

Walker 2016 SliceSampling

Uploaded by

DanielSaavedra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

INTERNATIONAL CENTRE FOR ECONOMIC RESEARCH

WORKING PAPER SERIES

Stephen G. Walker

SAMPLING THE DIRICHLET MIXTURE MODEL WITH SLICES

Working Paper No. 16/2006


SAMPLING THE DIRICHLET MIXTURE
MODEL WITH SLICES

Stephen G. Walker

Institute of Mathematics, Statistics and Actuarial Science


University of Kent Canterbury, Kent, CT2 7NZ, U.K.

May 2006

Abstract: We provide a new approach to the sampling of the


well known mixture of Dirichlet process model. Recent attention
has focused on retention of the random distribution function in
the model, but sampling algorithms have then suffered from the
countably infinite representation these distributions have. The
key to the algorithm detailed in this paper, which also keeps
the random distribution functions, is the introduction of a latent
variable which allows a finite number, which is known, of objects
to be sampled within each iteration of a Gibbs sampler.

Keywords: Bayesian Nonparametrics, Density estimation, Dirich-


let process, Gibbs sampler, Slice sampling.

Acknowledgements: The author is an EPSRC Advanced Re-


search Fellow and the paper was written during a visit to the
University of Turin funded by ICER.

1
1. Introduction. The aim of this paper is to introduce a new method for
sampling the well known and widely used mixture of Dirichlet process (MDP)
model. There have been a number of recent contributions to the literature on
this problem, notably Ishwaran & Zarepour (2000) and Papaspiliopoulos &
Roberts (2005). These papers have been concerned with sampling the MDP
model while retaining the random distribution functions.
The issue and the causes of the complexities is the countably infiniteness
of the discrete masses from the random distribution functions chosen from the
Dirichlet process prior. Ishwaran & Zarepour (2000) circumvent this with an
approximate method based on a truncation of the distributions. Motivated
by the work of Ishwaran & Zarepour (2000), Papaspiliopoulos & Roberts
(2005) proposed an exact algorithm based on the notion of retrospective
sampling. However, the algorithm itself becomes non-trivial when applied
to the MDP mocdel, and involves setting up a detailed balance criterion
with connecting proposal moves (Green, 1995). On the other hand, we find a
simple trick, based on the slice sampling schemes (Damien et al., 1999), which
deals with the infiniteness. The introduction of latent variables makes finite
the part of the random distribution function required to iterate through a
Gibbs sampler. Moreover, all the conditional distributions are easy to sample
and no accept/reject methods are needed.
The first sampler for the MDP model, based on a Gibbs sampler, was
given in the PhD Thesis of Escobar (1988, 1994). Alternative approaches
have been proposed by MacEachern (1994) and co-authors; for example,
MacEachern & Müller (1998). A recent survey is given in MacEachern (1998),
and other papers in the book of Dey et al. (1998), and by Müller & Qun-
intana (2004). Richardson & Green (1997) provide a comparison with more
traditional mixture models and Neal (2000) also discusses ideas for sampling

2
the MDP model.
Recently, Ishwaran & James (2001) developed a Gibbs sampling scheme
involving more general stick-breaking priors, which is a direct extension of
the Escobar (1998) approach. Escobar’s Gibbs sampler makes use of the
Pólya-urn sampling scheme (Blackwell & MacQueen, 1973) and the idea of
using the Pólya-urn scheme is connected with the procedure of integrating
out of the model the random distribution function from the Dirichlet process.
Recent attempts have avoided this step and retained the random distribu-
tion functions in the algorithms, notably Ishwaran & Zarepour (2001) and
Papaspiliopoulos & Roberts (2005).
In Section 2 we describe the Dirichlet process mixture model and describe
the latent variables of use to the sampling strategy. In Section 3 we will write
down the algorithm for the Gibbs sampler and Section 4 contains a couple
of illustrative examples. Finally, Section 5 concludes with a brief discussion.

2. The Dirichlet Process Model. Let D(c, P0 ) denote a Dirichlet process


prior (Ferguson, 1973) with scale parameter c > 0 and prior probability
measure P0 . So, for example, E(P ) = P0 and
P0 (A){1 − P0 (A)}
Var(P (A)) =
c+1
for all appropriate sets A. The posterior distribution of P given n indepen-
dent and identically distributed samples from P is also a Dirichlet process
with new parameters c + n and
cP0 + nPn
,
c+n
where Pn is the empirical distribution function. However, we will not be
needing this particular result.
It is well known that a random probability measure P can be chosen from
D(c, P0 ) via the following sampling scheme, attributable to Sethuraman &

3
Tiwari (1982), see also Sethuraman (1994), and involving the so-called stick-
breaking prior (see, for example, Freedman, 1963; Connor & Mosimann,
1969). Take v1 , v2 , . . . to be independent and identically distributed beta(1, c)
variables and take θ1 , θ2 , . . . to be independent and identically distributed
from P0 , which we will assume has density g0 with respect to the Lebesgue
measure. Then define

X
P = wj δθj ,
j=1

where w1 = v1 and for j > 1,


Y
w j = vj (1 − vl ).
l<j

Here δθ denotes the measure with a point mass of 1 at θ. The weights are
obtained via what is known as a stick-breaking procedure. Ishwaran & James
(2001) consider a more general model with the vj ∼ beta(aj , bj ) and show
that the sum of weights is 1 almost surely when

X
log(1 + aj /bj ) = +∞.
j=1

While we work with the v’s which lead to the Dirichlet process, our algorithm
for sampling the MDP model can be extended to cover other stick-breaking
prior distributions in a simple way. This will be elaborated on later in the
paper.
The MDP model is based on the idea of constructing absolutely contin-
uous random distribution functions and was first considered in Lo (1984).
The random distribution function chosen from a Dirichlet process is almost
surely discrete (Blackwell, 1973). Consequently, consider the random density
function
Z
fP (y) = N(y|θ) dP (θ).

4
Here N(y|θ) denotes a conditional density function, which will typically be a
normal distribution and the parameters of which are represented by θ. So in
the normal case θ = (µ, σ 2 ). Given the form for P , we can write
X
fw,θ (y) = wj N(y|θj ).
j

The prior distributions for the w and θ have been given earlier.
Our attempt to estimate the model, via Gibbs sampling ideas, is to in-
troduce a latent variable u such that the joint density with of (y, u) given
(w, θ) is given by
X
fw,θ (y, u) = 1(u < wj ) N(y|θj ).
j

Clearly integrating over u with respect to the Lebesgue measure returns us


the desired density fw,θ (y). Hence, the joint density exists and so there will
also exist a marginal density for u. Alternatively we can write

X
fw,θ (y, u) = wj U(u|0, wj ) N (y|θj )
j=1

and so with probability wj , y and u are independent and are, respectively,


normal and uniform distributed. Hence, the marginal density for u is given
by

X ∞
X
fw (u) = wj U(u|0, wj ) = 1(u < wj ).
j=1 j=1

If we let
Aw (u) = {j : wj > u}

then we can equally write


X
fw,θ (y, u) = N(y|θj ).
j∈Aw (u)

5
Note, it is quite clear that Aw (u) is a finite set for all u > 0. The conditional
density of y given u is given by
1 X
fw,θ (y|u) = N(y|θj ),
fw (u) j∈Aw (u)
P
where fw (u) = j 1(u < wj ) is the marginal density for u, being defined on
(0, w∗ ) where w∗ is the largest wj .
The usefulness of the latent variable u will become clear later on. A brief
comment here is that the move from an infinite sum to a finite sum, given u,
is going to make a lot of difference when sampling is involved.
So, given u, we have a finite mixture model with equal weights, all equal
to 1/fw (u). We can now introduce a further indicator latent variable which
will identify the component of the mixture from which y is to be taken.
Therefore, consider the joint density

fw,θ (y, δ = k, u) = N(y|θk )1(k ∈ A(u)).

The complete data likelihood based on a sample of size n is easily seen to be


n
Y
lw,θ ({yi , ui , δi = ki }ni=1 ) = N(yi |θki ) 1(ui < wki ).
i=1

As has been mentioned, we already know the prior distributions for the w
and θ. Though as it happens, we will use the v’s rather than the w’s when
it comes to sampling.

3. The Sampling Algorithm. In order to implement a Gibbs sampler we


require the set of full conditional density functions. For the infinite collection
of variables v and θ, it would seem that we would need to sample the entire
set. But this is not required. We only need to sample a finite set of them at
each stage in order to progress to the next iteration. All un-sampled vj ’s and
θj ’s will be independent samples from the priors; that is beta(1, c) and g0 ,

6
respectively. Let us proceed to consider the full conditional densities; listed
A to E.

A. We will start with the ui ’s. These are easy to find and are the uniform
distributions on the interval
(0, wki ).

B. Next we have θj , and this is easily seen to be the density function given
up to a constant of proportionality by
Y
f (θj | · · ·) ∝ g0 (θj ) N(yi |θj ).
ki =j

If there are no ki equal to j then f (θj | · · ·) = g0 (θj ).

C. Slightly harder, but quite do-able, is the sampling of the vj ’s. For the
joint full conditional density we have
n
Y
f (v| · · ·) ∝ π(v) 1(wki > ui ),
i=1

where π(v) denotes the collection of independent beta variables, and we have
already given the relation between the wj ’s and the vj ’s. Hence,
 
n
Y Y
f (v| · · ·) ∝ π(v) 1 vki (1 − vl ) > ui  .
i=1 l<ki

It is quite evident from this that only the vj ’s for j ≤ k ∗ , where k ∗ is the
maximum of {k1 , . . . , kn }, will be affected; that is, for j > k ∗ , we have
f (vj | · · ·) = beta(1, c). For j ≤ k ∗ we have

f (vj |v−j , · · ·) ∝ beta(vj |1, c)1 (αj < vj < βj ) ,

where ( )
ui
αj = max Q .
ki =j l<j (1 − vl )

7
and ( )
ui
βj = 1 − max Q .
ki >j v ki l<ki ,l6=j (1 − vl )
Then the distribution function, on αj < vj < βj , is given by

(1 − αj )c − (1 − vj )c
F (vj ) =
(1 − αj )c − (1 − βj )c

and so a sample can be taken via the inverse cdf technique. Clearly, it is
now evident that this approach covers more general stick-breaking models;
it is no more difficult to sample a truncated beta variable when we have
vj ∼ beta(aj , bj ) as the priors.

D. We now discuss the sampling of the indicator variables. We clearly have

pr(δi = k| · · ·) ∝ 1(k ∈ Aw (ui )) N(yi |θk ).

Clearly Aw (ui ) is not empty; at least ki ∈ Aw (ui ).


Before providing details on how to sample this, we mention that without
the latent variables ui , the possible choices of δi would be infinite and prob-
lems then arise with the normalising constant. Papaspiliopoulos & Roberts
(2005) attempted to circumvent the problem via retrospective sampling and
the use of a detailed-balance criterion, which is non-trivial. Our approach
is quite easy to implement. The choice of δi is from a finite set, which is
{k : wk > ui }. So we sample as many of the wk ’s until we are sure that we
have all the wk > ui . How do we know this? We are sure there can be no
further k > k i for which wk > ui when we have k i such that
ki
X
w j > 1 − ui .
j=1

So, to cover all the i’s, we find the smallest k ∗ such that
k∗
X
w j > 1 − u∗ ,
j=1

8
where u∗ = min{u1 , . . . , un }. Hence, we now know how many of the wk ’s we
need to sample in order for the chain to proceed; it is {w1 , . . . , wk∗ }. It is that
k ∗ will be necessary to find to implement the algorithm. One needs to know
how many of the wj are larger than u and it is only at k ∗ that one knows for
sure that all have been found. Hence, k ∗ is not a loose approximation; it is
an exact piece of information.
For the prior model it is that,

k ∗ ∼ 1 + Poisson(−c log u∗ ).

See Muliere & Tardella (1998).

E. We can incorporate a prior on c, say π(c). We will sample f (c, w, θ|y, u, δ)


as a block, and will sample this in two stages; first by sampling from f (c|y, u, δ)
and then f (w, θ|c, y, u, δ). We have already described how to sample from
the latter of these. For the former, it is equivalent to the full conditional
density that would arise from the marginal model, that is the one in which
the random distribution functions are removed from the model. Therefore,
as is well known, it is only the δ and the sample size that provides informa-
tion about c. To elaborate on this, the conditional distribution of c depends
only on the number of clusters; that is, the number of distinct ki ’s, call this
d, and that
f (c|d, n) ∝ cd Γ(c) π(c)/Γ(c + n),

where Γ(·) denotes the usual gamma function. A nice way to sample from
this is given in Escobar & West (1995) when π(c) is a gamma distribution.

Hence, all the conditional densities are easy to sample and the Markov chain
we have constructed is automatic. It requires no tuning nor retrospective
steps.

9
For density estimation we would like to sample from the predictive dis-
tribution of
f (yn+1 |y1 , . . . , yn ).

At each iteration we have (wj , θj ) and we sample a θj using the weights. The
idea is to sample a uniform random variable r from the unit interval and
to take that θj for which wj−1 < r < wj , with w0 = 0. If more weights
are required than currently exist then it is straightforward to sample more
as we know the additional vj ’s for j > k ∗ are independent and identically
distributed from beta(1, c) and the additional θj ’s are independent and iden-
tically distributed from g0 . Having taken θj , we draw yn+1 from N(·|θj ).

4. Illustration. Here we present a normal example in which θ = (µ, σ 2 ) and


we will take λ = σ −2 . The prior for the µj ’s will be independent N(0, 1/s) and
the prior for the λj ’s will be independent Ga(², ²). To complement Section 3
we now provide the conditional distributions for µj and λj . We have
à !
ξj λj 1
f (µj | · · ·) = N , ,
mj λj + s mj λj + s

where
X
ξj = yi
ki =j

and
X
mj = 1.
ki =j

We also have
f (λj | · · ·) = Ga(² + mj /2, ² + dj /2),

where
X
dj = (yi − θj )2 .
ki =j

10
In the simulated data set example that follows, the code was written using
scilab, which is freely downloadable from the internet.
We sampled 50 random variables independently from the mixture of nor-
mal distributions given by

1 1 1
f (y) = N(y| − 4, 1) + N(y|0, 1) + N(y|8, 1).
3 3 3

Choosing non-informative specifications, we took ² = 0.5, s = 0.1 and the


gamma prior for c to be Ga(0.1, 0.1), the Gibbs sampler was run for 20,000
iterations and at each iteration from 10,000 onwards a predictive sample yn+1
was taken. A histogram of the 50 data points with the density estimator
based on the 10,000 samples of yn+1 is provided in Figure 1. The density
estimator was obtained using the R density routine with bandwidth set to
0.3.
Figure 2 presents the running average for the number of clusters sampled
at each iteration. So it is clear that 10,000 samples is good enough for the
chain to reach stationarity and hence the samples from 10,000 onwards can
be taken as coming from the predictive distribution.

5. Discussion. We have provided a simple and fast way to sample the MDP
model. The key is the introduction of the latent variables which truncate the
weights of the random Dirichlet distributions. It is a highly simple piece of
code to write and is direct in the sense that no accept/reject sampling nor
retrospective sampling is required. It is also remarkably quick to run. It
improves on current approaches in the following way: we know exactly how
many of the wj ’s and θj ’s we need to sample at each iteration - it is k ∗ . This
fundamental result eludes the alternative approaches.
Retaining the random distribution function is useful as it removes the
dependence between the θki ’s which exist in the Pólya-urn model. However,

11
retaining the random distributions leads to problems with the countably
infinite representation. In this paper we deal with it by introducing a latent
variable which makes the representation finite for the purposes of proceeding
with the sampling and allowing sampling from the predictive distribution.
The sampling of the latent variable given the other variables is a uniform
distribution.
In the non-conjugate case, that is when N (y|θ) and g0 (θ) form a non-
conjugate pair and perhaps difficult to sample, then a possible useful solution
is again provided by the latent variable ideas presented in Damien et al.
(1999, Sections 4 & 5).

References

Blackwell, D. (1973). The discreteness of Ferguson selections. Annals of

Statistics 1, 356–358.

Blackwell, D. & MacQueen, J.B. (1973). Ferguson distributions via Pólya-

urn schemes. Annals of Statistics 1, 353–355.

Connor, R.J. & Mosimann, J.E. (1969). Concepts of independence for pro-

portions with a generalization of the Dirichlet distribution. Journal of


the American Statistical Association 64, 194–206.

Damien, P., Wakefield, J.C. & Walker, S.G. (1999). Gibbs sampling for

Bayesian non-conjugate and hierarchical models by using auxiliary vari-


ables. Journal of the Royal Statistical Society, Series B 61, 331–344.

Dey, D., Sinha, D. & Müller, P. (1998). Practical Nonparametric and Semi-

parametric Bayesian Statistics. Lecture Notes in Statistics. Springer,


New York.

12
Escobar, M.D. (1988). Estimating the means of several normal populations

by nonparametric estimation of the distribution of the means. Unpub-


lished Ph.D. dissertation, Department of Statistics, Yale University.

Escobar, M.D. (1994). Estimating normal means with a Dirichlet process

prior. Journal of the American Statistical Association 89, 268–277.

Escobar, M.D. & West, M. (1995). Bayesian density estimation and infer-

ence using mixtures. Journal of the American Statistical Association


90, 577–588.

Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric prob-

lems. Annals of Statistics 1, 209–230.

Freedman, D.A. (1963). On the asymptotic behaviour of Bayes estimates in

the discrete case I. Annals of Mathematical Statistics 34, 1386–1403.

Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computa-

tion and Bayesian model determination. Biometrika 82, 711–732.

Ishwaran, H. & Zarepour, M. (2000). Markov chain Monte Carlo in approx-

imate Dirichlet and beta two parameter process hierarchical models.


Biometrika 87, 371–390.

Ishwaran, H. & James, L. (2001). Gibbs sampling methods for stick-breaking

priors. Journal of the American Statistical Association 96, 161–173.

Lo, A.Y. (1984). On a class of Bayesian nonparametric estimates I. Density

estimates. Annals of Statistics 12, 351–357.

MacEachern, S.N. (1994). Estimating normal means with a conjugate style

Dirichlet process prior. Communications in Statistics: Simulation and


Computation 23, 727–741.

13
MacEachern, S.N. (1998). Computational methods for Mixture of Dirich-

let Process Models. In Practical Nonparametric and Semiparametric


Bayesian Statistics (D.Dey, P.Müller, D.Sinha eds.) 23–43. Springer,
New York.

MacEachern, S.N. and Müller, P. (1998). Estimating mixtures of Dirichlet

process models. Journal of Computational and Graphical Statistics 7,


223–238.

Muliere, P. & Tardella, L. (1998). Approximating distributions of random

functionals of Ferguson-Dirichlet priors. Canadian Journal of Statistics


26, 283–297.

Müller, P. & Quintana, F.A. (2004). Nonparametric Bayesian Data Anal-

ysis.Statistical Science 19, 95–110.

Neal, R.M. (2000). Markov chain sampling methods for Dirichlet process

mixture models. Journal of Computational and Graphical Statistics 9,


249–265.

Richardson, S. & Green, P.J. (1997). On Bayesian analysis of mixtures with

an unknown number of components. Journal of the Royal Statistical


Society, Series B 59, 731–792.

Papaspiliopoulos, O. & Roberts, G.O. (2005). Retrospective Markov chain

Monte Carlo methods for Dirichlet process hierarchical models. Sub-


mitted.

Sethuraman, J. & Tiwari, R. (1982). Convergence of Dirichlet measures

and the interpretation of their parameter. In Proceedings of the third


Purdue symposium on statistical decision theory and related topics.
Gupta, S.S. and Berger, J.O. (Eds.) Academic press, New York.

14
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statis-

tica Sinica 4, 639–650.

15
0.25
0.20
0.15
Density

0.10
0.05
0.00

−10 −5 0 5 10 15

data

Figure 1: Histogram of data and density estimate of predictive density for


1/3N(-4,1)+1/3N(0,1)+1/3N(8,1)

16
10
8
6
clusters

4
2
0

0 2000 4000 6000 8000 10000

iterations

Figure 2: Running average for the number of clusters up to iteration 10000

17

You might also like