Dirichlet Process: Yee Whye Teh, University College London
Dirichlet Process: Yee Whye Teh, University College London
Dirichlet Process: Yee Whye Teh, University College London
Definition
The Dirichlet process is a stochastic proces used in Bayesian nonparametric
models of data, particularly in Dirichlet process mixture models (also known as
infinite mixture models). It is a distribution over distributions, i.e. each draw
from a Dirichlet process is itself a distribution. It is called a Dirichlet process be-
cause it has Dirichlet distributed finite dimensional marginal distributions, just
as the Gaussian process, another popular stochastic process used for Bayesian
nonparametric regression, has Gaussian distributed finite dimensional marginal
distributions. Distributions drawn from a Dirichlet process are discrete, but
cannot be described using a finite number of parameters, thus the classification
as a nonparametric model.
1
tion which we wish to infer given some observed data. Say we observe x1 , . . . , xn ,
with xi ∼ F independent and identical draws from the unknown distribution
F . A Bayesian would approach this problem by placing a prior over F then
computing the posterior over F given data. Traditionally, this prior over dis-
tributions is given by a parametric family. But constraining distributions to
lie within parametric families limits the scope and type of inferences that can
be made. The nonparametric approach instead uses a prior over distributions
with wide support, typically the support being the space of all distributions.
Given such a large space over which we make our inferences, it is important
that posterior computations are tractable.
The Dirichlet process is currently one of the most popular Bayesian non-
parametric models. It was first formalized in [1]1 for general Bayesian statistical
modeling, as a prior over distributions with wide support yet tractable posteri-
ors. Unfortunately the Dirichlet process is limited by the fact that draws from
it are discrete distributions, and generalizations to more general priors did not
have tractable posterior inference until the development of MCMC techniques
[3, 4]. Since then there has been significant developments in terms of infer-
ence algorithms, extensions, theory and applications. In the machine learning
community work on Dirichlet processes date back to [5, 6].
Theory
The Dirichlet process (DP) is a stochastic process whose sample paths are proba-
bility measures with probability one. Stochastic processes are distributions over
function spaces, with sample paths being random functions drawn from the dis-
tribution. In the case of the DP, it is a distribution over probability measures,
which are functions with certain special properties which allow them to be inter-
preted as distributions over some probability space Θ. Thus draws from a DP
can be interpreted as random distributions. For a distribution over probability
measures to be a DP, its marginal distributions have to take on a specific form
which we shall give below. We assume that the user is familiar with a modicum
of measure theory and Dirichlet distributions.
Before we proceed to the formal definition, we will first give an intuitive
explanation of the DP as an infinite dimensional generalization of Dirichlet
distributions. Consider a Bayesian mixture model consisting of K components:
π|α ∼ Dir( Kα α
,..., K ) θk∗ |H ∼ H
zi |π ∼ Mult(π) xi |zi , {θk∗ } ∼ F (θz∗i ) (1)
2
becomes independent of K and is approximately O(α log n). This implies that
the mixture model stays well-defined as K → ∞, leading to what is known as
an infinite mixture model [5, 6]. This model was first proposed as a way to
sidestep the difficult problem of determining the number of components in a
mixture, and as a nonparametric alternative to finite mixtures whose size can
grow naturally with the number of data items. The more modern definition of
this model uses a DP and with the resulting model called a DP mixture model.
PK appears as the K → ∞ limit of the random discrete probability
The DP itself
measure k=1 πk δθk∗ , where δθ is a point mass centred at θ. We will return to
the DP mixture towards the end of this entry.
Dirichlet Process
For a random distribution G to be distributed according to a DP, its marginal
distributions have to be Dirichlet distributed [1]. Specifically, let H be a distri-
bution over Θ and α be a positive real number. Then for any finite measurable
partition A1 , . . . , Ar of Θ the vector (G(A1 ), . . . , G(Ar )) is random since G is
random. We say G is Dirichlet process distributed with base distribution H and
concentration parameter α, written G ∼ DP(α, H), if
3
A related issue to the above is the coverage of the DP within the class
of all distributions over Θ. We already noted that samples from the DP are
discrete, thus the set of distributions with positive probability under the DP is
small. However it turns out that this set is also large in a different sense: if the
topological support of H (the smallest closed set S in Θ with H(S) = 1) is all
of Θ, then any distribution over Θ can be approximated arbitrarily accurately
in the weak or pointwise sense by a sequence of draws from DP(α, H). This
property has consequence in the consistency of DPs discussed later.
For all but the simplest probability spaces, the number of measurable par-
titions in the definition (2) of the DP can be uncountably large. The natural
question to ask here is whether objects satisfying such a large number of condi-
tions as (2) can exist. There are a number of approaches to establish existence.
[1] noted that the conditions (2) are consistent with each other, and made use
of Kolmogorov’s consistency theorem to show that a distribution over functions
from the measurable subsets of Θ to [0, 1] exists satisfying (2) for all finite
measurable partitions of Θ. However it turns out that this construction does
not necessarily guarantee a distribution over probability measures. [1] also pro-
vided a construction of the DP by normalizing a gamma process. In a later
section we will see that the predictive distributions of the DP are related to the
Blackwell-MacQueen urn scheme. [7] made use of this, along with de Finetti’s
theorem on exchangeable sequences, to prove existence of the DP. All the above
methods made use of powerful and general mathematical machinery to establish
existence, and often require regularity assumptions on H and Θ to apply these
machinery. In a later section, we describe a stick-breaking construction of the
DP due to [8], which is a direct and elegant construction of the DP which need
not impose such regularity assumptions.
Posterior Distribution
Let G ∼ DP(α, H). Since G is a (random) distribution, we can in turn draw
samples from G itself. Let θ1 , . . . , θn be a sequence of independent draws from
G. Note that the θi ’s take values in Θ since G is a distribution over Θ. We are
interested in the posterior distribution of G given observed values of θ1 , . . . , θn .
Let A1 , . . . , Ar be a finite measurable partition of Θ, and let nk = #{i : θi ∈ Ak }
be the number of observed values in Ak . By (2) and the conjugacy between the
Dirichlet and the multinomial distributions, we have:
Since the above is true for all finite measurable partitions, the posterior distribu-
tion over G must be a DP as well. A little algebra shows that the posterior DP
αH+ n
P
i=1 δθi
has updated concentration parameter α + n and base distribution α+n ,
Pn
where δi is a point mass located at θi and nk = i=1 δi (Ak ). In other words,
the DP provides a conjugate family of priors over distributions that is closed
under posterior updates given observations. Rewriting the posterior DP, we
4
have:
Pn
α n δθi
G|θ1 , . . . , θn ∼ DP α + n, α+n H+ α+n
i=1
n (4)
Notice that the posterior base distribution is a weighted P average between the
n
δθi
prior base distribution H and the empirical distribution i=1 n . The weight
associated with the prior base distribution is proportional to α, while the em-
pirical distribution has weight proportional to the number of observations n.
Thus we can interpret α as the strength or mass associated with the prior. In
the next section we will see that the posterior base distribution is also the pre-
dictive distribution of θn+1 given θ1 , . . . , θn . Taking α → 0, the prior becomes
non-informative in the sense that the predictive distribution is just given by the
empirical distribution. On the other hand, as the amount of observations grows
large, n α, the posterior is simply dominated by the empirical distribution
which is in turn a close approximation of the true underlying distribution. This
gives a consistency property of the DP: the posterior DP approaches the true
underlying distribution.
where the last step follows from the posterior base distribution of G given the
first n observations. Thus with G marginalized out:
n
!
1 X
θn+1 |θ1 , . . . , θn ∼ αH + δθi (6)
α+n i=1
5
reach into the urn to pick a random ball out (draw θn+1 from the empirical
distribution), paint a new ball with the same color and drop both balls back
into the urn.
The Blackwell-MacQueen urn scheme has been used to show the existence
of the DP [7]. Starting from (6), which are perfectly well-defined conditional
distributions regardless of the question of the existence of DPs, we can con-
struct a distribution over sequences θ1 , θ2 , . . . by iteratively drawing each θi
given θ1 , . . . , θi−1 . For n ≥ 1 let
n
Y
P (θ1 , . . . , θn ) = P (θi |θ1 , . . . , θi−1 ) (7)
i=1
be the joint distribution over the first n observations, where the conditional
distributions are given by (6). It is straightforward to verify that this random
sequence is infinitely exchangeable. That is, for every n, the probability of
generating θ1 , . . . , θn using (6), in that order, is equal to the probability of
drawing them in any alternative order. More precisely, given any permutation
σ on 1, . . . , n, we have
Now de Finetti’s theorem states that for any infinitely exchangeable sequence
θ1 , θ2 , . . . there is a random distribution G such that the sequence is composed
of i.i.d. draws from it:
n
Z Y
P (θ1 , . . . , θn ) = G(θi ) dP (G) (9)
i=1
In our setting, the prior over the random distribution P (G) is precisely the
Dirichlet process DP(α, H), thus establishing existence.
A salient property of the predictive distribution (6) is that it has point masses
located at the previous draws θ1 , . . . , θn . A first observation is that with positive
probability draws from G will take on the same value, regardless of smoothness
of H. This implies that the distribution G itself has point masses. A further
observation is that for a long enough sequence of draws from G, the value of
any draw will be repeated by another draw, implying that G is composed only
of a weighted sum of point masses, i.e. it is a discrete distribution. We will see
two sections below that this is indeed the case, and give a simple construction
for G called the stick-breaking construction. Before that, we shall investigate
the clustering property of the DP.
6
repeated values are due to the discreteness property of the DP and not due to
H itself2 . Since the values of draws are repeated, let θ1∗ , . . . , θm
∗
be the unique
∗
values among θ1 , . . . , θn , and nk be the number of repeats of θk . The predictive
distribution can be equivalently written as:
m
!
1 X
θn+1 |θ1 , . . . , θn ∼ αH + nk δθk∗ (10)
α+n
k=1
7
(observations) under simplifying assumptions on the population, and is known
under the name of Ewens sampling formula [2]. Before moving on we shall
consider just one illuminating aspect, specifically the distribution of the number
of clusters among n observations. Notice that for i ≥ 1, the observation θi
α
takes on a new value (thus incrementing m by one) with probability α+i−1
independently of the number of clusters among previous θ’s. Thus the number
of cluster m has mean and variance:
n
X α
E[m|n] = = α(ψ(α + n) − ψ(α))
i=1
α+i−1
n
' α log 1 + for N, α 0, (11)
α
V [m|n] = α(ψ(α + n) − ψ(α)) + α2 (ψ 0 (α + n) − ψ 0 (α))
n
' α log 1 + for n > α 0, (12)
α
where ψ(·) is the digamma function. Note that the number of clusters grows only
logarithmically in the number of observations. This slow growth of the number
of clusters makes sense because of the rich-gets-richer phenomenon: we expect
there to be large clusters thus the number of clusters m has to be smaller than
the number of observations n. Notice that α controls the number of clusters
in a direct manner, with larger α implying a larger number of clusters a priori.
This intuition will help in the application of DPs to mixture models.
Stick-breaking Construction
We have already intuited that draws from a DP are composed of a weighted sum
of point masses. [8] made this precise by providing a constructive definition of
the DP as such, called the stick-breaking construction. This construction is
also significantly more straightforward and general than previous proofs of the
existence of DPs. It is simply given as follows:
βk ∼ Beta(1, α) θk∗ ∼ H
k−1
Y ∞
X
π k = βk (1 − βk ) G= πk δθk∗ (13)
l=1 k=1
8
Applications
Because of its simplicity, DPs are used across a wide variety of applications
of Bayesian analysis in both statistics and machine learning. The simplest
and most prevalent applications include: Bayesian model validation, density
estimation and clustering via mixture models. We shall briefly describe the first
two classes before detailing DP mixture models.
How does one validate that a model gives a good fit to some observed data?
The Bayesian approach would usually involve computing the marginal prob-
ability of the observed data under the model, and comparing this marginal
probability to that for other models. If the marginal probability of the model
of interest is highest we may conclude that we have a good fit. The choice of
models to compare against is an issue in this approach, since it is desirable to
compare against as large a class of models as possible. The Bayesian nonpara-
metric approach gives an answer to this question: use the space of all possible
distributions as our comparison class, with a prior over distributions. The DP
is a popular choice for this prior, due to its simplicity, wide coverage of the class
of all distributions, and recent advances in computationally efficient inference in
DP models. The approach is usually to use the given parametric model as the
base distribution of the DP, with the DP serving as a nonparametric relaxation
around this parametric model. If the parametric model performs as well or bet-
ter than the DP relaxed model, we have convincing evidence of the validity of
the model.
Another application of DPs is in density estimation [12, 5, 3, 6]. Here we
are interested in modeling the density from which a given set of observations
is drawn. To avoid limiting ourselves to any parametric class, we may again
use a nonparametric prior over all densities. Here again DPs are a popular.
However note that distributions drawn from a DP are discrete, thus do not
have densities. The solution is to smooth out draws from the DP with a kernel.
Let G ∼ DP(α, H) and let f (x|θ) be a family of densities (kernels) indexed by
θ. We use the following as our nonparametric density of x:
Z
p(x) = f (x|θ)G(θ) dθ (14)
Similarly, smoothing out DPs in this way is also useful in the nonparametric
relaxation setting above. As we see below, this way of smoothing out DPs
is equivalent to DP mixture models, if the data distributions F (θ) below are
smooth with densities given by f (x|θ).
9
G, while each xi has distribution F (θi ) parametrized by θi :
xi |θi ∼ F (θi )
θi |G ∼ G
G|α, H ∼ DP(α, H) (15)
10
its representations. These include Pólya trees, normalized random measure,
Poisson-Kingman models, species sampling models and stick-breaking priors.
The DP has also been used in more complex models involving more than
one random probability measure. For example, in nonparametric regression we
might have one probability measure for each value of a covariate, and in multi-
task settings each task might be associated with a probability measure with
dependence across tasks implemented using a hierarchical Bayesian model. In
the first situation the class of models is typically called dependent Dirichlet pro-
cesses [16], while in the second the appropriate model is a hierarchical Dirichlet
process [17].
Future Directions
The Dirichlet process, and Bayesian nonparametrics in general, is an active area
of research within both machine learning and statistics. Current research trends
span a number of directions. Firstly there is the issue of efficient inference in
DP models. [4] is an excellent survey of the state-of-the-art in 2000, with all
algorithms based on Gibbs sampling or small-step Metropolis-Hastings MCMC
sampling. Since then there has been much work, including split-and-merge and
large-step auxiliary variable MCMC sampling, sequential Monte Carlo, expec-
tation propagation, and variational methods. Secondly there has been interest
in extending the DP, both in terms of new random distributions, as well as
novel classes of nonparametric objects inspired by the DP. Thirdly, theoretical
issues of convergence and consistency are being explored to provide frequentist
guarantees for Bayesian nonparametric models. Finally there are applications
of such models, to clustering, transfer learning, relational learning, models of
cognition, sequence learning, and regression and classification among others.
We believe DPs and Bayesian nonparametrics will prove to be rich and fertile
grounds for research for years to come.
Cross References
Bayesian Methods, Prior Probabilities, Bayesian Nonparametrics.
Recommended Reading
In addition to the references embedded in the text above, we recommend the
book [18] on Bayesian nonparametrics.
11
[3] M. D. Escobar and M. West. Bayesian density estimation and inference
using mixtures. Journal of the American Statistical Association, 90:577–
588, 1995.
[4] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture
models. Journal of Computational and Graphical Statistics, 9:249–265,
2000.
[5] R. M. Neal. Bayesian mixture modeling. In Proceedings of the Workshop
on Maximum Entropy and Bayesian Methods of Statistical Analysis, vol-
ume 11, pages 197–211, 1992.
12
[17] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet
processes. Journal of the American Statistical Association, 101(476):1566–
1581, 2006.
[18] N. Hjort, C. Holmes, P. Müller, and S. Walker, editors. Bayesian Nonpara-
metrics. Number 28 in Cambridge Series in Statistical and Probabilistic
Mathematics. Cambridge University Press, 2010.
13