0% found this document useful (0 votes)
36 views44 pages

DP Tutorial 2

The document discusses Dirichlet processes and their use in clustering unlabeled data. Dirichlet processes allow modeling an unknown number of mixture components in a data set. The document covers the motivation for using Dirichlet processes, provides an overview of key concepts like the Chinese restaurant process, and discusses inference methods like Markov chain Monte Carlo sampling.

Uploaded by

justdls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views44 pages

DP Tutorial 2

The document discusses Dirichlet processes and their use in clustering unlabeled data. Dirichlet processes allow modeling an unknown number of mixture components in a data set. The document covers the motivation for using Dirichlet processes, provides an overview of key concepts like the Chinese restaurant process, and discusses inference methods like Markov chain Monte Carlo sampling.

Uploaded by

justdls
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Dirichlet Processes

A gentle tutorial
Khalid El-Arini
SELECT Lab Meeting
October 14, 2008
Motivation
2
We are given a data set, and are told that it was
generated from a mixture of Gaussian distributions.
Unfortunately, no one has any idea how many Gaussians
produced the data.
Motivation
3
We are given a data set, and are told that it was
generated from a mixture of Gaussian distributions.
Unfortunately, no one has any idea how many Gaussians
produced the data.
We are given a data set, and are told that it was
generated from a mixture of Gaussian distributions.
Unfortunately, no one has any idea how many Gaussians
produced the data.
Motivation
4
What to do?
5
We can guess the number of clusters, run Expectation
Maximization (EM) for Gaussian Mixture Models, look at
the results, and then try again
We can run hierarchical agglomerative clustering, and cut
the tree at a visually appealing level
We want to cluster the data in a statistically principled
manner, without resorting to hacks.
Other motivating examples
6
Brain Imaging: Model an unknown number of spatial
activation patterns in fMRI images [Kim and Smyth, NIPS
2006]
Topic Modeling: Model an unknown number of topics
across several corpora of documents [Teh et al. 2006]

Overview
7
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
The Dirichlet Distribution
8
Let
We write:
Samples from the distribution lie in the m-1 dimensional
probability simplex
P(
1
;
2
; : : : ;
m
) =
(
P
k

k
)
Q
k
(
k
)
m
Y
k=1

k
1
k
The Dirichlet Distribution
9
Let
We write:
Distribution over possible parameter vectors for a multinomial
distribution, and is the conjugate prior for the multinomial.
Beta distribution is the special case of a Dirichlet for 2
dimensions.
Thus, it is in fact a distribution over distributions.
Dirichlet Process
10
A Dirichlet Process is also a distribution over distributions.
Let G be Dirichlet Process distributed:
G ~ DP(, G
0
)
G
0
is a base distribution
is a positive scaling parameter
G is a random probability measure that has the same support
as G
0
Dirichlet Process
11
Consider Gaussian G
0
G ~ DP(, G
0
)
Dirichlet Process
12
G ~ DP(, G
0
)
G
0
is continuous, so the probability that any two samples are
equal is precisely zero.
However, G is a discrete distribution, made up of a countably
infinite number of point masses [Blackwell]
Therefore, there is always a non-zero probability of two samples colliding
Overview
13
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
Samples from a Dirichlet Process
14
G ~ DP(, G
0
)
X
n
| G ~ G for n = {1, , N} (iid given G)
Marginalizing out G introduces dependencies
between the X
n
variables
G
X
n
N
15
Assume we view these variables in a specific order, and are interested in the
behavior of X
n
given the previous n - 1 observations.
Let there be K unique values for the variables:
Samples from a Dirichlet Process
Samples from a Dirichlet Process
16
Notice that the above formulation of the joint distribution does
not depend on the order we consider the variables.
Chain rule
P(partition) P(draws)
Samples from a Dirichlet Process
17
Can rewrite as:
Let there be K unique values for the variables:
Blackwell-MacQueen Urn Scheme
18
G ~ DP(, G
0
)
X
n
| G ~ G
Assume that G
0
is a distribution over colors, and that each X
n
represents the color of a single ball placed in the urn.
Start with an empty urn.
On step n:
With probability proportional to , draw X
n
~ G
0
, and add a ball of
that color to the urn.
With probability proportional to n 1 (i.e., the number of balls
currently in the urn), pick a ball at random from the urn. Record its
color as X
n
, and return the ball into the urn, along with a new one of
the same color.
[Blackwell and Macqueen, 1973]
Chinese Restaurant Process
19
Consider a restaurant with infinitely many tables, where the X
n
s
represent the patrons of the restaurant. From the above
conditional probability distribution, we can see that a customer is
more likely to sit at a table if there are already many people
sitting there. However, with probability proportional to , the
customer will sit at a new table.
Also known as the clustering effect, and can be seen in the
setting of social clubs. [Aldous]
Chinese Restaurant Process
20
Stick Breaking
21
So far, weve just mentioned properties of a distribution
G drawn from a Dirichlet Process
In 1994, Sethuraman developed a constructive way of
forming G, known as stick breaking
Stick Breaking
22
1. Draw X
1
* from G
0
2. Draw v
1
from Beta(1, )
4. Draw X
2
* from G
0
3.
1
= v
1

5. Draw v
2
from Beta(1, )
6.
2
= v
2
(1 v
1
)
Formal Definition (not constructive)
23
Let be a positive, real-valued scalar
Let G
0
be a non-atomic probability distribution over
support set A
If G ~ DP(, G
0
), then for any finite set of partitions
of A:
A
1
A
2
A
3
A
4
A
5
A
6
A
7
Overview
24
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
Finite Mixture Models
25
A finite mixture model assumes that the data come from a
mixture of a finite number of distributions.
c
n
y
n
n=1N

*
k
k=1K
G
0

~ Dirichlet(/K,, /K)
c
n
~ Multinomial()

k
~ G
0
y
n
| c
n
,
1
,
K
~ F( |
c
n
)
Component
labels
Infinite Mixture Models
26
An infinite mixture model assumes that the data come
from a mixture of an infinite number of distributions
c
n
y
n
n=1N
k=1K
G
0

~ Dirichlet(/K,, /K)
c
n
~ Multinomial()

k
~ G
0
y
n
| c
n
,
1
,
K
~ F( |
c
n
)
Take limit as K goes to
Note: the N data points still come from at most N
different components
[Rasmussen 2000]

*
k
Dirichlet Process Mixture
27
G

n
n=1N
y
n
G
0

countably infinite number


of point masses
draw N times from G to get
parameters for different mixture
components
If
n
were drawn from, e.g., a Gaussian, no two values
would be the same, but since they are drawn from a
Dirichlet Process-distributed distribution, we expect
a clustering of the
n
# unique values for
n
= # mixture components
CRP Mixture
28
Overview
29
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
Inference for Dirichlet Process Mixtures
30
Expectation Maximization (EM)
is generally used for inference in
a mixture model, but G is
nonparametric, making EM
difficult
Markov Chain Monte Carlo
techniques [Neal 2000]
Variational Inference [Blei and
Jordan 2006]
G

n
n=1N
y
n
G
0

Aside: Monte Carlo Methods


[Basic Integration]
31
We want to compute the integral,
where f(x) is a probability density function.
In other words, we want E
f
[h(x)].
We can approximate this as:
where X
1
, X
2
, , X
N
are sampled from f.
By the law of large numbers,
[Lafferty and Wasserman]
Aside: Monte Carlo Methods
[What if we dont know how to sample from f?]
32
Importance Sampling
Markov Chain Monte Carlo (MCMC)
Goal is to generate a Markov chain X
1
, X
2
, , whose stationary
distribution is f.
If so, then
(under certain conditions)
1
N
N
X
i=1
h(X
i
)
p
!I
Goal: Generate a Markov chain with stationary distribution f(x)
Initialization:
Let q(y | x) be an arbitrary distribution that we know how to sample
from. We call q the proposal distribution.
Arbitrarily choose X
0
.
Assume we have generated X
0
, X
1
, , X
i
. To generate
X
i+1
:
Generate a proposal value Y ~ q(y|X
i
)
Evaluate r r(X
i
, Y) where:
Set:
Aside: Monte Carlo Methods
[MCMC I: Metropolis-Hastings Algorithm]
33
[Lafferty and Wasserman]
A common choice
is N(x, b
2
) for b > 0
with probability r
with probability 1-r
If q is symmetric,
simplifies to:
Aside: Monte Carlo Methods
[MCMC II: Gibbs Sampling]
34
Goal: Generate a Markov chain with stationary distribution
f(x, y) [Easily extendable to higher dimensions.]
Assumption:
We know how to sample from the conditional distributions
f
X|Y
(x | y) and f
Y|X
(y | x)
Initialization:
Arbitrarily choose X
0
, Y
0
.
Assume we have generated (X
0
, Y
0
), , (X
i
, Y
i
). To
generate (X
i+1
, Y
i+1
):
X
i+1
~ f
X|Y
(x | Y
i
)
Y
i+1
~ f
Y|X
(y | X
i+1
)
[Lafferty and Wasserman]
If not, then we run one iteration
of Metropolis-Hastings each
time we need to sample from a
conditional.
MCMC for Dirichlet Process Mixtures
[Overview]
35
We would like to sample from the
posterior distribution:
P(
1
,,
N
| y
1
,y
N
)
If we could, we would be able to
determine:
how many distinct components are likely
contributing to our data.
what the parameters are for each
component.
[Neal 2000] is an excellent resource
describing several MCMC algorithms
for solving this problem.
We will briefly take a look at two of them.
G

n
n=1N
y
n
G
0

MCMC for Dirichlet Process Mixtures


[Infinite Mixture Model representation]
36
MCMC algorithms that are based
on the infinite mixture model
representation of Dirichlet Process
Mixtures are found to be simpler to
implement and converge faster than
those based on the direct
representation.
Thus, rather than sampling for
1
,,

N
directly, we will instead sample
for the component indicators c
1
, ,
c
N
, as well as the component
parameters
*
c,
for all c in {c
1
, ,
c
N
}
c
n
y
n
n=1N
k=1
G
0

*
k
[Neal 2000]
Assume current state of Markov chain consists of c
1
, , c
N
,
as well as the component parameters
*
c,
for all c in {c
1
, ,
c
N
}.
To generate the next sample:
1. For i = 1,,N:
If c
i
is currently a singleton, remove
*
c
i
from the state.
Draw a new value for c
i
from the conditional distribution:
If the new c
i
is not associated with any other observation,
draw a value for
*
c
i
from:
for existing c
for new c
MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Conjugate Priors]
37
[Neal 2000, Algorithm 2]
MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Conjugate Priors]
38
2. For all c in {c
1
, , c
N
}:
Draw a new value for
*
c
from the posterior distribution
based on the prior G
0
and all the data points currently
associated with component c:
This algorithm breaks down when G
0
is not a conjugate
prior.
[Neal 2000, Algorithm 2]
MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Auxiliary Parameters]
39
Recall from the Gibbs sampling overview: if we do not know
how to sample from the conditional distributions, we can
interleave one or more Metropolis-Hastings steps.
We can apply this technique when G
0
is not a conjugate prior, but it
can lead to convergence issues [Neal 2000, Algorithms 5-7]
Instead, we will use auxiliary parameters.
Previously, the state of our Markov chain consisted of c
1
,, c
N
,
as well as component parameters
*
c,
for all c in {c
1
, , c
N
}.
When updating c
i
, we either:
choose an existing component c from c
-i
(i.e., all c
j
such that j i).
choose a brand new component.
In the previous algorithm, this involved integrating with respect to G
0
,
which is difficult in the non-conjugate case.
[Neal 2000, Algorithm 8]
MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Auxiliary Parameters]
40
When updating c
i
, we either:
choose an existing component c from c
-i
(i.e., all c
j
such that j i).
choose a brand new component.
Let K
-i
be the number of distinct components c in c
-i
.
WLOG, let these components c
-i
have values in {1, , K
-i
}.
Instead of integrating over G
0
, we will add m auxiliary
parameters, each corresponding to a new component
independently drawn from G
0
:
[
*
K
-i
+1
, ,
*
K
-i
+m
]
Recall that the probability of selecting a new component is proportional to .
Here, we divide equally among the m auxiliary components.
[Neal 2000, Algorithm 8]
MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Auxiliary Parameters]
41
This takes care of sampling for c
i
in the non-conjugate case.
A Metropolis-Hastings step can be used to sample
*
c
.
See Neals paper for more details.
[Neal 2000, Algorithm 8]
y
1
y
2
y
3
y
4
y
5
y
6
y
7
Data:

*
1

*
2

*
3

*
4

*
5

*
6

*
7
Components:
c
j
: 1 2 1 3 4 3 ?
Probability:
(proportional to)
2 1 2 1 /3 /3 /3
Each is a fresh draw from G
0
Conclusion
42
We now have a statistically principled mechanism for
solving our original problem.
This was intended as a general and fairly high level
overview of Dirichlet Processes.
Topics left out include Hierarchical Dirichlet Processes,
variational inference for Dirichlet Processes, and many more.
Tehs MLSS 07 tutorial provides a much deeper and more
detailed take on DPshighly recommended!
Acknowledgments
43
Much thanks goes to David Blei.
Some material for this presentation was inspired by slides
from Teg Grenager, Zoubin Ghahramani, and Yee Whye Teh
References
44
David Blackwell and James B. MacQueen. Ferguson Distributions via Polya Urn Schemes. Annals
of Statistics 1(2), 1973, 353-355.
David M. Blei and Michael I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian
Analysis 1(1), 2006.
Thomas S. Ferguson. A Bayesian Analysis of Some Nonparametric Problems Annals of Statistics
1(2), 1973, 209-230.
Zoubin Gharamani. Non-parametric Bayesian Methods. UAI Tutorial July 2005.
Teg Grenager. Chinese Restaurants and Stick Breaking: An Introduction to the Dirichlet Process
R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of
Computational and Graphical Statistics, 9:249-265, 2000.
C.E. Rasmussen. The Infinite Gaussian Mixture Model. Advances in Neural Information Processing
Systems 12, 554-560. (Eds.) Solla, S. A., T. K. Leen and K. R. Mller, MIT Press (2000).
Y.W. Teh. Dirichlet Processes. Machine Learning Summer School 2007 Tutorial and Practical
Course.
Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei. Hierarchical Dirichlet Processes. J. American
Statistical Association 101(476):1566-1581, 2006.

You might also like