Chapter 9 Data Mining
Chapter 9 Data Mining
P(X1) = P(X , X
1 2 = x 2 , , X N = x N )
x2 xN
Marginalization
The previous notation can be simplified to
p(x1 ) = P(x , x ,
1 2 , xN )
x2 xN
W O T H
W O T H
P(Y,O,.T, H,W) = P(W | O,Y )P(O |Y )P(T | O,Y )P(H | T,Y )P(Y )
Estimating Bayesian network
parameters
The log-likelihood of a Bayesian network with
V variables and N examples of complete
variable assignments to the network is
n -1
To estimate the fifth parameter pA, just take the
proportion of the instances that are in the A
cluster, then pB=1-pA.
Motivating the EM algorithm
If you knew the five parameters, finding the
(posterior) probabilities that a given instance
comes from each distribution would be easy
Given an instance xi, the probability that it
belongs to cluster A is
P(xi | A) P(A) N(xi ; mA , s A )pA
P(A|xi ) = =
P(xi ) N(xi ; mA , s A )pA + N(xi ; mB, s B )pB
where N() is the normal or Gaussian distribution
( x-m )2
1 -
N(x; m, s ) = e 2s 2
2ps
The EM algorithm for a GMM
Start with a random (but reasonable) assignment to
the parameters
Compute the posterior distribution for the cluster
assignments for each example
Update the parameters based on the expected class
assignments - where probabilities act like weights.
If wi is the probability that instance i belongs to
cluster A, the mean and std. dev. are
w1 x1 + w2 x2 +... + wn xn
mA =
w1 + w2 +... + wn
w (x - m ) 2
+ w (x - m ) 2
+... + w (x - m ) 2
s A2 = 1 1 2 2 n n
w1 + w2 +... + wn
The EM algorithm
Optimizes the marginal likelihood, obtained by
marginalizing over the two components of the
Gaussian mixture
The marginal likelihood is a measure of the
goodness of the clustering and it increases
at each iteration of the EM algorithm.
In practice we use the log marginal likelihood
n n
log P(xi ) = log P(xi | ci ) P(ci )
i=1 i=1 ci
n
= log [ N(xi ; m A , s A )pA + N(xi ; m B, s B )pB ]
i=1
Extending the mixture Model
The Gaussian distribution generalizes to n-dimensions
Consider a two-dimensional model consisting of
independent Gaussian distributions for each dimension
We can transform from scalar to matrix notation for a two
dimensional Gaussian distribution as follows:
N i=1
The mean is simply
N
xi.
1
m=
N i=1
Clustering with correlated attributes
If all attributes are continuous one can simply use a full
covariance Gaussian mixture model
But one needs to estimate n(n + 1)/2 parameters per
mixture component for a full covariance matrix model
As we will see later principal component analysis (PCA)
can be formulated as a probabilistic model, yielding
probabilistic principal component analysis (PPCA),
Approaches known as mixtures of principal component
analyzers or mixtures of factor analyzers provide ways
of using a much smaller number of parameters to
represent large covariance matrices
Mixtures of Factor Analyzers
The idea is to decompose the covariance matrix M
of each cluster into the form M = WWT + D
W is typically a long and skinny matrix of size n x d,
with as many rows as there are dimensions n of
the input, and as many columns d as there are
dimensions in the reduced dimensional space
Standard PCA corresponds to setting D=0
PPCA corresponds to using the form D=2I, where
is a scalar parameter and I is the identity matrix
Factor analysis corresponds to using a diagonal
matrix for D
Clustering using prior distributions
It would be nice if somehow you could penalize the
model for introducing new parameters
One principled way of doing this is to adopt a fully
Bayesian approach in which every parameter has a
prior probability distribution
Then, whenever a new parameter is introduced, its
prior probability must be incorporated into the overall
likelihood figure
Because this will involve multiplying the overall
likelihood by a number less than onethe prior
probabilityit will automatically penalize the addition
of new parameters
To improve the overall likelihood, the new parameters
will have to yield a benefit that outweighs the penalty
Autoclass
A comprehensive clustering scheme based on Bayesian
networks that uses the finite mixture model with prior
distributions on all the parameters
It allows both numeric and nominal attributes, and
uses the EM algorithm to estimate the parameters of
the probability distributions to best fit the data
Because there is no guarantee that the EM algorithm
converges to the global optimum, the procedure is
repeated for several different sets of initial values
Can perform clustering with correlated nominal and
correlated numeric attributes
DensiTrees
Rather than showing just the most likely clustering
to the user, it may be best to present all of them,
weighted by probability.
More fully Bayesian techniques for hierarchical
clustering have been developed that produce as
output a probability distribution over possible
hierarchical structures representing a dataset.
A DensiTree, shows the set of all hierarchical trees
for a particular dataset
Such visualizations make it easy for people to grasp
the possible hierarchical clusterings
A DensiTree
q H
3. If the algorithm has not converged, set q old = q new
and return to step 1
The M step corresponds to maximizing the expected
log-likelihood, the overall procedure maximizes the
marginal likelihood of the data
Although discrete hidden variables are used above, the
approach generalizes to continuous ones
EM for Bayesian networks
Unconditional probability distributions are estimated in
the same way in which it would be computed if the
variables Ai had been observed, but with each
observation replaced by its posterior marginal
probability (i.e. marginalizing over the other variables)
...
xn(i+1) ~ argmax p(xn | x1 = x1i , , xn-1 = xn-1
i
).
xn
Often converges quickly to an approximate MPE,
but is prone to local minima
Zero temperature limit of a simulated annealer
Gibbs sampling is a popular special case of the
more general Metropolis Hastings algorithm
Gibbs sampling and simulated annealing are
special cases of the broader class of Markov
Chain Monte Carlo (MCMC) methods
Variational methods
Rather than sampling from a complex distribution
the distribution can be approximated by a simpler,
more tractable, function
Suppose we have a probability model with a set of
hidden variables H and observed variables X
Let be the exact posterior
Define as the variational approx.
having its own variational parameters, F
Variational methods make q close to p, yielding
principled EM algorithms with approx. posteriors
Bibliographic Notes & Further Reading
Sampling and MCMC
x x1 x2 xN xi
i=1:N
x1 x2 xP
EM for PPCA x2
x1 xP
s1
~
X Uk s2 VkT
sk
td tk kk kd
where U and V have orthogonal columns and S is a
diagonal matrix containing the singular values, usually
sorted in decreasing order (t=#terms, d=#docs, k=#topics)
For every k < d, and if all but the k largest
singular values are discarded the data matrix can be
reconstructed in an optimal in a least squares sense
Probabilistic di
j=1:mn
di i i
Smoothed LDA i
Rather than applying LDAb naively when looking for trends over
time, Blei and Lafferty (2006)s dynamic topic models treat the
temporal evolution of topics explicitly; they examined topical trends
in the journal Science.
Griffiths and Steyvers (2004) used Bayesian model selection to
determine the number of topics in their LDAb analysis of
Proceedings of the National Academy of Science abstracts.
Griffiths and Steyvers (2004) and Teh et al. (2006) give more details
on the collapsed Gibbs sampling and variational approaches to
LDAb.
Hierarchical Dirichlet processes (Teh et al., 2006) and related
techniques offer alternatives to the problem of determining the
number of topics or clusters in hierarchical Bayesian models.
Factor graphs
Represent functions by factoring them into the
product of local functions, each of which acts on
a subset of the full argument set
S
F(x1 , , x n ) = f j (X j )
j=1
x1 x2 x1 x2
fD
fC fE
x3 x4 x5 x3 x4 x5
Factor graphs for nave Bayes models
Nave Bayes models have simple factor graphs
n
P(y, x1, , xn ) = P(y) P(xi | y).
i-1
y y
x1 x2 xn x1 x2 xn
x1 x2 xn x1 x2 xn
y y
1
= f A (x1 ) fB (x2 ) fC (x1 ) fD (x2 ) fE (x1, x2 ) fF (x2 , x3 ) fG (x3, x4 ) fH (x4 , x1 )
Z
Note we used unary and pairwise potentials
An MRF x1 x2 x1 x2
An MRF
using a as a factor
traditional graph
y1 y2 y1 y2
MRF style
undirected
x3 x4 x3 x4
graph
y3 y4 y3 y4
MRFs and energy functions
An MRF lattice is often repeated over an image
MRFs can be expressed in terms of
an energy function F(X), where
U V
F(X) = U(Xu ) + V (Xv ), and
u=1 v=1
A commonly used strategy for tasks such as
image segmentation and entity resolution in
text documents is to minimizing an energy
function of the form in the previous slide
When such energy functions are sub-
modular, an exact minimum can be found
using algorithms based on graph-cuts;
otherwise methods such as tree-reweighted
message passing can be used.
Example: Image Segmentation
Given a labeled dataset of medical imagery, such as a
computed tomography or CT image
A well known approach is to learn a Markov Random
Field model to segment that image into classes of
interest, ex. anatomical structures or tumors
Image features are combined with spatial context
x3 x4 x5 x3 x4 x5
x1 x2
Marginal probabilities x3 x4 x5
fA fB
x1 x2 x1 x2
fD
fC fE
x3 x4 x5 x3 x4 x5
The sum-product algorithm
Computes exact marginals in tree structured
factor graphs
Begin with variable or function nodes that have
only one connection (leaf nodes)
Function nodes send the message:
to the variable connected to them
Variable nodes send the message:
Other nodes wait until they have received a
message from all neighbors except the one will
send a message to
Function to variable messages
When ready, function nodes send messages of the
following form to variable x:
x1
fD
x2 Sum product messages
fC fE
fA fB
x3 x4 x5
1d 1c
5d 5c
fC fD fE
6a 5a 4a 3a 2a 1a
x3 x1 x2 x5
1b 2b 3b 4b 5b 6b
x4
The messages for our example
fA fB
P(x3, x!4 ) = P(x3 | x1 )P(x
! 1 ) P( x4 | x1, x2 )P(x
!
! 2 ) P(x5 | x2 ) 1a
1!
x1 x2
fC
1d
5d
fD
1c 5c
fE
1d 1c
!x5 # " # $
6a 5a 4a 3a 2a 1a ! # ## " # 2a# #$
x3 x1 x2 x5
1b 2b 3b 4b 5b 6b
! # # # # ## " # # #3a # # #$
! # # # # # # # " #4a# # # # # #$
! # # # # # # # # # " # # 5a# # # # # # # $
x4 6a
The complete
algorithm can yield
all single-variable
marginals in the
graph using the
other messages
shown in the
diagram
Finding the most probable configuration
Finding the most probable configuration of all
other variables in our example given
involves searching for
for which
fA fB
fE
x3 x4 x5
Kernelized classification
Bibliographic Notes & Further Reading
Kernel Logistic Regression, and Various Vector Machines