Slides 11
Slides 11
INTRODUCTION TO
Machine Learning
2nd Edition
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
CHAPTER 7:
Clustering
Clustering:Motivation
●
Optical Character Recognition
– Two ways to write 7 (w/o horizontal bar)
– Can’t assume single distribution
– Mixture of unknown number of templates
●
Compared to classification
– Number of classes is known
– Each training sample has a label of a class
– Supervised Learning
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example : Color quantization
5
●
Image: each pixels represented by 24 bit color
●
Colors come from different distribution (e.g. sky,
grass)
●
Don’t have labeling for each pixels if it’s sky or
grass
●
Want to use only 256 colors in palette to
represent image as close as possible to original
●
Quantize uniformly: assign single color to each
2^24/256 interval
●
Waste values for rarely occurring intervals
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Quantization
6
●
Sample (pixels):
●
k reference vectors (palette):
●
Select reference vector for each pixel:
xt − mi = min xt − m j
j
●
Reference vectors: codebook vectors or code
words
E {mi } i =1 X ) = ∑t ∑i bit xt − mi
( k
●
Compress image
●
Reconstruction error t 1 if xt − m = min xt − m
bi = i
j
j
0 otherwise
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Encoding/Decoding
7
1 if xt − mi = min xt − m j
bit = j
0 otherwise
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-means clustering
8
●
Minimize reconstruction error
( )
E { m i } i =1 X = ∑ t ∑ i bit xt − m i
k
●
Take derivatives and set to zero
●
Reference vectors is the mean of all
instances it represents
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-Means clustering
9
●
Iterative procedure for finding reference
vectors
●
Start with random reference vectors
●
Estimate labels b
●
Re-compute reference vectors as means
●
Continue till converge
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
k-means Clustering
10
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11 Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expectation Maximization:
Learning from Data
We want to learn a model with a set of
parameter values
We are given a set of data X.
An approach: argmax Pr(X| )
This is the maximum likelihood model
(ML).
Super Simple Example
Coin I and Coin II. (biased.)
Pick a coin at random (uniform).
Flip it 4 times.
Repeat.
h/(t+h) = p
Missing Data
HHHT HTTH
TTTH HTHH
THTT HTTT
TTHT HHHH
THHH HTHT
Oh Boy, Now What!
If we knew the labels (which flips from
which coin), we could find ML values for
p and q.
What could we use to label?
p and q!
Computing Labels
p = ¾, q = 3/10
Pr(Coin I | HHTH)
= Pr(HHTH | Coin I) Pr(Coin I) / c
= (3/4)3(1/4) (1/2)/c = .052734375/c
Pr(Coin II | HHTH)
= Pr(HHTH | Coin II) Pr(Coin II) / c
= (3/10)3(7/10) (1/2)/c= .00945/c
Expected Labels
I II I II
HHHT .85 .15 HTTH .44 .56
TTTH .10 .90 HTHH .85 .15
THTT .10 .90 HTTT .10 .90
TTHT .10 .90 HHHH .98 .02
THHH .85 .15 HTHT .44 .56
Wait, I Have an Idea
Pick some mode l 0
Expectation
●
Compute expected labels via i
Maximization
●
Compute ML model i+1
Repeat
Could This Work?
Expectation-Maximization (EM)
Pr(X| i) will not decrease.
Sound familiar? Type of search.
Mixture Densities
23
k
p ( x )=∑ p ( x∣Gi ) P ( Gi )
i=1
●
where Gi the components/groups/clusters,
P ( Gi ) mixture proportions (priors),
p ( x | Gi) component densities
●
Gaussian mixture where p(x|Gi) ~ N ( μi ,
∑i ) parameters Φ = {P ( Gi ), μi , ∑i }ki=1
unlabeled sample X={xt}t (unsupervised
learning)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example
24
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expectation Maximization(EM):
Motivation
25
●
Date came from several distributions
●
Assume each distribution is known up to
parameters
●
If we would know for each data instance from
what distribution it came, could use parametric
estimation
●
Introduce unobservable (latent) variables which
indicate source distribution
●
Run iterative process
– Estimate latent variables from data and current
estimation of distribution parameters
– Use current values of latent variables to refine
parameter estimation
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM
26
●
Log-Likelihood L ( Φ | X ) = log
∏ |Φ
p xt
( )
t
( )
k
= ∑t log∑ p xt | Gi P (Gi )
i =1
●
Assume hidden variables Z, which when
known, make optimization much simpler
●
Complete likelihood, Lc(Φ |X,Z), in terms of X
and Z
●
Incomplete likelihood, L(Φ |X), in terms of X
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Latent Variables
27
●
Unknown
●
Can’t compute complete likelihood Lc(Φ |X,Z)
●
Can compute its expected value
E-step:Q ( Φ|Φ l ) = E L C ( Φ|X,Z ) |X , Φ l
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
E- and M-steps
28
M-step:Φ l +1
= arg max Q ( Φ|Φ l
)
Φ
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example:
29
●
Data came from mix of Gaussians
●
Maximize likelihood assuming we know latent
“indicator variable”
●
E-step: expected value of indicator variables
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
P(G1|x)=h1=0.5
30 Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM for Gaussian mixtures
31
●
Assume all groups/clusters are Gaussians
●
Multivariate Uncorrelated
●
Same Variance
●
Harden indicators
– EM: expected values are between 0 and 1
– K-means: 0 or 1
●
Same as k-means
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dimensionality Reduction vs.
32
Clustering
●
Dimensionality reduction methods find
correlations between features and group
features
– Age and Income are correlated
●
Clustering methods find similarities between
instances and group instances
– Customer A and B are from the same cluster
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Clustering: Usage for supervised
learning
33
●
Describe data in terms of cluster
– Represent all data in cluster by cluster mean
– Range of attributes
●
Map data into new space(preprocessing)
– d- dimension original space
– k- number of clusters
– Use indicator variables as data representations
– k might be larger then d
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Mixture of Mixtures
34
●
In classification, the input comes from a
mixture of classes (supervised).
●
If each class is also a mixture, e.g., of
Gaussians, (unsupervised), we have a
mixture of mixtures: k
p ( x | Ci ) = ∑ p ( x | Gij ) P (Gij )
i
j =1
K
p ( x) = ∑ p ( x | Ci ) P (Ci )
i =1
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
35
●
Probabilistic view
– Fit mixture model to data
– Find codewords minimizing reconstruction error
●
Hierarchical clustering
– Group similar items together
– No specific model/distribution
– Items in groups is more similar to each other
than instances in different groups
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
36
City-block distance
( r s
)
dcb x , x = ∑ j =1 xrj − x sj
d
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Agglomerative Clustering
37
●
Start with clusters each having single point
●
At each step merge similar clusters
●
Measure of similarity
– Minimal distance(single link)
●
Distance between closest points in 2 groups
– Maximal distance(complete link)
●
Distance between most distant points in 2 groups
– Average distance
●
Distance between group centers
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example: Single-Link
38
Clustering
Dendrogram
Based on for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Choosing k
39
●
Defined by the application, e.g., image
quantization
●
Plotting data in two dimensions using PCA
●
Incremental (leader-cluster) algorithm: Add
one at a time until “elbow” (reconstruction
error/log likelihood/intergroup distances)
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)