0% found this document useful (0 votes)
3 views39 pages

Slides 11

Uploaded by

Akshaya Ashok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views39 pages

Slides 11

Uploaded by

Akshaya Ashok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Lecture Slides for

INTRODUCTION TO

Machine Learning
2nd Edition

ETHEM ALPAYDIN, modified by Leonardo Bobadilla


and some parts from
http://www.cs.tau.ac.il/~apartzin/MachineLearning/
and
www.cs.princeton.edu/courses/archive/fall01/cs302
/notes/11.../EM.ppt
© The MIT Press, 2010 [email protected]
http://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Outline
Previous class
Ch 6: Dimensionality reduction
This class:
Ch 7: Clustering

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT
Press (V1.0)
CHAPTER 7:

Clustering
Clustering:Motivation

Optical Character Recognition
– Two ways to write 7 (w/o horizontal bar)
– Can’t assume single distribution
– Mixture of unknown number of templates


Compared to classification
– Number of classes is known
– Each training sample has a label of a class
– Supervised Learning

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example : Color quantization
5


Image: each pixels represented by 24 bit color

Colors come from different distribution (e.g. sky,
grass)

Don’t have labeling for each pixels if it’s sky or
grass

Want to use only 256 colors in palette to
represent image as close as possible to original

Quantize uniformly: assign single color to each
2^24/256 interval

Waste values for rarely occurring intervals
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Quantization
6


Sample (pixels):

k reference vectors (palette):

Select reference vector for each pixel:
xt − mi = min xt − m j
j

Reference vectors: codebook vectors or code
words
E {mi } i =1 X ) = ∑t ∑i bit xt − mi
( k

Compress image

Reconstruction error t 1 if xt − m = min xt − m
bi =  i
j
j

0 otherwise
Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Encoding/Decoding
7

1 if xt − mi = min xt − m j
bit =  j
0 otherwise

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-means clustering
8


Minimize reconstruction error

( )
E { m i } i =1 X = ∑ t ∑ i bit xt − m i
k


Take derivatives and set to zero


Reference vectors is the mean of all
instances it represents

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
K-Means clustering
9


Iterative procedure for finding reference
vectors

Start with random reference vectors

Estimate labels b

Re-compute reference vectors as means

Continue till converge

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
k-means Clustering
10

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11 Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expectation Maximization:
Learning from Data
We want to learn a model with a set of
parameter values
We are given a set of data X.
An approach: argmax Pr(X| )
This is the maximum likelihood model
(ML).
Super Simple Example
Coin I and Coin II. (biased.)
Pick a coin at random (uniform).
Flip it 4 times.
Repeat.

What are the parameters of the model?


Data
Coin I Coin II
HHHT TTTH
HTHH THTT
HTTH TTHT
THHH HTHT
HHHH HTTT
Probability of X Given
p: Probability of H from Coin I
q: Probability of H from Coin II

Let’s say h heads and t tails for Coin I. h’


and t’ for Coin II.
Pr(X| ) = ph (1-p)t qh’ (1-q)t’
How maximize this quantity?
Maximizing p
Use maximum likelihood.

h/(t+h) = p
Missing Data
HHHT HTTH
TTTH HTHH
THTT HTTT
TTHT HHHH
THHH HTHT
Oh Boy, Now What!
If we knew the labels (which flips from
which coin), we could find ML values for
p and q.
What could we use to label?
p and q!
Computing Labels
p = ¾, q = 3/10
Pr(Coin I | HHTH)
= Pr(HHTH | Coin I) Pr(Coin I) / c
= (3/4)3(1/4) (1/2)/c = .052734375/c
Pr(Coin II | HHTH)
= Pr(HHTH | Coin II) Pr(Coin II) / c
= (3/10)3(7/10) (1/2)/c= .00945/c
Expected Labels
I II I II
HHHT .85 .15 HTTH .44 .56
TTTH .10 .90 HTHH .85 .15
THTT .10 .90 HTTT .10 .90
TTHT .10 .90 HHHH .98 .02
THHH .85 .15 HTHT .44 .56
Wait, I Have an Idea
Pick some mode l 0

Expectation

Compute expected labels via i

Maximization

Compute ML model i+1

Repeat
Could This Work?
Expectation-Maximization (EM)
Pr(X| i) will not decrease.
Sound familiar? Type of search.
Mixture Densities
23

k
p ( x )=∑ p ( x∣Gi ) P ( Gi )
i=1

where Gi the components/groups/clusters,
P ( Gi ) mixture proportions (priors),
p ( x | Gi) component densities


Gaussian mixture where p(x|Gi) ~ N ( μi ,
∑i ) parameters Φ = {P ( Gi ), μi , ∑i }ki=1
unlabeled sample X={xt}t (unsupervised
learning)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example
24

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Expectation Maximization(EM):
Motivation
25


Date came from several distributions

Assume each distribution is known up to
parameters

If we would know for each data instance from
what distribution it came, could use parametric
estimation

Introduce unobservable (latent) variables which
indicate source distribution

Run iterative process
– Estimate latent variables from data and current
estimation of distribution parameters
– Use current values of latent variables to refine
parameter estimation
Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM
26


Log-Likelihood L ( Φ | X ) = log
∏ |Φ
p xt
( )
t

( )
k
= ∑t log∑ p xt | Gi P (Gi )
i =1


Assume hidden variables Z, which when
known, make optimization much simpler

Complete likelihood, Lc(Φ |X,Z), in terms of X
and Z

Incomplete likelihood, L(Φ |X), in terms of X

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Latent Variables
27


Unknown

Can’t compute complete likelihood Lc(Φ |X,Z)

Can compute its expected value
E-step:Q ( Φ|Φ l ) = E  L C ( Φ|X,Z ) |X , Φ l 

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
E- and M-steps
28

 Iterate the two steps:


1. E-step: Estimate z given X and current
Φ
2. M-step: Find new Φ’ given z, X, and old
Φ.
E-step:Q ( Φ|Φ ) = E  L C ( Φ|X,Z ) |X, Φ 
l l

M-step:Φ l +1
= arg max Q ( Φ|Φ l
)
Φ

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example:
29


Data came from mix of Gaussians

Maximize likelihood assuming we know latent
“indicator variable”


E-step: expected value of indicator variables

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
P(G1|x)=h1=0.5

30 Lecture Notes for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
EM for Gaussian mixtures
31


Assume all groups/clusters are Gaussians

Multivariate Uncorrelated

Same Variance

Harden indicators
– EM: expected values are between 0 and 1
– K-means: 0 or 1

Same as k-means

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Dimensionality Reduction vs.
32
Clustering

Dimensionality reduction methods find
correlations between features and group
features
– Age and Income are correlated


Clustering methods find similarities between
instances and group instances
– Customer A and B are from the same cluster

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Clustering: Usage for supervised
learning
33


Describe data in terms of cluster
– Represent all data in cluster by cluster mean
– Range of attributes


Map data into new space(preprocessing)
– d- dimension original space
– k- number of clusters
– Use indicator variables as data representations
– k might be larger then d

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Mixture of Mixtures
34


In classification, the input comes from a
mixture of classes (supervised).

If each class is also a mixture, e.g., of
Gaussians, (unsupervised), we have a
mixture of mixtures: k
p ( x | Ci ) = ∑ p ( x | Gij ) P (Gij )
i

j =1
K
p ( x) = ∑ p ( x | Ci ) P (Ci )
i =1

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
35


Probabilistic view
– Fit mixture model to data
– Find codewords minimizing reconstruction error


Hierarchical clustering
– Group similar items together
– No specific model/distribution
– Items in groups is more similar to each other
than instances in different groups

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Hierarchical Clustering
36

Minkowski (Lp) (Euclidean for p = 2)


( ) [∑ (x ) ]
1/ p
d s p
dm x , x = r s
j =1
r
j −x j

City-block distance
( r s
)
dcb x , x = ∑ j =1 xrj − x sj
d

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Agglomerative Clustering
37


Start with clusters each having single point

At each step merge similar clusters

Measure of similarity
– Minimal distance(single link)

Distance between closest points in 2 groups
– Maximal distance(complete link)

Distance between most distant points in 2 groups
– Average distance

Distance between group centers

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example: Single-Link
38
Clustering

Dendrogram

Based on for E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Choosing k
39


Defined by the application, e.g., image
quantization

Plotting data in two dimensions using PCA


Incremental (leader-cluster) algorithm: Add
one at a time until “elbow” (reconstruction
error/log likelihood/intergroup distances)

Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1)

You might also like