0% found this document useful (0 votes)
27 views71 pages

ML.5-Clustering Techniques (Week 9)

Uploaded by

Sơn Trịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views71 pages

ML.5-Clustering Techniques (Week 9)

Uploaded by

Sơn Trịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Nhân bản – Phụng sự – Khai phóng

Chapter 5

Clustering Techniques
Machine Learning
CONTENTs

• Clustering Problems

• K-Means

• DBSCAN

• Gaussian Mixtures

Machine Learning 2
CONTENTs

•Clustering Problems
• K-Means

• DBSCAN

• Gaussian Mixtures

Machine Learning 3
Clustering Problem
• Unsupervised learning
• Sometimes the data form clusters, where examples within a cluster are similar to
each other, and examples in different clusters are dissimilar:

• Grouping data points into clusters, with no labels, is called clustering

Machine Learning 4
Clustering Problem

• Assume the data {x(1) , . . . , x(N)} lives in a Euclidean space, x(n) ∈ Rd


• Assume the data belongs to K classes (patterns)
• Assume the data points from same class are similar, i.e. close in Euclidean distance.
How can we identify those classes (data points that belong to each class)?

Machine Learning 5
CONTENTs

• Clustering Problems

•K-Means
• DBSCAN

• Gaussian Mixtures

Machine Learning 6
K-means

• Initialization: randomly initialize cluster centers


• The algorithm iteratively alternates between two steps:
• Assignment step: Assign each data point to the closest cluster
• Refitting step: Move each cluster center to the center of gravity of the data
assigned to it

Machine Learning 7
K-means

• K-means assumes there are K clusters, and each point is close to its cluster center
(the mean of points in the cluster).
• If we knew the cluster assignment we could easily compute means.
• If we knew the means we could easily compute cluster assignment.
Chicken and egg problem!  Can show it is NP hard.
• Very simple (and useful) heuristic - start randomly and alternate between the two!

Machine Learning 8
K-means

Machine Learning 9
K-means

Machine Learning 10
K-means
• Finding the Optimal Number of Clusters

• Bad choices for the number of clusters

Machine Learning 11
K-means
• Finding the Optimal Number of Clusters

• Selecting the number of clusters k using the “elbow rule”

Machine Learning 12
K-means
• Limits of K-Means
• K-Means does not behave very well when the clusters have varying sizes,
• different densities, or non-spherical shapes

K-Means fails to cluster these ellipsoidal blobs properly

Machine Learning 13
CONTENTs

• Clustering Problems

• K-Means

•DBSCAN
• Gaussian Mixtures

Machine Learning 14
DBSCAN
• DBSCAN – Density-Based Spatial Clustering of Applications with Noise
• Core, Border, and Noise points

Machine Learning 15
DBSCAN
• Clusters as continuous regions of high density.
• DBSCAN algorithm:
• For each instance, the algorithm counts how many instances are located
within a small distance ε (epsilon) from it. This region is called the instance’s ε
neighborhood.
• If an instance has at least min_samples instances in its ε-neighborhood
(including itself), then it is considered a core instance.
• All instances in the neighborhood of a core instance belong to the same
cluster.
• Any instance that is not a core instance and does not have one in its
neighborhood is considered an anomaly.

Machine Learning 16
DBSCAN: Algorithm

• Let ClusterCount=0. For every point p:


• 1. If p it is not a core point, assign a null label to it [e.g., zero]
• 2. If p is a core point, a new cluster is formed
[with label ClusterCount:= ClusterCount+1]
• Then find all points density-reachable from p and classify them in the
cluster. [Reassign the zero labels but not the others]
• Repeat this process until all of the points have been visited.
(Since all the zero labels of border points have been reassigned in 2,
the remaining points with zero label are noise).

Machine Learning 17
DBSCAN: Complexity

• Time Complexity: O(n2 )—for each point it has to be determined if it is a


core point, can be reduced to O(n*log(n)) in lower dimensional spaces
by using efficient data structures (n is the number of objects to be
clustered);
• Space Complexity: O(n).

Machine Learning 18
DBSCAN

Machine Learning 19
DBSCAN

• DBSCAN clustering using two different neighborhood radiuses

Machine Learning 20
DBSCAN: Opimal Eps

Machine Learning 21
CONTENTs

• Clustering Problems

• K-Means

• DBSCAN

•Gaussian Mixtures

Machine Learning 22
Gaussian Bayes Classifier Reminder

Machine Learning 23
Predicting wealth from age

Machine Learning 24
Learning modelyear , mpg ---> maker
  21  12   1m 
 
  12  2 2   2m 
Σ=
     
   2 m 
 1m  2 m

Machine Learning 25
  21  12

  1m 

General: O(m2) parameters
  12  2 2   2m 
Σ=
     
   2 m 
 1m  2 m

Machine Learning 26
Aligned: O(m) parameters
  21 0 0  0 0 
 
 0  22 0  0 0 
 
0 0  23  0 0 
Σ=
       
 
 0 0 0   2 m −1 0 
 0   2 m 
 0 0 0

Machine Learning 27
  21 0 0  0 0 
 
 0  22 0  0 0  Aligned: O(m) parameters
 
0 0  23  0 0 
Σ=
       
 
 0 0 0   2 m −1 0 
 0   2 m 
 0 0 0

Machine Learning 28
Spherical: O(1) cov parameters
 2 0 0  0 0 
 
 0 2 0  0 0 
 
0 0 2  0 0 
Σ=
       
 
 0 0 0  2 0 
 0   2 
 0 0 0

Machine Learning 29
 2 0 0  0 0 

 0 2 0  0 0 
 Spherical: O(1) cov parameters
 
0 0 2  0 0 
Σ=
       
 
 0 0 0  2 0 
 0   2 
 0 0 0

Machine Learning 30
Making a Classifier from a Density Estimator

Categorical inputs Real-valued inputs Mixed Real / Cat okay


only only

Joint BC Gauss BC
Predict Dec Tree
Naïve BC
Inputs

Classifier category

Prob- Joint DE Gauss DE


Inputs

Density Naïve DE
ability
Estimator

Predict
Inputs

Regressor real no.

Machine Learning 31
Next… back to Density Estimation

What if we want to do density estimation with multimodal or clumpy data?

Machine Learning 32
The GMM assumption

• There are k components. The


i’th component is called wi
• Component wi has an
associated mean vector mi m2
m1

m3

Machine Learning 33
The GMM assumption

• There are k components. The


i’th component is called wi
• Component wi has an
associated mean vector mi m2
• Each component generates data m1
from a Gaussian with mean mi
and covariance matrix 2I
Assume that each datapoint is m3
generated according to the
following recipe:

Machine Learning 34
The GMM assumption

• There are k components. The


i’th component is called wi
• Component wi has an
associated mean vector mi m2
• Each component generates data
from a Gaussian with mean mi
and covariance matrix 2I
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).

Machine Learning 35
The GMM assumption

• There are k components. The


i’th component is called wi
• Component wi has an
associated mean vector mi m2
• Each component generates data
x
from a Gaussian with mean mi
and covariance matrix 2I
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2. Datapoint ~ N(mi, 2I )
Machine Learning 36
The General GMM assumption

• There are k components. The


i’th component is called wi
• Component wi has an
associated mean vector mi m2
• Each component generates data m1
from a Gaussian with mean mi
and covariance matrix Si
Assume that each datapoint is m3
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(wi).
2. Datapoint ~ N(mi, Si )
Machine Learning 37
Unsupervised Learning: not as hard as it looks

Sometimes easy
IN CASE YOU’RE
WONDERING WHAT THESE
DIAGRAMS ARE, THEY
SHOW 2-d UNLABELED
DATA (X VECTORS)
DISTRIBUTED IN 2-d SPACE.
Sometimes impossible THE TOP ONE HAS THREE
VERY CLEAR GAUSSIAN
CENTERS

and sometimes
in between

Machine Learning 38
Computing likelihoods in unsupervised case
We have x1 , x2 , … xN
We know P(w1) P(w2) .. P(wk)
We know σ

P(x|wi, μi, … μk) = Prob that an observation from class wi would have value x given class
means μ1… μx

Can we write an expression for that?

39
likelihoods in unsupervised case
We have x1 x2 … xn
We have P(w1) .. P(wk). We have σ.
We can define, for any x , P(x|wi , μ1, μ2 .. μk)

Can we define P(x | μ1, μ2 .. μk) ?

Can we define P(x1, x1, .. xn | μ1, μ2 .. μk) ?

[YES, IF WE ASSUME THE X1’S WERE DRAWN INDEPENDENTLY]

40
Unsupervised Learning:Mediumly Good News

We now have a procedure s.t. if you give me a guess at μ1, μ2 .. μk,


I can tell you the prob of the unlabeled data given those μ‘s.

Suppose x‘s are 1-dimensional. (From Duda and Hart)

There are two classes; w1 and w2


P(w1) = 1/3 P(w2) = 2/3 σ=1.
There are 25 unlabeled datapoints
x1 = 0.608
x2 = -1.590
x3 = 0.235
x4 = 3.949
:
x25 = -0.712

Machine Learning 41
Duda & Hart’s Example
Graph of
log P(x1, x2 .. x25 | μ1, μ2 )
against μ1 (→) and μ2 ()

Max likelihood = (μ1 =-2.13, μ2 =1.668)


Local minimum, but very close to global at (μ1 =2.085, μ2 =-1.257)*
* corresponds to switching w1 + w2.
Machine Learning 42
Duda & Hart’s Example
We can graph the
prob. dist. function
of data given our
μ1 and μ2
estimates.

We can also graph the


true function from
which the data was
randomly generated.

• They are close. Good.


• The 2nd solution tries to put the “2/3” hump where the “1/3” hump should
go, and vice versa.
• In this example unsupervised is almost as good as supervised. If the x1 ..
x25 are given the class which was used to learn them, then the results are
(μ1=-2.176, μ2=1.684). Unsupervised got (μ1=-2.13, μ2=1.668).

Machine Learning 43
Finding the max likelihood μ1,μ2..μk
We can compute P( data | μ1,μ2..μk)
How do we find the μi‘s which give max. likelihood?

• The normal max likelihood trick:


Set log Prob (….) = 0
μi
and solve for μi‘s.
# Here you get non-linear non-analytically- solvable equations
• Use gradient descent
Slow but doable
• Use a much faster, cuter, and recently very popular method…

44
Expectation Maximalization

Machine Learning 45
The E.M. Algorithm
• We’ll get back to unsupervised learning soon.
• But now we’ll look at an even simpler case with hidden information.
• The EM algorithm
❑ Can do trivial things, such as the contents of the next few slides.
❑ An excellent way of doing our unsupervised learning problem, as we’ll see.
❑ Many, many other uses, including inference of Hidden Markov Models (future
lecture).

46
Silly Example
Let events be “grades in a class”
w1 = Gets an A P(A) = ½
w2 = Gets a B P(B) = μ
w3 = Gets a C P(C) = 2μ
w4 = Gets a D P(D) = ½-3μ
(Note 0 ≤ μ ≤1/6)
Assume we want to estimate μ from data. In a given class there were
a A’s
b B’s
c C’s
d D’s
What’s the maximum likelihood estimate of μ given a,b,c,d ?

47
Silly Example
Let events be “grades in a class”
w1 = Gets an A P(A) = ½
w2 = Gets a B P(B) = μ
w3 = Gets a C P(C) = 2μ
w4 = Gets a D P(D) = ½-3μ
(Note 0 ≤ μ ≤1/6)
Assume we want to estimate μ from data. In a given class there were
a A’s
b B’s
c C’s
d D’s
What’s the maximum likelihood estimate of μ given a,b,c,d ?

48
Trivial Statistics
P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ
P( a,b,c,d | μ) = K(½)a(μ)b(2μ)c(½-3μ)d
log P( a,b,c,d | μ) = log K + alog ½ + blog μ + clog 2μ + dlog (½-3μ)

LogP
FOR MAX LIKE μ, SET =0
μ
LogP b 2c 3d
= + − =0
μ μ 2μ 1 / 2 − 3μ
b+c
Gives max like μ =
6(b + c + d )
So if class got
A B C D
14 6 9 10
1
Max like μ =
10

Machine Learning 49
Same Problem with Hidden Information
REMEMBER
Someone tells us that P(A) = ½
Number of High grades (A’s + B’s) = h P(B) = μ
Number of C’s =c P(C) = 2μ

Number of D’s =d P(D) = ½-3μ

What is the max. like estimate of μ now?

Machine Learning 50
Same Problem with Hidden Information
REMEMBER
Someone tells us that P(A) = ½
Number of High grades (A’s + B’s) = h P(B) = μ
Number of C’s =c P(C) = 2μ

Number of D’s =d P(D) = ½-3μ

What is the max. like estimate of μ now?


We can answer this question circularly:
EXPECTATION If we know the value of μ we could compute the expected value of a and b

1
2 h μ
Since the ratio a:b should be the same as the ratio ½ : m a= b= h
1 +μ 1 +μ
2 2
MAXIMIZATION
If we know the expected values of a and b we could compute the
maximum likelihood value of μ b+c
μ =
6(b + c + d )
Machine Learning 51
E.M. for our Trivial Problem
REMEMBER
We begin with a guess for μ P(A) = ½
We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates ofP(B) = μ a and b.
μ and
P(C) = 2μ

Define μ(t) the estimate of μ on the t’th iteration P(D) = ½-3μ

b(t) the estimate of b on t’th iteration

μ (0) = initial guess


μ(t)h E-step
b(t ) = = b | μ (t )
1 + μ (t )
2
b(t ) + c
μ (t + 1) = M-step
6(b(t ) + c + d )
= max like est of μ given b(t )
Continue iterating until converged.
Good news: Converging to local optimum is assured.
Bad news: I said “local” optimum.
Machine Learning 52
E.M. Convergence
• Convergence proof based on fact that Prob(data | μ) must increase or remain same between each
iteration [NOT OBVIOUS]
• But it can never exceed 1 [OBVIOUS]
So it must therefore converge [OBVIOUS]

In our example, suppose we had t μ(t) b(t)


h = 20
c = 10 0 0 0
d = 10
μ(0) = 0 1 0.0833 2.857
2 0.0937 3.158
3 0.0947 3.185
4 0.0948 3.187
Convergence is generally linear: error decreases by a constant
factor each time step. 5 0.0948 3.187
6 0.0948 3.187
Machine Learning 53
Back to Unsupervised Learning of GMMs
Remember:
We have unlabeled data x1 x2 … xR
We know there are k classes
We know P(w1) P(w2) P(w3) … P(wk)
We don’t know μ1 μ2 .. μk

We can write P( data | μ1…. μk)

= p(x1...xR μ1...μ k )

=  p(xi μ1...μ k )
R

i =1

( )
=  p xi w j , μ1...μ k P(w j )
R k

i =1 j =1

=  K exp − 2 (xi − μ j ) P(w j )


R k
 1 2

i =1 j =1  2σ 

Machine Learning 54
E.M. for GMMs


For Max likelihood we know log Pr ob (data μ1...μ k ) = 0
μ i
Some wild' n' crazy algebra turns this into : " For Max likelihood, for each j,

 P(w xi , μ1...μ k ) xi
R

j
μj = i =1

 P(w xi , μ1...μ k )
R

j
i =1

This is n nonlinear equations in μj’s.”


If, for each xi we knew that for each wj the prob that μj was in class wj is
P(wj|xi,μ1…μk) Then… we would easily compute μj.

If we knew each μj then we could easily compute P(wj|xi,μ1…μk) for each wj and xi.

…I feel an EM experience coming on!!


55
E.M. for GMMs
Iterate. On the t’th iteration let our estimates be
lt = { μ1(t), μ2(t) … μc(t) }

E-step
Compute “expected” classes of all datapoints for each class Just evaluate a
Gaussian at xk

P(wi xk , lt ) =
p(xk wi , lt )P(wi lt )
=
( )
p xk wi , m i (t ), 2I pi (t )
p(xk lt )
( )
c

M-step. j =1
 k j j
p x w , m (t ),  2
I p j (t )

Compute Max. like μ given our data’s class membership distributions

 P(w x , l ) x
i k t k
μ (t + 1) = k

 P(w x , l )
i
i k t
k

Machine Learning 56
E.M. Convergence
• This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G.
Vector Quantization for Speech Data

• Your lecturer will


(unless out of
time) give you a
nice intuitive
explanation of
why this rule
works.
• As with all EM
procedures,
convergence to a
local optimum
guaranteed.
57
E.M.p (t)for General
is shorthand
i for GMMs
estimate of P(wi) on
Iterate. On the t’th iteration let our estimates be
t’th iteration
lt = { μ1(t), μ2(t) … μc(t), S1(t), S2(t) … Sc(t), p1(t), p2(t) … pc(t) }

E-step
Just evaluate a
Compute “expected” classes of all datapoints for each class
Gaussian at xk

p(xk wi , lt )P(wi lt ) p(xk wi , m i (t ), S i (t ) ) pi (t )


P(wi xk , lt ) = =
p(xk lt )
 p(x )
c

k w j , m j (t ), S j (t ) p j (t )
M-step. j =1

Compute Max. like μ given our data’s class membership distributions

 P(w x , l ) x  P(wi xk , lt ) xk − mi (t + 1)xk − mi (t + 1)


T

S i (t + 1) =
i k t k
μ (t + 1) =
k

 P(w x , l )
k

 P(w x , l )
i
i k t i k t
k k

 P(w i xk , lt )
pi (t + 1) = k
R = #records
R
Machine Learning 58
Gaussian Mixture Example: Start

Advance apologies: in Black and


White this example will be
incomprehensible

Machine Learning 59
After first iteration

Machine Learning 60
After 2nd iteration

Machine Learning 61
After 3rd iteration

Machine Learning 62
After 4th iteration

Machine Learning 63
After 5th iteration

Machine Learning 64
After 6th iteration

Machine Learning 65
After 20th iteration

Machine Learning 66
Some Bio Assay data

Machine Learning 67
GMM clustering of the assay data

Machine Learning 68
Resulting Density Estimator

Machine Learning 69
SUMMARY

• Clustering Problems

• K-Means

• DBSCAN

• Gaussian Mixtures

Machine Learning 70
Nhân bản – Phụng sự – Khai phóng

Enjoy the Course…!

Machine Learning 71

You might also like