ML.5-Clustering Techniques (Week 9)
ML.5-Clustering Techniques (Week 9)
Chapter 5
Clustering Techniques
Machine Learning
CONTENTs
• Clustering Problems
• K-Means
• DBSCAN
• Gaussian Mixtures
Machine Learning 2
CONTENTs
•Clustering Problems
• K-Means
• DBSCAN
• Gaussian Mixtures
Machine Learning 3
Clustering Problem
• Unsupervised learning
• Sometimes the data form clusters, where examples within a cluster are similar to
each other, and examples in different clusters are dissimilar:
Machine Learning 4
Clustering Problem
Machine Learning 5
CONTENTs
• Clustering Problems
•K-Means
• DBSCAN
• Gaussian Mixtures
Machine Learning 6
K-means
Machine Learning 7
K-means
• K-means assumes there are K clusters, and each point is close to its cluster center
(the mean of points in the cluster).
• If we knew the cluster assignment we could easily compute means.
• If we knew the means we could easily compute cluster assignment.
Chicken and egg problem! Can show it is NP hard.
• Very simple (and useful) heuristic - start randomly and alternate between the two!
Machine Learning 8
K-means
Machine Learning 9
K-means
Machine Learning 10
K-means
• Finding the Optimal Number of Clusters
Machine Learning 11
K-means
• Finding the Optimal Number of Clusters
Machine Learning 12
K-means
• Limits of K-Means
• K-Means does not behave very well when the clusters have varying sizes,
• different densities, or non-spherical shapes
Machine Learning 13
CONTENTs
• Clustering Problems
• K-Means
•DBSCAN
• Gaussian Mixtures
Machine Learning 14
DBSCAN
• DBSCAN – Density-Based Spatial Clustering of Applications with Noise
• Core, Border, and Noise points
Machine Learning 15
DBSCAN
• Clusters as continuous regions of high density.
• DBSCAN algorithm:
• For each instance, the algorithm counts how many instances are located
within a small distance ε (epsilon) from it. This region is called the instance’s ε
neighborhood.
• If an instance has at least min_samples instances in its ε-neighborhood
(including itself), then it is considered a core instance.
• All instances in the neighborhood of a core instance belong to the same
cluster.
• Any instance that is not a core instance and does not have one in its
neighborhood is considered an anomaly.
Machine Learning 16
DBSCAN: Algorithm
Machine Learning 17
DBSCAN: Complexity
Machine Learning 18
DBSCAN
Machine Learning 19
DBSCAN
Machine Learning 20
DBSCAN: Opimal Eps
Machine Learning 21
CONTENTs
• Clustering Problems
• K-Means
• DBSCAN
•Gaussian Mixtures
Machine Learning 22
Gaussian Bayes Classifier Reminder
Machine Learning 23
Predicting wealth from age
Machine Learning 24
Learning modelyear , mpg ---> maker
21 12 1m
12 2 2 2m
Σ=
2 m
1m 2 m
Machine Learning 25
21 12
1m
General: O(m2) parameters
12 2 2 2m
Σ=
2 m
1m 2 m
Machine Learning 26
Aligned: O(m) parameters
21 0 0 0 0
0 22 0 0 0
0 0 23 0 0
Σ=
0 0 0 2 m −1 0
0 2 m
0 0 0
Machine Learning 27
21 0 0 0 0
0 22 0 0 0 Aligned: O(m) parameters
0 0 23 0 0
Σ=
0 0 0 2 m −1 0
0 2 m
0 0 0
Machine Learning 28
Spherical: O(1) cov parameters
2 0 0 0 0
0 2 0 0 0
0 0 2 0 0
Σ=
0 0 0 2 0
0 2
0 0 0
Machine Learning 29
2 0 0 0 0
0 2 0 0 0
Spherical: O(1) cov parameters
0 0 2 0 0
Σ=
0 0 0 2 0
0 2
0 0 0
Machine Learning 30
Making a Classifier from a Density Estimator
Joint BC Gauss BC
Predict Dec Tree
Naïve BC
Inputs
Classifier category
Density Naïve DE
ability
Estimator
Predict
Inputs
Machine Learning 31
Next… back to Density Estimation
Machine Learning 32
The GMM assumption
m3
Machine Learning 33
The GMM assumption
Machine Learning 34
The GMM assumption
Machine Learning 35
The GMM assumption
Sometimes easy
IN CASE YOU’RE
WONDERING WHAT THESE
DIAGRAMS ARE, THEY
SHOW 2-d UNLABELED
DATA (X VECTORS)
DISTRIBUTED IN 2-d SPACE.
Sometimes impossible THE TOP ONE HAS THREE
VERY CLEAR GAUSSIAN
CENTERS
and sometimes
in between
Machine Learning 38
Computing likelihoods in unsupervised case
We have x1 , x2 , … xN
We know P(w1) P(w2) .. P(wk)
We know σ
P(x|wi, μi, … μk) = Prob that an observation from class wi would have value x given class
means μ1… μx
39
likelihoods in unsupervised case
We have x1 x2 … xn
We have P(w1) .. P(wk). We have σ.
We can define, for any x , P(x|wi , μ1, μ2 .. μk)
40
Unsupervised Learning:Mediumly Good News
Machine Learning 41
Duda & Hart’s Example
Graph of
log P(x1, x2 .. x25 | μ1, μ2 )
against μ1 (→) and μ2 ()
Machine Learning 43
Finding the max likelihood μ1,μ2..μk
We can compute P( data | μ1,μ2..μk)
How do we find the μi‘s which give max. likelihood?
44
Expectation Maximalization
Machine Learning 45
The E.M. Algorithm
• We’ll get back to unsupervised learning soon.
• But now we’ll look at an even simpler case with hidden information.
• The EM algorithm
❑ Can do trivial things, such as the contents of the next few slides.
❑ An excellent way of doing our unsupervised learning problem, as we’ll see.
❑ Many, many other uses, including inference of Hidden Markov Models (future
lecture).
46
Silly Example
Let events be “grades in a class”
w1 = Gets an A P(A) = ½
w2 = Gets a B P(B) = μ
w3 = Gets a C P(C) = 2μ
w4 = Gets a D P(D) = ½-3μ
(Note 0 ≤ μ ≤1/6)
Assume we want to estimate μ from data. In a given class there were
a A’s
b B’s
c C’s
d D’s
What’s the maximum likelihood estimate of μ given a,b,c,d ?
47
Silly Example
Let events be “grades in a class”
w1 = Gets an A P(A) = ½
w2 = Gets a B P(B) = μ
w3 = Gets a C P(C) = 2μ
w4 = Gets a D P(D) = ½-3μ
(Note 0 ≤ μ ≤1/6)
Assume we want to estimate μ from data. In a given class there were
a A’s
b B’s
c C’s
d D’s
What’s the maximum likelihood estimate of μ given a,b,c,d ?
48
Trivial Statistics
P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ
P( a,b,c,d | μ) = K(½)a(μ)b(2μ)c(½-3μ)d
log P( a,b,c,d | μ) = log K + alog ½ + blog μ + clog 2μ + dlog (½-3μ)
LogP
FOR MAX LIKE μ, SET =0
μ
LogP b 2c 3d
= + − =0
μ μ 2μ 1 / 2 − 3μ
b+c
Gives max like μ =
6(b + c + d )
So if class got
A B C D
14 6 9 10
1
Max like μ =
10
Machine Learning 49
Same Problem with Hidden Information
REMEMBER
Someone tells us that P(A) = ½
Number of High grades (A’s + B’s) = h P(B) = μ
Number of C’s =c P(C) = 2μ
Machine Learning 50
Same Problem with Hidden Information
REMEMBER
Someone tells us that P(A) = ½
Number of High grades (A’s + B’s) = h P(B) = μ
Number of C’s =c P(C) = 2μ
1
2 h μ
Since the ratio a:b should be the same as the ratio ½ : m a= b= h
1 +μ 1 +μ
2 2
MAXIMIZATION
If we know the expected values of a and b we could compute the
maximum likelihood value of μ b+c
μ =
6(b + c + d )
Machine Learning 51
E.M. for our Trivial Problem
REMEMBER
We begin with a guess for μ P(A) = ½
We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates ofP(B) = μ a and b.
μ and
P(C) = 2μ
= p(x1...xR μ1...μ k )
= p(xi μ1...μ k )
R
i =1
( )
= p xi w j , μ1...μ k P(w j )
R k
i =1 j =1
i =1 j =1 2σ
Machine Learning 54
E.M. for GMMs
For Max likelihood we know log Pr ob (data μ1...μ k ) = 0
μ i
Some wild' n' crazy algebra turns this into : " For Max likelihood, for each j,
P(w xi , μ1...μ k ) xi
R
j
μj = i =1
P(w xi , μ1...μ k )
R
j
i =1
If we knew each μj then we could easily compute P(wj|xi,μ1…μk) for each wj and xi.
E-step
Compute “expected” classes of all datapoints for each class Just evaluate a
Gaussian at xk
P(wi xk , lt ) =
p(xk wi , lt )P(wi lt )
=
( )
p xk wi , m i (t ), 2I pi (t )
p(xk lt )
( )
c
M-step. j =1
k j j
p x w , m (t ), 2
I p j (t )
P(w x , l ) x
i k t k
μ (t + 1) = k
P(w x , l )
i
i k t
k
Machine Learning 56
E.M. Convergence
• This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G.
Vector Quantization for Speech Data
E-step
Just evaluate a
Compute “expected” classes of all datapoints for each class
Gaussian at xk
k w j , m j (t ), S j (t ) p j (t )
M-step. j =1
S i (t + 1) =
i k t k
μ (t + 1) =
k
P(w x , l )
k
P(w x , l )
i
i k t i k t
k k
P(w i xk , lt )
pi (t + 1) = k
R = #records
R
Machine Learning 58
Gaussian Mixture Example: Start
Machine Learning 59
After first iteration
Machine Learning 60
After 2nd iteration
Machine Learning 61
After 3rd iteration
Machine Learning 62
After 4th iteration
Machine Learning 63
After 5th iteration
Machine Learning 64
After 6th iteration
Machine Learning 65
After 20th iteration
Machine Learning 66
Some Bio Assay data
Machine Learning 67
GMM clustering of the assay data
Machine Learning 68
Resulting Density Estimator
Machine Learning 69
SUMMARY
• Clustering Problems
• K-Means
• DBSCAN
• Gaussian Mixtures
Machine Learning 70
Nhân bản – Phụng sự – Khai phóng
Machine Learning 71