L 8 Clustering
L 8 Clustering
─ What is Classification?
─ What is Supervised Classification/Learning?
─ What is Unsupervised Classification/Learning?
─ SOM – Self Organizing Maps
Types of Clustering Algorithms
Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Survey.
ACM Computing Surveys, 1999. 31: pp. 264-323.
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
Clustering Evaluation
• Manual inspection
• Benchmarking on existing labels
• Cluster quality measures
–distance measures
–high similarity within a cluster, low across
clusters
The Distance Function
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
K-means example, step 3
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each cluster k3
X
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
K-means example, step 5
k1
Y
k2
move cluster
centers to k3
cluster means
X
Squared Error Criterion
Pros and cons of K-Means
K-means variations
Advantages Disadvantages
• Simple, understandable • Must pick number of
• items automatically clusters before hand
assigned to clusters • All items forced into a
cluster
• Too sensitive to outliers
since an object with an
extremely large value
may substantially
distort the distribution
of data
Clustering Summary
• unsupervised
• many approaches
–K-means – simple, sometimes useful
• K-medoids is less sensitive to outliers
–Hierarchical clustering – works for symbolic
attributes
–Can be used to fill in missing values
New Centroid for Cluster 2 New Centroid for
(A3+B1+B2+B3+C2)/5=6,6 Cluster 3
(A2+C1)/2=1.5,3.5