Clustering Notes
Clustering Notes
─ What is Classification?
─ What is Supervised Classification/Learning?
─ What is Unsupervised Classification/Learning?
─ SOM – Self Organizing Maps
Types of Clustering Algorithms
Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Survey.
ACM Computing Surveys, 1999. 31: pp. 264-323.
Agglomerative Divisive
Gradient Descent Evolutionary
Algorithms Algorithms and Artificial Methods
Neural Networks
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
Clustering Evaluation
• Manual inspection
• Benchmarking on existing labels
• Cluster quality measures
– distance measures
– high similarity within a cluster, low across
clusters
The Distance Function
k1
Y
Pick 3 k2
initial
cluster
centers
(randomly)
k3
X
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
K-means example, step 3
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each cluster k3
X
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
K-means example, step 4
…
k1
Y
A: three
points with
animation k3
k2
X
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
K-means example, step 5
k1
Y
k2
move cluster
centers to k3
cluster means
X
Squared Error Criterion
Pros and cons of K-Means
K-means variations
Advantages Disadvantages
• Simple, understandable • Must pick number of
• items automatically clusters before hand
assigned to clusters • All items forced into a
cluster
• Too sensitive to outliers
since an object with an
extremely large value
may substantially
distort the distribution
of data
Hierarchical clustering
• Agglomerative Clustering
– Start with single-instance clusters
– At each step, join the two closest clusters
– Design decision: distance between clusters
• Divisive Clustering
– Start with one universal cluster
– Find two clusters
– Proceed recursively on each subset
– Can be very fast
• Both methods produce a
dendrogram
g a c i e d k b j f h
Partial Supervision of Clustering
Disputed Data
Point
5
4
5 3
4 2
1
3
2
1