Week 3 Clustering
Week 3 Clustering
Unsupervised learning
Understand patterns of data (just “x”)
Useful for many reasons
Data mining (“explain”)
Missing data values (“impute”)
Representation (feature generation or selection)
Clustering
“Grouping a set of data objects into clusters according to their similarity”
Partitioning:
n
d euc (x, y) i i
( x
i 1
y ) 2
Unsupervised learning:
Finds “natural” grouping of
instances given un-labeled data
Clustering Evaluation
Manual inspection
Benchmarking on existing labels
Cluster quality measures
distance measures
high similarity within a cluster, low across clusters
The Distance Function
k1
Y
Pick 3
k2
initial
cluster
centers
(randomly)
k3
X
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3
X
K-means example, step 3
k1 k1
Y
Move k2
each cluster
center k3
k2
to the mean
of each cluster k3
X
K-means example, step 4
Reassign k1
points Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?
X
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
K-means example, step 5
k1
Y
k2
move cluster
centers to k3
cluster means
X
Squared Error Criterion
Hierarchical clustering
Agglomerative Clustering
Start with single-instance clusters
At each step, join the two closest clusters
Design decision: distance between clusters
Divisive Clustering
Start with one universal cluster
Find two clusters
Proceed recursively on each subset
Can be very fast
Both methods produce a
dendrogram
g a c i e d k b j f h
Define a distance between clusters
Initially, every datum is a cluster (return to this)
Initialize: every example is a cluster
Iterate:
Compute distances between all clusters
(store for efficiency)
Merge two closest clusters
Save both clustering and sequence of
cluster operations
“Dendrogram”
Iteration 1
Iteration 2
Iteration 3
• Builds up a sequence of
clusters (“hierarchical”)
• Single Link
• Complete Link
• Average Link
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Questions