Clustering
Clustering
Ch. 16
What is clustering?
Clustering: the process of grouping a set of objects into classes of
similar objects
Objects within a cluster should be similar.
Objects from different clusters should be dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as opposed to
supervised data where a classification of examples is given
Applications in Search engines:
Structuring search results
Suggesting related pages
Automatic directory construction/update
Finding near identical/duplicate pages
Classification vs. Clustering
Classification: Clustering:
• Supervised learning • Unsupervised learning
• Learns a method for predicting • Finds “natural” grouping of
the instance class from pre- instances given un-labeled data
labeled (classified) instances
Classification vs. Clustering (cont.)
Between-cluster variation:
Within-cluster variation:
1- Partitional clustering
k-Means Clustering
Where obj represents each data point in cluster k and centk is the centroid of
cluster Ck
K-means example, step 1
Pick 3
initial
cluster
centers
(randomly)
K-means example, step 2
Assign
each point
to the closest
cluster
center
K-means example, step 3
Move
each cluster
center
to the mean
of each cluster
K-means example, step 4a
Reassign
points
closest to a
different new
cluster center
Q: Which
points are
reassigned?
K-means example, step 4a …
K-means example, step 4b
re-compute
cluster
means
K-means example, step 5
move cluster
centers to
cluster means
Example 2:
Suppose that we have eight data points in two-dimensional space
as follows
SSE=33.64
Centroid of the cluster 1 is
[(1+1+1)/3,(3+2+1)/3]
=(1,2)
m1=(1,2)
m2=(3.6,2.4)
Point Distance from Distance from Old Cluster New cluster
m1 (1,2) m2 (3.6,2.4) membership membership
C1 C2
a (1,3) 1.00 2.67 C1 C1
b (3,3) 2.24 0.85 C2 C2
c (4,3) 3.61 0.72 C2 C2
d (5,3) 4.12 1.52 C2 C2
e (1,2) 0.00 2.63 C1 C1
f (4,2) 3.00 0.57 C2 C2
g (1,1) 1.00 2.95 C1 C1
h (2,1) 1.41 2.13 C2 C1
Point Distance from m1 Distance from m2 Cluster membership
a 1.00 2.67 C1
b 2.24 0.85 C2
c 3.61 0.72 C2
d 4.12 1.52 C2
e 0.00 2.63 C1
f 3.00 0.57 C2
g 1.00 2.95 C1
h 1.41 2.13 C1
SSE=30.42
Centroid of the cluster 1 is
[(1+1+1+2)/4,(3+2+1+1)/4]
=(1.25,1.75)
m1(1.25,1.75)
m2(4,2.75)
Point Distance from Distance from Old Cluster New cluster
m1 (1.25,1.75) m2 (4,2.75) membership membership
• Clustering obtained by
cutting the dendrogram
at a desired level: each
connected component
forms a cluster.
Complete link
Distance between farthest elements in clusters
Centroids
Distance between centroids(means) of two clusters
Single link method
j j
j
Single-link clustering
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
• x = (1, 2.5)
2
• x = (3, 1)
3
• x = (4, 0.5)
4
• x = (4, 2)
5
• x = (1, 2)
1
• x = (1, 2.5)
2
• x = (3, 1)
3
• x = (4, 0.5)
4
• x = (4, 2)
5
Merge X1 and X2
Merge X3 and X4
Merge {X3,X4} and X5
Merge {X1,X2} and {X3,X4,X5}
Example 3-Complete link method
• x = (1, 2)
1
• x = (1, 2.5)
2
• x = (3, 1)
3
• x = (4, 0.5)
4
• x = (4, 2)
5
Merge X1 and X2
Merge X3 and X4
Merge {X3,X4} and X5
Merge {X1,X2} and {X3,X4,X5}
The dendrogram :
Proc and Cons of Hierarchical Clustering
Advantages
Dendograms are great for visualization
Provides hierarchical relations between clusters
Disadvantages
Not easy to define levels for clusters
Can never undo what was done previously
Sensitive to cluster distance measures and noise/outliers
Experiments showed that other clustering techniques outperform
hierarchical clustering
There are several variants to overcome its weaknesses
BIRCH: scalable to a large data set
ROCK: clustering categorical data
CHAMELEON: hierarchical clustering using dynamic modelling