Clustering Partitional
Clustering Partitional
Jing Gao
SUNY Buffalo
1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
– Partitional
– Hierarchical
– Density-based
– Mixture model
– Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.
2
Partitional Methods
• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means
3
Partitional Methods
• Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
– The center of a cluster is called centroid
– Each point is assigned to the cluster with the closest
centroid
– The number of clusters usually should be specified
4 center-based clusters
4
K-means
• Partition {x1,…,xn} into K clusters
– K is predefined
• Initialization
– Specify the initial cluster centers (centroids)
• Iteration until no change
– For each object xi
• Calculate the distances between xi and the K centroids
• (Re)assign xi to the cluster whose centroid is the
closest to xi
– Update the cluster centroids based on current
assignment
5
K-means: Initialization
Initialization: Determine the three cluster centers
5
4
m1
m2
2
m3
0
0 1 2 3 4 5
6
K-means Clustering: Cluster Assignment
Assign each object to the cluster which has the closet distance from the centroid
to the object
5
4
m1
m2
2
m3
0
0 1 2 3 4 5
7
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster
4
m1
m2
2
m3
0
0 1 2 3 4 5
8
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster
4
m1
2
m3
m2
1
0
0 1 2 3 4 5
9
K-means Clustering: Cluster Assignment
Assign each object to the cluster which has the closet distance from the centroid
to the object
5
4
m1
2
m3
m2
1
0
0 1 2 3 4 5
10
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster
4
m1
2
m3
m2
1
0
0 1 2 3 4 5
11
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster
4 m1
2
m2
m3
1
0
0 1 2 3 4 5
12
Partitional Methods
• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means
13
Sum of Squared Error (SSE)
• Suppose the centroid of cluster Cj is mj
• For each object x in Cj, compute the squared error between x and the
centroid mj
• Sum up the error of all the objects
SSE ( x m j ) 2
j xC j
1 m1= 2 4 m2= 5
1.5 4.5
14
How to Minimize SSE
min ( x m j ) 2
j xC j
15
Cluster Assignment Step
min ( x m j ) 2
j xC j
16
Example—Cluster Assignment
10
Given m1, m2, which
9 cluster each of the five
8 points belongs to?
7 x1 Assign points to the
6 closet centroid—
5 x3 minimize SSE
4 x2
3
m1 x4 x1 , x 2 , x 3 C1
2 x 4 , x5 C 2
m2
1 x5
SSE ( x1 m1 ) 2 ( x2 m1 ) 2 ( x3 m1 ) 2
0 1 2 3 4 5 6 7 8 9 10 ( x4 m2 ) 2 ( x5 m2 ) 2
17
Cluster Centroid Computation Step
min ( x m j ) 2
j xC j
18
Example—Cluster Centroid Computation
10
9
8
7 x1
6 Given the cluster
5 m1 x3 assignment, compute
the centers of the two
4 x2 clusters
3 x4
2 m2
1 x5
0 1 2 3 4 5 6 7 8 9 10
19
Comments on the K-Means Method
• Strength
– Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n
– Easy to implement
• Issues
– Need to specify K, the number of clusters
– Local minimum– Initialization matters
– Empty clusters may appear
20
Partitional Methods
• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means
21
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
22
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
23
Problems with Selecting Initial Points
24
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
25
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
26
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have
only one. 27
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have
only one. 28
Solutions to Initial Centroids Problem
• Multiple runs
– Average the results or choose the one that has the
smallest SSE
• Sample and use hierarchical clustering to determine initial
centroids
• Select more than K initial centroids and then select among
these initial centroids
– Select most widely separated
• Postprocessing—Use K-means’ results as other algorithms’
initialization
• Bisecting K-means
– Not as susceptible to initialization issues
29
Bisecting K-means
30
Handling Empty Clusters
• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest
SSE
– If there are several empty clusters, the above can
be repeated several times
31
Updating Centers Incrementally
32
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
33
Partitional Methods
• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means
34
Variations of the K-Means Method
+
+
outlier
36
K-Medoids Clustering Method
• Difference between K-means and K-medoids
– K-means: Computer cluster centers (may not be the original data
point)
– K-medoids: Each cluster’s centroid is represented by a point in the
cluster
– K-medoids is more robust than K-means in the presence of
outliers because a medoid is less influenced by outliers or other
extreme values
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means k-medoids 37
The K-Medoid Clustering Method
9 9 9
8 8 8
7 7 7
6
Arbitrary 6
Assign 6
5
choose k 5
each 5
4 object as 4 remaining 4
3 initial 3
object to 3
2
medoids 2
nearest 2
1 1 1
0 0
medoids 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O 7 total cost of 7
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
39
K-modes Algorithm
• Handling categorical data:
K-modes (Huang’98) age income student credit_rating
< = 30 high no fair
– Replacing means of clusters < = 30 high no excellent
with modes 31…40 high no fair
• Given n records in cluster, > 40 medium no fair
mode is a record made up of > 40 low yes fair
the most frequent attribute > 40 low yes excellent
values 31…40 low yes excellent
– Using new dissimilarity < = 30 medium no fair
measures to deal with < = 30 low yes fair
> 40 medium yes fair
categorical objects
< = 30 medium yes excellent
A mixture of categorical 31…40 medium no excellent
and numerical data: K- 31…40 high yes fair
41
Limitations of K-means: Differing Sizes
42
Limitations of K-means: Differing Density
43
Limitations of K-means: Irregular Shapes
44
Overcoming K-means Limitations
46
Overcoming K-means Limitations
47
Take-away Message
48