0% found this document useful (0 votes)

2 views48 pages

Clustering Partitional

The document discusses partitional clustering methods, focusing on K-means algorithms, their optimization, and limitations. It explains the process of cluster assignment and centroid computation, emphasizing the importance of selecting initial centroids to minimize the sum of squared errors (SSE). Additionally, it addresses challenges in choosing initial points and suggests solutions such as multiple runs and hierarchical clustering for better centroid initialization.

Uploaded by

ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views48 pages

Clustering Partitional

Uploaded by

ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Clustering

Lecture 2: Partitional Methods

Jing Gao
SUNY Buffalo

1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
– Partitional
– Hierarchical
– Density-based
– Mixture model
– Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.

2
Partitional Methods

• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means

3
Partitional Methods

• Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
– The center of a cluster is called centroid
– Each point is assigned to the cluster with the closest
centroid
– The number of clusters usually should be specified

4 center-based clusters
4
K-means
• Partition {x1,…,xn} into K clusters
– K is predefined
• Initialization
– Specify the initial cluster centers (centroids)
• Iteration until no change
– For each object xi
• Calculate the distances between xi and the K centroids
• (Re)assign xi to the cluster whose centroid is the
closest to xi
– Update the cluster centroids based on current
assignment

5
K-means: Initialization
Initialization: Determine the three cluster centers
5

4
m1

m2
2

m3
0
0 1 2 3 4 5

6
K-means Clustering: Cluster Assignment
Assign each object to the cluster which has the closet distance from the centroid
to the object
5

4
m1

m2
2

m3
0
0 1 2 3 4 5

7
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster

4
m1

m2
2

m3
0
0 1 2 3 4 5

8
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster

4
m1

2
m3
m2
1

0
0 1 2 3 4 5

9
K-means Clustering: Cluster Assignment
Assign each object to the cluster which has the closet distance from the centroid
to the object
5

4
m1

2
m3
m2
1

0
0 1 2 3 4 5

10
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster

4
m1

2
m3
m2
1

0
0 1 2 3 4 5

11
K-means Clustering: Update Cluster Centroid
Compute cluster centroid as the center of the points in the cluster

4 m1

2
m2
m3
1

0
0 1 2 3 4 5

12
Partitional Methods

• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means

13
Sum of Squared Error (SSE)
• Suppose the centroid of cluster Cj is mj
• For each object x in Cj, compute the squared error between x and the
centroid mj
• Sum up the error of all the objects

SSE    ( x  m j ) 2
j xC j

 
1 m1= 2 4 m2= 5
1.5 4.5

SSE (1  1.5) 2  (2  1.5) 2  (4  4.5) 2  (5  4.5) 2  1

14
How to Minimize SSE

min   ( x  m j ) 2
j xC j

• Two sets of variables to minimize

– Each object x belongs to which cluster? xC j

– What’s the cluster centroid? mj

• Block coordinate descent
– Fix the cluster centroid—find cluster assignment that
minimizes the current error
– Fix the cluster assignment—compute the cluster centroids
that minimize the current error

15
Cluster Assignment Step

min   ( x  m j ) 2
j xC j

• Cluster centroids (mj) are known

• For each object
– Choose Cj among all the clusters for x such that
the distance between x and mj is the minimum
– Choose another cluster will incur a bigger error
• Minimize error on each object will minimize
the SSE

16
Example—Cluster Assignment
10
Given m1, m2, which
9 cluster each of the five
8 points belongs to?
7 x1 Assign points to the
6 closet centroid—
5 x3 minimize SSE
4 x2
3
m1 x4 x1 , x 2 , x 3  C1

2 x 4 , x5  C 2
m2
1 x5
SSE ( x1  m1 ) 2  ( x2  m1 ) 2  ( x3  m1 ) 2
0 1 2 3 4 5 6 7 8 9 10  ( x4  m2 ) 2  ( x5  m2 ) 2

17
Cluster Centroid Computation Step

min   ( x  m j ) 2
j xC j

• For each cluster

– Choose cluster centroid mj as the center of the
points
x
xC j
mj 
|Cj |
• Minimize error on each cluster will
minimize the SSE

18
Example—Cluster Centroid Computation
10
9
8
7 x1
6 Given the cluster
5 m1 x3 assignment, compute
the centers of the two
4 x2 clusters
3 x4
2 m2
1 x5

0 1 2 3 4 5 6 7 8 9 10

19
Comments on the K-Means Method

• Strength
– Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n
– Easy to implement

• Issues
– Need to specify K, the number of clusters
– Local minimum– Initialization matters
– Empty clusters may appear

20
Partitional Methods

• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means

21
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

22
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

23
Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of

selecting one centroid from each cluster is small

– Chance is relatively small when K is large

– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 =

0.00036

– Sometimes the initial centroids will readjust

themselves in ‘right’ way, and sometimes they don’t

24
10 Clusters Example
Iteration 4
1
2
3
8

2
y

-2

-4

-6

0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
25
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with two initial centroids in one cluster of each pair of clusters
26
10 Clusters Example
Iteration 4
1
2
3
8

2
y

-2

-4

-6

0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have
only one. 27
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with some pairs of clusters having three initial centroids, while other have
only one. 28
Solutions to Initial Centroids Problem

• Multiple runs
– Average the results or choose the one that has the
smallest SSE
• Sample and use hierarchical clustering to determine initial
centroids
• Select more than K initial centroids and then select among
these initial centroids
– Select most widely separated
• Postprocessing—Use K-means’ results as other algorithms’
initialization
• Bisecting K-means
– Not as susceptible to initialization issues

29
Bisecting K-means

• Bisecting K-means algorithm

– Variant of K-means that can produce a partitional or a hierarchical
clustering

30
Handling Empty Clusters

• Basic K-means algorithm can yield empty

clusters

• Several strategies
– Choose the point that contributes most to SSE
– Choose a point from the cluster with the highest
SSE
– If there are several empty clusters, the above can
be repeated several times

31
Updating Centers Incrementally

• In the basic K-means algorithm, centroids are

updated after all points are assigned to a centroid

• An alternative is to update the centroids after

each assignment (incremental approach)
– Each assignment updates zero or two centroids
– More expensive
– Introduces an order dependency
– Never get an empty cluster
– Can use “weights” to change the impact

32
Pre-processing and Post-processing

• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE

33
Partitional Methods

• K-means algorithms
• Optimization of SSE
• Improvement on K-Means
• K-means variants
• Limitation of K-means

34
Variations of the K-Means Method

• Most of the variants of the K-means which differ in

– Dissimilarity calculations

– Strategies to calculate cluster means

• Two important issues of K-means

– Sensitive to noisy data and outliers
• K-medoids algorithm
– Applicable only to objects in a continuous multi-dimensional
space
• Using the K-modes method for categorical data
35
Sensitive to Outliers

• K-means is sensitive to outliers

– Outlier: objects with extremely large (or small) values
• May substantially distort the distribution of the data

+
+

outlier

36
K-Medoids Clustering Method
• Difference between K-means and K-medoids
– K-means: Computer cluster centers (may not be the original data
point)
– K-medoids: Each cluster’s centroid is represented by a point in the
cluster
– K-medoids is more robust than K-means in the presence of
outliers because a medoid is less influenced by outliers or other
extreme values
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

k-means k-medoids 37
The K-Medoid Clustering Method

• K-Medoids Clustering: Find representative objects (medoids) in clusters

– PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering
• PAM works effectively for small data sets, but does not scale well for large
data sets. Time complexity is O(k(n-k)2) for each iteration where n is # of
data objects, k is # of clusters
• Efficiency improvement on PAM
– CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
– CLARANS (Ng & Han, 1994): Randomized re-sampling
38
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrary 6
Assign 6

5
choose k 5
each 5

4 object as 4 remaining 4

3 initial 3
object to 3

2
medoids 2
nearest 2

1 1 1

0 0
medoids 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O 7 total cost of 7

Until no change and Oramdom 6

swapping 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

39
K-modes Algorithm
• Handling categorical data:
K-modes (Huang’98) age income student credit_rating
< = 30 high no fair
– Replacing means of clusters < = 30 high no excellent
with modes 31…40 high no fair
• Given n records in cluster, > 40 medium no fair
mode is a record made up of > 40 low yes fair
the most frequent attribute > 40 low yes excellent
values 31…40 low yes excellent
– Using new dissimilarity < = 30 medium no fair
measures to deal with < = 30 low yes fair
> 40 medium yes fair
categorical objects
< = 30 medium yes excellent
 A mixture of categorical 31…40 medium no excellent
and numerical data: K- 31…40 high yes fair

prototype method mode = (<=30, medium, yes, fair)

40
Limitations of K-means

• K-means has problems when clusters are of

differing
– Sizes
– Densities
– Irregular shapes

41
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

42
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

43
Limitations of K-means: Irregular Shapes

Original Points K-means (2 Clusters)

44
Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.
45
Overcoming K-means Limitations

Original Points K-means Clusters

46
Overcoming K-means Limitations

Original Points K-means Clusters

47
Take-away Message

• What’s partitional clustering?

• How does K-means work?
• How is K-means related to the minimization of SSE?
• What are the strengths and weakness of K-means?
• What are the variants of K-means?

C PROGRAMMING Notes
80% (5)
C PROGRAMMING Notes
6 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering Partition Hierachy
No ratings yet
Clustering Partition Hierachy
58 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Unit 5
No ratings yet
Unit 5
63 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
08 K-Means
No ratings yet
08 K-Means
19 pages
Clustering Analysis: What Is Cluster Analysis?
No ratings yet
Clustering Analysis: What Is Cluster Analysis?
5 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
ML Application in Signal Processing and Communication Engineering
No ratings yet
ML Application in Signal Processing and Communication Engineering
27 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Clustering
No ratings yet
Clustering
29 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Cluster
100% (1)
Cluster
72 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering
No ratings yet
Clustering
125 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Unit 5
No ratings yet
Unit 5
85 pages
Lec. 15-Final. ClusAdvanced
No ratings yet
Lec. 15-Final. ClusAdvanced
103 pages
ML - 8
No ratings yet
ML - 8
70 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Unit 4
No ratings yet
Unit 4
19 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
21csc305p Machine Learning Unit 3 - Updated
No ratings yet
21csc305p Machine Learning Unit 3 - Updated
147 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Module 5
No ratings yet
Module 5
98 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Unit 7 Clustering (P)
No ratings yet
Unit 7 Clustering (P)
22 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
Unit 5
No ratings yet
Unit 5
51 pages
Numerical Methods
No ratings yet
Numerical Methods
4 pages
Emulator ProgrammingV140
No ratings yet
Emulator ProgrammingV140
462 pages
Create Your Own Programming Language
No ratings yet
Create Your Own Programming Language
77 pages
Jurnal Pemrograman Komputer PDF
No ratings yet
Jurnal Pemrograman Komputer PDF
11 pages
Group 8 PPT DBMS (Stucture and Working)
No ratings yet
Group 8 PPT DBMS (Stucture and Working)
25 pages
CST362-Programming in Python A-SchemeSet#2 July 2022
No ratings yet
CST362-Programming in Python A-SchemeSet#2 July 2022
6 pages
Introduction To Visual C++ 2010 Express
No ratings yet
Introduction To Visual C++ 2010 Express
7 pages
MVVM Architecture
No ratings yet
MVVM Architecture
14 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
Prentice - Hall.the.c.puzzle - book.1982.SCAN DARKCROWN
No ratings yet
Prentice - Hall.the.c.puzzle - book.1982.SCAN DARKCROWN
196 pages
Unit 1
No ratings yet
Unit 1
13 pages
Unicodebook PDF
No ratings yet
Unicodebook PDF
73 pages
Python Assignments
No ratings yet
Python Assignments
7 pages
c101 Practice Exam
67% (3)
c101 Practice Exam
70 pages
Mindstix - Job Description
No ratings yet
Mindstix - Job Description
5 pages
Stanford CS193p: Developing Applications For iOS! Fall 2013-14
No ratings yet
Stanford CS193p: Developing Applications For iOS! Fall 2013-14
26 pages
Audio Demo
No ratings yet
Audio Demo
6 pages
Thread Scheduling
No ratings yet
Thread Scheduling
38 pages
Earley Parser
No ratings yet
Earley Parser
6 pages
Final - Time Table 2023 - Nov - Rev21
No ratings yet
Final - Time Table 2023 - Nov - Rev21
6 pages
CCS 2101 Computer Programming 1 Question paperEEE& BED
No ratings yet
CCS 2101 Computer Programming 1 Question paperEEE& BED
5 pages
Module1 String handling-AJ-BIS402
No ratings yet
Module1 String handling-AJ-BIS402
63 pages
CSE 101 Lecture Zero
100% (1)
CSE 101 Lecture Zero
43 pages
2.os Lab 2
No ratings yet
2.os Lab 2
13 pages
8 Week Leetcode List
No ratings yet
8 Week Leetcode List
8 pages
12th CS Short Questions Notes by Youth Academy
100% (1)
12th CS Short Questions Notes by Youth Academy
36 pages
Lab Manual Web Engineering
No ratings yet
Lab Manual Web Engineering
44 pages
DBMS Q&a
No ratings yet
DBMS Q&a
2 pages