0% found this document useful (0 votes)

330 views27 pages

Partitioning-Based Clustering Methods

The document discusses different partitioning-based clustering methods. It covers the basic concepts of partitioning algorithms including k-means clustering. The k-means method is described as assigning points to the closest centroid and recomputing centroids iteratively until convergence. Variations discussed include k-medoids clustering, which uses data points as centroids rather than means, making it more robust to outliers.

Uploaded by

sharathdhamodaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

330 views27 pages

Partitioning-Based Clustering Methods

Uploaded by

sharathdhamodaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture 3.

Partitioning-Based
Clustering Methods
Lecture 3. Partitioning-Based Clustering Methods
 Basic Concepts of Partitioning Algorithms

 The K-Means Clustering Method

 Initialization of K-Means Clustering

 The K-Medoids Clustering Method
 The K-Medians and K-Modes Clustering Methods
 The Kernel K-Means Clustering Method
 Summary

2
Session 1: Basic Concepts of
Partitioning Algorithms
Partitioning Algorithms: Basic Concepts
 Partitioning method: Discovering the groupings in the data by optimizing a specific
objective function and iteratively improving the quality of partitions
 K-partitioning method: Partitioning a dataset D of n objects into a set of K clusters
so that an objective function is optimized (e.g., the sum of squared distances is
minimized, where ck is the centroid or medoid of cluster Ck)
 A typical objective function: Sum of Squared Errors (SSE)
K
SSE (C )    || xi ck ||2
k 1 xiCk
 Problem definition: Given K, find a partition of K clusters that optimizes the chosen
partitioning criterion
 Global optimal: Needs to exhaustively enumerate all partitions
 Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-Medoids, etc.

4
Session 2: The K-Means
Clustering Method
The K-Means Clustering Method
 K-Means (MacQueen’67, Lloyd’57/’82)
 Each cluster is represented by the center of the cluster
 Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows

 Select K points as initial centroids

 Repeat
 Form K clusters by assigning each point to its closest centroid
 Re-compute the centroids (i.e., mean point) of each cluster
 Until convergence criterion is satisfied
 Different kinds of measures can be used

 Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
6
Example: K-Means Clustering
Assign
Recompute
points to
cluster
clusters
centers

The original data points & Redo point assignment

randomly select K = 2 centroids

Execution of the K-Means Clustering Algorithm

Select K points as initial centroids
Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
Until convergence criterion is satisfied
7
Discussion on the K-Means Method
 Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations
 Normally, K, t << n; thus, an efficient method
 K-means clustering often terminates at a local optimal
 Initialization can be important to find high-quality clusters
 Need to specify K, the number of clusters, in advance
 There are ways to automatically determine the “best” K
 In practice, one often runs a range of values and selected the “best” K value
 Sensitive to noisy data and outliers
 Variations: Using K-medians, K-medoids, etc.
 K-means is applicable only to objects in a continuous n-dimensional space
 Using the K-modes for categorical data
 Not suitable to discover clusters with non-convex shapes
 Using density-based clustering, kernel K-means, etc.
8
Variations of K-Means
 There are many variants of the K-Means method, varying in different aspects

 Choosing better initial centroid estimates

 K-means++, Intelligent K-Means, Genetic K-Means To be discussed in this lecture

 Choosing different representative prototypes for the clusters

 K-Medoids, K-Medians, K-Modes To be discussed in this lecture

 Applying feature transformation techniques

 Weighted K-Means, Kernel K-Means To be discussed in this lecture

9
Session 3: Initialization of K-
Means Clustering
Initialization of K-Means
 Different initializations may generate rather different clustering
results (some could be far from optimal)
 Original proposal (MacQueen’67): Select K seeds randomly

 Need to run the algorithm multiple times using different seeds

 There are many methods proposed for better initialization of k seeds

 K-Means++ (Arthur & Vassilvitskii’07):

 The first centroid is selected at random
 The next centroid selected is the one that is farthest from the currently selected
(selection is based on a weighted probability score)
 The selection continues until K centroids are obtained

11
Example: Poor Initialization May Lead to Poor Clustering

Assign Recompute
points to cluster
clusters centers

Another random selection of k

centroids for the same data points

 Rerun of the K-Means using another random K seeds

 This run of K-Means generates a poor quality clustering

12
Session 4: The K-Medoids
Clustering Method
Handling Outliers: From K-Means to K-Medoids
 The K-Means algorithm is sensitive to outliers!—since an object with an extremely
large value may substantially distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster
 The K-Medoids clustering algorithm:
 Select K points as the initial representative objects (i.e., as initial K medoids)
 Repeat
 Assigning each point to the cluster with the closest medoid
 Randomly select a non-representative object oi
 Compute the total cost S of swapping the medoid m with oi
 If S < 0, then swap m with oi to form the new set of medoids
 Until convergence criterion is satisfied
14
PAM: A Typical K-Medoids Algorithm
10 10 10

9 9 9

8 8 8

7
Arbitrary 7
Assign 7

choose K each
6 6 6

remaining
5 5

4 object as 4 4

3 initial 3 object to 3

2
medoids 2
nearest 2

1 1
medoids 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2
Randomly select a non-
medoid object,Oramdom
Select initial K medoids randomly
10 10

Repeat
Compute
9 9

Swapping O 8 8

Object re-assignment total cost of

and Oramdom
7 7

6
swapping 6

Swap medoid m with oi if it If quality is

4
5

improved
improves the clustering quality 3

2
3

1 1

Until convergence criterion is satisfied 0

0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10

15
Discussion on K-Medoids Clustering
 K-Medoids Clustering: Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids: Kaufmann & Rousseeuw 1987)
 Starts from an initial set of medoids, and
 Iteratively replaces one of the medoids by one of the non-medoids if it improves
the total sum of the squared errors (SSE) of the resulting clustering
 PAM works effectively for small data sets but does not scale well for large data
sets (due to the computational complexity)
 Computational complexity: PAM: O(K(n − K)2) (quite expensive!)
 Efficiency improvements on PAM
 CLARA (Kaufmann & Rousseeuw, 1990):
 PAM on samples; O(Ks2 + K(n − K)), s is the sample size
 CLARANS (Ng & Han, 1994): Randomized re-sampling, ensuring efficiency + quality
16
Session 5: The K-Medians and K-
Modes Clustering Methods
K-Medians: Handling Outliers by Computing Medians
 Medians are less sensitive to outliers than means
 Think of the median salary vs. mean salary of a large firm when adding a few top
executives!
 K-Medians: Instead of taking the mean value of the object in a cluster as a reference
point, medians are used (L1-norm as the distance measure)
 The criterion function for the K-Medians algorithm: K
S    | xij  med kj |
 The K-Medians clustering algorithm: k 1 xiCk

 Select K points as the initial representative objects (i.e., as initial K medians)

 Repeat
 Assign every point to its nearest median
 Re-compute the median using the median of each individual feature
 Until convergence criterion is satisfied
18
K-Modes: Clustering Categorical Data
 K-Means cannot handle non-numerical (categorical) data
 Mapping categorical value to 1/0 cannot generate quality clusters for high-
dimensional data
 K-Modes: An extension to K-Means by replacing means of clusters with modes
 Dissimilarity measure between object X and the center of a cluster Z
 Φ(xj, zj) = 1 – njr/nl when xj = zj ; 1 when xj ǂ zj
 where zj is the categorical value of attribute j in Zl, nl is the number of objects
in cluster l, and njr is the number of objects whose attribute value is r
 This dissimilarity measure (distance function) is frequency-based
 Algorithm is still based on iterative object cluster assignment and centroid update
 A fuzzy K-Modes method is proposed to calculate a fuzzy cluster membership
value for each object to each cluster
 A mixture of categorical and numerical data: Using a K-Prototype method
19
Session 6: Kernel K-Means
Clustering
Kernel K-Means Clustering
 Kernel K-Means can be used to detect non-convex clusters
 K-Means can only detect clusters that are linearly separable
 Idea: Project data onto the high-dimensional kernel space, and
then perform K-Means clustering
 Map data points in the input space onto a high-dimensional feature
space using the kernel function
 Perform K-Means on the mapped feature space
 Computational complexity is higher than K-Means
 Need to compute and store n x n kernel matrix generated from the
kernel function on the original data
 The widely studied spectral clustering can be considered as a variant of
Kernel K-Means clustering
21
Kernel Functions and Kernel K-Means Clustering
 Typical kernel functions:
 Polynomial kernel of degree h: K(Xi, Xj) = (Xi∙Xj + 1)h
|| X i  X j || 2 /2 2
 Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = e
 Sigmoid kernel: K(Xi, Xj) = tanh(κXi∙Xj −δ)
 The formula for kernel matrix K for any two points xi, xj є Ck is K xi x j   ( xi )   ( x j )
K
 The SSE criterion of kernel K-means: SSE (C )    ||  ( xi )  ck ||2
k 1 xiCk
 The formula for the cluster centroid:
 (x )
xiCk
i

ck 
| Ck |

 Clustering can be performed without the actual individual projections φ(xi) and φ(xj)
for the data points xi, xj є Ck
22
Example: Kernel Functions and Kernel K-Means Clustering
|| X i  X j || 2 /2 2
 Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = e
 Suppose there are 5 original 2-dimensional points:
 x1(0, 0), x2(4, 4), x3(-4, 4), x4(-4, -4), x5(4, -4)
 If we set 𝜎 to 4, we will have the following points in the kernel space
2 32
2 2 −
 E.g., 𝑥1 − 𝑥2 = 0−4 + 0−4 = 32, therefore, 𝐾 𝑥1 , 𝑥2 = 𝑒 2⋅42 = 𝑒 −1

Original Space RBF Kernel Space (𝜎 = 4)

𝑲(𝒙𝒊 , 𝒙𝟏 ) 𝑲(𝒙𝒊 , 𝒙𝟐 ) 𝑲(𝒙𝒊 , 𝒙𝟑 ) 𝑲(𝒙𝒊 , 𝒙𝟒 ) 𝑲(𝒙𝒊 , 𝒙𝟓 )
𝑥 𝑦
x1 0 0 0 −
42 +42
−1
𝑒 −1 𝑒 −1 𝑒 −1
𝑒 2⋅4 2 =𝑒
x2 4 4
𝑒 −1 0 𝑒 −2 𝑒 −4 𝑒 −2
x3 −4 4
𝑒 −1 𝑒 −2 0 𝑒 −2 𝑒 −4
x4 −4 −4
𝑒 −1 𝑒 −4 𝑒 −2 0 𝑒 −2
x5 4 −4
𝑒 −1 𝑒 −2 𝑒 −4 𝑒 −2 0
23
Example: Kernel K-Means Clustering

The original data set The result of K-Means clustering The result of Gaussian Kernel K-Means clustering

 The above data set cannot generate quality clusters by K-Means since it contains non-
covex clusters
 Gaussian RBF Kernel transformation maps data to a kernel matrix K for any two points
|| X i  X j || 2 /2 2
xi, xj: K x x   ( xi )   ( x j ) and Gaussian kernel: K(Xi, Xj) = e
i j

 K-Means clustering is conducted on the mapped data, generating quality clusters

24
Session 7: Summary
Summary: Partitioning-Based Clustering Methods
 Basic Concepts of Partitioning Algorithms

 The K-Means Clustering Method

 Initialization of K-Means Clustering

 The K-Medoids Clustering Method
 The K-Medians and K-Modes Clustering Methods
 The Kernel K-Means Clustering Method
 Summary

26
Recommended Readings
 J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proc.
of the 5th Berkeley Symp. on Mathematical Statistics and Probability, 1967
 S. Lloyd. Least Squares Quantization in PCM. IEEE Trans. on Information Theory, 28(2), 1982
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John
Wiley & Sons, 1990
 R. Ng and J. Han. Efficient and Effective Clustering Method for Spatial Data Mining. VLDB'94
 B. Schölkopf, A. Smola, and K. R. Müller. Nonlinear Component Analysis as a Kernel Eigenvalue
Problem. Neural computation, 10(5):1299–1319, 1998
 I. S. Dhillon, Y. Guan, and B. Kulis. Kernel K-Means: Spectral Clustering and Normalized Cuts. KDD’04
 D. Arthur and S. Vassilvitskii. K-means++: The Advantages of Careful Seeding. SODA’07
 C. K. Reddy and B. Vinzamuri. A Survey of Partitional and Hierarchical Clustering Algorithms, in
(Chap. 4) Aggarwal and Reddy (eds.), Data Clustering: Algorithms and Applications. CRC Press, 2014
 M. J. Zaki and W. Meira, Jr.. Data Mining and Analysis: Fundamental Concepts and Algorithms.
Cambridge Univ. Press, 2014
27

Neural Network Clustering Guide
No ratings yet
Neural Network Clustering Guide
168 pages
Partitioning-Based Clustering Overview
No ratings yet
Partitioning-Based Clustering Overview
27 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Clustering
No ratings yet
Clustering
29 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
76 pages
Clustering Techniques Guide
100% (1)
Clustering Techniques Guide
33 pages
Cluster Analysis Techniques Overview
No ratings yet
Cluster Analysis Techniques Overview
33 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Clustering
No ratings yet
Clustering
18 pages
Clustering Partition Hierachy
No ratings yet
Clustering Partition Hierachy
58 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
K-Medoids Clustering Overview
No ratings yet
K-Medoids Clustering Overview
36 pages
Understanding Clustering in Data Mining
No ratings yet
Understanding Clustering in Data Mining
48 pages
Session 34 - 35clustering
No ratings yet
Session 34 - 35clustering
50 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
Clustering and Dimensionality Reduction
No ratings yet
Clustering and Dimensionality Reduction
58 pages
7 Clustering1
No ratings yet
7 Clustering1
72 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
14 pages
Cluster Analysis for CS Students
No ratings yet
Cluster Analysis for CS Students
43 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
K Means
No ratings yet
K Means
23 pages
K Means Clustering
No ratings yet
K Means Clustering
90 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Cluster Analysis and Methods Overview
No ratings yet
Cluster Analysis and Methods Overview
47 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Unit 5
No ratings yet
Unit 5
85 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
Unsupervised Learning & Clustering Guide
No ratings yet
Unsupervised Learning & Clustering Guide
49 pages
CE345 - Lecture #9 - Clustering
No ratings yet
CE345 - Lecture #9 - Clustering
56 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
K-Means and K-Medoids Clustering Guide
No ratings yet
K-Means and K-Medoids Clustering Guide
29 pages
Clustering
No ratings yet
Clustering
9 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
4.1.2. K Means Clustering
No ratings yet
4.1.2. K Means Clustering
38 pages
Chapter 6
No ratings yet
Chapter 6
12 pages
Clustering
No ratings yet
Clustering
25 pages
Clustering Methods
No ratings yet
Clustering Methods
64 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
2023 K Means
No ratings yet
2023 K Means
48 pages
Cluster Analysis and K-Means Guide
No ratings yet
Cluster Analysis and K-Means Guide
20 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Unit4 ML
No ratings yet
Unit4 ML
20 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering
No ratings yet
Clustering
24 pages
Cluster
No ratings yet
Cluster
50 pages
ML 5
No ratings yet
ML 5
61 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Cluster Analysis Techniques Overview
No ratings yet
Cluster Analysis Techniques Overview
101 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Workflow of Statistical Data Analysis
No ratings yet
Workflow of Statistical Data Analysis
105 pages
Vincent Granville, Ph.D. Co-Founder, DSC
No ratings yet
Vincent Granville, Ph.D. Co-Founder, DSC
16 pages
Limits: Calculus Cheat Sheet Calculus Cheat Sheet
No ratings yet
Limits: Calculus Cheat Sheet Calculus Cheat Sheet
6 pages
Reg Ex Cheat Sheet
No ratings yet
Reg Ex Cheat Sheet
1 page
Python Regular Expressions Cheat Sheet PDF
No ratings yet
Python Regular Expressions Cheat Sheet PDF
1 page
Machine Learning Cheat Sheet PDF
No ratings yet
Machine Learning Cheat Sheet PDF
135 pages
Binary Classification Metrics
No ratings yet
Binary Classification Metrics
6 pages
2015 FIeld Guide To Data Science
No ratings yet
2015 FIeld Guide To Data Science
126 pages
Data Transformation Assignment Instructions
No ratings yet
Data Transformation Assignment Instructions
2 pages
Data Mining Project Report Template
No ratings yet
Data Mining Project Report Template
3 pages
BDA Mini Project Report 1
No ratings yet
BDA Mini Project Report 1
20 pages
Explanatory Case Study
No ratings yet
Explanatory Case Study
18 pages
THREE - 4666-Research Methods-II
No ratings yet
THREE - 4666-Research Methods-II
4 pages
Research About Flood Prone Areas
No ratings yet
Research About Flood Prone Areas
38 pages
SRM Cia 4
No ratings yet
SRM Cia 4
34 pages
Barbados Public Services Survey
No ratings yet
Barbados Public Services Survey
67 pages
CRD ANOVA Table Formation Guide
No ratings yet
CRD ANOVA Table Formation Guide
7 pages
Data Mining Course Overview CS-5343
No ratings yet
Data Mining Course Overview CS-5343
2 pages
Impact of Job Satisfaction On Employee Performance 2
No ratings yet
Impact of Job Satisfaction On Employee Performance 2
13 pages
Linear Correlation Analysis Application
No ratings yet
Linear Correlation Analysis Application
4 pages
Knowlwdge Management Unit-1
No ratings yet
Knowlwdge Management Unit-1
8 pages
MMDC IT Curriculum
No ratings yet
MMDC IT Curriculum
18 pages
Ecommerce EDA Project
No ratings yet
Ecommerce EDA Project
14 pages
The Origin and Development of The Ibero-Romance - NC-:-NG - Suffixes - Pharies PDF
No ratings yet
The Origin and Development of The Ibero-Romance - NC-:-NG - Suffixes - Pharies PDF
122 pages
Prescale English Datasheet
No ratings yet
Prescale English Datasheet
8 pages
Session 3 - Descriptive Analytics - Thien Nguyen
No ratings yet
Session 3 - Descriptive Analytics - Thien Nguyen
37 pages
Applied Multivariate Statistics For The Social Sciences Analyses With SAS and IBM S SPSS 6th Edition Keenan A. Pituch Full
100% (5)
Applied Multivariate Statistics For The Social Sciences Analyses With SAS and IBM S SPSS 6th Edition Keenan A. Pituch Full
148 pages
R Programming Exam for IT Students
No ratings yet
R Programming Exam for IT Students
11 pages
Magic Quadrant Marketing Hubs
No ratings yet
Magic Quadrant Marketing Hubs
35 pages
An Enlightenment To Machine Learning
No ratings yet
An Enlightenment To Machine Learning
3 pages
Business Intelligence Overview and Methods
No ratings yet
Business Intelligence Overview and Methods
9 pages
Biostat II Final Exam
No ratings yet
Biostat II Final Exam
7 pages
Trainity Project Report
No ratings yet
Trainity Project Report
21 pages
AI Algorithm Exercise & Solution
No ratings yet
AI Algorithm Exercise & Solution
8 pages
202CS009
No ratings yet
202CS009
2 pages
Elicit - Personal Expense Analytics Model - Report
No ratings yet
Elicit - Personal Expense Analytics Model - Report
10 pages
Da Notes
No ratings yet
Da Notes
6 pages
MLSC-Week 4
No ratings yet
MLSC-Week 4
112 pages

Partitioning-Based Clustering Methods

Uploaded by

Partitioning-Based Clustering Methods

Uploaded by

Lecture 3.

 The K-Means Clustering Method

 Initialization of K-Means Clustering

 Select K points as initial centroids

The original data points & Redo point assignment

Execution of the K-Means Clustering Algorithm

 Choosing better initial centroid estimates

 K-means++, Intelligent K-Means, Genetic K-Means To be discussed in this lecture

 Choosing different representative prototypes for the clusters

 K-Medoids, K-Medians, K-Modes To be discussed in this lecture

 Applying feature transformation techniques

 Weighted K-Means, Kernel K-Means To be discussed in this lecture

 Need to run the algorithm multiple times using different seeds

 K-Means++ (Arthur & Vassilvitskii’07):

Another random selection of k

 Rerun of the K-Means using another random K seeds

 This run of K-Means generates a poor quality clustering

Object re-assignment total cost of

Swap medoid m with oi if it If quality is

Until convergence criterion is satisfied 0

 Select K points as the initial representative objects (i.e., as initial K medians)

Original Space RBF Kernel Space (𝜎 = 4)

 K-Means clustering is conducted on the mapped data, generating quality clusters

 The K-Means Clustering Method

 Initialization of K-Means Clustering

You might also like