0% found this document useful (0 votes)

15 views27 pages

19.1. Partitioning-Based Clustering Algorithms

Partitioning-based clustering algorithms optimize a specific objective function to group data into K clusters, commonly using heuristic methods like K-Means, K-Medians, and K-Medoids. K-Means is the most popular method, involving iterative assignment of points to centroids and updating these centroids until convergence. Variations of K-Means, such as K-Medoids and K-Medians, address sensitivity to outliers and non-numerical data, while Kernel K-Means extends the method to detect non-convex clusters by projecting data into high-dimensional space.

Uploaded by

nnooobmasterr69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views27 pages

19.1. Partitioning-Based Clustering Algorithms

Uploaded by

nnooobmasterr69

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Partitioning-Based

Clustering Algorithms
CSED, TIET
Partitioning-Based Algorithms- Introduction
▪ Discovers the grouping in the data by optimizing a specific objective function and iteratively
improving the quality of clusters.

▪ These algorithms partitions a dataset D into K clusters so that the objective function is optimized.

▪ In order to find global optimal it needs to exhaustively enumerate all the partitions.

▪ Realistically, we use following heuristic methods (greedy methods) to find the clusters:
▪ K-Means
▪ K-Medians
▪ K-Medoids
K-Means Clustering-Introduction
▪ It is the most popular and widely clustering method.
▪ It has been proposed by Macqueen in 1967.
▪ In K-means clustering algorithm, each cluster is represented by the center of the cluster.
▪ Given K, the number of clusters, the K-means clustering algorithm is outlined as follows:
▪ Select K-points as initial clusters.
▪ Repeat until convergence or for fixed number of iterations
▪ Assignment Step: Form K-clusters by assigning each point to its closest centroid.
▪ Update: Recompute the centroids (i.e., mean point) of each cluster.
▪Different kinds of measures such as Manhattan distance (L1 Norm), Euclidean Distance (L2
Norm), cosine distance are used to find the distance of each point from the centroids.
K-Means Clustering- Introduction
K-Means- Introduction
K-Means Algorithm
K-Means- Numerical Example
Suppose we have four types of medicines with two features: weight index, and pH.
Group these medicines into 2 groups based on their features using K-Means algorithm.

Medicine Attribute 1 (X): Attribute II

Weight Index (Y):pH
A 1 1
B 2 1
C 4 3
D 5 4
Example –Solution
K=2, Randomly initialize 2 centers. Let the centers be c1=(1,1) and c2=(2,1)
Iteration-1
Assignment Step: Form K-clusters by assigning each point to its closest centroid.

Update: Recompute the centroids (i.e., mean point) of each cluster.

2+4+5 1+3+4 11 8
c1=(1,1) and c2= , =( , )
3 3 3 3
Example –Solution (Contd….)
Iteration-2
Assignment Step: Form K-clusters by assigning each point to its closest centroid.

Update: Recompute the centroids (i.e., mean point) of each cluster.

1+2 1+1 3 4+5 3+4 9 7
c1= , = ( , 1) and c2= , =( , )
2 2 2 2 2 2 2
Example –Solution (Contd….)
Iteration-3
Assignment Step: Form K-clusters by assigning each point to its closest centroid.

Update: Recompute the centroids (i.e., mean point) of each cluster.

1+2 1+1 3 4+5 3+4 9 7
c1= , = ( , 1) and c2= , =( , )
2 2 2 2 2 2 2

Since, there is no change in centroids and clusters. Hence, the algorithm will stop. So, the final
clusters are (Medicine A, Medicine B) and (Medicine C, Medicine D)
K-Mean Objective Function
▪ K-Means algorithm partitions a dataset D of n objects into a set of K clusters so that the
objective function is maximized.
▪ Particularly, it uses sum of squared error or deviations of each sample from the center of the
cluster to which it belongs.
▪ The sum squared error is given by:
𝐾

𝑆𝑆𝐸(𝐶) = ෍ ෍ |𝑥𝑖 − 𝑐𝑘 |2
𝑘=1 ∀𝑥𝑖 𝜖𝑐𝑘

▪ The above mean squared error is also called inertia and the mean of the above function is called
distortion.
▪ The SSE strictly decreases after we recompute new centers in the k-means algorithm because the
new center of a cluster comes from the average of all data points in this cluster, which actually
minimizes the SSE.
How to choose K?
▪ There are variety of ways to find the optimal
numbers of clusters (i.e. K)
▪ The most popular method is the elbow method.
▪ In this method, we consider number of distinct
values of K, and for each value of K, the sum
squared deviations of samples from the cluster
center is computed.
▪ To determine the optimal number of clusters, we
have to select the value of k at the “elbow” i.e. the
point after which the distortion/inertia start
decreasing in a linear fashion.
▪ For instance, in the figure, we conclude that the
optimal number of clusters for the data is 3.
How to choose K? (Contd…..)
▪ We can also find the optimal number of clusters using the clustering evaluation metrics.

▪ We can choose different values of number of clusters (K), and for each value of K,
silhouette coefficient is computed. The value for which Silhouette coefficient is
maximum is chosen.

▪ In case the ground truth is available, we can also consider external evaluation metrics
such as accuracy, Rand Index, Adjusted Rand Index, Purity, etc.
K-Means: Key Points
▪ Efficiency: O(tKn): where n are number of objects, K are number of clusters, and t is the number
of iterations.
▪ Normally K,t << n; thus an efficient method.
▪ K-means clustering often terminates at a local optimal.
▪ Initialization can be important to find high quality clusters.
▪ Sensitive to noisy data and outliers.
▪ Variations: K-medians, K-medoids, etc.
▪ K-means is applicable only to objects in a continuous n-dimensional space.
▪ Using the K-modes for categorical data.
▪ Not suitable to discover clusters with non-convex shapes.
▪ Use density-based methods, kernel K-means, etc.
Variations of K-Means
▪ There are many variations of the K-means method, varying in different aspects
1. Choosing better initial cluster centroids.

❑ K-means ++, Intelligent K-means, Genetic K-means

2. Choosing different representative prototypes for the clusters.

❑ K-medians, K-medoids, K-modes

3. Applying feature transformation techniques

❑ Weighted K-means, Kernel K-Means

Initialization of K-Means
▪ in K-means, the initial centroids are randomly initialized. Some of these initialization
may not lead to global optimal (i.e. minimum sum square error) but may stuck into
local minimum.
Initialization of K-Means (Contd….)
▪ There are two particular solutions to the random initialization problem.

1. Original Proposal (MacQueen’67): select k seeds randomly

❑ Need to run the algorithm multiple times using different seeds.

2. Use advanced versions of K-means for better initialization of k-seeds

❑ K-Means++
❑ Genetic K-Means
❑ Intelligent K-Means
K-Means++
▪ K-Means ++ algorithm has been proposed by Arthur and Vassilvitskii (2007).
▪ It is different from the K-means algorithm in the way it update the centroids (i.e. the update
centroid step).
▪ The K-means++ algorithm is as follows:
1.Randomly select the first centroid from the data points.
2.For each data point compute its distance from the nearest, previously chosen centroid.
3.Select the next centroid from the data points such that the probability of choosing a point as
centroid is directly proportional to its distance from the nearest, previously chosen centroid.
(i.e. the point having maximum distance from the nearest centroid is most likely to be selected
next as a centroid)
4.Repeat steps 2 and 3 until k centroids have been sampled
▪ It solves the problem of random initialization as the new centroid is having maximum probability
proportional to distance from all the points.
K-Medoid Clustering
▪ K-Medoid clustering is also called PAM (Partitioning around Medoids).
▪ PAM algorithm has been proposed by Kaufmann and Rousseeuw in 1987
▪ K-Means algorithm is sensitive to outliers- since an object with an extreme value may
substantially distort the distribution of data.
▪ K-Medoid algorithm instead of taking the mean value of objects in a cluster, uses
medoids, which is the most centrally located object in the cluster.
▪ PAM algorithm works as follows:
1. Starts from an initial set of medoids (randomly chosen).
2. Repeat until convergence or for fixed number of iterations
❑ Assignment: Assign the data points to the nearest medoid.
❑ Update: Replace each medoid in each cluster by one of the non-medoid of the same cluster if it improve the
sum square error of the resulting cluster.
K-Medoid: Key Points
▪ PAM algorithm works effectively for small data sets but does not scale well for large
data sets (due to computational complexity).
▪ Computational complexity: PAM: O(K(n-K)2) – quite expensive (where n are number
of objects; and K are number of clusters)
▪ Efficiency improvements on PAM
▪ CLARA – PAM on random samples – O(Ks2+K(n-K)); s is the sample size.
▪ CLARANS - PAM with randomized resampling, ensure efficiency and quality
K-Median
▪ Medians are less sensitive to outliers than means.
▪ For instance, median salary is less affected as compared to mean salary of a large firm
with few top executives.
▪ In K-Medians, instead of taking the mean value of the object in a cluster as a reference
point, medians are used.
▪ The objective function that is minimized by the K-Median algorithm is as follows:
𝐾

𝑆𝑆𝐸(𝐶) = ෍ ෍ |𝑥𝑖 − 𝑚𝑒𝑑𝑖𝑎𝑛𝑘 |2

𝑘=1 ∀𝑥𝑖 𝜖𝑐𝑘
▪ K-Median function uses L1 normalization (Manhattan distance) to assign the data
points to the nearest median.
K-Median (Contd…..)
▪ K-Median algorithm works in the following steps:
1. Select K points as the initial representative objects (i.e. initial K Medians).

2. Repeat until convergence or fixed number if iterations:

❑ Assignment: Assign the data points to the nearest median.

❑ Update: Recompute the median of each cluster using the median of the
datapoints in the individual clusters.
K-Mode
▪ K-Means cannot handle non-numerical data (categorical data).
▪ Mapping categorical values to numerical values does not generate quality clusters for high
dimensional data.

▪ K-modes is an extension to K-means by replacing means of clusters with modes.

▪ K-Median algorithm works in the following steps:
1. Select K points as the initial representative objects (i.e. initial K Mode).
2. Repeat until convergence or fixed number if iterations:
❑ Assignment: Assign the data points to the nearest representatives based on Hamming distance.
❑ Update: Recompute the mode of each cluster (i.e. each feature should have most common
response).

▪ A mixture of categorical and numerical features uses K-prototype algorithm (i.e. combination of K-
means and K-modes algorithm).
Kernel K-Means
▪ K-Means algorithm cannot be used to detect non-convex clusters.
▪ It can only detect clusters that are linearly separable (convex shapes).
▪ Convex shaped clusters are those in which the line segment joining any two points lie completely
inside the cluster.

▪Idea: Project data onto high-dimensional kernel space, and then perform K-means
clustering.
▪ Map data points in the feature space to a high-dimensional feature space using Kernel Functions.
▪ Apply K-means clustering on the mapped feature space.
Kernel K-Means (Contd….)
▪ Computational complexity is higher than K-Means.
▪ Need to compute and store the nxn kernel matrix generated from the kernel function
on original data (of order kxn).
Some Kernel Functions
|𝑥𝑖 −𝑥𝑗 |2
−
2𝜎2
𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛 𝑅𝑎𝑑𝑖𝑎𝑙 𝐵𝑎𝑖𝑠 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛 = 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝑒

𝐿𝑖𝑛𝑒𝑎𝑟 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = 𝑥𝑖 . 𝑥𝑗

𝑃𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = (𝑥𝑖 . 𝑥𝑗 + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡)𝑑𝑒𝑔𝑟𝑒𝑒

𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = tanh(𝑎 𝑥𝑖 . 𝑥𝑗 + b) for constants a, b

𝐿𝑜𝑔 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = −log(|𝑥𝑖 − 𝑥𝑗 |𝑑𝑒𝑔𝑟𝑒𝑒 ) + 1

Kernel K-Means Example

Infrastructure as Code
No ratings yet
Infrastructure as Code
218 pages
Lesson8 Clustering
100% (1)
Lesson8 Clustering
33 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
Clustering Partition Hierachy
No ratings yet
Clustering Partition Hierachy
58 pages
UNIT III Part-1
No ratings yet
UNIT III Part-1
69 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Ml Module5 Clustering
No ratings yet
Ml Module5 Clustering
71 pages
Clustering and Dimensionality Reduction
No ratings yet
Clustering and Dimensionality Reduction
58 pages
ML 5 (1)
No ratings yet
ML 5 (1)
61 pages
Clustering
No ratings yet
Clustering
29 pages
7. KMeans
No ratings yet
7. KMeans
2 pages
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
No ratings yet
CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S
17 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Unit-4
No ratings yet
Unit-4
19 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
No ratings yet
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
33 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Lecture5 - Clustering (K Means and K Medoids)
No ratings yet
Lecture5 - Clustering (K Means and K Medoids)
36 pages
Cluster
No ratings yet
Cluster
50 pages
kmea
No ratings yet
kmea
53 pages
unit4_ml[1]
No ratings yet
unit4_ml[1]
20 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
No ratings yet
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
20 pages
Kmeans Worksheet PDF
No ratings yet
Kmeans Worksheet PDF
6 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
Unsupervised Learning (1)
No ratings yet
Unsupervised Learning (1)
27 pages
K Means
No ratings yet
K Means
23 pages
The International Journal of Engineering and Science (The IJES)
No ratings yet
The International Journal of Engineering and Science (The IJES)
4 pages
Kmeans Worksheet
No ratings yet
Kmeans Worksheet
6 pages
unsupervised_learning_1
No ratings yet
unsupervised_learning_1
40 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
k Mean Clustering
No ratings yet
k Mean Clustering
32 pages
Clustering
No ratings yet
Clustering
6 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
K Mean Clustering
No ratings yet
K Mean Clustering
48 pages
2 - K-Mean
No ratings yet
2 - K-Mean
39 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
UNIT 4
No ratings yet
UNIT 4
125 pages
UNIT-4
No ratings yet
UNIT-4
22 pages
Unit-4
No ratings yet
Unit-4
46 pages
algo
No ratings yet
algo
59 pages
k-means
No ratings yet
k-means
25 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
Lot-by-Lot Acceptance Sampling For Attributes: Earning Bjectives
No ratings yet
Lot-by-Lot Acceptance Sampling For Attributes: Earning Bjectives
13 pages
ITS Strategy Chisinau Draft
No ratings yet
ITS Strategy Chisinau Draft
102 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Kmean
No ratings yet
Kmean
24 pages
Powerpoint Presentation Alternatives
100% (1)
Powerpoint Presentation Alternatives
10 pages
Module 4_ Network Forensics
No ratings yet
Module 4_ Network Forensics
40 pages
Data Recovery & Backups - Coursera
50% (2)
Data Recovery & Backups - Coursera
1 page
K-Means and PCA
No ratings yet
K-Means and PCA
69 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Certified Bug Bounty Hunter: An ISO 9001 - 2015 Certified Company
No ratings yet
Certified Bug Bounty Hunter: An ISO 9001 - 2015 Certified Company
24 pages
NM-911 Rev0.4 Schematic: R17M-P1-50 R17M-P1-70 Intel Whiskey Processor With DDR4 + PCH
No ratings yet
NM-911 Rev0.4 Schematic: R17M-P1-50 R17M-P1-70 Intel Whiskey Processor With DDR4 + PCH
99 pages
Accelaterm WI1989.55 INST
No ratings yet
Accelaterm WI1989.55 INST
44 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
Sriram Rajagopal Head of Systems Engineering
No ratings yet
Sriram Rajagopal Head of Systems Engineering
13 pages
Quickstart Guide to Meta
No ratings yet
Quickstart Guide to Meta
1 page
Present Social Bread Company Profile
No ratings yet
Present Social Bread Company Profile
70 pages
MIPS Tutorial
No ratings yet
MIPS Tutorial
26 pages
K Means
No ratings yet
K Means
33 pages
Rich 9
No ratings yet
Rich 9
18 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Digi Notes
No ratings yet
Digi Notes
11 pages
Properties of Relations
No ratings yet
Properties of Relations
8 pages
McDonalds NED Playbook 9.1.2020
No ratings yet
McDonalds NED Playbook 9.1.2020
10 pages
1-Ty2 SBT Global Success
No ratings yet
1-Ty2 SBT Global Success
6 pages
Environment S.A. - Real Time (Industrial Site) Noise Monitoring - France
No ratings yet
Environment S.A. - Real Time (Industrial Site) Noise Monitoring - France
5 pages
ArcSWAT Version2012.10 5.21 ReleaseNotes PDF
No ratings yet
ArcSWAT Version2012.10 5.21 ReleaseNotes PDF
8 pages
CCNA Guide To Cisco Networking: Chapter 1: Introducing Networks
No ratings yet
CCNA Guide To Cisco Networking: Chapter 1: Introducing Networks
43 pages
Chapter 5 of PHP (WBP)
No ratings yet
Chapter 5 of PHP (WBP)
25 pages
GB VectorFurniture5000-7000 Tech-Manual English
No ratings yet
GB VectorFurniture5000-7000 Tech-Manual English
100 pages
Notes From Session
No ratings yet
Notes From Session
6 pages
Python 2.0
No ratings yet
Python 2.0
8 pages
LDT's
No ratings yet
LDT's
5 pages
FINALGROUP3
No ratings yet
FINALGROUP3
7 pages
Factoring by Greatest Common Factor
No ratings yet
Factoring by Greatest Common Factor
1 page
Adec Dryers Manual Rev2
No ratings yet
Adec Dryers Manual Rev2
34 pages
00-441128-02 - IPC1000 BIOS Setup - IBT - MB885 - 091203
No ratings yet
00-441128-02 - IPC1000 BIOS Setup - IBT - MB885 - 091203
29 pages
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet

19.1. Partitioning-Based Clustering Algorithms

Uploaded by

19.1. Partitioning-Based Clustering Algorithms

Uploaded by

Partitioning-Based

Medicine Attribute 1 (X): Attribute II

Update: Recompute the centroids (i.e., mean point) of each cluster.

Update: Recompute the centroids (i.e., mean point) of each cluster.

Update: Recompute the centroids (i.e., mean point) of each cluster.

❑ K-means ++, Intelligent K-means, Genetic K-means

2. Choosing different representative prototypes for the clusters.

❑ K-medians, K-medoids, K-modes

3. Applying feature transformation techniques

❑ Weighted K-means, Kernel K-Means

1. Original Proposal (MacQueen’67): select k seeds randomly

2. Use advanced versions of K-means for better initialization of k-seeds

𝑆𝑆𝐸(𝐶) = ෍ ෍ |𝑥𝑖 − 𝑚𝑒𝑑𝑖𝑎𝑛𝑘 |2

2. Repeat until convergence or fixed number if iterations:

❑ Assignment: Assign the data points to the nearest median.

▪ K-modes is an extension to K-means by replacing means of clusters with modes.

𝐿𝑖𝑛𝑒𝑎𝑟 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = 𝑥𝑖 . 𝑥𝑗

𝑃𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = (𝑥𝑖 . 𝑥𝑗 + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡)𝑑𝑒𝑔𝑟𝑒𝑒

𝑆𝑖𝑔𝑚𝑜𝑖𝑑 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = tanh(𝑎 𝑥𝑖 . 𝑥𝑗 + b) for constants a, b

𝐿𝑜𝑔 𝐾𝑒𝑟𝑛𝑒𝑙 = 𝐾𝑖𝑗 = −log(|𝑥𝑖 − 𝑥𝑗 |𝑑𝑒𝑔𝑟𝑒𝑒 ) + 1

You might also like