0% found this document useful (0 votes)

52 views4 pages

Determining Clusters

Uploaded by

Deergha Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views4 pages

Determining Clusters

Uploaded by

Deergha Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Determine the number of clusters

There are several ways to determine the number of clusters in the k-means
algorithm, including:

1.Elbow method
This is a well-known method that involves plotting the within-cluster sum of squares
(WSS) against the number of clusters (K). The WSS is the sum of the squared
distances between each point and its cluster's centroid. The plot will show an elbow
shape, where the WSS value rapidly changes at a point and then levels off. The K
value at this point is the optimal number of clusters.

The elbow method is a commonly used technique to determine the optimal number of clusters
(k) in the k-means clustering algorithm. The goal is to find the value of k that balances
minimizing within-cluster variance while avoiding overfitting. Here's how the elbow method
works:

Steps:

1. Run k-means clustering on your dataset for a range of values of k (e.g., 1 to 10).
2. Calculate the Within-Cluster Sum of Squared Errors (WCSS) for each value of k.
WCSS measures the squared distance between each point and the centroid of its cluster.

3. Plot the WCSS against the number of clusters, k.

4. Identify the "elbow" point on the plot where the reduction in WCSS begins to slow
down significantly. This point represents the optimal number of clusters. Before this
point, adding more clusters results in a significant reduction in WCSS, while after this
point, the reduction is much smaller.

Interpretation:

 The "elbow" point is where adding additional clusters provides diminishing returns in
terms of improved clustering (reduced WCSS).
 The elbow might not always be very sharp, so interpretation may require judgment. In
such cases, it’s useful to complement it with other methods like the Silhouette Score.

Example:
If you're clustering customer data to segment them into different groups, the elbow method can
help determine how many meaningful groups (clusters) there are without overfitting.

2.Cross-validation
This process involves partitioning the data into multiple parts, using each part as a test
set, and calculating the objective function value for each test set. The average of
these values is calculated for each number of clusters, and the number of clusters is
selected where increasing the number of clusters only slightly reduces the objective
function.

In general, k-means clustering doesn't naturally lend itself to the standard cross-validation
approach, which is widely used in supervised learning. This is because k-means is an
unsupervised algorithm, and it doesn't have a clear objective function to measure "accuracy" in
the same way as supervised algorithms. However, there are ways to adapt the idea of cross-
validation for k-means.

Methods to Apply Cross-Validation in k-means:

Stability-Based Cross-Validation

This approach focuses on checking how stable the clusters are across different subsets of the
data:

 Procedure:
1. Randomly split your dataset into training and testing sets (like standard cross-
validation).
2. Run k-means on the training set to generate cluster centroids.
3. For the testing set, assign the points to the nearest centroids obtained from the
training set.
4. Calculate the performance metric based on how well the testing points match
the clusters formed from the training set.
5. Repeat this process for several iterations and different random splits.

 Metric: Common metrics include within-cluster sum of squares (WCSS) or cluster labeling
stability (checking how often the same data points end up in the same clusters).
3.Prediction Strength (PS)
This method measures the stability of clusters by repeatedly subsampling the data
and clustering subsets of it. A higher PS indicates more stable clusters.

Steps to Calculate Prediction Strength:

1. Split the dataset into two halves: a training set and a test set.
2. Apply k-means on the training set to obtain clusters and their centroids.
3. Assign the test set points to the nearest centroids obtained from the training set.
4. Re-cluster the test set using the k-means algorithm (independently from the training
clusters).
5. Compare the clusters between the training and test sets by checking how many points
that belong to the same cluster in the training set are still in the same cluster in the test
set.
6. Calculate the Prediction Strength (PS): The score is the proportion of points that
remain consistently clustered across the training and test sets. The formula is:

o If all points that were in the same cluster in the training set are still in the same
cluster in the test set, the PS will be high (close to 1).
o If points are frequently reassigned to different clusters, the PS will be low.

Key Insights:

 PS > 0.8: A high prediction strength (typically above 0.8) indicates that the clusters are
stable, and the clustering solution is reliable.
 PS < 0.5: A low prediction strength suggests that the clusters are unstable or may not
generalize well to new data, meaning that the chosen k might not be appropriate.
4.Silhouette score

This is an evaluation measure that uses a score between -1 and 1, with 1 being the
best and -1 being the worst. Values close to zero indicate that data points are
overlapping clusters.

 1 indicates that the data points are very well clustered (the points are far from
neighboring clusters and well within their own).
 0 indicates that the points are on or very close to the boundary of clusters (the points are
neither well clustered nor badly clustered).
 -1 indicates that the points may be assigned to the wrong clusters (the points are closer to
neighboring clusters than to their own).

The Silhouette score s(i) for each point iii is calculated as follows:

where:

 a(i) is the mean intra-cluster distance (average distance between the point iii and all other
points in the same cluster).
 b(i) is the mean nearest-cluster distance (average distance between the point iii and points
in the nearest cluster that is not its own).

Key Steps in Using Silhouette Score with K-Means:

1. Fit the K-Means model: Perform clustering on your dataset using K-Means for a chosen
number of clusters k.
2. Calculate Silhouette score: For each data point, compute the intra-cluster and nearest-
cluster distances, and then calculate the Silhouette score.
3. Evaluate different kkk values: Use the average Silhouette score across all points in the
dataset to assess the quality of the clustering for different k values. Typically, the k with
the highest average Silhouette score indicates the optimal number of clusters.

The Senior Software Engineer
100% (1)
The Senior Software Engineer
251 pages
Jazz HRM
0% (1)
Jazz HRM
13 pages
AP MK8 MK9 Installation Manual
No ratings yet
AP MK8 MK9 Installation Manual
48 pages
Type of Wells
100% (1)
Type of Wells
10 pages
Project Proposal FIC
No ratings yet
Project Proposal FIC
14 pages
Kmeans Clustering
No ratings yet
Kmeans Clustering
3 pages
K-Means Clustering
No ratings yet
K-Means Clustering
14 pages
K-MEANS CLUSTERING PPT Kpu
No ratings yet
K-MEANS CLUSTERING PPT Kpu
4 pages
K Means Clustering
No ratings yet
K Means Clustering
13 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
13 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Elbow Method For Optimal Cluster Number in K-Means
No ratings yet
Elbow Method For Optimal Cluster Number in K-Means
8 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
Unit 4 Aiml
No ratings yet
Unit 4 Aiml
24 pages
Data Mining
No ratings yet
Data Mining
10 pages
Assignment 4 A
No ratings yet
Assignment 4 A
15 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
K Means
No ratings yet
K Means
26 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Unit 4
No ratings yet
Unit 4
63 pages
Ads Exp5
No ratings yet
Ads Exp5
4 pages
Data Mining-4
No ratings yet
Data Mining-4
9 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
6 pages
K-Means and PCA
No ratings yet
K-Means and PCA
69 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
No ratings yet
CSC649 Lecture 3 Unsupervised ML - KMeansClustering
22 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
K Mean
No ratings yet
K Mean
9 pages
Introduction To The K-Means Clustering Algorithm Based On The Elbow
No ratings yet
Introduction To The K-Means Clustering Algorithm Based On The Elbow
4 pages
K Means Clustering
No ratings yet
K Means Clustering
27 pages
Unit 4
No ratings yet
Unit 4
22 pages
10.program K Means
No ratings yet
10.program K Means
16 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
Kmean
No ratings yet
Kmean
24 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
K - Mean Clustering
No ratings yet
K - Mean Clustering
15 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
K-Means Clustering
No ratings yet
K-Means Clustering
7 pages
Algo
No ratings yet
Algo
59 pages
3.k-Metoids and Hierarchical Updated
No ratings yet
3.k-Metoids and Hierarchical Updated
50 pages
4 Clustring
No ratings yet
4 Clustring
48 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Chapter 2.1 - Kmean
No ratings yet
Chapter 2.1 - Kmean
10 pages
Simple K Means
No ratings yet
Simple K Means
3 pages
CPE412 Pattern Recognition (Week 7)
No ratings yet
CPE412 Pattern Recognition (Week 7)
48 pages
Clustering Model XX
No ratings yet
Clustering Model XX
5 pages
Unit 4
No ratings yet
Unit 4
46 pages
Chapter 06
No ratings yet
Chapter 06
15 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
K Means
No ratings yet
K Means
25 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
DS Prac 8
No ratings yet
DS Prac 8
4 pages
(FREE PDF Sample) Fluid Power Circuits and Controls Fundamentals and Applications 1st Edition John S. Cundiff Ebooks
100% (5)
(FREE PDF Sample) Fluid Power Circuits and Controls Fundamentals and Applications 1st Edition John S. Cundiff Ebooks
51 pages
New PRDCT DVLPMNT
No ratings yet
New PRDCT DVLPMNT
3 pages
2.3 Worksheet
No ratings yet
2.3 Worksheet
5 pages
Assignment 1: (DBLS01)
No ratings yet
Assignment 1: (DBLS01)
16 pages
R600a Compressors
100% (1)
R600a Compressors
140 pages
Cookie List
No ratings yet
Cookie List
22 pages
Catalogo KS Tools Renault Scania Volvo
No ratings yet
Catalogo KS Tools Renault Scania Volvo
24 pages
Cloud, Grid and High Performance Computing PDF
No ratings yet
Cloud, Grid and High Performance Computing PDF
412 pages
Paula Scher - The Art of Map Design - Commissioned by ARTContent Editions Limited
No ratings yet
Paula Scher - The Art of Map Design - Commissioned by ARTContent Editions Limited
6 pages
Samsung Template
No ratings yet
Samsung Template
26 pages
Dow University of Health Sciences: Vacancies of Non-Teaching Staff
No ratings yet
Dow University of Health Sciences: Vacancies of Non-Teaching Staff
2 pages
File Version Management in PHP
100% (1)
File Version Management in PHP
6 pages
IPSC Target Array Handbook - June 2013
No ratings yet
IPSC Target Array Handbook - June 2013
13 pages
Vijiviji
No ratings yet
Vijiviji
30 pages
Control Reliable Hydraulic Safety Valves
No ratings yet
Control Reliable Hydraulic Safety Valves
6 pages
REAPER API Functions PDF
No ratings yet
REAPER API Functions PDF
88 pages
R.V. Jones and The Birth of Scientific Intelligence
No ratings yet
R.V. Jones and The Birth of Scientific Intelligence
640 pages
ProgrammingLanguageInterface (PLI)
No ratings yet
ProgrammingLanguageInterface (PLI)
35 pages
Proposal Final Year Project: Faculty of Engineering Science & Technology, Hamdard University
No ratings yet
Proposal Final Year Project: Faculty of Engineering Science & Technology, Hamdard University
8 pages
ELECTRICAL TNPSC Engineering Service Exam
No ratings yet
ELECTRICAL TNPSC Engineering Service Exam
3 pages
Oracle Plsql1
No ratings yet
Oracle Plsql1
4 pages
Mock Test 10 Key
No ratings yet
Mock Test 10 Key
17 pages
Printed Organic Electronics: C. Serbutoviez - Chef Du Laboratoire Des Composants Imprimés
No ratings yet
Printed Organic Electronics: C. Serbutoviez - Chef Du Laboratoire Des Composants Imprimés
37 pages
FX50
40% (5)
FX50
198 pages
Kode Warna Kelistrikan
No ratings yet
Kode Warna Kelistrikan
1 page

Determining Clusters

Uploaded by

Determining Clusters

Uploaded by

Determine the number of clusters

3. Plot the WCSS against the number of clusters, k.

Methods to Apply Cross-Validation in k-means:

Steps to Calculate Prediction Strength:

Key Steps in Using Silhouette Score with K-Means:

You might also like