Determining Clusters
Determining Clusters
There are several ways to determine the number of clusters in the k-means
algorithm, including:
1.Elbow method
This is a well-known method that involves plotting the within-cluster sum of squares
(WSS) against the number of clusters (K). The WSS is the sum of the squared
distances between each point and its cluster's centroid. The plot will show an elbow
shape, where the WSS value rapidly changes at a point and then levels off. The K
value at this point is the optimal number of clusters.
The elbow method is a commonly used technique to determine the optimal number of clusters
(k) in the k-means clustering algorithm. The goal is to find the value of k that balances
minimizing within-cluster variance while avoiding overfitting. Here's how the elbow method
works:
Steps:
1. Run k-means clustering on your dataset for a range of values of k (e.g., 1 to 10).
2. Calculate the Within-Cluster Sum of Squared Errors (WCSS) for each value of k.
WCSS measures the squared distance between each point and the centroid of its cluster.
Interpretation:
The "elbow" point is where adding additional clusters provides diminishing returns in
terms of improved clustering (reduced WCSS).
The elbow might not always be very sharp, so interpretation may require judgment. In
such cases, it’s useful to complement it with other methods like the Silhouette Score.
Example:
If you're clustering customer data to segment them into different groups, the elbow method can
help determine how many meaningful groups (clusters) there are without overfitting.
2.Cross-validation
This process involves partitioning the data into multiple parts, using each part as a test
set, and calculating the objective function value for each test set. The average of
these values is calculated for each number of clusters, and the number of clusters is
selected where increasing the number of clusters only slightly reduces the objective
function.
In general, k-means clustering doesn't naturally lend itself to the standard cross-validation
approach, which is widely used in supervised learning. This is because k-means is an
unsupervised algorithm, and it doesn't have a clear objective function to measure "accuracy" in
the same way as supervised algorithms. However, there are ways to adapt the idea of cross-
validation for k-means.
Stability-Based Cross-Validation
This approach focuses on checking how stable the clusters are across different subsets of the
data:
Procedure:
1. Randomly split your dataset into training and testing sets (like standard cross-
validation).
2. Run k-means on the training set to generate cluster centroids.
3. For the testing set, assign the points to the nearest centroids obtained from the
training set.
4. Calculate the performance metric based on how well the testing points match
the clusters formed from the training set.
5. Repeat this process for several iterations and different random splits.
Metric: Common metrics include within-cluster sum of squares (WCSS) or cluster labeling
stability (checking how often the same data points end up in the same clusters).
3.Prediction Strength (PS)
This method measures the stability of clusters by repeatedly subsampling the data
and clustering subsets of it. A higher PS indicates more stable clusters.
1. Split the dataset into two halves: a training set and a test set.
2. Apply k-means on the training set to obtain clusters and their centroids.
3. Assign the test set points to the nearest centroids obtained from the training set.
4. Re-cluster the test set using the k-means algorithm (independently from the training
clusters).
5. Compare the clusters between the training and test sets by checking how many points
that belong to the same cluster in the training set are still in the same cluster in the test
set.
6. Calculate the Prediction Strength (PS): The score is the proportion of points that
remain consistently clustered across the training and test sets. The formula is:
o If all points that were in the same cluster in the training set are still in the same
cluster in the test set, the PS will be high (close to 1).
o If points are frequently reassigned to different clusters, the PS will be low.
Key Insights:
PS > 0.8: A high prediction strength (typically above 0.8) indicates that the clusters are
stable, and the clustering solution is reliable.
PS < 0.5: A low prediction strength suggests that the clusters are unstable or may not
generalize well to new data, meaning that the chosen k might not be appropriate.
4.Silhouette score
This is an evaluation measure that uses a score between -1 and 1, with 1 being the
best and -1 being the worst. Values close to zero indicate that data points are
overlapping clusters.
1 indicates that the data points are very well clustered (the points are far from
neighboring clusters and well within their own).
0 indicates that the points are on or very close to the boundary of clusters (the points are
neither well clustered nor badly clustered).
-1 indicates that the points may be assigned to the wrong clusters (the points are closer to
neighboring clusters than to their own).
The Silhouette score s(i) for each point iii is calculated as follows:
where:
a(i) is the mean intra-cluster distance (average distance between the point iii and all other
points in the same cluster).
b(i) is the mean nearest-cluster distance (average distance between the point iii and points
in the nearest cluster that is not its own).
1. Fit the K-Means model: Perform clustering on your dataset using K-Means for a chosen
number of clusters k.
2. Calculate Silhouette score: For each data point, compute the intra-cluster and nearest-
cluster distances, and then calculate the Silhouette score.
3. Evaluate different kkk values: Use the average Silhouette score across all points in the
dataset to assess the quality of the clustering for different k values. Typically, the k with
the highest average Silhouette score indicates the optimal number of clusters.