0% found this document useful (0 votes)
89 views8 pages

K-Means Clustering

The document discusses a proposed method for K-means clustering that does not require specifying the number of clusters (K) in advance. It determines an optimal threshold value for clustering the data dynamically based on the dataset. When the elbow method is unclear about the best K, it uses silhouette scoring to help identify the optimal number of clusters. The method is tested on the iris dataset, where the elbow plot is unclear but the silhouette plot identifies that K=2 has the highest score and is optimal for clustering the data.

Uploaded by

Abeer Pareek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views8 pages

K-Means Clustering

The document discusses a proposed method for K-means clustering that does not require specifying the number of clusters (K) in advance. It determines an optimal threshold value for clustering the data dynamically based on the dataset. When the elbow method is unclear about the best K, it uses silhouette scoring to help identify the optimal number of clusters. The method is tested on the iris dataset, where the elbow plot is unclear but the silhouette plot identifies that K=2 has the highest score and is optimal for clustering the data.

Uploaded by

Abeer Pareek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A Comprehensive survey on K-Means Clustering

ABSTRACT

Data mining finds patterns in large datasets. This strategy helps decision-makers choose a real-
world solution. Data mining uses data clustering to find a data pattern. The K-means method
is ideal for classifying large volumes of data. K-means clustering assumes a limited number of
clusters to split the data set. The biggest drawback of this method is that if the number of
clusters is as low as possible, it is more likely to group unrelated items together. If several
clusters are produced, things with similar properties are more likely to be allocated to different
clusters. Many clusters cause this. This work presents a unique K-Means clustering method to
address the issue. The approach can regulate dynamic processes like data clustering. The
suggested method calculates a K-Means centroid threshold to estimate cluster size. If elbow
method is not clearly stating the optimal clusters then Silhouette score will help in suggesting
the optimal clusters needed for the dataset.

INTRODUCTION
Data mining is the trendiest branch in computer science, and it borrows from many other fields.
Using this strategy, patterns in a large database may be automatically detected. Since the past
decade or two, data mining has been more significant, and there is now tremendous rivalry in
the marketplace for information efficiency. As a consequence, data has played a significant
role in assisting companies and governments in making essential choices and has supplied an
abundance of data for usage across industries. In the real world, there is a mountain of data
from which important insights may be extracted. This information must be recovered in its
entirety, including its structure, within the specified time limit, since it is very valuable. Data
mining provides a way for removing data background noise. When relevant data from the
enormous dataset is necessary, it may be retrieved and displayed appropriately. It's a terrific
tool for monitoring market conditions, searching for cutting-edge technological developments,
managing production in response to customer demand, and much more. In a word, data mining
is the process of acquiring knowledge from enormous databases. We may deduce the nature or
activity of any pattern using data mining.

Examining data clusters is a crucial component of both knowledge discovery and data mining.
The formation of clusters within a large dataset is a technique for organizing data based on
similarities between records. Clustering is achievable in supervised, semi-supervised, and
unsupervised settings. The clustering algorithm is an important meta-learning tool for
analysing the data supplied by modern applications. Clustering is to arrange data into groups
with similar qualities and behaviours. Several clustering strategies for classifying data have
been proposed. The majority of these methods assume that the number of clusters in a massive
data collection is constant. When the predicted number of clusters is low, this assumption
increases the risk of wrongly categorizing objects into the same cluster. If there are several
clusters, however, it is more probable that data with similar features will be clustered together
by chance. In addition, it is difficult to predict how many clusters will exist in the real case.
Here, we construct a novel form of the K-Means clustering approach that is capable of adapting
to the data it receives. This approach classifies data by determining a threshold value from the
data set and then grouping the data without setting a hard limit on the size of each cluster.
Lastly, the approach proposed clusters the data set based on the threshold value. In this
proposed technique, the threshold value is essential. The threshold value determines whether
or not the data should be combined into a new group.

THE K-MEANS CLUSTERING ALGORITHM


First, we provide an overview of the K-Means technique, and then we describe the
recommended methodology in further detail. K-Means clustering is frequently used because it
can be applied to a range of data types, including medical pictures, text, and more. The
efficiency of K-Means depends on the original centroid's position. Incorrect selection of the
centroid leads to unstable clustering results and an increase in iterations. Consequently, both
temporal and spatial complexity will increase simultaneously. Due to its simplicity as a
clustering technique, the K-Means approach is widely used in data mining. This technique is a
kind of unsupervised learning used to solve the traditional clustering problem. In partition-
based clustering, a large dataset is first divided into a number of objects, which are then
partitioned into a set of groups containing related data points. The K-means algorithm separates
the data into K different groups as the process converges. K-Means produces autonomous
clusters. The K-Means strategy for cluster analysis consists of two components. The algorithm
begins by selecting a K-value, where K is the total number of clusters. As a second stage, we
consider each data point in respect to its nearest center. Then, calculate the Euclidean distance
between each data point and each of the K centroids. After that, we use every piece of
information to create a class. This process will take as much time as feasible. See below for
details on the K-Means clustering method.
Step-1: Choose the cluster number, K.

Step-2: Initialization Node of Cluster K. There are several ways to do this. However,
randomization is the most often used strategy. A random number generator assigns a starting
point to each cluster focus.

Step-3: Simply locate the nearest cluster for each piece of data or item. The distance between
two objects determines the degree of their closeness. Likewise, the distance between a record
and the cluster's center is utilized to determine the record's proximity to that cluster. A record
is allocated to a cluster according to its closeness to the cluster's borders. Below figure, depicts
the formula for Euclidean distance, which may be used to compute the distance between each
data point and the cluster centers.

Fig 1: Euclidean Distance Formula


Where:
D (ij) = how far away the center of Cluster J is from the data point I
Xij = coordinates of datapoint of X to I on data attribute to K
Xkj = coordinates of datapoint of X to J on data attribute to K

Step-4: Update the measurement of the cluster center using the new data. The cluster's center
indicates the average of the information or objects it contains. Consequently, you must also use
the cluster's median. Therefore, there are several other conceivable sizes than the mean.

Step-5: Update all object definitions to reflect the new hub of the cluster. A categorization is
deemed complete when the cluster center is no longer subject to modification. Alternately, you
may return to option c until the cluster's Center is unaffected by additional modifications.

PROPOSED METHOD
There is no need to specify a fixed value for (K), where (K) is the number of clusters, since all
data in a large dataset may be grouped dynamically using our recommended approach. When
utilizing K-Means, you must first choose a value for (K) and then begin clustering based on
that number (K). However, choosing a decision is first difficult. This causes a decline in the
quality of K-means clustering findings. Using Elbow method, optimal clusters can be decided.
However, there are several circumstances in which the Elbow curve is inadequate for
calculating the right 'K'. The Silhouette score will thus be used to estimate how many clusters
are required for this data set.

The elbow technique is a graphical representation of finding the optimal 'K' in a K-means
clustering. WCSS is an operation that involves computing the sum of the squares of the
distances between each point in a cluster and the cluster centroid (Within-Cluster Sum of
Square). In the elbow plot, WCSS (y-axis) data are plotted against K (x-axis) to highlight how
they vary with the parameter K. (on the x-axis). If the graph has the shape of an elbow, the K-
value at which the elbow is created is selected. The term "elbow point" accurately describes
this site. Beyond the Elbow, increasing the value of "K" will have no effect on the WCSS.

Fig 2: Expected Elbow Curve

However, majority of the real-world datasets, it is not very clear to identify the right ‘K’ using
the elbow method. We cannot be sure which point is to be considered in the below elbow curve

Fig 2: Real world Elbow Curve

Silhouette Method
When the Elbow approach fails to locate the Elbow point, the Silhouette score is a highly
effective tool for calculating K's value. The score for silhouette might be a negative one or a
plus one. First, there is no score variance inside a cluster, and each cluster is immediately
distinguishable.

• At a count of zero, clusters begin to overlap.


• A collection of points has been given the incorrect value of -1.

Fig 2: Silhouette Score for 2 clusters

Silhouette Score = (b-a)/max(a,b)

where a is the average distance between any two locations inside a cluster. If we calculate the
average distance between clusters, we obtain b.

Experimental Setup
With the iris dataset to get the clusters of the different species, we implemented the k-means
clustering. In the following graph, elbow point K=4 is selected, but K=3 is also a valid
alternative. So, it is unclear what the Elbow point should be. Using the Silhouette plot, we can
now determine K's value.

Fig 2: The Elbow plot finds the elbow point at K=4


Fig 2: Silhouette Plot for K = 2 to 5

For K = 2, the Silhouette score is the highest (0.68), however this is insufficient to determine
the optimal K. In order to determine the right value for 'K,' the following must be confirmed
using silhouette plots:

When K is specified, each cluster should, on average, have a higher Silhouette score than the
whole dataset (represented by a red dotted line). The Silhouette rating is shown along the x-
axis of this graph. As K = 4 and K = 5 clusters do not meet this requirement, they are eliminated.
Changes in cluster size should not be very radical. Each cluster is separated by a distance
according to the number of observations. At K = 2, the blue cluster is almost twice as broad as
the green cluster. For K = 3, this blue cluster divides into two smaller clusters of similar size.

According to the Silhouette plot approach, the optimal value for K is three. K = 3 should be
used for the final clustering of the Iris data set.

Conclusion
With the use of the Elbow curve or Silhouette plots, the optimal value for K in K-means
clustering may be determined. In several circumstances involving real-world datasets, the
optimum 'K' cannot be determined using the Elbow curve alone. In such a case, a Silhouette
plot may assist you discover the optimal number of clusters for your data. To get the optimal
value of K for K-means clustering, you need, in my view, combine the two approaches.
References
• https://towardsdatascience.com/elbow-method-is-not-sufficient-to-find-best-k-in-k-means-
clustering-fc820da0631d
• https://www.researchgate.net/publication/346813072_Analysis_of_Data_Mining_Using_K-
Means_Clustering_Algorithm_for_Product_Grouping
• https://dl.acm.org/doi/10.1145/312129.312248
• https://www.ripublication.com/irph/ijict_spl/ijictv4n17spl_15.pdf
• https://www.researchgate.net/publication/354547481_Comprehensive_Review_of_K-
Means_Clustering_Algorithms
Appendix

You might also like