K-Means Clustering
K-Means Clustering
ABSTRACT
Data mining finds patterns in large datasets. This strategy helps decision-makers choose a real-
world solution. Data mining uses data clustering to find a data pattern. The K-means method
is ideal for classifying large volumes of data. K-means clustering assumes a limited number of
clusters to split the data set. The biggest drawback of this method is that if the number of
clusters is as low as possible, it is more likely to group unrelated items together. If several
clusters are produced, things with similar properties are more likely to be allocated to different
clusters. Many clusters cause this. This work presents a unique K-Means clustering method to
address the issue. The approach can regulate dynamic processes like data clustering. The
suggested method calculates a K-Means centroid threshold to estimate cluster size. If elbow
method is not clearly stating the optimal clusters then Silhouette score will help in suggesting
the optimal clusters needed for the dataset.
INTRODUCTION
Data mining is the trendiest branch in computer science, and it borrows from many other fields.
Using this strategy, patterns in a large database may be automatically detected. Since the past
decade or two, data mining has been more significant, and there is now tremendous rivalry in
the marketplace for information efficiency. As a consequence, data has played a significant
role in assisting companies and governments in making essential choices and has supplied an
abundance of data for usage across industries. In the real world, there is a mountain of data
from which important insights may be extracted. This information must be recovered in its
entirety, including its structure, within the specified time limit, since it is very valuable. Data
mining provides a way for removing data background noise. When relevant data from the
enormous dataset is necessary, it may be retrieved and displayed appropriately. It's a terrific
tool for monitoring market conditions, searching for cutting-edge technological developments,
managing production in response to customer demand, and much more. In a word, data mining
is the process of acquiring knowledge from enormous databases. We may deduce the nature or
activity of any pattern using data mining.
Examining data clusters is a crucial component of both knowledge discovery and data mining.
The formation of clusters within a large dataset is a technique for organizing data based on
similarities between records. Clustering is achievable in supervised, semi-supervised, and
unsupervised settings. The clustering algorithm is an important meta-learning tool for
analysing the data supplied by modern applications. Clustering is to arrange data into groups
with similar qualities and behaviours. Several clustering strategies for classifying data have
been proposed. The majority of these methods assume that the number of clusters in a massive
data collection is constant. When the predicted number of clusters is low, this assumption
increases the risk of wrongly categorizing objects into the same cluster. If there are several
clusters, however, it is more probable that data with similar features will be clustered together
by chance. In addition, it is difficult to predict how many clusters will exist in the real case.
Here, we construct a novel form of the K-Means clustering approach that is capable of adapting
to the data it receives. This approach classifies data by determining a threshold value from the
data set and then grouping the data without setting a hard limit on the size of each cluster.
Lastly, the approach proposed clusters the data set based on the threshold value. In this
proposed technique, the threshold value is essential. The threshold value determines whether
or not the data should be combined into a new group.
Step-2: Initialization Node of Cluster K. There are several ways to do this. However,
randomization is the most often used strategy. A random number generator assigns a starting
point to each cluster focus.
Step-3: Simply locate the nearest cluster for each piece of data or item. The distance between
two objects determines the degree of their closeness. Likewise, the distance between a record
and the cluster's center is utilized to determine the record's proximity to that cluster. A record
is allocated to a cluster according to its closeness to the cluster's borders. Below figure, depicts
the formula for Euclidean distance, which may be used to compute the distance between each
data point and the cluster centers.
Step-4: Update the measurement of the cluster center using the new data. The cluster's center
indicates the average of the information or objects it contains. Consequently, you must also use
the cluster's median. Therefore, there are several other conceivable sizes than the mean.
Step-5: Update all object definitions to reflect the new hub of the cluster. A categorization is
deemed complete when the cluster center is no longer subject to modification. Alternately, you
may return to option c until the cluster's Center is unaffected by additional modifications.
PROPOSED METHOD
There is no need to specify a fixed value for (K), where (K) is the number of clusters, since all
data in a large dataset may be grouped dynamically using our recommended approach. When
utilizing K-Means, you must first choose a value for (K) and then begin clustering based on
that number (K). However, choosing a decision is first difficult. This causes a decline in the
quality of K-means clustering findings. Using Elbow method, optimal clusters can be decided.
However, there are several circumstances in which the Elbow curve is inadequate for
calculating the right 'K'. The Silhouette score will thus be used to estimate how many clusters
are required for this data set.
The elbow technique is a graphical representation of finding the optimal 'K' in a K-means
clustering. WCSS is an operation that involves computing the sum of the squares of the
distances between each point in a cluster and the cluster centroid (Within-Cluster Sum of
Square). In the elbow plot, WCSS (y-axis) data are plotted against K (x-axis) to highlight how
they vary with the parameter K. (on the x-axis). If the graph has the shape of an elbow, the K-
value at which the elbow is created is selected. The term "elbow point" accurately describes
this site. Beyond the Elbow, increasing the value of "K" will have no effect on the WCSS.
However, majority of the real-world datasets, it is not very clear to identify the right ‘K’ using
the elbow method. We cannot be sure which point is to be considered in the below elbow curve
Silhouette Method
When the Elbow approach fails to locate the Elbow point, the Silhouette score is a highly
effective tool for calculating K's value. The score for silhouette might be a negative one or a
plus one. First, there is no score variance inside a cluster, and each cluster is immediately
distinguishable.
where a is the average distance between any two locations inside a cluster. If we calculate the
average distance between clusters, we obtain b.
Experimental Setup
With the iris dataset to get the clusters of the different species, we implemented the k-means
clustering. In the following graph, elbow point K=4 is selected, but K=3 is also a valid
alternative. So, it is unclear what the Elbow point should be. Using the Silhouette plot, we can
now determine K's value.
For K = 2, the Silhouette score is the highest (0.68), however this is insufficient to determine
the optimal K. In order to determine the right value for 'K,' the following must be confirmed
using silhouette plots:
When K is specified, each cluster should, on average, have a higher Silhouette score than the
whole dataset (represented by a red dotted line). The Silhouette rating is shown along the x-
axis of this graph. As K = 4 and K = 5 clusters do not meet this requirement, they are eliminated.
Changes in cluster size should not be very radical. Each cluster is separated by a distance
according to the number of observations. At K = 2, the blue cluster is almost twice as broad as
the green cluster. For K = 3, this blue cluster divides into two smaller clusters of similar size.
According to the Silhouette plot approach, the optimal value for K is three. K = 3 should be
used for the final clustering of the Iris data set.
Conclusion
With the use of the Elbow curve or Silhouette plots, the optimal value for K in K-means
clustering may be determined. In several circumstances involving real-world datasets, the
optimum 'K' cannot be determined using the Elbow curve alone. In such a case, a Silhouette
plot may assist you discover the optimal number of clusters for your data. To get the optimal
value of K for K-means clustering, you need, in my view, combine the two approaches.
References
• https://towardsdatascience.com/elbow-method-is-not-sufficient-to-find-best-k-in-k-means-
clustering-fc820da0631d
• https://www.researchgate.net/publication/346813072_Analysis_of_Data_Mining_Using_K-
Means_Clustering_Algorithm_for_Product_Grouping
• https://dl.acm.org/doi/10.1145/312129.312248
• https://www.ripublication.com/irph/ijict_spl/ijictv4n17spl_15.pdf
• https://www.researchgate.net/publication/354547481_Comprehensive_Review_of_K-
Means_Clustering_Algorithms
Appendix