Shahapure 2020
Abstract—Clustering is an important phase in data mining. 1) Obtain the data in the form of tuples.
Selecting the number of clusters in a clustering algorithm, e.g. 2) Run sklearn’s k-means algorithm on the data set.
choosing the best value of k in the various k-means algorithms 3) Obtain cluster labels by fitting the data.
[1], can be difficult. We studied the use of silhouette scores 4) Calculate mean silhouette coefficient by passing
and scatter plots to suggest, and then validate, the number tuples and cluster labels.
of clusters we specified in running the k-means clustering 5) Repeat this for different numbers (i.e. values of k )
algorithm on two publicly available data sets. Scikit-learn’s of clusters.
[4] silhouette score method, which is a measure of the quality
of a cluster, was used to find the mean silhouette co-efficient of 2. Quality measurement
all the samples for different number of clusters. The highest
silhouette score indicates the optimal number of clusters. We Scikit-learn’s silhouette score function computes the
present several instances of utilizing the silhouette score to mean silhouette coefficient of all samples. The silhouette
determine the best value of k for those data sets. coefficient is calculated by taking into account the mean
intra-cluster distance a and the mean nearest-cluster distance
1. Introduction b for each data point. The silhouette coefficient for a sample
is (b − a)/max(a, b).
Determining the optimal number of clusters for a data
• A silhouette score with a value near + 1 means the
set is an important problem in certain clustering algorithms,
data point is in the correct cluster.
especially the well-known k -means and similar algorithms
• A silhouette score with a value near 0 means the
[1]. There is no one-size-fits-all method to determine the
data point might belong in some other cluster.
value of k , the optimal value for a given data set may well
• A silhouette score with a value near -1 means, the
depend on the methods used for measuring similarities and
data point is in (a) wrong cluster.
the initial seed values used for partitioning. A solution is to
inspect the dendrogram resulting from hierarchical cluster- The analysis of silhouette scores for different data sets is
ing, but this remains a somewhat subjective and expensive given below.
approach, since hierarchical clustering is intrinsically slower
than k-means. Hierarchical clustering could still be applied 2.1. Iris data set
on several small subsets of the data, to find a reasonable
estimate of k . We choose the more direct method of analyz- This is a classic multi-class classification data set pro-
ing the silhouette scores [5] which measure the quality of vided by scikit-learn. The data set consists of 3 classes, 4
clusters. A high average silhouette coefficient value indicates dimensions or features, and 150 samples. Figure 1 shows
good clustering and helps in deciding the optimal value of
the number of clusters k [3]. We present examples of this
approach, along with 2-d and 3-d scatter plots to support if
not validate the results.
We propose to investigate whether the silhouette score
can be used for validation of the number of clusters obtained
by running k -means clustering algorithm on each of several
data sets. Dimensionality reduction is done to reduce the
number of features and generate a 2D or 3D scatter plot Figure 1. Silhouette scores for Iris data set
which helps in visually analyzing the number of clusters
and validating the result. The following steps are carried the silhouette scores for different number of clusters with k
for the analysis ranging from 2 to 10. It can be observed that the silhouette
score is the highest for k = 2. In addition, selecting k = 4
or k = 5 results in silhouette scores that are more or less
equally bad. Therefore, k = 2 or k = 3 are the only two
reasonable choices for this data set.
