0% found this document useful (0 votes)

99 views10 pages

Research On K-Value Selection Method of K-Means Clustering Algorithm

K-means clustering research paper

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views10 pages

Research On K-Value Selection Method of K-Means Clustering Algorithm

K-means clustering research paper

Uploaded by

abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Article

Research on K-Value Selection Method of K-Means

Clustering Algorithm
Chunhui Yuan and Haitao Yang *
Graduate institute, Space Engineering University, Beijing 101400, China; [email protected]
* Correspondence: [email protected]

Received: 21 May 2019; Accepted: 15 June 2019; Published: 18 June 2019

Abstract: Among many clustering algorithms, the K-means clustering algorithm is widely used
because of its simple algorithm and fast convergence. However, the K-value of clustering needs to
be given in advance and the choice of K-value directly affect the convergence result. To solve this
problem, we mainly analyze four K-value selection algorithms, namely Elbow Method, Gap Statistic,
Silhouette Coefficient, and Canopy; give the pseudo code of the algorithm; and use the standard data
set Iris for experimental verification. Finally, the verification results are evaluated, the advantages
and disadvantages of the above four algorithms in a K-value selection are given, and the clustering
range of the data set is pointed out.

Keywords: Clustering; K-means; K-value; Convergence

1. Introduction
Cluster analysis is one of the most important research directions in the field of data mining.
“Things are clustered and people are grouped”; compared with other data mining methods, clustering
can complete the classification of data without prior knowledge. Clustering algorithms can be divided
into multiple types based on partitioning, density, and model [1]. A clustering algorithm is a process of
dividing a physical or abstract object into a collection of similar objects. A cluster is a collection of data
objects; objects in the same cluster are like each other and different from objects in other clusters [2].
For a clustering task, we want to get the objects as close as possible within the clusters: first cluster
tends to sample or data point. However, the randomness of sample center point selection tends to
make cluster aggregation not converge. Cluster analysis is based on the similarity in clustering data
sets, which is unsupervised learning.
In the partition-based clustering algorithm, K-means algorithm has many advantages such as
simple mathematical ideas, fast convergence, and easy implementation [3]. Therefore, the application
fields are very broad, including different types of document classification, music, movies, classification
based on user purchase behavior, the construction of recommendation systems based on user interests,
and so on. With the increase of the amount of data, the traditional K-means algorithm has been
difficult to meet the actual needs when analyzing massive data sets. In view of the shortcomings of
the traditional K-means algorithm, many scholars have proposed improvement measures based on
K-means. For instance, in Reference [4], a simple and efficient implementation of the K-means clustering
algorithm is presented to solve the problem of the cluster center point not being well-determined;
it built a kd-tree data structure for the data points. The algorithm is easy to implement and can
effectively avoid entering the local optimal solution to some extent. For the problems of the traditional
clustering algorithms having no way to take advantage of some background knowledge (about the
domain or the data set), an Improved K-means Algorithm Based on Multiple Information Domains
is presented in Reference [5]; they apply this method to six data sets and the real-world problem of
automatically detecting road lanes from global positioning system (GPS) data. Experiments show that

J 2019, 2, 16; doi:10.3390/j2020016 www.mdpi.com/journal/j

J 2019, 2, 16 227 of 235

the improved algorithm is more correct when selecting K values when solving practical problems.
Two algorithms which extend the k-means algorithm to categorical domains and domains are reported
in Reference [6], through the pattern mixing algorithm, the combination of the effectiveness measure,
in order to solve the problem of complex data and more noise in the real world. A principal Component
Analysis (PCA) method is implemented in Reference [7]; they use the artificial neural network
(ANN) algorithm and K-nearest neighbor (KNN) and support vector machine (SVM) classification
algorithms to extract and analyze the features, which effectively realize the classification of malware.
The clustering algorithm is also applied to the early detection of pulmonary nodules [8]; they propose
a novel optimized method of feature selection for both cluster and classifier components. In the field
of medical imaging, clustering and classification based on selection features effectively improve the
classification performance of Computer-aided detection (CAD) systems. With the advent of deep
learning methods in pattern recognition applications, some scholars have applied them to cluster
analysis. For example, in Reference [9], by studying the performance of a CAD system for lung nodules
in Computed tomography (CT) as a function of slice thickness, a method of comparing the performance
of CAD systems using a training method using nonuniform data was proposed.
In summary, based on the traditional K-means clustering algorithm, this paper discusses how to
quickly determine the K-value algorithm. The remainder of this paper is organized as follows: Section 2
provides a brief description of the K-means clustering algorithm. Section 3 presents the four K-value
selection algorithms—Elbow Method, Gap Statistic, Silhouette Coefficient and Canopy—and elucidates
the various methods with sample data along with their experimental results. Finally, a discussion and
conclusions are given in Section 4.

2. The K-means Algorithm

The K-means algorithm is a simple iterative clustering algorithm. Using the distance as the metric
and given the K classes in the data set, calculate the distance mean, giving the initial centroid, with
each class described by the centroid. For a given data set X containing n multidimensional data points
and the category K to be divided, the Euclidean distance is selected as the similarity index and the
clustering targets minimize the sum of the squares of the various types; that is, it minimizes [10]

k X
X n
d= ||(xi − uk )||2 (1)
k =1 i=1

where k represents K cluster centers, uk represents the kth center, and xi represents the ith point in the
data set. The solution to the centroid uk is as follows:

k P
n
∂ ∂
(xi − uk )2
P
∂uk
= ∂uk
k =1 i=1
k P
n
∂
− uk )2 (2)
P
= (x
∂uk i
k =1 i=1
Pn
= 2(xi − uk )
i=1

n
1 P
Let Equation (2) be zero; then uk = n xi .
i=1
The central idea of algorithm implementation is to randomly extract K sample points from the
sample set as the center of the initial cluster: Divide each sample point into the cluster represented by
the nearest center point; then the center point of all sample points in each cluster is the center point of
the cluster. Repeat the above steps until the center point of the cluster is unchanged or reaches the set
number of iterations. The algorithm results change with the choice of the center point, resulting in an
instability of the results. The determination of the central point depends on the choice of the K value,
J 2019, 2, 16 228 of 235

which is the focus of the algorithm; it directly affects the clustering results, such as the local optimality
or global optimality [11].

3. Research on K-Value Selection Algorithm

For the K-means algorithm, the number of clusters depends on the K-value setting [12]. In practice,
the K value is generally difficult to define. The choice of K value directly determines the data cluster
that needs to be clustered into multiple clusters. At the beginning of the algorithm, people use
the “shooting the head” method to determine the K value, which is estimated to later give many
improvements proposed for optimization algorithms. This paper mainly summarizes the methods
of K-value selection with certain representativeness and gives further analysis and experimental
verification. The experimental simulation environment is Intel Core i5 dual-core [email protected],
4G memory, 500G hard disk space.
The experiment used the UCI Machine Learning Repository machine to learn the Iris data set in
the data set [13]. The Iris data set consists of three classes, each with 50 elements, a total of 150 samples,
and four attributes per sample, with each representing a type of iris.
In order to have a more obvious clustering effect to test the pros and cons of the K-value algorithm,
the latter two dimensions of the Iris data set sample are selected during the experiment.

3.1. An Elbow Method Algorithm

The basic idea of the elbow rule is to use a square of the distance between the sample points in
each cluster and the centroid of the cluster to give a series of K values. The sum of squared errors
(SSE) is used as a performance indicator. Iterate over the K-value and calculate the SSE. Smaller values
indicate that each cluster is more convergent.
When the number of clusters is set to approach the number of real clusters, SSE shows a rapid
decline. When the number of clusters exceeds the number of real clusters, SSE will continue to decline
but it will quickly become slower. The pseudo code of the algorithm is as follows:

Algorithm 1: Silhouette Coefficient

Input: iris = datasets.load_iris(), X = iris.data [:, 2 :]
Output: d, k
1: d = [];
2: for k = 1, k in rang (1, 9) do
k P
dist(x, ci )2 ;
P
3: d =
i=1
4: return d, k;

The K value can be better determined by plotting the K-SSE curve and by finding the inflection
point down. As shown in Figure 1, there is a very obvious inflection point when K = 2, so when the K
value is 2, the data set clustering effect is the best, as shown in Figure 2.
3: d = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑥, 𝑐 ) ;
4: return d, k;

The K value can be better determined by plotting the K-SSE curve and by finding the inflection
point down. As shown in Figure 1, there is a very obvious inflection point when K = 2, so when
J 2019, 2, 16
the
229 of 235
K value is 2, the data set clustering effect is the best, as shown in Figure 2.

J 2019, 2 4 of 9

Figure 1.
Figure 1. Choosing
Choosing the
the value
value of
of K:
K: When
When the
the selected
selected K K value
value is
is less
less than
than the
the real
real value,
value, the
the cost
cost value
value
the change of the cost value will not be so obvious for every 1 increase of k. Thus, the correct K value
will
willbe begreatly
greatlyreduced
reducedforforevery
every1 1increase ofof
increase k; k;
when
whenthethe
selected
selectedk value is greater
k value thanthan
is greater the the
truetrue
K, the
K,
will be atofthis
change theturning point,
cost value willsimilar
not be tosoelbow.
obviousAsfor
shown,
every there is a very
1 increase of k.obvious
Thus, theinflection
correct point
K valuewhen
will
Kbe= at
2. this turning point, similar to elbow. As shown, there is a very obvious inflection point when K = 2.

Figure
Figure2.2.Iris
Irisdata
dataset
setclustering renderings:
clustering When
renderings: thethe
When K value is 2,isthe
K value datadata
2, the set clustering effect
set clustering is theis
effect
best.
the best.

3.2.The
3.2. TheGap
GapStatistic
StatisticAAlgorithm
lgorithm
Gap Statistic is an algorithm proposed by Tibshirani [14] to determine the number of clusters of
Gap Statistic is an algorithm proposed by Tibshirani [14] to determine the number of clusters of
data sets with unknown classification numbers. The basic idea of Gap Statistic is to introduce reference
data sets with unknown classification numbers. The basic idea of Gap Statistic is to introduce
measurements, which can be obtained by the Monte Carlo sampling method [15] and to calculate the
reference measurements, which can be obtained by the Monte Carlo sampling method [15] and to
sum of the squares of the Euclidean distance between two measurements in each class. The clustering
calculate the sum of the squares of the Euclidean distance between two measurements in each class.
results of the constructed reference zero-mean distribution are compared to determine the optimal
The clustering results of the constructed reference zero-mean distribution are compared to determine
number of clusters in the data set. It is calculated as follows:
the optimal number of clusters in the data set. It is calculated as follows:
Gapn (k) = E∗n (log(Wk )) − ∗logWk E∗n (log(Wk ) )
𝐺𝑎𝑝 (𝑘) = 𝐸 ∗ (log(𝑊
))PP− 𝑙𝑜𝑔𝑊 𝐸 log(𝑊
P P)
= P1 ∗ )≈ 1
log(Wkb P
∗ )s(k )
log(Wkb (3)
= (1 𝑃 )q log(𝑊 ∗ ) ≈ (1 𝑃)
b = 1 b = 1
log(𝑊 ∗ )𝑠(𝑘)
= 1+P P s(k) (3)
1+𝑃
where E∗n (log(Wk )) refers to=log(Wk ) expectations.
𝑠(𝑘) This value is usually generated randomly by Monte
𝑃
Carlo. We randomly generate as many random samples as the original sample number in a rectangular
region where the sample is located many times for Wk . We can get multiple log(Wk ). In order to get the
where 𝐸 ∗ (log(𝑊
average, refers
first you))will getto
anlog(𝑊 ) expectations.
approximate E∗n (log(This
Wk ))value
value.isPusually generated
is the number randomly by
of samplings, s(kMonte
) is the
Carlo. We randomly generate as many random samples as the original sample
standard of joining, and finally Gapk can be calculated. The K value corresponding to the maximumnumber in a
rectangular region where the sample is located many times for 𝑊 . We can get multiple log(𝑊 ). In
order to get the average, first you will get an approximate 𝐸 ∗ (log(𝑊 )) value. P is the number of
samplings, 𝑠(𝑘) is the standard of joining, and finally 𝐺𝑎𝑝 can be calculated. The K value
corresponding to the maximum value of 𝐺𝑎𝑝 is the best k; that is, it satisfies the minimum k
of 𝐺𝑎𝑝 ≥ 𝐺𝑎𝑝 −𝑆 as the optimal number of clusters. The pseudo code of the algorithm is as
J 2019, 2, 16 230 of 235

value of Gapk is the best k; that is, it satisfies the minimum k of Gapk ≥ Gapk+1 − Sk+1 as the optimal
number of clusters. The pseudo code of the algorithm is as follows:

Algorithm 2: Gap Statistic

Input: iris = datasets.load_iris(), X = iris.data [:, 2 :]
Output: k
1: def SampleNum, P, MaxK, u, sigma;
2: SampleSet = [];
3: size (u) = [uM, ];
4: for i = 1 : uM do
5: SampleSet =
[SampleSet;
J 2019, 2
mvnrnd(u(i, :), sigma, fix(SampleNum/uM))]; 5 of 9
6: Wk = log(CompuWk (SampleSet, MaxK));
7: for b = 1 : P do 7: for b = 1 : P do
8: Wkb = log(CompuWk (RefSet(:, :, b), MaxK));
8: Wkb = log(CompuWk(RefSet(:, :, b), MaxK));
9: for k = 1 : MaxK, OptimusK = 1 do
P 9: fork = 1 : MaxK, OptimusK = 1 do
10: Gapk = ( P1 ) ∗ ;
P
log10:Wkb Gapk = ( )∑ log(𝑊 ∗ );
b=1
11: Gapk <= Gapk−1 11:
+ s(k), Gap k <= Gap
OptimusK ==k−11;+ s(k), OptimusK == 1;
12: OptimusK = k – 1;
12: OptimusK = k – 1;
13: retuern k; 13: retuern k;

ItItcan
canbebeseen
seenfrom
fromFigure
Figure3 3that,
that,when
whenKK= =2,2,the
theoptimal
optimalnumber
numberofofclusters
clustersisisobtained.
obtained.

Figure3.3.Observations
Figure Observationsofofthethechange
changeofofthe
theGap
Gapvalue
value with
with the
the KK value:AsAsshown,
value: shown,when
when = =2,2,the
KK the
optimalnumber
optimal numberofofclusters
clustersisisobtained.
obtained.

3.3. The Silhouette Coefficient Algorithm

3.3. The Silhouette Coefficient Algorithm
The Silhouette method was first proposed by Peter J. Rousseeuw [16]. It combines the two factors
The Silhouette method was first proposed by Peter J. Rousseeuw [16]. It combines the two factors
of cohesion and resolution. Cohesion is the similarity between the object and the cluster. When
of cohesion and resolution. Cohesion is the similarity between the object and the cluster. When
compared to other clusters, it is called separation. This comparison is achieved by the value of the
compared to other clusters, it is called separation. This comparison is achieved by the value of the
Silhouette, which is in the range −1–1. The Silhouette value is close to 1, indicating that there is a
Silhouette, which is in the range −1–1. The Silhouette value is close to 1, indicating that there is a close
close relationship between the object and the cluster. If a data cluster in a model is generated with a
relationship between the object and the cluster. If a data cluster in a model is generated with a
relatively high Silhouette value, the model is suitable and acceptable. It is calculated as follows:
relatively high Silhouette value, the model is suitable and acceptable. It is calculated as follows:
𝑏(𝑖) − 𝑎(𝑖)
𝑠(𝑖) =
max{𝑎(𝑖), 𝑏(𝑖)
𝑎(𝑖)
⎧1 − , 𝑎(𝑖) < 𝑏(𝑖)
⎪ 𝑏(𝑖) (4)
= 0, 𝑎(𝑖) = 𝑏(𝑖)
⎨ 𝑏(𝑖)
⎪ − 1, 𝑎(𝑖) > 𝑏(𝑖)
⎩ 𝑎(𝑖)
Calculation method:
J 2019, 2, 16 231 of 235

b(i)−a(i)
s(i) =
max{a(ia),b
(i)
(i)

 1 − b(i) , a ( i ) < b ( i )



 (4)
= 0, a(i) = b(i)


 b(i) − 1, a(i) > b(i)



a(i)

Calculation method:
(1) Calculate the average distance a(i) of sample i to other samples in the same cluster. The smaller
a(i) is, the more the sample i should be clustered into the cluster. a(i) is referred to as the intra-cluster
dissimilarity of sample i. The a(i) mean of all samples in cluster c is called the cluster dissimilarity of
cluster c.
(2) Calculate the average distance b(i) of all samples of sample i to the other cluster, cluster
c(i), which is called the dissimilarity between sample i and cluster c(i). Defined as the inter-cluster
dissimilarity of sample i: b(i) = min{bi1, bi2, ..., bik}; the larger b(i) is, the less sample i belongs to
other clusters.
(3) The contour coefficients of sample i are defined according to the intra-cluster dissimilarity a(i)
of sample i and to the inter-cluster dissimilarity b(i).
The pseudo code of the algorithm is as follows:

Algorithm 3: Silhouette Coefficient

Input: iris = datasets.load_iris(), X = iris.data [:, 2 :]
Output: S(i), k
1: def i in X,PC, D; P
C −i D −i
2 : a(i) = n n n ; b(i) = n n n ;
3 : for a(i) → min, i ∈ C ; b(i) → max, i < D do
b(i)−a(i)
4: s(i) = maxa(i),b(i)
;
a(i)
5: i f a (i ) < b (i ), s (i ) = 1 − b(i) ;
6: if a(i) = b(i), s(i) = 0;
b(i)
7: i f a(i) > b(i), s(i) = a(i)−1
;
8: for k = 2, 3, 4, 5, 6 do
9: lables = KMeans(n_clusters = k).fix(x).lables_;
10: retuern S(i), k;

s(i) is the contour coefficient of the clustering result, which is a reasonable and effective measure
of the cluster. The closer s(i) is to 1, the more reasonable the sample i clustering is. From Figure 4,
we can get s(i) = 0.765, where s(i) is the largest, and then K = 2 is the optimal cluster number.

3.4. The Canopy Algorithm

The Canopy algorithm can roughly divide the data into several overlapping subsets [17], recorded
as Canopy. Each subset acts as a cluster, often using low-cost similarity metrics to accelerate
clustering [18]. Therefore, Canopy clustering is generally used for the initialization operations of other
clustering algorithms. The formation of Canopy needs to specify two distance thresholds—T1, T2,
and T1 > T2 (the settings of T1 and T2 can be obtained according to the needs of the user or using
cross-validation)—and the original data set X is sorted according to certain rules.
9: lables = KMeans(n_clusters = k).fix(x).lables_;
10: retuern S(i), k;

S(i) is the contour coefficient of the clustering result, which is a reasonable and effective measure
of the2,cluster.
J 2019, 16 The closer S(i) is to 1, the more reasonable the sample i clustering is. From Figure
2324,
of we
235
can get S(i) = 0.765, where S(i) is the largest, and then K = 2 is the optimal cluster number.

Figure 4. S(i)
s(i) changes with the K value and clustering effect diagram: S(i) s(i) is the contour coefficient of
the clustering
clustering result,
result, which
whichisisaareasonable
reasonableandandeffective
effectivemeasure
measureofofthe
thecluster.
cluster.The closer
The S(i)s(i)
closer is is
to to
1,
thethe
1, more
morereasonable
reasonable the sample
the sample i clustering
i clusteringis.is.AsAsshown,
shown,wewecan
canget s(i)==0.765,
getS(i) 0.765,where
whereS(i)
s(i) is the
is the
largest, and then
largest, and K=
then K = 22 is
is the
the optimal
optimal cluster
cluster number.
number.

A data
3.3. The vector
Canopy A is randomly selected in X, and a distance d between other sample data vectors in
Algorithm
A and X is calculated using a rough distance calculation method. The sample data vector with d less
than T1 is mapped to a Canopy, and the sample data vector with d less than T2 is removed from the list
of candidate center vectors. Repeat the above steps until the list of candidate center vectors is empty;
that is, X is empty and the algorithm ends [19]. The pseudo code of the algorithm is as follows:

Algorithm 4: Canopy
Input: iris = datasets.load_iris(), X = iris.data [:, 2 :]
Output: k
1: def T1, T2, T1 > T2; delete_X = []; Canopy_X = [];
2 : for P ∈ X do
3 : d = ||P − Xi ||;
4: if d < T2 then
5: delete_X = [d];
6: else Canopy_X = [d];
7: until X = Φ;
8: end;

The algorithm mainly traverses the data continuously. T2 < d < T1 can be used as the center
list. d < T2 is considered too close to Canopy and will not be deleted as a center point in the future.
It can be seen from Figure 5 that the Canopy algorithm is used to cluster the Iris data set, and after
convergence, it is two center points; that is, K is 2.
8: end;

The algorithm mainly traverses the data continuously. T2 < d < T1 can be used as the center list.
d < T2 is considered too close to Canopy and will not be deleted as a center point in the future. It can
be seen
J 2019, 2, 16from Figure 5 that the Canopy algorithm is used to cluster the Iris data set, and 233 after
of 235
convergence, it is two center points; that is, K is 2.

Figure 5.
Figure 5. AA generated
generated clustering
clustering effect
effect map
map byby the
the Canopy
Canopy method:
method: TheThe algorithm
algorithm mainly
mainly traverses
traverses
the data
the data continuously.
continuously. T2T2 <<dd <<T1
T1can
can be
be used
used as
as the
the center list. dd << T2
center list. T2 isis considered
considered too too close
close to
to
Canopy and will not be deleted as a center point in the future. As shown, the Canopy
Canopy and will not be deleted as a center point in the future. As shown, the Canopy algorithm is algorithm is
used to cluster the Iris data set, and after convergence, it is two center points; that is,
used to cluster the Iris data set, and after convergence, it is two center points; that is, k is 2. k is 2.

4. Discussion
4. Discussion
In this paper, four kinds of K-value selection algorithms, such as Elbow Method, Gap Statistic,
In this paper, four kinds of K-value selection algorithms, such as Elbow Method, Gap Statistic,
Silhouette Coefficient, and Canopy, are used to cluster the Iris data set to obtain the K value and the
Silhouette Coefficient, and Canopy, are used to cluster the Iris data set to obtain the K value and the
clustering result of the data set. For the four algorithms implemented in this paper, the verification
clustering result of the data set. For the four algorithms implemented in this paper, the verification
results are shown in Table 1.
results are shown in Table 1.
Table 1. The following table shows the experimental results of the four algorithms, including the
obtained K value and the algorithm execution time, where the Gap Statistic algorithm execution time is
the result when the reference sample P = 100.

No. Name K value Execution Time

1 Elbow Method 2 1.830 s
2 Gap Statistic 2 9.763 s
3 Silhouette Coefficient 2 8.648 s
4 Canopy 2 2.120 s

It can be seen from the above table that each of the four algorithms has its own characteristics.
The Elbow Method algorithm uses SSE as a performance metric, traverses the K value, finds the
inflection point, and has a simple complexity. The inadequacy is that the inflection point depends on
the relationship between the K value and the distance value. If the inflection point is not obvious, the K
value cannot be determined. The Gap Statistic algorithm compares the expected value of the averaged
reference data set with that of the observed data set so that the fastest k value of decreases. However,
for many practical large-scale data sets, this method is not desirable for both time complexity and
space complexity. Take this article as an example: In the experiment, when P = 100, the algorithm
execution time is 9.763 s, and when P = 1000, the total time spent is 56.970 s. The Silhouette Coefficient
algorithm uses cluster cohesion and separation to perform a cluster analysis. Minimizing cohesion is
equivalent to maximizing separation, combining it with Si, and traversing the K value. When Si is
maximum, the K value is the optimal number of clusters. Because the distance matrix needs to be
calculated, the defect is that the computational complexity is O(n2); then, the amount of data reaches
one million or even ten million. The computational overhead can be very large, so this method is also
not used for large-scale data sets. The Canopy algorithm divides the data set into several overlapping
subsets by a predetermined distance threshold and repeats aggregation and deletion through distance
comparisons until the original data set is empty. The advantage is that the addition of overlapping
J 2019, 2, 16 234 of 235

subsets increases the fault tolerance and noise immunity of the algorithm, and clustering in Canopy
effectively avoids the problems caused by large computations.
In summary, we can see that, for the clustering of small data sets, the four methods mentioned
in the paper can meet the requirements and that, for large and complex data sets, it is obvious that
the Canopy algorithm is the best choice. Next, we will use the real-world multidimensional data
containing complex information fields for experimental verification to deeply explore the advantages
and disadvantages of each algorithm or to improve the performance of the algorithm.

Author Contributions: Each author's contribution to this article is as follows: methodology, software, validation,
and data curation, Chunhui Yuan; formal analysis, writing—review and editing, and supervision, Haitao Yang.
Funding: This research received no external funding.
Conflicts of Interest: All authors declare no conflict of interest.

References
1. Zhai, D.; Yu, J.; Gao, F.; Lei, Y.; Feng, D. K-means text clustering algorithm based on centers selection
according to maximum distance. Appl. Res. Comput. 2014, 31, 713–719.
2. Sun, J.; Liu, J.; Zhao, L. Clustering algorithm research. J. Softw. 2008, 19, 48–61. [CrossRef]
3. Li, X.; Yu, L.; Hang, L.; Tang, X. The parallel implementation and application of an improved k-means
algorithm. J. Univ. Electron. Sci. Technol. China 2017, 46, 61–68.
4. Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.D.; Silverman, R.; Wu, A.Y. An efficient k-means
clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 0–892.
[CrossRef]
5. Wagstaff, K.; Cardie, C.; Rogers, S.; Schrödl, S. Constrained k-means clustering with background knowledge.
In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA,
28 June–1 July 2001; pp. 577–584.
6. Huang, Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min.
Knowl. Discov. 1998, 24, 283–304. [CrossRef]
7. Narayanan, B.N.; Djaneye-Boundjou, O.; Kebede, T.M. Performance analysis of machine learning and pattern
recognition algorithms for Malware classification. In Proceedings of the 2016 IEEE National Aerospace and
Electronics Conference (NAECON) and Ohio Innovation Summit (OIS), Dayton, OH, USA, 25–29 July 2016;
pp. 338–342.
8. Narayanan, B.N.; Hardie, R.C.; Kebede, T.M.; Sprague, M.J. Optimized feature selection-based clustering
approach for computer-aided detection of lung nodules in different modalities. Pattern Anal. Appl. 2019, 22,
559–571. [CrossRef]
9. Narayanan, B.N.; Hardie, R.C.; Kebede, T.M. Performance analysis of a computer-aided detection system for
lung nodules in CT at different slice thicknesses. J. Med. Imag. 2018, 5, 014504. [CrossRef] [PubMed]
10. Wang, Q.; Wang, C.; Feng, Z.; Ye, J. Review of K-means clustering algorithm. Electron. Des. Eng. 2012, 20,
21–24.
11. Ravindra, R.; Rathod, R.D.G. Design of electricity tariff plans using gap statistic for K-means clustering based
on consumers monthly electricity consumption data. Int. J. Energ. Sect. Manag. 2017, 2, 295–310.
12. Han, L.; Wang, Q.; Jiang, Z.; Hao, Z. Improved K-means initial clustering center selection algorithm.
Comput. Eng. Appl. 2010, 46, 150–152.
13. UCI. UCI Machine learning repository. Available online: http://archive.ics.uci.edu/ml/ (accessed on 30 March
2019).
14. Tibshirani, R.; Walther, G.; Hastie, T. Estimating the number of clusters in a data set via the gap statistic. J. R.
Statist. Soc. Ser. B (Statist. Methodol.) 2001, 63, 411–423. [CrossRef]
15. Xiao, Y.; Yu, J. Gap statistic and K-means algorithm. J. Comput. Res. Dev. 2007, 44, 176–180.
16. Kaufmn, I.; Rousseeuw, P.J. Finding Groups in Data an Introduction to Cluster Analysis; New York John
Wiley&Sons: Hoboken, NY, USA, 1990.
J 2019, 2, 16 235 of 235

17. Esteves, K.M.; Rong, C. Using Mahout for clustering Wikipedia’s latest articles: A comparison between
K-means and fuzzy c-means in the cloud. In Proceedings of the 2011 Third IEEE International Conference on
Science, Cloud Computing technology and IEEE Computer Society, Washington, DC, USA, 29 November–
1 December 2011; pp. 565–569.
18. Yu, C.; Zhang, R. Research of FCM algorithm based on canopy clustering algorithm under cloud environment.
Comput. Sci. 2014, 41, 316–319.
19. Mccallum, A.; Nigam, K.; Ungar, I.H. Efficient clustering of high-dimensional data sets with application to
reference matching. In Proceedings of the Sixth ACM SIUKDD International Conference on Knowledge
Discovery and Data Mining, Boston, MA, USA, 20–23 August 2000; pp. 169–178.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Baylor Manual
80% (5)
Baylor Manual
28 pages
Foundations of Elementary Analysis
From Everand
Foundations of Elementary Analysis
Roshan Trivedi
No ratings yet
3.k-Metoids and Hierarchical Updated
No ratings yet
3.k-Metoids and Hierarchical Updated
50 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
Cluster Analysis: Minh Tran, PHD
No ratings yet
Cluster Analysis: Minh Tran, PHD
37 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Root Cause Analysis
No ratings yet
Root Cause Analysis
12 pages
514 614 L28 32H Fuel Oil System
100% (1)
514 614 L28 32H Fuel Oil System
30 pages
Unit 4
No ratings yet
Unit 4
46 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
PART2
No ratings yet
PART2
61 pages
Ix Developer: User's Guide
100% (1)
Ix Developer: User's Guide
48 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
1 s2.0 S0031320319301608 Main
No ratings yet
1 s2.0 S0031320319301608 Main
18 pages
K Means
No ratings yet
K Means
25 pages
Digital Computer Concept and Practice: Unsupervised Learning
No ratings yet
Digital Computer Concept and Practice: Unsupervised Learning
21 pages
How To Create A Digital Strategy That Will Reach Thousands of Customers - Neil - Patel - Bukarest Romania
100% (4)
How To Create A Digital Strategy That Will Reach Thousands of Customers - Neil - Patel - Bukarest Romania
103 pages
Adm2 FR Operating Manual 15.07.02
71% (7)
Adm2 FR Operating Manual 15.07.02
160 pages
Foundation Load (Reactions) Data FOR 45 M Diameter Thickener
No ratings yet
Foundation Load (Reactions) Data FOR 45 M Diameter Thickener
88 pages
1 A Modified Version
No ratings yet
1 A Modified Version
7 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
1 s2.0 S1877050923018549 Main
No ratings yet
1 s2.0 S1877050923018549 Main
5 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
CSDM 2.0 White Paper Final
No ratings yet
CSDM 2.0 White Paper Final
23 pages
Kmean
No ratings yet
Kmean
24 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
No ratings yet
An Improved K-Means Algorithm Based On Mapreduce and Grid: Li Ma, Lei Gu, Bo Li, Yue Ma and Jin Wang
12 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
K-Means Clustering Algorithm and Its Improvement R
No ratings yet
K-Means Clustering Algorithm and Its Improvement R
6 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Research On K Mean Algorithm
No ratings yet
Research On K Mean Algorithm
5 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Na 2010
No ratings yet
Na 2010
5 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
Information Sheet 2.4-1 Electrical and Electric Controls
No ratings yet
Information Sheet 2.4-1 Electrical and Electric Controls
23 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
Appendix D: Introduction To Flowcharting
No ratings yet
Appendix D: Introduction To Flowcharting
10 pages
Students Perceptions On Online Education
No ratings yet
Students Perceptions On Online Education
4 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
21S18052 - Joshuapartogihutauruk - Busnov - Studycase - Ibm'S Decade of Transformation: Turnaround To Growth
No ratings yet
21S18052 - Joshuapartogihutauruk - Busnov - Studycase - Ibm'S Decade of Transformation: Turnaround To Growth
7 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Introduction To The K-Means Clustering Algorithm Based On The Elbow
No ratings yet
Introduction To The K-Means Clustering Algorithm Based On The Elbow
4 pages
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
No ratings yet
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
6 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
EG-EM1 Manual
No ratings yet
EG-EM1 Manual
4 pages
Genetic Algorithm For Variable Selection: Jennifer Pittman
No ratings yet
Genetic Algorithm For Variable Selection: Jennifer Pittman
27 pages
Best Machine Learning Interview Questions and Answers
No ratings yet
Best Machine Learning Interview Questions and Answers
38 pages
Comprehensive Review of K-Means Clustering Algorithms
No ratings yet
Comprehensive Review of K-Means Clustering Algorithms
5 pages
Dynamic Approach To K-Means Clustering Algorithm-2
No ratings yet
Dynamic Approach To K-Means Clustering Algorithm-2
16 pages
A Dynamic K-Means Clustering For Data Mining-Dikonversi
No ratings yet
A Dynamic K-Means Clustering For Data Mining-Dikonversi
6 pages
Enhancing The Exactness of K-Means Clustering Algorithm by Centroids
No ratings yet
Enhancing The Exactness of K-Means Clustering Algorithm by Centroids
7 pages
A Dynamic K-Means Clustering For Data Mining
No ratings yet
A Dynamic K-Means Clustering For Data Mining
6 pages
AK-means: An Automatic Clustering Algorithm Based On K-Means
No ratings yet
AK-means: An Automatic Clustering Algorithm Based On K-Means
6 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
Internet of Things Based Smart University Management System: Renuka M N, Anita R
No ratings yet
Internet of Things Based Smart University Management System: Renuka M N, Anita R
4 pages
Pattern Recognition Letters: Krista Rizman Z Alik
No ratings yet
Pattern Recognition Letters: Krista Rizman Z Alik
7 pages
Water Body Extraction From Sentinel-3 Image With Multiscale Spatiotemporal Super-Resolution Mapping
No ratings yet
Water Body Extraction From Sentinel-3 Image With Multiscale Spatiotemporal Super-Resolution Mapping
20 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
An Efficient Enhanced K-Means Clustering Algorithm
No ratings yet
An Efficient Enhanced K-Means Clustering Algorithm
8 pages
UNIT-3 Strings
No ratings yet
UNIT-3 Strings
33 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
5 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
UEE503
No ratings yet
UEE503
1 page
Unit 5 Dev 2023
No ratings yet
Unit 5 Dev 2023
23 pages
The International Journal of Engineering and Science (The IJES)
No ratings yet
The International Journal of Engineering and Science (The IJES)
4 pages
K Means Algo
No ratings yet
K Means Algo
7 pages
StopRansomware Guide 508C v3 - 1
No ratings yet
StopRansomware Guide 508C v3 - 1
31 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
Saeed PHD Thesis UniMelbAug2017
No ratings yet
Saeed PHD Thesis UniMelbAug2017
143 pages
I Jsa It 04132012
No ratings yet
I Jsa It 04132012
4 pages
An Efficient Incremental Clustering Algorithm
No ratings yet
An Efficient Incremental Clustering Algorithm
3 pages
2022 Inspiring Profiles en Forster 0 Cat355cb804
No ratings yet
2022 Inspiring Profiles en Forster 0 Cat355cb804
32 pages
Best 10 CNC Machining Service Companies in Belgium
No ratings yet
Best 10 CNC Machining Service Companies in Belgium
5 pages
Hacking CDDVDBlu-ray For Fun and Scientific Research
No ratings yet
Hacking CDDVDBlu-ray For Fun and Scientific Research
71 pages
NSP P1
No ratings yet
NSP P1
46 pages
Kolom Distilasi Tinjauan Umum
No ratings yet
Kolom Distilasi Tinjauan Umum
22 pages
Pipeliner Mps 4000
No ratings yet
Pipeliner Mps 4000
4 pages
06 - Panel PRO-FACE de Válvula
No ratings yet
06 - Panel PRO-FACE de Válvula
24 pages
6 BSTs and AVL Trees
No ratings yet
6 BSTs and AVL Trees
12 pages
Tsarouchas Anastasios Resume
No ratings yet
Tsarouchas Anastasios Resume
1 page
2023 - SP2 - CP3401 - CP5636-Assessment Item 1
No ratings yet
2023 - SP2 - CP3401 - CP5636-Assessment Item 1
4 pages
Bambam Part 24
No ratings yet
Bambam Part 24
1 page
Bambam Part 21
No ratings yet
Bambam Part 21
1 page
Bambam Part 20
No ratings yet
Bambam Part 20
1 page
Bambam Part 23
No ratings yet
Bambam Part 23
1 page
Bambam Part 22
No ratings yet
Bambam Part 22
1 page
Bambam Part 19
No ratings yet
Bambam Part 19
1 page
I Am Coder Haha, I Am Programmer Abcdefghukhjjjkjhkhl K KNKBNN N MNNJKKJ LKHLJKLLK L JHKHLJLK, JLJLKJLKJJK K GKHH HLJL HJLJKLK HKJJLK
No ratings yet
I Am Coder Haha, I Am Programmer Abcdefghukhjjjkjhkhl K KNKBNN N MNNJKKJ LKHLJKLLK L JHKHLJLK, JLJLKJLKJJK K GKHH HLJL HJLJKLK HKJJLK
1 page
Motioneering - Damping Solutions
No ratings yet
Motioneering - Damping Solutions
1 page
WM412C.1-V1.1-1.2 Main Vertical Sections
No ratings yet
WM412C.1-V1.1-1.2 Main Vertical Sections
1 page

Research On K-Value Selection Method of K-Means Clustering Algorithm

Uploaded by

Research On K-Value Selection Method of K-Means Clustering Algorithm

Uploaded by

Article

Research on K-Value Selection Method of K-Means

Keywords: Clustering; K-means; K-value; Convergence

J 2019, 2, 16; doi:10.3390/j2020016 www.mdpi.com/journal/j

2. The K-means Algorithm

3. Research on K-Value Selection Algorithm

3.1. An Elbow Method Algorithm

Algorithm 1: Silhouette Coefficient

Algorithm 2: Gap Statistic

3.3. The Silhouette Coefficient Algorithm

Algorithm 3: Silhouette Coefficient

3.4. The Canopy Algorithm

No. Name K value Execution Time

You might also like