Unsupervised K-Means Clustering Algorithm
Unsupervised K-Means Clustering Algorithm
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier XXXXXX
ABSTRACT The k-means algorithm is generally the most known and used clustering method. There are
various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to
clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always
influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is
not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema
for the k-means algorithm so that it is free of initializations without parameter selection and can also
simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-
kmeans) clustering algorithm with automatically finding an optimal number of clusters without giving any
initialization and parameter selection. The computational complexity of the proposed U-k-means clustering
algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are
made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-
kmeans clustering algorithm.
INDEX TERMS Clustering, K-means, Number of clusters, Initializations, Unsupervised learning schema,
Unsupervised k-means (U-k-means)
I. INTRODUCTION In general, partitional methods suppose that the data
Clustering is a useful tool in data science. It is a set can be represented by finite cluster prototypes with
method for finding cluster structure in a data set that is their own objective functions. Therefore, defining the
characterized by the greatest similarity within the same dissimilarity (or distance) between a point and a cluster
cluster and the greatest dissimilarity between different prototype is essential for partition methods. It is known
clusters. Hierarchical clustering was the earliest that the k-means algorithm is the oldest and popular
clustering method used by biologists and social partitional method [1,8]. The k-means clustering has
scientists, whereas cluster analysis became a branch of been widely studied with various extensions in the
statistical multivariate analysis [1,2]. It is also an literature and applied in a variety of substantive areas
unsupervised learning approach to machine learning. [9,10,11,12]. However, these k-means clustering
From statistical viewpoint, clustering methods are algorithms are usually affected by initializations and
generally divided as probability model-based need to be given a number of clusters a priori. In
approaches and nonparametric approaches. The general, the cluster number is unknown. In this case,
probability model-based approaches follow that the validity indices can be used to find a cluster number
data points are from a mixture probability model so where they are supposed to be independent of
that a mixture likelihood approach to clustering is used clustering algorithms [13]. Many cluster validity
[3]. In modelbased approaches, the expectation and indices for the kmeans clustering algorithm had been
maximization (EM) algorithm is the most used [4,5]. proposed in the literature, such as Bayesian
For nonparametric approaches, clustering methods are information criterion (BIC) [14], Akaike information
mostly based on an objective function of similarity or criterion (AIC) [15], Dunn’s index [16], Davies-
dissimilarity measures, and these can be divided into Bouldin index (DB) [17], Silhouette Width (SW)
hierarchical and partitional methods where partitional [18], Calinski and Harabasz index (CH) [19], Gap
methods are the most used statistic
[2,6,7].
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
cluster numbers in which the true cluster number ak inn1 ik ij and zik 1 if x ai k 2 1min k c x ai
reasonably lies and then a model selection, such as BIC 2
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
maximize these between cluster distances while number of clusters based on BIC, and it is still
minimizing the distance between the cluster centroid influenced by initializations. To construct the k-means
and the other data objects. The Silhouette value [18] is clustering algorithm with free of initializations and
a measure of how similar an object is to its own cluster automatically find the number of clusters, we use the
(cohesion) compared to other clusters (separation). The entropy concept. We borrow the idea from the EM
silhouette ranges from -1 to +1, where a high value algorithm by Yang et al. [31]. We first consider
indicates that the object is well matched to its own proportions k in which the k term is seen as the
cluster and poorly matched to neighboring clusters. probability of one data point belonged to the kth class.
Thus, positive and negative large silhouette widths
Hence, we use lnk as the information in the
(SW) indicate that the corresponding object is well
clustered and wrongly clustered, respectively. Any occurrence of one data point belonged to the kth class,
objects with the SW validity index around zero are
considered not to be clearly discriminated between and so ck1 k ln k becomes the average of
clusters. The Gap statistic [20] is a cluster validity
measure based upon a statistical hypothesis test. The
gap statistic works by comparing the change in within- information. In fact, the term ck1 k ln k is the
cluster dispersion with that expected under an entropy over proportions k . When k 1/ ,c k we
appropriate reference null distribution at each value c.
The optimal number of clusters is the smallest c. say that there is no information about k . At this point,
For an efficient method about the number of clusters, we have the entropy achieve the maximum value.
Xmeans proposed by Pelleg and Moore [23], should be Therefore, we add this term to the k-means objective
the most well-known and used in the literature, such as function J(z, A) as a penalty. We then construct a
Witten et al. [28], and Guo et al. [29]. In X-means, schema to estimate k by minimizing the entropy to
Pelleg and Moore [23] extended k-means by making
local decisions for cluster centers in each iteration of get the most information fork . To minimize
c c
k-means with splitting themselves to get better
k1 k ln k is equivalent to maximizing k1 k ln
clustering. Users only need to specify a range of
cluster numbers in which the true cluster number
reasonably lies and then a model selection, such as k .
BIC, is used to do the splitting process. Although X-
means has been the most used for clustering without
For this reason, we use ck1 k ln k as a penalty
given a number of clusters a priori, it still needs to term for the k-means objective function J(z, A). Thus,
specify a range of cluster numbers based on a criterion, we propose a novel objective function as follows:
such as BIC. On the other hand, it is still influenced by 0
initializations of algorithm. On the other hand, n c c
they called as a clustering by fast search (C-FS) and In order to determine the number of clusters, we next
find of density peaks. To identify the cluster centers, consider another entropy term. We combine the
C-FS uses the heuristic approach of a decision graph.
However, the performance of C-FS highly depends on variables membership zik and the proportion k . By
using the basis of entropy theory, we suggest a new
two factors, i.e., local density ρi and cutoff distance δi .
term in the form of zik lnk . Thus, we propose the
III. THE UNSUPERVISED K-MEANS unsupervised k-means (U-k-
CLUSTERING ALGORITHM means) objective function as follows:
n c c n c
There always exists a difficult problem in the k-means
algorithm and its extensions for a long history in the
literature. That is, they are usually affected by JUKM (z, A,) z x aiki k 2 n k lnk zik
2
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
c
n c
(3)weighted mean, then the new mixing proportion k(t1)
zik ln k k 1 will become smaller than the old k( )t . That is, the
i 1 k 1 k1 smaller proportion will decrease and the bigger
We first take the partial derivative of the Lagrangian proportion will increase in the next iteration, and then
competition will occur. This situation is similar as the
(3) with respect to zik , and setting them to be zero.
formula in Figueiredo and Jain [32]. If k 0 or k
Thus, the updating equation for zik is obtained as 1/n for some 1 k c( )t , they are considered to be
min illegitimate proportions. In this situation, we discard
follows: 1 if x ai k 2 ln k x ai k 2 ln those clusters and then update the cluster number c( )t to
k
k 0,
zik* zik*cs
zis*
t 1
1
i1 0
(9)
We next concern about the parameter learning of
Thus, we have ck1n k ln k ck1n k k1 c
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
automatically reduce the number of clusters, and also
simultaneously gets the estimates of parameters. max1 k ck( 1)t max1 k c 1n z
in1 ik max1 k c k( )t
Furthermore, the parameter can help us control the lnmax 1 k c k( )t cs1 s t ln s t
lnmax
competition. We discuss the variable as follows. We
first apply the rule e1 k ln k 0 . If 0 k 1
and max 1n in1 ikz max1 k c k( )t 1 k c k( )
t cs1 s( )t ln s( )t
k and let Ecs1 s ln s 0, then we have 1kc
have max{ ke | k 1/2, k 1,2,, }c e/2 . According to Eqs. (12) and (13), we can get
estimate with ln k
c
(t1) ( )t
|}/ c (12)Because the can jump at any time, we let 0 when
the cluster number c is stable. When the cluster
k1exp{ n| k k number c is stable, it means c is no longer decreasing.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
U-k-means clustering algorithm algorithm with more running time. For experimental
results and comparisons in the next section, we make
Step 1: Fix 0 . Give initial c(0) n , k(0) 1/n , more comparisons of the proposed U-k-means
algorithm with the RL-FCM algorithm. We also
ak0 xi and initial learning rates 0 0 1. analyze the computational complexity for the U-k-
Set t=0. means algorithm. In fact, the U-kmeans algorithm can
be divided into three parts: (1) Compute the hard
Step 2: Compute zikt1 using akt ,kt ,ct ,t
membership partition zik with O ncd ; (2)
,t by (4).
Compute the mixing proportion k with O nc ;
t1 (3)
Step 3: Compute by (10).
Update the cluster center k with O n . The total
Step 4: Update k(t1) with ik and k( )t by (7).
t1
computational complexity for the U-k-means
algorithm is O ncd , where n is the number of
Step 5: Compute (t1) with (t1) and ( )t by (14).
data points, c is the number of clusters, and d is the
Step 6: Update c( )t to ct1 by discard those clusters dimension of data points. Compared with the RL-FCM
with algorithm [33], the RL-FCM has the total
kt1 1/ n and adjust kt1 and ik( +1)t
and (9).
by (8)
computational complexity fwith O nc d . 2
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
4T ,
T T
T
numbers of clusters.
3 0.553 0.649 0.866 0.388 0.047 2.603 0.019
TABLE II
VALIDITY INDEX VALUES OF CH, SW, DB, GAP-STAT, DNO, DNG,
AND
DNS FOR THE NOISY DATA SET
7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
(a) (b) FIGURE 3. (a) 14-cluster dataset; (b) Final results from U-k-
means.
FIGURE 2. (a) 6-cluster dataset with 50 noisy points; (b) Final
results from U-k-means.
TABLE III
RESULTS OF THE SEVEN VALIDITY INDICES
True Optimal number of clusters
Example 2 In this example, we consider a data set of c
800 data points generated from a 3-variate 14- CH SW DB Gap-stat DNo DNg DNs
component Gaussian mixture with 800 data points with
14 14 14 14 14 14 2, 4, 5, 14
3 dimensions and 14 clusters, as shown in Fig. 3(a). To (60%) (60%) (60%) (64%) (20%) 10, 11 (20%)
estimate the number c of clusters, we use CH, SW,
DB, Gap-stat, DNo, DNg, and DNs. To create the
results of the seven validity indices, we consider the k-
means algorithm with 25 different initializations.
These estimated numbers of clusters from CH,
SW, DB, Gap statistic, DNo, DNg, and DNs with
percentages are shown in Table III. It is seen that all
validity indices can give the correct number c* 14 of
clusters, except DNg, where the Gap-stat index gives
the highest percentage of the correct number c* 14 of
clusters with 64%. We also implement the proposed U- (a) (b)
k-means for the data set, and then compare it with the
R-EM, C-FS, k-means with the true number of
clusters, X-means, and RL-FCM clustering algorithms.
We mention that U-k-means, R-EM, and RL-FCM are
free of parameter selection, but others are dependent
on parameter selection for finding the number of
clusters. Table IV shows the comparison results of the
U-kmeans, R-EM, C-FS, k-means with the true cluster
number c14, X-means, and RL-FCM algorithms. (c) (d)
Note that C-FS, k-means with the true number of
8
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
Lung and Nci9. Detailed descriptions on these data which is very closed to the true c=9. In terms of AR,
sets with feature characteristics, the number c of the U-k-means algorithm significantly performs
classes, the number n of instances and the number d of much better than others. The R-EM algorithm
features are listed in Table XI. In this experiment, we estimates the correct number of clusters on SPECT.
first preprocess the SPECT, Parkinson, WPBC, Colon, However, it underestimates the number of clusters on
and Lung data sets using the matrix factorization Parkinsons, and overestimates the number of clusters
technique. We also conduct experiments to compare on WPBC. We also reported that the results of R-EM
the proposed U-k-means with REM, C-FS, k-means on Colon, Lung and Nci9 data sets are missing
with the true c, k-means+Gap-stat, Xmeans, and RL- because the probability of one data point belonged to
FCM. The results are shown in Table XII. For the kth class on these data sets are known as
C-FS, k-means with the true c, k-means+Gap-stat and illegitimate proportions at the first iteration. The C-
X-means, we make experiments with 25 different FS algorithm presents better than k-means+Gap-stat
initializations, and report their results with the average and X-means. The RL-FCM algorithm estimates the
AR (AV-AR) and the percentages of algorithms to get correct number of clusters c for the SPECT,
the correct number c of clusters, as shown in Table Parkinsons, and WPBC data sets. While RL-FCM
XII. It is seen that the proposed U-k-means gets the overestimates the number of clusters on Colon, Lung
correct number of clusters for SPECT, Parkinsons, and Nci9 with c*=62, c*=9, and c*=60,
WPBC, Colon, and respectively.
Lung. While for the Nci9 data set, the U-k-means
algorithm gets the number of clusters with c*=8
TABLE IV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE DATA SET OF FIG. 3(A)
R- k-means
True U-k-means C-FS X-means RL-FCM
EM with true c
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR
14 14 1.00 14 1.00 14 (96%) 0.9772 0.8160 14 (76%) 1.00 14 1.00
TABLE V
MIXING PROPORTIONS, MEAN VALUES AND COVARIANCE MATRICES OF EXAMPLE 3
1 02.
1 2 4 6 0 0 0 0 1 1 1 0 0 0 0 0 3 5 0 0 1
2 03.
20 1 3 5 01 01 05 05 0 0 2 4 3 1 1 1 025 05 07 25. . . . . .
3 01.
. .
4 01.
3 5 5 5 5 4 4 4 4 6 6 6 6 8 8 8 8 1 1 1 1 4 2 2 2 2 2 1 1 1 1 1 3 I
5 02.
6 01. 333377777
5125 13 145 15 225 23 245 25 1 1 1 1 3 3 3 3 2 2 2 2. . .
k 20 20
. . . . .
TABLE VI
RESULTS OF THE SEVEN VALIDITY INDICES FOR THE DATA SET OF EXAMPLE 3
True c Optimal number of clusters obtains by
10
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access
TABLE VII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 3
U-k-means R-EM C-FS K-means with true c X-means RL-FCM
True c
c* AR c* AR c* AR AV-AR c* AR c* AR
6 6 1.00 3 - 6 (84%) 0.8155 0.7833 6 (100%) 1.00 3 -
TABLE VIII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 4
k-means
U-k-means R-EM C-FS X-means RL-FCM
True c with true c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR
9 9 1.00 12 - 9 (96%) 0.7641 0.9190 2 - 2 -
11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access
TABLE IX
DESCRIPTIONS OF THE EIGHT DATA SETS USED IN EXAMPLE 5
Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features
Iris Real 3 150 4
Seeds Real 3 210 7
Australian Categorical, Integer, Real 2 690 14
Flowmeter D Real 4 180 43
Sonar Real 2 208 60
Wine Integer, Real 3 178 13
Horse Categorical, Integer, Real 2 368 27
Waveform (Version 1) Real 3 5000 21
TABLE X
CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-Means R-EM C-FS k- k-means + Gap-stat X-means RL-FCM
True Means
Data set c* AR c* AV- c* AR c* AR c* AV- c* AV-
c with
AR true c AR AR
Iris 3 3 0.8933 3 0.8600 3 (84%) 0.7521 0.7939 4, 5 - 2 - 3 0.9067
TABLE XII
RESULTS FROM VARIOUS ALGORITHMS FOR THE SIX MEDICAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-MEANS R-EM C-FS K-MEANS K-MEANS + X-MEANS RL-FCM
WITH GAPSTAT
DATA SET TRUE TRUE C
C
C* AR C* AV-AR C* AR C* AR C* AV-AR C* AV-AR
12
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access
NCI9 9 8 - - - 2, 4 - 0.32 2 - 2 - 60 -
TABLE XIII CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS
IN BOLDFACE
k-means
True FU-k-means R-EM C-FS with true c X-means RL-FCM
Data set
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR
Yale Face 15 16 - - - 12 - 0.34 2, 3 - 2 -
TABLE XIV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE 100 IMAGES SAMPLE OF THE CIFAR-10 DATA
SET
k-means
U-k-means R-EM C-FS X-means RL-FCM
Data set True c with true c
13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access
14
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access
Yale Face 0.3741 - 5.9634 0.4286 The advantages of U-kmeans are free of initializations and
32x32 parameters that also robust to different cluster volumes
CIFAR-10 2.6561 - 6.4500 - and shapes with automatically finding the number of
clusters. The proposed U-k-means algorithm was
performed on several synthetic and real data sets and also
Example 8 In this example, we apply the U-k-means
compared with most existing algorithms, such as R-EM,
clustering algorithm to the CIFAR-10 color images [37].
C-FS, k-means with the true number c, k-means+gap, and
The CIFAR-10 data set consists of 60000 32x32 color
X-means algorithms. The results actually demonstrate the
images in 10 classes, i.e., each pixel is an RGB triplet of
superiority of the U-k-means clustering algorithm.
unsigned bytes between 0 and 255. There are 50000
training images and 10000 test images. Each red, green,
and blue channel value contains 1024 entries. The 10
REFERENCES
classes in the data set are airplane, automobile, bird, cat,
[1] A.K. Jain, R.C. Dubes, Algorithms for Clustering
deer, dog, frog, horse, ship, and truck. Specifically, we
Data, Englewood Cliffs, NJ: Prentice Hall, 1988.
take the first 100 color images (10 images per class) and
[2] L. Kaufman, P.J. Rousseeuw, Finding Groups in
training 40 multi-way from CIFAR-10 60K images data
Data: An Introduction to Cluster Analysis, New
set for our experiment. The rest 59900 images as the
York: Wiley, 1990.
retrieval database. Fig. 6 shows the 100 images sample
[3] G.J. McLachlan, K.E. Basford, Mixture Models:
from the CIFAR-10 images data set. The results for the
Inference and Applications to clustering, New York:
number of clusters and AR are given in Table XIV. From
Marcel Dekker, 1988.
Table XIV, it is seen that the proposed U-k-means and k-
means with the true c=10 give better results on the 100 [4] A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum
images sample of the CIFAR10 data set. The U-k-means likelihood from incomplete data via the EM
has the correct number c*=10 of clusters with 42.5% and algorithm (with discussion),” J. Roy. Stat. Soc., Ser.
AV-AR=0.28 and k-means with c=10 gives the same AV- B, vol. 39, pp. 1-38, 1977.
AR=0.28. For the C-FS, the percentage with the correct [5] J. Yu, C. Chaomurilige, M.S. Yang, “On convergence
number c*=10 of clusters is only 16.7% with AV- and parameter selection of the EM and DA-EM
AR=0.24. X-means underestimates the number of clusters algorithms for Gaussian mixtures,” Pattern
with c*=2. The results from R-EM and RL-FCM on this Recognition, vol. 77, pp. 188–203, 2018.
data sets are missing because the probability of one data [6] A.K. Jain, “Data clustering: 50 years beyond k-
point belonged to the kth class on these data sets are means,” Pattern Recognition Letters, vol. 31, pp.
known as illegitimate proportions at the first iteration. 651–666, 2010.
We further analyze the performance of U-k-means, [7] M.S. Yang, S.J. Chang-Chien and Y. Nataliani, “A
REM, C-FS, and RL-FCM by comparing their average fully-unsupervised possibilistic c-means clustering
running times of 25 runs for these algorithms, as shown in method,” IEEE Access, vol. 6, pp. 78308–78320,
Table XV. All algorithms are implemented in MATLAB 2018.
2017b. From Table XV, it is seen that the proposed U-k- [8] J. MacQueen, “Some methods for classification and
means is the fastest for all data sets among these analysis of multivariate observations,” Proc. of 5th
algorithms, except that the C-FS algorithm is the fastest Berkeley Symposium on Mathematical Statistics and
for the Waveform data set. Furthermore, in Section III, we Probability, vol. 1, pp. 281-297, University of
had mentioned that the proposed U-k-means objective California Press, 1967.
function is simpler than the RL-FCM objective function [9] M. Alhawarat and M. Hegazi, “Revisiting k-means
with saving running time. From Table 15, it is seen that and topic modeling, a comparison study to cluster
the proposed U-k-means algorithm is actually running arabic documents,” IEEE Access, vol. 6, pp. 42740-
faster than the RL-FCM algorithm. 42749, 2018.
[10] Y. Meng, J. Liang, F. Cao, Y. He, “A new distance
V. CONCLUSIONS with derivative information for functional k-means
In this paper we propose a new schema with a learning clustering algorithm,” Information Sciences, vol.
framework for the k-means clustering algorithm. We 463– 464, pp. 166–185, 2018.
adopt the merit of entropy-type penalty terms to construct [11] Z. Lv, T. Liu, C. Shi, J.A. Benediktsson, H. Du,
a competition schema. The proposed U-k-means “Novel land cover change detection method based on
algorithm uses the number of points as the initial number k-means clustering and adaptive majority voting
of clusters for solving the initialization problem. During using bitemporal remote sensing images,” IEEE
iterations, the U-kmeans algorithm will discard extra Access, vol. 7, pp. 34425-34437, 2019.
clusters, and then an optimal number of clusters can be [12] J. Zhu, Z. Jiang, G.D. Evangelidis, C. Zhang, S.
automatically found according to the structure of data. Panga,
15
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access
Z. Li, “Efficient registration of multi-view point sets perspective,” Expert Syst. Appl., vol. 36, pp.
by k-means clustering,” Information Sciences, vol. 60506061, 2009.
488, pp. 205–218, 2019. [27] L.J. Deborah, R. Baskaran, A. Kannan, “A survey on
[13] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On internal validity measure for cluster validation,” Int.
clustering validation techniques,” J. Intell. Inf. Syst., J. Comput. & Eng. Surv., vol. 1 pp. 85-102, 2010.
vol. 17, pp. 107-145, 2001. [28] I.H. Witten, E. Frank, M.A. Hall and C.J. Pal, Data
[14] R.E. Kass, A.E. Raftery, “Bayes Factors,” Journal of Mining: Practical Machine Learning Tools and
the American Statistical Association, vol. 90, pp. Techniques, Morgan Kaufmann Publishers, 2000.
773– 795, 1995. [29] G. Guo, L. Chen, Y. Ye and Q. Jiang, “Cluster
[15] H. Bozdogan, “Model selection and Akaike’s validation method for determining the number of
information criterion (AIC): The general theory and clusters in categorical sequences,” IEEE Transactions
its analytical extensions,” Psychometrika, vol. 52, pp. on Neural Networks and Learning Systems, vol. 28,
345–370, 1987. pp. 2936-2948, 2017.
[16] J.C. Dunn, “A fuzzy relative of the ISODATA [30] A. Rodriguez, A. Laio, “Clustering by fast search and
process and its use in detecting compact, well- find of density peaks,” Science, vol. 344 (6191) pp.
separated clusters,” J. Cybernetics, vol. 3, pp. 32-57, 1492-1496, 2014.
1974. [31] M.S. Yang, C.Y. Lai and C.Y. Lin, “A robust EM
[17] D. Davies and D. Bouldin, “A cluster separation clustering algorithm for Gaussian mixture models,”
measure,” IEEE Transactions on Pattern Analysis and Pattern Recognition, vol. 45, pp. 3950-3961, 2012.
Machine Intelligence, vol. PAMI-1, pp. 224–227, [32] M.A.T. Figueiredo, A. K. Jain, “Unsupervised
1979. learning of finite Mixture models,” IEEE Trans.
[18] P.J. Rousseeuw, “Silhouettes: A graphical aid to the Pattern Analysis and Machine Intelligence, vol. 24,
interpretation and validation of cluster analysis,” pp. 381396, 2002.
Journal of Computational and Applied Mathematics, [33] M.S. Yang and Y. Nataliani, “Robust-learning fuzzy
vol. 20, pp. 53-65, 1987. cmeans clustering algorithm with unknown number
[19] T. Calinski, J. Harabasz, “A dendrite method for of clusters,” Pattern Recognition, vol. 71, pp. 45-59,
cluster analysis,” Commun. Stat.-Theory Methods, 2017.
vol. 3, pp. 1–27, 1974. [34] C.L. Blake, C.J. Merz, UCI repository of machine
[20] R. Tibshirani, G. Walther, and T. hastie, “Estimating learning databases, a huge collection of artificial and
the number of clusters in a data set via the gap real-world data sets, 1998.
statistic,” https://archive.ics.uci.edu/ml/datasets.html
Journal of the Royal Statistical Society: Series B, vol. [35] D. Cai, X. He, J. Han and T. S. Huang, Graph
63, pp. 411-423, 2001. regularized nonnegative matrix factorization for data
[21] N. R. Pal and J. Biswas, “Cluster validation using representation, IEEE Trans. Pattern Analysis and
graph theoretic concepts,” Pattern Recognition, vol. Machine Intelligence 33.8 (2010) 1548-1560.
30, pp. 847-857, 1997. [36] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, Learning
[22] N. IIc, “Modified Dunn’s cluster validity index based a spatially smooth subspace for face recognition,
on graph theory,” Przeglad Elektrotechniczny Proceedings of IEEE Conference on Computer Vision
(Electrical Review), vol. 2, pp. 126-131, 2012. and Pattern Recognition, 2007, CVPR’07, pp. 1-7,
[23] D. Pelleg, A. Moore, “X-Means: Extending k-means 2007.
with efficient estimation of the number of clusters,” [37] A. Krizhevsky and G. Hinton, Learning multiple
Proc. of the 17th International Conference on Machine layers of features from tiny images (Vol. 1, No. 4, p.
Learning, pp. 727–734, San Francisco, 2000. 7), Technical report, University of Toronto, 2009.
[24] E. Rendon, I. Abundez, A. Arizmendi, E.M. Quiroz,
Kristina P. Sinaga received B.S. degree and
“Internal versus external cluster validation indexes,” M.S. degree in mathematics from University
Int. J. Computers and Communications, vol. 5, pp. of Sumatera Utara, Indonesia. She is a Ph.D.
2734, 2011. student at Department of Applied
[25] Y. Lei, J.C. Bezdek, S. Romani, N.X. Vinh, J. Chan, Mathematics, Chung Yuan Christian
University, Taiwan. Her research interests
J. Bailey, “Ground truth bias in external cluster include clustering and pattern recognition.
validity indices,” Pattern Recognition, vol. 65, pp. 58-
70, 2017.
[26] J. Wu, J. Chen, H.Xiong, M. Sie, “External validation
measures for k-means clustering: a data distribution
16
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access
17
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.