Unsupervised K-Means Clustering Algorithm
Unsupervised K-Means Clustering Algorithm
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier XXXXXX
ABSTRACT The k-means algorithm is generally the most known and used clustering method. There are
various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to
clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always
influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is
not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema
for the k-means algorithm so that it is free of initializations without parameter selection and can also
simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-k-
means) clustering algorithm with automatically finding an optimal number of clusters without giving any
initialization and parameter selection. The computational complexity of the proposed U-k-means clustering
algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are
made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-k-
means clustering algorithm.
INDEX TERMS Clustering, K-means, Number of clusters, Initializations, Unsupervised learning schema,
Unsupervised k-means (U-k-means)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
specify a range of cluster numbers in which the true cluster that the k-means algorithm is always affected by
number reasonably lies and then a model selection, such as initializations.
BIC or AIC, is used to do the splitting process. Although To resolve the above issue for finding the number c of
these k-means clustering algorithms can find the number of cluster, cluster validity issues get much more attention. There
clusters, such as cluster validity indices and X-means, they are several clustering validity indices available for estimating
use extra iteration steps outside the clustering algorithms. As the number c of clusters. Clustering validity indices can be
we know, no work in the literature for k-means can be free of grouped into two major categories: external and internal [24].
initializations, parameter selection and also simultaneously External indices are used to evaluate clustering results by
find the number of clusters. We suppose that this is due to its comparing cluster memberships assigned by a clustering
algorithm with the previously known knowledge such as
difficulty for constructing this kind of the k-means algorithm.
externally supplied class label [25,26]. However, internal
In this paper, we first construct a learning procedure for
indices are used to evaluate the goodness of cluster structure
the k-means clustering algorithm. This learning procedure
by focusing on the intrinsic information of the data itself [27]
can automatically find the number of clusters without any so that we consider only internal indices. In the paper, these
initialization and parameter selection. We first consider an most widely used internal indices, such as original Dunn’s
entropy penalty term for adjusting bias, and then create a index (DNo) [16], Davies-Bouldin index (DB) [17],
learning schema for finding the number of clusters. The Silhouette Width (SW) [18], Calinski and Harabasz index
organization of this paper is as follows. In Section II, we (CH) [19], Gap statistics [20], generalized Dunn’s index
review some related works. In Section III, we first construct (DNg) [21], and modified Dunn’s index (DNs) [22] are
the learning schema and then propose the unsupervised k- chosen for finding the number of clusters and then compared
means clustering (U-k-means) with automatically finding with our proposed U-k-means clustering algorithm.
the number of clusters. The computational complexity of The DNo [16], DNg [21], and DNs [22] are supposed to
the proposed U-k-means algorithm is also analyzed. In be the simplest (internal) validity index where it compares
Section IV, several experimental examples and the size of clusters with the distance between clusters. The
comparisons with numerical and real data sets are provided DNo, DNg, and DNs indices are computed as the ratio
to demonstrate the effectiveness of the proposed U-k-means between the minimum distance between two clusters and
clustering algorithm. Finally, conclusions are stated in the size of the largest cluster, and so we are looking for the
Section V. maximum value of index values. Davies-Bouldin index
(DB) [17] measures the average similarity between each
II. RELATED WORKS cluster and its most similar one. The DB validity index
In this section, we review several works that are closely attempts to maximize these between cluster distances while
related with ours. K-means is one of the most popular minimizing the distance between the cluster centroid and
unsupervised learning algorithms that solve the well-known the other data objects. The Silhouette value [18] is a
clustering problem. Let X x1 , , xn be a data set in a measure of how similar an object is to its own cluster
(cohesion) compared to other clusters (separation). The
d-dimensional Euclidean space d
. Let A a1 , , ac silhouette ranges from -1 to +1, where a high value
indicates that the object is well matched to its own cluster
be the c cluster centers. Let z [zik ]nc , where zik is a and poorly matched to neighboring clusters. Thus, positive
binary variable (i.e. zik {0,1} ) indicating if the data and negative large silhouette widths (SW) indicate that the
corresponding object is well clustered and wrongly
point xi belongs to k-th cluster, k 1, , c . The k-means clustered, respectively. Any objects with the SW validity
n c index around zero are considered not to be clearly
objective function is J (z, A) xi ak
2
z . The
i 1 k 1 ik discriminated between clusters. The Gap statistic [20] is a
k-means algorithm is iterated through necessary conditions cluster validity measure based upon a statistical hypothesis
for minimizing the k-means objective function J (z, A) test. The gap statistic works by comparing the change in
with updating equations for cluster centers and within-cluster dispersion with that expected under an
memberships, respectively, as appropriate reference null distribution at each value c. The
optimal number of clusters is the smallest c.
1 if xi ak 2 min xi ak 2
n
z x
ak i 1 ik ij and zik 1 k c For an efficient method about the number of clusters, X-
i 1 zik
n
0, otherwise. means proposed by Pelleg and Moore [23], should be the
most well-known and used in the literature, such as Witten
where xi ak is the Euclidean distance between the data et al. [28], and Guo et al. [29]. In X-means, Pelleg and
point xi and the cluster center ak . There exists a difficult Moore [23] extended k-means by making local decisions
problem in k-means, i.e., it needs to give a number of for cluster centers in each iteration of k-means with
clusters a priori. However, the number of clusters is splitting themselves to get better clustering. Users only
generally unkown in real applications. Another problem is need to specify a range of cluster numbers in which the true
cluster number reasonably lies and then a model selection,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
such as BIC, is used to do the splitting process. Although of entropy theory, we suggest a new term in the form of
X-means has been the most used for clustering without zik ln k . Thus, we propose the unsupervised k-means (U-k-
given a number of clusters a priori, it still needs to specify a means) objective function as follows:
range of cluster numbers based on a criterion, such as BIC. n c c n c
(z, A, ) z x a n ln z ln (2)
2
On the other hand, it is still influenced by initializations of J UKM 2 ik i k k k ik k
i 1 k 1 k 1 i 1 k 1
algorithm. On the other hand, Rodriguez and Laio [30]
proposed an approach based on the idea that cluster centers
We know that, when and in Eq. (2) are zero, it
are characterized by a higher density than their neighbors becomes the original k-means. The Lagrangian of Eq. (2) is
n c c
and by a relatively large distance from points with higher J (z, A, , ) zik xi ak n k ln k
2
zik 1 k c (4)
There always exists a difficult problem in the k-means
algorithm and its extensions for a long history in the 0, otherwise.
literature. That is, they are usually affected by initializations The updating equation for the cluster center ak is as follows:
and require a given number of clusters a priori. We
ak i 1 zik xij
n n
mentioned that the X-means algorithm has been used for i 1 ik
z (5)
clustering without given a number of clusters a priori, but it We next take the partial derivative of the Lagrangian with
still needs to specify a range of number of clusters based on respect to k , we obtain J n ln k 1
BIC, and it is still influenced by initializations. To construct k
i 1 k 0 and n k ln k 1 i 1 zik k 0,
n zik n
the k-means clustering algorithm with free of initializations
and automatically find the number of clusters, we use the Thus, we have c n k ln k c n k c n zik
entropy concept. We borrow the idea from the EM algorithm k 1 k 1 k 1 i 1
by Yang et al. [31]. We first consider proportions k in k 1 k 0 with n k 1 k ln k n n . We obtain
c c
which the k term is seen as the probability of one data n k ln k 1 i 1 zik ( n k 1 k ln k n n k 0
n c
point belonged to the kth class. Hence, we use ln k as the and then we get the updating equation for k as follows:
information in the occurrence of one data point belonged to n
c
kt 1 zik / n ( / ) k(t ) ln k(t ) st ln st (6)
the kth class, and so c k ln k becomes the average of i 1 s 1
k 1
information. In fact, the term k ln k is the entropy where t denotes the iteration number in the algorithm.
c
k 1 We should mention that Eq. (6) created above is
over proportions k . When k 1/ c, k we say that there is important for our proposed U-k-means clustering method.
no information about k . At this point, we have the entropy In Eq. (6), c s ln s is the weighted mean of ln k
s 1
achieve the maximum value. Therefore, we add this term to with the weights 1 ,, c . For the kth mixing
the k-means objective function J (z, A) as a penalty. We
proportion (t )
, if ln (t )
is less than the weighted mean,
then construct a schema to estimate k by minimizing the
k k
i 1 k 1 k 1
In order to determine the number of clusters, we next ct 1 ct kt 1 kt 1 1 n, k 1, , c t (7)
consider another entropy term. We combine the variables
membership zik and the proportion k . By using the basis where |{}| denotes the cardinality of the set {}. After
updating the number of clusters c , the remaining mixing
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
proportion k* and corresponding zik need to be re- On the other hand, we consider the inequations
*
k* k* s* and max 1 n (t ) (t ) c (t ) (t )
(8) n i 1 ik
z 1 k c k k s 1 s s
s 1 1 k c 1 k c
c
z max ln . If max z
t 1
n c n
max 1
zik* zik*
1 (t ) (t ) (t )
s 1
zis* (9) 1 k c
n i 1 ik 1 k c
k s 1 s s 1 k c
n i 1 ik
We next concern about the parameter learning of max k(t ) s 1 s(t ) ln s(t ) 1 ,
1 k c
c
then the restriction of
and for the two terms of
n c
z ln k max k(t 1) 1 is held, and then we obtain
i 1 k 1 ik 1 k c
(t 1) min k 1
k k
, 1 k c (14)
ec
(t )
c
, e c (t )
(t ) (t )
k k 1
100 500 750
c
that decreases faster, but e and c ( max (t )
ln )
' '
1 k c k
c ( t ) 1000
e decreases slower. We suppose that the Because the can jump at any time, we let 0 when
parameter should not decrease too slow or too fast, and the cluster number c is stable. When the cluster number c
so we set the parameter as is stable, it means c is no longer decreasing. In our setting,
we use all data points as initial means with ak xk , i.e.
t e c
(t )
250
(10)
Under competition schema setting, the algorithm can cinitial n , and we use k 1/ cinitial , k 1, 2, ..., cinitial
automatically reduce the number of clusters, and also as initial mixing proportions. Thus, the proposed U-k-
simultaneously gets the estimates of parameters. means clustering algorithm can be summarized as follows:
Furthermore, the parameter can help us control the
U-k-means clustering algorithm
competition. We discuss the variable as follows. We
Step 1: Fix 0 . Give initial c n , k( 0 ) 1 / n ,
(0)
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
4 6 6 , 5 10 8 , 6 7 10
T T T
for the data point xi belonging to k-th cluster. If we , and
compare the proposed U-k-means objective function 0.4 0 with 2 dimensions and 6 clusters,
JUKM 2 (z, A, ) with the RL-FCM objective function = 6
0 0.4
1
J (U, , A) , we find that, except ik and zik with different as shown in Fig. 1(a). We implement the proposed U-k-
membership representations, the RL-FCM objective means clustering algorithm for the data set of Fig. 1(a) in
which it obtains the correct number c 6 of clusters with
*
function J (U, , A) in Yang et al. [33] gave more extra
terms and parameters and so the RL-FCM algorithm is AR=1.00, as shown in Fig. 1(f), after 11 iterations. These
more complicated than the proposed U-k-means algorithm validity indices of CH, SW, DB, Gap statistic, DNo, DNg,
with more running time. For experimental results and and DNs are shown in Table I. All indices give the correct
number c 6 of clusters, except DNg.
*
comparisons in the next section, we make more
comparisons of the proposed U-k-means algorithm with the Moreover, we consider the data set with noisy points to
RL-FCM algorithm. We also analyze the computational show the performance of the proposed U-k-means
complexity for the U-k-means algorithm. In fact, the U-k- algorithm under noisy environment. We add 50 uniformly
means algorithm can be divided into three parts: (1) noisy points to the data set of Fig. 1(a), as shown in Fig.
Compute the hard membership partition zik with O ncd ; 2(a). By implementing the U-k-means algorithm on the
noisy data set of Fig. 2(a), it still obtains the correct number
(2) Compute the mixing proportion k with O nc ; (3) c* 6 of clusters after 28 iterations with AR=1.00, as shown
Update the cluster center k
with O n . The total in Fig. 2(b). These validity index values of CH, SW, DB,
Gap-stat, DNo, DNg, and DNs for the noisy data set of Fig.
computational complexity for the U-k-means algorithm is
2(a) are shown in Table II. The five validity indices of CH,
O ncd , where n is the number of data points, c is the DB, Gap-stat, DNo and DNs give the correct number of
number of clusters, and d is the dimension of data points. clusters. But, SW and DNg give the incorrect numbers of
Compared with the RL-FCM algorithm [33], the RL-FCM clusters.
has the total computational complexity fwith O nc 2 d .
IV. EXPERIMENTAL RESULTS AND COMPARISONS
In this section we give some examples with numerical and
real data sets to demonstrate the performance of the proposed
U-k-means algorithm. We show these unsupervised learning
behaviors to get the best number c * of clusters for the U-k-
means algorithm. Generally, most clustering algorithms, (a) (b)
including k-means, are employed to give different numbers
of clusters with associated cluster memberships, and then
these clustering results are evaluated by multiple validity
measures to determine the most practically plausible
clustering results with the estimated number of clusters [13].
Thus, we will first compare the U-k-means algorithm with
the seven validity indices, DNo [16], DNg [21], DNs [22],
Gap statistic (Gap-stat) [20], DB [17], SW [18] and CH [19].
Furthermore, the comparisons of the proposed U-k-means (c) (d)
with k-means [8], robust EM [31], clustering by fast search
(C-FS) [30], X-means [23], and RL-FCM [33] are also made.
For measuring clustering performance, we use an accuracy
n ck n , where n ck is the
c
rate (AR) with AR k 1
number of data points that obtain correct clustering for the
cluster k and n is the total number of data points. The larger
AR is, the better clustering performance is.
Example 1 In this example, we use a data set of 400 data
(e) (f)
points generated from the 2-variate 6-component Gaussian
mixture model f ( x; , ) k f ( x;k ) with parameters
c FIGURE 1. (a) Original data set; (b)-(e) Processes of the U-k-means
k 1 after 1, 2, 4, and 9; (f) Convergent results.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
Original data
12
10
(a) (b)
8
4 TABLE III
RESULTS OF THE SEVEN VALIDITY INDICES
2
Optimal number of clusters
0
True
0 2 4 6 8 10 12 CH SW DB Gap-stat DNo DNg DNs
x1
c
(a) (b) 14 14 14 14 14 14 2, 4, 5, 14
(60%) (60%) (60%) (64%) (20%) 10, 11 (20%)
FIGURE 2. (a) 6-cluster dataset with 50 noisy points; (b) Final results
from U-k-means.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
Lung. While for the Nci9 data set, the U-k-means algorithm of one data point belonged to the kth class on these data sets
gets the number of clusters with c*=8 which is very closed are known as illegitimate proportions at the first iteration.
to the true c=9. In terms of AR, the U-k-means algorithm The C-FS algorithm presents better than k-means+Gap-stat
significantly performs much better than others. The R-EM and X-means. The RL-FCM algorithm estimates the correct
algorithm estimates the correct number of clusters on number of clusters c for the SPECT, Parkinsons, and
SPECT. However, it underestimates the number of clusters WPBC data sets. While RL-FCM overestimates the number
on Parkinsons, and overestimates the number of clusters on of clusters on Colon, Lung and Nci9 with c*=62, c*=9, and
WPBC. We also reported that the results of R-EM on Colon, c*=60, respectively.
Lung and Nci9 data sets are missing because the probability
TABLE IV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE DATA SET OF FIG. 3(A)
R- k-means
True U-k-means C-FS X-means RL-FCM
EM with true c
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR
TABLE V
MIXING PROPORTIONS, MEAN VALUES AND COVARIANCE MATRICES OF EXAMPLE 3
1 0.2
1 2 4 6 0 0 0 0 1 1 1 0 0 0 0 0 3 5 0 0 1
2 0.3
2 0 1 3 5 0.1 0.1 0.5 0.5 0 0 2 4 3 1 1 1 0.25 0.5 0.7 2.5
3 0.1
4 0.1
3 5 5 5 5 4 4 4 4 6 6 6 6 8 8 8 8 1 1 1 1
4 2 2 2 2 2 1 1 1 1 1 3 3 3 3 3 7 7 7 7 7
k
I 2020
5 0.2 5 1.25 1.3 1.45 1.5 2.25 2.3 2.45 2.5 1 1 1 1 3 3 3 3 2 2 2 2
6 0.1 6 0 0 1 1 0.5 0.5 2.5 2.5 5 5 1 1 5 5 0 0 0.75 1.5 3.5 5.5
TABLE VI
RESULTS OF THE SEVEN VALIDITY INDICES FOR THE DATA SET OF EXAMPLE 3
Optimal number of clusters obtains by
True c
CH SW DB Gap-stat DNo DNg DNs
6 6 (88%) 6 (88%) 2, 3 6 (88%) 6 (16%) 6 (8%) 6 (12%)
TABLE VII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 3
U-k-means R-EM C-FS K-means with true c X-means RL-FCM
True c
c* AR c* AR c* AR AV-AR c* AR c* AR
6 6 1.00 3 - 6 (84%) 0.8155 0.7833 6 (100%) 1.00 3 -
TABLE VIII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 4
k-means
U-k-means R-EM C-FS X-means RL-FCM
True c with true c
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
TABLE IX
DESCRIPTIONS OF THE EIGHT DATA SETS USED IN EXAMPLE 5
Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features
Iris Real 3 150 4
Seeds Real 3 210 7
Australian Categorical, Integer, Real 2 690 14
Flowmeter D Real 4 180 43
Sonar Real 2 208 60
Wine Integer, Real 3 178 13
Horse Categorical, Integer, Real 2 368 27
Waveform (Version 1) Real 3 5000 21
TABLE X
CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-Means R-EM C-FS k- k-means + Gap-stat X-means RL-FCM
True Means
Data set
c AV- with AV- AV-
c* AR c* c* AR c* AR c* c*
AR true c AR AR
Iris
3 3 0.8933 3 0.8600 3 (84%) 0.7521 0.7939 4, 5 - 2 - 3 0.9067
Seeds 3 3 3
3 3 0.9048 3 0.8476 0.7944 0.8864 0.8952 0.890 3 0.8952
(100%) (100%) (100%)
Australian 2
2 2 0.5551 4 - 0.5551 0.5551 6 - 6 - 26 -
(100%)
Flowmeter D 4
4 4 0.6056 3 - 0.4338 0.5833 9, 10 - 10 - 13 -
(100%)
Sonar 2 5 - 0.4791 0.4791 5, 6 - 3, 4 - 4 -
2 0.5337 2 (80%)
Wine 3
3 3 0.7022 2 - 0.5557 0.6851 2 - 3 (64%) 0.62 2 -
(100%)
Horse 4,6,8, 2
2 2 0.6576 - 0.6033 0.6055 3 - 2 (88%) 0.50 7 -
10, 14 (100%)
Waveform
3 3 0.4020 1 - 2 - 0.3900 1 - 8 - 3 0.3972
(Version 1)
TABLE XI
DESCRIPTIONS OF THE SIX MEDICAL DATA SETS USED IN EXAMPLE 6
TABLE XII
RESULTS FROM VARIOUS ALGORITHMS FOR THE SIX MEDICAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
K-MEANS + GAP-
U-K-MEANS R-EM C-FS K-MEANS X-MEANS RL-FCM
TRUE STAT
DATA SET WITH
C
C* AR C* AV-AR C* AR TRUE C C* AR C* AV-AR C* AV-AR
SPECT 2 2 0.920 2 0.562 2(84%) 0.8408 0.5262 5, 6 - 2 (100%) 0.5119 2 0.588
PARKINSONS 2 2 0.754 1 - 2 (100%) 0.7436 0.5183 2 (100%) 0.62 4, 5 - 2 0.754
WPBC 2 2 0.763 198 - 2 (100%) 0.7576 0.5927 4 - 3 - 2 0.763
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
TABLE XIII
CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
k-means
True FU-k-means R-EM C-FS X-means RL-FCM
Data set with true c
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR
Yale Face 15 16 - - - 12 - 0.34 2, 3 - 2 -
TABLE XIV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE 100 IMAGES SAMPLE OF THE CIFAR-10 DATA SET
k-means
U-k-means R-EM C-FS X-means RL-FCM
Data set True c with true c
10
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
TABLE XV 10 data set. The U-k-means has the correct number c*=10
COMPARISON OF AVERAGE RUNNING TIMES (IN SECONDS) OF U-K-MEANS,
of clusters with 42.5% and AV-AR=0.28 and k-means with
R-EM, C-FS, K-MEANS WITH THE TRUE C, AND RL-FCM FOR ALL DATA
SETS. THE FASTEST RUNNING TIMES ARE HIGHLIGHTED c=10 gives the same AV-AR=0.28. For the C-FS, the
percentage with the correct number c*=10 of clusters is
U-k- RL-
Data sets R-EM C-FS only 16.7% with AV-AR=0.24. X-means underestimates the
means FCM
number of clusters with c*=2. The results from R-EM and
Synthetic Data sets RL-FCM on this data sets are missing because the
Example 1 0.3842 4.8921 5.8050 1.3688 probability of one data point belonged to the kth class on
Example 2 2.9185 13.6157 7.3559 6.0444 these data sets are known as illegitimate proportions at the
first iteration.
Example 3 2.1625 2.7938 10.2817 3.2924
We further analyze the performance of U-k-means, R-
Example 4 117.2595 742.14 35.6417 438.047 EM, C-FS, and RL-FCM by comparing their average running
UCI Data sets times of 25 runs for these algorithms, as shown in Table XV.
Iris 0.2159 1.1842 6.31581 0.4184 All algorithms are implemented in MATLAB 2017b. From
Table XV, it is seen that the proposed U-k-means is the
Seeds 0.1455 2.0400 5.2702 0.4472 fastest for all data sets among these algorithms, except that
Australian 2.0434 5.8039 6.1772 2.3829 the C-FS algorithm is the fastest for the Waveform data set.
Flowmeter 0.2834 0.6969 5.6230 0.3054 Furthermore, in Section III, we had mentioned that the
D proposed U-k-means objective function is simpler than the
Sonar 0.1747 0.3148 5.8564 0.3963 RL-FCM objective function with saving running time. From
Table 15, it is seen that the proposed U-k-means algorithm is
Wine 0.1980 1.4837 5.8094 0.3060
actually running faster than the RL-FCM algorithm.
Horse 0.6072 2.5989 5.3442 0.6272
Waveform 330.748 - 113.8162 474.165 V. CONCLUSIONS
In this paper we propose a new schema with a learning
Medical Data sets
framework for the k-means clustering algorithm. We adopt
SPECFT 0.1354 0.7211 5.9079 0.3411 the merit of entropy-type penalty terms to construct a
Parkinsons 0.1487 0.5856 4.9534 0.3958 competition schema. The proposed U-k-means algorithm
WPBC 0.7922 5.2152 0.4036
uses the number of points as the initial number of clusters for
0.1512
solving the initialization problem. During iterations, the U-k-
Colon 0.1653 - 4.9608 0.2676 means algorithm will discard extra clusters, and then an
Lung 1.1239 - 5.2485 1.1167 optimal number of clusters can be automatically found
Nci9 - 6.4794 0.5096
according to the structure of data. The advantages of U-k-
0.6186
means are free of initializations and parameters that also
Image Data sets robust to different cluster volumes and shapes with
Yale Face 0.3741 - 5.9634 0.4286 automatically finding the number of clusters. The proposed
32x32 U-k-means algorithm was performed on several synthetic
CIFAR-10 2.6561 - 6.4500 - and real data sets and also compared with most existing
algorithms, such as R-EM, C-FS, k-means with the true
number c, k-means+gap, and X-means algorithms. The
Example 8 In this example, we apply the U-k-means results actually demonstrate the superiority of the U-k-means
clustering algorithm to the CIFAR-10 color images [37]. clustering algorithm.
The CIFAR-10 data set consists of 60000 32x32 color
images in 10 classes, i.e., each pixel is an RGB triplet of
unsigned bytes between 0 and 255. There are 50000 REFERENCES
training images and 10000 test images. Each red, green, and [1] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data,
blue channel value contains 1024 entries. The 10 classes in Englewood Cliffs, NJ: Prentice Hall, 1988.
the data set are airplane, automobile, bird, cat, deer, dog, [2] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data:
frog, horse, ship, and truck. Specifically, we take the first An Introduction to Cluster Analysis, New York: Wiley,
100 color images (10 images per class) and training 40 1990.
multi-way from CIFAR-10 60K images data set for our [3] G.J. McLachlan, K.E. Basford, Mixture Models:
experiment. The rest 59900 images as the retrieval database. Inference and Applications to clustering, New York:
Fig. 6 shows the 100 images sample from the CIFAR-10 Marcel Dekker, 1988.
images data set. The results for the number of clusters and [4] A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum
AR are given in Table XIV. From Table XIV, it is seen that likelihood from incomplete data via the EM algorithm
the proposed U-k-means and k-means with the true c=10 (with discussion),” J. Roy. Stat. Soc., Ser. B, vol. 39,
give better results on the 100 images sample of the CIFAR- pp. 1-38, 1977.
11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
[5] J. Yu, C. Chaomurilige, M.S. Yang, “On convergence 63, pp. 411-423, 2001.
and parameter selection of the EM and DA-EM [21] N. R. Pal and J. Biswas, “Cluster validation using
algorithms for Gaussian mixtures,” Pattern Recognition, graph theoretic concepts,” Pattern Recognition, vol. 30,
vol. 77, pp. 188–203, 2018. pp. 847-857, 1997.
[6] A.K. Jain, “Data clustering: 50 years beyond k-means,” [22] N. IIc, “Modified Dunn’s cluster validity index based
Pattern Recognition Letters, vol. 31, pp. 651–666, 2010. on graph theory,” Przeglad Elektrotechniczny
[7] M.S. Yang, S.J. Chang-Chien and Y. Nataliani, “A (Electrical Review), vol. 2, pp. 126-131, 2012.
fully-unsupervised possibilistic c-means clustering [23] D. Pelleg, A. Moore, “X-Means: Extending k-means
method,” IEEE Access, vol. 6, pp. 78308–78320, 2018. with efficient estimation of the number of clusters,”
[8] J. MacQueen, “Some methods for classification and Proc. of the 17th International Conference on Machine
analysis of multivariate observations,” Proc. of 5th Learning, pp. 727–734, San Francisco, 2000.
Berkeley Symposium on Mathematical Statistics and [24] E. Rendon, I. Abundez, A. Arizmendi, E.M. Quiroz,
Probability, vol. 1, pp. 281-297, University of “Internal versus external cluster validation indexes,”
California Press, 1967. Int. J. Computers and Communications, vol. 5, pp. 27-
[9] M. Alhawarat and M. Hegazi, “Revisiting k-means and 34, 2011.
topic modeling, a comparison study to cluster arabic [25] Y. Lei, J.C. Bezdek, S. Romani, N.X. Vinh, J. Chan, J.
documents,” IEEE Access, vol. 6, pp. 42740-42749, Bailey, “Ground truth bias in external cluster validity
2018. indices,” Pattern Recognition, vol. 65, pp. 58-70, 2017.
[10] Y. Meng, J. Liang, F. Cao, Y. He, “A new distance with [26] J. Wu, J. Chen, H.Xiong, M. Sie, “External validation
derivative information for functional k-means measures for k-means clustering: a data distribution
clustering algorithm,” Information Sciences, vol. 463– perspective,” Expert Syst. Appl., vol. 36, pp. 6050-
464, pp. 166–185, 2018. 6061, 2009.
[11] Z. Lv, T. Liu, C. Shi, J.A. Benediktsson, H. Du, “Novel [27] L.J. Deborah, R. Baskaran, A. Kannan, “A survey on
land cover change detection method based on k-means internal validity measure for cluster validation,” Int. J.
clustering and adaptive majority voting using Comput. & Eng. Surv., vol. 1 pp. 85-102, 2010.
bitemporal remote sensing images,” IEEE Access, vol. [28] I.H. Witten, E. Frank, M.A. Hall and C.J. Pal, Data
7, pp. 34425-34437, 2019. Mining: Practical Machine Learning Tools and
[12] J. Zhu, Z. Jiang, G.D. Evangelidis, C. Zhang, S. Panga, Techniques, Morgan Kaufmann Publishers, 2000.
Z. Li, “Efficient registration of multi-view point sets by [29] G. Guo, L. Chen, Y. Ye and Q. Jiang, “Cluster
k-means clustering,” Information Sciences, vol. 488, validation method for determining the number of
pp. 205–218, 2019. clusters in categorical sequences,” IEEE Transactions
[13] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On on Neural Networks and Learning Systems, vol. 28, pp.
clustering validation techniques,” J. Intell. Inf. Syst., 2936-2948, 2017.
vol. 17, pp. 107-145, 2001. [30] A. Rodriguez, A. Laio, “Clustering by fast search and
[14] R.E. Kass, A.E. Raftery, “Bayes Factors,” Journal of find of density peaks,” Science, vol. 344 (6191) pp.
the American Statistical Association, vol. 90, pp. 773– 1492-1496, 2014.
795, 1995. [31] M.S. Yang, C.Y. Lai and C.Y. Lin, “A robust EM
[15] H. Bozdogan, “Model selection and Akaike’s clustering algorithm for Gaussian mixture models,”
information criterion (AIC): The general theory and its Pattern Recognition, vol. 45, pp. 3950-3961, 2012.
analytical extensions,” Psychometrika, vol. 52, pp. [32] M.A.T. Figueiredo, A. K. Jain, “Unsupervised learning
345–370, 1987. of finite Mixture models,” IEEE Trans. Pattern
[16] J.C. Dunn, “A fuzzy relative of the ISODATA process Analysis and Machine Intelligence, vol. 24, pp. 381-
and its use in detecting compact, well-separated 396, 2002.
clusters,” J. Cybernetics, vol. 3, pp. 32-57, 1974. [33] M.S. Yang and Y. Nataliani, “Robust-learning fuzzy c-
[17] D. Davies and D. Bouldin, “A cluster separation means clustering algorithm with unknown number of
measure,” IEEE Transactions on Pattern Analysis and clusters,” Pattern Recognition, vol. 71, pp. 45-59, 2017.
Machine Intelligence, vol. PAMI-1, pp. 224–227, 1979. [34] C.L. Blake, C.J. Merz, UCI repository of machine
[18] P.J. Rousseeuw, “Silhouettes: A graphical aid to the learning databases, a huge collection of artificial and
interpretation and validation of cluster analysis,” real-world data sets, 1998.
Journal of Computational and Applied Mathematics, https://archive.ics.uci.edu/ml/datasets.html
vol. 20, pp. 53-65, 1987. [35] D. Cai, X. He, J. Han and T. S. Huang, Graph
[19] T. Calinski, J. Harabasz, “A dendrite method for cluster regularized nonnegative matrix factorization for data
analysis,” Commun. Stat.-Theory Methods, vol. 3, pp. representation, IEEE Trans. Pattern Analysis and
1–27, 1974. Machine Intelligence 33.8 (2010) 1548-1560.
[20] R. Tibshirani, G. Walther, and T. hastie, “Estimating the [36] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, Learning a
number of clusters in a data set via the gap statistic,” spatially smooth subspace for face recognition,
Journal of the Royal Statistical Society: Series B, vol. Proceedings of IEEE Conference on Computer Vision
12
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access
13
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.