0% found this document useful (0 votes)
70 views

Unsupervised K-Means Clustering Algorithm

This document summarizes an unsupervised K-means clustering algorithm that can automatically determine the optimal number of clusters without initialization or parameter selection. It discusses how traditional K-means requires specifying the number of clusters beforehand and is influenced by initialization. The proposed unsupervised K-means (U-K-means) approach constructs a learning procedure for K-means that can find the number of clusters during training. It compares the new method to other clustering validity indices and demonstrates its effectiveness on numerical and real-world datasets.

Uploaded by

Ahmad Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Unsupervised K-Means Clustering Algorithm

This document summarizes an unsupervised K-means clustering algorithm that can automatically determine the optimal number of clusters without initialization or parameter selection. It discusses how traditional K-means requires specifying the number of clusters beforehand and is influenced by initialization. The proposed unsupervised K-means (U-K-means) approach constructs a learning procedure for K-means that can find the number of clusters during training. It compares the new method to other clustering validity indices and demonstrates its effectiveness on numerical and real-world datasets.

Uploaded by

Ahmad Faisal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final


publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier XXXXXX

Unsupervised K-Means Clustering


Algorithm
Kristina P. Sinaga and Miin-Shen Yang
Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li 32023, Taiwan
Corresponding author: Miin-Shen Yang (e-mail: [email protected]).
This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 107-2118-M-033-002-MY2.

ABSTRACT The k-means algorithm is generally the most known and used clustering method. There are
various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to
clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always
influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is
not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema
for the k-means algorithm so that it is free of initializations without parameter selection and can also
simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-
kmeans) clustering algorithm with automatically finding an optimal number of clusters without giving any
initialization and parameter selection. The computational complexity of the proposed U-k-means clustering
algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are
made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-
kmeans clustering algorithm.

INDEX TERMS Clustering, K-means, Number of clusters, Initializations, Unsupervised learning schema,
Unsupervised k-means (U-k-means)
I. INTRODUCTION In general, partitional methods suppose that the data
Clustering is a useful tool in data science. It is a set can be represented by finite cluster prototypes with
method for finding cluster structure in a data set that is their own objective functions. Therefore, defining the
characterized by the greatest similarity within the same dissimilarity (or distance) between a point and a cluster
cluster and the greatest dissimilarity between different prototype is essential for partition methods. It is known
clusters. Hierarchical clustering was the earliest that the k-means algorithm is the oldest and popular
clustering method used by biologists and social partitional method [1,8]. The k-means clustering has
scientists, whereas cluster analysis became a branch of been widely studied with various extensions in the
statistical multivariate analysis [1,2]. It is also an literature and applied in a variety of substantive areas
unsupervised learning approach to machine learning. [9,10,11,12]. However, these k-means clustering
From statistical viewpoint, clustering methods are algorithms are usually affected by initializations and
generally divided as probability model-based need to be given a number of clusters a priori. In
approaches and nonparametric approaches. The general, the cluster number is unknown. In this case,
probability model-based approaches follow that the validity indices can be used to find a cluster number
data points are from a mixture probability model so where they are supposed to be independent of
that a mixture likelihood approach to clustering is used clustering algorithms [13]. Many cluster validity
[3]. In modelbased approaches, the expectation and indices for the kmeans clustering algorithm had been
maximization (EM) algorithm is the most used [4,5]. proposed in the literature, such as Bayesian
For nonparametric approaches, clustering methods are information criterion (BIC) [14], Akaike information
mostly based on an objective function of similarity or criterion (AIC) [15], Dunn’s index [16], Davies-
dissimilarity measures, and these can be divided into Bouldin index (DB) [17], Silhouette Width (SW)
hierarchical and partitional methods where partitional [18], Calinski and Harabasz index (CH) [19], Gap
methods are the most used statistic
[2,6,7].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

 in1 ck1z x aik i  k 2 . The k-means algorithm is


[20], generalized Dunn’s index (DNg) [21], and
modified
Dunn’s index (DNs) [22]. iterated through necessary conditions for minimizing
For estimation the number of clusters, Pelleg and the k-means objective function J(z, A) with updating
Moore [23] extended k-means, called X-means, by equations for cluster centers and memberships,
making local decisions for cluster centers in each respectively, as
iteration of k-means with splitting themselves to get
better clustering. Users need to specify a range of  zx 
1

cluster numbers in which the true cluster number ak  inn1 ik ij and zik 1 if x ai  k 2  1min k c x ai 
reasonably lies and then a model selection, such as BIC 2

or AIC, is used to do the splitting process. Although


these k-means clustering algorithms can find the  z
i1 ik 0, otherwise.
number of clusters, such as cluster validity indices and
X-means, they use extra iteration steps outside the where xi ak is the Euclidean distance between the data
clustering algorithms. As we know, no work in the point xi and the cluster center ak . There exists a
literature for k-means can be free of initializations, difficult problem in k-means, i.e., it needs to give a
parameter selection and also simultaneously find the number of clusters a priori. However, the number of
number of clusters. We suppose that this is due to its clusters is generally unkown in real applications.
difficulty for constructing this kind of the k-means Another problem is that the k-means algorithm is
algorithm. always affected by initializations.
In this paper, we first construct a learning procedure To resolve the above issue for finding the number c
for the k-means clustering algorithm. This learning of cluster, cluster validity issues get much more
procedure can automatically find the number of attention. There are several clustering validity indices
clusters without any initialization and parameter available for estimating the number c of clusters.
selection. We first consider an entropy penalty term for Clustering validity indices can be grouped into two
adjusting bias, and then create a learning schema for major categories: external and internal [24]. External
finding the number of clusters. The organization of this indices are used to evaluate clustering results by
paper is as follows. In Section II, we review some comparing cluster memberships assigned by a
related works. In Section III, we first construct the clustering algorithm with the previously known
learning schema and then propose the unsupervised knowledge such as externally supplied class label
kmeans clustering (U-k-means) with automatically [25,26]. However, internal indices are used to evaluate
finding the number of clusters. The computational the goodness of cluster structure by focusing on the
complexity of the proposed U-k-means algorithm is intrinsic information of the data itself [27] so that we
also analyzed. In Section IV, several experimental consider only internal indices. In the paper, these most
examples and comparisons with numerical and real widely used internal indices, such as original Dunn’s
data sets are provided to demonstrate the effectiveness index (DNo) [16], Davies-Bouldin index (DB) [17],
of the proposed U-k-means clustering algorithm. Silhouette Width (SW) [18], Calinski and Harabasz
Finally, conclusions are stated in Section V. index (CH) [19], Gap statistics [20], generalized
Dunn’s index (DNg) [21], and modified Dunn’s index
II. RELATED WORKS (DNs) [22] are chosen for finding the number of
In this section, we review several works that are clusters and then compared with our proposed U-k-
closely related with ours. K-means is one of the most means clustering algorithm.
popular unsupervised learning algorithms that solve The DNo [16], DNg [21], and DNs [22] are
supposed to be the simplest (internal) validity index
the well-known clustering problem. Let Xx1, ,xn where it compares the size of clusters with the distance
be a data set in a d-dimensional Euclidean space d. Let between clusters. The DNo, DNg, and DNs indices are
A a 1, ,ac be the c cluster centers. Let z[ ]zik n computed as the ratio between the minimum distance
between two clusters and the size of the largest cluster,
c , where zik is a binary variable (i.e. zik {0,1}) and so we are looking for the maximum value of index
indicating if the data point xi belongs to k-th cluster, values. Davies-Bouldin index (DB) [17] measures the
k1, ,c . The k-means objective function is J(z,A) average similarity between each cluster and its most
similar one. The DB validity index attempts to
2

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

maximize these between cluster distances while number of clusters based on BIC, and it is still
minimizing the distance between the cluster centroid influenced by initializations. To construct the k-means
and the other data objects. The Silhouette value [18] is clustering algorithm with free of initializations and
a measure of how similar an object is to its own cluster automatically find the number of clusters, we use the
(cohesion) compared to other clusters (separation). The entropy concept. We borrow the idea from the EM
silhouette ranges from -1 to +1, where a high value algorithm by Yang et al. [31]. We first consider
indicates that the object is well matched to its own proportions k in which the k term is seen as the
cluster and poorly matched to neighboring clusters. probability of one data point belonged to the kth class.
Thus, positive and negative large silhouette widths
Hence, we use lnk as the information in the
(SW) indicate that the corresponding object is well
clustered and wrongly clustered, respectively. Any occurrence of one data point belonged to the kth class,
objects with the SW validity index around zero are
considered not to be clearly discriminated between and so ck1 k ln k becomes the average of
clusters. The Gap statistic [20] is a cluster validity
measure based upon a statistical hypothesis test. The
gap statistic works by comparing the change in within- information. In fact, the term ck1 k ln k is the
cluster dispersion with that expected under an entropy over proportions k . When k 1/ ,c k we
appropriate reference null distribution at each value c.
The optimal number of clusters is the smallest c. say that there is no information about k . At this point,
For an efficient method about the number of clusters, we have the entropy achieve the maximum value.
Xmeans proposed by Pelleg and Moore [23], should be Therefore, we add this term to the k-means objective
the most well-known and used in the literature, such as function J(z, A) as a penalty. We then construct a
Witten et al. [28], and Guo et al. [29]. In X-means, schema to estimate k by minimizing the entropy to
Pelleg and Moore [23] extended k-means by making
local decisions for cluster centers in each iteration of get the most information fork . To minimize
c c
k-means with splitting themselves to get better
k1 k ln k is equivalent to maximizing k1 k ln
clustering. Users only need to specify a range of
cluster numbers in which the true cluster number
reasonably lies and then a model selection, such as k .
BIC, is used to do the splitting process. Although X-
means has been the most used for clustering without
For this reason, we use  ck1 k ln k as a penalty
given a number of clusters a priori, it still needs to term for the k-means objective function J(z, A). Thus,
specify a range of cluster numbers based on a criterion, we propose a novel objective function as follows:
such as BIC. On the other hand, it is still influenced by 0
initializations of algorithm. On the other hand, n c c

Rodriguez and Laio [30] proposed an approach based


on the idea that cluster centers are characterized by a JUKM (z, A,) z x aiki  k 2   n k ln
1 k
higher density than their neighbors and by a relatively (1)
large distance from points with higher densities, which i 1 k 1 k1

they called as a clustering by fast search (C-FS) and In order to determine the number of clusters, we next
find of density peaks. To identify the cluster centers, consider another entropy term. We combine the
C-FS uses the heuristic approach of a decision graph.
However, the performance of C-FS highly depends on variables membership zik and the proportion k . By
using the basis of entropy theory, we suggest a new
two factors, i.e., local density ρi and cutoff distance δi .
term in the form of zik lnk . Thus, we propose the
III. THE UNSUPERVISED K-MEANS unsupervised k-means (U-k-
CLUSTERING ALGORITHM means) objective function as follows:
n c c n c
There always exists a difficult problem in the k-means
algorithm and its extensions for a long history in the
literature. That is, they are usually affected by JUKM (z, A,) z x aiki  k 2    n k lnk  zik
2

initializations and require a given number of clusters a lnk (2)


i 1 k 1 i 1 k 1
priori. We mentioned that the X-means algorithm has k1

been used for clustering without given a number of


clusters a priori, but it still needs to specify a range of
3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

We know that, when  and  in Eq. (2) are zero, it


We should mention that Eq. (6) created above is
important for our proposed U-k-means clustering
becomes the original k-means. The Lagrangian of Eq.
(2) is method. In Eq. (6), cs1 s ln s is the weighted mean
,
J(z, A,, ) n c z x aik i  k 2   nc k ln k
of lnk with the weights  1 , c . For the kth
i 1 k 1 k1
mixing proportionk , if lnk is less than the ( )t ( )t

c 
n c
(3)weighted mean, then the new mixing proportion k(t1)
zik ln  k  k 1 will become smaller than the old k( )t . That is, the
i 1 k 1  k1  smaller proportion will decrease and the bigger
We first take the partial derivative of the Lagrangian proportion will increase in the next iteration, and then
competition will occur. This situation is similar as the
(3) with respect to zik , and setting them to be zero.
formula in Figueiredo and Jain [32]. If k 0 or k
Thus, the updating equation for zik is obtained as 1/n for some 1 k c( )t , they are considered to be
min illegitimate proportions. In this situation, we discard
follows: 1 if x ai  k 2  ln k  x ai  k 2  ln those clusters and then update the cluster number c( )t to

 
k

be ct1 c t  kt1 kt1 1n k, 1, ,c


zik 1 k c (4)
0, otherwise.
t  (7)
The updating equation for the cluster center ak is as where |{}| denotes the cardinality of the set {}. After
follows: updating the number of clusters c , the remaining
mixing proportion k* and corresponding zik* need to
ak in1z xik ij  n
z
i 1 ik (5) be renormalized by
We next take the partial derivative of the Lagrangian
with respect to k , we obtain J  nln k 1 n k
 k*  k* cs 1 s* t 1

 1 in1 zik


(8)

zik
k and  n k ln k

k  0,
 zik* zik*cs 
zis*

t 1
1
 i1   0
(9)
We next concern about the parameter learning of 
Thus, we have ck1n k ln k ck1n k    k1 c

and  for the two terms of   n


i 1 k1 ik
c
z lnk and
z ck1k 0with     
n
n
i 1 ik  c
k1 k ln k
 n n . We obtain  k1 
c
k ln k . Based on some increasingly learning
rates of cluster number with
 n k ln k  1 in1zik  ( nck1 k ln k  ec( )t 100,ec( )t 250,ec( )t 500 ,ec( )t 750 , and ec( )t 1000 , it is
n n  k 0 and then we get the updating equation seen that ec ( )t 100 decreases faster, but ec( )t 500,ec( )t 750
for k as follows: and e  c( )t 1000
decreases slower. We suppose that the
n c

  parameter should not decrease too slow or too fast,


kt1  zik / n ( / ) k( )t ln k( )t   s t ln s t 
(6) and so we set the parameter  as
i1 
where t s1 
 t
denotes the iteration number in the algorithm.  ec ( )t 250
(10)
Under competition schema setting, the algorithm can
4

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

 
automatically reduce the number of clusters, and also
simultaneously gets the estimates of parameters. max1 k ck( 1)t max1 k c 1n  z
in1 ik max1 k c k( )t

Furthermore, the parameter  can help us control the lnmax 1 k c k( )t cs1 s t ln s t 
lnmax
competition. We discuss the variable  as follows. We
first apply the rule  e1  k ln k  0 . If 0  k 1
and max   1n  in1 ikz  max1 k c k( )t 1 k c k( )

t cs1 s( )t ln s( )t 
k and let Ecs1 s ln s  0, then we have 1kc

   k E k cs1 s ln s 0. Thus, we obtain


c
 max1 k c  1n  z
in1 ik   max 1 k c   k( )t cs1 s( )t

e1k (ln    k  s ln s )  ln s( )t . If max 1 k c  1n  z


in1 ik 
k E)
s1
(11)    max  k( )t  cs1 s( )t ln s( )t 1 , then the
restriction of

1kc

Under the constraint ck1 k 1, and only when k


maxk( 1)t 1 is held, and then we obtain
1 k c
1/2 , we can have that (lnk cs1 s ln s)0 . To
avoid the situation where all k  0, the left hand of
inequality (11) must be larger than  max{ k | k  1max  1  1n  z
in1 ik  max 1 k c 
1/2, k 1,2,, }c . We now have an elementary
condition of  with
e1max{ k | k 1/2, k 1,2,, }c . Thus, we
 k( )t cs1 s( )t ln s( )t  (13) k c

have  max{ ke | k 1/2, k 1,2,, }c e/2 . According to Eqs. (12) and (13), we can get

Therefore, to prevent  too large, we use [0, 1] . If 


the difference between kt1 and k t is small, then ) 1 max
( 1)t  min k1 c k   k( )t , ( max1 k c 1

 must become large to enhance its competition. If the


 
difference betweenkt1 and k t is large, then   k ck( )t  1n
n
z
ck' i 11 ik t

)   (14)

c

will become small to maintain stability. Thus, we exp(n ( 1)t

estimate  with ln k 

c
(t1) ( )t
|}/ c (12)Because the  can jump at any time, we let 0 when
the cluster number c is stable. When the cluster
k1exp{  n| k  k number c is stable, it means c is no longer decreasing.

where  min 11, td 2 1 


 and   a
In our setting, we use all data points as initial means
with ak xk , i.e.
represents the largest integer that is no more than a
cinitial n , and we use k 1/ cinitial ,  k 1, 2, ...,
and t denotes the iteration number in the algorithm.
cinitial
On the other hand, we consider the inequations as initial mixing proportions. Thus, the proposed U-
kmeans clustering algorithm can be summarized as
follows:

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

U-k-means clustering algorithm algorithm with more running time. For experimental
results and comparisons in the next section, we make
Step 1: Fix 0 . Give initial c(0) n , k(0) 1/n , more comparisons of the proposed U-k-means
algorithm with the RL-FCM algorithm. We also
ak0  xi and initial learning rates 0  0 1. analyze the computational complexity for the U-k-
Set t=0. means algorithm. In fact, the U-kmeans algorithm can
be divided into three parts: (1) Compute the hard
Step 2: Compute zikt1 using akt ,kt ,ct ,t
membership partition zik with O ncd ; (2)
,t by (4).
Compute the mixing proportion k with O nc  ;
t1 (3)
Step 3: Compute  by (10).
Update the cluster center k with O n  . The total
Step 4: Update k(t1) with ik and k( )t by (7).
t1
computational complexity for the U-k-means
algorithm is O ncd , where n is the number of
Step 5: Compute (t1) with (t1) and ( )t by (14).
data points, c is the number of clusters, and d is the
Step 6: Update c( )t to ct1 by discard those clusters dimension of data points. Compared with the RL-FCM
with algorithm [33], the RL-FCM has the total
kt1 1/ n and adjust kt1 and ik( +1)t
and (9).
by (8)
computational complexity fwith O nc d  . 2

IV. EXPERIMENTAL RESULTS AND


COMPARISONS
IF t 60 and c(t60) c( )t 0, THEN let (t1) 0.
In this section we give some examples with numerical
t1 t+1 t1
Step 7: Update ak with c and zik by (5). and real data sets to demonstrate the performance of
the proposed U-k-means algorithm. We show these
t1
Step 8: Compare ak and ak t . unsupervised learning behaviors to get the best number
c* of clusters for the U-kmeans algorithm. Generally,
IF max t akt1 ak t , THEN Stop. most clustering algorithms, including k-means, are
1 k c employed to give different numbers of clusters with
associated cluster memberships, and then these
ELSE t = t +1 and return to Step 2.
clustering results are evaluated by multiple validity
measures to determine the most practically plausible
Before we analyze the computational complexity for
clustering results with the estimated number of clusters
the proposed U-k-means algorithm, we give a brief
[13]. Thus, we will first compare the U-k-means
review of another clustering algorithm that had also
algorithm with the seven validity indices, DNo [16],
used the idea from the EM algorithm by Yang et al.
DNg [21], DNs [22], Gap statistic (Gap-stat) [20], DB
[31]. This is the robust-learning fuzzy c-means (RL-
[17], SW [18] and CH [19]. Furthermore, the
FCM), proposed by Yang and Nataliani [33]. In Yang
comparisons of the proposed U-k-means with k-means
and Nataliani [33], they gave the RL-FCM objective
[8], robust EM [31], clustering by fast search (C-FS)
function J(U,,A) with ik , not binary variables, but [30], X-means [23], and RL-FCM [33] are also made.
For measuring clustering performance, we use an
and
fuzzy c-memberships with 0 ik 1  k1
c
1 to
accuracy rate (AR) with ARck1n c 
ik
k n ,
indicate fuzzy memberships for the data point xi
belonging to k-th cluster. If we compare the proposed where n c k  is the number of data points that
U-k-means objective function JUKM2 (z, A,) with the obtain correct clustering for the cluster k and n is the
RL-FCM objective function total number of data points. The larger
J(U,,A), we find that, except ik and zik with different AR is, the better clustering performance is.
membership representations, the RL-FCM objective Example 1 In this example, we use a data set of 400
function J(U,,A) in Yang et al. [33] gave more extra data points generated from the 2-variate 6-component
terms and parameters and so the RL-FCM algorithm is
more complicated than the proposed U-k-means
6

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

Gaussian mixture model f x( ; ,) ck1k f x( ;k )


with parameters

k 1 6,k , 1 5 2T , 2 3 4T , 3 8

4T ,
T T
T

4 6 6 , 5 10 8 , 6 7 10 ,


and
0.4 0  with 2 dimensions and 6 clusters,

 =1 6 0 0.4


as shown in Fig. 1(a). We implement the proposed U-
kmeans clustering algorithm for the data set of Fig.
1(a) in which it obtains the correct number c*  6 of
clusters with AR=1.00, as shown in Fig. 1(f), after 11
iterations. These validity indices of CH, SW, DB, Gap
statistic, DNo, DNg, and DNs are shown in Table I.
All indices give the correct number c* 6 of clusters,
except DNg.
Moreover, we consider the data set with noisy
points to show the performance of the proposed U-k-
means algorithm under noisy environment. We add 50 FIGURE 1. (a) Original data set; (b)-(e) Processes of the U-k-
uniformly noisy points to the data set of Fig. 1(a), as means after 1, 2, 4, and 9; (f) Convergent results.
shown in Fig. 2(a). By implementing the U-k-means
algorithm on the noisy data set of Fig. 2(a), it still
obtains the correct number c* 6 of clusters after 28 TABLE I
VALIDITY INDEX VALUES OF CH, SW, DB, GAP-STAT, DNO, DNG,
iterations with AR=1.00, as shown in Fig. 2(b). These AND
validity index values of CH, SW, DB, Gap-stat, DNo, DNS FOR THE DATA SET OF FIG. 1(A)
DNg, and DNs for the noisy data set of Fig. 2(a) are c Validity index values
shown in Table II. The five validity indices of CH,
CH SW DB Gap-stat DNo DNg DNs
DB, Gap-stat, DNo and DNs give the correct number
of clusters. But, SW and DNg give the incorrect 2 0.511 0.680 0.772 0.183 0.008 6.587 0.001

numbers of clusters.
3 0.553 0.649 0.866 0.388 0.047 2.603 0.019

4 0.605 0.715 0.700 0.469 0.040 1.603 0.016

5 0.754 0.743 0.571 0.619 0.041 4.619 0.020

6 1.277 0.838 0.483 1.067 0.102 4.635 0.048

7 1.155 0.773 0.634 0.991 0.060 0.794 0.022

8 1.054 0.703 0.808 0.930 0.030 0.571 0.004

TABLE II
VALIDITY INDEX VALUES OF CH, SW, DB, GAP-STAT, DNO, DNG,
AND
DNS FOR THE NOISY DATA SET
7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

clusters, and X-means algorithms are dependent of


Criterion values initials or parameter selection, and so we consider their
c average AR (AV-AR) under different initials or
CH SW DB Gap stat DNo DNg DNs
parameter selection. From Table IV, it is seen that the
2 508.2 0.662 0.792 0.827 0.005 4.547 0.001
3 523.4 0.615 0.913 0.835 0.034 1.828 0.012 proposed U-k-means, R-EM, and RL-FCM clustering
4 526.9 0.655 0.748 0.719 0.033 1.752 0.011 algorithms are able to find the correct number of
5 637.9 0.697 0.607 0.914 0.028 3.397 0.008 clusters c* 14 with AR=1.00. While C-FS obtained
6 902.6 0.771 0.538 1.237 0.052 1.502 0.013
7 864.4 0.783 0.558 1.173 0.042 0.797 0.008 the correct c* 14 with 96% and AV-AR=0.9772. The
8 837.7 0.766 0.666 1.143 0.019 0.497 0.002 k-means with the true c gave AV-AR=0.8160. The X-
means obtained the correct c* 14 with 76% and AV-
AR=1.00. Note that the numbers in parentheses
indicate the percentage in obtaining the correct number
of clusters for clustering algorithms under 25 different
initial values.

(a) (b) FIGURE 3. (a) 14-cluster dataset; (b) Final results from U-k-
means.
FIGURE 2. (a) 6-cluster dataset with 50 noisy points; (b) Final
results from U-k-means.
TABLE III
RESULTS OF THE SEVEN VALIDITY INDICES
True Optimal number of clusters
Example 2 In this example, we consider a data set of c
800 data points generated from a 3-variate 14- CH SW DB Gap-stat DNo DNg DNs
component Gaussian mixture with 800 data points with
14 14 14 14 14 14 2, 4, 5, 14
3 dimensions and 14 clusters, as shown in Fig. 3(a). To (60%) (60%) (60%) (64%) (20%) 10, 11 (20%)
estimate the number c of clusters, we use CH, SW,
DB, Gap-stat, DNo, DNg, and DNs. To create the
results of the seven validity indices, we consider the k-
means algorithm with 25 different initializations.
These estimated numbers of clusters from CH,
SW, DB, Gap statistic, DNo, DNg, and DNs with
percentages are shown in Table III. It is seen that all
validity indices can give the correct number c* 14 of
clusters, except DNg, where the Gap-stat index gives
the highest percentage of the correct number c* 14 of
clusters with 64%. We also implement the proposed U- (a) (b)
k-means for the data set, and then compare it with the
R-EM, C-FS, k-means with the true number of
clusters, X-means, and RL-FCM clustering algorithms.
We mention that U-k-means, R-EM, and RL-FCM are
free of parameter selection, but others are dependent
on parameter selection for finding the number of
clusters. Table IV shows the comparison results of the
U-kmeans, R-EM, C-FS, k-means with the true cluster
number c14, X-means, and RL-FCM algorithms. (c) (d)
Note that C-FS, k-means with the true number of
8

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

algorithm with REM, C-FS, k-means with true c, X-


means, and RL-FCM. All the experiments are
performed 25 times with parameter selection where the
average AR results under the correct number of cluster
are reported in Table VIII. As shown in Table VIII, U-
k-means gives the correct number c*=9 of clusters
with AR=1.00, followed by k-means with true c=9
achieves an average AR=0.9190 and C-FS with c*=9
(96%) achieves average AR=0.7641. While R-EM
(e) (f) overestimates the number of clusters with c*=12, but
X-means and RLFCM underestimate the number of
FIGURE 4. (a) 9-diamonds data set; (b)-(e) Results of the U-k-
means after 1, 3, 5, and 7 iterations; (f) Final results of the U-k- clusters with c*=2.
means after 11 iterations. We next consider real data sets. These data sets
are from the UCI Machine Learning Repository [34].
Example 5 In this example, we use the eight real data
Example 3 To examine the effectiveness of the sets from UCI Machine Learning Repository [34],
proposed U-k-means for finding the number of known as Iris, Seeds, Australian credit approval,
clusters, we generate a data set of 900 data points from Flowmeter D, Sonar, Wine, Horse, and waveform
a 20-variate 6-component Gaussian mixture model. (version 1). Detailed information on these data sets
The mixing proportions, mean values and covariance such as feature characteristics, the number c of classes,
matrices of the Gaussian mixture model are listed in the number n of instances and the number d of features
Table V. The validity indices of CH, SW, DB, Gap- is listed in Table IX. Since data features in Seeds,
stat, DNo, DNg, and DNs are used to estimate the Flowmeter D, Wine and Waveform (version 1) are
number c of clusters. The k-means algorithm with 25 distributed in different ranges and data features in
different initializations are considered to create the Australian (credit approval) are mixed feature types,
results of the seven validity indices. These estimated we first preprocess data matrices using matrix
numbers of clusters from the seven validity indices factorization technique [35]. This preprocessed
with percentages are shown in Table VI where the technique can give these data in uniform to get good
parentheses are indicating the percentages of validity quality clusters and improve accuracy rates of
indices in giving the correct number of clusters under clustering algorithms. Clustering results from the U-k-
25 different initial values. It is seen that CH, SW, and means, R-EM, C-FS, kmeans with the true c, k-
Gap-stat give the correct number c* 6 of clusters with means+Gap-stat, X-means, and RL-FCM algorithms
the highest percentage. We also for different real data sets are shown in Table X, where
implemented the U-k-means and compare it with R- the best results are presented in boldface. It is seen that
EM, CFS, k-means with the true number c, X-means, the proposed U-k-means gives the best result in
and RLFCM algorithms. The obtained numbers of estimating the number c of clusters and accuracy rate
clusters and ARs of these algorithms are shown in among them except for Australian data. The C-FS
Table VII. As it can be seen, the proposed U-k-means, algorithm gives the corrected numbers of clusters for
C-FS and X-means correctly find the number of Iris, Seeds, Australian, Flowmeter D, Sonar, Wine, and
clusters for the data set. The REM and RL-FCM Horse data sets while it underestimates the number of
underestimate the number of clusters for the data set. clusters for the waveform data set with c*=2. The X-
Both U-k-means and X-means get the best AR. means algorithm only obtains the correct number of
Example 4 In this example, we consider a synthetic clusters for Seeds, Wine and Horse data sets. The R-
data set of non-spherical shape with 3000 data points, EM obtains the correct number of clusters for Iris and
as shown in Fig. 4(a). The U-k-means is implemented Seeds data sets. The k-means+Gapstat only obtains a
for this data set with the clustering results as shown in correct number of clusters for the Seed data set. The
Figs. 4(b)-4(f). The U-k-means algorithm decreases the RL-FCM algorithm obtains the correct number of
number of clusters from 3000 to 2132 after the clusters for the Iris, Seeds and Waveform (version 1)
iteration is implemented once. From Figs. 4(b)-4(f), it data sets. Note that the results in parentheses are the
is seen that the U-k-means algorithm exhibits fast percentages of algorithms to get the correct number c
decreasing for the number of clusters. After 11 of clusters.
iterations, the U-k-means algorithm obtains its Example 6 In this example, we use the six medical
convergent result with c*=9 and AR= 1.00, as shown data sets from the UCI Machine Learning Repository
in Fig. 4(f). We next compare the proposed U-k-means [34], known as SPECT, Parkinsons, WPBC, Colon,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

Lung and Nci9. Detailed descriptions on these data which is very closed to the true c=9. In terms of AR,
sets with feature characteristics, the number c of the U-k-means algorithm significantly performs
classes, the number n of instances and the number d of much better than others. The R-EM algorithm
features are listed in Table XI. In this experiment, we estimates the correct number of clusters on SPECT.
first preprocess the SPECT, Parkinson, WPBC, Colon, However, it underestimates the number of clusters on
and Lung data sets using the matrix factorization Parkinsons, and overestimates the number of clusters
technique. We also conduct experiments to compare on WPBC. We also reported that the results of R-EM
the proposed U-k-means with REM, C-FS, k-means on Colon, Lung and Nci9 data sets are missing
with the true c, k-means+Gap-stat, Xmeans, and RL- because the probability of one data point belonged to
FCM. The results are shown in Table XII. For the kth class on these data sets are known as
C-FS, k-means with the true c, k-means+Gap-stat and illegitimate proportions at the first iteration. The C-
X-means, we make experiments with 25 different FS algorithm presents better than k-means+Gap-stat
initializations, and report their results with the average and X-means. The RL-FCM algorithm estimates the
AR (AV-AR) and the percentages of algorithms to get correct number of clusters c for the SPECT,
the correct number c of clusters, as shown in Table Parkinsons, and WPBC data sets. While RL-FCM
XII. It is seen that the proposed U-k-means gets the overestimates the number of clusters on Colon, Lung
correct number of clusters for SPECT, Parkinsons, and Nci9 with c*=62, c*=9, and c*=60,
WPBC, Colon, and respectively.
Lung. While for the Nci9 data set, the U-k-means
algorithm gets the number of clusters with c*=8
TABLE IV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE DATA SET OF FIG. 3(A)

R- k-means
True U-k-means C-FS X-means RL-FCM
EM with true c
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR
14 14 1.00 14 1.00 14 (96%) 0.9772 0.8160 14 (76%) 1.00 14 1.00

TABLE V
MIXING PROPORTIONS, MEAN VALUES AND COVARIANCE MATRICES OF EXAMPLE 3

Mixing proportions Mean values covariance matrix

 1 02. 
  1 2 4 6 0 0 0 0 1 1 1 0 0 0 0 0 3 5 0 0 1
 2 03.
 20 1 3 5 01 01 05 05 0 0 2 4 3 1 1 1 025 05 07 25. . . . . .
 3 01. 
. .
 4 01. 
 3 5 5 5 5 4 4 4 4 6 6 6 6 8 8 8 8 1 1 1 1  4 2 2 2 2 2 1 1 1 1 1 3 I
 5 02.  
6 01. 333377777
 5125 13 145 15 225 23 245 25 1 1 1 1 3 3 3 3 2 2 2 2. . .
k 20 20 


. . . . .

 60 0 1 1 05 05 25 25 5 5 1 1 5 5 0 0 075 15 35 55. . . . .



. . .

TABLE VI
RESULTS OF THE SEVEN VALIDITY INDICES FOR THE DATA SET OF EXAMPLE 3
True c Optimal number of clusters obtains by

10

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

CH SW DB Gap-stat DNo DNg DNs

6 6 (88%) 6 (88%) 2, 3 6 (88%) 6 (16%) 6 (8%) 6 (12%)

TABLE VII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 3
U-k-means R-EM C-FS K-means with true c X-means RL-FCM
True c
c* AR c* AR c* AR AV-AR c* AR c* AR
6 6 1.00 3 - 6 (84%) 0.8155 0.7833 6 (100%) 1.00 3 -

TABLE VIII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 4

k-means
U-k-means R-EM C-FS X-means RL-FCM
True c with true c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR
9 9 1.00 12 - 9 (96%) 0.7641 0.9190 2 - 2 -

11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

TABLE IX
DESCRIPTIONS OF THE EIGHT DATA SETS USED IN EXAMPLE 5
Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features
Iris Real 3 150 4
Seeds Real 3 210 7
Australian Categorical, Integer, Real 2 690 14
Flowmeter D Real 4 180 43
Sonar Real 2 208 60
Wine Integer, Real 3 178 13
Horse Categorical, Integer, Real 2 368 27
Waveform (Version 1) Real 3 5000 21

TABLE X
CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-Means R-EM C-FS k- k-means + Gap-stat X-means RL-FCM
True Means
Data set c* AR c* AV- c* AR c* AR c* AV- c* AV-
c with
AR true c AR AR
Iris 3 3 0.8933 3 0.8600 3 (84%) 0.7521 0.7939 4, 5 - 2 - 3 0.9067

Seeds 3 3 0.9048 3 0.8476 3 0.7944 0.8864 3 0.8952 3 0.890 3 0.8952


(100%) (100%) (100%)
Australian 2 2 0.5551 4 - 2 0.5551 0.5551 6 - 6 - 26 -
(100%)
Flowmeter D 4 4 0.6056 3 - 4 0.4338 0.5833 9, 10 - 10 - 13 -
(100%)
Sonar 2 2 0.5337 5 - 2 (80%) 0.4791 0.4791 5, 6 - 3, 4 - 4 -

Wine 3 3 0.7022 2 - 3 0.5557 0.6851 2 - 3 (64%) 0.62 2 -


(100%)
Horse 2 2 0.6576 4,6,8, - 2 0.6033 0.6055 3 - 2 (88%) 0.50 7 -
10, 14 (100%)
Waveform 3 3 0.4020 1 - 2 - 0.3900 1 - 8 - 3 0.3972
(Version 1)

TABLE XI DESCRIPTIONS OF THE SIX MEDICAL DATA SETS USED IN


EXAMPLE 6

Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features

SPECT Categorical 2 187 22


Parkinsons Real 2 195 22
WPBC Real 2 198 33
Colon Discrete, Binary 2 62 2000
Lung Continous, Multi-class 5 203 3312
Nci9 Discrete, Multi-class 9 60 9712

TABLE XII
RESULTS FROM VARIOUS ALGORITHMS FOR THE SIX MEDICAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-MEANS R-EM C-FS K-MEANS K-MEANS + X-MEANS RL-FCM
WITH GAPSTAT
DATA SET TRUE TRUE C
C
C* AR C* AV-AR C* AR C* AR C* AV-AR C* AV-AR

12

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

SPECT 2 2 0.920 2 0.562 2(84%) 0.8408 0.5262 5, 6 - 2 (100%) 0.5119 2 0.588

PARKINSONS 2 2 0.754 1 - 2 (100%) 0.7436 0.5183 2 (100%) 0.62 4, 5 - 2 0.754

WPBC 2 2 0.763 198 - 2 (100%) 0.7576 0.5927 4 - 3 - 2 0.763

COLON 2 2 0.645 - - 2 (100%) 0.5813 0.4768 4 - 2 (100%) 0.45 62 -

LUNG 5 5 0.788 - - 5 (100%) 0.6859 0.6818 4, 6, 7, 8 - 2 - 9 -

NCI9 9 8 - - - 2, 4 - 0.32 2 - 2 - 60 -

TABLE XIII CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS
IN BOLDFACE
k-means
True FU-k-means R-EM C-FS with true c X-means RL-FCM
Data set
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR
Yale Face 15 16 - - - 12 - 0.34 2, 3 - 2 -

TABLE XIV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE 100 IMAGES SAMPLE OF THE CIFAR-10 DATA
SET

k-means
U-k-means R-EM C-FS X-means RL-FCM
Data set True c with true c

c* AV-AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR


CIFAR-10 10 10
10 0.311 - - 0.295 0.280 2 - - -
(42.5%) (3.03%)

13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

the truec=15. The R -EM algorithm is missing because the


probability of one data point belonged to kththe
class on this
data set are known as illegitimate proportions at the first
iteration. The -CFS gives c*=12 and X-means givesc*=2 or
3. The k - means clustering with the truec=15 gives AV-
AR=0.34, while RL-FCM gives c*=2.

FIGURE5. Yale Face 32x32


Example 7 In this example, we apply the U -k-means
clustering algorithm for Yale Face 32x32 data set, as shown
in Fig. 5. It has 165 grayscale images in GIF format of 15
individuals [36]. There are 11 images per subject with
different facial expression or configuration: center -light,
with/glasses, happy, left-light, w/no glasses, normal,ight-
r
light, sad, sleepy, surprised, and wink. In the experiment, we
use 135 images of 165 grayscale images. The results from
different algorithms are shown in Table XIII . From Table
XIII , although U-k-means cannot correctly estimate the true
numberc=15 of clusters for the Yale face data set, but it
gives the number of clusters c*=16 in which it is closed to
FIGURE 6. The 100 Images Sample of CIFAR-10
TABLE XV Flowmeter 0.2834 0.6969 5.6230 0.3054
COMPARISON OF AVERAGE RUNNING TIMES (IN SECONDS) OF U-K- D
MEANS,
R-EM, C-FS, K-MEANS WITH THE TRUE C, AND RL-FCM FOR ALL DATA
Sonar 0.1747 0.3148 5.8564 0.3963
SETS. THE FASTEST RUNNING TIMES ARE HIGHLIGHTED Wine 0.1980 1.4837 5.8094 0.3060
U-kmeans RL-
Data sets R-EM C-FS Horse 0.6072 2.5989 5.3442 0.6272
FCM
Waveform 330.748 - 113.8162 474.165
Synthetic Data sets
Medical Data sets
Example 1 0.3842 4.8921 5.8050 1.3688
SPECFT 0.1354 0.7211 5.9079 0.3411
Example 2 2.9185 13.6157 7.3559 6.0444
Parkinsons 0.1487 0.5856 4.9534 0.3958
Example 3 2.1625 2.7938 10.2817 3.2924
WPBC 0.1512 0.7922 5.2152 0.4036
Example 4 117.2595 742.14 35.6417 438.047
Colon 0.1653 - 4.9608 0.2676
UCI Data sets
Lung 1.1239 - 5.2485 1.1167
Iris 0.2159 1.1842 6.31581 0.4184
Nci9 0.6186 - 6.4794 0.5096
Seeds 0.1455 2.0400 5.2702 0.4472
Image Data sets
Australian 2.0434 5.8039 6.1772 2.3829

14

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

Yale Face 0.3741 - 5.9634 0.4286 The advantages of U-kmeans are free of initializations and
32x32 parameters that also robust to different cluster volumes
CIFAR-10 2.6561 - 6.4500 - and shapes with automatically finding the number of
clusters. The proposed U-k-means algorithm was
performed on several synthetic and real data sets and also
Example 8 In this example, we apply the U-k-means
compared with most existing algorithms, such as R-EM,
clustering algorithm to the CIFAR-10 color images [37].
C-FS, k-means with the true number c, k-means+gap, and
The CIFAR-10 data set consists of 60000 32x32 color
X-means algorithms. The results actually demonstrate the
images in 10 classes, i.e., each pixel is an RGB triplet of
superiority of the U-k-means clustering algorithm.
unsigned bytes between 0 and 255. There are 50000
training images and 10000 test images. Each red, green,
and blue channel value contains 1024 entries. The 10
REFERENCES
classes in the data set are airplane, automobile, bird, cat,
[1] A.K. Jain, R.C. Dubes, Algorithms for Clustering
deer, dog, frog, horse, ship, and truck. Specifically, we
Data, Englewood Cliffs, NJ: Prentice Hall, 1988.
take the first 100 color images (10 images per class) and
[2] L. Kaufman, P.J. Rousseeuw, Finding Groups in
training 40 multi-way from CIFAR-10 60K images data
Data: An Introduction to Cluster Analysis, New
set for our experiment. The rest 59900 images as the
York: Wiley, 1990.
retrieval database. Fig. 6 shows the 100 images sample
[3] G.J. McLachlan, K.E. Basford, Mixture Models:
from the CIFAR-10 images data set. The results for the
Inference and Applications to clustering, New York:
number of clusters and AR are given in Table XIV. From
Marcel Dekker, 1988.
Table XIV, it is seen that the proposed U-k-means and k-
means with the true c=10 give better results on the 100 [4] A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum
images sample of the CIFAR10 data set. The U-k-means likelihood from incomplete data via the EM
has the correct number c*=10 of clusters with 42.5% and algorithm (with discussion),” J. Roy. Stat. Soc., Ser.
AV-AR=0.28 and k-means with c=10 gives the same AV- B, vol. 39, pp. 1-38, 1977.
AR=0.28. For the C-FS, the percentage with the correct [5] J. Yu, C. Chaomurilige, M.S. Yang, “On convergence
number c*=10 of clusters is only 16.7% with AV- and parameter selection of the EM and DA-EM
AR=0.24. X-means underestimates the number of clusters algorithms for Gaussian mixtures,” Pattern
with c*=2. The results from R-EM and RL-FCM on this Recognition, vol. 77, pp. 188–203, 2018.
data sets are missing because the probability of one data [6] A.K. Jain, “Data clustering: 50 years beyond k-
point belonged to the kth class on these data sets are means,” Pattern Recognition Letters, vol. 31, pp.
known as illegitimate proportions at the first iteration. 651–666, 2010.
We further analyze the performance of U-k-means, [7] M.S. Yang, S.J. Chang-Chien and Y. Nataliani, “A
REM, C-FS, and RL-FCM by comparing their average fully-unsupervised possibilistic c-means clustering
running times of 25 runs for these algorithms, as shown in method,” IEEE Access, vol. 6, pp. 78308–78320,
Table XV. All algorithms are implemented in MATLAB 2018.
2017b. From Table XV, it is seen that the proposed U-k- [8] J. MacQueen, “Some methods for classification and
means is the fastest for all data sets among these analysis of multivariate observations,” Proc. of 5th
algorithms, except that the C-FS algorithm is the fastest Berkeley Symposium on Mathematical Statistics and
for the Waveform data set. Furthermore, in Section III, we Probability, vol. 1, pp. 281-297, University of
had mentioned that the proposed U-k-means objective California Press, 1967.
function is simpler than the RL-FCM objective function [9] M. Alhawarat and M. Hegazi, “Revisiting k-means
with saving running time. From Table 15, it is seen that and topic modeling, a comparison study to cluster
the proposed U-k-means algorithm is actually running arabic documents,” IEEE Access, vol. 6, pp. 42740-
faster than the RL-FCM algorithm. 42749, 2018.
[10] Y. Meng, J. Liang, F. Cao, Y. He, “A new distance
V. CONCLUSIONS with derivative information for functional k-means
In this paper we propose a new schema with a learning clustering algorithm,” Information Sciences, vol.
framework for the k-means clustering algorithm. We 463– 464, pp. 166–185, 2018.
adopt the merit of entropy-type penalty terms to construct [11] Z. Lv, T. Liu, C. Shi, J.A. Benediktsson, H. Du,
a competition schema. The proposed U-k-means “Novel land cover change detection method based on
algorithm uses the number of points as the initial number k-means clustering and adaptive majority voting
of clusters for solving the initialization problem. During using bitemporal remote sensing images,” IEEE
iterations, the U-kmeans algorithm will discard extra Access, vol. 7, pp. 34425-34437, 2019.
clusters, and then an optimal number of clusters can be [12] J. Zhu, Z. Jiang, G.D. Evangelidis, C. Zhang, S.
automatically found according to the structure of data. Panga,
15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

Z. Li, “Efficient registration of multi-view point sets perspective,” Expert Syst. Appl., vol. 36, pp.
by k-means clustering,” Information Sciences, vol. 60506061, 2009.
488, pp. 205–218, 2019. [27] L.J. Deborah, R. Baskaran, A. Kannan, “A survey on
[13] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On internal validity measure for cluster validation,” Int.
clustering validation techniques,” J. Intell. Inf. Syst., J. Comput. & Eng. Surv., vol. 1 pp. 85-102, 2010.
vol. 17, pp. 107-145, 2001. [28] I.H. Witten, E. Frank, M.A. Hall and C.J. Pal, Data
[14] R.E. Kass, A.E. Raftery, “Bayes Factors,” Journal of Mining: Practical Machine Learning Tools and
the American Statistical Association, vol. 90, pp. Techniques, Morgan Kaufmann Publishers, 2000.
773– 795, 1995. [29] G. Guo, L. Chen, Y. Ye and Q. Jiang, “Cluster
[15] H. Bozdogan, “Model selection and Akaike’s validation method for determining the number of
information criterion (AIC): The general theory and clusters in categorical sequences,” IEEE Transactions
its analytical extensions,” Psychometrika, vol. 52, pp. on Neural Networks and Learning Systems, vol. 28,
345–370, 1987. pp. 2936-2948, 2017.
[16] J.C. Dunn, “A fuzzy relative of the ISODATA [30] A. Rodriguez, A. Laio, “Clustering by fast search and
process and its use in detecting compact, well- find of density peaks,” Science, vol. 344 (6191) pp.
separated clusters,” J. Cybernetics, vol. 3, pp. 32-57, 1492-1496, 2014.
1974. [31] M.S. Yang, C.Y. Lai and C.Y. Lin, “A robust EM
[17] D. Davies and D. Bouldin, “A cluster separation clustering algorithm for Gaussian mixture models,”
measure,” IEEE Transactions on Pattern Analysis and Pattern Recognition, vol. 45, pp. 3950-3961, 2012.
Machine Intelligence, vol. PAMI-1, pp. 224–227, [32] M.A.T. Figueiredo, A. K. Jain, “Unsupervised
1979. learning of finite Mixture models,” IEEE Trans.
[18] P.J. Rousseeuw, “Silhouettes: A graphical aid to the Pattern Analysis and Machine Intelligence, vol. 24,
interpretation and validation of cluster analysis,” pp. 381396, 2002.
Journal of Computational and Applied Mathematics, [33] M.S. Yang and Y. Nataliani, “Robust-learning fuzzy
vol. 20, pp. 53-65, 1987. cmeans clustering algorithm with unknown number
[19] T. Calinski, J. Harabasz, “A dendrite method for of clusters,” Pattern Recognition, vol. 71, pp. 45-59,
cluster analysis,” Commun. Stat.-Theory Methods, 2017.
vol. 3, pp. 1–27, 1974. [34] C.L. Blake, C.J. Merz, UCI repository of machine
[20] R. Tibshirani, G. Walther, and T. hastie, “Estimating learning databases, a huge collection of artificial and
the number of clusters in a data set via the gap real-world data sets, 1998.
statistic,” https://archive.ics.uci.edu/ml/datasets.html
Journal of the Royal Statistical Society: Series B, vol. [35] D. Cai, X. He, J. Han and T. S. Huang, Graph
63, pp. 411-423, 2001. regularized nonnegative matrix factorization for data
[21] N. R. Pal and J. Biswas, “Cluster validation using representation, IEEE Trans. Pattern Analysis and
graph theoretic concepts,” Pattern Recognition, vol. Machine Intelligence 33.8 (2010) 1548-1560.
30, pp. 847-857, 1997. [36] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, Learning
[22] N. IIc, “Modified Dunn’s cluster validity index based a spatially smooth subspace for face recognition,
on graph theory,” Przeglad Elektrotechniczny Proceedings of IEEE Conference on Computer Vision
(Electrical Review), vol. 2, pp. 126-131, 2012. and Pattern Recognition, 2007, CVPR’07, pp. 1-7,
[23] D. Pelleg, A. Moore, “X-Means: Extending k-means 2007.
with efficient estimation of the number of clusters,” [37] A. Krizhevsky and G. Hinton, Learning multiple
Proc. of the 17th International Conference on Machine layers of features from tiny images (Vol. 1, No. 4, p.
Learning, pp. 727–734, San Francisco, 2000. 7), Technical report, University of Toronto, 2009.
[24] E. Rendon, I. Abundez, A. Arizmendi, E.M. Quiroz,
Kristina P. Sinaga received B.S. degree and
“Internal versus external cluster validation indexes,” M.S. degree in mathematics from University
Int. J. Computers and Communications, vol. 5, pp. of Sumatera Utara, Indonesia. She is a Ph.D.
2734, 2011. student at Department of Applied
[25] Y. Lei, J.C. Bezdek, S. Romani, N.X. Vinh, J. Chan, Mathematics, Chung Yuan Christian
University, Taiwan. Her research interests
J. Bailey, “Ground truth bias in external cluster include clustering and pattern recognition.
validity indices,” Pattern Recognition, vol. 65, pp. 58-
70, 2017.
[26] J. Wu, J. Chen, H.Xiong, M. Sie, “External validation
measures for k-means clustering: a data distribution

16

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

Miin-Shen Yang received the BS degree in


mathematics from the Chung Yuan
Christian University, Chung-Li, Taiwan, in
1977, the MS degree in applied
mathematics from the National Chiao-Tung
University, Hsinchu, Taiwan, in 1980, and
the Ph.D. degree in statistics from the
University of South Carolina, Columbia,
USA, in 1989.
In 1989, he joined the faculty of the
Department of Mathematics in the
Chung Yuan Christian University (CYCU) as an Associate Professor,
where, since
1994, he has been a Professor. From 1997 to 1998, he was a Visiting
Professor with the Department of Industrial Engineering, University of
Washington, Seattle, USA. During 2001-2005, he was the Chairman of
the Department of Applied Mathematics in CYCU. Since 2012, he has
been a Distinguished Professor of the Department of Applied
Mathematics and the Director of Chaplain’s Office, and now, the Dean of
College of Science in CYCU. His research interests include clustering
algorithms, fuzzy clustering, soft computing, pattern recognition and
machine learning. Dr. Yang was an Associate Editor of the IEEE
Transactions on Fuzzy Systems (2005-2011), and is an Associate Editor
of the Applied Computational Intelligence & Soft Computing.

17

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like