0% found this document useful (0 votes)
30 views13 pages

Unsupervised K-Means Clustering Algorithm

This document discusses an unsupervised k-means clustering algorithm that can automatically determine the optimal number of clusters without initialization or parameter selection. It reviews existing clustering methods and their limitations, such as dependence on initialization. It then proposes a novel unsupervised k-means algorithm that uses an entropy penalty term and learning schema to simultaneously cluster data and select the number of clusters in an unsupervised manner.

Uploaded by

Mohammad Muskaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

Unsupervised K-Means Clustering Algorithm

This document discusses an unsupervised k-means clustering algorithm that can automatically determine the optimal number of clusters without initialization or parameter selection. It reviews existing clustering methods and their limitations, such as dependence on initialization. It then proposes a novel unsupervised k-means algorithm that uses an entropy penalty term and learning schema to simultaneously cluster data and select the number of clusters in an unsupervised manner.

Uploaded by

Mohammad Muskaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier XXXXXX

Unsupervised K-Means Clustering Algorithm


Kristina P. Sinaga and Miin-Shen Yang
Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li 32023, Taiwan

Corresponding author: Miin-Shen Yang (e-mail: [email protected]).


This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 107-2118-M-033-002-MY2.

ABSTRACT The k-means algorithm is generally the most known and used clustering method. There are
various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to
clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always
influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is
not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema
for the k-means algorithm so that it is free of initializations without parameter selection and can also
simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-k-
means) clustering algorithm with automatically finding an optimal number of clusters without giving any
initialization and parameter selection. The computational complexity of the proposed U-k-means clustering
algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are
made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-k-
means clustering algorithm.

INDEX TERMS Clustering, K-means, Number of clusters, Initializations, Unsupervised learning schema,
Unsupervised k-means (U-k-means)

I. INTRODUCTION own objective functions. Therefore, defining the


Clustering is a useful tool in data science. It is a method for dissimilarity (or distance) between a point and a cluster
finding cluster structure in a data set that is characterized by prototype is essential for partition methods. It is known that
the greatest similarity within the same cluster and the the k-means algorithm is the oldest and popular partitional
greatest dissimilarity between different clusters. method [1,8]. The k-means clustering has been widely
Hierarchical clustering was the earliest clustering method studied with various extensions in the literature and applied
used by biologists and social scientists, whereas cluster in a variety of substantive areas [9,10,11,12]. However,
analysis became a branch of statistical multivariate analysis these k-means clustering algorithms are usually affected by
[1,2]. It is also an unsupervised learning approach to initializations and need to be given a number of clusters a
machine learning. From statistical viewpoint, clustering priori. In general, the cluster number is unknown. In this
methods are generally divided as probability model-based case, validity indices can be used to find a cluster number
approaches and nonparametric approaches. The probability where they are supposed to be independent of clustering
model-based approaches follow that the data points are algorithms [13]. Many cluster validity indices for the k-
from a mixture probability model so that a mixture means clustering algorithm had been proposed in the
likelihood approach to clustering is used [3]. In model- literature, such as Bayesian information criterion (BIC) [14],
based approaches, the expectation and maximization (EM) Akaike information criterion (AIC) [15], Dunn’s index [16],
algorithm is the most used [4,5]. For nonparametric Davies-Bouldin index (DB) [17], Silhouette Width (SW)
approaches, clustering methods are mostly based on an [18], Calinski and Harabasz index (CH) [19], Gap statistic
objective function of similarity or dissimilarity measures, [20], generalized Dunn’s index (DNg) [21], and modified
and these can be divided into hierarchical and partitional Dunn’s index (DNs) [22].
methods where partitional methods are the most used For estimation the number of clusters, Pelleg and Moore
[2,6,7]. [23] extended k-means, called X-means, by making local
In general, partitional methods suppose that the data set decisions for cluster centers in each iteration of k-means with
can be represented by finite cluster prototypes with their splitting themselves to get better clustering. Users need to

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

specify a range of cluster numbers in which the true cluster that the k-means algorithm is always affected by
number reasonably lies and then a model selection, such as initializations.
BIC or AIC, is used to do the splitting process. Although To resolve the above issue for finding the number c of
these k-means clustering algorithms can find the number of cluster, cluster validity issues get much more attention. There
clusters, such as cluster validity indices and X-means, they are several clustering validity indices available for estimating
use extra iteration steps outside the clustering algorithms. As the number c of clusters. Clustering validity indices can be
we know, no work in the literature for k-means can be free of grouped into two major categories: external and internal [24].
initializations, parameter selection and also simultaneously External indices are used to evaluate clustering results by
find the number of clusters. We suppose that this is due to its comparing cluster memberships assigned by a clustering
algorithm with the previously known knowledge such as
difficulty for constructing this kind of the k-means algorithm.
externally supplied class label [25,26]. However, internal
In this paper, we first construct a learning procedure for
indices are used to evaluate the goodness of cluster structure
the k-means clustering algorithm. This learning procedure
by focusing on the intrinsic information of the data itself [27]
can automatically find the number of clusters without any so that we consider only internal indices. In the paper, these
initialization and parameter selection. We first consider an most widely used internal indices, such as original Dunn’s
entropy penalty term for adjusting bias, and then create a index (DNo) [16], Davies-Bouldin index (DB) [17],
learning schema for finding the number of clusters. The Silhouette Width (SW) [18], Calinski and Harabasz index
organization of this paper is as follows. In Section II, we (CH) [19], Gap statistics [20], generalized Dunn’s index
review some related works. In Section III, we first construct (DNg) [21], and modified Dunn’s index (DNs) [22] are
the learning schema and then propose the unsupervised k- chosen for finding the number of clusters and then compared
means clustering (U-k-means) with automatically finding with our proposed U-k-means clustering algorithm.
the number of clusters. The computational complexity of The DNo [16], DNg [21], and DNs [22] are supposed to
the proposed U-k-means algorithm is also analyzed. In be the simplest (internal) validity index where it compares
Section IV, several experimental examples and the size of clusters with the distance between clusters. The
comparisons with numerical and real data sets are provided DNo, DNg, and DNs indices are computed as the ratio
to demonstrate the effectiveness of the proposed U-k-means between the minimum distance between two clusters and
clustering algorithm. Finally, conclusions are stated in the size of the largest cluster, and so we are looking for the
Section V. maximum value of index values. Davies-Bouldin index
(DB) [17] measures the average similarity between each
II. RELATED WORKS cluster and its most similar one. The DB validity index
In this section, we review several works that are closely attempts to maximize these between cluster distances while
related with ours. K-means is one of the most popular minimizing the distance between the cluster centroid and
unsupervised learning algorithms that solve the well-known the other data objects. The Silhouette value [18] is a
clustering problem. Let X   x1 , , xn  be a data set in a measure of how similar an object is to its own cluster
(cohesion) compared to other clusters (separation). The
d-dimensional Euclidean space d
. Let A  a1 , , ac  silhouette ranges from -1 to +1, where a high value
indicates that the object is well matched to its own cluster
be the c cluster centers. Let z  [zik ]nc , where zik is a and poorly matched to neighboring clusters. Thus, positive
binary variable (i.e. zik  {0,1} ) indicating if the data and negative large silhouette widths (SW) indicate that the
corresponding object is well clustered and wrongly
point xi belongs to k-th cluster, k  1, , c . The k-means clustered, respectively. Any objects with the SW validity

 
n c index around zero are considered not to be clearly
objective function is J (z, A)  xi  ak
2
z . The
i 1 k 1 ik discriminated between clusters. The Gap statistic [20] is a
k-means algorithm is iterated through necessary conditions cluster validity measure based upon a statistical hypothesis
for minimizing the k-means objective function J (z, A) test. The gap statistic works by comparing the change in
with updating equations for cluster centers and within-cluster dispersion with that expected under an
memberships, respectively, as appropriate reference null distribution at each value c. The
optimal number of clusters is the smallest c.
 1 if xi  ak 2  min xi  ak 2
n
z x
ak  i 1 ik ij and zik   1 k  c For an efficient method about the number of clusters, X-
 i 1 zik
n
0, otherwise. means proposed by Pelleg and Moore [23], should be the
most well-known and used in the literature, such as Witten
where xi  ak is the Euclidean distance between the data et al. [28], and Guo et al. [29]. In X-means, Pelleg and
point xi and the cluster center ak . There exists a difficult Moore [23] extended k-means by making local decisions
problem in k-means, i.e., it needs to give a number of for cluster centers in each iteration of k-means with
clusters a priori. However, the number of clusters is splitting themselves to get better clustering. Users only
generally unkown in real applications. Another problem is need to specify a range of cluster numbers in which the true
cluster number reasonably lies and then a model selection,

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

such as BIC, is used to do the splitting process. Although of entropy theory, we suggest a new term in the form of
X-means has been the most used for clustering without zik ln  k . Thus, we propose the unsupervised k-means (U-k-
given a number of clusters a priori, it still needs to specify a means) objective function as follows:
range of cluster numbers based on a criterion, such as BIC. n c c n c
(z, A,  )  z x  a   n  ln    z ln  (2)
 
2
On the other hand, it is still influenced by initializations of J UKM 2 ik i k k k ik k
i 1 k 1 k 1 i 1 k 1
algorithm. On the other hand, Rodriguez and Laio [30]
proposed an approach based on the idea that cluster centers
We know that, when  and in Eq. (2) are zero, it 
are characterized by a higher density than their neighbors becomes the original k-means. The Lagrangian of Eq. (2) is
n c c
and by a relatively large distance from points with higher J (z, A,  ,  )   zik xi  ak   n  k ln  k
2

densities, which they called as a clustering by fast search i 1 k 1 k 1

(C-FS) and find of density peaks. To identify the cluster n c


 c  (3)
  zik ln  k     k  1
centers, C-FS uses the heuristic approach of a decision i 1 k 1  k 1 
graph. However, the performance of C-FS highly depends We first take the partial derivative of the Lagrangian (3)
on two factors, i.e., local density ρi and cutoff distance δi . with respect to zik , and setting them to be zero. Thus, the
updating equation for zik is obtained as follows:
III. THE UNSUPERVISED K-MEANS CLUSTERING
ALGORITHM 1 if xi  ak   ln  k  min xi  ak   ln  k
2 2

zik   1 k  c (4)
There always exists a difficult problem in the k-means
algorithm and its extensions for a long history in the 0, otherwise.
literature. That is, they are usually affected by initializations The updating equation for the cluster center ak is as follows:
and require a given number of clusters a priori. We
ak   i 1 zik xij 
n n
mentioned that the X-means algorithm has been used for i 1 ik
z (5)
clustering without given a number of clusters a priori, but it We next take the partial derivative of the Lagrangian with
still needs to specify a range of number of clusters based on respect to  k , we obtain J    n  ln  k  1
BIC, and it is still influenced by initializations. To construct k

  i 1 k    0 and  n k  ln  k  1   i 1 zik   k  0,
n zik n
the k-means clustering algorithm with free of initializations
and automatically find the number of clusters, we use the Thus, we have  c n k ln  k   c n k    c  n zik
entropy concept. We borrow the idea from the EM algorithm k 1 k 1 k 1 i 1

by Yang et al. [31]. We first consider proportions  k in  k 1  k  0 with   n  k 1 k ln  k  n  n . We obtain
c c

which the  k term is seen as the probability of one data  n k  ln  k  1   i 1 zik  (  n  k 1 k ln  k  n  n  k  0
n c

point belonged to the kth class. Hence, we use  ln  k as the and then we get the updating equation for  k as follows:
information in the occurrence of one data point belonged to n
 c

 kt 1   zik / n  ( /  ) k(t )  ln  k(t )   st  ln st   (6)
the kth class, and so c k ln k becomes the average of i 1  s 1 
k 1

information. In fact, the term  k ln k is the entropy where t denotes the iteration number in the algorithm.
c
k 1 We should mention that Eq. (6) created above is
over proportions  k . When  k  1/ c, k we say that there is important for our proposed U-k-means clustering method.
no information about  k . At this point, we have the entropy In Eq. (6), c  s ln  s is the weighted mean of ln  k
s 1

achieve the maximum value. Therefore, we add this term to with the weights  1 ,,  c . For the kth mixing
the k-means objective function J (z, A) as a penalty. We
proportion  (t )
, if ln  (t )
is less than the weighted mean,
then construct a schema to estimate  k by minimizing the
k k

then the new mixing proportion  k( t 1) will become smaller


entropy to get the most information for  k . To minimize
than the old  k( t ) . That is, the smaller proportion will
k 1k ln k is equivalent to maximizing k 1k ln k .
c c
decrease and the bigger proportion will increase in the next

k ln k as a penalty term for iteration, and then competition will occur. This situation is
c
For this reason, we use
k 1
similar as the formula in Figueiredo and Jain [32]. If  k  0
the k-means objective function J (z, A) . Thus, we propose a
novel objective function as follows:   0 or  k  1 / n for some 1 k  c (t ) , they are considered to be
n c c illegitimate proportions. In this situation, we discard those
JUKM1 (z, A,  )   zik xi  ak   n  k ln  k
2
(1)
clusters and then update the cluster number c (t ) to be

 
i 1 k 1 k 1

In order to determine the number of clusters, we next ct 1  ct    kt 1  kt 1  1 n, k  1, , c t  (7)
consider another entropy term. We combine the variables
membership zik and the proportion  k . By using the basis where |{}| denotes the cardinality of the set {}. After
updating the number of clusters c , the remaining mixing

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

proportion  k* and corresponding zik need to be re- On the other hand, we consider the inequations
*

normalized by max  k(t 1)  max


1 k c 1 k  c
 1
n 
n
i 1 ik
z  
 1 k  c 
max k(t ) ln max  k(t )   s 1 st  ln  st 
1 k  c
c

c
   max ln max     ln  
t 1

  
k*  k*  s* and max 1 n (t ) (t ) c (t ) (t )
(8) n i 1 ik
z  1 k c k k s 1 s s
s 1 1 k  c 1 k  c

c
 z       max    ln    . If max   z  
t 1

 
n c n
 max 1
zik*  zik*
1 (t ) (t ) (t )

s 1
zis* (9) 1 k  c
n i 1 ik 1 k  c
k s 1 s s 1 k  c
n i 1 ik

We next concern about the parameter learning of   max k(t ) s 1 s(t ) ln  s(t )  1 ,
1 k c
c
then the restriction of
and  for the two terms of  
n c
z ln k max  k(t 1)  1 is held, and then we obtain
i 1 k 1 ik 1 k  c

 ln k . Based on some increasingly learning


     max 
c
and
  1  max    s(t ) ln  s(t )
n c
k 1 k
1
z (t )
(13)
1 k  c
n i 1 ik 1 k  c
k s 1
rates of cluster number with
According to Eqs. (12) and (13), we can get
ec
(t )
e c , e c , e c , e c
 
(t ) (t ) (t ) (t )
100 250 500 750 1000
, and , it is seen  c exp( n  (t 1)   (t ) ) 1  max 1n  i 1 zik 

n

 (t 1)  min  k 1
k k
, 1 k  c  (14)
ec
(t )
c
, e c  (t ) 
(t ) (t )

k  k 1
100 500 750
 
c
that decreases faster, but e and  c (  max (t )
ln ) 
 
' '
1 k  c k

 c ( t ) 1000
e decreases slower. We suppose that the Because the  can jump at any time, we let   0 when
parameter  should not decrease too slow or too fast, and the cluster number c is stable. When the cluster number c
so we set the parameter  as is stable, it means c is no longer decreasing. In our setting,
we use all data points as initial means with ak  xk , i.e.
 t   e c
(t )
250
(10)
Under competition schema setting, the algorithm can cinitial  n , and we use  k  1/ cinitial , k  1, 2, ..., cinitial
automatically reduce the number of clusters, and also as initial mixing proportions. Thus, the proposed U-k-
simultaneously gets the estimates of parameters. means clustering algorithm can be summarized as follows:
Furthermore, the parameter  can help us control the
U-k-means clustering algorithm
competition. We discuss the variable  as follows. We
Step 1: Fix   0 . Give initial c  n ,  k( 0 )  1 / n ,
(0)

first apply the rule  e 1   k ln  k  0 . If 0   k  1  k


ak 0   xi and initial learning rates   0    0  1 . Set t=0.
E   s 1 s ln  s  0,
c
and let then we have  t 1 t 
Step 2: Compute zik using a k t  ,  k t  , c ,   t  ,   t  by (4).
k E  k s 1s ln s  0 . Thus, we obtain
c
 t 1
c
Step 3: Compute  by (10).
 e    k (ln  k    s ln  s )   ( k E )
1 (11)
Step 4: Update  k( t 1) with ik  and  k by (7).
t 1 (t )
s 1

 Step 5: Compute  ( t 1) with 


( t 1)
 k  1 , and only when  k  1 / 2 , and  by (14).
c (t )
Under the constraint
k 1

we can have that (ln k  c  s ln  s )  0 . To avoid the (t )  t 1


s 1 Step 6: Update c to c by discard those clusters with
situation where all  k  0 , the left hand of inequality (11)  kt 1  1/ n and adjust  k t 1 and ik( t +1) by (8) and (9).
must be larger than  max{  k |  k  1 / 2, k  1,2,, c} . We IF t  60 and c
( t 60)
 c (t )  0 , THEN let  (t 1)  0 .
now have an elementary condition of  with
Step 7: Update ak
 t 1
with c  t +1 and zik t 1 by (5).
 e 1    max{  k |  k  1 / 2, k  1,2,, c} . Thus, we have  t 1 t 
Step 8: Compare ak and ak .
  max{ k e |  k  1 / 2, k  1,2,, c}  e / 2 . Therefore, to
IF max ak   ak    , THEN Stop.
t 1 t
prevent  too large, we use   [0, 1] . If the difference t
1 k  c 
 t 1 t 
between  k and  k is small, then  must become large ELSE t = t +1 and return to Step 2.
 t 1
to enhance its competition. If the difference between  k Before we analyze the computational complexity for the
t 
and k is large, then  will become small to maintain
proposed U-k-means algorithm, we give a brief review of
another clustering algorithm that had also used the idea
stability. Thus, we estimate  with from the EM algorithm by Yang et al. [31]. This is the
  k 1exp{n | k(t 1)  k(t ) |}/ c robust-learning fuzzy c-means (RL-FCM), proposed by
c
(12)
Yang and Nataliani [33]. In Yang and Nataliani [33], they
d 21
, t  
where   min 11 
and  a  represents the largest  gave the RL-FCM objective function J (U,  , A) with ik ,
integer that is no more than a and t denotes the iteration not binary variables, but fuzzy c-memberships with
number in the algorithm.
4

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

0  ik  1 and  ik  1 to indicate fuzzy memberships  k  1 6, k , 1  5 2 , 2  3 4 , 3  8 4 ,


c T T T
k 1

4   6 6 , 5  10 8 , 6   7 10
T T T
for the data point xi belonging to k-th cluster. If we , and
compare the proposed U-k-means objective function  0.4 0  with 2 dimensions and 6 clusters,
JUKM 2 (z, A,  ) with the RL-FCM objective function  = 6   
 0 0.4 
1

J (U,  , A) , we find that, except ik and zik with different as shown in Fig. 1(a). We implement the proposed U-k-
membership representations, the RL-FCM objective means clustering algorithm for the data set of Fig. 1(a) in
which it obtains the correct number c  6 of clusters with
*
function J (U,  , A) in Yang et al. [33] gave more extra
terms and parameters and so the RL-FCM algorithm is AR=1.00, as shown in Fig. 1(f), after 11 iterations. These
more complicated than the proposed U-k-means algorithm validity indices of CH, SW, DB, Gap statistic, DNo, DNg,
with more running time. For experimental results and and DNs are shown in Table I. All indices give the correct
number c  6 of clusters, except DNg.
*
comparisons in the next section, we make more
comparisons of the proposed U-k-means algorithm with the Moreover, we consider the data set with noisy points to
RL-FCM algorithm. We also analyze the computational show the performance of the proposed U-k-means
complexity for the U-k-means algorithm. In fact, the U-k- algorithm under noisy environment. We add 50 uniformly
means algorithm can be divided into three parts: (1) noisy points to the data set of Fig. 1(a), as shown in Fig.
Compute the hard membership partition zik with O  ncd  ; 2(a). By implementing the U-k-means algorithm on the
noisy data set of Fig. 2(a), it still obtains the correct number
(2) Compute the mixing proportion k with O  nc  ; (3) c*  6 of clusters after 28 iterations with AR=1.00, as shown
Update the cluster center k
with O  n  . The total in Fig. 2(b). These validity index values of CH, SW, DB,
Gap-stat, DNo, DNg, and DNs for the noisy data set of Fig.
computational complexity for the U-k-means algorithm is
2(a) are shown in Table II. The five validity indices of CH,
O  ncd  , where n is the number of data points, c is the DB, Gap-stat, DNo and DNs give the correct number of
number of clusters, and d is the dimension of data points. clusters. But, SW and DNg give the incorrect numbers of
Compared with the RL-FCM algorithm [33], the RL-FCM clusters.
has the total computational complexity fwith O nc 2 d .  
IV. EXPERIMENTAL RESULTS AND COMPARISONS
In this section we give some examples with numerical and
real data sets to demonstrate the performance of the proposed
U-k-means algorithm. We show these unsupervised learning
behaviors to get the best number c * of clusters for the U-k-
means algorithm. Generally, most clustering algorithms, (a) (b)
including k-means, are employed to give different numbers
of clusters with associated cluster memberships, and then
these clustering results are evaluated by multiple validity
measures to determine the most practically plausible
clustering results with the estimated number of clusters [13].
Thus, we will first compare the U-k-means algorithm with
the seven validity indices, DNo [16], DNg [21], DNs [22],
Gap statistic (Gap-stat) [20], DB [17], SW [18] and CH [19].
Furthermore, the comparisons of the proposed U-k-means (c) (d)
with k-means [8], robust EM [31], clustering by fast search
(C-FS) [30], X-means [23], and RL-FCM [33] are also made.
For measuring clustering performance, we use an accuracy
 n  ck  n , where n  ck  is the
c
rate (AR) with AR  k 1
number of data points that obtain correct clustering for the
cluster k and n is the total number of data points. The larger
AR is, the better clustering performance is.
Example 1 In this example, we use a data set of 400 data
(e) (f)
points generated from the 2-variate 6-component Gaussian
mixture model f ( x; , )    k f ( x;k ) with parameters
c FIGURE 1. (a) Original data set; (b)-(e) Processes of the U-k-means
k 1 after 1, 2, 4, and 9; (f) Convergent results.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

dependent on parameter selection for finding the number of


TABLE I
clusters. Table IV shows the comparison results of the U-k-
VALIDITY INDEX VALUES OF CH, SW, DB, GAP-STAT, DNO, DNG, AND
DNS FOR THE DATA SET OF FIG. 1(A) means, R-EM, C-FS, k-means with the true cluster number
Validity index values c  14 , X-means, and RL-FCM algorithms. Note that C-FS,
c
CH SW DB Gap-stat DNo DNg DNs
k-means with the true number of clusters, and X-means
algorithms are dependent of initials or parameter selection,
2 0.511 0.680 0.772 0.183 0.008 6.587 0.001
3 0.866 0.388
and so we consider their average AR (AV-AR) under
0.553 0.649 0.047 2.603 0.019 different initials or parameter selection. From Table IV, it is
4 0.605 0.715 0.700 0.469 0.040 1.603 0.016
5 0.754 0.743 0.571 0.619 0.041 4.619 0.020
seen that the proposed U-k-means, R-EM, and RL-FCM
6 1.277 0.838 0.483 1.067 0.102 4.635 0.048 clustering algorithms are able to find the correct number of
7 1.155 0.773 0.634 0.991 0.060 0.794 0.022 clusters c*  14 with AR=1.00. While C-FS obtained the
8 0.808 0.930
1.054 0.703 0.030 0.571 0.004 correct c*  14 with 96% and AV-AR=0.9772. The k-means
TABLE II with the true c gave AV-AR=0.8160. The X-means obtained
VALIDITY INDEX VALUES OF CH, SW, DB, GAP-STAT, DNO, DNG, AND the correct c*  14 with 76% and AV-AR=1.00. Note that the
DNS FOR THE NOISY DATA SET numbers in parentheses indicate the percentage in obtaining
Criterion values
the correct number of clusters for clustering algorithms under
c 25 different initial values.
CH SW DB Gap stat DNo DNg DNs
2 508.2 0.662 0.792 0.827 0.005 4.547 0.001
3 523.4 0.615 0.913 0.835 0.034 1.828 0.012
4 526.9 0.655 0.748 0.719 0.033 1.752 0.011
5 637.9 0.697 0.607 0.914 0.028 3.397 0.008
6 902.6 0.771 0.538 1.237 0.052 1.502 0.013
7 864.4 0.783 0.558 1.173 0.042 0.797 0.008
8 837.7 0.766 0.666 1.143 0.019 0.497 0.002

Original data
12

10

(a) (b)
8

FIGURE 3. (a) 14-cluster dataset; (b) Final results from U-k-means.


x2

4 TABLE III
RESULTS OF THE SEVEN VALIDITY INDICES
2
Optimal number of clusters
0
True
0 2 4 6 8 10 12 CH SW DB Gap-stat DNo DNg DNs
x1
c

(a) (b) 14 14 14 14 14 14 2, 4, 5, 14
(60%) (60%) (60%) (64%) (20%) 10, 11 (20%)
FIGURE 2. (a) 6-cluster dataset with 50 noisy points; (b) Final results
from U-k-means.

Example 2 In this example, we consider a data set of 800


data points generated from a 3-variate 14-component
Gaussian mixture with 800 data points with 3 dimensions
and 14 clusters, as shown in Fig. 3(a). To estimate the
number c of clusters, we use CH, SW, DB, Gap-stat, DNo,
DNg, and DNs. To create the results of the seven validity
indices, we consider the k-means algorithm with 25 different
initializations. These estimated numbers of clusters from CH, (a) (b)
SW, DB, Gap statistic, DNo, DNg, and DNs with
percentages are shown in Table III. It is seen that all validity
indices can give the correct number c*  14 of clusters,
except DNg, where the Gap-stat index gives the highest
percentage of the correct number c*  14 of clusters with
64%. We also implement the proposed U-k-means for the
data set, and then compare it with the R-EM, C-FS, k-means
with the true number of clusters, X-means, and RL-FCM
clustering algorithms. We mention that U-k-means, R-EM, (c) (d)
and RL-FCM are free of parameter selection, but others are

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

the number of clusters with c*=12, but X-means and RL-


FCM underestimate the number of clusters with c*=2.
We next consider real data sets. These data sets are
from the UCI Machine Learning Repository [34].
Example 5 In this example, we use the eight real data sets
from UCI Machine Learning Repository [34], known as Iris,
Seeds, Australian credit approval, Flowmeter D, Sonar,
Wine, Horse, and waveform (version 1). Detailed
information on these data sets such as feature
characteristics, the number c of classes, the number n of
(e) (f)
instances and the number d of features is listed in Table IX.
FIGURE 4. (a) 9-diamonds data set; (b)-(e) Results of the U-k-means Since data features in Seeds, Flowmeter D, Wine and
after 1, 3, 5, and 7 iterations; (f) Final results of the U-k-means after 11
iterations. Waveform (version 1) are distributed in different ranges
and data features in Australian (credit approval) are mixed
feature types, we first preprocess data matrices using matrix
Example 3 To examine the effectiveness of the proposed
factorization technique [35]. This preprocessed technique
U-k-means for finding the number of clusters, we generate
can give these data in uniform to get good quality clusters
a data set of 900 data points from a 20-variate 6-component
and improve accuracy rates of clustering algorithms.
Gaussian mixture model. The mixing proportions, mean
Clustering results from the U-k-means, R-EM, C-FS, k-
values and covariance matrices of the Gaussian mixture
means with the true c, k-means+Gap-stat, X-means, and
model are listed in Table V. The validity indices of CH, SW,
RL-FCM algorithms for different real data sets are shown
DB, Gap-stat, DNo, DNg, and DNs are used to estimate the
in Table X, where the best results are presented in boldface.
number c of clusters. The k-means algorithm with 25
It is seen that the proposed U-k-means gives the best result
different initializations are considered to create the results
in estimating the number c of clusters and accuracy rate
of the seven validity indices. These estimated numbers of
among them except for Australian data. The C-FS
clusters from the seven validity indices with percentages
algorithm gives the corrected numbers of clusters for Iris,
are shown in Table VI where the parentheses are indicating
Seeds, Australian, Flowmeter D, Sonar, Wine, and Horse
the percentages of validity indices in giving the correct
data sets while it underestimates the number of clusters for
number of clusters under 25 different initial values. It is
the waveform data set with c*=2. The X-means algorithm
seen that CH, SW, and Gap-stat give the correct number
only obtains the correct number of clusters for Seeds, Wine
c*  6 of clusters with the highest percentage. We also and Horse data sets. The R-EM obtains the correct number
implemented the U-k-means and compare it with R-EM, C- of clusters for Iris and Seeds data sets. The k-means+Gap-
FS, k-means with the true number c, X-means, and RL- stat only obtains a correct number of clusters for the Seed
FCM algorithms. The obtained numbers of clusters and data set. The RL-FCM algorithm obtains the correct
ARs of these algorithms are shown in Table VII. As it can number of clusters for the Iris, Seeds and Waveform
be seen, the proposed U-k-means, C-FS and X-means (version 1) data sets. Note that the results in parentheses are
correctly find the number of clusters for the data set. The R- the percentages of algorithms to get the correct number c of
EM and RL-FCM underestimate the number of clusters for clusters.
the data set. Both U-k-means and X-means get the best AR. Example 6 In this example, we use the six medical data
Example 4 In this example, we consider a synthetic data set sets from the UCI Machine Learning Repository [34],
of non-spherical shape with 3000 data points, as shown in known as SPECT, Parkinsons, WPBC, Colon, Lung and
Fig. 4(a). The U-k-means is implemented for this data set Nci9. Detailed descriptions on these data sets with feature
with the clustering results as shown in Figs. 4(b)-4(f). The characteristics, the number c of classes, the number n of
U-k-means algorithm decreases the number of clusters from instances and the number d of features are listed in Table
3000 to 2132 after the iteration is implemented once. From XI. In this experiment, we first preprocess the SPECT,
Figs. 4(b)-4(f), it is seen that the U-k-means algorithm Parkinson, WPBC, Colon, and Lung data sets using the
exhibits fast decreasing for the number of clusters. After 11 matrix factorization technique. We also conduct
iterations, the U-k-means algorithm obtains its convergent experiments to compare the proposed U-k-means with R-
result with c*=9 and AR= 1.00, as shown in Fig. 4(f). We EM, C-FS, k-means with the true c, k-means+Gap-stat, X-
next compare the proposed U-k-means algorithm with R- means, and RL-FCM. The results are shown in Table XII.
EM, C-FS, k-means with true c, X-means, and RL-FCM. For C-FS, k-means with the true c, k-means+Gap-stat and
All the experiments are performed 25 times with parameter X-means, we make experiments with 25 different
selection where the average AR results under the correct initializations, and report their results with the average AR
number of cluster are reported in Table VIII. As shown in (AV-AR) and the percentages of algorithms to get the
Table VIII, U-k-means gives the correct number c*=9 of correct number c of clusters, as shown in Table XII. It is
clusters with AR=1.00, followed by k-means with true c=9 seen that the proposed U-k-means gets the correct number
achieves an average AR=0.9190 and C-FS with c*=9 (96%) of clusters for SPECT, Parkinsons, WPBC, Colon, and
achieves average AR=0.7641. While R-EM overestimates
7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

Lung. While for the Nci9 data set, the U-k-means algorithm of one data point belonged to the kth class on these data sets
gets the number of clusters with c*=8 which is very closed are known as illegitimate proportions at the first iteration.
to the true c=9. In terms of AR, the U-k-means algorithm The C-FS algorithm presents better than k-means+Gap-stat
significantly performs much better than others. The R-EM and X-means. The RL-FCM algorithm estimates the correct
algorithm estimates the correct number of clusters on number of clusters c for the SPECT, Parkinsons, and
SPECT. However, it underestimates the number of clusters WPBC data sets. While RL-FCM overestimates the number
on Parkinsons, and overestimates the number of clusters on of clusters on Colon, Lung and Nci9 with c*=62, c*=9, and
WPBC. We also reported that the results of R-EM on Colon, c*=60, respectively.
Lung and Nci9 data sets are missing because the probability
TABLE IV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE DATA SET OF FIG. 3(A)

R- k-means
True U-k-means C-FS X-means RL-FCM
EM with true c
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR

14 14 1.00 14 1.00 14 (96%) 0.9772 0.8160 14 (76%) 1.00 14 1.00

TABLE V
MIXING PROPORTIONS, MEAN VALUES AND COVARIANCE MATRICES OF EXAMPLE 3

Mixing proportions Mean values covariance matrix

1  0.2
1   2 4 6 0 0 0 0 1 1 1 0 0 0 0 0 3 5 0 0 1
2  0.3
2   0 1 3 5 0.1 0.1 0.5 0.5 0 0 2 4 3 1 1 1 0.25 0.5 0.7 2.5 
3  0.1
 4  0.1
3   5 5 5 5 4 4 4 4 6 6 6 6 8 8 8 8 1 1 1 1
4  2 2 2 2 2 1 1 1 1 1 3 3 3 3 3 7 7 7 7 7 
 k
 I 2020
 
 5  0.2 5  1.25 1.3 1.45 1.5 2.25 2.3 2.45 2.5 1 1 1 1 3 3 3 3 2 2 2 2 
 6  0.1 6   0 0 1 1 0.5 0.5 2.5 2.5 5 5 1 1 5 5 0 0 0.75 1.5 3.5 5.5

TABLE VI
RESULTS OF THE SEVEN VALIDITY INDICES FOR THE DATA SET OF EXAMPLE 3
Optimal number of clusters obtains by
True c
CH SW DB Gap-stat DNo DNg DNs
6 6 (88%) 6 (88%) 2, 3 6 (88%) 6 (16%) 6 (8%) 6 (12%)

TABLE VII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 3
U-k-means R-EM C-FS K-means with true c X-means RL-FCM
True c
c* AR c* AR c* AR AV-AR c* AR c* AR
6 6 1.00 3 - 6 (84%) 0.8155 0.7833 6 (100%) 1.00 3 -

TABLE VIII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 4

k-means
U-k-means R-EM C-FS X-means RL-FCM
True c with true c

c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR


9 9 1.00 12 - 9 (96%) 0.7641 0.9190 2 - 2 -

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

TABLE IX
DESCRIPTIONS OF THE EIGHT DATA SETS USED IN EXAMPLE 5
Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features
Iris Real 3 150 4
Seeds Real 3 210 7
Australian Categorical, Integer, Real 2 690 14
Flowmeter D Real 4 180 43
Sonar Real 2 208 60
Wine Integer, Real 3 178 13
Horse Categorical, Integer, Real 2 368 27
Waveform (Version 1) Real 3 5000 21

TABLE X
CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-Means R-EM C-FS k- k-means + Gap-stat X-means RL-FCM
True Means
Data set
c AV- with AV- AV-
c* AR c* c* AR c* AR c* c*
AR true c AR AR
Iris
3 3 0.8933 3 0.8600 3 (84%) 0.7521 0.7939 4, 5 - 2 - 3 0.9067
Seeds 3 3 3
3 3 0.9048 3 0.8476 0.7944 0.8864 0.8952 0.890 3 0.8952
(100%) (100%) (100%)
Australian 2
2 2 0.5551 4 - 0.5551 0.5551 6 - 6 - 26 -
(100%)
Flowmeter D 4
4 4 0.6056 3 - 0.4338 0.5833 9, 10 - 10 - 13 -
(100%)
Sonar 2 5 - 0.4791 0.4791 5, 6 - 3, 4 - 4 -
2 0.5337 2 (80%)
Wine 3
3 3 0.7022 2 - 0.5557 0.6851 2 - 3 (64%) 0.62 2 -
(100%)
Horse 4,6,8, 2
2 2 0.6576 - 0.6033 0.6055 3 - 2 (88%) 0.50 7 -
10, 14 (100%)
Waveform
3 3 0.4020 1 - 2 - 0.3900 1 - 8 - 3 0.3972
(Version 1)

TABLE XI
DESCRIPTIONS OF THE SIX MEDICAL DATA SETS USED IN EXAMPLE 6

Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features

SPECT Categorical 2 187 22


Parkinsons Real 2 195 22
WPBC Real 2 198 33
Colon Discrete, Binary 2 62 2000
Lung Continous, Multi-class 5 203 3312
Nci9 Discrete, Multi-class 9 60 9712

TABLE XII
RESULTS FROM VARIOUS ALGORITHMS FOR THE SIX MEDICAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
K-MEANS + GAP-
U-K-MEANS R-EM C-FS K-MEANS X-MEANS RL-FCM
TRUE STAT
DATA SET WITH
C
C* AR C* AV-AR C* AR TRUE C C* AR C* AV-AR C* AV-AR
SPECT 2 2 0.920 2 0.562 2(84%) 0.8408 0.5262 5, 6 - 2 (100%) 0.5119 2 0.588
PARKINSONS 2 2 0.754 1 - 2 (100%) 0.7436 0.5183 2 (100%) 0.62 4, 5 - 2 0.754
WPBC 2 2 0.763 198 - 2 (100%) 0.7576 0.5927 4 - 3 - 2 0.763

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

COLON 2 2 0.645 - - 2 (100%) 0.5813 0.4768 4 - 2 (100%) 0.45 62 -


LUNG 5 5 0.788 - - 5 (100%) 0.6859 0.6818 4, 6, 7, 8 - 2 - 9 -
NCI9 9 8 - - - 2, 4 - 0.32 2 - 2 - 60 -

TABLE XIII
CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE

k-means
True FU-k-means R-EM C-FS X-means RL-FCM
Data set with true c
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR
Yale Face 15 16 - - - 12 - 0.34 2, 3 - 2 -

TABLE XIV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE 100 IMAGES SAMPLE OF THE CIFAR-10 DATA SET

k-means
U-k-means R-EM C-FS X-means RL-FCM
Data set True c with true c

c* AV-AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR


CIFAR-10 10 10
10 0.311 - - 0.295 0.280 2 - - -
(42.5%) (3.03%)

the true c=15. The R-EM algorithm is missing because the


probability of one data point belonged to the kth class on this
data set are known as illegitimate proportions at the first
iteration. The C-FS gives c*=12 and X-means gives c*=2 or
3. The k- means clustering with the true c=15 gives AV-
AR=0.34, while RL-FCM gives c*=2.

FIGURE 5. Yale Face 32x32


Example 7 In this example, we apply the U-k-means
clustering algorithm for Yale Face 32x32 data set, as shown
in Fig. 5. It has 165 grayscale images in GIF format of 15
individuals [36]. There are 11 images per subject with
different facial expression or configuration: center-light,
with/glasses, happy, left-light, w/no glasses, normal, right-
light, sad, sleepy, surprised, and wink. In the experiment, we
use 135 images of 165 grayscale images. The results from
different algorithms are shown in Table XIII. From Table
XIII, although U-k-means cannot correctly estimate the true
number c=15 of clusters for the Yale face data set, but it
gives the number of clusters c*=16 in which it is closed to FIGURE 6. The 100 Images Sample of CIFAR-10

10

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

TABLE XV 10 data set. The U-k-means has the correct number c*=10
COMPARISON OF AVERAGE RUNNING TIMES (IN SECONDS) OF U-K-MEANS,
of clusters with 42.5% and AV-AR=0.28 and k-means with
R-EM, C-FS, K-MEANS WITH THE TRUE C, AND RL-FCM FOR ALL DATA
SETS. THE FASTEST RUNNING TIMES ARE HIGHLIGHTED c=10 gives the same AV-AR=0.28. For the C-FS, the
percentage with the correct number c*=10 of clusters is
U-k- RL-
Data sets R-EM C-FS only 16.7% with AV-AR=0.24. X-means underestimates the
means FCM
number of clusters with c*=2. The results from R-EM and
Synthetic Data sets RL-FCM on this data sets are missing because the
Example 1 0.3842 4.8921 5.8050 1.3688 probability of one data point belonged to the kth class on
Example 2 2.9185 13.6157 7.3559 6.0444 these data sets are known as illegitimate proportions at the
first iteration.
Example 3 2.1625 2.7938 10.2817 3.2924
We further analyze the performance of U-k-means, R-
Example 4 117.2595 742.14 35.6417 438.047 EM, C-FS, and RL-FCM by comparing their average running
UCI Data sets times of 25 runs for these algorithms, as shown in Table XV.
Iris 0.2159 1.1842 6.31581 0.4184 All algorithms are implemented in MATLAB 2017b. From
Table XV, it is seen that the proposed U-k-means is the
Seeds 0.1455 2.0400 5.2702 0.4472 fastest for all data sets among these algorithms, except that
Australian 2.0434 5.8039 6.1772 2.3829 the C-FS algorithm is the fastest for the Waveform data set.
Flowmeter 0.2834 0.6969 5.6230 0.3054 Furthermore, in Section III, we had mentioned that the
D proposed U-k-means objective function is simpler than the
Sonar 0.1747 0.3148 5.8564 0.3963 RL-FCM objective function with saving running time. From
Table 15, it is seen that the proposed U-k-means algorithm is
Wine 0.1980 1.4837 5.8094 0.3060
actually running faster than the RL-FCM algorithm.
Horse 0.6072 2.5989 5.3442 0.6272
Waveform 330.748 - 113.8162 474.165 V. CONCLUSIONS
In this paper we propose a new schema with a learning
Medical Data sets
framework for the k-means clustering algorithm. We adopt
SPECFT 0.1354 0.7211 5.9079 0.3411 the merit of entropy-type penalty terms to construct a
Parkinsons 0.1487 0.5856 4.9534 0.3958 competition schema. The proposed U-k-means algorithm
WPBC 0.7922 5.2152 0.4036
uses the number of points as the initial number of clusters for
0.1512
solving the initialization problem. During iterations, the U-k-
Colon 0.1653 - 4.9608 0.2676 means algorithm will discard extra clusters, and then an
Lung 1.1239 - 5.2485 1.1167 optimal number of clusters can be automatically found
Nci9 - 6.4794 0.5096
according to the structure of data. The advantages of U-k-
0.6186
means are free of initializations and parameters that also
Image Data sets robust to different cluster volumes and shapes with
Yale Face 0.3741 - 5.9634 0.4286 automatically finding the number of clusters. The proposed
32x32 U-k-means algorithm was performed on several synthetic
CIFAR-10 2.6561 - 6.4500 - and real data sets and also compared with most existing
algorithms, such as R-EM, C-FS, k-means with the true
number c, k-means+gap, and X-means algorithms. The
Example 8 In this example, we apply the U-k-means results actually demonstrate the superiority of the U-k-means
clustering algorithm to the CIFAR-10 color images [37]. clustering algorithm.
The CIFAR-10 data set consists of 60000 32x32 color
images in 10 classes, i.e., each pixel is an RGB triplet of
unsigned bytes between 0 and 255. There are 50000 REFERENCES
training images and 10000 test images. Each red, green, and [1] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data,
blue channel value contains 1024 entries. The 10 classes in Englewood Cliffs, NJ: Prentice Hall, 1988.
the data set are airplane, automobile, bird, cat, deer, dog, [2] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data:
frog, horse, ship, and truck. Specifically, we take the first An Introduction to Cluster Analysis, New York: Wiley,
100 color images (10 images per class) and training 40 1990.
multi-way from CIFAR-10 60K images data set for our [3] G.J. McLachlan, K.E. Basford, Mixture Models:
experiment. The rest 59900 images as the retrieval database. Inference and Applications to clustering, New York:
Fig. 6 shows the 100 images sample from the CIFAR-10 Marcel Dekker, 1988.
images data set. The results for the number of clusters and [4] A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum
AR are given in Table XIV. From Table XIV, it is seen that likelihood from incomplete data via the EM algorithm
the proposed U-k-means and k-means with the true c=10 (with discussion),” J. Roy. Stat. Soc., Ser. B, vol. 39,
give better results on the 100 images sample of the CIFAR- pp. 1-38, 1977.
11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

[5] J. Yu, C. Chaomurilige, M.S. Yang, “On convergence 63, pp. 411-423, 2001.
and parameter selection of the EM and DA-EM [21] N. R. Pal and J. Biswas, “Cluster validation using
algorithms for Gaussian mixtures,” Pattern Recognition, graph theoretic concepts,” Pattern Recognition, vol. 30,
vol. 77, pp. 188–203, 2018. pp. 847-857, 1997.
[6] A.K. Jain, “Data clustering: 50 years beyond k-means,” [22] N. IIc, “Modified Dunn’s cluster validity index based
Pattern Recognition Letters, vol. 31, pp. 651–666, 2010. on graph theory,” Przeglad Elektrotechniczny
[7] M.S. Yang, S.J. Chang-Chien and Y. Nataliani, “A (Electrical Review), vol. 2, pp. 126-131, 2012.
fully-unsupervised possibilistic c-means clustering [23] D. Pelleg, A. Moore, “X-Means: Extending k-means
method,” IEEE Access, vol. 6, pp. 78308–78320, 2018. with efficient estimation of the number of clusters,”
[8] J. MacQueen, “Some methods for classification and Proc. of the 17th International Conference on Machine
analysis of multivariate observations,” Proc. of 5th Learning, pp. 727–734, San Francisco, 2000.
Berkeley Symposium on Mathematical Statistics and [24] E. Rendon, I. Abundez, A. Arizmendi, E.M. Quiroz,
Probability, vol. 1, pp. 281-297, University of “Internal versus external cluster validation indexes,”
California Press, 1967. Int. J. Computers and Communications, vol. 5, pp. 27-
[9] M. Alhawarat and M. Hegazi, “Revisiting k-means and 34, 2011.
topic modeling, a comparison study to cluster arabic [25] Y. Lei, J.C. Bezdek, S. Romani, N.X. Vinh, J. Chan, J.
documents,” IEEE Access, vol. 6, pp. 42740-42749, Bailey, “Ground truth bias in external cluster validity
2018. indices,” Pattern Recognition, vol. 65, pp. 58-70, 2017.
[10] Y. Meng, J. Liang, F. Cao, Y. He, “A new distance with [26] J. Wu, J. Chen, H.Xiong, M. Sie, “External validation
derivative information for functional k-means measures for k-means clustering: a data distribution
clustering algorithm,” Information Sciences, vol. 463– perspective,” Expert Syst. Appl., vol. 36, pp. 6050-
464, pp. 166–185, 2018. 6061, 2009.
[11] Z. Lv, T. Liu, C. Shi, J.A. Benediktsson, H. Du, “Novel [27] L.J. Deborah, R. Baskaran, A. Kannan, “A survey on
land cover change detection method based on k-means internal validity measure for cluster validation,” Int. J.
clustering and adaptive majority voting using Comput. & Eng. Surv., vol. 1 pp. 85-102, 2010.
bitemporal remote sensing images,” IEEE Access, vol. [28] I.H. Witten, E. Frank, M.A. Hall and C.J. Pal, Data
7, pp. 34425-34437, 2019. Mining: Practical Machine Learning Tools and
[12] J. Zhu, Z. Jiang, G.D. Evangelidis, C. Zhang, S. Panga, Techniques, Morgan Kaufmann Publishers, 2000.
Z. Li, “Efficient registration of multi-view point sets by [29] G. Guo, L. Chen, Y. Ye and Q. Jiang, “Cluster
k-means clustering,” Information Sciences, vol. 488, validation method for determining the number of
pp. 205–218, 2019. clusters in categorical sequences,” IEEE Transactions
[13] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On on Neural Networks and Learning Systems, vol. 28, pp.
clustering validation techniques,” J. Intell. Inf. Syst., 2936-2948, 2017.
vol. 17, pp. 107-145, 2001. [30] A. Rodriguez, A. Laio, “Clustering by fast search and
[14] R.E. Kass, A.E. Raftery, “Bayes Factors,” Journal of find of density peaks,” Science, vol. 344 (6191) pp.
the American Statistical Association, vol. 90, pp. 773– 1492-1496, 2014.
795, 1995. [31] M.S. Yang, C.Y. Lai and C.Y. Lin, “A robust EM
[15] H. Bozdogan, “Model selection and Akaike’s clustering algorithm for Gaussian mixture models,”
information criterion (AIC): The general theory and its Pattern Recognition, vol. 45, pp. 3950-3961, 2012.
analytical extensions,” Psychometrika, vol. 52, pp. [32] M.A.T. Figueiredo, A. K. Jain, “Unsupervised learning
345–370, 1987. of finite Mixture models,” IEEE Trans. Pattern
[16] J.C. Dunn, “A fuzzy relative of the ISODATA process Analysis and Machine Intelligence, vol. 24, pp. 381-
and its use in detecting compact, well-separated 396, 2002.
clusters,” J. Cybernetics, vol. 3, pp. 32-57, 1974. [33] M.S. Yang and Y. Nataliani, “Robust-learning fuzzy c-
[17] D. Davies and D. Bouldin, “A cluster separation means clustering algorithm with unknown number of
measure,” IEEE Transactions on Pattern Analysis and clusters,” Pattern Recognition, vol. 71, pp. 45-59, 2017.
Machine Intelligence, vol. PAMI-1, pp. 224–227, 1979. [34] C.L. Blake, C.J. Merz, UCI repository of machine
[18] P.J. Rousseeuw, “Silhouettes: A graphical aid to the learning databases, a huge collection of artificial and
interpretation and validation of cluster analysis,” real-world data sets, 1998.
Journal of Computational and Applied Mathematics, https://archive.ics.uci.edu/ml/datasets.html
vol. 20, pp. 53-65, 1987. [35] D. Cai, X. He, J. Han and T. S. Huang, Graph
[19] T. Calinski, J. Harabasz, “A dendrite method for cluster regularized nonnegative matrix factorization for data
analysis,” Commun. Stat.-Theory Methods, vol. 3, pp. representation, IEEE Trans. Pattern Analysis and
1–27, 1974. Machine Intelligence 33.8 (2010) 1548-1560.
[20] R. Tibshirani, G. Walther, and T. hastie, “Estimating the [36] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, Learning a
number of clusters in a data set via the gap statistic,” spatially smooth subspace for face recognition,
Journal of the Royal Statistical Society: Series B, vol. Proceedings of IEEE Conference on Computer Vision

12

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access
Author Name: Preparation of Papers for IEEE Access

and Pattern Recognition, 2007, CVPR’07, pp. 1-7,


2007.
[37] A. Krizhevsky and G. Hinton, Learning multiple layers
of features from tiny images (Vol. 1, No. 4, p. 7),
Technical report, University of Toronto, 2009.

Kristina P. Sinaga received B.S. degree and


M.S. degree in mathematics from University of
Sumatera Utara, Indonesia. She is a Ph.D.
student at Department of Applied Mathematics,
Chung Yuan Christian University, Taiwan. Her
research interests include clustering and pattern
recognition.

Miin-Shen Yang received the BS degree in


mathematics from the Chung Yuan Christian
University, Chung-Li, Taiwan, in 1977, the
MS degree in applied mathematics from the
National Chiao-Tung University, Hsinchu,
Taiwan, in 1980, and the Ph.D. degree in
statistics from the University of South
Carolina, Columbia, USA, in 1989.
In 1989, he joined the faculty of the
Department of Mathematics in the Chung
Yuan Christian University (CYCU) as an Associate Professor, where, since
1994, he has been a Professor. From 1997 to 1998, he was a Visiting
Professor with the Department of Industrial Engineering, University of
Washington, Seattle, USA. During 2001-2005, he was the Chairman of the
Department of Applied Mathematics in CYCU. Since 2012, he has been a
Distinguished Professor of the Department of Applied Mathematics and
the Director of Chaplain’s Office, and now, the Dean of College of Science
in CYCU. His research interests include clustering algorithms, fuzzy
clustering, soft computing, pattern recognition and machine learning. Dr.
Yang was an Associate Editor of the IEEE Transactions on Fuzzy Systems
(2005-2011), and is an Associate Editor of the Applied Computational
Intelligence & Soft Computing.

13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like