0% found this document useful (0 votes)

70 views

Unsupervised K-Means Clustering Algorithm

This document summarizes an unsupervised K-means clustering algorithm that can automatically determine the optimal number of clusters without initialization or parameter selection. It discusses how traditional K-means requires specifying the number of clusters beforehand and is influenced by initialization. The proposed unsupervised K-means (U-K-means) approach constructs a learning procedure for K-means that can find the number of clusters during training. It compares the new method to other clustering validity indices and demonstrates its effectiveness on numerical and real-world datasets.

Uploaded by

Ahmad Faisal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Unsupervised K-Means Clustering Algorithm

Uploaded by

Ahmad Faisal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final

publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier XXXXXX

Unsupervised K-Means Clustering

Algorithm
Kristina P. Sinaga and Miin-Shen Yang
Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li 32023, Taiwan
Corresponding author: Miin-Shen Yang (e-mail: [email protected]).
This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 107-2118-M-033-002-MY2.

ABSTRACT The k-means algorithm is generally the most known and used clustering method. There are
various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to
clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always
influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is
not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema
for the k-means algorithm so that it is free of initializations without parameter selection and can also
simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-
kmeans) clustering algorithm with automatically finding an optimal number of clusters without giving any
initialization and parameter selection. The computational complexity of the proposed U-k-means clustering
algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are
made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-
kmeans clustering algorithm.

INDEX TERMS Clustering, K-means, Number of clusters, Initializations, Unsupervised learning schema,
Unsupervised k-means (U-k-means)
I. INTRODUCTION In general, partitional methods suppose that the data
Clustering is a useful tool in data science. It is a set can be represented by finite cluster prototypes with
method for finding cluster structure in a data set that is their own objective functions. Therefore, defining the
characterized by the greatest similarity within the same dissimilarity (or distance) between a point and a cluster
cluster and the greatest dissimilarity between different prototype is essential for partition methods. It is known
clusters. Hierarchical clustering was the earliest that the k-means algorithm is the oldest and popular
clustering method used by biologists and social partitional method [1,8]. The k-means clustering has
scientists, whereas cluster analysis became a branch of been widely studied with various extensions in the
statistical multivariate analysis [1,2]. It is also an literature and applied in a variety of substantive areas
unsupervised learning approach to machine learning. [9,10,11,12]. However, these k-means clustering
From statistical viewpoint, clustering methods are algorithms are usually affected by initializations and
generally divided as probability model-based need to be given a number of clusters a priori. In
approaches and nonparametric approaches. The general, the cluster number is unknown. In this case,
probability model-based approaches follow that the validity indices can be used to find a cluster number
data points are from a mixture probability model so where they are supposed to be independent of
that a mixture likelihood approach to clustering is used clustering algorithms [13]. Many cluster validity
[3]. In modelbased approaches, the expectation and indices for the kmeans clustering algorithm had been
maximization (EM) algorithm is the most used [4,5]. proposed in the literature, such as Bayesian
For nonparametric approaches, clustering methods are information criterion (BIC) [14], Akaike information
mostly based on an objective function of similarity or criterion (AIC) [15], Dunn’s index [16], Davies-
dissimilarity measures, and these can be divided into Bouldin index (DB) [17], Silhouette Width (SW)
hierarchical and partitional methods where partitional [18], Calinski and Harabasz index (CH) [19], Gap
methods are the most used statistic
[2,6,7].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Author AccessName: Preparation of Papers for IEEE Access

 in1 ck1z x aik i  k 2 . The k-means algorithm is

[20], generalized Dunn’s index (DNg) [21], and
modified
Dunn’s index (DNs) [22]. iterated through necessary conditions for minimizing
For estimation the number of clusters, Pelleg and the k-means objective function J(z, A) with updating
Moore [23] extended k-means, called X-means, by equations for cluster centers and memberships,
making local decisions for cluster centers in each respectively, as
iteration of k-means with splitting themselves to get
better clustering. Users need to specify a range of  zx 
1

cluster numbers in which the true cluster number ak  inn1 ik ij and zik 1 if x ai  k 2  1min k c x ai 
reasonably lies and then a model selection, such as BIC 2

or AIC, is used to do the splitting process. Although

these k-means clustering algorithms can find the  z
i1 ik 0, otherwise.
number of clusters, such as cluster validity indices and
X-means, they use extra iteration steps outside the where xi ak is the Euclidean distance between the data
clustering algorithms. As we know, no work in the point xi and the cluster center ak . There exists a
literature for k-means can be free of initializations, difficult problem in k-means, i.e., it needs to give a
parameter selection and also simultaneously find the number of clusters a priori. However, the number of
number of clusters. We suppose that this is due to its clusters is generally unkown in real applications.
difficulty for constructing this kind of the k-means Another problem is that the k-means algorithm is
algorithm. always affected by initializations.
In this paper, we first construct a learning procedure To resolve the above issue for finding the number c
for the k-means clustering algorithm. This learning of cluster, cluster validity issues get much more
procedure can automatically find the number of attention. There are several clustering validity indices
clusters without any initialization and parameter available for estimating the number c of clusters.
selection. We first consider an entropy penalty term for Clustering validity indices can be grouped into two
adjusting bias, and then create a learning schema for major categories: external and internal [24]. External
finding the number of clusters. The organization of this indices are used to evaluate clustering results by
paper is as follows. In Section II, we review some comparing cluster memberships assigned by a
related works. In Section III, we first construct the clustering algorithm with the previously known
learning schema and then propose the unsupervised knowledge such as externally supplied class label
kmeans clustering (U-k-means) with automatically [25,26]. However, internal indices are used to evaluate
finding the number of clusters. The computational the goodness of cluster structure by focusing on the
complexity of the proposed U-k-means algorithm is intrinsic information of the data itself [27] so that we
also analyzed. In Section IV, several experimental consider only internal indices. In the paper, these most
examples and comparisons with numerical and real widely used internal indices, such as original Dunn’s
data sets are provided to demonstrate the effectiveness index (DNo) [16], Davies-Bouldin index (DB) [17],
of the proposed U-k-means clustering algorithm. Silhouette Width (SW) [18], Calinski and Harabasz
Finally, conclusions are stated in Section V. index (CH) [19], Gap statistics [20], generalized
Dunn’s index (DNg) [21], and modified Dunn’s index
II. RELATED WORKS (DNs) [22] are chosen for finding the number of
In this section, we review several works that are clusters and then compared with our proposed U-k-
closely related with ours. K-means is one of the most means clustering algorithm.
popular unsupervised learning algorithms that solve The DNo [16], DNg [21], and DNs [22] are
supposed to be the simplest (internal) validity index
the well-known clustering problem. Let Xx1, ,xn where it compares the size of clusters with the distance
be a data set in a d-dimensional Euclidean space d. Let between clusters. The DNo, DNg, and DNs indices are
A a 1, ,ac be the c cluster centers. Let z[ ]zik n computed as the ratio between the minimum distance
between two clusters and the size of the largest cluster,
c , where zik is a binary variable (i.e. zik {0,1}) and so we are looking for the maximum value of index
indicating if the data point xi belongs to k-th cluster, values. Davies-Bouldin index (DB) [17] measures the
k1, ,c . The k-means objective function is J(z,A) average similarity between each cluster and its most
similar one. The DB validity index attempts to
2

maximize these between cluster distances while number of clusters based on BIC, and it is still
minimizing the distance between the cluster centroid influenced by initializations. To construct the k-means
and the other data objects. The Silhouette value [18] is clustering algorithm with free of initializations and
a measure of how similar an object is to its own cluster automatically find the number of clusters, we use the
(cohesion) compared to other clusters (separation). The entropy concept. We borrow the idea from the EM
silhouette ranges from -1 to +1, where a high value algorithm by Yang et al. [31]. We first consider
indicates that the object is well matched to its own proportions k in which the k term is seen as the
cluster and poorly matched to neighboring clusters. probability of one data point belonged to the kth class.
Thus, positive and negative large silhouette widths
Hence, we use lnk as the information in the
(SW) indicate that the corresponding object is well
clustered and wrongly clustered, respectively. Any occurrence of one data point belonged to the kth class,
objects with the SW validity index around zero are
considered not to be clearly discriminated between and so ck1 k ln k becomes the average of
clusters. The Gap statistic [20] is a cluster validity
measure based upon a statistical hypothesis test. The
gap statistic works by comparing the change in within- information. In fact, the term ck1 k ln k is the
cluster dispersion with that expected under an entropy over proportions k . When k 1/ ,c k we
appropriate reference null distribution at each value c.
The optimal number of clusters is the smallest c. say that there is no information about k . At this point,
For an efficient method about the number of clusters, we have the entropy achieve the maximum value.
Xmeans proposed by Pelleg and Moore [23], should be Therefore, we add this term to the k-means objective
the most well-known and used in the literature, such as function J(z, A) as a penalty. We then construct a
Witten et al. [28], and Guo et al. [29]. In X-means, schema to estimate k by minimizing the entropy to
Pelleg and Moore [23] extended k-means by making
local decisions for cluster centers in each iteration of get the most information fork . To minimize
c c
k-means with splitting themselves to get better
k1 k ln k is equivalent to maximizing k1 k ln
clustering. Users only need to specify a range of
cluster numbers in which the true cluster number
reasonably lies and then a model selection, such as k .
BIC, is used to do the splitting process. Although X-
means has been the most used for clustering without
For this reason, we use  ck1 k ln k as a penalty
given a number of clusters a priori, it still needs to term for the k-means objective function J(z, A). Thus,
specify a range of cluster numbers based on a criterion, we propose a novel objective function as follows:
such as BIC. On the other hand, it is still influenced by 0
initializations of algorithm. On the other hand, n c c

Rodriguez and Laio [30] proposed an approach based

on the idea that cluster centers are characterized by a JUKM (z, A,) z x aiki  k 2   n k ln
1 k
higher density than their neighbors and by a relatively (1)
large distance from points with higher densities, which i 1 k 1 k1

they called as a clustering by fast search (C-FS) and In order to determine the number of clusters, we next
find of density peaks. To identify the cluster centers, consider another entropy term. We combine the
C-FS uses the heuristic approach of a decision graph.
However, the performance of C-FS highly depends on variables membership zik and the proportion k . By
using the basis of entropy theory, we suggest a new
two factors, i.e., local density ρi and cutoff distance δi .
term in the form of zik lnk . Thus, we propose the
III. THE UNSUPERVISED K-MEANS unsupervised k-means (U-k-
CLUSTERING ALGORITHM means) objective function as follows:
n c c n c
There always exists a difficult problem in the k-means
algorithm and its extensions for a long history in the
literature. That is, they are usually affected by JUKM (z, A,) z x aiki  k 2    n k lnk  zik
2

initializations and require a given number of clusters a lnk (2)

i 1 k 1 i 1 k 1
priori. We mentioned that the X-means algorithm has k1

been used for clustering without given a number of

clusters a priori, but it still needs to specify a range of
3

We know that, when  and  in Eq. (2) are zero, it

We should mention that Eq. (6) created above is
important for our proposed U-k-means clustering
becomes the original k-means. The Lagrangian of Eq.
(2) is method. In Eq. (6), cs1 s ln s is the weighted mean
,
J(z, A,, ) n c z x aik i  k 2   nc k ln k
of lnk with the weights  1 , c . For the kth
i 1 k 1 k1
mixing proportionk , if lnk is less than the ( )t ( )t

c 
n c
(3)weighted mean, then the new mixing proportion k(t1)
zik ln  k  k 1 will become smaller than the old k( )t . That is, the
i 1 k 1  k1  smaller proportion will decrease and the bigger
We first take the partial derivative of the Lagrangian proportion will increase in the next iteration, and then
competition will occur. This situation is similar as the
(3) with respect to zik , and setting them to be zero.
formula in Figueiredo and Jain [32]. If k 0 or k
Thus, the updating equation for zik is obtained as 1/n for some 1 k c( )t , they are considered to be
min illegitimate proportions. In this situation, we discard
follows: 1 if x ai  k 2  ln k  x ai  k 2  ln those clusters and then update the cluster number c( )t to

 
k

be ct1 c t  kt1 kt1 1n k, 1, ,c

zik 1 k c (4)
0, otherwise.
t  (7)
The updating equation for the cluster center ak is as where |{}| denotes the cardinality of the set {}. After
follows: updating the number of clusters c , the remaining
mixing proportion k* and corresponding zik* need to
ak in1z xik ij  n
z
i 1 ik (5) be renormalized by
We next take the partial derivative of the Lagrangian
with respect to k , we obtain J  nln k 1 n k
 k*  k* cs 1 s* t 1

 1 in1 zik

(8)

zik
k and  n k ln k

k  0,
 zik* zik*cs 
zis*

t 1
1
 i1   0
(9)
We next concern about the parameter learning of 
Thus, we have ck1n k ln k ck1n k    k1 c

and  for the two terms of   n

i 1 k1 ik
c
z lnk and
z ck1k 0with     
n
n
i 1 ik  c
k1 k ln k
 n n . We obtain  k1 
c
k ln k . Based on some increasingly learning
rates of cluster number with
 n k ln k  1 in1zik  ( nck1 k ln k  ec( )t 100,ec( )t 250,ec( )t 500 ,ec( )t 750 , and ec( )t 1000 , it is
n n  k 0 and then we get the updating equation seen that ec ( )t 100 decreases faster, but ec( )t 500,ec( )t 750
for k as follows: and e  c( )t 1000
decreases slower. We suppose that the
n c

  parameter should not decrease too slow or too fast,

kt1  zik / n ( / ) k( )t ln k( )t   s t ln s t 
(6) and so we set the parameter  as
i1 
where t s1 
 t
denotes the iteration number in the algorithm.  ec ( )t 250
(10)
Under competition schema setting, the algorithm can
4

 
automatically reduce the number of clusters, and also
simultaneously gets the estimates of parameters. max1 k ck( 1)t max1 k c 1n  z
in1 ik max1 k c k( )t

Furthermore, the parameter  can help us control the lnmax 1 k c k( )t cs1 s t ln s t 
lnmax
competition. We discuss the variable  as follows. We
first apply the rule  e1  k ln k  0 . If 0  k 1
and max   1n  in1 ikz  max1 k c k( )t 1 k c k( )

t cs1 s( )t ln s( )t 
k and let Ecs1 s ln s  0, then we have 1kc

   k E k cs1 s ln s 0. Thus, we obtain

c
 max1 k c  1n  z
in1 ik   max 1 k c   k( )t cs1 s( )t

e1k (ln    k  s ln s )  ln s( )t . If max 1 k c  1n  z

in1 ik 
k E)
s1
(11)    max  k( )t  cs1 s( )t ln s( )t 1 , then the
restriction of

1kc

Under the constraint ck1 k 1, and only when k

maxk( 1)t 1 is held, and then we obtain
1 k c
1/2 , we can have that (lnk cs1 s ln s)0 . To
avoid the situation where all k  0, the left hand of
inequality (11) must be larger than  max{ k | k  1max  1  1n  z
in1 ik  max 1 k c 
1/2, k 1,2,, }c . We now have an elementary
condition of  with
e1max{ k | k 1/2, k 1,2,, }c . Thus, we
 k( )t cs1 s( )t ln s( )t  (13) k c

have  max{ ke | k 1/2, k 1,2,, }c e/2 . According to Eqs. (12) and (13), we can get

Therefore, to prevent  too large, we use [0, 1] . If 

the difference between kt1 and k t is small, then ) 1 max
( 1)t  min k1 c k   k( )t , ( max1 k c 1

 must become large to enhance its competition. If the

 
difference betweenkt1 and k t is large, then   k ck( )t  1n
n
z
ck' i 11 ik t

)   (14)

c

will become small to maintain stability. Thus, we exp(n ( 1)t

estimate  with ln k 

c
(t1) ( )t
|}/ c (12)Because the  can jump at any time, we let 0 when
the cluster number c is stable. When the cluster
k1exp{  n| k  k number c is stable, it means c is no longer decreasing.

where  min 11, td 2 1 

 and   a
In our setting, we use all data points as initial means
with ak xk , i.e.
represents the largest integer that is no more than a
cinitial n , and we use k 1/ cinitial ,  k 1, 2, ...,
and t denotes the iteration number in the algorithm.
cinitial
On the other hand, we consider the inequations as initial mixing proportions. Thus, the proposed U-
kmeans clustering algorithm can be summarized as
follows:

U-k-means clustering algorithm algorithm with more running time. For experimental
results and comparisons in the next section, we make
Step 1: Fix 0 . Give initial c(0) n , k(0) 1/n , more comparisons of the proposed U-k-means
algorithm with the RL-FCM algorithm. We also
ak0  xi and initial learning rates 0  0 1. analyze the computational complexity for the U-k-
Set t=0. means algorithm. In fact, the U-kmeans algorithm can
be divided into three parts: (1) Compute the hard
Step 2: Compute zikt1 using akt ,kt ,ct ,t
membership partition zik with O ncd ; (2)
,t by (4).
Compute the mixing proportion k with O nc  ;
t1 (3)
Step 3: Compute  by (10).
Update the cluster center k with O n  . The total
Step 4: Update k(t1) with ik and k( )t by (7).
t1
computational complexity for the U-k-means
algorithm is O ncd , where n is the number of
Step 5: Compute (t1) with (t1) and ( )t by (14).
data points, c is the number of clusters, and d is the
Step 6: Update c( )t to ct1 by discard those clusters dimension of data points. Compared with the RL-FCM
with algorithm [33], the RL-FCM has the total
kt1 1/ n and adjust kt1 and ik( +1)t
and (9).
by (8)
computational complexity fwith O nc d  . 2

IV. EXPERIMENTAL RESULTS AND

COMPARISONS
IF t 60 and c(t60) c( )t 0, THEN let (t1) 0.
In this section we give some examples with numerical
t1 t+1 t1
Step 7: Update ak with c and zik by (5). and real data sets to demonstrate the performance of
the proposed U-k-means algorithm. We show these
t1
Step 8: Compare ak and ak t . unsupervised learning behaviors to get the best number
c* of clusters for the U-kmeans algorithm. Generally,
IF max t akt1 ak t , THEN Stop. most clustering algorithms, including k-means, are
1 k c employed to give different numbers of clusters with
associated cluster memberships, and then these
ELSE t = t +1 and return to Step 2.
clustering results are evaluated by multiple validity
measures to determine the most practically plausible
Before we analyze the computational complexity for
clustering results with the estimated number of clusters
the proposed U-k-means algorithm, we give a brief
[13]. Thus, we will first compare the U-k-means
review of another clustering algorithm that had also
algorithm with the seven validity indices, DNo [16],
used the idea from the EM algorithm by Yang et al.
DNg [21], DNs [22], Gap statistic (Gap-stat) [20], DB
[31]. This is the robust-learning fuzzy c-means (RL-
[17], SW [18] and CH [19]. Furthermore, the
FCM), proposed by Yang and Nataliani [33]. In Yang
comparisons of the proposed U-k-means with k-means
and Nataliani [33], they gave the RL-FCM objective
[8], robust EM [31], clustering by fast search (C-FS)
function J(U,,A) with ik , not binary variables, but [30], X-means [23], and RL-FCM [33] are also made.
For measuring clustering performance, we use an
and
fuzzy c-memberships with 0 ik 1  k1
c
1 to
accuracy rate (AR) with ARck1n c 
ik
k n ,
indicate fuzzy memberships for the data point xi
belonging to k-th cluster. If we compare the proposed where n c k  is the number of data points that
U-k-means objective function JUKM2 (z, A,) with the obtain correct clustering for the cluster k and n is the
RL-FCM objective function total number of data points. The larger
J(U,,A), we find that, except ik and zik with different AR is, the better clustering performance is.
membership representations, the RL-FCM objective Example 1 In this example, we use a data set of 400
function J(U,,A) in Yang et al. [33] gave more extra data points generated from the 2-variate 6-component
terms and parameters and so the RL-FCM algorithm is
more complicated than the proposed U-k-means
6

Gaussian mixture model f x( ; ,) ck1k f x( ;k )

with parameters

k 1 6,k , 1 5 2T , 2 3 4T , 3 8

4T ,
T T
T

4 6 6 , 5 10 8 , 6 7 10 ,

and
0.4 0  with 2 dimensions and 6 clusters,

 =1 6 0 0.4

as shown in Fig. 1(a). We implement the proposed U-
kmeans clustering algorithm for the data set of Fig.
1(a) in which it obtains the correct number c*  6 of
clusters with AR=1.00, as shown in Fig. 1(f), after 11
iterations. These validity indices of CH, SW, DB, Gap
statistic, DNo, DNg, and DNs are shown in Table I.
All indices give the correct number c* 6 of clusters,
except DNg.
Moreover, we consider the data set with noisy
points to show the performance of the proposed U-k-
means algorithm under noisy environment. We add 50 FIGURE 1. (a) Original data set; (b)-(e) Processes of the U-k-
uniformly noisy points to the data set of Fig. 1(a), as means after 1, 2, 4, and 9; (f) Convergent results.
shown in Fig. 2(a). By implementing the U-k-means
algorithm on the noisy data set of Fig. 2(a), it still
obtains the correct number c* 6 of clusters after 28 TABLE I
VALIDITY INDEX VALUES OF CH, SW, DB, GAP-STAT, DNO, DNG,
iterations with AR=1.00, as shown in Fig. 2(b). These AND
validity index values of CH, SW, DB, Gap-stat, DNo, DNS FOR THE DATA SET OF FIG. 1(A)
DNg, and DNs for the noisy data set of Fig. 2(a) are c Validity index values
shown in Table II. The five validity indices of CH,
CH SW DB Gap-stat DNo DNg DNs
DB, Gap-stat, DNo and DNs give the correct number
of clusters. But, SW and DNg give the incorrect 2 0.511 0.680 0.772 0.183 0.008 6.587 0.001

numbers of clusters.
3 0.553 0.649 0.866 0.388 0.047 2.603 0.019

4 0.605 0.715 0.700 0.469 0.040 1.603 0.016

5 0.754 0.743 0.571 0.619 0.041 4.619 0.020

6 1.277 0.838 0.483 1.067 0.102 4.635 0.048

7 1.155 0.773 0.634 0.991 0.060 0.794 0.022

8 1.054 0.703 0.808 0.930 0.030 0.571 0.004

TABLE II
VALIDITY INDEX VALUES OF CH, SW, DB, GAP-STAT, DNO, DNG,
AND
DNS FOR THE NOISY DATA SET
7

clusters, and X-means algorithms are dependent of

Criterion values initials or parameter selection, and so we consider their
c average AR (AV-AR) under different initials or
CH SW DB Gap stat DNo DNg DNs
parameter selection. From Table IV, it is seen that the
2 508.2 0.662 0.792 0.827 0.005 4.547 0.001
3 523.4 0.615 0.913 0.835 0.034 1.828 0.012 proposed U-k-means, R-EM, and RL-FCM clustering
4 526.9 0.655 0.748 0.719 0.033 1.752 0.011 algorithms are able to find the correct number of
5 637.9 0.697 0.607 0.914 0.028 3.397 0.008 clusters c* 14 with AR=1.00. While C-FS obtained
6 902.6 0.771 0.538 1.237 0.052 1.502 0.013
7 864.4 0.783 0.558 1.173 0.042 0.797 0.008 the correct c* 14 with 96% and AV-AR=0.9772. The
8 837.7 0.766 0.666 1.143 0.019 0.497 0.002 k-means with the true c gave AV-AR=0.8160. The X-
means obtained the correct c* 14 with 76% and AV-
AR=1.00. Note that the numbers in parentheses
indicate the percentage in obtaining the correct number
of clusters for clustering algorithms under 25 different
initial values.

(a) (b) FIGURE 3. (a) 14-cluster dataset; (b) Final results from U-k-
means.
FIGURE 2. (a) 6-cluster dataset with 50 noisy points; (b) Final
results from U-k-means.
TABLE III
RESULTS OF THE SEVEN VALIDITY INDICES
True Optimal number of clusters
Example 2 In this example, we consider a data set of c
800 data points generated from a 3-variate 14- CH SW DB Gap-stat DNo DNg DNs
component Gaussian mixture with 800 data points with
14 14 14 14 14 14 2, 4, 5, 14
3 dimensions and 14 clusters, as shown in Fig. 3(a). To (60%) (60%) (60%) (64%) (20%) 10, 11 (20%)
estimate the number c of clusters, we use CH, SW,
DB, Gap-stat, DNo, DNg, and DNs. To create the
results of the seven validity indices, we consider the k-
means algorithm with 25 different initializations.
These estimated numbers of clusters from CH,
SW, DB, Gap statistic, DNo, DNg, and DNs with
percentages are shown in Table III. It is seen that all
validity indices can give the correct number c* 14 of
clusters, except DNg, where the Gap-stat index gives
the highest percentage of the correct number c* 14 of
clusters with 64%. We also implement the proposed U- (a) (b)
k-means for the data set, and then compare it with the
R-EM, C-FS, k-means with the true number of
clusters, X-means, and RL-FCM clustering algorithms.
We mention that U-k-means, R-EM, and RL-FCM are
free of parameter selection, but others are dependent
on parameter selection for finding the number of
clusters. Table IV shows the comparison results of the
U-kmeans, R-EM, C-FS, k-means with the true cluster
number c14, X-means, and RL-FCM algorithms. (c) (d)
Note that C-FS, k-means with the true number of
8

algorithm with REM, C-FS, k-means with true c, X-

means, and RL-FCM. All the experiments are
performed 25 times with parameter selection where the
average AR results under the correct number of cluster
are reported in Table VIII. As shown in Table VIII, U-
k-means gives the correct number c*=9 of clusters
with AR=1.00, followed by k-means with true c=9
achieves an average AR=0.9190 and C-FS with c*=9
(96%) achieves average AR=0.7641. While R-EM
(e) (f) overestimates the number of clusters with c*=12, but
X-means and RLFCM underestimate the number of
FIGURE 4. (a) 9-diamonds data set; (b)-(e) Results of the U-k-
means after 1, 3, 5, and 7 iterations; (f) Final results of the U-k- clusters with c*=2.
means after 11 iterations. We next consider real data sets. These data sets
are from the UCI Machine Learning Repository [34].
Example 5 In this example, we use the eight real data
Example 3 To examine the effectiveness of the sets from UCI Machine Learning Repository [34],
proposed U-k-means for finding the number of known as Iris, Seeds, Australian credit approval,
clusters, we generate a data set of 900 data points from Flowmeter D, Sonar, Wine, Horse, and waveform
a 20-variate 6-component Gaussian mixture model. (version 1). Detailed information on these data sets
The mixing proportions, mean values and covariance such as feature characteristics, the number c of classes,
matrices of the Gaussian mixture model are listed in the number n of instances and the number d of features
Table V. The validity indices of CH, SW, DB, Gap- is listed in Table IX. Since data features in Seeds,
stat, DNo, DNg, and DNs are used to estimate the Flowmeter D, Wine and Waveform (version 1) are
number c of clusters. The k-means algorithm with 25 distributed in different ranges and data features in
different initializations are considered to create the Australian (credit approval) are mixed feature types,
results of the seven validity indices. These estimated we first preprocess data matrices using matrix
numbers of clusters from the seven validity indices factorization technique [35]. This preprocessed
with percentages are shown in Table VI where the technique can give these data in uniform to get good
parentheses are indicating the percentages of validity quality clusters and improve accuracy rates of
indices in giving the correct number of clusters under clustering algorithms. Clustering results from the U-k-
25 different initial values. It is seen that CH, SW, and means, R-EM, C-FS, kmeans with the true c, k-
Gap-stat give the correct number c* 6 of clusters with means+Gap-stat, X-means, and RL-FCM algorithms
the highest percentage. We also for different real data sets are shown in Table X, where
implemented the U-k-means and compare it with R- the best results are presented in boldface. It is seen that
EM, CFS, k-means with the true number c, X-means, the proposed U-k-means gives the best result in
and RLFCM algorithms. The obtained numbers of estimating the number c of clusters and accuracy rate
clusters and ARs of these algorithms are shown in among them except for Australian data. The C-FS
Table VII. As it can be seen, the proposed U-k-means, algorithm gives the corrected numbers of clusters for
C-FS and X-means correctly find the number of Iris, Seeds, Australian, Flowmeter D, Sonar, Wine, and
clusters for the data set. The REM and RL-FCM Horse data sets while it underestimates the number of
underestimate the number of clusters for the data set. clusters for the waveform data set with c*=2. The X-
Both U-k-means and X-means get the best AR. means algorithm only obtains the correct number of
Example 4 In this example, we consider a synthetic clusters for Seeds, Wine and Horse data sets. The R-
data set of non-spherical shape with 3000 data points, EM obtains the correct number of clusters for Iris and
as shown in Fig. 4(a). The U-k-means is implemented Seeds data sets. The k-means+Gapstat only obtains a
for this data set with the clustering results as shown in correct number of clusters for the Seed data set. The
Figs. 4(b)-4(f). The U-k-means algorithm decreases the RL-FCM algorithm obtains the correct number of
number of clusters from 3000 to 2132 after the clusters for the Iris, Seeds and Waveform (version 1)
iteration is implemented once. From Figs. 4(b)-4(f), it data sets. Note that the results in parentheses are the
is seen that the U-k-means algorithm exhibits fast percentages of algorithms to get the correct number c
decreasing for the number of clusters. After 11 of clusters.
iterations, the U-k-means algorithm obtains its Example 6 In this example, we use the six medical
convergent result with c*=9 and AR= 1.00, as shown data sets from the UCI Machine Learning Repository
in Fig. 4(f). We next compare the proposed U-k-means [34], known as SPECT, Parkinsons, WPBC, Colon,

Lung and Nci9. Detailed descriptions on these data which is very closed to the true c=9. In terms of AR,
sets with feature characteristics, the number c of the U-k-means algorithm significantly performs
classes, the number n of instances and the number d of much better than others. The R-EM algorithm
features are listed in Table XI. In this experiment, we estimates the correct number of clusters on SPECT.
first preprocess the SPECT, Parkinson, WPBC, Colon, However, it underestimates the number of clusters on
and Lung data sets using the matrix factorization Parkinsons, and overestimates the number of clusters
technique. We also conduct experiments to compare on WPBC. We also reported that the results of R-EM
the proposed U-k-means with REM, C-FS, k-means on Colon, Lung and Nci9 data sets are missing
with the true c, k-means+Gap-stat, Xmeans, and RL- because the probability of one data point belonged to
FCM. The results are shown in Table XII. For the kth class on these data sets are known as
C-FS, k-means with the true c, k-means+Gap-stat and illegitimate proportions at the first iteration. The C-
X-means, we make experiments with 25 different FS algorithm presents better than k-means+Gap-stat
initializations, and report their results with the average and X-means. The RL-FCM algorithm estimates the
AR (AV-AR) and the percentages of algorithms to get correct number of clusters c for the SPECT,
the correct number c of clusters, as shown in Table Parkinsons, and WPBC data sets. While RL-FCM
XII. It is seen that the proposed U-k-means gets the overestimates the number of clusters on Colon, Lung
correct number of clusters for SPECT, Parkinsons, and Nci9 with c*=62, c*=9, and c*=60,
WPBC, Colon, and respectively.
Lung. While for the Nci9 data set, the U-k-means
algorithm gets the number of clusters with c*=8
TABLE IV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE DATA SET OF FIG. 3(A)

R- k-means
True U-k-means C-FS X-means RL-FCM
EM with true c
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR
14 14 1.00 14 1.00 14 (96%) 0.9772 0.8160 14 (76%) 1.00 14 1.00

TABLE V
MIXING PROPORTIONS, MEAN VALUES AND COVARIANCE MATRICES OF EXAMPLE 3

Mixing proportions Mean values covariance matrix

 1 02. 
  1 2 4 6 0 0 0 0 1 1 1 0 0 0 0 0 3 5 0 0 1
 2 03.
 20 1 3 5 01 01 05 05 0 0 2 4 3 1 1 1 025 05 07 25. . . . . .
 3 01. 
. .
 4 01. 
 3 5 5 5 5 4 4 4 4 6 6 6 6 8 8 8 8 1 1 1 1  4 2 2 2 2 2 1 1 1 1 1 3 I
 5 02.  
6 01. 333377777
 5125 13 145 15 225 23 245 25 1 1 1 1 3 3 3 3 2 2 2 2. . .
k 20 20 


. . . . .

 60 0 1 1 05 05 25 25 5 5 1 1 5 5 0 0 075 15 35 55. . . . .


. . .

TABLE VI
RESULTS OF THE SEVEN VALIDITY INDICES FOR THE DATA SET OF EXAMPLE 3
True c Optimal number of clusters obtains by

CH SW DB Gap-stat DNo DNg DNs

6 6 (88%) 6 (88%) 2, 3 6 (88%) 6 (16%) 6 (8%) 6 (12%)

TABLE VII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 3
U-k-means R-EM C-FS K-means with true c X-means RL-FCM
True c
c* AR c* AR c* AR AV-AR c* AR c* AR
6 6 1.00 3 - 6 (84%) 0.8155 0.7833 6 (100%) 1.00 3 -

TABLE VIII
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, RL-FCM FOR EXAMPLE 4

k-means
U-k-means R-EM C-FS X-means RL-FCM
True c with true c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AR
9 9 1.00 12 - 9 (96%) 0.7641 0.9190 2 - 2 -

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

TABLE IX
DESCRIPTIONS OF THE EIGHT DATA SETS USED IN EXAMPLE 5
Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features
Iris Real 3 150 4
Seeds Real 3 210 7
Australian Categorical, Integer, Real 2 690 14
Flowmeter D Real 4 180 43
Sonar Real 2 208 60
Wine Integer, Real 3 178 13
Horse Categorical, Integer, Real 2 368 27
Waveform (Version 1) Real 3 5000 21

TABLE X
CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-Means R-EM C-FS k- k-means + Gap-stat X-means RL-FCM
True Means
Data set c* AR c* AV- c* AR c* AR c* AV- c* AV-
c with
AR true c AR AR
Iris 3 3 0.8933 3 0.8600 3 (84%) 0.7521 0.7939 4, 5 - 2 - 3 0.9067

Seeds 3 3 0.9048 3 0.8476 3 0.7944 0.8864 3 0.8952 3 0.890 3 0.8952

(100%) (100%) (100%)
Australian 2 2 0.5551 4 - 2 0.5551 0.5551 6 - 6 - 26 -
(100%)
Flowmeter D 4 4 0.6056 3 - 4 0.4338 0.5833 9, 10 - 10 - 13 -
(100%)
Sonar 2 2 0.5337 5 - 2 (80%) 0.4791 0.4791 5, 6 - 3, 4 - 4 -

Wine 3 3 0.7022 2 - 3 0.5557 0.6851 2 - 3 (64%) 0.62 2 -

(100%)
Horse 2 2 0.6576 4,6,8, - 2 0.6033 0.6055 3 - 2 (88%) 0.50 7 -
10, 14 (100%)
Waveform 3 3 0.4020 1 - 2 - 0.3900 1 - 8 - 3 0.3972
(Version 1)

TABLE XI DESCRIPTIONS OF THE SIX MEDICAL DATA SETS USED IN

EXAMPLE 6

Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features

SPECT Categorical 2 187 22

Parkinsons Real 2 195 22
WPBC Real 2 198 33
Colon Discrete, Binary 2 62 2000
Lung Continous, Multi-class 5 203 3312
Nci9 Discrete, Multi-class 9 60 9712

TABLE XII
RESULTS FROM VARIOUS ALGORITHMS FOR THE SIX MEDICAL DATA SETS WITH THE BEST RESULTS IN BOLDFACE
U-K-MEANS R-EM C-FS K-MEANS K-MEANS + X-MEANS RL-FCM
WITH GAPSTAT
DATA SET TRUE TRUE C
C
C* AR C* AV-AR C* AR C* AR C* AV-AR C* AV-AR

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

SPECT 2 2 0.920 2 0.562 2(84%) 0.8408 0.5262 5, 6 - 2 (100%) 0.5119 2 0.588

PARKINSONS 2 2 0.754 1 - 2 (100%) 0.7436 0.5183 2 (100%) 0.62 4, 5 - 2 0.754

WPBC 2 2 0.763 198 - 2 (100%) 0.7576 0.5927 4 - 3 - 2 0.763

COLON 2 2 0.645 - - 2 (100%) 0.5813 0.4768 4 - 2 (100%) 0.45 62 -

LUNG 5 5 0.788 - - 5 (100%) 0.6859 0.6818 4, 6, 7, 8 - 2 - 9 -

NCI9 9 8 - - - 2, 4 - 0.32 2 - 2 - 60 -

TABLE XIII CLUSTERING RESULTS FROM VARIOUS ALGORITHMS FOR DIFFERENT REAL DATA SETS WITH THE BEST RESULTS
IN BOLDFACE
k-means
True FU-k-means R-EM C-FS with true c X-means RL-FCM
Data set
c
c* AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR
Yale Face 15 16 - - - 12 - 0.34 2, 3 - 2 -

TABLE XIV
RESULTS OF U-K-MEANS, R-EM, C-FS, K-MEANS WITH THE TRUE C, X-MEANS, AND RL-FCM FOR THE 100 IMAGES SAMPLE OF THE CIFAR-10 DATA
SET

k-means
U-k-means R-EM C-FS X-means RL-FCM
Data set True c with true c

c* AV-AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR

CIFAR-10 10 10
10 0.311 - - 0.295 0.280 2 - - -
(42.5%) (3.03%)

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

the truec=15. The R -EM algorithm is missing because the

probability of one data point belonged to kththe
class on this
data set are known as illegitimate proportions at the first
iteration. The -CFS gives c*=12 and X-means givesc*=2 or
3. The k - means clustering with the truec=15 gives AV-
AR=0.34, while RL-FCM gives c*=2.

FIGURE5. Yale Face 32x32

Example 7 In this example, we apply the U -k-means
clustering algorithm for Yale Face 32x32 data set, as shown
in Fig. 5. It has 165 grayscale images in GIF format of 15
individuals [36]. There are 11 images per subject with
different facial expression or configuration: center -light,
with/glasses, happy, left-light, w/no glasses, normal,ight-
r
light, sad, sleepy, surprised, and wink. In the experiment, we
use 135 images of 165 grayscale images. The results from
different algorithms are shown in Table XIII . From Table
XIII , although U-k-means cannot correctly estimate the true
numberc=15 of clusters for the Yale face data set, but it
gives the number of clusters c*=16 in which it is closed to
FIGURE 6. The 100 Images Sample of CIFAR-10
TABLE XV Flowmeter 0.2834 0.6969 5.6230 0.3054
COMPARISON OF AVERAGE RUNNING TIMES (IN SECONDS) OF U-K- D
MEANS,
R-EM, C-FS, K-MEANS WITH THE TRUE C, AND RL-FCM FOR ALL DATA
Sonar 0.1747 0.3148 5.8564 0.3963
SETS. THE FASTEST RUNNING TIMES ARE HIGHLIGHTED Wine 0.1980 1.4837 5.8094 0.3060
U-kmeans RL-
Data sets R-EM C-FS Horse 0.6072 2.5989 5.3442 0.6272
FCM
Waveform 330.748 - 113.8162 474.165
Synthetic Data sets
Medical Data sets
Example 1 0.3842 4.8921 5.8050 1.3688
SPECFT 0.1354 0.7211 5.9079 0.3411
Example 2 2.9185 13.6157 7.3559 6.0444
Parkinsons 0.1487 0.5856 4.9534 0.3958
Example 3 2.1625 2.7938 10.2817 3.2924
WPBC 0.1512 0.7922 5.2152 0.4036
Example 4 117.2595 742.14 35.6417 438.047
Colon 0.1653 - 4.9608 0.2676
UCI Data sets
Lung 1.1239 - 5.2485 1.1167
Iris 0.2159 1.1842 6.31581 0.4184
Nci9 0.6186 - 6.4794 0.5096
Seeds 0.1455 2.0400 5.2702 0.4472
Image Data sets
Australian 2.0434 5.8039 6.1772 2.3829

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

Yale Face 0.3741 - 5.9634 0.4286 The advantages of U-kmeans are free of initializations and
32x32 parameters that also robust to different cluster volumes
CIFAR-10 2.6561 - 6.4500 - and shapes with automatically finding the number of
clusters. The proposed U-k-means algorithm was
performed on several synthetic and real data sets and also
Example 8 In this example, we apply the U-k-means
compared with most existing algorithms, such as R-EM,
clustering algorithm to the CIFAR-10 color images [37].
C-FS, k-means with the true number c, k-means+gap, and
The CIFAR-10 data set consists of 60000 32x32 color
X-means algorithms. The results actually demonstrate the
images in 10 classes, i.e., each pixel is an RGB triplet of
superiority of the U-k-means clustering algorithm.
unsigned bytes between 0 and 255. There are 50000
training images and 10000 test images. Each red, green,
and blue channel value contains 1024 entries. The 10
REFERENCES
classes in the data set are airplane, automobile, bird, cat,
[1] A.K. Jain, R.C. Dubes, Algorithms for Clustering
deer, dog, frog, horse, ship, and truck. Specifically, we
Data, Englewood Cliffs, NJ: Prentice Hall, 1988.
take the first 100 color images (10 images per class) and
[2] L. Kaufman, P.J. Rousseeuw, Finding Groups in
training 40 multi-way from CIFAR-10 60K images data
Data: An Introduction to Cluster Analysis, New
set for our experiment. The rest 59900 images as the
York: Wiley, 1990.
retrieval database. Fig. 6 shows the 100 images sample
[3] G.J. McLachlan, K.E. Basford, Mixture Models:
from the CIFAR-10 images data set. The results for the
Inference and Applications to clustering, New York:
number of clusters and AR are given in Table XIV. From
Marcel Dekker, 1988.
Table XIV, it is seen that the proposed U-k-means and k-
means with the true c=10 give better results on the 100 [4] A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum
images sample of the CIFAR10 data set. The U-k-means likelihood from incomplete data via the EM
has the correct number c*=10 of clusters with 42.5% and algorithm (with discussion),” J. Roy. Stat. Soc., Ser.
AV-AR=0.28 and k-means with c=10 gives the same AV- B, vol. 39, pp. 1-38, 1977.
AR=0.28. For the C-FS, the percentage with the correct [5] J. Yu, C. Chaomurilige, M.S. Yang, “On convergence
number c*=10 of clusters is only 16.7% with AV- and parameter selection of the EM and DA-EM
AR=0.24. X-means underestimates the number of clusters algorithms for Gaussian mixtures,” Pattern
with c*=2. The results from R-EM and RL-FCM on this Recognition, vol. 77, pp. 188–203, 2018.
data sets are missing because the probability of one data [6] A.K. Jain, “Data clustering: 50 years beyond k-
point belonged to the kth class on these data sets are means,” Pattern Recognition Letters, vol. 31, pp.
known as illegitimate proportions at the first iteration. 651–666, 2010.
We further analyze the performance of U-k-means, [7] M.S. Yang, S.J. Chang-Chien and Y. Nataliani, “A
REM, C-FS, and RL-FCM by comparing their average fully-unsupervised possibilistic c-means clustering
running times of 25 runs for these algorithms, as shown in method,” IEEE Access, vol. 6, pp. 78308–78320,
Table XV. All algorithms are implemented in MATLAB 2018.
2017b. From Table XV, it is seen that the proposed U-k- [8] J. MacQueen, “Some methods for classification and
means is the fastest for all data sets among these analysis of multivariate observations,” Proc. of 5th
algorithms, except that the C-FS algorithm is the fastest Berkeley Symposium on Mathematical Statistics and
for the Waveform data set. Furthermore, in Section III, we Probability, vol. 1, pp. 281-297, University of
had mentioned that the proposed U-k-means objective California Press, 1967.
function is simpler than the RL-FCM objective function [9] M. Alhawarat and M. Hegazi, “Revisiting k-means
with saving running time. From Table 15, it is seen that and topic modeling, a comparison study to cluster
the proposed U-k-means algorithm is actually running arabic documents,” IEEE Access, vol. 6, pp. 42740-
faster than the RL-FCM algorithm. 42749, 2018.
[10] Y. Meng, J. Liang, F. Cao, Y. He, “A new distance
V. CONCLUSIONS with derivative information for functional k-means
In this paper we propose a new schema with a learning clustering algorithm,” Information Sciences, vol.
framework for the k-means clustering algorithm. We 463– 464, pp. 166–185, 2018.
adopt the merit of entropy-type penalty terms to construct [11] Z. Lv, T. Liu, C. Shi, J.A. Benediktsson, H. Du,
a competition schema. The proposed U-k-means “Novel land cover change detection method based on
algorithm uses the number of points as the initial number k-means clustering and adaptive majority voting
of clusters for solving the initialization problem. During using bitemporal remote sensing images,” IEEE
iterations, the U-kmeans algorithm will discard extra Access, vol. 7, pp. 34425-34437, 2019.
clusters, and then an optimal number of clusters can be [12] J. Zhu, Z. Jiang, G.D. Evangelidis, C. Zhang, S.
automatically found according to the structure of data. Panga,
15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

Z. Li, “Efficient registration of multi-view point sets perspective,” Expert Syst. Appl., vol. 36, pp.
by k-means clustering,” Information Sciences, vol. 60506061, 2009.
488, pp. 205–218, 2019. [27] L.J. Deborah, R. Baskaran, A. Kannan, “A survey on
[13] M. Halkidi, Y. Batistakis, M. Vazirgiannis, “On internal validity measure for cluster validation,” Int.
clustering validation techniques,” J. Intell. Inf. Syst., J. Comput. & Eng. Surv., vol. 1 pp. 85-102, 2010.
vol. 17, pp. 107-145, 2001. [28] I.H. Witten, E. Frank, M.A. Hall and C.J. Pal, Data
[14] R.E. Kass, A.E. Raftery, “Bayes Factors,” Journal of Mining: Practical Machine Learning Tools and
the American Statistical Association, vol. 90, pp. Techniques, Morgan Kaufmann Publishers, 2000.
773– 795, 1995. [29] G. Guo, L. Chen, Y. Ye and Q. Jiang, “Cluster
[15] H. Bozdogan, “Model selection and Akaike’s validation method for determining the number of
information criterion (AIC): The general theory and clusters in categorical sequences,” IEEE Transactions
its analytical extensions,” Psychometrika, vol. 52, pp. on Neural Networks and Learning Systems, vol. 28,
345–370, 1987. pp. 2936-2948, 2017.
[16] J.C. Dunn, “A fuzzy relative of the ISODATA [30] A. Rodriguez, A. Laio, “Clustering by fast search and
process and its use in detecting compact, well- find of density peaks,” Science, vol. 344 (6191) pp.
separated clusters,” J. Cybernetics, vol. 3, pp. 32-57, 1492-1496, 2014.
1974. [31] M.S. Yang, C.Y. Lai and C.Y. Lin, “A robust EM
[17] D. Davies and D. Bouldin, “A cluster separation clustering algorithm for Gaussian mixture models,”
measure,” IEEE Transactions on Pattern Analysis and Pattern Recognition, vol. 45, pp. 3950-3961, 2012.
Machine Intelligence, vol. PAMI-1, pp. 224–227, [32] M.A.T. Figueiredo, A. K. Jain, “Unsupervised
1979. learning of finite Mixture models,” IEEE Trans.
[18] P.J. Rousseeuw, “Silhouettes: A graphical aid to the Pattern Analysis and Machine Intelligence, vol. 24,
interpretation and validation of cluster analysis,” pp. 381396, 2002.
Journal of Computational and Applied Mathematics, [33] M.S. Yang and Y. Nataliani, “Robust-learning fuzzy
vol. 20, pp. 53-65, 1987. cmeans clustering algorithm with unknown number
[19] T. Calinski, J. Harabasz, “A dendrite method for of clusters,” Pattern Recognition, vol. 71, pp. 45-59,
cluster analysis,” Commun. Stat.-Theory Methods, 2017.
vol. 3, pp. 1–27, 1974. [34] C.L. Blake, C.J. Merz, UCI repository of machine
[20] R. Tibshirani, G. Walther, and T. hastie, “Estimating learning databases, a huge collection of artificial and
the number of clusters in a data set via the gap real-world data sets, 1998.
statistic,” https://archive.ics.uci.edu/ml/datasets.html
Journal of the Royal Statistical Society: Series B, vol. [35] D. Cai, X. He, J. Han and T. S. Huang, Graph
63, pp. 411-423, 2001. regularized nonnegative matrix factorization for data
[21] N. R. Pal and J. Biswas, “Cluster validation using representation, IEEE Trans. Pattern Analysis and
graph theoretic concepts,” Pattern Recognition, vol. Machine Intelligence 33.8 (2010) 1548-1560.
30, pp. 847-857, 1997. [36] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, Learning
[22] N. IIc, “Modified Dunn’s cluster validity index based a spatially smooth subspace for face recognition,
on graph theory,” Przeglad Elektrotechniczny Proceedings of IEEE Conference on Computer Vision
(Electrical Review), vol. 2, pp. 126-131, 2012. and Pattern Recognition, 2007, CVPR’07, pp. 1-7,
[23] D. Pelleg, A. Moore, “X-Means: Extending k-means 2007.
with efficient estimation of the number of clusters,” [37] A. Krizhevsky and G. Hinton, Learning multiple
Proc. of the 17th International Conference on Machine layers of features from tiny images (Vol. 1, No. 4, p.
Learning, pp. 727–734, San Francisco, 2000. 7), Technical report, University of Toronto, 2009.
[24] E. Rendon, I. Abundez, A. Arizmendi, E.M. Quiroz,
Kristina P. Sinaga received B.S. degree and
“Internal versus external cluster validation indexes,” M.S. degree in mathematics from University
Int. J. Computers and Communications, vol. 5, pp. of Sumatera Utara, Indonesia. She is a Ph.D.
2734, 2011. student at Department of Applied
[25] Y. Lei, J.C. Bezdek, S. Romani, N.X. Vinh, J. Chan, Mathematics, Chung Yuan Christian
University, Taiwan. Her research interests
J. Bailey, “Ground truth bias in external cluster include clustering and pattern recognition.
validity indices,” Pattern Recognition, vol. 65, pp. 58-
70, 2017.
[26] J. Wu, J. Chen, H.Xiong, M. Sie, “External validation
measures for k-means clustering: a data distribution

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI
10.1109/ACCESS.2020.2988796, IEEE Access Author Name: Preparation of Papers for IEEE Access

Miin-Shen Yang received the BS degree in

mathematics from the Chung Yuan
Christian University, Chung-Li, Taiwan, in
1977, the MS degree in applied
mathematics from the National Chiao-Tung
University, Hsinchu, Taiwan, in 1980, and
the Ph.D. degree in statistics from the
University of South Carolina, Columbia,
USA, in 1989.
In 1989, he joined the faculty of the
Department of Mathematics in the
Chung Yuan Christian University (CYCU) as an Associate Professor,
where, since
1994, he has been a Professor. From 1997 to 1998, he was a Visiting
Professor with the Department of Industrial Engineering, University of
Washington, Seattle, USA. During 2001-2005, he was the Chairman of
the Department of Applied Mathematics in CYCU. Since 2012, he has
been a Distinguished Professor of the Department of Applied
Mathematics and the Director of Chaplain’s Office, and now, the Dean of
College of Science in CYCU. His research interests include clustering
algorithms, fuzzy clustering, soft computing, pattern recognition and
machine learning. Dr. Yang was an Associate Editor of the IEEE
Transactions on Fuzzy Systems (2005-2011), and is an Associate Editor
of the Applied Computational Intelligence & Soft Computing.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

Unsupervised K-Means Clustering Algorithm
No ratings yet
Unsupervised K-Means Clustering Algorithm
13 pages
Dynamic Approach To K-Means Clustering Algorithm-2
No ratings yet
Dynamic Approach To K-Means Clustering Algorithm-2
16 pages
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
No ratings yet
Application of The K-Means Clustering Algorithm in Medical Claims Fraud / Abuse Detection
10 pages
Analysis&Comparisonof Efficient Techniquesof
No ratings yet
Analysis&Comparisonof Efficient Techniquesof
5 pages
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
No ratings yet
MMZ XRF O0 Ra Pre 0 ZB XGXW W1 Er 02 OAYQum QDD78 HQP
4 pages
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
No ratings yet
An Improvement in K Means Clustering Algorithm IJERTV2IS1385
6 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
AK-means: An Automatic Clustering Algorithm Based On K-Means
No ratings yet
AK-means: An Automatic Clustering Algorithm Based On K-Means
6 pages
K-Means Clustering
No ratings yet
K-Means Clustering
8 pages
A Dynamic K-Means Clustering For Data Mining-Dikonversi
No ratings yet
A Dynamic K-Means Clustering For Data Mining-Dikonversi
6 pages
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
No ratings yet
Ijert Ijert: Enhanced Clustering Algorithm For Classification of Datasets
8 pages
Pattern Recognition Letters: Krista Rizman Z Alik
No ratings yet
Pattern Recognition Letters: Krista Rizman Z Alik
7 pages
Unit-4
No ratings yet
Unit-4
46 pages
Determination of The Number of Cluster A Priori Using A K-Means Algorithm
No ratings yet
Determination of The Number of Cluster A Priori Using A K-Means Algorithm
3 pages
PART2
No ratings yet
PART2
61 pages
na2010
No ratings yet
na2010
5 pages
A Dynamic K-Means Clustering For Data Mining
No ratings yet
A Dynamic K-Means Clustering For Data Mining
6 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
K Means Algo
No ratings yet
K Means Algo
7 pages
I Jsa It 04132012
No ratings yet
I Jsa It 04132012
4 pages
Research on k Mean Algorithm
No ratings yet
Research on k Mean Algorithm
5 pages
Application of K-Means 1002.2425 PDF
No ratings yet
Application of K-Means 1002.2425 PDF
4 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
genedata doc
No ratings yet
genedata doc
67 pages
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
No ratings yet
The General Considerations and Implementation In: K-Means Clustering Technique: Mathematica
10 pages
Comprehensive Review of K-Means Clustering Algorithms
No ratings yet
Comprehensive Review of K-Means Clustering Algorithms
5 pages
An Efficient Incremental Clustering Algorithm
No ratings yet
An Efficient Incremental Clustering Algorithm
3 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
A Review On K Means Clustering
No ratings yet
A Review On K Means Clustering
7 pages
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
No ratings yet
(IJCST-V3I1P7) Author: Kanika, Gargi Narula
3 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
20-463 Internal and External Validity PDF
No ratings yet
20-463 Internal and External Validity PDF
8 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Normalization Based K Means Clustering Algorithm
No ratings yet
Normalization Based K Means Clustering Algorithm
5 pages
Machine_Learning_Unit_4
No ratings yet
Machine_Learning_Unit_4
22 pages
The K-Means Clustering Technique General Considera
No ratings yet
The K-Means Clustering Technique General Considera
11 pages
K-Means Clustering Algorithm and Its Improvement R
No ratings yet
K-Means Clustering Algorithm and Its Improvement R
6 pages
Lab 8
No ratings yet
Lab 8
4 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
11 pages
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
No ratings yet
Analysis and Study of K Means Clustering Algorithm IJERTV2IS70648
6 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
DSUP_Exp5[1]
No ratings yet
DSUP_Exp5[1]
7 pages
Unit 4
No ratings yet
Unit 4
74 pages
Research On K-Means Clustering Algorithm An Improved K-Means Clustering Algorithm
No ratings yet
Research On K-Means Clustering Algorithm An Improved K-Means Clustering Algorithm
5 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
02 Data Mining-Partitioning Method
No ratings yet
02 Data Mining-Partitioning Method
8 pages
A Wavelet-Based Anytime Algorithm For K-Means Clustering of Time Series
No ratings yet
A Wavelet-Based Anytime Algorithm For K-Means Clustering of Time Series
12 pages
Fast_and_Robust_General_Purpose_Clustering_Algorit
No ratings yet
Fast_and_Robust_General_Purpose_Clustering_Algorit
29 pages
Unit 5
No ratings yet
Unit 5
63 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
The Clustering Validity With Silhouette and Sum of Squared Errors
No ratings yet
The Clustering Validity With Silhouette and Sum of Squared Errors
8 pages
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
No ratings yet
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
9 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Predicting Students' Performance Using K-Median Clustering
No ratings yet
Predicting Students' Performance Using K-Median Clustering
4 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
K Means Clustering Lecture
No ratings yet
K Means Clustering Lecture
32 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
A Beginners Guide To Codebreaking
No ratings yet
A Beginners Guide To Codebreaking
36 pages
Midterm Grading Criteria
No ratings yet
Midterm Grading Criteria
16 pages
Image Fusion Using Various Transforms: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
Image Fusion Using Various Transforms: IPASJ International Journal of Computer Science (IIJCS)
8 pages
Quantization of Moduli Spaces and Quantum Cryptosophrology
No ratings yet
Quantization of Moduli Spaces and Quantum Cryptosophrology
3 pages
Bezier Curves: CS 319 Advanced Topics in Computer Graphics John C. Hart
No ratings yet
Bezier Curves: CS 319 Advanced Topics in Computer Graphics John C. Hart
16 pages
Computational Machine Learning Mock Test
No ratings yet
Computational Machine Learning Mock Test
6 pages
Practice Problem 3-2 - Minimization of Cost (Simplex Method)
No ratings yet
Practice Problem 3-2 - Minimization of Cost (Simplex Method)
3 pages
Applied Abstract Algebra with Mapletm and Matlab r Third Edition Richard E Klima pdf download
No ratings yet
Applied Abstract Algebra with Mapletm and Matlab r Third Edition Richard E Klima pdf download
56 pages
An Image-Based System For Pavement Crack Evaluation Using Transfer Learning and Wavelet Transform
No ratings yet
An Image-Based System For Pavement Crack Evaluation Using Transfer Learning and Wavelet Transform
13 pages
3-SAT Notes
No ratings yet
3-SAT Notes
7 pages
03 - TS - CH-3 Polynomials - Sa-1 Booklet - 2023-2024
No ratings yet
03 - TS - CH-3 Polynomials - Sa-1 Booklet - 2023-2024
9 pages
UGRD-MATH6200 Data Analysis2
No ratings yet
UGRD-MATH6200 Data Analysis2
15 pages
Assignment 1 - Answer
No ratings yet
Assignment 1 - Answer
13 pages
Download ebooks file Data-Driven Model-Free Controllers 1st Edition Precup all chapters
100% (2)
Download ebooks file Data-Driven Model-Free Controllers 1st Edition Precup all chapters
55 pages
Conics
No ratings yet
Conics
10 pages
Template Chater Book
No ratings yet
Template Chater Book
2 pages
Ece141 Lec14 Convolutional Codes
No ratings yet
Ece141 Lec14 Convolutional Codes
36 pages
MoIL Momentum Imitation Learning For Efficient Vision-Language Adaptation
No ratings yet
MoIL Momentum Imitation Learning For Efficient Vision-Language Adaptation
13 pages
Explain The Model of Digital Signature.
No ratings yet
Explain The Model of Digital Signature.
2 pages
A Greedy Algorithm For Wire Length Optimization
No ratings yet
A Greedy Algorithm For Wire Length Optimization
4 pages
Discussion Paper 2310 2
No ratings yet
Discussion Paper 2310 2
23 pages
Plan and Design Format
No ratings yet
Plan and Design Format
2 pages
LAB Report 7
No ratings yet
LAB Report 7
8 pages
ML Unit 5
No ratings yet
ML Unit 5
30 pages
Algorithm Assignment
No ratings yet
Algorithm Assignment
12 pages
Johnson's Algorithm
80% (5)
Johnson's Algorithm
12 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Top Cited Articles in Signal & Image Processing 2021-2022
No ratings yet
Top Cited Articles in Signal & Image Processing 2021-2022
7 pages
Thota Network Security
No ratings yet
Thota Network Security
9 pages
SCT - QB - Anwers - p1
No ratings yet
SCT - QB - Anwers - p1
53 pages

Unsupervised K-Means Clustering Algorithm

Uploaded by

Unsupervised K-Means Clustering Algorithm

Uploaded by

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final

Unsupervised K-Means Clustering

 in1 ck1z x aik i  k 2 . The k-means algorithm is

or AIC, is used to do the splitting process. Although

Rodriguez and Laio [30] proposed an approach based

initializations and require a given number of clusters a lnk (2)

been used for clustering without given a number of

We know that, when  and  in Eq. (2) are zero, it

be ct1 c t  kt1 kt1 1n k, 1, ,c

 1 in1 zik

and  for the two terms of   n

  parameter should not decrease too slow or too fast,

   k E k cs1 s ln s 0. Thus, we obtain

e1k (ln    k  s ln s )  ln s( )t . If max 1 k c  1n  z

Under the constraint ck1 k 1, and only when k

Therefore, to prevent  too large, we use [0, 1] . If 

 must become large to enhance its competition. If the

will become small to maintain stability. Thus, we exp(n ( 1)t

where  min 11, td 2 1 

IV. EXPERIMENTAL RESULTS AND

Gaussian mixture model f x( ; ,) ck1k f x( ;k )

k 1 6,k , 1 5 2T , 2 3 4T , 3 8

4 6 6 , 5 10 8 , 6 7 10 ,

 =1 6 0 0.4

4 0.605 0.715 0.700 0.469 0.040 1.603 0.016

5 0.754 0.743 0.571 0.619 0.041 4.619 0.020

6 1.277 0.838 0.483 1.067 0.102 4.635 0.048

7 1.155 0.773 0.634 0.991 0.060 0.794 0.022

8 1.054 0.703 0.808 0.930 0.030 0.571 0.004

clusters, and X-means algorithms are dependent of

algorithm with REM, C-FS, k-means with true c, X-

Mixing proportions Mean values covariance matrix

 60 0 1 1 05 05 25 25 5 5 1 1 5 5 0 0 075 15 35 55. . . . .

CH SW DB Gap-stat DNo DNg DNs

6 6 (88%) 6 (88%) 2, 3 6 (88%) 6 (16%) 6 (8%) 6 (12%)

Seeds 3 3 0.9048 3 0.8476 3 0.7944 0.8864 3 0.8952 3 0.890 3 0.8952

Wine 3 3 0.7022 2 - 3 0.5557 0.6851 2 - 3 (64%) 0.62 2 -

TABLE XI DESCRIPTIONS OF THE SIX MEDICAL DATA SETS USED IN

Dataset Feature Characteristics Number c of clusters Number n of instances Number d of features

SPECT Categorical 2 187 22

SPECT 2 2 0.920 2 0.562 2(84%) 0.8408 0.5262 5, 6 - 2 (100%) 0.5119 2 0.588

PARKINSONS 2 2 0.754 1 - 2 (100%) 0.7436 0.5183 2 (100%) 0.62 4, 5 - 2 0.754

WPBC 2 2 0.763 198 - 2 (100%) 0.7576 0.5927 4 - 3 - 2 0.763

COLON 2 2 0.645 - - 2 (100%) 0.5813 0.4768 4 - 2 (100%) 0.45 62 -

LUNG 5 5 0.788 - - 5 (100%) 0.6859 0.6818 4, 6, 7, 8 - 2 - 9 -

c* AV-AR c* AR c* AV-AR AV-AR c* AV-AR c* AV-AR

the truec=15. The R -EM algorithm is missing because the

FIGURE5. Yale Face 32x32

Miin-Shen Yang received the BS degree in

You might also like