0% found this document useful (0 votes)
13 views15 pages

Yang 2017

Uploaded by

Kavitha Sarvanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Yang 2017

Uploaded by

Kavitha Sarvanan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Pattern Recognition 71 (2017) 45–59

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog

Robust-learning fuzzy c-means clustering algorithm with unknown


number of clusters
Miin-Shen Yang a,∗, Yessica Nataliani a,b
a
Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li 32023, Taiwan
b
Department of Information Systems, Satya Wacana Christian University, Salatiga 50711, Indonesia

a r t i c l e i n f o a b s t r a c t

Article history: In fuzzy clustering, the fuzzy c-means (FCM) algorithm is the most commonly used clustering method.
Received 27 August 2016 Various extensions of FCM had been proposed in the literature. However, the FCM algorithm and its
Revised 12 May 2017
extensions are usually affected by initializations and parameter selection with a number of clusters to
Accepted 20 May 2017
be given a priori. Although there were some works to solve these problems in FCM, there is no work
Available online 22 May 2017
for FCM to be simultaneously robust to initializations and parameter selection under free of the fuzzi-
Keywords: ness index without a given number of clusters. In this paper, we construct a robust learning-based FCM
Fuzzy clustering framework, called a robust-learning FCM (RL-FCM) algorithm, so that it becomes free of the fuzziness
Fuzzy c-means (FCM) index m and initializations without parameter selection, and can also automatically find the best number
Robust learning-based schema of clusters. We first use entropy-type penalty terms for adjusting bias with free of the fuzziness index,
Number of clusters and then create a robust learning-based schema for finding the best number of clusters. The computa-
Entropy penalty terms
tional complexity of the proposed RL-FCM algorithm is also analyzed. Comparisons between RL-FCM and
Robust-learning FCM (RL-FCM)
other existing methods are made. Experimental results and comparisons actually demonstrate these good
aspects of the proposed RL-FCM where it exhibits three robust characteristics: 1) robust to initializations
with free of the fuzziness index, 2) robust to (without) parameter selection, and 3) robust to number of
clusters (with unknown number of clusters).
© 2017 Elsevier Ltd. All rights reserved.

1. Introduction tering had also been developed and discussed in the literature [4],
for example, clustering based on similarity of user-behavior for tar-
Clustering is a useful tool for data analysis. It is a method geted advertising was investigated in Aggarwal et al. [5].
for finding groups within data with the most similarity in the Partitional clustering methods suppose that the data set can
same cluster and the most dissimilarity between different clusters. be represented by finite cluster prototypes with their own objec-
The hierarchical clustering was supposed as the earliest clustering tive functions. Therefore, to define the dissimilarity (or distance)
method used by biologist and social scientists. Afterwards, cluster between data points and cluster prototypes is essential for parti-
analysis becomes a branch in statistical multivariate analysis [1]. It tional methods. It is known that the k-means (or called hard c-
is also an approach to unsupervised learning as one of the major means) algorithm is the oldest and popular partitional method [6].
techniques in pattern recognition and machine learning. According For an efficient estimation for a number of clusters, Pelleg and
to the statistical point of view, clustering methods may be divided Moore [7] extended k-means, called X-means, by making local de-
as a probability model-based approach and a nonparametric ap- cisions for cluster centers in each iteration of k-means with split-
proach. A probability model-based approach assumes that the data ting themselves to get better clustering. Users only need to specify
set follows a mixture model of probability distributions so that a a range of cluster numbers in which the true cluster number rea-
mixture likelihood approach to clustering is used [2], where the sonably lies and then a model selection, such as Bayesian informa-
expectation and maximization (EM) algorithm [3] is the most pop- tion criterion (BIC) or Akaike information criterion (AIC), is used
ular. For a nonparametric approach, clustering methods may be to do the splitting process. Although the k-means and X-means al-
based on an objective function of similarity or dissimilarity mea- gorithms are widely used, these crisp clustering methods restricts
sures, where partitional methods are the most used. Graph clus- that each data point belongs to exactly one cluster with crisp clus-
ter memberships so that it can be well fitted for sharp bound-
aries between clusters in data, but not good for unsharp (or vague)

Corresponding author. boundaries.
E-mail address: [email protected] (M.-S. Yang).

http://dx.doi.org/10.1016/j.patcog.2017.05.017
0031-3203/© 2017 Elsevier Ltd. All rights reserved.
46 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59

Since Zadeh [8] proposed fuzzy set that introduced the idea of ing. The robust-learning FCM (RL-FCM) clustering algorithm is also
partial memberships described by membership functions, it was presented in this section. In Section 3, several experimental exam-
successfully applied in clustering. Fuzzy clustering has been widely ples and comparisons with numeric and real data sets are provided
studied and applied in a variety of substantive areas more than to demonstrate the effectiveness of the proposed RL-FCM, which
45 years [9–12] since Ruspini [13] first proposed fuzzy c-partitions can automatically find the best number of clusters. Finally, conclu-
as a fuzzy approach to clustering in the 1970s. In fuzzy clustering, sions are stated in Section 4.
the fuzzy c-means (FCM) clustering algorithm proposed by Dunn
[14] and Bezdek [9] is the most well-known and used method.
There are many extensions and variants of FCM proposed in the
literature. The first important extension to FCM was proposed by 2. Robust-learning fuzzy c-means clustering algorithm
Gustafson and Kessel (GK) [15] in which the Euclidean distance
in the FCM objective function was replaced by the Mahalanobis Let X = {x1 , . . . , xn } be a data set in a d-dimensional Euclidean
distance. Afterwards, there are many extensions to FCM, such as space Rd and V = {v1 , . . . , vc } be the c cluster  centers with its
extensions to maximum-entropy clustering (MEC) by Karayiannis d
Euclidean norm denoted by dik = xi − vk 2 =
2
[16], Miyamoto and Umayahara [17] and Wei and Fahn [18], ex- j=1 (xi j − vk j ) .

tensions to Lp norms by Hathaway et al. [19], extension of FCM as The fuzzy c-means (FCM) objective function [9–10] is given with
 
alpha-cut implemented fuzzy clustering algorithms by Yang et al. Jm (U, V ) = ck=1 ni=1 μm d2 where m > 1 is the fuzziness in-
ik ik
[20], extension of FCM for treating very large data by Havens et al. dex, μ = {μik }n×c ∈ M f cn is a fuzzy partition matrix with M f cn =
 
[21], an augmented FCM for clustering spatiotemporal data by Iza- {μ = [μik ]nc |∀i, ∀k, 0 ≤ μik ≤ 1, ck=1 μik = 1, 0 < ni=1 μik < n},
kian et al. [22], and so forth. However, these fuzzy clustering algo- and dik = xi − vk 2 is the Euclidean distance. The FCM al-
rithms always need to give a number of clusters a priori. In gen- gorithm is iterated through necessary conditions for min-
eral, the cluster number c is unknown. In this case, validity indices imizing Jm (U, V)with the updating equations for cluster
 
can be used to find a cluster number c where they are supposed centers and memberships as: vk = ni=1 μm x / ni=1 μm
ik i ik
and
( )  −2/(m−1 )
to be independent of clustering algorithms. Many cluster validity μik = (dik ) −2 / m −1 / t=1 (dit )
c
.
indices for fuzzy clustering algorithms had been proposed in the We know that the FCM algorithm is dependent on initial val-
literature, such as partition coefficient (PC) [23], partition entropy ues and some parameters need to be given a priori, such as a
(PE) [24], normalization of PC and PE [25–26], fuzzy hypervolume fuzziness index m, cluster center initialization and also a num-
(FHV) [27] and XB (Xie and Beni [28]). ber of clusters. Although there exist some works in the literature
Frigui and Krishnapuram [29] proposed the robust competitive to solve some problems in FCM, such as Dembélé and Kastner
agglomerative (RCA) algorithm by adding a loss function of clus- [32] and Schwämmle and Jensen [33] on estimating the fuzziness
ters and a weight function of data points to clusters. The RCA al- index m for clustering microarray data, there is no work for FCM
gorithm can be used for determining a cluster number. Starting to be simultaneously robust to initializations and parameter selec-
with a large cluster number, RCA reduces the number by discard- tion under free of the fuzziness index m without a given num-
ing clusters with less cardinality. Some parameter initial values are ber of clusters. Next, we construct a robust learning-based schema
needed in RCA, such as time constant, discarding threshold, tun- for FCM to simultaneously solve these problems. Our basic idea is
ing factor, etc. Another clustering algorithm was presented by Ro- that, we first consider all data points as initial cluster centers, i.e.,
driguez and Laio [30] for clustering by fast search, called C-FS, us- the number of data points is the initial number of clusters. After
ing a similarity matrix for finding density peaks. They proposed that, use the mixing proportion α k of the cluster k, which is like
the C-FS algorithm by assigning a cutoff distance dc and selecting a a cluster weight, and discard these clusters that have values of α k
decision window so that it can automatically determine a number less than one over the number of data points. The proposed al-
of clusters. In [30], the cutoff distance dc becomes another param- gorithm can iteratively obtain the best number of clusters until it
eter in which clustering results are heavily dependent of the cut- converges.
off parameter dc . Recently, Fazendeiro and Oliveira [31] presented For a data set X = {x1 , . . . , xn } in Rd with c cluster centers to
a fuzzy clustering algorithm with an unknown number of clusters have FCM be simultaneously robust to initializations and param-
based on observer position, called focal point. With this point, ob- eter selection under free of the fuzziness index m that can au-
server can select a suitable point while searching for clusters that tomatically find the best number of clusters, we add several en-
is actually appropriate to the underlying data structure. After the tropy terms in the FCM objective function. First, to construct an
focal point is chosen, the initialization of cluster centers must be algorithm free of the fuzziness index m, we replace m by adding
generated randomly. The inverse of XB index is used to compute an extra term with a function of μik . In this sense, we consider
the validity measure. The maximal value is chosen to get the best the concept of MEC [16–18] by adding the entropy term of mem-
c n
number of clusters. Although these algorithms can find a number berships with
k=1 i=1 ik
μ ln μik . Moreover, we use a learning
of clusters during iteration procedures, they are still dependent of function r, i.e. r ck=1 ni=1 μik ln μik , to learn the effects of the en-
initializations and parameter selections. tropy term for adjusting bias. We next use the mixing proportion
Up to now, there is no work in the literature for FCM to be si- α = (α1 , · · · , αc ) of clusters, where α k presents the probability of
multaneously robust to initializations and parameter selection un- one data point belonged to the kth cluster with the constraint
c
k=1 αk = 1. Hence, − ln αk is the information in the occurrence of
der free of the fuzziness index without a given number of clusters.
We think that this may be due to its difficulty for constructing a data point belonged to the kth cluster. Thus, we add the entropy
 
this kind of robust FCM. In this paper, we try to construct a ro- term ck=1 ni=1 μik ln αk to summarize the average of information
bust learning-based framework for fuzzy clustering, especially for for the occurrence of a data point belonged to the corresponding
the FCM algorithm. This framework can automatically find the best cluster over fuzzy memberships. Furthermore, we borrow the idea
number of clusters, without any initialization and parameter selec- of Yang et al. [34] in the EM algorithm by using the entropy term,
c
k=1 αk ln αk , to represent the average of information for the oc-
tion, and it is also free of the fuzziness index m. We first consider
some entropy-type penalty terms for adjusting the bias, and then currence of each data point belonged to the corresponding cluster.
create a robust-learning mechanism for finding the best number of Totally, the entropy terms of mixing proportion in probability and
clusters. The organization of this paper is as follows. In Section 2, the average of occurrence in probability over fuzzy memberships
we construct a robust learning-based framework for fuzzy cluster- are used for learning to find the best number of clusters.
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 47

According to the above construction for FCM, we propose a where |{}| denotes the cardinality of the set {}. After updating the
robust-learning FCM (RL-FCM) objective function as follows: number of clusters c, the remaining mixing proportion αk∗ and cor-
responding μ∗ik need to be re-normalized by,

c 
n 
c 
n
J ( U, α , V ) = μik dik2 − r1 μik ln αk
 c(new)
k=1 i=1 k=1 i=1 

c 
n 
c αk∗ = αk∗ αt∗ (9)
+ r2 μik ln μik − r3 n αk ln αk , (1) t=1

k=1 i=1 k=1


  c(new)
d 
where r1 , r2 , r3 ≥ 0 and dik = xi − vk 2 = (xi j − vk j )2 . The
j=1
μ∗ik = μ∗ik μ∗it (10)
Lagrangian function of (1) is t=1

c 
n 
c 
n
J˜(U, α , λ1 , λ2 , V ) = μik dik2 − r1 μik ln αk c(new)
Eqs. (9) and (10) keep the constraints k=1 αk∗ = 1 and
k=1 i=1 k=1 i=1 c(new)
k=1 μik = 1. We utilize this concept to estimate the best num-


c 
n 
c
+ r2 μik ln μik − r3 n αk ln αk ber of clusters c∗ .
k=1 i=1 k=1 A new problem is how to learn the values of the three parame-
     
ters, r1 , r2 , and r3 for the three penalty terms ck=1 ni=1 μik ln αk ,

c 
c c n c
− λ1 μik − 1 − λ2 αk − 1 . (2) k=1 i=1 μik ln μik , and k=1 αk ln αk , respectively. By consider-
k=1 k=1
ing some decrease learning rates, such as e−t , e−t/10 , e−t/100 , and
e−t/1000 , we know that y = e−t/1000 decreases slower, but y = e−t
By considering the Lagrangian function in (2), the updating c n
decreases faster. Since k=1 i=1 μik ln αk has effect on member-
equation for membership function, cluster center, and mixing pro- ship partition and mixing proportion, we assume that r1 is not set
portion are as follows. The updating equation for the RL-FCM ob- to decrease too slow or too fast. Therefore, we set r1 as
jective functionJ(U, α , V) with respective to vk is as Eq. (3).
 r1(t ) = e−t/10 .

n 
n (11)
vk = μik xi μik . (3)
 
i=1 i=1 On the other hand, because the term ck=1 ni=1 μik ln μik is the
By taking the partial derivative of the Lagrangian in entropy to the partition membership μik and has effect on the
Eq. (2) with respect to μik and setting them to be zero, it clustering results, the parameter r2 should maintain large value
∂ J˜ and does not need too much variation in iterative process. In this
∂ μik = dik − r1 ln αk + r2 (ln μik + 1 ) − λ1 = 0
becomes 2 and then
sense, we consider the decreasing learning rate for r2 by assigning
2 +r ln α +λ −r )
(−dik
ln μik = r2
1 k 1 2
. Thus, the updating equation for μik is it with
obtained as follows:
   r2(t ) = e−t/100 . (12)
−dik
2
+ r1 ln αk 
c
−dit2 + r1 ln αt
μik = exp exp . (4) c
r2
t=1
r2 In order to avoid that
c n k=1 αk ln αk interferes with

μ
kc =1 i=1 ik
ln μ ik when the algorithm is stable, the term
μ
Similarly, we have ∂∂αJ = −r1 ni=1 αik − r3 n(ln αk + 1 ) − λ2 = 0.
˜
k k k=1 αk ln αk needs large effect in initially iterative process and
By multiplying with α k , we obtain small effect when the algorithm is stable. Since r3 is a control
scale for the entropy to α k , we consider that r3 is related to

n
−r1 μik − r3 nαk (ln αk + 1 ) − λ2 αk = 0 (5) the variation of the mixing proportion |αk(new ) − αk(old ) |. Our goal
i=1 is that r3 can control competition of the mixing proportions.
    Therefore, first r3 is defined with
and −r1 ck=1 ni=1 μik − ck=1 nr3 αk ln αk − ck=1 nr3 αk −
then
c
c
k=1 λ2 αk = 0. We get exp −ηn αk(new) − αk(old )
k=1

c r3 = , (13)
c
λ2 = −nr1 − nr3 αk ln αk − nr3 . (6)
d/2−1
k=1
where η = min{1, 2/d } and the notation a denotes the

By substituting (6) to (5), we have −r1 ni=1 μik −
c largest integer no more than a. In Eq. (13), if |αk(new ) − αk(old ) | is
nr3 αk (ln αk + 1 ) − (−nr1 − nr3 k=1 αk ln αk − nr3 )αk = 0. small, then r3 will become large to enhance its competition. If
Thus, the updating equation for α k can be obtain as follows: |αk(new) − αk(old ) | is large, then r3 is small to maintain stability.
  In addition, the competition of the mixing proportions for the
1 
n c
(new ) r
αk = μik + 3 αk(old ) ln αk(old ) − αt(old ) ln αt(old ) . (7) higher dimensional data needs the larger value of r3 . Therefore,
n
i=1
r1
t=1 η = min{1, 2/dd/2−1 } is to adjust r3 . Furthermore, we need to
consider the restriction of max αk(new ) ≤ 1. However, max αk(new )
For solving the initialization problem, all data points is as- 1≤k≤c 1≤k≤c
signed as initial clusters for the first iteration. That is, c (0 ) = n  c
≤ max ( n1 ni=1 μik ) + r3 max αk(old ) (ln max αk(old ) − t=1 αt(old ) ln
r

and αk(0 ) = 1/c = 1/n, k = 1, · · · , c. There are competitions between 1≤k≤c 1 1≤k≤c

1≤k≤c

these mixing proportions according to Eq. (7). Iteratively, the al- αt(old ) ) and max ( 1n ni=1 μik ) + r3 max αk(old ) (ln max αi(old )
r
1≤k≤c 1 1≤k≤c 1≤i≤c
gorithm can find the final number of clusters c by utilizing the c 
following Eq. (8). When αk(new ) < 1/n, we discard illegitimate mix-
− t=1 αt(old ) ln αt(old ) ) < max ( 1n ni=1 μik ) + r3 (−( max αk(old )
1≤k≤c 1≤k≤c
c (old )
ing proportion αk(new ) . Therefore, the updating number of clusters t=1 αt ln αt(old ) )).
 c
c(new) is Therefore, if max ( 1n ni=1 μik ) − r3 max αk(old ) t=1 αt(old ) ln αt(old )
1≤k≤c 1≤k≤c
c(new ) = c(old ) − {αk (new) αk (new) < 1/n, k = 1, 2, .., c(old ) } (8) ≤ 1, then the restriction will be held. It follows that
48 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59

   
1 
n c
(old )
r3 ≤ 1 − max μik − max αk αt(old ) ln αt(old ) .
1≤k≤c n 1≤k≤c
i=1 t=1

(14)
To combine Eq. (13) and Eq. (14), we obtain

c
k=1 exp −ηn αk(new) − αk(old )
r3 = min ,
c

n
i=1 μik
1
1 − max
⎟
n
1≤k≤c
(old ) c (old ) (old )
⎠. (15)
− max αk t=1 αt ln αt
1≤k≤c

When the number of clusters c is stable, the competition of


mixing proportions will be stopped. The parameter r3 can be set
as 0 at this point. In our experiments, the number of clusters c is
usually stable if iteration t large than or equal to 100. Thus, we
give the flowchart of the proposed robust-learning FCM (RL-FCM)
as shown in Fig. 1, and the RL-FCM clustering algorithm is sum-
marized as follows:
RL-FCM Algorithm
Fix ɛ > 0. Give initials c (0 ) = n, vk(0 ) = xi , αk(0 ) = 1n , and initial
learning rates r1(0 ) = r2(0 ) = r3(0 ) = 1. Let t = 1.
(t )
Step 1: Compute μik using vk(t−1 ) , αk(t−1 ) , c (t−1 ) , r1(t−1 ) , r2(t−1 ) by
Eq. (4).
Step 2: Update r1(t ) and r2(t ) by Eqs. (11) and (12), respectively.
Step 3: Update αk(t ) with μik
(t )
and αk(t−1 ) by Eq. (7).
Step 4: Update r3(t ) with αk(t ) and αk(t−1 ) by Eq. (15).
Step 5: Update c (t−1 ) to c(t) by discarding those clusters with
αk(t ) ≤ 1/n using Eq. (8) and normalize αk(t ) and μik(t ) by
Eqs. (9) and (10), respectively.
IF t ≥ 100 and c (t−100 ) − c (t ) = 0, THEN let r3(t ) = 0.
Step 6: Update vk(t ) using c(t) and μik
(t )
by Eq. (3).
Step 7: Compare vk(t ) and vk(t−1 ) .
IF max vk(t ) − vk(t−1)  < ε, STOP.
1≤k≤c (t−1 )
ELSE t = t + 1 and return to Step 1.

We analyze the computational complexity for the RL-FCM algo-


rithm. The computational complexity is calculated as follows. The
RL-FCM algorithm can be divided into three parts: (1) compute the
membership partition,μik , which needsO(nc2 d); (2) compute the
mixing proportion α k , which needs O(nc2 ); (3) update cluster cen-
ter, vk , which needs O(nc). Because the notation of big O (i.e.,O( · )) Fig. 1. Flowchart of RL-FCM.
only considers the upper bound on the growth rate of the function,
the total computational complexity for the RL-FCM algorithm is
O(nc2 d), where n is the number of data, c is the number of clusters, that obtained correct clustering for the cluster k, and n is the to-
and d is the dimension of data points. In fact, the RL-FCM has the tal number of data points. The larger AR is, the better clustering
same computational complexity as the FCM, i.e.O(nc2 d). The most performance is.
difference between the RL-FCM and FCM is the initial number c of For Gaussian mixture model in Example 1, data
clusters, where c is given a priori in the FCM, but c decreases from is generated from the d-variate normal mixture
  −(d/2 )
n to cfinal , n > >cfinal , in the RL-FCM. We should mention that, even model f (x; α , θ ) = ck=1 αk f (x; θk ) = ck=1 αk (2π )
−(1/2 ) −(1/2 )(x−μk ) −1 (x−μk )
though the RL-FCM algorithm uses the number n of data points as | k| e

k α k > 0 denotes mixing
, where
the number c of clusters (i.e., c = n) in the beginning of iterations, proportions with ck=1 αk = 1, and f(x;θ k ) denotes the density of
the time per iteration will decrease rapidly after several iterations. x from kth group with the corresponding parameter θ k consisted
This situation occurs because the clusters with αk ≤ 1/c = 1/n will of a mean vector μk and a covariance matrix k .
be discarded during iterations, so that the number c of clusters will
decrease rapidly after some iterations. We will demonstrate this Example 1. In this example, we use a data set with 500 data
behavior of the RL-FCM in the next section. points generated from a three-component Gaussian mixture distri-
In the next Example 1, we demonstrate these robust learning bution, with parameters α1 = α2 = α3 = 1/3, μ1 = (0 0 )T , μ2 =
behaviors to get the best number c∗ of clusters for the RL-FCM. For 1 0
(7 0 )T , μ3 = (14 0 )T , and 1 = 2 = 3 = ( ), as shown
measuring clustering performance, we use an accuracy rate (AR) 0 1

with AR = ck=1 n(ck )/n, where n(ck ) is the number of data points in Fig. 2(a). After two iterations, the number of cluster decreases
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 49

Fig. 2. (a) 3-cluster dataset; (b) Clustering result after 2 iterations with 289 clusters; (c) Clustering result after 5 iterations with 190 clusters; (d) Clustering result after 10
iterations with 78 clusters; (e) Clustering result after 15 iterations with 10 clusters; (f) Final result from RL-FCM with 3 clusters.

rapidly from 500 to 289, as shown in Fig. 2(b). The RL-FCM al- Table 1
Validity index values of PC, PE, MPC, MPE, FHV and XB for the data
gorithm decreases the number of cluster to 190, 78, and 10 clus-
set in Fig. 3(a).
ters after 5, 10, and 15 iterations (see Fig. 2(c)–(e)), respectively.
Finally, after 22 iterations (see Fig. 2(f)), the RL-FCM algorithm ob- c PC PE MPC MPE FHV XB
tains its convergence where three clusters are formed with c∗ = 3 2 0.7497 0.4013 0.4994 0.4210 13.3712 0.1482
and AR = 1.00. 3 0.6414 0.6335 0.4621 0.4234 13.3343 0.2399
4 0.7219 0.5572 0.6292 0.5980 7.6920 0.0990
5 0.7643 0.5202 0.7053 0.6768 5.2695 0.0553
3. Experimental results and comparisons 6 0.7280 0.6134 0.6736 0.6577 5.6192 0.1313

In this section we present some experimental examples with


artificial and real data sets, also image segmentations to show the the bottom side, and a separated line horizontal shape in the cen-
performance of the proposed RL-FCM algorithm. The validity in- ter. The proposed RL-FCM algorithm is applied for this data set and
dices of partition coefficient (PC) [23], partition entropy (PE) [24], have the number of clusters decrease from 500 to 5 after 26 iter-
modified PC (MPC) [25], modified PE (MPE) [26], fuzzy hypervol- ations, as shown in Fig. 3(b). We find that the RL-FCM can detect
ume (FHV) [27], and Xie and Beni (XB) [28] are computed to com- five clusters with c∗ = 5 and AR = 0.9960. While using the FCM by
pare with the number of clusters obtained from RL-FCM. The com- assigning the number of clusters, c = 5, yields the clustering results
parisons between RL-FCM, FCM, robust competitive agglomeration as shown in Fig. 3(c) with the same AR.
(RCA) [29], clustering by fast search (C-FS) [30], observer-biased
(OB) [31], and robust EM (R-EM) [34] are also made, by using As known, cluster validity indices are usually used to find the
fuzziness index m = 2. number of clusters. In this example, PC [23], PE [24], MPC [25],
MPE [26], FHV [27], and XB [28] are computed to check whether
Example 2. In this example, we use a data set with 500 data RL-FCM is valid or not. All validity index values are shown in
points generated from the five-component Gaussian mixture distri- Table 1. The indices PC and PE give the best number of clusters
bution, with parameters αk = 1/5, k = 1, . . . , 5, μ1 = (0 0 )T , μ2 = with c∗ = 2, but the indices MPC, MPE, FHV, and XB give the best
(0 6 )T , μ3 = (10 0 )T , μ4 = (10 6 )T , μ5 = (5 3 )T , 1 = 2 = number of clusters with c∗ = 5.
1 0 0.01 0
3 = 4 = ( ), and 5 = ( ), as shown in Fig. 3(a). Furthermore, to demonstrate the performance of RL-FCM for
0 1 0 10 noisy data, we add 50 uniformly noisy points (i.e. background out-
The data set is seen as two circles at the top side, two circles at
liers) in the data set of Fig. 3(a). The noisy data set is shown in
50 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59

Fig. 3. (a) 5-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 5.

Fig. 4. (a) 5-cluster dataset with noisy points; (b) Final result from RL-FCM. Fig. 6. (a) 13-cluster dataset with 100 noisy points; (b) Final results from RL-FCM.

Table 2
Validity index values of PC, PE, MPC, MPE, FHV and XB for the data set
in Fig. 5(a).

c PC PE MPC MPE FHV XB

2 0.8173 0.3109 0.6347 0.5514 9.4148 0.0985


3 0.6920 0.5526 0.5380 0.4970 9.9570 0.1395
4 0.6201 0.7160 0.4935 0.4835 10.0563 0.1968
5 0.5714 0.8538 0.4642 0.4695 10.2332 0.1750
6 0.5571 0.9122 0.4685 0.4909 9.6476 0.1611
7 0.5478 0.9715 0.4725 0.5008 9.0646 0.1420
8 0.5553 0.9845 0.4917 0.5266 8.3539 0.1419
9 0.5530 1.0206 0.4971 0.5355 7.9127 0.1121
10 0.5773 0.9970 0.5303 0.5670 6.5653 0.1184
11 0.5866 0.9987 0.5453 0.5835 5.9894 0.0910
12 0.6072 0.9707 0.5715 0.6094 5.1708 0.0826
13 0.6201 0.9540 0.5885 0.6281 4.6007 0.0683
14 0.6055 1.0 0 03 0.5751 0.6210 4.7423 0.20 0 0
Fig. 5. (a) 13-cluster data set; (b) Final result from RL-FCM.

that the RL-FCM can detect 13 clusters with c∗ = 13 and AR = 1.00,


Fig. 4(a). We implement the RL-FCM algorithm for the noisy data as shown in Fig. 5(b). The validity index values of PC, PE, MPC,
set with clustering results shown in Fig. 4(b). We find that the RL- MPE, FHV and XB are shown in Table 2, where the validity indices
FCM algorithm still obtains five clusters after 20 iterations with of PC, PE and MPC give the best number of clusters with c∗ = 2,
AR = 0.9940. In this case, the RL-FCM algorithm is quite robust in the validity indices of MPE, FHV, and XB give the best number of
this noisy environment. clusters with c∗ = 13.

Example 3. In this example, we show another data set with 13 We continue adding 100 noisy points in the data set of Fig. 5(a).
blocks generated from continuously uniform distribution, where The noisy data set is shown in Fig. 6(a). The effectiveness of
each block contains 50 data points, as shown in Fig. 5(a). Using the RL-FCM in handling noise is demonstrated in Fig. 6(b) where
RL-FCM algorithm, we get 13 clusters after 19 iterations. We find we can see the RL-FCM algorithm still obtains 13 clusters after
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 51

Fig. 7. (a) Original data set; (b) Final result from RL-FCM; (c) Final result from FCM with c = 6.

Table 3 Table 4
Validity index values of PC, PE, MPC, MPE, FHV and XB for the data Cluster number results using different learning functions for r1 with r2 =
set in Fig. 7(a). e−t/100 (ARs presented inside brackets).

c PC PE MPC MPE FHV XB Ex. 1 Ex. 2 Ex. 3 Ex. 4

2 0.7856 0.3458 0.5713 0.5010 4.6115 0.1063 True c 3 5 13 6


3 0.7293 0.5035 0.5940 0.5417 3.6960 0.1178 e−t 2 2 4 4
4 0.7016 0.5935 0.6021 0.5719 3.2528 0.1744 r1 e−t/10 3 (1.0 0 0 0) 5 (0.9960) 13 (1.0 0 0 0) 6 (0.9667)
5 0.6727 0.6809 0.5909 0.5769 3.0853 0.1804 e−t/100 9 12 13 (1.0 0 0 0) 6 (0.9633)
6 0.7142 0.6284 0.6571 0.6493 2.2999 0.1023 e−t/1000 9 9 13 (1.0 0 0 0) 6 (0.9633)
7 0.6622 0.7420 0.6059 0.6187 2.5522 0.5717

20 iterations with AR = 1.00 from these clustering results shown 3 0


Furthermore, we continue a data set with ), k =
k =(
in Fig. 6(b). 0 3
1, . . . , 25, as shown in Fig. 10(a). We find that RL-FCM still ob-
Example 4. Modifying from Fazendeiro and Oliveira [31], 300 data tain 25 clusters after 36 iterations with AR = 0.9160, as shown in
points of a 2D Gaussian mixture distribution are constructed as Fig. 10(b). While using FCM with c = 25, the average AR is 0.8340,
seen in Fig. 7(a), with parameters αk = 1/6, μ1 = (1 3 )T , μ2 = as shown in Fig. 10(c).
( 3 2 ) T , μ3 = ( 3 7 ) T , μ4 = ( 5 5 ) T , μ5 = ( 7 8 ) T , μ6 = ( 9 6 ) T ,
0.4 0 Example 6. In this example, we consider 3-dimensional data with
and k = ( ), k = 1, . . . , 6, where two clusters are overlap- 17 blocks generated from uniform distributions, where each block
0 0.4
ping. The proposed RL-FCM algorithm can obtain six clusters as contains 50 data points, as shown in Fig. 11(a). Using the RL-
well as the clustering results obtained from Fazendeiro and Oliveira FCM algorithm, it gets 17 clusters after 51 iterations, as shown
[31] after 28 iterations, as shown in Fig. 7(b), with AR = 0.9667. Us- in Fig. 11(b). RL-FCM can detect 17 clusters with AR = 1.00, while
ing FCM with c = 6, the average AR is 0.8910 (Fig. 7(c)). The valid- using FCM with c = 17, the average AR is 0.8851, as shown in
ity index values of PC, PE, MPC, MPE, FHV and XB are shown in Fig. 11(c).
Table 3, where the validity indices MPC, MPE, FHV, and XB give
We continue a more number of clusters that has 21 blocks
the number of clusters with c∗ = 6 and the validity indices PC and
generated from uniform distributions, as shown in Fig. 12(a). RL-
PE give the number of clusters with c∗ = 2.
FCM obtains 21 clusters after 49 iterations with AR = 0.9962, as
Moreover, to show the performance of RL-FCM with more over- shown in Fig. 12(b). While using FCM with c = 21, the average
lapping data set, we generate 500 data points of a 2D Gaussian AR = 0.8826, as shown in Fig. 12(c).
mixture distribution as seen in Fig. 8(a), with parameters αk = 1/6, We study different learning behaviors for the parameter r1 and
μ1 = ( 1 . 5 3 ) T , μ 2 = ( 3 3 ) T , μ 3 = ( 3 6 ) T , μ 4 = ( 5 5 ) T , μ 5 = r2 . From all above examples, we implement the RL-FCM algorithm
0.4 0 using four learning functions, i.e., e−t , e−t/10 , e−t/100 , and e−t/1000 .
(6 7 )T , μ6 = (8 6.5)T , and k = ( ), k = 1, . . . , 6. The Clustering results obtained by using r1 = e−t indicate that clus-
0 0.4
proposed RL-FCM algorithm still can obtain six clusters after 32 it- ter bias is adjusted too fast, so it is very difficult to target opti-
erations, as shown in Fig. 8(b), with AR = 0.9140, while using FCM mal cluster centers with correct number of clusters. On the other
with c = 6, the average AR is 0.8986 (Fig. 8(c)). hand, when r1 = e−t/1000 , cluster bias is adjusted slow, so it can-
not create better clustering results with correct number of clusters.
Example 5. For considering more number of clusters, in this ex- Tables 4 and 5 show the cluster number results using different
ample, we generate 500 data points from a 2D Gaussian mixture learning rates for r1 and r2 (ARs are presented inside brackets), re-
distribution, as shown in Fig. 9(a), where the data set is consisted spectively. We also can see the behavior of r1 and r2 from these
1 0 tables. Most of them, the greater r1 obtains the more cluster num-
of 25 clusters with k = ( ), k = 1, . . . , 25. Using the RL-FCM
0 1 ber. While for r2 , the greater r2 gives the less cluster number. Us-
algorithm, the data set can be well separated into 25 clusters af- ing r1 = e−t/10 , r1 = e−t/100 , r1 = e−t/1000 or r2 = e−t/10 , r2 = e−t/100 ,
ter 31 iterations, as shown in Fig. 9(b), with AR = 1.00. While using r2 = e−t/1000 , some cluster bias can be adjusted well and targeted
FCM with c = 25, the average AR is 0.8558, as shown in Fig. 9(c). to optimal cluster centers. However, r1 = e−t/10 and r2 = e−t/100 can
52 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59

Fig. 8. (a) Original data set; (b) Final result from RL-FCM; (c) Final result from FCM with c = 6.

Fig. 9. (a) 25-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 25.

Fig. 10. (a) 25-cluster data set with closer cluster; (b) Final result from RL-FCM; (c) Final result from FCM with c = 25.

often give the best number of clusters with the smallest error cm), sepal width (SW, in cm), petal length (PL, in cm), and petal
rate. Overall, we recommend the learning function r1 = e−t/10 and width (PW, in cm). The Iris data set originally has three clusters
r2 = e−t/100 as a decreasing learning of the parameters r1 and r2 , (i.e., setosa, versicolor, and virginica). Using the RL-FCM algorithm,
respectively, in the proposed RL-FCM algorithm. we get three clusters in 23 iterations, as shown in Fig. 13. We find
that the RL-FCM and R-EM can detect three clusters with c∗ = 3
Example 7 (Iris data set). In this example, we use the Iris real and the AR of 0.9067 (or 14 error counts) and 0.5200 (or 72 error
dataset from UCI Machine Learning Repository [35], which con- counts), respectively. Using the FCM by assigning the number of
tains 150 data points with four attributes, i.e., sepal length (SL, in clusters c = 3 for the Iris data set, we generally get an average AR
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 53

Fig. 11. (a) 17-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 17.

Fig. 12. (a) 21-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 21.

Table 6
Table 5
Validity index values of PC, PE, MPC, MPE, FHV and XB for the Iris
Cluster number results using different learning functions for r2 with r1 =
data.
e−t/10 (ARs presented inside brackets).
c PC PE MPC MPE FHV XB
Ex. 1 Ex. 2 Ex. 3 Ex. 4
2 0.8920 0.1961 0.7841 0.7172 0.0216 0.0542
True c 3 5 13 6
3 0.7632 0.3959 0.6748 0.6396 0.0205 0.1371
e−t 56 49 46 25
4 0.7065 0.5617 0.6087 0.5948 0.0209 0.1958
r2 e−t/10 6 4 13 (1.0 0 0 0) 5
5 0.6654 0.6759 0.5818 0.5800 0.0204 0.2283
e−t/100 3 (1.0 0 0 0) 5 (0.9960) 13 (1.0 0 0 0) 6 (0.9667)
6 0.6065 0.8178 0.5278 0.5436 0.0217 0.3273
e−t/1000 3 (1.0 0 0 0) 2 13 (1.0 0 0 0) 3

of 0.8933 (or 16 error counts). The validity index values of PC, PE,
MPC, MPE, FHV and XB for the Iris data set are shown in Table 6,
where all validity indices, except the index FHV with c∗ = 5, give
the best number of clusters with c∗ = 2. However, no validity in-
dex can give three clusters with c∗ = 3. For comparison, with pa-
rameter selections, RCA and OB give correct cluster number with
average AR = 0.9667 and 0.8975, respectively, while C-FS gives two
clusters.

Example 8 (Breast data set). Breast data set consists of 699 in-
stances and 9 attributes, divided into two clusters (benign and ma-
lignant) [35]. One attribute with missing value is discarded in this
experiment. RL-FCM obtains two clusters with AR = 0.9528, while
R-EM obtains seven clusters. Using proper parameter selections,
RCA, OB, and C-FS give two clusters with average AR = 0.6552,
0.9471, and 0.7761, respectively. Using FCM with c = 2 gives
Fig. 13. Final result from RL-FCM for the Iris data set.
AR = 0.9356.
54 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59

Table 7
Cluster numbers obtained by RL-FCM, R-EM, RCA, OB, and C-FS using different parameter selec-
tions.

Dataset True c RL-FCM R-EM RCA OB C-FS

Ex. 1 3 3 3 3, 4 3, 4 3
Ex. 2 5 5 5 3, 4, 5 5, 6, 7, 8 5
Ex. 3 13 13 15 9, 13, 19, 27, 31, 32 2, 3, 4 2, 3, 4, 5, 13, 14
Ex. 4 6 6 5 4, 5, 6, 7, 8, 9, 10 4, 6 2, 3, 5, 6
Iris 3 3 3 2, 3 2, 3 2
Breast 2 2 7 2 2, 3 2
Seeds 3 3 3 2, 3, 4 2 2, 3, 4

Table 8 Table 9
Percentages of RCA, OB, and C-FS that ob- Average AR and RI from RL-FCM, R-EM, RCA, OB, C-FS and FCM using the true num-
tain the correct cluster number c using 60 ber c of clusters.
different parameter selections.
Dataset RL-FCM R-EM RCA OB C-FS FCM
Dataset RCA OB C-FS
Ex. 1 AR 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0
Ex. 1 96.67% 81.67% 100% RI 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0
Ex. 2 16.67% 73.33% 100% Ex. 2 AR 0.9960 1.0 0 0 0 0.9980 0.9925 0.9845 0.9960
Ex. 3 1.67% 0% 35% RI 0.9967 1.0 0 0 0 0.9984 0.9940 0.9930 0.9967
Ex. 4 15% 25% 40% Ex. 3 AR 1.0 0 0 0 – 1.0 0 0 0 – 0.9136 0.8736
Iris 41.67% 13.33% 0% RI 1.0 0 0 0 – 1.0 0 0 0 – 0.9486 0.9795
Breast 1.67% 16.67% 50% Ex. 4 AR 0.9667 – 0.8633 0.9664 0.9192 0.8910
Seeds 68.33% 0% 65% RI 0.9794 – 0.9400 0.9705 0.9558 0.9551
Iris AR 0.9067 0.5200 0.9667 0.8975 – 0.8933
RI 0.8923 0.7212 0.9575 0.8836 – 0.8797
Breast AR 0.9528 – 0.6552 0.9471 0.7761 0.9356
Example 9 (Seeds data set). Seeds data set consists of 210 in- RI 0.9099 – 0.5475 0.8996 0.7070 0.8794
stances and 7 attributes, divided into three clusters [35]. RL-FCM Seeds AR 0.8952 0.8857 0.9034 – 0.7593 0.8952
RI 0.8744 0.8677 0.8843 – 0.7683 0.8744
and R-EM obtain correct cluster number with AR = 0.8952 and
0.8857, respectively. Using proper parameter selections, RCA and
C-FS gives three clusters with average AR = 0.9034 and 0.7593, re-
spectively, while OB detects this dataset as two clusters. Using FCM long to the same cluster in C and different clusters in C∗ , c is the
with c = 3 gives AR = 0.8952. number of pairs if both points belong to two different clusters in
C and the same cluster in C∗ , and d is the number of pairs if both
We next make more comparisons of the proposed RL-FCM with points belong to the two different clusters in C and C∗ . The RI is
R-EM, RCA, OB, and C-FS. We implement RL-FCM, R-EM, RCA, OB, defined by RI = (a + d )/(a + b + c + d ), and so the larger RI is, the
and C-FS on the data sets of Examples 1–4 and 7–9 using differ- better clustering performance is. These average AR and RI with the
ent parameter selections that are required by RCA, OB, and C-FS. true number c of clusters assigned to RL-FCM, FCM, RCA, OB, C-FS,
We summarize these obtained numbers of clusters from the algo- and R-EM are shown in Table 9. From Tables 8 and 9, we find that
rithms as shown in Table 7. Because RCA, OB, and C-FS need some the proposed RL-FCM always presents better accuracy than these
parameter selections, they obtain several possible numbers of clus- existing clustering algorithms.
ters that depend on parameter selections. However, for RL-FCM In next example, we compare the capability of finding the num-
and R-EM, because they are algorithms with no initialization and ber of clusters from RL-FCM, R-EM and validity indexes for real
no parameter selection, they obtain only one number of clusters. data sets that have higher number of clusters. These are libra, soy-
Note that, for RCA, required parameter selections are time con- bean (large) and letter from UCI Machine Learning Repository [35].
stant, discarding threshold, and an initial value. For OB, we need
to assign initial focal points, an increasing value, and also initial Example 10. For a concern about the capability of finding the
cluster centers. While for C-FS, the selection for cutoff distance number of clusters using RL-FCM for real data with more num-
and a decision window are required. Furthermore, we run RCA, OB, ber of clusters, in this example, we use three real data sets, libra,
and C-FS by using 60 different parameter selections on the data soybean (large), and letter [35]. Note that, for soybean, since the
sets of Examples 1–4 and 7–9. We then calculate the percentages data set has missing values, we discard the points with missing
that obtain the correct number of clusters under these 60 differ- values. The original n and true c are 307 and 19, respectively. But,
ent parameter selections by RCA, OB, and C-FS. These percentages after discarding these points with missing values, it has 266 data
with the correct number of clusters are shown in Table 8. From points with 15 clusters. The data number n, feature component d
Tables 7 and 8, we find that the proposed RL-FCM obtain the cor- and true cluster number c are described in Table 10. These finding
rect number of clusters that always presents better results than R- results for the numbers of clusters from RL-FCM, R-EM and valid-
EM, RCA, OB, and C-FS. ity indexes are also shown in Table 10. We see that it is difficult to
Furthermore, we make comparisons of average AR when the find the exactly true number of clusters by these methods. How-
true number c of clusters is assigned for RL-FCM, FCM, RCA, OB, ever, the RL-FCM algorithm can find the numbers of clusters that
C-FS, and R-EM. Besides AR, we also consider the Rand Index (RI). are always the most closed to the true numbers of clusters.
In 1971, Rand [36] proposed objective criteria for the evaluation of We also consider real datasets of USPS handwriting [37],
clustering methods, known as RI. Up to now, RI had popularly used Olivetti face [30], and ovarian cancer [38] that have higher at-
for measuring similarity between two clustering partitions. Let C tribute dimensions.
be the set of original clusters in a data set and C∗ be the set of
clusters obtained by the clustering algorithm. For a pair of points Example 11. In this example, we consider three real datasets
(xi , xj ), a is the number of pairs if both points belong to the same with high attribute dimensions. These are USPS handwriting
cluster in C and C∗ , b is the number of pairs if both points be- (150 × 256) [37], Olivetti face (100 × 1024) [30], and ovarian can-
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 55

Table 10
Three real data sets with more number of clusters.

Data set n×d True c Obtained c by

RL-FCM R-EM PC PE MPC MPE FHV XB

Libra 360 × 90 15 13 2 2 2 2 20 2 2
Soybean (large) 266 × 36 15 13 8 2 2 2 5 4 2
Letter 20,0 0 0 × 17 26 22 1 2 2 3 3 2 2

Table 11
Cluster numbers obtained by RL-FCM, R-EM, PC, PE, MPC, MPE, FHV and XB.

Data set n×d True c Obtained c by

RL-FCM R-EM PC PE MPC MPE FHV XB

USPS 150 × 256 10 8 1 2 2 2 8 2 2


Olivetti face 100 × 1024 10 9 1 2 2 2 15 2 2
Ovarian cancer 216 × 40 0 0 2 3 1 2 2 2 2 2 2

Fig. 14. The brain MR image with its window selection image. Fig. 15. (a) Window selected MR image; (b) Clustering with the RL-FCM.

cer (216 × 40 0 0) [38]. We implement RL-FCM, R-EM, PC, PE, MPC,


MPE, FHV and XB for the three datasets. These obtained number abnormalities are not easily detectable. For the purpose of detect-
of clusters by RL-FCM, R-EM, PC, PE, MPC, MPE, FHV and XB are ing these abnormal tissues, a window of the area around chiasma
shown in Table 11. We find that it becomes more difficult to get is selected from the brain MR image as shown in Fig. 15(a). Ac-
the correct number of clusters. The R-EM cannot get the correct cording to [39], Ophthalmologist recommended this MRI to be cat-
number of clusters. For ovarian cancer, all validity indices get the egorized into three groups: connective tissue, nervous tissue, and
correct number of clusters, but these validity indices cannot get tumor tissue. Using the RL-FCM algorithm, we also get three clus-
the correct number of clusters for USPS handwriting and Olivetti ters with c∗ = 3 in 44 iterations, as shown in Fig. 15(b). We can
face. The RL-FCM cannot get the correct number of clusters, but see occult lesions clearly enhanced with the proposed RL-FCM al-
the obtained numbers of clusters from RL-FCM are very closed to gorithm (marked by circle).
the true number c of clusters for the three datasets.
Example 13 (Lena image). In this example, we use Lenna image
Beside synthetic and real datasets, some images are given for
with 256 × 256 pixels from Matlab database, as shown in Fig. 16(a).
our following experiments.
Using the RL-FCM algorithm, we get seven clusters with c∗ = 7 in
Example 12 (MRI image). In this example, we consider the Oph- 117 iterations, as shown in Fig. 16(c). We can see the RL-FCM al-
thalmology MRI image taken from Yang et al. [39]. This MRI image gorithm gives clear segmentation results. From the histogram, as
is from a 2-year-old patient. She was diagnosed with Retinoblas- shown in Fig. 16(b), it can be seen that the number of peaks is
toma in her left eye, an inborn malignant neoplasm of the retina more than six. We also use cluster validity indices of PC, PE, MPC,
with frequent metastasis beyond the lacrimal cribrosa. The CT scan MPE, FHV and XB to find the best number of clusters. The charts
image showed a large tumor with calcification occupying the vitre- of PC, PE, MPC, MPE, FHV, and XB validity indexes are shown in
ous cavity of her left eye. The first MR image was acquired with its Fig. 17, where the best numbers of clusters c∗ of PC, PE, MPC, MPE,
grayscale image 400 × 286 = 114,400 pixels. This MR image showed FHV and XB are 2, 2, 98, 100, 4, and 5, respectively.
that an intra-muscle cone tumor mass with high T1-weight im-
age signals and low T2-weight image signals noted in the left eye- Example 14 (Peppers image). In this example, we use peppers
ball. The tumor was measured 20 mm in diameter and occupied image with 384 × 512 pixels from Matlab database, as shown in
nearly the whole vitreous cavity. Since a shady signal abnormality Fig. 18(a). Using the RL-FCM algorithm, we get six clusters with
along the optic nerve to the level of the optic chiasm toward the c∗ = 6 in 132 iterations, as shown in Fig. 18(c). The results of clus-
brain is suspected, the second MR image of the brain MRI image ter validity indices PC, PE, MPC, MPE, FHV and XB are 2, 2, 100,
283 × 292 = 82,636 pixels with its window selection image was ac- 100, 17, and 15, respectively. We can see the RL-FCM algorithm
quired and analyzed, as shown in Fig. 14. From the picture, one le- gives clear segmentation results. From the histogram, as shown in
sion was clearly seen by MR image. However, some vague shadows Fig. 18(b), it can be seen that the number of peaks is around five
of lesions were suspected with tumor invasion. These suspected to six.
56 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59

Fig. 16. (a) Original Lena image; (b) Histogram of (a); (c) Clustering using RL-FCM.

Fig. 17. Plot of validity indexes for Fig. 16(a): (a) PC; (b) PE; (c) MPC; (d) MPE; (e) FHV; (f) XB.

Fig. 18. (a) Original peppers image; (b) Histogram of (a); (c) Clustering using RL-FCM.
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 57

Fig. 19. Plots of per iteration time as implementing RL-FCM for different data sets: (a) Gaussian mixture model of Example 1; (b) Iris data set; (c) Peppers image.

Finally, we would like to demonstrate that, although the algo- ered. For our future work, we will consider data with high dimen-
rithm uses the number of data points as the number of clusters sions by building a new feature selection procedure in the RL-FCM
(i.e., c = n) in the beginning iteration, the iteration time will de- algorithm.
crease rapidly after several iterations. This situation occurs because
the clusters with α k ≤ 1/n will be discarded during iterations, so Acknowledgments
that the number of clusters c decreases rapidly. To demonstrate
this phenomenon, we show the running time in seconds per itera- The authors would like to thank the anonymous referees for
tion for the data sets of Example 1 (Gaussian mixture model with their helpful comments in improving the presentation of this pa-
c = 3), Example 7 (Iris dataset), and Example 14 (Peppers image) per. This work was supported in part by the Ministry of Science
as shown in Fig. 19(a)–(c), respectively. As we can see, the running and Technology, Taiwan, under Grant MOST 105-2118-M-033-004-
time decreases rapidly after the 10th iteration. MY2.

4. Conclusions References

In this paper we proposed a new schema with a learning frame- [1] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster
work for fuzzy clustering algorithms, especially for the fuzzy c- Analysis, Wiley, New York, 1990.
[2] G.J. McLachlan, K.E. Basford, Mixture Models: Inference and Applications to
means (FCM) algorithm. We adopted the merit of entropy-type Clustering, Marcel Dekker, New York, 1988.
penalty terms for adjusting the bias and also free of the fuzzi- [3] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete
ness index in the FCM. We then created a robust-learning FCM (RL- data via the EM algorithm (with discussion), J. R. Stat. Soc. Series B 39 (1977)
1–38.
FCM) clustering algorithm. The proposed RL-FCM uses the number [4] S.E. Schaeffer, Graph clustering, Comput. Sci. Rev. I (2007) 27–64.
of data points as initial number of clusters for solving initialization [5] C.C. Aggarwal, J.L. Wolf, P.S. Yu, Method for targeted advertising on the web
problem. It then discards these clusters that have the mixing pro- based on accumulated self-learning data, clustering users and semantic node
graph techniques, U.S. Patent No. 6714975 (2004).
portion values less than one over the number of data points, such [6] J. MacQueen, Some methods for classification and analysis of multivariate ob-
that the best cluster number can be automatically found accord- servations, in: Proceedings of 5th Berkeley Symposium on Mathematical Statis-
ing to the structure of data. The advantages of RL-FCM are free of tics and Probability, 1, University of California Press, 1967, pp. 281–297.
[7] D. Pelleg, A. Moore, X-Means: extending K-means with efficient estimation of
initializations and parameters that also robust to different cluster the number of clusters, in: Proceedings of the 17th International Conference
volumes and shapes, noisy points and outliers with automatically on Machine Learning, San Francisco, 20 0 0, pp. 727–734.
finding the best number of clusters. The computational complex- [8] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338–353.
[9] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,
ity of RL-FCM was also analyzed. The main difference for compu-
Plenum Press, New York, 1981.
tational time between RL-FCM and FCM is the beginning iteration [10] M.S. Yang, A survey of fuzzy clustering, Math. Comput. Model. 18 (1993) 1–16.
for assigning the number of data points as initial number of clus- [11] A. Baraldi, P. Blonda, A survey of fuzzy clustering algorithms for pattern recog-
ters. However, in general, the iteration time for RL-FCM will de- nition Part I and II, IEEE Trans. Syst. Man Cybern. Part B 29 (1999) 778–801.
[12] F. Hoppner, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis: Methods
crease rapidly after several iterations. On the other hand, for very for Classification Data Analysis and Image Recognition, Wiley, New York, 1999.
large data sets, we may consider by using a grid base to divide di- [13] E. Ruspini, A new approach to clustering, Inf. Control 15 (1969) 22–32.
mensions of feature space into grids and then only choosing one [14] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting
compact, well-separated clusters, J. Cybern. 3 (1974) 32–57.
data point in each grid as the initial cluster center so that we can [15] D.E. Gustafson, W.C. Kessel, Fuzzy clustering with a fuzzy covariance matrix,
reduce the computational time for RL-FCM. in: Proceedings of IEEE CDC, California, 1979, pp. 761–766.
Several numerical data and real data sets with MRI and image [16] N.B. Karayiannis, MECA: maximum entropy clustering algorithm, in: Proceed-
ings of IEEE International Conference on Fuzzy Systems, 1, Orlando, FL, 1994,
segmentation are used to show these good aspects of RL-FCM. Ex- pp. 630–635.
perimental results and comparisons actually demonstrated the ef- [17] S. Miyamoto, K. Umayahara, Fuzzy clustering by quadratic regularization, in:
fectiveness and superiority of the proposed RL-FCM algorithm. Ex- Proceedings of the 7th IEEE International Conference on Fuzzy Systems, 2, Pis-
cataway, NJ, 1998, pp. 1394–1399.
cept these experiments in the paper, the RL-FCM algorithm could
[18] C. Wei, C. Fahn, The multisynapse neural network and its application to fuzzy
be applied to text mining, face recognition, marketing segmenta- clustering, IEEE Trans. Neural Netw. 13 (2002) 600–618.
tion and gene expression. As a whole, the proposed RL-FCM is ac- [19] R.J. Hathaway, J.C. Bezdek, Y. Hu, Generalized fuzzy c-means clustering strate-
gies using Lp norm distances, IEEE Trans. Fuzzy Syst. 8 (20 0 0) 576–582.
tually an effective and useful robust learning-based clustering algo-
[20] M.S. Yang, K.L. Wu, J.N. Hsieh, J. Yu, Alpha-cut implemented fuzzy clustering
rithm. Although the RL-FCM algorithm actually improves the per- algorithms and switching regressions, IEEE Trans. Syst. Man Cybern. Part B 38
formance of fuzzy clustering algorithms, especially capable of ob- (2008) 588–603.
taining the number of clusters, it still has limitations in handling [21] T.C. Havens, J.C. Bezdek, C. Leckie, L.O. Hall, M. Palaniswami, Fuzzy c-means
algorithms for very large data, IEEE Trans. Fuzzy Syst. 20 (2012) 1130–1146.
high-dimensional data sets. We think that, for high-dimensional [22] H. Izakian, W. Pedrycz, I. Jamal, Clustering spatiotemporal data: an augmented
data, a suitable schema with feature selection should be consid- fuzzy c-means, IEEE Trans. Fuzzy Syst. 21 (2013) 855–868.
58 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59

[23] J.C. Bezdek, Numerical taxonomy with fuzzy sets, J. Math. Biol. 1 (1974) 57–71. [33] V. Schwämmle, O.N. Jensen, A simple and fast method to determine the
[24] J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybern. 3 (1974) 58–73. parameters for fuzzy c-means cluster analysis, Bioinformatics 26 (2010)
[25] M. Roubens, Pattern classification problems with fuzzy sets, Fuzzy Sets Syst. 1 2841–2848.
(1978) 239–253. [34] M.S. Yang, C.Y. Lai, C.Y. Lin, A robust EM clustering algorithm for Gaussian mix-
[26] R.N. Dave, Validating fuzzy partition obtained through c-shells clustering, Pat- ture models, Pattern Recognit. 45 (2012) 3950–3961.
tern Recognit. Lett. 17 (1996) 613–623. [35] C.L. Blake, C.J. Merz, UCI repository of machine learning databases, a huge col-
[27] I. Gath, A.B. Geva, Unsupervised optimal fuzzy clustering, IEEE Trans. Pattern lection of artificial and real-world data sets, 1998.
Anal. Mach. Intell. 11 (1989) 73–781. [36] W.M. Rand, Objective criteria for the evaluation of clustering methods, J. Amer.
[28] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern Stat. Assoc. 66 (1971) 846–850.
Anal. Mach. Intell. 13 (1991) 841–847. [37] J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pat-
[29] H. Frigui, R. Krishnapuram, A robust competitive clustering algorithm with ap- tern Anal. Mach. Intell. 16 (1994) 550–554.
plications in computer vision, IEEE Trans. Pattern Anal. Mach. Intell. 21 (6) [38] T.P. Conrads, M. Zhou, E.F. Petricoin III, L. Liotta, T.D. Veenstra, Cancer diagnosis
(1999) 450–465. using proteomic patterns, Expert Rev. Mol. Diagn. 3 (2003) 411–420.
[30] A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Sci- [39] M.S. Yang, Y.J. Hu, K.C.R. Lin, C.C.L. Lin, Segmentation techniques for tissue dif-
ence 344 (6191) (2014) 1492–1496. ferentiation in MRI of ophthalmology using fuzzy clustering algorithms, Magn.
[31] P. Fazendeiro, J.V. de Oliveira, Observer-biased fuzzy clustering, IEEE Trans. Reson. Imaging 20 (2002) 173–179.
Fuzzy Syst. 23 (2015) 85–97.
[32] D. Dembélé, P. Kastner, Fuzzy c-means method for clustering microarray data,
Bioinformatics 19 (2003) 973–980.
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 59

Miin-Shen Yang received the BS degree in mathematics from the Chung Yuan Christian University, Chung-Li, Taiwan, in 1977, the MS degree in applied mathematics from
the National Chiao-Tung University, Hsinchu, Taiwan, in 1980, and the PhD degree in statistics from the University of South Carolina, Columbia, USA, in 1989.
In 1989, he joined the faculty of the Department of Mathematics in the Chung Yuan Christian University (CYCU) as an Associate Professor, where, since 1994, he has been
a Professor. From 1997 to 1998, he was a Visiting Professor with the Department of Industrial Engineering, University of Washington, Seattle. During 20 01–20 05, he was
the Chairman of the Department of Applied Mathematics in CYCU. His current research interests include clustering, pattern recognition, machine learning, and neural fuzzy
systems.
Dr. Yang was an Associate Editor of the IEEE Transactions on Fuzzy Systems (2005–2011), and is an Associate Editor of the Applied Computational Intelligence & Soft
Computing and Editor-in-Chief of Advances in Computational Research. He was awarded with 2008 Outstanding Associate Editor of IEEE Transactions on Fuzzy Systems,
IEEE; 2009 Outstanding Research Award of CYCU; 2012–2018 Distinguished Professorship of CYCU; 2016 Outstanding Research Award of CYCU.

Yessica Nataliani received the BS degree in mathematics and the MS degree in computer science from the Gadjah Mada University, Yogyakarta, Indonesia, in 2004 and 2006,
respectively.
She is currently a Ph.D. student at the Department of Applied Mathematics in the Chung Yuan Christian University, Taiwan. His research interests include cluster analysis
and pattern recognition.

You might also like