Yang 2017
Yang 2017
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: In fuzzy clustering, the fuzzy c-means (FCM) algorithm is the most commonly used clustering method.
Received 27 August 2016 Various extensions of FCM had been proposed in the literature. However, the FCM algorithm and its
Revised 12 May 2017
extensions are usually affected by initializations and parameter selection with a number of clusters to
Accepted 20 May 2017
be given a priori. Although there were some works to solve these problems in FCM, there is no work
Available online 22 May 2017
for FCM to be simultaneously robust to initializations and parameter selection under free of the fuzzi-
Keywords: ness index without a given number of clusters. In this paper, we construct a robust learning-based FCM
Fuzzy clustering framework, called a robust-learning FCM (RL-FCM) algorithm, so that it becomes free of the fuzziness
Fuzzy c-means (FCM) index m and initializations without parameter selection, and can also automatically find the best number
Robust learning-based schema of clusters. We first use entropy-type penalty terms for adjusting bias with free of the fuzziness index,
Number of clusters and then create a robust learning-based schema for finding the best number of clusters. The computa-
Entropy penalty terms
tional complexity of the proposed RL-FCM algorithm is also analyzed. Comparisons between RL-FCM and
Robust-learning FCM (RL-FCM)
other existing methods are made. Experimental results and comparisons actually demonstrate these good
aspects of the proposed RL-FCM where it exhibits three robust characteristics: 1) robust to initializations
with free of the fuzziness index, 2) robust to (without) parameter selection, and 3) robust to number of
clusters (with unknown number of clusters).
© 2017 Elsevier Ltd. All rights reserved.
1. Introduction tering had also been developed and discussed in the literature [4],
for example, clustering based on similarity of user-behavior for tar-
Clustering is a useful tool for data analysis. It is a method geted advertising was investigated in Aggarwal et al. [5].
for finding groups within data with the most similarity in the Partitional clustering methods suppose that the data set can
same cluster and the most dissimilarity between different clusters. be represented by finite cluster prototypes with their own objec-
The hierarchical clustering was supposed as the earliest clustering tive functions. Therefore, to define the dissimilarity (or distance)
method used by biologist and social scientists. Afterwards, cluster between data points and cluster prototypes is essential for parti-
analysis becomes a branch in statistical multivariate analysis [1]. It tional methods. It is known that the k-means (or called hard c-
is also an approach to unsupervised learning as one of the major means) algorithm is the oldest and popular partitional method [6].
techniques in pattern recognition and machine learning. According For an efficient estimation for a number of clusters, Pelleg and
to the statistical point of view, clustering methods may be divided Moore [7] extended k-means, called X-means, by making local de-
as a probability model-based approach and a nonparametric ap- cisions for cluster centers in each iteration of k-means with split-
proach. A probability model-based approach assumes that the data ting themselves to get better clustering. Users only need to specify
set follows a mixture model of probability distributions so that a a range of cluster numbers in which the true cluster number rea-
mixture likelihood approach to clustering is used [2], where the sonably lies and then a model selection, such as Bayesian informa-
expectation and maximization (EM) algorithm [3] is the most pop- tion criterion (BIC) or Akaike information criterion (AIC), is used
ular. For a nonparametric approach, clustering methods may be to do the splitting process. Although the k-means and X-means al-
based on an objective function of similarity or dissimilarity mea- gorithms are widely used, these crisp clustering methods restricts
sures, where partitional methods are the most used. Graph clus- that each data point belongs to exactly one cluster with crisp clus-
ter memberships so that it can be well fitted for sharp bound-
aries between clusters in data, but not good for unsharp (or vague)
∗
Corresponding author. boundaries.
E-mail address: [email protected] (M.-S. Yang).
http://dx.doi.org/10.1016/j.patcog.2017.05.017
0031-3203/© 2017 Elsevier Ltd. All rights reserved.
46 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59
Since Zadeh [8] proposed fuzzy set that introduced the idea of ing. The robust-learning FCM (RL-FCM) clustering algorithm is also
partial memberships described by membership functions, it was presented in this section. In Section 3, several experimental exam-
successfully applied in clustering. Fuzzy clustering has been widely ples and comparisons with numeric and real data sets are provided
studied and applied in a variety of substantive areas more than to demonstrate the effectiveness of the proposed RL-FCM, which
45 years [9–12] since Ruspini [13] first proposed fuzzy c-partitions can automatically find the best number of clusters. Finally, conclu-
as a fuzzy approach to clustering in the 1970s. In fuzzy clustering, sions are stated in Section 4.
the fuzzy c-means (FCM) clustering algorithm proposed by Dunn
[14] and Bezdek [9] is the most well-known and used method.
There are many extensions and variants of FCM proposed in the
literature. The first important extension to FCM was proposed by 2. Robust-learning fuzzy c-means clustering algorithm
Gustafson and Kessel (GK) [15] in which the Euclidean distance
in the FCM objective function was replaced by the Mahalanobis Let X = {x1 , . . . , xn } be a data set in a d-dimensional Euclidean
distance. Afterwards, there are many extensions to FCM, such as space Rd and V = {v1 , . . . , vc } be the c cluster centers with its
extensions to maximum-entropy clustering (MEC) by Karayiannis d
Euclidean norm denoted by dik = xi − vk 2 =
2
[16], Miyamoto and Umayahara [17] and Wei and Fahn [18], ex- j=1 (xi j − vk j ) .
tensions to Lp norms by Hathaway et al. [19], extension of FCM as The fuzzy c-means (FCM) objective function [9–10] is given with
alpha-cut implemented fuzzy clustering algorithms by Yang et al. Jm (U, V ) = ck=1 ni=1 μm d2 where m > 1 is the fuzziness in-
ik ik
[20], extension of FCM for treating very large data by Havens et al. dex, μ = {μik }n×c ∈ M f cn is a fuzzy partition matrix with M f cn =
[21], an augmented FCM for clustering spatiotemporal data by Iza- {μ = [μik ]nc |∀i, ∀k, 0 ≤ μik ≤ 1, ck=1 μik = 1, 0 < ni=1 μik < n},
kian et al. [22], and so forth. However, these fuzzy clustering algo- and dik = xi − vk 2 is the Euclidean distance. The FCM al-
rithms always need to give a number of clusters a priori. In gen- gorithm is iterated through necessary conditions for min-
eral, the cluster number c is unknown. In this case, validity indices imizing Jm (U, V)with the updating equations for cluster
can be used to find a cluster number c where they are supposed centers and memberships as: vk = ni=1 μm x / ni=1 μm
ik i ik
and
( ) −2/(m−1 )
to be independent of clustering algorithms. Many cluster validity μik = (dik ) −2 / m −1 / t=1 (dit )
c
.
indices for fuzzy clustering algorithms had been proposed in the We know that the FCM algorithm is dependent on initial val-
literature, such as partition coefficient (PC) [23], partition entropy ues and some parameters need to be given a priori, such as a
(PE) [24], normalization of PC and PE [25–26], fuzzy hypervolume fuzziness index m, cluster center initialization and also a num-
(FHV) [27] and XB (Xie and Beni [28]). ber of clusters. Although there exist some works in the literature
Frigui and Krishnapuram [29] proposed the robust competitive to solve some problems in FCM, such as Dembélé and Kastner
agglomerative (RCA) algorithm by adding a loss function of clus- [32] and Schwämmle and Jensen [33] on estimating the fuzziness
ters and a weight function of data points to clusters. The RCA al- index m for clustering microarray data, there is no work for FCM
gorithm can be used for determining a cluster number. Starting to be simultaneously robust to initializations and parameter selec-
with a large cluster number, RCA reduces the number by discard- tion under free of the fuzziness index m without a given num-
ing clusters with less cardinality. Some parameter initial values are ber of clusters. Next, we construct a robust learning-based schema
needed in RCA, such as time constant, discarding threshold, tun- for FCM to simultaneously solve these problems. Our basic idea is
ing factor, etc. Another clustering algorithm was presented by Ro- that, we first consider all data points as initial cluster centers, i.e.,
driguez and Laio [30] for clustering by fast search, called C-FS, us- the number of data points is the initial number of clusters. After
ing a similarity matrix for finding density peaks. They proposed that, use the mixing proportion α k of the cluster k, which is like
the C-FS algorithm by assigning a cutoff distance dc and selecting a a cluster weight, and discard these clusters that have values of α k
decision window so that it can automatically determine a number less than one over the number of data points. The proposed al-
of clusters. In [30], the cutoff distance dc becomes another param- gorithm can iteratively obtain the best number of clusters until it
eter in which clustering results are heavily dependent of the cut- converges.
off parameter dc . Recently, Fazendeiro and Oliveira [31] presented For a data set X = {x1 , . . . , xn } in Rd with c cluster centers to
a fuzzy clustering algorithm with an unknown number of clusters have FCM be simultaneously robust to initializations and param-
based on observer position, called focal point. With this point, ob- eter selection under free of the fuzziness index m that can au-
server can select a suitable point while searching for clusters that tomatically find the best number of clusters, we add several en-
is actually appropriate to the underlying data structure. After the tropy terms in the FCM objective function. First, to construct an
focal point is chosen, the initialization of cluster centers must be algorithm free of the fuzziness index m, we replace m by adding
generated randomly. The inverse of XB index is used to compute an extra term with a function of μik . In this sense, we consider
the validity measure. The maximal value is chosen to get the best the concept of MEC [16–18] by adding the entropy term of mem-
c n
number of clusters. Although these algorithms can find a number berships with
k=1 i=1 ik
μ ln μik . Moreover, we use a learning
of clusters during iteration procedures, they are still dependent of function r, i.e. r ck=1 ni=1 μik ln μik , to learn the effects of the en-
initializations and parameter selections. tropy term for adjusting bias. We next use the mixing proportion
Up to now, there is no work in the literature for FCM to be si- α = (α1 , · · · , αc ) of clusters, where α k presents the probability of
multaneously robust to initializations and parameter selection un- one data point belonged to the kth cluster with the constraint
c
k=1 αk = 1. Hence, − ln αk is the information in the occurrence of
der free of the fuzziness index without a given number of clusters.
We think that this may be due to its difficulty for constructing a data point belonged to the kth cluster. Thus, we add the entropy
this kind of robust FCM. In this paper, we try to construct a ro- term ck=1 ni=1 μik ln αk to summarize the average of information
bust learning-based framework for fuzzy clustering, especially for for the occurrence of a data point belonged to the corresponding
the FCM algorithm. This framework can automatically find the best cluster over fuzzy memberships. Furthermore, we borrow the idea
number of clusters, without any initialization and parameter selec- of Yang et al. [34] in the EM algorithm by using the entropy term,
c
k=1 αk ln αk , to represent the average of information for the oc-
tion, and it is also free of the fuzziness index m. We first consider
some entropy-type penalty terms for adjusting the bias, and then currence of each data point belonged to the corresponding cluster.
create a robust-learning mechanism for finding the best number of Totally, the entropy terms of mixing proportion in probability and
clusters. The organization of this paper is as follows. In Section 2, the average of occurrence in probability over fuzzy memberships
we construct a robust learning-based framework for fuzzy cluster- are used for learning to find the best number of clusters.
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 47
According to the above construction for FCM, we propose a where |{}| denotes the cardinality of the set {}. After updating the
robust-learning FCM (RL-FCM) objective function as follows: number of clusters c, the remaining mixing proportion αk∗ and cor-
responding μ∗ik need to be re-normalized by,
c
n
c
n
J ( U, α , V ) = μik dik2 − r1 μik ln αk
c(new)
k=1 i=1 k=1 i=1
c
n
c αk∗ = αk∗ αt∗ (9)
+ r2 μik ln μik − r3 n αk ln αk , (1) t=1
and αk(0 ) = 1/c = 1/n, k = 1, · · · , c. There are competitions between 1≤k≤c 1 1≤k≤c
1≤k≤c
these mixing proportions according to Eq. (7). Iteratively, the al- αt(old ) ) and max ( 1n ni=1 μik ) + r3 max αk(old ) (ln max αi(old )
r
1≤k≤c 1 1≤k≤c 1≤i≤c
gorithm can find the final number of clusters c by utilizing the c
following Eq. (8). When αk(new ) < 1/n, we discard illegitimate mix-
− t=1 αt(old ) ln αt(old ) ) < max ( 1n ni=1 μik ) + r3 (−( max αk(old )
1≤k≤c 1≤k≤c
c (old )
ing proportion αk(new ) . Therefore, the updating number of clusters t=1 αt ln αt(old ) )).
c
c(new) is Therefore, if max ( 1n ni=1 μik ) − r3 max αk(old ) t=1 αt(old ) ln αt(old )
1≤k≤c 1≤k≤c
c(new ) = c(old ) − {αk (new) αk (new) < 1/n, k = 1, 2, .., c(old ) } (8) ≤ 1, then the restriction will be held. It follows that
48 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59
1
n c
(old )
r3 ≤ 1 − max μik − max αk αt(old ) ln αt(old ) .
1≤k≤c n 1≤k≤c
i=1 t=1
(14)
To combine Eq. (13) and Eq. (14), we obtain
c
k=1 exp −ηn αk(new) − αk(old )
r3 = min ,
c
⎞
n
i=1 μik
1
1 − max
⎟
n
1≤k≤c
(old ) c (old ) (old )
⎠. (15)
− max αk t=1 αt ln αt
1≤k≤c
Fig. 2. (a) 3-cluster dataset; (b) Clustering result after 2 iterations with 289 clusters; (c) Clustering result after 5 iterations with 190 clusters; (d) Clustering result after 10
iterations with 78 clusters; (e) Clustering result after 15 iterations with 10 clusters; (f) Final result from RL-FCM with 3 clusters.
rapidly from 500 to 289, as shown in Fig. 2(b). The RL-FCM al- Table 1
Validity index values of PC, PE, MPC, MPE, FHV and XB for the data
gorithm decreases the number of cluster to 190, 78, and 10 clus-
set in Fig. 3(a).
ters after 5, 10, and 15 iterations (see Fig. 2(c)–(e)), respectively.
Finally, after 22 iterations (see Fig. 2(f)), the RL-FCM algorithm ob- c PC PE MPC MPE FHV XB
tains its convergence where three clusters are formed with c∗ = 3 2 0.7497 0.4013 0.4994 0.4210 13.3712 0.1482
and AR = 1.00. 3 0.6414 0.6335 0.4621 0.4234 13.3343 0.2399
4 0.7219 0.5572 0.6292 0.5980 7.6920 0.0990
5 0.7643 0.5202 0.7053 0.6768 5.2695 0.0553
3. Experimental results and comparisons 6 0.7280 0.6134 0.6736 0.6577 5.6192 0.1313
Fig. 3. (a) 5-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 5.
Fig. 4. (a) 5-cluster dataset with noisy points; (b) Final result from RL-FCM. Fig. 6. (a) 13-cluster dataset with 100 noisy points; (b) Final results from RL-FCM.
Table 2
Validity index values of PC, PE, MPC, MPE, FHV and XB for the data set
in Fig. 5(a).
Example 3. In this example, we show another data set with 13 We continue adding 100 noisy points in the data set of Fig. 5(a).
blocks generated from continuously uniform distribution, where The noisy data set is shown in Fig. 6(a). The effectiveness of
each block contains 50 data points, as shown in Fig. 5(a). Using the RL-FCM in handling noise is demonstrated in Fig. 6(b) where
RL-FCM algorithm, we get 13 clusters after 19 iterations. We find we can see the RL-FCM algorithm still obtains 13 clusters after
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 51
Fig. 7. (a) Original data set; (b) Final result from RL-FCM; (c) Final result from FCM with c = 6.
Table 3 Table 4
Validity index values of PC, PE, MPC, MPE, FHV and XB for the data Cluster number results using different learning functions for r1 with r2 =
set in Fig. 7(a). e−t/100 (ARs presented inside brackets).
Fig. 8. (a) Original data set; (b) Final result from RL-FCM; (c) Final result from FCM with c = 6.
Fig. 9. (a) 25-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 25.
Fig. 10. (a) 25-cluster data set with closer cluster; (b) Final result from RL-FCM; (c) Final result from FCM with c = 25.
often give the best number of clusters with the smallest error cm), sepal width (SW, in cm), petal length (PL, in cm), and petal
rate. Overall, we recommend the learning function r1 = e−t/10 and width (PW, in cm). The Iris data set originally has three clusters
r2 = e−t/100 as a decreasing learning of the parameters r1 and r2 , (i.e., setosa, versicolor, and virginica). Using the RL-FCM algorithm,
respectively, in the proposed RL-FCM algorithm. we get three clusters in 23 iterations, as shown in Fig. 13. We find
that the RL-FCM and R-EM can detect three clusters with c∗ = 3
Example 7 (Iris data set). In this example, we use the Iris real and the AR of 0.9067 (or 14 error counts) and 0.5200 (or 72 error
dataset from UCI Machine Learning Repository [35], which con- counts), respectively. Using the FCM by assigning the number of
tains 150 data points with four attributes, i.e., sepal length (SL, in clusters c = 3 for the Iris data set, we generally get an average AR
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 53
Fig. 11. (a) 17-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 17.
Fig. 12. (a) 21-cluster dataset; (b) Final result from RL-FCM; (c) Final result from FCM with c = 21.
Table 6
Table 5
Validity index values of PC, PE, MPC, MPE, FHV and XB for the Iris
Cluster number results using different learning functions for r2 with r1 =
data.
e−t/10 (ARs presented inside brackets).
c PC PE MPC MPE FHV XB
Ex. 1 Ex. 2 Ex. 3 Ex. 4
2 0.8920 0.1961 0.7841 0.7172 0.0216 0.0542
True c 3 5 13 6
3 0.7632 0.3959 0.6748 0.6396 0.0205 0.1371
e−t 56 49 46 25
4 0.7065 0.5617 0.6087 0.5948 0.0209 0.1958
r2 e−t/10 6 4 13 (1.0 0 0 0) 5
5 0.6654 0.6759 0.5818 0.5800 0.0204 0.2283
e−t/100 3 (1.0 0 0 0) 5 (0.9960) 13 (1.0 0 0 0) 6 (0.9667)
6 0.6065 0.8178 0.5278 0.5436 0.0217 0.3273
e−t/1000 3 (1.0 0 0 0) 2 13 (1.0 0 0 0) 3
of 0.8933 (or 16 error counts). The validity index values of PC, PE,
MPC, MPE, FHV and XB for the Iris data set are shown in Table 6,
where all validity indices, except the index FHV with c∗ = 5, give
the best number of clusters with c∗ = 2. However, no validity in-
dex can give three clusters with c∗ = 3. For comparison, with pa-
rameter selections, RCA and OB give correct cluster number with
average AR = 0.9667 and 0.8975, respectively, while C-FS gives two
clusters.
Example 8 (Breast data set). Breast data set consists of 699 in-
stances and 9 attributes, divided into two clusters (benign and ma-
lignant) [35]. One attribute with missing value is discarded in this
experiment. RL-FCM obtains two clusters with AR = 0.9528, while
R-EM obtains seven clusters. Using proper parameter selections,
RCA, OB, and C-FS give two clusters with average AR = 0.6552,
0.9471, and 0.7761, respectively. Using FCM with c = 2 gives
Fig. 13. Final result from RL-FCM for the Iris data set.
AR = 0.9356.
54 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59
Table 7
Cluster numbers obtained by RL-FCM, R-EM, RCA, OB, and C-FS using different parameter selec-
tions.
Ex. 1 3 3 3 3, 4 3, 4 3
Ex. 2 5 5 5 3, 4, 5 5, 6, 7, 8 5
Ex. 3 13 13 15 9, 13, 19, 27, 31, 32 2, 3, 4 2, 3, 4, 5, 13, 14
Ex. 4 6 6 5 4, 5, 6, 7, 8, 9, 10 4, 6 2, 3, 5, 6
Iris 3 3 3 2, 3 2, 3 2
Breast 2 2 7 2 2, 3 2
Seeds 3 3 3 2, 3, 4 2 2, 3, 4
Table 8 Table 9
Percentages of RCA, OB, and C-FS that ob- Average AR and RI from RL-FCM, R-EM, RCA, OB, C-FS and FCM using the true num-
tain the correct cluster number c using 60 ber c of clusters.
different parameter selections.
Dataset RL-FCM R-EM RCA OB C-FS FCM
Dataset RCA OB C-FS
Ex. 1 AR 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0
Ex. 1 96.67% 81.67% 100% RI 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0 1.0 0 0 0
Ex. 2 16.67% 73.33% 100% Ex. 2 AR 0.9960 1.0 0 0 0 0.9980 0.9925 0.9845 0.9960
Ex. 3 1.67% 0% 35% RI 0.9967 1.0 0 0 0 0.9984 0.9940 0.9930 0.9967
Ex. 4 15% 25% 40% Ex. 3 AR 1.0 0 0 0 – 1.0 0 0 0 – 0.9136 0.8736
Iris 41.67% 13.33% 0% RI 1.0 0 0 0 – 1.0 0 0 0 – 0.9486 0.9795
Breast 1.67% 16.67% 50% Ex. 4 AR 0.9667 – 0.8633 0.9664 0.9192 0.8910
Seeds 68.33% 0% 65% RI 0.9794 – 0.9400 0.9705 0.9558 0.9551
Iris AR 0.9067 0.5200 0.9667 0.8975 – 0.8933
RI 0.8923 0.7212 0.9575 0.8836 – 0.8797
Breast AR 0.9528 – 0.6552 0.9471 0.7761 0.9356
Example 9 (Seeds data set). Seeds data set consists of 210 in- RI 0.9099 – 0.5475 0.8996 0.7070 0.8794
stances and 7 attributes, divided into three clusters [35]. RL-FCM Seeds AR 0.8952 0.8857 0.9034 – 0.7593 0.8952
RI 0.8744 0.8677 0.8843 – 0.7683 0.8744
and R-EM obtain correct cluster number with AR = 0.8952 and
0.8857, respectively. Using proper parameter selections, RCA and
C-FS gives three clusters with average AR = 0.9034 and 0.7593, re-
spectively, while OB detects this dataset as two clusters. Using FCM long to the same cluster in C and different clusters in C∗ , c is the
with c = 3 gives AR = 0.8952. number of pairs if both points belong to two different clusters in
C and the same cluster in C∗ , and d is the number of pairs if both
We next make more comparisons of the proposed RL-FCM with points belong to the two different clusters in C and C∗ . The RI is
R-EM, RCA, OB, and C-FS. We implement RL-FCM, R-EM, RCA, OB, defined by RI = (a + d )/(a + b + c + d ), and so the larger RI is, the
and C-FS on the data sets of Examples 1–4 and 7–9 using differ- better clustering performance is. These average AR and RI with the
ent parameter selections that are required by RCA, OB, and C-FS. true number c of clusters assigned to RL-FCM, FCM, RCA, OB, C-FS,
We summarize these obtained numbers of clusters from the algo- and R-EM are shown in Table 9. From Tables 8 and 9, we find that
rithms as shown in Table 7. Because RCA, OB, and C-FS need some the proposed RL-FCM always presents better accuracy than these
parameter selections, they obtain several possible numbers of clus- existing clustering algorithms.
ters that depend on parameter selections. However, for RL-FCM In next example, we compare the capability of finding the num-
and R-EM, because they are algorithms with no initialization and ber of clusters from RL-FCM, R-EM and validity indexes for real
no parameter selection, they obtain only one number of clusters. data sets that have higher number of clusters. These are libra, soy-
Note that, for RCA, required parameter selections are time con- bean (large) and letter from UCI Machine Learning Repository [35].
stant, discarding threshold, and an initial value. For OB, we need
to assign initial focal points, an increasing value, and also initial Example 10. For a concern about the capability of finding the
cluster centers. While for C-FS, the selection for cutoff distance number of clusters using RL-FCM for real data with more num-
and a decision window are required. Furthermore, we run RCA, OB, ber of clusters, in this example, we use three real data sets, libra,
and C-FS by using 60 different parameter selections on the data soybean (large), and letter [35]. Note that, for soybean, since the
sets of Examples 1–4 and 7–9. We then calculate the percentages data set has missing values, we discard the points with missing
that obtain the correct number of clusters under these 60 differ- values. The original n and true c are 307 and 19, respectively. But,
ent parameter selections by RCA, OB, and C-FS. These percentages after discarding these points with missing values, it has 266 data
with the correct number of clusters are shown in Table 8. From points with 15 clusters. The data number n, feature component d
Tables 7 and 8, we find that the proposed RL-FCM obtain the cor- and true cluster number c are described in Table 10. These finding
rect number of clusters that always presents better results than R- results for the numbers of clusters from RL-FCM, R-EM and valid-
EM, RCA, OB, and C-FS. ity indexes are also shown in Table 10. We see that it is difficult to
Furthermore, we make comparisons of average AR when the find the exactly true number of clusters by these methods. How-
true number c of clusters is assigned for RL-FCM, FCM, RCA, OB, ever, the RL-FCM algorithm can find the numbers of clusters that
C-FS, and R-EM. Besides AR, we also consider the Rand Index (RI). are always the most closed to the true numbers of clusters.
In 1971, Rand [36] proposed objective criteria for the evaluation of We also consider real datasets of USPS handwriting [37],
clustering methods, known as RI. Up to now, RI had popularly used Olivetti face [30], and ovarian cancer [38] that have higher at-
for measuring similarity between two clustering partitions. Let C tribute dimensions.
be the set of original clusters in a data set and C∗ be the set of
clusters obtained by the clustering algorithm. For a pair of points Example 11. In this example, we consider three real datasets
(xi , xj ), a is the number of pairs if both points belong to the same with high attribute dimensions. These are USPS handwriting
cluster in C and C∗ , b is the number of pairs if both points be- (150 × 256) [37], Olivetti face (100 × 1024) [30], and ovarian can-
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 55
Table 10
Three real data sets with more number of clusters.
Libra 360 × 90 15 13 2 2 2 2 20 2 2
Soybean (large) 266 × 36 15 13 8 2 2 2 5 4 2
Letter 20,0 0 0 × 17 26 22 1 2 2 3 3 2 2
Table 11
Cluster numbers obtained by RL-FCM, R-EM, PC, PE, MPC, MPE, FHV and XB.
Fig. 14. The brain MR image with its window selection image. Fig. 15. (a) Window selected MR image; (b) Clustering with the RL-FCM.
Fig. 16. (a) Original Lena image; (b) Histogram of (a); (c) Clustering using RL-FCM.
Fig. 17. Plot of validity indexes for Fig. 16(a): (a) PC; (b) PE; (c) MPC; (d) MPE; (e) FHV; (f) XB.
Fig. 18. (a) Original peppers image; (b) Histogram of (a); (c) Clustering using RL-FCM.
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 57
Fig. 19. Plots of per iteration time as implementing RL-FCM for different data sets: (a) Gaussian mixture model of Example 1; (b) Iris data set; (c) Peppers image.
Finally, we would like to demonstrate that, although the algo- ered. For our future work, we will consider data with high dimen-
rithm uses the number of data points as the number of clusters sions by building a new feature selection procedure in the RL-FCM
(i.e., c = n) in the beginning iteration, the iteration time will de- algorithm.
crease rapidly after several iterations. This situation occurs because
the clusters with α k ≤ 1/n will be discarded during iterations, so Acknowledgments
that the number of clusters c decreases rapidly. To demonstrate
this phenomenon, we show the running time in seconds per itera- The authors would like to thank the anonymous referees for
tion for the data sets of Example 1 (Gaussian mixture model with their helpful comments in improving the presentation of this pa-
c = 3), Example 7 (Iris dataset), and Example 14 (Peppers image) per. This work was supported in part by the Ministry of Science
as shown in Fig. 19(a)–(c), respectively. As we can see, the running and Technology, Taiwan, under Grant MOST 105-2118-M-033-004-
time decreases rapidly after the 10th iteration. MY2.
4. Conclusions References
In this paper we proposed a new schema with a learning frame- [1] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster
work for fuzzy clustering algorithms, especially for the fuzzy c- Analysis, Wiley, New York, 1990.
[2] G.J. McLachlan, K.E. Basford, Mixture Models: Inference and Applications to
means (FCM) algorithm. We adopted the merit of entropy-type Clustering, Marcel Dekker, New York, 1988.
penalty terms for adjusting the bias and also free of the fuzzi- [3] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete
ness index in the FCM. We then created a robust-learning FCM (RL- data via the EM algorithm (with discussion), J. R. Stat. Soc. Series B 39 (1977)
1–38.
FCM) clustering algorithm. The proposed RL-FCM uses the number [4] S.E. Schaeffer, Graph clustering, Comput. Sci. Rev. I (2007) 27–64.
of data points as initial number of clusters for solving initialization [5] C.C. Aggarwal, J.L. Wolf, P.S. Yu, Method for targeted advertising on the web
problem. It then discards these clusters that have the mixing pro- based on accumulated self-learning data, clustering users and semantic node
graph techniques, U.S. Patent No. 6714975 (2004).
portion values less than one over the number of data points, such [6] J. MacQueen, Some methods for classification and analysis of multivariate ob-
that the best cluster number can be automatically found accord- servations, in: Proceedings of 5th Berkeley Symposium on Mathematical Statis-
ing to the structure of data. The advantages of RL-FCM are free of tics and Probability, 1, University of California Press, 1967, pp. 281–297.
[7] D. Pelleg, A. Moore, X-Means: extending K-means with efficient estimation of
initializations and parameters that also robust to different cluster the number of clusters, in: Proceedings of the 17th International Conference
volumes and shapes, noisy points and outliers with automatically on Machine Learning, San Francisco, 20 0 0, pp. 727–734.
finding the best number of clusters. The computational complex- [8] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338–353.
[9] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,
ity of RL-FCM was also analyzed. The main difference for compu-
Plenum Press, New York, 1981.
tational time between RL-FCM and FCM is the beginning iteration [10] M.S. Yang, A survey of fuzzy clustering, Math. Comput. Model. 18 (1993) 1–16.
for assigning the number of data points as initial number of clus- [11] A. Baraldi, P. Blonda, A survey of fuzzy clustering algorithms for pattern recog-
ters. However, in general, the iteration time for RL-FCM will de- nition Part I and II, IEEE Trans. Syst. Man Cybern. Part B 29 (1999) 778–801.
[12] F. Hoppner, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster Analysis: Methods
crease rapidly after several iterations. On the other hand, for very for Classification Data Analysis and Image Recognition, Wiley, New York, 1999.
large data sets, we may consider by using a grid base to divide di- [13] E. Ruspini, A new approach to clustering, Inf. Control 15 (1969) 22–32.
mensions of feature space into grids and then only choosing one [14] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting
compact, well-separated clusters, J. Cybern. 3 (1974) 32–57.
data point in each grid as the initial cluster center so that we can [15] D.E. Gustafson, W.C. Kessel, Fuzzy clustering with a fuzzy covariance matrix,
reduce the computational time for RL-FCM. in: Proceedings of IEEE CDC, California, 1979, pp. 761–766.
Several numerical data and real data sets with MRI and image [16] N.B. Karayiannis, MECA: maximum entropy clustering algorithm, in: Proceed-
ings of IEEE International Conference on Fuzzy Systems, 1, Orlando, FL, 1994,
segmentation are used to show these good aspects of RL-FCM. Ex- pp. 630–635.
perimental results and comparisons actually demonstrated the ef- [17] S. Miyamoto, K. Umayahara, Fuzzy clustering by quadratic regularization, in:
fectiveness and superiority of the proposed RL-FCM algorithm. Ex- Proceedings of the 7th IEEE International Conference on Fuzzy Systems, 2, Pis-
cataway, NJ, 1998, pp. 1394–1399.
cept these experiments in the paper, the RL-FCM algorithm could
[18] C. Wei, C. Fahn, The multisynapse neural network and its application to fuzzy
be applied to text mining, face recognition, marketing segmenta- clustering, IEEE Trans. Neural Netw. 13 (2002) 600–618.
tion and gene expression. As a whole, the proposed RL-FCM is ac- [19] R.J. Hathaway, J.C. Bezdek, Y. Hu, Generalized fuzzy c-means clustering strate-
gies using Lp norm distances, IEEE Trans. Fuzzy Syst. 8 (20 0 0) 576–582.
tually an effective and useful robust learning-based clustering algo-
[20] M.S. Yang, K.L. Wu, J.N. Hsieh, J. Yu, Alpha-cut implemented fuzzy clustering
rithm. Although the RL-FCM algorithm actually improves the per- algorithms and switching regressions, IEEE Trans. Syst. Man Cybern. Part B 38
formance of fuzzy clustering algorithms, especially capable of ob- (2008) 588–603.
taining the number of clusters, it still has limitations in handling [21] T.C. Havens, J.C. Bezdek, C. Leckie, L.O. Hall, M. Palaniswami, Fuzzy c-means
algorithms for very large data, IEEE Trans. Fuzzy Syst. 20 (2012) 1130–1146.
high-dimensional data sets. We think that, for high-dimensional [22] H. Izakian, W. Pedrycz, I. Jamal, Clustering spatiotemporal data: an augmented
data, a suitable schema with feature selection should be consid- fuzzy c-means, IEEE Trans. Fuzzy Syst. 21 (2013) 855–868.
58 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59
[23] J.C. Bezdek, Numerical taxonomy with fuzzy sets, J. Math. Biol. 1 (1974) 57–71. [33] V. Schwämmle, O.N. Jensen, A simple and fast method to determine the
[24] J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybern. 3 (1974) 58–73. parameters for fuzzy c-means cluster analysis, Bioinformatics 26 (2010)
[25] M. Roubens, Pattern classification problems with fuzzy sets, Fuzzy Sets Syst. 1 2841–2848.
(1978) 239–253. [34] M.S. Yang, C.Y. Lai, C.Y. Lin, A robust EM clustering algorithm for Gaussian mix-
[26] R.N. Dave, Validating fuzzy partition obtained through c-shells clustering, Pat- ture models, Pattern Recognit. 45 (2012) 3950–3961.
tern Recognit. Lett. 17 (1996) 613–623. [35] C.L. Blake, C.J. Merz, UCI repository of machine learning databases, a huge col-
[27] I. Gath, A.B. Geva, Unsupervised optimal fuzzy clustering, IEEE Trans. Pattern lection of artificial and real-world data sets, 1998.
Anal. Mach. Intell. 11 (1989) 73–781. [36] W.M. Rand, Objective criteria for the evaluation of clustering methods, J. Amer.
[28] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern Stat. Assoc. 66 (1971) 846–850.
Anal. Mach. Intell. 13 (1991) 841–847. [37] J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pat-
[29] H. Frigui, R. Krishnapuram, A robust competitive clustering algorithm with ap- tern Anal. Mach. Intell. 16 (1994) 550–554.
plications in computer vision, IEEE Trans. Pattern Anal. Mach. Intell. 21 (6) [38] T.P. Conrads, M. Zhou, E.F. Petricoin III, L. Liotta, T.D. Veenstra, Cancer diagnosis
(1999) 450–465. using proteomic patterns, Expert Rev. Mol. Diagn. 3 (2003) 411–420.
[30] A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Sci- [39] M.S. Yang, Y.J. Hu, K.C.R. Lin, C.C.L. Lin, Segmentation techniques for tissue dif-
ence 344 (6191) (2014) 1492–1496. ferentiation in MRI of ophthalmology using fuzzy clustering algorithms, Magn.
[31] P. Fazendeiro, J.V. de Oliveira, Observer-biased fuzzy clustering, IEEE Trans. Reson. Imaging 20 (2002) 173–179.
Fuzzy Syst. 23 (2015) 85–97.
[32] D. Dembélé, P. Kastner, Fuzzy c-means method for clustering microarray data,
Bioinformatics 19 (2003) 973–980.
M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59 59
Miin-Shen Yang received the BS degree in mathematics from the Chung Yuan Christian University, Chung-Li, Taiwan, in 1977, the MS degree in applied mathematics from
the National Chiao-Tung University, Hsinchu, Taiwan, in 1980, and the PhD degree in statistics from the University of South Carolina, Columbia, USA, in 1989.
In 1989, he joined the faculty of the Department of Mathematics in the Chung Yuan Christian University (CYCU) as an Associate Professor, where, since 1994, he has been
a Professor. From 1997 to 1998, he was a Visiting Professor with the Department of Industrial Engineering, University of Washington, Seattle. During 20 01–20 05, he was
the Chairman of the Department of Applied Mathematics in CYCU. His current research interests include clustering, pattern recognition, machine learning, and neural fuzzy
systems.
Dr. Yang was an Associate Editor of the IEEE Transactions on Fuzzy Systems (2005–2011), and is an Associate Editor of the Applied Computational Intelligence & Soft
Computing and Editor-in-Chief of Advances in Computational Research. He was awarded with 2008 Outstanding Associate Editor of IEEE Transactions on Fuzzy Systems,
IEEE; 2009 Outstanding Research Award of CYCU; 2012–2018 Distinguished Professorship of CYCU; 2016 Outstanding Research Award of CYCU.
Yessica Nataliani received the BS degree in mathematics and the MS degree in computer science from the Gadjah Mada University, Yogyakarta, Indonesia, in 2004 and 2006,
respectively.
She is currently a Ph.D. student at the Department of Applied Mathematics in the Chung Yuan Christian University, Taiwan. His research interests include cluster analysis
and pattern recognition.