Surveyofclusteringmethods
Surveyofclusteringmethods
net/publication/220571682
CITATIONS READS
363 34,755
3 authors:
Ayed A. Salman
Kuwait University
64 PUBLICATIONS 3,671 CITATIONS
SEE PROFILE
All content following this page was uploaded by Mahamed Omran on 20 May 2014.
1. Introduction
Data clustering is the process of identifying natural groupings or clusters within
multidimensional data based on some similarity measure (e.g. Euclidean distance)
[Jain et al. 1999; Jain et al. 2000]. It is an important process in pattern recognition and
machine learning [Hamerly and Elkan 2002]. Furthermore, data clustering is a central
process in Artificial Intelligence (AI) [Hamerly 2003]. Clustering algorithms are used
in many applications, such as image segmentation [Coleman and Andrews 1979; Jain
and Dubes 1988; Turi 2001], vector and color image quantization [Kaukoranta et al.
1998; Baek et al. 1998; Xiang 1997], data mining [Judd et al. 1998], compression
[Abbas and Fahmy 1994], machine learning [Carpineto and Romano 1996], etc. A
cluster is usually identified by a cluster center (or centroid) [Lee and Antonsson
2000]. Data clustering is a difficult problem in unsupervised pattern recognition as the
clusters in data may have different shapes and sizes [Jain et al. 2000].
Due to the prohibitive amount of research conducted in the area of clustering, a
survey paper investigating the state-of-the-art clustering methods is generally
welcomed. Hence, the purpose of this paper is to provide such an overview of
representative clustering methods. However, trying to address all the clustering
methods on one paper is not possible. Therefore, this paper tries to provide the reader
with an overview of a set of representative clustering methods.
The reminder of this paper is organized as follows: Section 2 provides a
background material. Section 3 surveys different clustering techniques. Several
clustering validation techniques are presented in Section 4. Methods for determining
the number of clusters in a data set are given in Section 5. Section 6 provides a brief
introduction to the use of Self-Organizing Maps for clustering. Clustering using
stochastic techniques is investigated in Section 7. Finally, Section 8 concludes the
paper.
2. Backgrounds
This section defines the terms used throughout the paper and it provides the reader
with the necessary background material to follow-up the discussion in the paper.
2.1. Definitions
The following terms are used in this paper:
• A pattern (or feature vector), z, is a single object or data point used by the
clustering algorithm [Jain et al. 1999].
• A feature (or attribute) is an individual component of a pattern [Jain et al.
1999].
• A cluster is a set of similar patterns, and patterns from different clusters
are not similar [Everitt 1974].
• Hard (or Crisp) clustering algorithms assign each pattern to one and only
one cluster.
• Fuzzy clustering algorithms assign each pattern to each cluster with some
degree of membership.
• A distance measure is a metric used to evaluate the similarity of patterns
[Jain et al. 1999].
The clustering problem can be formally defined as follows (Veenman et al. 2003):
Given a data set Z = {z1 , z 2 ,K , z p ,K , z N p } where zp is a pattern in the Nd-
dimensional feature space, and Np is the number of patterns in Z, then the clustering
of Z is the partitioning of Z into K clusters {C1, C2,…,CK} satisfying the following
conditions:
• Each pattern should be assigned to a cluster, i.e.
∪ kK=1 C k = Z
• Each cluster has at least one pattern assigned to it, i.e.
C k ≠ φ , k = 1,K , K
• Each pattern is assigned to one and only one cluster (in case of hard
clustering only), i.e.
C k ∩ C kk = φ where k ≠ kk
2.2. Similarity Measures
As previously mentioned, clustering is the process of identifying natural
groupings or clusters within multidimensional data based on some similarity measure.
Hence, similarity measures are fundamental components in most clustering algorithms
[Jain et al. 1999].
The most popular way to evaluate a similarity measure is the use of distance
measures. The most widely used distance measure is the Euclidean distance defined
as
Nd
d (zu , z w ) = ∑ (z
j =1
u, j − z w, j ) 2 = z u − z w (1)
Euclidean distance is a special case (when α = 2) of the Minkowski metric [Jain et al.
1999] defined as
Nd
d α ( z u , z w ) = (∑ ( z u, j − z w, j ) α )1 / α = z u − z w
α
(2)
j =1
∑z
j =1
u, j z w, j
< zu , z w > = (3)
zu z w
d M ( zu , z w ) = ( zu − z w )Σ −1 ( zu − z w )T (4)
where Σ is the covariance matrix of the patterns. The Mahalanobis distance gives
different features different weights based on their variances and pairwise linear
correlations. Thus, this metric implicitly assumes that the densities of the classes are
multivariate Gaussian [Jain et al. 1999].
3. Clustering Techniques
Most clustering algorithms are based on two popular techniques known as
hierarchical and partitional clustering [Frigui and Krishnapuram 1999; Leung et al.
2000]. In the following, an overview of both techniques is presented with an elaborate
discussion of popular hierarchical and partitional clustering algorithms.
∑ u (m
∀z p
k | z p ) w( z p ) z p
mk = (5)
∑ u (m
∀z p
k | z p ) w( z p )
The weight function, w(zp), in Eq. (5) defines how much influence pattern zp has
in recomputing the centroids in the next iteration, where w( z p ) > 0 [Hamerly and
Elkan 2002]. The weight function was proposed by Zhang [2000].
Different stopping criteria can be used in an iterative clustering algorithm, for
example:
• stop when the change in centroid values are smaller than a user-specified
value,
• stop when the quantization error is small enough, or
• stop when a maximum number of iterations has been exceeded.
In the following, popular iterative clustering algorithms are described by defining
the membership and weight functions in Eq. (5).
3.2.1. The K-means Algorithm
The most widely used partitional algorithm is the iterative K-means approach
[Forgy 1965]. The objective function that the K-means optimizes is
K
J K − means = ∑ ∑d 2
(z p , mk ) (6)
k =1 ∀z p ∈C k
Hence, the K-means algorithm minimizes the intra-cluster distance [Hamerly and
Elkan 2002]. The K-means algorithm starts with K centroids (initial values for the
centroids are randomly selected or derived from a priori information). Then, each
pattern in the data set is assigned to the closest cluster (i.e. closest centroid). Finally,
the centroids are recalculated according to the associated patterns. This process is
repeated until convergence is achieved.
The membership and weight functions for K-means are defined as
{ }
1 if d 2 ( z p , m k ) = arg min k d 2 ( z p , m k )
u (m k | z p ) = (7)
0 otherwise
w( z p ) = 1 (8)
Hence, K-means has a hard membership function. Furthermore, K-means has a
constant weight function, thus, all patterns have equal importance [Hamerly and
Elkan 2002].
The K-means algorithm has the following main advantages [Turi 2001]:
• it is very easy to implement, and
• its time complexity is O(Np) making it suitable for very large data sets.
However, the K-means algorithm has the following drawbacks [Davies 1997]:
• the algorithm is data-dependent,
• it is a greedy algorithm that depends on the initial conditions, which may
cause the algorithm to converge to suboptimal solutions, and
• the user needs to specify the number of clusters in advance.
The K-medoids algorithm is similar to K-means with one major difference,
namely, the centroids are taken from the data itself [Hamerly 2003]. The objective of
K-medoids is to find the most centrally located patterns within the clusters [Halkidi et
al. 2001]. These patterns are called medoids. Finding a single medoid requires
O( N p2 ) . Hence, K-medoids is not suitable for moderately large data sets.
K Np
where q is the fuzziness exponent, with q ≥ 1. Increasing the value of q will make the
algorithm more fuzzy; uk,p is the membership value for the pth pattern in the kth cluster
satisfying the following constraints:
• uk, p ≥ 0 , p = 1,…, Np and k = 1,…, K
K
• ∑u
k =1
k, p = 1 , p = 1,…, Np
The membership and weight functions for FCM are defined as [Hamerly and
Elkan 2002]
−2 / ( q −1)
z p − mk
u (m k | z p ) = K
(10)
− 2 / ( q −1)
∑z
k =1
p − mk
w( z p ) = 1 (11)
Hence, FCM has a soft membership function and a constant weight function. In
general, FCM performs better than K-means [Hamerly 2003] and it is less affected by
the presence of uncertainty in the data [Liew et al. 2000]. However, as in K-means it
requires the user to specify the number of clusters in the data set. In addition, it may
converge to local optima [Jain et al. 1999].
Krishnapuram and Keller [1993; 1996] proposed a possibilistic clustering
algorithm, called possibilistic C-means. Possibilistic clustering is similar to fuzzy
clustering; the main difference is that in possibilistic clustering the membership
values may not sum to one [Turi 2001]. Possibilistic C-means works well in the
presence of noise in the data set. However, it has several drawbacks, namely [Turi
2001],
• it is likely to generate coincident clusters,
• it requires the user to specify the number of clusters in advance,
• it converges to local optima, and
• it depends on initial conditions.
3.2.3. The Gaussian Expectation-Maximization Algorithm
Another popular clustering algorithm is the Expectation-Maximization (EM)
algorithm [McLachlan and Krishnan 1997; Rendner and Walker 1984; Bishop 1995].
EM is used for parameter estimation in the presence of some unknown data [Hamerly
2003]. EM partitions the data set into clusters by determining a mixture of Gaussians
fitting the data set. Each Gaussian has a mean and covariance matrix [Alldrin et al.
2003]. The objective function that the EM optimizes as defined by Hamerly and Elkan
[2002] is
Np K
J EM = −∑ log(∑ p( z p | m k ) p(m k )) (12)
p =1 k =1
p ( z p | mk ) p (mk )
u (mk | z p ) = (13)
p( z p )
w( z p ) = 1 (14)
Hence, EM has a soft membership function and a constant weight function. The
algorithm starts with an initial estimate of the parameters. Then, an expectation step is
applied where the known data values are used to compute the expected values of the
unknown data [Hamerly 2003]. This is followed by a maximization step where the
known and expected values of the data are used to generate a new estimate of the
parameters. The expectation and maximization steps are repeated until convergence.
Results from Veenman et al. [2002] and Hamerly [2003] showed that K-
means performs comparably to EM. Furthermore, Aldrin et al. [2003] stated that EM
fails on high-dimensional data sets due to numerical precision problems. They also
observed that Gaussians often collapsed to delta functions [Alldrin et al. 2003]. In
addition, EM depends on the initial estimate of the parameters [Hamerly 2003; Turi
2001] and it requires the user to specify the number of clusters in advance. Moreover,
EM assumes that the density of each cluster is Gaussian which may not always be true
[Ng et al. 2001].
3.2.4. The K-harmonic Means Algorithm
Recently, Zhang and colleagues [1999; 2000] proposed a novel algorithm called
K-harmonic means (KHM), with promising results. In KHM, the harmonic mean of
the distance of each cluster center to every pattern is computed. The cluster centroids
are then updated accordingly. The objective function that the KHM optimizes is
Np
K
J KHM = ∑ K
(15)
1
p =1
∑ α
k =1 z p − mk
K
−α − 2
∑z p − mk
w( z p ) = k =1
2
(17)
K −α
∑ z p − mk
k =1
Hence, KHM has a soft membership function and a varying weight function. KHM
assigns higher weights for patterns that are far from all the centroids to help the
centroids in covering the data [Hamerly and Elkan 2002].
Contrary to K-means, KHM is less sensitive to initial conditions and does not
have the problem of collapsing Gaussians exhibited by EM [Alldrin et al. 2003].
Experiments conducted by Zhang et al. [1999], Zhang [2000] and Hamerly and Elkan
[2002] showed that KHM outperformed K-means, FCM (according to Hamerly and
Elkan [2002]) and EM.
3.2.5. Hybrid 2
Hamerly and Elkan [2002] proposed a variation of KHM, called Hybrid 2 (H2),
which uses the soft membership function of KHM (i.e. Eq. (16)) and the constant
weight function of K-means (i.e. Eq. (8)). Hamerly and Elkan [2002] showed that H2
outperformed K-means, FCM and EM. However, KHM, in general, performed
slightly better than H2.
K-means, FCM, EM, KHM and H2 are linear time algorithms (i.e. their time
complexity is O(Np)) making them suitable for very large data sets. According to
Hamerly [2003], FCM, KHM and H2 - all use soft membership functions - are the
best available clustering algorithms.
dist( C ,C )
D = min min
k kk
(18)
k =1,...,K kk = k +1,...,K
diam( C a )
amax
=1,...,K
where dist( C k ,C kk ) is the dissimilarity function between two clusters Ck and Ckk
defined as
dist( C k ,C kk ) = min d ( u, w ) ,
u∈C k ,w∈C kk
where d(u, w) is the Euclidean distance between u and v; diam(C) is the diameter of a
cluster, defined as
diam( C ) = max d ( u , w )
u ,w∈C
An "optimal" value of K is the one that maximizes the Dunn's index. Dunn's index
suffers from the following problems [Halkidi et al. 2001]:
• it is computationally expensive, and
• it is sensitive to the presence of noise.
Several Dunn-like indices were proposed in Pal and Biswas [1997] to reduce the
sensitivity to the presence of noise.
Another well known index, proposed by Davies and Bouldin [1979], minimizes
the average similarity between each cluster and the one most similar to it. The Davies
and Bouldin index is defined as
1 K
diam(C k ) + diam(C kk )
DB =
K
∑ max
kk =1,...,K dist (C k ,C kk )
(19)
k =1 k ≠ kk
intra
V = ( c × N ( 2,1) + 1) × (20)
inter
where c is a user specified parameter and N(2,1) is a Gaussian distribution with mean
2 and standard deviation of 1. The "intra" term is the average of all the distances
between each data point and its cluster centroid, defined as
K
1
∑ ∑ u−m
2
intra = k
Np k =1 ∀u∈C k
This term is used to measure the compactness of the clusters. The "inter" term is the
minimum distance between the cluster centroids, defined as
2
inter = min{ m k − m kk }, ∀ k = 1,..., K − 1 and k k = k + 1,..., K .
This term is used to measure the separation of the clusters. An "optimal" value of K is
the one that minimizes the V index.
According to Turi [2001], this index performed better than both Dunn's index
and the index of Davies and Bouldin on the tested cases.
Two recent validity indices are S_Dbw [Halkidi and Vazirgiannis 2001] and
CDbw [Halkidi and Vazirgiannis 2002]. S_Dbw measures the compactness of a data
set by the cluster variance, whereas separation is measured by the density between
clusters. The S_Dbw index is defined as
where σ (C k ) is the variance of cluster Ck and σ (Z ) is the variance of data set Z; ||z||
is defined as ||z|| = (zTz)1/2, where z is a vector.
The second term in Eq. (21) evaluates the density of the area between the two
clusters in relation to the density of the two clusters. Thus, the second term is a
measure of the separation of the clusters, defined as
K
1 K
density (bk,kk )
Dens_bw( K ) = ∑ ∑
K ( K − 1) k =1 k k =1 max{density (C k ) , density (C kk )}
k ≠ k k
where bk,kk is the middle point of the line segment defined by mk and mkk. The term
density(b) is defined as
nk,kk
density (b) = ∑ f ( z ll , b)
ll =1
where nk,kk is the total number of patterns in clusters Ck and Ckk (i.e. nk,kk= nk + nkk).
The function f(z,b) is defined as
0 if d ( z ,b) > σ
f ( z , b) =
1 otherwise
where
K
1
σ=
K
∑ σ (C
k =1
k )
An "optimal" value of K is the one that minimizes the S_Dbw index. Halkidi and
Vazirgiannis [2001] showed that, in tested cases, S_Dbw successfully found the
"optimal" number of clusters whereas other well-known indices often failed to do so.
However, S_Dbw does not work properly for arbitrary shaped clusters.
To address this problem, Halkidi and Vazirgiannis [2002] proposed a multi-
representative validity index, CDbw, in which each cluster is represented by a user-
specified number of points, instead of one representative as is done in S_Dbw.
Furthermore, CDbw uses intra-cluster density to measure the compactness of a data
set, and uses the density between clusters to measure their separation.
More recently, Veenman et al. [2002; 2003] proposed a validity index that
minimizes the intra-cluster variability while constraining the intra-cluster variability
of the union of the two clusters. The sum of squared error is used to minimize the
intra-cluster variability while a minimum variance for the union of two clusters is
used to implement the joint intra-cluster variability. The index is defined as
K
IV = min ∑ nk Var (C k ) (22)
k =1
1
∑z
2
Var (C k ) = p − mk
nk z p ∈C k
such that
Var (C k ∪ C kk ) ≥ σ max
2
, ∀C k ,C kk , k ≠ kk
where σ max
2
is a user-specified parameter. This parameter has a profound effect on the
final result.
The above validity indices are suitable for hard clustering. Validity indices have
been developed for fuzzy clustering. The interested reader is referred to Halkidi et al.
[2001] for more information.
These are also several information-theoretic criteria to determine the number of
clusters in a data set such as Akaike's information criterion (AIC) [Akaike 1974],
minimum description length (MDL) [Rissanen 1978], Merhav-Gutman-Ziv (MGZ)
[Merhav 1989]. These criteria are based on likelihood and they differ in the penalty
term they use to penalize large number of clusters. According to Langan et al. [1998],
MGZ requires the user to specify a priori value for a parameter that has a profound
effect on the resultant number of clusters. Furthermore, the penalty terms of AIC and
MDL are generally useless due to the fact that the associated log likelihood function
generally dominates the penalty terms in both AIC and MDL. To address this issue,
Langan et al. [1998] proposed a cluster validation criterion that has no penalty term
and applied it to the image segmentation problem with promising results.
K ( N d + 1)
BIC (C | Z ) = l̂ ( Z | C ) − log N p (23)
2
K Np K
J UFC = ∑∑ u k,q p d 2 ( z p , m k ) − β ∑ p k log( p k ) (24)
k =1 p =1 k =1
where q is the fuzziness exponent, uk,p is the membership value for the pth pattern in
the kth cluster, β is a parameter that decreases as the run progresses, and pk is the a
priori probability of cluster Ck defined as
Np
1
pk =
Np
∑u
p =1
k, p (25)
The first term of Eq. (24) is the objective function of FCM which is minimized when
each cluster consists of one pattern. The second term is an entropy term that is
minimized when all the patterns are assigned to one cluster. Lorette et al. [2000] use
this objective function to derive new update equations for the membership and
centroid parameters.
The algorithm starts with a large number of clusters. Then, the membership
values and centroids are updated using the new update equations. This is followed by
applying Eq. (25) to update the a priori probabilities. If p k < ε then cluster k is
discarded; ε is a user-specified parameter. This procedure is repeated until
convergence. The drawback of this approach is that it requires the parameter ε to be
specified in advance. The performance of the algorithm is sensitive to the value of ε.
Similarly, Boujemaa [2000] proposed an algorithm, based on a generalization
of the competitive agglomeration clustering algorithm introduced by Frigui and
Krishnapuram [1997].
The fuzzy algorithms discussed above modify the objective function of FCM.
In general, these approaches are sensitive to initialization and other parameters [Frigui
and Krishnapuram 1999]. Frigui and Krishnapuram [1999] proposed a robust
competitive clustering algorithm based on the process of competitive agglomeration.
The algorithm starts with a large number of small clusters. Then, during the execution
of the algorithm, adjacent clusters compete for patterns. Clusters losing the
competition will eventually disappear [Frigui and Krishnapuram 1999]. However, this
algorithm also requires the user to specify a parameter that has a significant effect on
the generated result.
Use competitive learning to train the weight vectors such that all the nodes within
the neighborhood of the winning node are moved toward zp:
w (t ) + η (t )[ z p − w k (t )] k ∈ ∆ w (t )
w k (t + 1) = k
w k (t ) otherwise
Endloop
Linearly decrease η (t ) and reduce ∆ w (t )
Until some convergence criteria are satisfied
One problem with simulated annealing is that it is very slow in finding an optimal
solution [Jain et al. 1999].
Tabu search [Glover 1989; Glover 1990] has also been used for hard clustering
[Al-Sultan 1995] and fuzzy clustering [Delgado et al. 1997] with encouraging results.
A hybrid approach combining both K-means and tabu search that performs better than
both K-means and tabu search was proposed by Frnti et al. [1998]. Recently, Chu and
Roddick [2003] proposed a hybrid approach combining both tabu search and
simulated annealing that outperforms the hybrid proposed by Frnti et al. [1998].
However, the performance of simulated annealing and tabu search depends on the
selection of several control parameters [Jain et al. 1999].
Most clustering approaches discussed so far perform local search to find a solution
to a clustering problem. Evolutionary algorithms [Michalewicz and Fogel 2000]
which perform global search have also been used for clustering [Jain et al. 1999].
Raghavan and Birchand [1979] used GAs [Goldberg 1989] to minimize the squared
error of a clustering solution. In this approach, each chromosome represents a
partition of Np patterns into K clusters. Hence, the size of each chromosome is Np.
This representation has a major drawback in that it increases the search space by a
factor of K!. The crossover operator may also result in inferior offspring [Jain et al.
1999].
Babu and Murty [1993] proposed a hybrid approach combining K-means and
GAs that performed better than the GA. In this approach, a GA is only used to feed K-
means with good initial centroids [Jain et al. 1999].
Recently, Maulik and Bandyopadhyay [2000] proposed a GA-based clustering
where each chromosome represents K centroids. Hence, a floating point
representation is used. The fitness function is defined as the inverse of the objective
function of K-means (refer to Eq. (6)). The GA-based clustering algorithm is
summarized in Figure 4.
According to Maulik and Bandyopadhyay [2000], this approach outperformed
K-means on the tested cases. One drawback of this approach is that it requires the user
to specify the number of clusters in advance.
Lee and Antonsson [2000] used an evolution strategy (ES) [Bäck et al. 1991] to
dynamically cluster a data set. The proposed ES implemented variable length
individuals to search for both the centroids and the number of clusters. Each
individual represents a set of centroids. The length of each individual is randomly
chosen from a user-specified range of cluster numbers. The centroids of each
individual are then randomly initialized. Mutation is applied to the individuals by
adding/subtracting a Gaussian random variable with zero mean and unit standard
deviation. Two point crossover is also used as a "length changing operator". A
(10+60) ES selection is used where 10 is the number of parents and 60 is the number
of offspring generated in each generation. The best ten individuals from the set of
parents and offspring are used for the next generation. A modification of the mean
square error is used as the fitness function, defined as
K
J ES = K + 1∑ ∑ d (z p ,mk ) (26)
k =1 ∀z p ∈C k
8. Summary
This paper presented an overview of the different clustering methods. First the
data clustering problem was defined. This was followed by defining the terms used in
this paper. In addition, a brief overview of the different similarity measures was
given. Clustering techniques were then discussed. A presentation of different
clustering validation techniques was then shown. Methods that automatically
determine the number of clusters in a data set was then presented. Finally, an
overview of clustering using SOMs and stochastic techniques was presented.
References
H. Abbas and M. Fahmy. Neural Networks for Maximum Likelihood Clustering.
Signal Processing, vol. 36, no.1, pp. 111-126, 1994.
H. Akaike. A New Look at the Statistical Model Identification. IEEE Transactions on
Automated Control, vol. AC-19, Dec., 1974.
N. Alldrin, A. Smith and D. Turnbull. Clustering with EM and K-means, unpublished
Manuscript, 2003, http://louis.ucsd.edu/~nalldrin/research/cse253\_wi03.pdf
(visited 15 Nov 2003).
K. Al-Sultan. A Tabu Search Approach to Clustering Problems. Pattern Recognition,
vol. 28, pp. 1443-1451, 1995.
M. Anderberg. Cluster Analysis for Applications. Academic Press, New York, USA,
1973.
G. Babu and M. Murty. A Near-Optimal Initial Seed Value Selection in K-means
Algorithm Using a Genetic Algorithm. Pattern Recognition Letters, vol. 14, no.
10, pp. 763-769, 1993.
F. Bach and M. Jordan. Learning Spectral Clustering. Neural Information Processing
Systems 16 (NIPS 2003), 2003.
T. Bäck, F. Hoffmeister and H. Schwefel. A Survey of Evolution Strategies. In
Proceedings of the Fourth International Conference on Genetic Algorithms and
their Applications, pp. 2-9, 1991.
S. Baek, B. Jeon, D. Lee and K. Sung. Fast Clustering Algorithm for Vector
Quantization. Electronics Letters, vol. 34, no. 2, pp. 151-152, 1998.
G. Ball and D. Hall. A Clustering Technique for Summarizing Multivariate Data.
Behavioral Science, vol. 12, pp. 153-155, 1967.
J. Bezdek. A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms.
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 2, pp. 1-8,
1980.
J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum
Press, 1981.
H. Bischof, A. Leonardis and A. Selb. MDL Principle for Robust Vector
Quantization. Pattern Analysis and Applications, vol. 2, pp. 59-72, 1999.
C. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.
N. Boujemaa. On Competitive Unsupervised Clustering. In the International
Conference on Pattern Recognition (ICPR'00), vol. 1, pp. 1631-1634, 2000.
C. Carpineto and G. Romano. A Lattice Conceptual Clustering System and Its
Application to Browsing Retrieval. Machine Learning, vol. 24, no. 2, pp. 95-122,
1996.
S. Chu and J. Roddick. A Clustering Algorithm Using Tabu Search Approach with
Simulated Annealing for Vector Quantization. Chinese Journal of Electronics,
vol. 12, no. 3, pp. 349-353, 2003.
F. Chung. Spectral Graph Theory. Society Press, 1997.
G. Coleman and H. Andrews. Image Segmentation by Clustering. In Proceedings of
IEEE, vol. 67, pp. 773-785, 1979.
D. Comaniciu and P. Meer. Mean Shift: A Robust Approach Toward Feature Space
Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
24, no. 5, pp. 603-619, 2002.
H. Dai and W. Ma. A Novelty Bayesian Method for Unsupervised Learning of Finite
Mixture Models. In Proceedings of the 3rd International Conference on Machine
Learning and Cybernetics, Shanghai, China, pp. 3574-3578, 2004.
E. Davies. Machine Vision: Theory, Algorithms, Practicalities. Academic Press, 2nd
Edition, 1997.
D. Davies and D. Bouldin. A Cluster Separation Measure. IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 1, no. 2, 1979.
M. Delgado, A. Skarmeta and H. Barberá. A Tabu Search Approach to the Fuzzy
Clustering Problem. In the Sixth IEEE International Conference on Fuzzy
Systems, Barcelona, 1997.
J. C. Dunn. Well Separated Clusters and Optimal Fuzzy Partitions. Journal of
Cybernetics, vol. 4, pp. 95-104, 1974.
B. Everitt. Cluster Analysis. Heinemann Books, London, 1974.
M. Figueiredo and A. Jain. Unsupervised Learning of Finite Mixture Models. IEEE
Transactions on Pattern Analysis and Machine Intillegence, vol. 24, no. 3, pp.
381-396, 2002.
E. Forgy. Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of
Classification. Biometrics, vol. 21, pp. 768-769, 1965.
H. Frigui and R. Krishnapuram. Clustering by Competitive Agglomeration. Pattern
Recognition Letters, vol. 30, no. 7, pp. 1109-1119, 1997.
H. Frigui and R. Krishnapuram. A Robust Competitive Clustering Algorithm with
Applications in Computer Vision. IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 21, no.5, pp. 450-465, 1999.
P. Frnti, J. Kivijrvi and O. Nevalainen. Tabu Search Algorithm for Codebook
Generation in Vector Quantization. Pattern Recognition, vol. 31, no. 8, pp. 1139-
1148, 1998.
I. Gath and A. Geva. Unsupervised Optimal Fuzzy Clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 11, no. 7, pp. 773-781, 1989.
F. Glover. Tabu Search – Part I. ORSA Journal on Computing, vol. 1, no. 3, pp. 190-
206, 1989.
F. Glover. Tabu Search – Part II. ORSA Journal on Computing, vol. 2, no. 1, pp. 4-32,
1990.
D. Goldberg. Genetic Algorithms in search, optimization and machine learning.
Addison-Wesley, 1989.
M. Halkidi, Y. Batistakis and M. Vazirgiannis. On Clustering Validation Techniques.
Intelligent Information Systems Journal, Kluwer Pulishers, vol. 17, no. 2-3,
pp.107-145, 2001.
M. Halkidi and M. Vazirgiannis. Clustering Validity Assessment: Finding the
Optimal Partitioning of a data set. In Proceedings of ICDM Conference, CA,
USA, 2001.
M. Halkidi and M. Vazirgiannis. Clustering Validity Assessment using Multi
representative. In Proceedings of the Hellenic Conference on Artificial
Intelligence, SETN, Thessaloniki, Greece, 2002.
G. Hamerly. Learning Structure and Concepts in Data using Data Clustering, PhD
Thesis. University of California, San Diego, 2003.
G. Hamerly and C. Elkan. Alternatives to the K-means Algorithm that Find Better
Clusterings. In Proceedings of the ACM Conference on Information and
Knowledge Management (CIKM-2002), pp. 600-607, 2002.
G. Hamerly and C. Elkan. Learning the K in K-means. In The Seventh Annual
Conference on Neural Information Processing Systems, 2003.
K. Huang. A Synergistic Automatic Clustering Technique (Syneract) for
Multispectral Image Analysis. Photogrammetric Engineering and Remote
Sensing, vol. 1, no.1, pp. 33-40, 2002.
A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey,
USA, 1988.
A. Jain, R. Duin and J. Mao. Statistical Pattern Recognition: A Review. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no.1, pp. 4-
37, 2000.
A. Jain, M. Murty and P. Flynn. Data Clustering: A Review. ACM Computing
Surveys, vol. 31, no. 3, pp. 264-323,1999.
D. Judd, P. Mckinley and A. Jain. Large-scale Parallel Data Clustering. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp.
871-876, 1998.
R. Kass and L. Wasserman. A Reference Bayesian Test for Nested Hypotheses and its
Relationship to the Schwarz Criterion. Journal of the American Statistical
Association, vol. 90, no. 431, pp. 928-934, 1995.
T. Kaukoranta, P. Fränti and O. Nevalainen. A New Iterative Algorithm for VQ
Codebook Generation. International Conference on Image Processing, pp. 589-
593, 1998.
J. Kennedy and R. Eberhart. Particle Swarm Optimization. In Proceedings of IEEE
International Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942-
1948, 1995.
R. Klein and R. Dubes. Experiments in Projection and Clustering by Simulated
Annealing. Pattern Recognition, vol. 22, pp. 213-220, 1989.
T. Kohonen. Self-Organizing Maps. Springer Series in Information Sciences, 30,
Springer-Verlag, NewYork, USA, 1995.
Krishnapuram and Keller. A Possibilistic Approach to Clustering. IEEE Transactions
on Fuzzy Systems, vol. 1, no. 2, pp. 98-110, 1993.
Krishnapuram and Keller. The Possibilistic C-Means algorithm: Insights and
Recommendations. IEEE Transactions on Fuzzy Systems, vol. 4, no. 3, pp. 385-
393, 1996.
D. Langan, J. Modestino and J. Zhang. Cluster Validation for Unsupervised
Stochastic Model-Based Image Segmentation. IEEE Transactions on Image
Processing, vol. 7, no. 2, pp. 180-195, 1998.
C. Lee and E. Antonsson. Dynamic Partitional Clustering Using Evolution Strategies.
In The Third Asia-Pacific Conference on Simulated Evolution and Learning,
2000.
Y. Leung, J. Zhang and Z. Xu. Clustering by Space-Space Filtering. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no.12, pp.
1396-1410, 2000.
A. Liew, S. Leung and W. Lau. Fuzzy Image Clustering Incorporating Spatial
Continuity. In IEE Proceedings Vision, Image and Signal Processing, vol. 147,
no. 2, 2000.
A. Lorette, X. Descombes and J. Zerubia. Fully Unsupervised Fuzzy Clustering with
Entropy Criterion. In International Conference on Pattern Recognition (ICPR'00),
vol. 3, pp. 3998-4001, 2000.
S. Lu and K. Fu. A Sentence-to-Sentence Clustering Procedure for Pattern Analysis.
IEEE Transaction on Systems, Man and Cybernetics, vol. 8, pp. 381-389, 1978.
J. MacQueen. Some Methods for Classification and Analysis of Multivariate
Observations. In Proceedings Fifth Berkeley Symposium on Mathematics,
Statistics and Probability, vol. 1, pp. 281-297, 1967.
U. Maulik and S. Bandyopadhyay. Genetic Algorithm-Based Clustering Technique.
Pattern Recognition, vol. 33, pp. 1455-1465, 2000.
G. McLachlan and T. Krishnan. The EM algorithm and Extensions. John Wiley &
Sons, Inc., 1997.
K. Mehrotra, C. Mohan and Rakka. Elements of Artificial Neural Networks. MIT
Press, 1997.
N. Merhav. The Estimation of the Model Order in Exponential Families. IEEE
Transactions on Information Theory, vol. 35, Sep. 1989.
Z. Michalewicz and D. Fogel. How to Solve It: Modern Heuristics. Springer-Verlag,
Berlin, 2000.
A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: Analysis and an Algorithm.
In Proceedings of Neural Information Processing Systems (NIPS 2001), 2001.
J. Oliver, R. Baxter and C. Wallace. Unsupervised Learning using MML. In
Proceedings of the 13th International Conference Machine Learning (ICML'96),
pp. 364-372, San Francisco, USA, 1996.
J. Oliver and D. Hand. Introduction to Minimum Encoding Inference. Technical
Report no. 94/205. Department of Computer Science, Monash University,
Australia, 1994.
M. Omran. Particle Swarm Optimization Methods for Pattern Recognition and Image
Processing, PhD Thesis. Department of Computer Science, University of Pretoria,
South Africa, 2005.
M. Omran, A. Engelbrecht and A. Salman. Differential Evolution Methods for
Unsupervised Image Classification. To appear in the IEEE Congress on
Evolutionary Computation (CEC2005), September 2005.
M. Omran, A. Engelbrecht and A. Salman. Particle Swarm Optimization Method for
Image Clustering. International Journal of Pattern Recognition and Artificial
Intelligence, vol. 19, no. 3, pp. 297-322, May 2005.
M. Omran, A. Salman and A. Engelbrecht. Image Classification using Particle Swarm
Optimization. In Conference on Simulated Evolution and Learning, Singapore, pp.
370-374, November 2002.
N. Pal and J. Biswas. Cluster Validation using Graph Theoretic Concepts. Pattern
Recognition, vol. 30, no. 6, 1997.
A. Pandya and R. Macy. Pattern Recognition with Neural Networks in C++. CRC
Press, 1996.
S. Paterlini and T. Krink. High Performance Clustering with Differential Evolution. In
the Congress on Evolutionary Computation (CEC2004), vol. 2, pp. 2004-2011,
2004.
D. Pelleg and A. Moore. X-means: Extending K-means with Efficient Estimation of
the Number of Clusters. In Proceedings of the 17th International Conference on
Machine Learning, pp. 727-734, Morgan Kaufmann, San Francisco, CA, 2000.
V. Raghavan and K. Birchand. A Clustering Strategy Based on a Formalism of the
Reproductive Process in a Natural System. In Proceedings of the Second
International Conference on Information Storage and Retrieval, pp. 10-22, 1979.
R. Rendner and H. Walker. Mixture Densities, Maximum Likelihood and the EM
Algorithm. SIAM Review, vol. 26, no. 2, 1984.
J. Rissanen. Modeling by Shortest Data Description. Automatica, vol. 14, pp. 465-
471, 1978.
S. Roberts, D. Husmeier, L. Rezek and W. Penny. Bayesian Approaches to Gaussian
Mixture Modeling. IEEE Transactions in Pattern Recognition and Machine
Intelligence, vol. 20, no. 11, pp. 1133-1142, 1998.
C. Rosenberger and K. Chehdi. Unsupervised Clustering Method with Optimal
Estimation of the Number of Clusters: Application to Image Segmentation. In The
International Conference on Pattern Recognition (ICPR'00), vol. 1, pp. 1656-
1659, 2000.
P. Scheunders. A Comparison of Clustering Algorithms Applied to Color Image
Quantization. Pattern Recognition Letters, vol. 18, no. 11-13, pp. 1379-1384,
1997.
P. Sneath and R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.
R. Storn. and K. Price. Differential Evolution – a Simple and Efficient Adaptive
Scheme for Global Optimization over Continuous Spaces, Technical Report TR-
95-012, ICSI, 1995.
M. Su. Cluster Analysis: Chapter two Lecture notes, 2002,
http://selab.csie.ncu.edu.tw/~muchun/course/cluster/CHAPTER%202.pdf (visited
15 August 2004).
S. Theodoridis and K. Koutroubas. Pattern Recognition. Academic Press, 1999.
J. Tou. DYNOC – A Dynamic Optimal Cluster-seeking Technique. International
Journal of Computer and Information Sciences, vol. 8, no. 6, pp. 541-547, 1979.
R.H. Turi. Clustering-Based Colour Image Segmentation, PhD Thesis. Monash
University, Australia, 2001.
P. Van Laarhoven and E. Aarts. Simulated Annealing: Theory and Applications.
Kluwer Academic Publishers, 1987.
C. Veenman, M. Reinders and E. Backer. A Maximum Variance Cluster Algorithm.
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 9,
pp. 1273-1280, 2002.
C. Veenman, M. Reinders and E. Backer. A Cellular Coevolutionary Algorithm for
Image Segmentation. IEEE Transactions on Image Processing, vol. 12, no. 3, pp.
304-316, 2003.
C. Wallace. An Improved Program for Classification. Technical Report no. 47.
Department of Computer Science, Monash University, Australia, 1984.
C. Wallace and D. Boulton. An Information Measure for Classification. The
Computer Journal, vol. 11, pp. 185-194, 1968.
C. Wallace and D. Dowe. Intrinsic Classification by MML – the snob program. In
Proceedings Seventh Australian Joint Conference on Artificial Intelligence, UNE,
Armidale, NSW, Australia, pp. 37-44, 1994.
Z. Xiang. Color Image Quantization by Minimizing the Maximum Inter-cluster
Distance. ACM Transactions on Graphics, vol. 16, no. 3, pp. 260-276, 1997.
B. Zhang. Generalized K-Harmonic Means - Boosting in Unsupervised Learning.
Technical Report HPL-2000-137. Hewlett-Packard Labs, 2000.
B. Zhang, M. Hsu and U. Dayal. K-Harmonic Means - A Data Clustering Algorithm.
Technical Report HPL-1999-124. Hewlett-Packard Labs, 1999.
Z. Zivkovic and F. van der Heijden. Recursive Unsupervised Learning of Finite
Mixture Models. IEEE Transactions of Pattern Analysis and Machine
Intelligence, vol. 26, no. 5, pp. 651-656, 2004.