Survey of Clustering Algorithms: IEEE Transactions On Neural Networks June 2005

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3303538

Survey of Clustering Algorithms

Article  in  IEEE Transactions on Neural Networks · June 2005


DOI: 10.1109/TNN.2005.845141 · Source: IEEE Xplore

CITATIONS READS
4,697 7,871

2 authors:

Rui Xu Donald C. Wunsch


Institute of Electrical and Electronics Engineers Missouri University of Science and Technology
41 PUBLICATIONS   6,060 CITATIONS    462 PUBLICATIONS   13,959 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Data Extrapolation in Electromagnetic Compatibility Simulations View project

Integrated Distribution System Optimization View project

All content following this page was uploaded by Rui Xu on 17 April 2014.

The user has requested enhancement of the downloaded file.


IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005 645

Survey of Clustering Algorithms


Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE

Abstract—Data analysis plays an indispensable role for un- [60], [167]. When the inducer reaches convergence or termi-
derstanding various phenomena. Cluster analysis, primitive nates, an induced classifier is generated [167].
exploration with little or no prior knowledge, consists of research In unsupervised classification, called clustering or ex-
developed across a wide variety of communities. The diversity,
on one hand, equips us with many tools. On the other hand, ploratory data analysis, no labeled data are available [88],
the profusion of options causes confusion. We survey clustering [150]. The goal of clustering is to separate a finite unlabeled
algorithms for data sets appearing in statistics, computer science, data set into a finite and discrete set of “natural,” hidden data
and machine learning, and illustrate their applications in some structures, rather than provide an accurate characterization
benchmark data sets, the traveling salesman problem, and bioin- of unobserved samples generated from the same probability
formatics, a new field attracting intensive efforts. Several tightly
related topics, proximity measure, and cluster validation, are also distribution [23], [60]. This can make the task of clustering fall
discussed. outside of the framework of unsupervised predictive learning
problems, such as vector quantization [60] (see Section II-C),
Index Terms—Adaptive resonance theory (ART), clustering,
clustering algorithm, cluster validation, neural networks, prox- probability density function estimation [38] (see Section II-D),
imity, self-organizing feature map (SOFM). [60], and entropy maximization [99]. It is noteworthy that
clustering differs from multidimensional scaling (perceptual
maps), whose goal is to depict all the evaluated objects in a
I. INTRODUCTION way that minimizes the topographical distortion while using as
few dimensions as possible. Also note that, in practice, many
W E ARE living in a world full of data. Every day, people
encounter a large amount of information and store or
represent it as data, for further analysis and management. One
(predictive) vector quantizers are also used for (nonpredictive)
clustering analysis [60].
of the vital means in dealing with these data is to classify or Nonpredictive clustering is a subjective process in nature,
group them into a set of categories or clusters. Actually, as one which precludes an absolute judgment as to the relative effi-
of the most primitive activities of human beings [14], classi- cacy of all clustering techniques [23], [152]. As pointed out by
fication plays an important and indispensable role in the long Backer and Jain [17], “in cluster analysis a group of objects is
history of human development. In order to learn a new object split up into a number of more or less homogeneous subgroups
or understand a new phenomenon, people always try to seek on the basis of an often subjectively chosen measure of sim-
the features that can describe it, and further compare it with ilarity (i.e., chosen subjectively based on its ability to create
other known objects or phenomena, based on the similarity or “interesting” clusters), such that the similarity between objects
dissimilarity, generalized as proximity, according to some cer- within a subgroup is larger than the similarity between objects
tain standards or rules. “Basically, classification systems are ei- belonging to different subgroups””1.
ther supervised or unsupervised, depending on whether they as- Clustering algorithms partition data into a certain number
sign new inputs to one of a finite number of discrete supervised of clusters (groups, subsets, or categories). There is no univer-
classes or unsupervised categories, respectively [38], [60], [75]. sally agreed upon definition [88]. Most researchers describe a
In supervised classification, the mapping from a set of input data cluster by considering the internal homogeneity and the external
vectors ( , where is the input space dimensionality), to separation [111], [124], [150], i.e., patterns in the same cluster
a finite set of discrete class labels ( , where is should be similar to each other, while patterns in different clus-
the total number of class types), is modeled in terms of some ters should not. Both the similarity and the dissimilarity should
mathematical function , where is a vector of be examinable in a clear and meaningful way. Here, we give
adjustable parameters. The values of these parameters are de- some simple mathematical descriptions of several types of clus-
termined (optimized) by an inductive learning algorithm (also tering, based on the descriptions in [124].
termed inducer), whose aim is to minimize an empirical risk Given a set of input patterns ,
functional (related to an inductive principle) on a finite data set where and each measure
of input–output examples, , where is is said to be a feature (attribute, dimension, or variable).
the finite cardinality of the available representative data set [38], • (Hard) partitional clustering attempts to seek a -par-
tition of , such that
Manuscript received March 31, 2003; revised September 28, 2004. This work 1) ;
was supported in part by the National Science Foundation and in part by the 2) ;
M. K. Finley Missouri Endowment. 3) and .
The authors are with the Department of Electrical and Computer Engineering,
University of Missouri-Rolla, Rolla, MO 65409 USA (e-mail: [email protected];
[email protected]). 1The preceding quote is taken verbatim from verbiage suggested by the
Digital Object Identifier 10.1109/TNN.2005.845141 anonymous associate editor, a suggestion which we gratefully acknowledge.

1045-9227/$20.00 © 2005 IEEE

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
646 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

Fig. 1. Clustering procedure. The typical cluster analysis consists of four steps with a feedback pathway. These steps are closely related to each other and affect
the derived clusters.

•) Hierarchical clustering attempts to construct a clustering criterion function makes the partition of
tree-like nested structure partition of clusters an optimization problem, which is well defined
, such that mathematically, and has rich solutions in the literature.
, and imply or for all Clustering is ubiquitous, and a wealth of clustering algo-
. rithms has been developed to solve different problems in
For hard partitional clustering, each pattern only belongs to specific fields. However, there is no clustering algorithm
one cluster. However, a pattern may also be allowed to belong that can be universally used to solve all problems. “It has
to all clusters with a degree of membership, , which been very difficult to develop a unified framework for
represents the membership coefficient of the th object in the reasoning about it (clustering) at a technical level, and
th cluster and satisfies the following two constraints: profoundly diverse approaches to clustering” [166], as
proved through an impossibility theorem. Therefore, it
and is important to carefully investigate the characteristics
of the problem at hand, in order to select or design an
appropriate clustering strategy.
as introduced in fuzzy set theory [293]. This is known as fuzzy 3) Cluster validation. Given a data set, each clustering
clustering, reviewed in Section II-G. algorithm can always generate a division, no matter
Fig. 1 depicts the procedure of cluster analysis with four basic whether the structure exists or not. Moreover, different
steps. approaches usually lead to different clusters; and even
1) Feature selection or extraction. As pointed out by Jain for the same algorithm, parameter identification or
et al. [151], [152] and Bishop [38], feature selection the presentation order of input patterns may affect the
chooses distinguishing features from a set of candidates, final results. Therefore, effective evaluation standards
while feature extraction utilizes some transformations and criteria are important to provide the users with a
to generate useful and novel features from the original degree of confidence for the clustering results derived
ones. Both are very crucial to the effectiveness of clus- from the used algorithms. These assessments should
tering applications. Elegant selection of features can be objective and have no preferences to any algorithm.
greatly decrease the workload and simplify the subse- Also, they should be useful for answering questions
quentdesignprocess.Generally,idealfeaturesshouldbe like how many clusters are hidden in the data, whether
of use in distinguishing patterns belonging to different the clusters obtained are meaningful or just an artifact
clusters, immune to noise, easy to extract and interpret. of the algorithms, or why we choose some algorithm
We elaborate the discussion on feature extraction in instead of another. Generally, there are three categories
Section II-L, in the context of data visualization and of testing criteria: external indices, internal indices,
dimensionality reduction. More information on feature and relative indices. These are defined on three types
selection can be found in [38], [151], and [250]. of clustering structures, known as partitional clus-
2) Clustering algorithm design or selection. The step is tering, hierarchical clustering, and individual clusters
usually combined with the selection of a corresponding [150]. Tests for the situation, where no clustering
proximity measure and the construction of a criterion structure exists in the data, are also considered [110],
function. Patterns are grouped according to whether but seldom used, since users are confident of the pres-
they resemble each other. Obviously, the proximity ence of clusters. External indices are based on some
measure directly affects the formation of the resulting prespecified structure, which is the reflection of prior
clusters. Almost all clustering algorithms are explicitly information on the data, and used as a standard to
or implicitly connected to some definition of proximity validate the clustering solutions. Internal tests are not
measure. Some algorithms even work directly on the dependent on external information (prior knowledge).
proximity matrix, as defined in Section II-A. Once On the contrary, they examine the clustering structure
a proximity measure is chosen, the construction of a directly from the original data. Relative criteria place

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 647

the emphasis on the comparison of different clustering cationsofclusteringalgorithmsforspatialdatabasesystems[171]


structures, in order to provide a reference, to decide and information retrieval [133], respectively. Berkhin further ex-
which one may best reveal the characteristics of the panded the topic to the whole field of data mining [33]. Murtagh
objects. We will not survey the topic in depth and refer reported the advances in hierarchical clustering algorithms [210]
interested readers to [74], [110], and [150]. However, and Baraldi surveyed several models for fuzzy and neural network
we will cover more details on how to determine the clustering [24]. Some more survey papers can also be found in
number of clusters in Section II-M. Some more recent [25], [40], [74], [89], and [151]. In addition to the review papers,
discussion can be found in [22], [37], [121], [180], comparative research on clustering algorithms is also significant.
and [181]. Approaches for fuzzy clustering validity Rauber, Paralic, and Pampalk presented empirical results for five
are reported in [71], [104], [123], and [220]. typical clustering algorithms [231]. Wei, Lee, and Hsu placed the
4) Results interpretation. The ultimate goal of clustering emphasis on the comparison of fast algorithms for large databases
is to provide users with meaningful insights from the [280]. Scheunders compared several clustering techniques for
original data, so that they can effectively solve the color image quantization, with emphasis on computational time
problems encountered. Experts in the relevant fields in- and the possibility of obtaining global optima [239]. Applications
terpret the data partition. Further analyzes, even exper- and evaluations of different clustering algorithms for the analysis
iments, may be required to guarantee the reliability of of gene expression data from DNA microarray experiments were
extracted knowledge. described in [153], [192], [246], and [271]. Experimental evalua-
Note that the flow chart also includes a feedback pathway. tionondocumentclusteringtechniques,based onhierarchicaland
Cluster analysis is not a one-shot process. In many circumstances, -means clustering algorithms, were summarized by Steinbach,
it needs a series of trials and repetitions. Moreover, there are no Karypis, and Kumar [261].
universal and effective criteria to guide the selection of features In contrast to the above, the purpose of this paper is to pro-
and clustering schemes. Validation criteria provide some insights vide a comprehensive and systematic description of the influ-
on the quality of clustering solutions. But even how to choose the ential and important clustering algorithms rooted in statistics,
appropriate criterion is still a problem requiring more efforts. computer science, and machine learning, with emphasis on new
Clustering has been applied in a wide variety of fields, advances in recent years.
ranging from engineering (machine learning, artificial intelli- The remainder of the paper is organized as follows. In Sec-
gence, pattern recognition, mechanical engineering, electrical tion II, we review clustering algorithms, based on the natures
engineering), computer sciences (web mining, spatial database of generated clusters and techniques and theories behind them.
analysis, textual document collection, image segmentation), Furthermore, we discuss approaches for clustering sequential
life and medical sciences (genetics, biology, microbiology, data, large data sets, data visualization, and high-dimensional
paleontology, psychiatry, clinic, pathology), to earth sciences data through dimension reduction. Two important issues on
(geography. geology, remote sensing), social sciences (soci- cluster analysis, including proximity measure and how to
ology, psychology, archeology, education), and economics choose the number of clusters, are also summarized in the
(marketing, business) [88], [127]. Accordingly, clustering is section. This is the longest section of the paper, so, for conve-
also known as numerical taxonomy, learning without a teacher nience, we give an outline of Section II in bullet form here:
(or unsupervised learning), typological analysis and partition. II. Clustering Algorithms
The diversity reflects the important position of clustering in • A. Distance and Similarity Measures
scientific research. On the other hand, it causes confusion, due (See also Table I)
to the differing terminologies and goals. Clustering algorithms • B. Hierarchical
developed to solve a particular problem, in a specialized field, — Agglomerative
usually make assumptions in favor of the application of interest. Single linkage, complete linkage, group average
These biases inevitably affect performance in other problems linkage, median linkage, centroid linkage, Ward’s
that do not satisfy these premises. For example, the -means method, balanced iterative reducing and clustering
algorithm is based on the Euclidean measure and, hence, tends using hierarchies (BIRCH), clustering using rep-
to generate hyperspherical clusters. But if the real clusters are resentatives (CURE), robust clustering using links
in other geometric forms, -means may no longer be effective, (ROCK)
and we need to resort to other schemes. This situation also — Divisive
holds true for mixture-model clustering, in which a model is fit divisive analysis (DIANA), monothetic analysis
to data in advance. (MONA)
Clustering has a long history, with lineage dating back to Aris- • C. Squared Error-Based (Vector Quantization)
— -means, iterative self-organizing data analysis
totle [124]. General references on clustering techniques include
technique (ISODATA), genetic -means algorithm
[14], [75], [77], [88], [111], [127], [150], [161], [259]. Important (GKA), partitioning around medoids (PAM)
survey papers on clustering techniques also exist in the literature. • D. pdf Estimation via Mixture Densities
Starting from a statistical pattern recognition viewpoint, Jain, — Gaussian mixture density decomposition (GMDD),
Murty,andFlynnreviewedtheclusteringalgorithmsandotherim- AutoClass
portant issues related to cluster analysis [152], while Hansen and • E. Graph Theory-Based
Jaumard described the clustering problems under a mathematical — Chameleon, Delaunay triangulation graph (DTG),
programming scheme [124]. Kolatch and He investigated appli- highly connected subgraphs (HCS), clustering iden-

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
648 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

TABLE I
SIMILARITY AND DISSIMILARITY MEASURE FOR QUANTITATIVE FEATURES

tification via connectivity kernels (CLICK), cluster Applications in two benchmark data sets, the traveling
affinity search technique (CAST) salesman problem, and bioinformatics are illustrated in Sec-
• F. Combinatorial Search Techniques-Based tion III. We conclude the paper in Section IV.
— Genetically guided algorithm (GGA), TS clustering,
SA clustering
• G. Fuzzy II. CLUSTERING ALGORITHMS
— Fuzzy -means (FCM), mountain method (MM), pos-
sibilistic -means clustering algorithm (PCM), fuzzy Different starting points and criteria usually lead to different
-shells (FCS) taxonomies of clustering algorithms [33], [88], [124], [150],
• H. Neural Networks-Based [152], [171]. A rough but widely agreed frame is to classify
— Learning vector quantization (LVQ), self-organizing clustering techniques as hierarchical clustering and parti-
feature map (SOFM), ART, simplified ART (SART), tional clustering, based on the properties of clusters generated
hyperellipsoidal clustering network (HEC), self-split- [88], [152]. Hierarchical clustering groups data objects with
ting competitive learning network (SPLL) a sequence of partitions, either from singleton clusters to a
• I. Kernel-Based cluster including all individuals or vice versa, while partitional
— Kernel -means, support vector clustering (SVC) clustering directly divides data objects into some prespecified
• J. Sequential Data number of clusters without the hierarchical structure. We
— Sequence Similarity follow this frame in surveying the clustering algorithms in the
— Indirect sequence clustering literature. Beginning with the discussion on proximity measure,
— Statistical sequence clustering
K. Large-Scale Data Sets (See also Table II) which is the basis for most clustering algorithms, we focus on

— CLARA, CURE, CLARANS, BIRCH, DBSCAN, hierarchical clustering and classical partitional clustering algo-
DENCLUE, WaveCluster, FC, ART rithms in Section II-B–D. Starting from part E, we introduce
• L. Data visualization and High-dimensional Data and analyze clustering algorithms based on a wide variety of
— PCA, ICA, Projection pursuit, Isomap, LLE, theories and techniques, including graph theory, combinato-
CLIQUE, OptiGrid, ORCLUS rial search techniques, fuzzy set theory, neural networks, and
• M. How Many Clusters? kernels techniques. Compared with graph theory and fuzzy set

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 649

TABLE II A data object is described by a set of features, usually repre-


COMPUTATIONAL COMPLEXITY OF CLUSTERING ALGORITHMS sented as a multidimensional vector. The features can be quan-
titative or qualitative, continuous or binary, nominal or ordinal,
which determine the corresponding measure mechanisms.
A distance or dissimilarity function on a data set is defined
to satisfy the following conditions.
1) Symmetry. ;
2) Positivity. for all and .
If conditions
3) Triangle inequality.

for all and

and (4) Reflexivity. also


hold, it is called a metric.
Likewise, a similarity function is defined to satisfy the con-
ditions in the following.
1) Symmetry. ;
2) Positivity. , for all and .
If it also satisfies conditions
3)

for all and

and (4) , it is called a simi-


larity metric.
For a data set with input patterns, we can define an
symmetric matrix, called proximity matrix, whose th
element represents the similarity or dissimilarity measure for
the th and th patterns .
Typically, distance functions are used to measure continuous
theory, which had already been widely used in cluster analysis features, while similarity measures are more important for qual-
before the 1980s, the other techniques have been finding their itative variables. We summarize some typical measures for con-
applications in clustering just in the recent decades. In spite of tinuous features in Table I. The selection of different measures
the short history, much progress has been achieved. Note that is problem dependent. For binary features, a similarity measure
these techniques can be used for both hierarchical and parti- is commonly used (dissimilarity measures can be obtained by
tional clustering. Considering the more frequent requirement of simply using ). Suppose we use two binary sub-
tackling sequential data sets, large-scale, and high-dimensional scripts to count features in two objects. and represent
data sets in many current applications, we review clustering the number of simultaneous absence or presence of features in
algorithms for them in the following three parts. We focus two objects, and and count the features present only in
particular attention on clustering algorithms applied in bioin- one object. Then two types of commonly used similarity mea-
formatics. We offer more detailed discussion on how to identify sures for data points and are illustrated in the following.
appropriate number of clusters, which is particularly important •
in cluster validity, in the last part of the section.
simple matching coefficient
A. Distance and Similarity Measures Rogers and Tanimoto measure.
Gower and Legendre measure
It is natural to ask what kind of standards we should use to
determine the closeness, or how to measure the distance (dis- These measures compute the match between two objects
similarity) or similarity between a pair of objects, an object and directly. Unmatched pairs are weighted based on their
a cluster, or a pair of clusters. In the next section on hierarchical contribution to the similarity.
clustering, we will illustrate linkage metrics for measuring prox- •
imity between clusters. Usually, a prototype is used to represent
a cluster so that it can be further processed like other objects. Jaccard coefficient
Here, we focus on reviewing measure approaches between in- Sokal and Sneath measure.
dividuals due to the previous consideration. Gower and Legendre measure

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
650 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

These measures focus on the co-occurrence features while The general agglomerative clustering can be summarized by
ignoring the effect of co-absence. the following procedure.
For nominal features that have more than two states, a simple 1) Start with singleton clusters. Calculate the prox-
strategy needs to map them into new binary features [161], while imity matrix for the clusters.
a more effective method utilizes the matching criterion 2) Search the minimal distance

where
if and do not match where is the distance function discussed be-
if and match fore, in the proximity matrix, and combine cluster
and to form a new cluster.
[88]. Ordinal features order multiple states according to some 3) Update the proximity matrix by computing the dis-
standard and can be compared by using continuous dissimi- tances between the new cluster and the other clusters.
larity measures discussed in [161]. Edit distance for alphabetic 4) Repeat steps 2)–3) until all objects are in the same
sequences is discussed in Section II-J. More discussion on se- cluster.
quences and strings comparisons can be found in [120] and Based on the different definitions for distance between two
[236]. clusters, there are many agglomerative clustering algorithms.
Generally, for objects consisting of mixed variables, we can The simplest and most popular methods include single linkage
map all these variables into the interval (0, 1) and use mea- [256] and complete linkage technique [258]. For the single
sures like the Euclidean metric. Alternatively, we can trans- linkage method, the distance between two clusters is deter-
form them into binary variables and use binary similarity func- mined by the two closest objects in different clusters, so it
tions. The drawback of these methods is the information loss. is also called nearest neighbor method. On the contrary, the
A more powerful method was described by Gower in the form complete linkage method uses the farthest distance of a pair of
of , where indicates the objects to define inter-cluster distance. Both the single linkage
similarity for the th feature and is a 0–1 coefficient based and the complete linkage method can be generalized by the
on whether the measure of the two objects is missing [88], [112]. recurrence formula proposed by Lance and Williams [178] as

B. Hierarchical Clustering
Hierarchical clustering (HC) algorithms organize data into a
hierarchical structure according to the proximity matrix. The re-
sults of HC are usually depicted by a binary tree or dendrogram. where is the distance function and , and are
The root node of the dendrogram represents the whole data set coefficients that take values dependent on the scheme used.
and each leaf node is regarded as a data object. The interme- The formula describes the distance between a cluster and a
diate nodes, thus, describe the extent that the objects are prox- new cluster formed by the merge of two clusters and . Note
imal to each other; and the height of the dendrogram usually that when , and , the formula
expresses the distance between each pair of objects or clusters, becomes
or an object and a cluster. The ultimate clustering results can
be obtained by cutting the dendrogram at different levels. This
representation provides very informative descriptions and visu-
alization for the potential data clustering structures, especially which corresponds to the single linkage method. When
when real hierarchical relations exist in the data, like the data and , the formula is
from evolutionary research on different species of organizms.
HC algorithms are mainly classified as agglomerative methods
and divisive methods. Agglomerative clustering starts with
clusters and each of them includes exactly one object. A series which corresponds to the complete linkage method.
of merge operations are then followed out that finally lead all Several more complicated agglomerative clustering algo-
objects to the same group. Divisive clustering proceeds in an rithms, including group average linkage, median linkage,
opposite way. In the beginning, the entire data set belongs to centroid linkage, and Ward’s method, can also be constructed
a cluster and a procedure successively divides it until all clus- by selecting appropriate coefficients in the formula. A detailed
ters are singleton clusters. For a cluster with objects, there table describing the coefficient values for different algorithms
are possible two-subset divisions, which is very ex- is offered in [150] and [210]. Single linkage, complete linkage
pensive in computation [88]. Therefore, divisive clustering is and average linkage consider all points of a pair of clusters,
not commonly used in practice. We focus on the agglomera- when calculating their inter-cluster distance, and are also called
tive clustering in the following discussion and some of divisive graph methods. The others are called geometric methods since
clustering applications for binary data can be found in [88]. Two they use geometric centers to represent clusters and determine
divisive clustering algorithms, named MONA and DIANA, are their distances. Remarks on important features and properties
described in [161]. of these methods are summarized in [88]. More inter-cluster

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 651

distance measures, especially the mean-based ones, were intro- “link” to describe the relation between a pair of objects and their
duced by Yager, with further discussion on their possible effect common neighbors. Like CURE, a random sample strategy is
to control the hierarchical clustering process [289]. used to handle large data sets. Chameleon is constructed from
The common criticism for classical HC algorithms is that they graph theory and will be discussed in Section II-E.
lack robustness and are, hence, sensitive to noise and outliers. Relative hierarchical clustering (RHC) is another exploration
Once an object is assigned to a cluster, it will not be considered that considers both the internal distance (distance between a
again, which means that HC algorithms are not capable of cor- pair of clusters which may be merged to yield a new cluster)
recting possible previous misclassification. The computational and the external distance (distance from the two clusters to the
complexity for most of HC algorithms is at least and rest), and uses the ratio of them to decide the proximities [203].
this high cost limits their application in large-scale data sets. Leung et al. showed an interesting hierarchical clustering based
Other disadvantages of HC include the tendency to form spher- on scale-space theory [180]. They interpreted clustering using
ical shapes and reversal phenomenon, in which the normal hier- a blurring process, in which each datum is regarded as a light
archical structure is distorted. point in an image, and a cluster is represented as a blob. Li
In recent years, with the requirement for handling large-scale and Biswas extended agglomerative HC to deal with both nu-
data sets in data mining and other fields, many new HC tech- meric and nominal data. The proposed algorithm, called simi-
niques have appeared and greatly improved the clustering per- larity-based agglomerative clustering (SBAC), employs a mixed
formance. Typical examples include CURE [116], ROCK [117], data measure scheme that pays extra attention to less common
Chameleon [159], and BIRCH [295]. matches of feature values [183]. Parallel techniques for HC are
The main motivations of BIRCH lie in two aspects, the ability discussed in [69] and [217], respectively.
to deal with large data sets and the robustness to outliers [295].
In order to achieve these goals, a new data structure, clustering C. Squared Error—Based Clustering (Vector Quantization)
feature (CF) tree, is designed to store the summaries of the In contrast to hierarchical clustering, which yields a succes-
original data. The CF tree is a height-balanced tree, with each sive level of clusters by iterative fusions or divisions, partitional
internal vertex composed of entries defined as child clustering assigns a set of objects into clusters with no hier-
, where is a representation of the cluster and archical structure. In principle, the optimal partition, based on
is defined as , where is the number of some specific criterion, can be found by enumerating all pos-
data objects in the cluster, is the linear sum of the objects, sibilities. But this brute force method is infeasible in practice,
and SS is the squared sum of the objects, child is a pointer to the due to the expensive computation [189]. Even for a small-scale
th child node, and is a threshold parameter that determines clustering problem (organizing 30 objects into 3 groups), the
the maximum number of entries in the vertex, and each leaf number of possible partitions is . Therefore, heuristic
composed of entries in the form of , where algorithms have been developed in order to seek approximate
is the threshold parameter that controls the maximum number of solutions.
entries in the leaf. Moreover, the leaves must follow the restriction One of the important factors in partitional clustering is the
that the diameter criterion function [124]. The sum of squared error function is
of each entry in the leaf is less than a threshold . The CF one of the most widely used criteria. Suppose we have a set of
tree structure captures the important clustering information of objects , and we want to organize them
the original data while reducing the required storage. Outliers into subsets . The squared error criterion
are eliminated from the summaries by identifying the objects then is defined as
sparsely distributed in the feature space. After the CF tree is
built, an agglomerative HC is applied to the set of summaries to
perform global clustering. An additional step may be performed
to refine the clusters. BIRCH can achieve a computational
complexity of . where
Noticing the restriction of centroid-based HC, which is a partition matrix;
unable to identify arbitrary cluster shapes, Guha, Rastogi, and
Shim developed a HC algorithm, called CURE, to explore more if cluster
with
sophisticated cluster shapes [116]. The crucial feature of CURE otherwise
lies in the usage of a set of well-scattered points to represent
each cluster, which makes it possible to find rich cluster shapes cluster prototype or centroid (means) matrix;
other than hyperspheres and avoids both the chaining effect
[88] of the minimum linkage method and the tendency to favor sample mean for the th cluster;
clusters with similar sizes of centroid. These representative
points are further shrunk toward the cluster centroid according number of objects in the th cluster.
to an adjustable parameter in order to weaken the effects of Note the relation between the sum of squared error criterion
outliers. CURE utilizes random sample (and partition) strategy and the scatter matrices defined in multiclass discriminant anal-
to reduce computational complexity. Guha et al. also proposed ysis [75],
another agglomerative HC algorithm, ROCK, to group data
with qualitative attributes [117]. They used a novel measure

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
652 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

where of the subsets is clustered times again, setting


total scatter matrix; each subset solution as the initial guess. The starting
within-class scatter matrix; points for the whole data are obtained by choosing the
solution with minimal sum of squared distances. Likas,
between-class scatter matrix; and Vlassis, and Verbeek proposed a global -means algo-
rithm consisting of a series of -means clustering pro-
mean vector for the whole data set. cedures with the number of clusters varying from 1 to
[186]. After finding the centroid for only one cluster
It is not difficult to see that the criterion based on the trace existing, at each , the previous
of is the same as the sum of squared error criterion. To centroids are fixed and the new centroid is selected by
minimize the squared error criterion is equivalent to minimizing examining all data points. The authors claimed that the
the trace of or maximizing the trace of . We can obtain algorithm is independent of the initial partitions and
a rich class of criterion functions based on the characteristics of provided accelerating strategies. But the problem on
and [75]. computational complexity exists, due to the require-
The -means algorithm is the best-known squared ment for executing -means times for each value
error-based clustering algorithm [94], [191]. of .
1) Initialize a -partition randomly or based on some An interesting technique, called ISODATA, devel-
prior knowledge. Calculate the cluster prototype ma- oped by Ball and Hall [21], deals with the estimation
trix . of . ISODATA can dynamically adjust the number of
2) Assign each object in the data set to the nearest cluster clusters by merging and splitting clusters according to
, i.e. some predefined thresholds (in this sense, the problem
of identifying the initial number of clusters becomes
if that of parameter (threshold) tweaking). The new is
for and used as the expected number of clusters for the next it-
eration.
3) Recalculate the cluster prototype matrix based on the 2) The iteratively optimal procedure of -means cannot
current partition. guarantee convergence to a global optimum. The sto-
4) Repeat steps 2)–3) until there is no change for each chastic optimal techniques, like simulated annealing
cluster. (SA) and genetic algorithms (also see part II.F), can
The -means algorithm is very simple and can be easily find the global optimum with the price of expensive
implemented in solving many practical problems. It can work computation. Krishna and Murty designed new opera-
very well for compact and hyperspherical clusters. The time tors in their hybrid scheme, GKA, in order to achieve
complexity of -means is . Since and are usu- global search and fast convergence [173]. The defined
ally much less than -means can be used to cluster large biased mutation operator is based on the Euclidean
data sets. Parallel techniques for -means are developed that distance between an object and the centroids and aims
can largely accelerate the algorithm [262]. The drawbacks of to avoid getting stuck in a local optimum. Another
-means are also well studied, and as a result, many variants of operator, the -means operator (KMO), replaces the
-means have appeared in order to overcome these obstacles. computationally expensive crossover operators and
We summarize some of the major disadvantages with the pro- alleviates the complexities coming with them. An
posed improvement in the following. adaptive learning rate strategy for the online mode
1) There is no efficient and universal method for iden- -means is illustrated in [63]. The learning rate is
tifying the initial partitions and the number of clus- exclusively dependent on the within-group variations
ters . The convergence centroids vary with different and can be adjusted without involving any user activi-
initial points. A general strategy for the problem is ties. The proposed enhanced LBG (ELBG) algorithm
to run the algorithm many times with random initial adopts a roulette mechanism typical of genetic algo-
partitions. Peña, Lozano, and Larrañaga compared the rithms to become near-optimal and therefore, is not
random method with other three classical initial parti- sensitive to initialization [222].
tion methods by Forgy [94], Kaufman [161], and Mac- 3) -means is sensitive to outliers and noise. Even if an
Queen [191], based on the effectiveness, robustness, object is quite far away from the cluster centroid, it is
and convergence speed criteria [227]. According to still forced into a cluster and, thus, distorts the cluster
their experimental results, the random and Kaufman’s shapes. ISODATA [21] and PAM [161] both consider
method work much better than the other two under the the effect of outliers in clustering procedures. ISO-
first two criteria and by further considering the conver- DATA gets rid of clusters with few objects. The split-
gence speed, they recommended Kaufman’s method. ting operation of ISODATA eliminates the possibility
Bradley and Fayyad presented a refinement algorithm of elongated clusters typical of -means. PAM utilizes
that first utilizes -means times to random sub- real data points (medoids) as the cluster prototypes and
sets from the original data [43]. The set formed from avoids the effect of outliers. Based on the same con-
the union of the solution (centroids of the clusters) sideration, a -medoids algorithm is presented in [87]

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 653

by searching the discrete 1-medians as the cluster cen- Gaussian densities are used due to its complete theory and
troids. analytical tractability [88], [297].
4) The definition of “means” limits the application Maximum likelihood (ML) estimation is an important statis-
only to numerical variables. The -medoids algo- tical approach for parameter estimation [75] and it considers the
rithm mentioned previously is a natural choice, when best estimate as the one that maximizes the probability of gen-
the computation of means is unavailable, since the erating all the observations, which is given by the joint density
medoids do not need any computation and always exist function
[161]. Huang [142] and Gupta et al. [118] defined dif-
ferent dissimilarity measures to extend -means
to categorical variables. For Huang’s method, the
clustering goal is to minimize the cost function
, where or, in a logarithm form

and
The best estimate can be achieved by solving the log-likelihood
equations .
Unfortunately, since the solutions of the likelihood equa-
tions cannot be obtained analytically in most circumstances
[90], [197], iteratively suboptimal approaches are required to
with a set
of -dimensional vectors
approximate the ML estimates. Among these methods, the
, where .
expectation-maximization (EM) algorithm is the most popular
Each vector is known as a mode and is defined to
[196]. EM regards the data set as incomplete and divides
minimize the sum of distances . The
each data point into two parts , where
proposed -modes algorithm operates in a similar
represents the observable features and
way as -means.
is the missing data, where chooses value 1 or 0 according
Several recent advances on -means and other squared-error to whether belongs to the component or not. Thus, the
based clustering algorithms with their applications can be found complete data log-likelihood is
in [125], [155], [222], [223], [264], and [277].

D. Mixture Densities-Based Clustering (pdf Estimation via


Mixture Densities)
In the probabilistic view, data objects are assumed to be gen-
The standard EM algorithm generates a series of parameter
erated according to several probability distributions. Data points
estimates , where represents the reaching of
in different clusters were generated by different probability dis-
the convergence criterion, through the following steps:
tributions. They can be derived from different types of density
functions (e.g., multivariate Gaussian or -distribution), or the 1) initialize and set ;
same families, but with different parameters. If the distributions 2) e-step: Compute the expectation of the complete data
are known, finding the clusters of a given data set is equivalent log-likelihood
to estimating the parameters of several underlying models. Sup-
pose the prior probability (also known as mixing probability)
for cluster (here, is assumed to
be known and methods for estimating are discussed in Sec- 3) m-step: Select a new parameter estimate that maxi-
tion II-M) and the conditional probability density mizes the -function, ;
(also known as component density), where is the unknown 4) Increase ; repeat steps 2)–3) until the conver-
parameter vector, are known. Then, the mixture probability den- gence condition is satisfied.
sity for the whole data set is expressed as The major disadvantages for EM algorithm are the sensitivity
to the selection of initial parameters, the effect of a singular co-
variance matrix, the possibility of convergence to a local op-
timum, and the slow convergence rate [96], [196]. Variants of
EM for addressing these problems are discussed in [90] and
[196].
where , and . As long as A valuable theoretical note is the relation between the EM
the parameter vector is decided, the posterior probability for algorithm and the -means algorithm. Celeux and Govaert
assigning a data point to a cluster can be easily calculated with proved that classification EM (CEM) algorithm under a spher-
Bayes’s theorem. Here, the mixtures can be constructed with ical Gaussian mixture is equivalent to the -means algorithm
any types of components, but more commonly, multivariate [58].

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
654 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

Fraley and Raftery described a comprehensive mix- make Chameleon flexible enough to explore the characteristics
ture-model based clustering scheme [96], which was im- of potential clusters, Chameleon merges these small subsets
plemented as a software package, known as MCLUST [95]. In and, thus, comes up with the ultimate clustering solutions.
this case, the component density is multivariate Gaussian, with Here, the relative interconnectivity (or closeness) is obtained
a mean vector and a covariance matrix as the parameters by normalizing the sum of weights (or average weight) of the
to be estimated. The covariance matrix for each component can edges connecting the two clusters over the internal connectivity
further be parameterized by virtue of eigenvalue decomposi- (or closeness) of the clusters. DTG is another important graph
tion, represented as , where is a scalar, is representation for HC analysis. Cherng and Lo constructed a
the orthogonal matrix of eigenvectors, and is the diagonal hypergraph (each edge is allowed to connect more than two
matrix based on the eigenvalues of [96]. These three elements vertices) from the DTG and used a two-phase algorithm that is
determine the geometric properties of each component. After similar to Chameleon to find clusters [61]. Another DTG-based
the maximum number of clusters and the candidate models are application, known as AMOEBA algorithm, is presented in
specified, an agglomerative hierarchical clustering was used to [86].
ignite the EM algorithm by forming an initial partition, which Graph theory can also be used for nonhierarchical clusters.
includes at most the maximum number of clusters, for each Zahn’s clustering algorithm seeks connected components as
model. The optimal clustering result is achieved by checking clusters by detecting and discarding inconsistent edges in the
the Bayesian information criterion (BIC) value discussed in minimum spanning tree [150]. Hartuv and Shamir treated clus-
Section II-M. GMDD is also based on multivariate Gaussian ters as HCS, where “highly connected” means the connectivity
densities and is designed as a recursive algorithm that sequen- (the minimum number of edges needed to disconnect a graph)
tially estimates each component [297]. GMDD views data of the subgraph is at least half as great as the number of the
points that are not generated from a distribution as noise and vertices [128]. A minimum cut (mincut) procedure, which
utilizes an enhanced model-fitting estimator to construct each aims to separate a graph with a minimum number of edges, is
component from the contaminated model. AutoClass considers used to find these HCSs recursively. Another algorithm, called
more families of probability distributions (e.g., Poisson and CLICK, is based on the calculation of the minimum weight
Bernoulli) for different data types [59]. A Bayesian approach is cut to form clusters [247]. Here, the graph is weighted and the
used in AutoClass to find out the optimal partition of the given edge weights are assigned a new interpretation, by combining
data based on the prior probabilities. Its parallel realization is probability and graph theory. The edge weight between node
described in [228]. Other important algorithms and programs and is defined as shown in
include Multimix [147], EM based mixture program (EMMIX)
[198], and Snob [278]. belong to the same cluster
does not belong to the same cluster
E. Graph Theory-Based Clustering
where represents the similarity between the two nodes.
The concepts and properties of graph theory [126] make it CLICK further assumes that the similarity values within clus-
very convenient to describe clustering problems by means of ters and between clusters follow Gaussian distributions with
graphs. Nodes of a weighted graph correspond to data different means and variances, respectively. Therefore, the
points in the pattern space and edges reflect the proximities previous equation can be rewritten by using Bayes’ theorem as
between each pair of data points. If the dissimilarity matrix is
defined as
if
otherwise
where is the prior probability that two objects belong to
where is a threshold value, the graph is simplified to an the same cluster and are the means and
unweighted threshold graph. Both the single linkage HC and variances for between-cluster similarities and within-clusters
the complete linkage HC can be described on the basis of similarities, respectively. These parameters can be estimated
the threshold graph. Single linkage clustering is equivalent to either from prior knowledge, or by using parameter estimation
seeking maximally connected subgraphs (components) while methods [75]. CLICK recursively checks the current subgraph,
complete linkage clustering corresponds to finding maximally and generates a kernel list, which consists of the components
complete subgraphs (cliques) [150]. Jain and Dubes illustrated satisfying some criterion function. Subgraphs that include only
and discussed more applications of graph theory (e.g., Hubert’s one node are regarded as singletons, and are separated for
algorithm and Johnson’s algorithm) for hierarchical clustering further manipulation. Using the kernels as the basic clusters,
in [150]. Chameleon [159] is a newly developed agglomerative CLICK carries out a series of singleton adoptions and cluster
HC algorithm based on the -nearest-neighbor graph, in which merge to generate the resulting clusters. Additional heuristics
an edge is eliminated if both vertices are not within the are provided to accelerate the algorithm performance.
closest points related to each other. At the first step, Chameleon Similarly, CAST considers a probabilistic model in designing
divides the connectivity graph into a set of subclusters with the a graph theory-based clustering algorithm [29]. Clusters are
minimal edge cut. Each subgraph should contain enough nodes modeled as corrupted clique graphs, which, in ideal conditions,
in order for effective similarity computation. By combining are regarded as a set of disjoint cliques. The effect of noise
both the relative interconnectivity and relative closeness, which is incorporated by adding or removing edges from the ideal

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 655

model, with a probability . Proofs were given for recovering Hall, Özyurt, and Bezdek proposed a GGA that can be re-
the uncorrupted graph with a high probability. CAST is the garded as a general scheme for center-based (hard or fuzzy)
heuristic implementation of the original theoretical version. clustering problems [122]. Fitness functions are reformulated
CAST creates clusters sequentially, and each cluster begins from the standard sum of squared error criterion function in
with a random and unassigned data point. The relation between order to adapt the change of the construction of the optimiza-
a data point and a cluster being built is determined by the tion problem (only the prototype matrix is needed)
affinity, defined as , and the affinity threshold
parameter . When , it means that the data point is
highly related to the cluster and vice versa. CAST alternately
adds high affinity data points or deletes low affinity data points
from the cluster until no more changes occur. for hard clustering

F. Combinatorial Search Techniques-Based Clustering


The basic object of search techniques is to find the global
or approximate global optimum for combinatorial optimization for fuzzy clustering
problems, which usually have NP-hard complexity and need to
search an exponentially large solution space. Clustering can be where , is the distance between
regarded as a category of optimization problems. Given a set of the th cluster and the th data object, and is the fuzzification
data points , clustering algorithms aim parameter.
to organize them into subsets that optimize GGA proceeds with the following steps.
some criterion function. The possible partition for points into 1) Choose appropriate parameters for the algorithm. Ini-
clusters is given by the formula [189] tialize the population randomly with individuals,
each of which represents a prototype matrix and
is encoded as gray codes. Calculate the fitness value for
each individual.
2) Use selection (tournament selection) operator to
As shown before, even for small and , the computa- choose parental members for reproduction.
tional complexity is extremely expensive, not to mention the 3) Use crossover (two-point crossover) and mutation (bit-
large-scale clustering problems frequently encountered in recent wise mutation) operator to generate offspring from the
decades. Simple local search techniques, like hill-climbing al- individuals chosen in step 2).
gorithms, are utilized to find the partitions, but they are easily 4) Determine the next generation by keeping the individ-
stuck in local minima and therefore cannot guarantee optimality. uals with the highest fitness.
More complex search methods (e.g., evolutionary algorithms 5) Repeat steps 2)–4) until the termination condition is
(EAs) [93], SA [165], and Tabu search (TS) [108] are known as satisfied.
stochastic optimization methods, while deterministic annealing Other GAs-based clustering applications have appeared
(DA) [139], [234] is the most typical deterministic search tech- based on a similar framework. They are different in the
nique) can explore the solution space more flexibly and effi- meaning of an individual in the population, encoding methods,
ciently. fitness function definition, and evolutionary operators [67],
Inspired by the natural evolution process, evolutionary com- [195], [273]. The algorithm CLUSTERING in [273] includes
putation, which consists of genetic algorithms (GAs), evolution a heuristic scheme for estimating the appropriate number of
strategies (ESs), evolutionary programming (EP), and genetic clusters in the data. It also uses a nearest-neighbor algorithm
programming (GP), optimizes a population of structure by using to divide data into small subsets, before GAs-based clustering,
a set of evolutionary operators [93]. An optimization function, in order to reduce the computational complexity. GAs are very
called the fitness function, is the standard for evaluating the opti- useful for improving the performance of -means algorithms.
mizing degree of the population, in which each individual has its Babu and Murty used GAs to find good initial partitions [15].
corresponding fitness. Selection, recombination, and mutation Krishna and Murty combined GA with -means and devel-
are the most widely used evolutionary operators. The selection oped GKA algorithm that can find the global optimum [173].
operator ensures the continuity of the population by favoring the As indicated in Section II-C, the algorithm ELBG uses the
best individuals in the next generation. The recombination and roulette mechanism to address the problems due to the bad
mutation operators support the diversity of the population by ex- initialization [222]. It is worthwhile to note that ELBG are
erting perturbations on the individuals. Among many EAs, GAs equivalent to another algorithm, fully automatic clustering
[140] are the most popular approaches applied in cluster anal- system (FACS) [223], in terms of quantization level detection.
ysis. In GAs, each individual is usually encoded as a binary bit The difference lies in the input parameters employed (ELBG
string, called a chromosome. After an initial population is gener- adopts the number of quantization levels, while FACS uses
ated according to some heuristic rules or just randomly, a series the desired distortion error). Except the previous applications,
of operations, including selection, crossover and mutation, are GAs can also be used for hierarchical clustering. Lozano and
iteratively applied to the population until the stop condition is Larrañag discussed the properties of ultrametric distance [127]
satisfied. and reformulated the hierarchical clustering as an optimization

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
656 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

problem that tries to find the closest ultrametic distance for a G. Fuzzy Clustering
given dissimilarity with Euclidean norm [190]. They suggested Except for GGA, the clustering techniques we have discussed
an order-based GA to solve the problem. Clustering algorithms so far are referred to as hard or crisp clustering, which means
based on ESs and EP are described and analyzed in [16] and that each object is assigned to only one cluster. For fuzzy clus-
[106], respectively. tering, this restriction is relaxed, and the object can belong to all
TS is a combinatory search technique that uses the tabu list to of the clusters with a certain degree of membership [293]. This
guide the search process consisting of a sequence of moves. The is particularly useful when the boundaries among the clusters
tabu list stores part or all of previously selected moves according are not well separated and ambiguous. Moreover, the member-
to the specified size. These moves are forbidden in the current ships may help us discover more sophisticated relations between
search and are called tabu. In the TS clustering algorithm devel- a given object and the disclosed clusters.
oped by Al-Sultan [9], a set of candidate solutions are generated FCM is one of the most popular fuzzy clustering algorithms
from the current solution with some strategy. Each candidate so- [141]. FCM can be regarded as a generalization of ISODATA
lution represents the allocations of data objects in clusters. [76] and was realized by Bezdek [35]. FCM attempts to find a
The candidate with the optimal cost function is selected as the partition ( fuzzy clusters) for a set of data points
current solution and appended to the tabu list, if it is not already while minimizing the cost function
in the tabu list or meets the aspiration criterion, which can over-
rule the tabu restriction. Otherwise, the remaining candidates
are evaluated in the order of their cost function values, until all
these conditions are satisfied. When all the candidates are tabu, a
new set of candidate solutions are created followed by the same where
search process. The search process proceeds until the maximum is the fuzzy partition matrix and
number of iterations is reached. Sung and Jin’s method includes is the membership coefficient of the
more elaborate search processes with the packing and releasing th object in the th cluster;
procedures [266]. They also used a secondary tabu list to keep is the cluster prototype (mean or center)
the search from trapping into the potential cycles. A fuzzy ver- matrix;
sion of TS clustering can be found in [72]. is the fuzzification parameter and usually
SA is also a sequential and global search technique and is mo- is set to 2 [129];
tivated by the annealing process in metallurgy [165]. SA allows is the distance measure between and
the search process to accept a worse solution with a certain prob- .
ability. The probability is controlled by a parameter, known as We summarize the standard FCM as follows, in which the
temperature and is usually expressed as , where Euclidean or norm distance function is used.
is the change of the energy (cost function). The tempera- 1) Select appropriate values for , and a small positive
ture goes through an annealing schedule from initial high to number . Initialize the prototype matrix randomly.
ultimate low values, which means that SA attempts to explore Set step variable .
solution space more completely at high temperatures while fa- 2) Calculate (at ) or update (at ) the member-
vors the solutions that lead to lower energy at low temperatures. ship matrix by
SA-based clustering was reported in [47] and [245]. The former
illustrated an application of SA clustering to evaluate different
clustering criteria and the latter investigated the effects of input
parameters to the clustering performance.
Hybrid approaches that combine these search techniques are for and
also proposed. A tabu list is used in a GA clustering algorithm to
3) Update the prototype matrix by
preserve the variety of the population and avoid repeating com-
putation [243]. An application of SA for improving TS was re-
ported in [64]. The algorithm further reduces the possible moves
to local optima.
The main drawback that plagues the search techniques-based for
clustering algorithms is the parameter selection. More often
than not, search techniques introduce more parameters than 4) Repeat steps 2)–3) until .
other methods (like -means). There are no theoretic guide- Numerous FCM variants and other fuzzy clustering algo-
lines to select the appropriate and effective parameters. Hall rithms have appeared as a result of the intensive investigation
et al. provided some methods for setting parameters in their on the distance measure functions, the effect of weighting
GAs-based clustering framework [122], but most of these exponent on fuzziness control, the optimization approaches for
criteria are still obtained empirically. The same situation exists fuzzy partition, and improvements of the drawbacks of FCM
for TS and SA clustering [9], [245]. Another problem is the [84], [141].
computational complexity paid for the convergence to global Like its hard counterpart, FCM also suffers from the presence
optima. High computational requirement limits their applica- of noise and outliers and the difficulty to identify the initial par-
tions in large-scale data sets. titions. Yager and Filev proposed a MM in order to estimate the

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 657

centers of clusters [290]. Candidate centers consist of a set of The standard FCM alternates the calculation of the member-
vertices that are formed by building a grid on the pattern space. ship and prototype matrix, which causes a computational burden
The mountain function for a vertex is defined as for large-scale data sets. Kolen and Hutcheson accelerated the
computation by combining updates of the two matrices [172].
Hung and Yang proposed a method to reduce computational
time by identifying more accurate cluster centers [146]. FCM
variants were also developed to deal with other data types, such
as symbolic data [81] and data with missing values [129].
where is the distance between the th data object and A family of fuzzy -shells algorithms has also appeared to de-
the th node, and is a positive constant. Therefore, the closer tect different types of cluster shapes, especially contours (lines,
a data object is to a vertex, the more the data object contributes circles, ellipses, rings, rectangles, hyperbolas) in a two-dimen-
to the mountain function. The vertex with the maximum sional data space. They use the “shells” (curved surfaces [70])
value of mountain function is selected as the first center. as the cluster prototypes instead of points or surfaces in tra-
A procedure, called mountain destruction, is performed to get ditional fuzzy clustering algorithms. In the case of FCS [36],
rid of the effects of the selected center. This is achieved by sub- [70], the proposed cluster prototype is represented as a -di-
tracting the mountain function value for each of the rest ver- mensional hyperspherical shell ( for circles),
tices with an amount dependent on the current maximum moun- where is the center, and is the radius. A dis-
tain function value and the distance between the vertex and the tance function is defined as to
center. The process iterates until the ratio between the current measure the distance from a data object to the prototype .
maximum and is below some threshold. The connection Similarly, other cluster shapes can be achieved by defining ap-
of MM with several other fuzzy clustering algorithms was fur- propriate prototypes and corresponding distance functions, ex-
ther discussed in [71]. Gath and Geva described an initialization ample including fuzzy -spherical shells (FCSS) [176], fuzzy
strategy of unsupervised tracking of cluster prototypes in their -rings (FCR) [193], fuzzy -quadratic shells (FCQS) [174], and
2-layer clustering scheme, in which FCM and fuzzy ML esti- fuzzy -rectangular shells (FCRS) [137]. See [141] for further
mation are effectively combined [102]. details.
Kersten suggested that city block distance (or norm) could Fuzzy set theories can also be used to create hierarchical
improve the robustness of FCM to outliers [163]. Furthermore, cluster structure. Geva proposed a hierarchical unsupervised
Hathaway, Bezdek, and Hu extended FCM to a more universal fuzzy clustering (HUFC) algorithm [104], which can effec-
case by using Minkowski distance (or norm, ) and tively explore data structure at different levels like HC, while
seminorm for the models that operate either di- establishing the connections between each object and cluster in
rectly on the data objects or indirectly on the dissimilarity mea- the hierarchy with the memberships. This design makes HUFC
sures [130]. According to their empirical results, the object data overcome one of the major disadvantages of HC, i.e., HC
based models, with and norm, are recommended. They cannot reassign an object once it is designated into a cluster.
also pointed out the possible improvement of models for other Fuzzy clustering is also closely related to neural networks [24],
norm with the price of more complicated optimization oper- and we will see more discussions in the following section.
ations. PCM is another approach for dealing with outliers [175].
Under this model, the memberships are interpreted by a possi- H. Neural Networks-Based Clustering
bilistic view, i.e., “the compatibilities of the points with the class
Neural networks-based clustering has been dominated by
prototypes” [175]. The effect of noise and outliers is abated with
SOFMs and adaptive resonance theory (ART), both of which
the consideration of typicality. In this case, the first condition for
are reviewed here, followed by a brief discussion of other
the membership coefficient described in Section I is relaxed to
approaches.
. Accordingly, the cost function is reformu-
In competitive neural networks, active neurons reinforce
lated as
their neighborhood within certain regions, while suppressing
the activities of other neurons (so-called on-center/off-surround
competition). Typical examples include LVQ and SOFM [168],
[169]. Intrinsically, LVQ performs supervised learning, and is
not categorized as a clustering algorithm [169], [221]. But its
where are some positive constants. The additional term learning properties provide an insight to describe the potential
tends to give credits to memberships with large values. A data structure using the prototype vectors in the competitive
modified version in order to find appropriate clusters is pro- layer. By pointing out the limitations of LVQ, including sen-
posed in [294]. Davé and Krishnapuram further elaborated sitivity to initiation and lack of a definite clustering object,
the discussion on fuzzy clustering robustness and indicated its Pal, Bezdek, and Tsao proposed a general LVQ algorithm
connection with robust statistics [71]. Relations among some for clustering, known as GLVQ [221] (also see [157] for its
widely used fuzzy clustering algorithms were discussed and improved version GLVQ-F). They constructed the clustering
their similarities to some robust statistical methods were also problem as an optimization process based on minimizing a
reviewed. They reached a unified framework as the conclusion loss function, which is defined on the locally weighted error
for the previous discussion and proposed generic algorithms between the input pattern and the winning prototype. They also
for robust clustering. showed the relations between LVQ and the online -means

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
658 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

algorithm. Soft LVQ algorithms, e.g., fuzzy algorithms for self-organizing way, thus, overcoming the effect of learning in-
LVQ (FALVQ), were discussed in [156]. stability that plagues many other competitive networks. ART is
The objective of SOFM is to represent high-dimensional not, as is popularly imagined, a neural network architecture. It
input patterns with prototype vectors that can be visualized in is a learning theory, that resonance in neural circuits can trigger
a usually two-dimensional lattice structure [168], [169]. Each fast learning. As such, it subsumes a large family of current
unit in the lattice is called a neuron, and adjacent neurons are and future neural networks architectures, with many variants.
connected to each other, which gives the clear topology of ART1 is the first member, which only deals with binary input
how the network fits itself to the input space. Input patterns patterns [51], although it can be extended to arbitrary input
are fully connected to all neurons via adaptable weights, and patterns by a variety of coding mechanisms. ART2 extends
during the training process, neighboring input patterns are the applications to analog input patterns [52] and ART3 intro-
projected into the lattice, corresponding to adjacent neurons. In duces a new mechanism originating from elaborate biological
this sense, some authors prefer to think of SOFM as a method processes to achieve more efficient parallel search in hierar-
to displaying latent data structure in a visual way rather than a chical structures [54]. By incorporating two ART modules,
clustering approach [221]. Basic SOFM training goes through which receive input patterns ART and corresponding labels
the following steps. ART , respectively, with an inter-ART module, the resulting
1) Define the topology of the SOFM; Initialize the proto- ARTMAP system can be used for supervised classifications
type vectors randomly. [56]. The match tracking strategy ensures the consistency of
2) Present an input pattern to the network; Choose category prediction between two ART modules by dynamically
the winning node that is closest to , i.e., adjusting the vigilance parameter of ART . Also see fuzzy
. ARTMAP in [55]. A similar idea, omitting the inter-ART
3) Update prototype vectors module, is known as LAPART [134].
The basic ART1 architecture consists of two-layer nodes, the
feature representation field and the category representation
field . They are connected by adaptive weights, bottom-up
where is the neighborhood function that is often weight matrix and top-down weight matrix . The pro-
defined as totypes of clusters are stored in layer . After it is activated
according to the winner-takes-all competition, an expectation
is reflected in layer , and compared with the input pattern.
The orienting subsystem with the specified vigilance parameter
determines whether the expectation and the
where is the monotonically decreasing learning input are closely matched, and therefore controls the generation
rate, represents the position of corresponding neuron, of new clusters. It is clear that the larger is, the more clusters
and is the monotonically decreasing kernel width are generated. Once weight adaptation occurs, both bottom-up
function, or and top-down weights are updated simultaneously. This is called
resonance, from which the name comes. The ART1 algorithm
if node belongs to the neighborhood can be described as follows.
of the winning node 1) Initialize weight matrices and as
otherwise , where are sorted in a descending order and sat-
isfies for and any
4) Repeat steps 2)–3) until no change of neuron position binary input pattern , and ;
that is more than a small positive number is observed. 2) For a new pattern , calculate the input from layer
While SOFM enjoy the merits of input space density ap- to layer as
proximation and independence of the order of input patterns, a
number of user-dependent parameters cause problems when ap-
plied in real practice. Like the -means algorithm, SOFM need
to predefine the size of the lattice, i.e., the number of clusters,
which is unknown for most circumstances. Additionally, trained if is an uncommitted node
SOFM may be suffering from input space density misrepresen- first activated
tation [132], where areas of low pattern density may be over-rep-
resented and areas of high density under-represented. Kohonen if is a committed node
reviewed a variety of variants of SOFM in [169], which improve
drawbacks of basic SOFM and broaden its applications. SOFM
can also be integrated with other clustering approaches (e.g., where represents the logic AND operation.
-means algorithm or HC) to provide more effective and faster 3) Activate layer by choosing node with the winner-
clustering. [263] and [276] illustrate two such hybrid systems. takes-all rule .
ART was developed by Carpenter and Grossberg, as a so- 4) Compare the expectation from layer with the input
lution to the plasticity and stability dilemma [51], [53], [113]. pattern. If , go to step 5a), other-
ART can learn arbitrary input patterns in a stable, fast, and wise go to step 5b).

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 659

5)
a) Update the corresponding weights for the active
node as new old
old and new old .;
b) Send a reset signal to disable the current active node
by the orienting subsystem and return to step 3).
6) Present another input pattern, return to step 2) until all
patterns are processed.
Note the relation between ART network and other clustering
algorithms described in traditional and statistical language. Fig. 2. ART1 architecture. Two layers are included in the attentional
Moore used several clustering algorithms to explain the clus- subsystem, connected via bottom-up and top-down adaptive weights. Their
tering behaviors of ART1 and therefore induced and proved a interactions are controlled by the orienting subsystem through a vigilance
parameter.
number of important properties of ART1, notably its equiva-
lence to varying -means clustering [204]. She also showed
how to adapt these algorithms under the ART1 framework. In cluster, instead of the combinations of several clusters. Simpson
[284] and [285], the ease with which ART may be used for employed hyperbox fuzzy sets to characterize clusters [100],
hierarchical clustering is also discussed. [249]. Each hyperbox is delineated by a min and max point, and
Fuzzy ART (FA) benefits the incorporation of fuzzy set theory data points build their relations with the hyperbox through the
and ART [57]. FA maintains similar operations to ART1 and membership function. The learning process experiences a se-
uses the fuzzy set operators to replace the binary operators, so ries of expansion and contraction operations, until all clusters
that it can work for all real data sets. FA exhibits many desirable are stable.
characteristics such as fast and stable learning and atypical pat-
tern detection. Huang et al. investigated and revealed more prop- I. Kernel-Based Clustering
erties of FA classified as template, access, reset, and the number
Kernel-based learning algorithms [209], [240], [274] are
of learning epochs [143]. The criticisms for FA are mostly fo-
based on Cover’s theorem. By nonlinearly transforming a set
cused on its inefficiency in dealing with noise and the defi-
of complex and nonlinearly separable patterns into a higher-di-
ciency of hyperrectangular representation for clusters in many
mensional feature space, we can obtain the possibility to
circumstances [23], [24], [281]. Williamson described Gaussian
separate these patterns linearly [132]. The difficulty of curse
ART (GA) to overcome these shortcomings [281], in which each
of dimensionality can be overcome by the kernel trick, arising
cluster is modeled with Gaussian distribution and represented as
from Mercer’s theorem [132]. By designing and calculating
a hyperellipsoid geometrically. GA does not inherit the offline
an inner-product kernel, we can avoid the time-consuming,
fast learning property of FA, as indicated by Anagnostopoulos et
sometimes even infeasible process to explicitly describe the
al. [13], who proposed different ART architectures: hypersphere
nonlinear mapping and compute the corresponding points in
ART (HA) [12] for hyperspherical clusters and ellipsoid ART
the transformed space.
(EA) [13] for hyperellipsoidal clusters, to explore a more effi-
In [241], Schölkopf, Smola, and Müller depicted a kernel-
cient representation of clusters, while keeping important prop-
-means algorithm in the online mode. Suppose we have a set of
erties of FA. Baraldi and Alpaydin proposed SART following
patterns and a nonlinear map
their general ART clustering networks frame, which is described
. Here, represents a feature space with arbitrarily high di-
through a feedforward architecture combined with a match com-
mensionality. The object of the algorithm is to find centers so
parison mechanism [23]. As specific examples, they illustrated
that we can minimize the distance between the mapped patterns
symmetric fuzzy ART (SFART) and fully self-organizing SART
and their closest center
(FOSART) networks. These networks outperform ART1 and FA
according to their empirical studies [23].
In addition to these, many other neural network architectures
are developed for clustering. Most of these architectures uti-
lize prototype vectors to represent clusters, e.g., cluster detec-
tion and labeling network (CDL) [82], HEC [194], and SPLL
[296]. HEC uses a two-layer network architecture to estimate
the regularized Mahalanobis distance, which is equated to the
Euclidean distance in a transformed whitened space. CDL is
also a two-layer network with an inverse squared Euclidean
metric. CDL requires the match between the input patterns and where is the center for the th cluster and lies in a span of
the prototypes above a threshold, which is dynamically adjusted. , and is the inner-
SPLL emphasizes initiation independent and adaptive genera- product kernel.
tion of clusters. It begins with a random prototype in the input Define the cluster assignment variable
space and iteratively chooses and divides prototypes until no fur-
ther split is available. The divisibility of a prototype is based on if belongs to cluster
the consideration that each prototype represents only one natural otherwise.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
660 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

Then the kernel- -means algorithm can be formulated as the and, by adjusting the width parameter of RBF, SVC can form ei-
following. ther agglomerative or divisive hierarchical clusters. When some
1) Initialize the centers with the first , ob- points are allowed to lie outside the hypersphere, SVC can deal
servation patterns; with outliers effectively. An extension, called multiple spheres
2) Take a new pattern and calculate as support vector clustering, was proposed in [62], which combines
shown in the equation at the bottom of the page. the concept of fuzzy membership.
3) Update the mean vector whose corresponding Kernel-based clustering algorithms have many advantages.
is 1 1) It is more possible to obtain a linearly separable hyper-
plane in the high-dimensional, or even infinite feature
space.
2) They can form arbitrary clustering shapes other than
where . hyperellipsoid and hypersphere.
4) Adapt the coefficients for each as 3) Kernel-based clustering algorithms, like SVC, have the
capability of dealing with noise and outliers.
for 4) For SVC, there is no requirement for prior knowledge
for to determine the system topological structure. In [107],
the kernel matrix can provide the means to estimate the
5) Repeat steps 2)–4) until convergence is achieved. number of clusters.
Two variants of kernel- -means were introduced in [66], Meanwhile, there are also some problems requiring further
motivated by SOFM and ART networks. These variants con- consideration and investigation. Like many other algorithms,
sider effects of neighborhood relations, while adjusting the how to determine the appropriate parameters, for example, the
cluster assignment variables, and use a vigilance parameter to width of Gaussian kernel, is not trivial. The problem of compu-
control the process of producing mean vectors. The authors also tational complexity may become serious for large data sets.
illustrated the application of these approaches in case based The process of constructing the sum-of-squared clustering
reasoning systems. algorithm [107] and -means algorithm [241] presents a good
An alternative kernel-based clustering approach is in [107]. example to reformulate more powerful nonlinear versions
The problem was formulated to determine an optimal partition for many existing linear algorithms, provided that the scalar
to minimize the trace of within-group scatter matrix in the product can be obtained. Theoretically, it is important to investi-
feature space gate whether these nonlinear variants can keep some useful and
essential properties of the original algorithms and how Mercer
kernels contribute to the improvement of the algorithms. The
effect of different types of kernel functions, which are rich in
the literature, is also an interesting topic for further exploration.

J. Clustering Sequential Data


Sequential data are sequences with variable length and many
other distinct characteristics, e.g., dynamic behaviors, time
constraints, and large volume [120], [265]. Sequential data can
be generated from: DNA sequencing, speech processing, text
mining, medical diagnosis, stock market, customer transactions,
web data mining, and robot sensor analysis, to name a few [78],
where [265]. In recent decades, sequential data grew explosively. For
, example, in genetics, the recent statistics released on October
and is the total number of patterns in the th cluster. 15, 2004 (Release 144.0) shows that there are 43 194 602
Note that the kernel function utilized in this case is the radial 655 bases from 38 941 263 sequences in GenBank database
basis function (RBF) and can be interpreted as a mea- [103] and release 45.0 of SWISSPROT on October 25, 2004
sure of the denseness for the th cluster. contains 59 631 787 amino acids in 163 235 sequence entries
Ben-Hur et al. presented a new clustering algorithm, SVC, [267]. Cluster analysis explores potential patterns hidden in the
in order to find a set of contours used as the cluster bound- large number of sequential data in the context of unsupervised
aries in the original data space [31], [32]. These contours can learning and therefore provides a crucial way to meet the cur-
be formed by mapping back the smallest enclosing sphere in rent challenges. Generally, strategies for sequential clustering
the transformed feature space. RBF is chosen in this algorithm, mostly fall into three categories.

if
otherwise

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 661

1) Sequence Similarity: The first scheme is based on the


measure of the distance (or similarity) between each pair of se-
quences. Then, proximity-based clustering algorithms, either hi-
erarchical or partitional, can group sequences. Since many se-
quential data are expressed in an alphabetic form, like DNA
or protein sequences, conventional measure methods are inap- Fig. 3. Illustration of a sequence alignment. Series of edit operations
propriate. If a sequence comparison is regarded as a process of is performed to change the sequence CLUSTERING into the sequence
transforming a given sequence to another with a series of substi- CLASSIFICATION.
tution, insertion, and deletion operations, the distance between
the two sequences can be defined by virtue of the minimum 2) Indirect Sequence Clustering: The second approach
number of required operations. A common analysis processes is employs an indirect strategy, which begins with the extraction
alignment, illustrated in Fig. 3. The defined distance is known of a set of features from the sequences. All the sequences
as edit distance or Levenshtein distance [120], [236]. These edit are then mapped into the transformed feature space, where
operations are weighted (punished or rewarded) according to classical vector space-based clustering algorithms can be
some prior domain knowledge and the distance herein is equiva- used to form clusters. Obviously, feature extraction becomes
lent to the minimum cost to complete the transformation. In this the essential factor that decides the effectiveness of these
sense, the similarity or distance between two sequences can be algorithms. Guralnik and Karypis discussed the potential de-
reformulated as an optimal alignment problem, which fits well pendency between two sequential patterns and suggested both
in the framework of dynamic programming. the global and the local approaches to prune the initial feature
Given two sequences, and sets in order to better represent sequences in the new feature
, the basic dynamic program- space [119]. Morzy et al. utilized the sequential patterns as
ming-based sequence alignment algorithm, also known as the basic element in the agglomerative hierarchical clustering
the Needleman-Wunsch algorithm, can be depicted by the and defined a co-occurrence measure, as the standard of fusion
following recursive equation [78], [212]: of smaller clusters [207]. These methods greatly reduce the
computational complexities and can be applied to large-scale
sequence databases. However, the process of feature selection
inevitably causes the loss of some information in the original
sequences and needs extra attention.
3) Statistical Sequence Clustering: Typically, the first two
where is defined as the best alignment score be-
approaches are used to deal with sequential data composed of
tween sequence segment of and
alphabets, while the third paradigm, which aims to construct
of , and , or
statistical models to describe the dynamics of each group of se-
represent the cost for aligning to , aligning to
quences, can be applied to numerical or categorical sequences.
a gap (denoted as ), or aligning to a gap, respectively. The
The most important method is hidden Markov models (HMMs)
computational results for each position at and are recorded
[214], [219], [253], which first gained its popularity in the appli-
in an array with a pointer that stores current optimal operations
cation of speech recognition [229]. A discrete HMM describes
and provides an effective path in backtracking the alignment.
an unobservable stochastic process consisting of a set of states,
The Needleman-Wunsch algorithm considers the comparison
each of which is related to another stochastic process that emits
of the whole length of two sequences and therefore performs
observable symbols. Therefore, the HMM is completely speci-
a global optimal alignment. However, it is also important to
fied by the following.
find local similarity among sequences in many circumstances.
The Smith-Waterman algorithm achieves that by allowing the 1) A finite set with states.
beginning of a new alignment during the recursive computa- 2) A discrete set with observa-
tion, and the stop of an alignment anywhere in the dynamic tion symbols.
programming matrix [78], [251]. This change is summarized 3) A state transition distribution , where
in the following: th state at time th state at time

4) A symbol emission distribution , where

at th state at

5) An initial state distribution , where


For both the global and local alignment algorithms, the com-
putation complexity is , which is very expensive, es- th state at
pecially for a clustering problem that requires an all-against-all
pairwise comparison. A wealth of speed-up methods has been After an initial state is selected according to the initial dis-
developed to improve the situation [78], [120]. We will see tribution , a symbol is emitted with emission distribution .
more discussion in Section III-E in the context of biological The next state is decided by the state transition distribution
sequences analysis. Other examples include applications for and it also generates a symbol based on . The process repeats
speech recognition [236] and navigation pattern mining [131]. until reaching the last state. Note that the procedure generates

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
662 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

a sequence of symbol observations instead of states, which is are combined with EM for parameter estimation [286]. Smyth
where the name “hidden” comes from. HMMs are well founded [255] and Cadez et al. [50] further generalize a universal prob-
theoretically [229]. Dynamic programming and EM algorithm abilistic framework to model mixed data measurement, which
are developed to solve the three basic problems of HMMs as the includes both conventional static multivariate vectors and dy-
following. namic sequence data.
1) Likelihood (forward or backward algorithm). Com- The paradigm models clusters directly from original data
pute the probability of an observation sequence given without additional process that may cause information loss.
a model. They provide more intuitive ways to capture the dynamics
2) State interpretation (Vertbi algorithm). Find an op- of data and more flexible means to deal with variable length
timal state sequence by optimizing some criterion sequences. However, determining the number of model com-
function given the observation sequence and the ponents remains a complicated and uncertain process [214],
model. [253]. Also, the model selected is required to have sufficient
3) Parameter estimation (Baum–Welch algorithm). De- complexity, in order to interpret the characteristics of data.
sign suitable model parameters to maximize the prob-
ability of observation sequence under the model.
The equivalence between an HMM and a recurrent K. Clustering Large-Scale Data Sets
back-propagation network was elucidated in [148], and a
universal framework was constructed to describe both the Scalability becomes more and more important for clustering
computational and the structural properties of the HMM and algorithms with the increasing complexity of data, mainly man-
the neural network. ifesting in two aspects: enormous data volume and high dimen-
Smyth proposed an HMM-based clustering model, which, sionality. Examples, illustrated in the sequential clustering sec-
similar to the theories introduced in mixture densities-based tion, are just some of the many applications that require this ca-
clustering, assumes that each cluster is generated based on some pability. With the further advances of database and Internet tech-
probability distribution [253]. Here, HMMs are used rather than nologies, clustering algorithms will face more severe challenges
the common Gaussian or -distribution. In addition to the form in handling the rapid growth of data. We summarize the com-
of finite mixture densities, the mixture model can also be de- putational complexity of some typical and classical clustering
scribed by means of an HMM with the transition matrix algorithms in Table II with several newly proposed approaches
specifically designed to deal with large-scale data sets. Several
points can be generalized through the table.
1) Obviously, classical hierarchical clustering algo-
rithms, including single-linkage, complete linkage,
where is the transition distribution for the th average linkage, centroid linkage and median linkage,
cluster. The initial distribution of the HMM is determined based are not appropriate for large-scale data sets due to the
on the prior probability for each cluster. The basic learning quadratic computational complexities in both execu-
process starts with a parameter initialization scheme to form tion time and store space.
a rough partition with the log-likelihood of each sequence 2) -means algorithm has a time complexity of
serving as the distance measure. The partition is further re- and space complexity of . Since is usu-
fined by training the overall HMM over all sequences with ally much larger than both and , the complexity be-
the classical EM algorithm. A Monte-Carlo cross validation comes near linear to the number of samples in the data
method was used to estimate the possible number of clusters. sets. -means algorithm is effective in clustering large-
An application with a modified HMM model that considers scale data sets, and efforts have been made in order to
the effect of context for clustering facial display sequences is overcome its disadvantages [142], [218].
illustrated in [138]. Oates et al. addressed the initial problem by 3) Many novel algorithms have been developed to cluster
pregrouping the sequences with the agglomerative hierarchical large-scale data sets, especially in the context of data
clustering, which operates on the proximity matrix determined mining [44], [45], [85], [135], [213], [248]. Many of
by the dynamic time warping (DTW) technique [214]. The area them can scale the computational complexity linearly
formed between one original sequence and a new sequence, to the input size and demonstrate the possibility of han-
generated by warping the time dimension of another original dling very large data sets.
sequence, reflects the similarity of the two sequences. Li and a) Random sampling approach, e.g., CLARA clus-
Biswas suggested several objective criterion functions based tering large applications (CLARA) [161] and CURE
on posterior probability and information theory for structural [116]. The key point lies that the appropriate sample
selection of HMMs and cluster validity [182]. More recent sizes can effectively maintain the important geomet-
advances on HMMs and other related topics are reviewed in rical properties of clusters. Furthermore, Chernoff
[30]. bounds can provide estimation for the lower bound of
Other model-based sequence clustering includes mixtures of the minimum sample size, given the low probability
first-order Markov chain [255] and a linear model like autore- that points in each cluster are missed in the sample
gressive moving average (ARMA) model [286]. Usually, they set [116]. CLARA represents each cluster with a

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 663

medoid while CURE chooses a set of well-scattered shown some successful applications in such cases, but
and center-shrunk points. these are still far from completely effective.
b) Randomized search approach, e.g., clustering In addition to the aforementioned approaches, several other
large applications based on randomized search techniquesalsoplaysignificantrolesinclusteringlarge-scaledata
(CLARANS) [213]. CLARANS sees the clustering sets. Parallel algorithms can more effectively use computational
as a search process in a graph, in which each node resources, and greatly improve overall performance in the context
corresponds to a set of medoids. It begins with an of both time and space complexity [69], [217], [262]. Incremental
arbitrary node as the current node and examines a set clustering techniques do not require the storage of the entire data
of neighbors, defined as the node consisting of only set, and can handle it in a one-pattern-at-a-time way. If the pat-
one different data object, to seek a better solution, tern displays enough closeness to a cluster according to some
i.e., any neighbor, with a lower cost, becomes the predefined criteria, it is assigned to the cluster. Otherwise, a new
current node. If the maximum number of neighbors, cluster is created to represent the object. A typical example is
specified by the user, has been reached, the current the ART family [51]–[53] discussed in Section II-H. Most incre-
node is accepted as a winning node. This process mental clustering algorithms are dependent on the order of the
iterates several times as specified by users. Though input patterns [51], [204]. Bradley, Fayyad, and Reina proposed
CLARANS achieves better performance than algo- a scalable clustering framework, considering seven relevant im-
rithms like CLARA, the total computational time is portant characteristics in dealing with large databases [44]. Ap-
still quadratic, which makes CLARANS not quite plications of the framework were illustrated for the -means al-
effective in very large data sets. gorithm and EM mixture models [44], [45].
c) Condensation-based approach, e.g., BIRCH [295].
BIRCH generates and stores the compact sum- L. Exploratory Data Visualization and High-Dimensional
maries of the original data in a CF tree, as discussed Data Analysis Through Dimensionality Reduction
in Section II-B. This new data structure efficiently For most of the algorithms summarized in Table II, although
captures the clustering information and largely they can deal with large-scale data, they are not sufficient for
reduces the computational burden. BIRCH was analyzing high-dimensional data. The term, “curse of dimen-
generalized into a broader framework in [101] with sionality,” which was first used by Bellman to indicate the ex-
two algorithms realization, named as BUBBLE and ponential growth of complexity in the case of multivariate func-
BUBBLE-FM. tion estimation under a high dimensionality situation [28], is
d) Density-based approach, e.g., density based spatial generally used to describe the problems accompanying high di-
clustering of applications with noise (DBSCAN) mensional spaces [34], [132]. It is theoretically proved that the
[85] and density-based clustering (DENCLUE) distance between the nearest points is no different from that
[135]. DBSCAN requires that the density in a of other points when the dimensionality of the space is high
neighborhood for an object should be high enough enough [34]. Therefore, clustering algorithms that are based on
if it belongs to a cluster. DBSCAN creates a new the distance measure may no longer be effective in a high dimen-
cluster from a data object by absorbing all objects in sional space. Fortunately, in practice, many high-dimensional
its neighborhood. The neighborhood needs to sat- data usually have an intrinsic dimensionality that is much lower
isfy a user-specified density threshold. DBSCAN than the original dimension [60]. Dimension reduction is impor-
uses a -tree structure for more efficient queries. tant in cluster analysis, which not only makes the high-dimen-
DENCLUE seeks clusters with local maxima of sional data addressable and reduces the computational cost, but
the overall density function, which reflects the provides users with a clearer picture and visual examination of
comprehensive influence of data objects to their the data of interest. However, dimensionality reduction methods
neighborhoods in the corresponding data space. inevitably cause some loss of information, and may damage the
e) Grid-based approach, e.g., WaveCluster [248] and interpretability of the results, even distort the real clusters.
fractal clustering (FC) [26]. WaveCluster assigns One natural strategy for dimensionality reduction is to
data objects to asetof units divided in the originalfea- extract important components from original data, which can
ture space, and employs wavelet transforms on these contribute to the division of clusters. Principle component
units, to map objects into the frequency domain. The analysis (PCA) or Karhunen-Loéve transformation is one of
key idea is that clusters can be easily distinguished in the typical approaches, which is concerned with constructing
the transformed space. FC combines the concepts of a linear combination of a set of vectors that can best describe
both incremental clustering and fractal dimension. the variance of data. Given the input pattern matrix
Data objects are incrementally added to the clusters, , the linear mapping
specified through an initial process, and represented projects into a low-dimensional
as cells in a grid, with the condition that the fractal subspace, where is the resulting matrix and is the
dimension of cluster needs to keep relatively stable. projection matrix whose columns are the eigenvectors
4) Most algorithms listed previously lack the capability of that correspond to the largest eigenvalues of the co-
dealing with data with high dimensionality. Their per- variance matrix , calculated from the whole data set (hence,
formances degenerate with the increase of dimension- the column vectors of are orthonormal). PCA estimates the
ality. Some algorithms, like FC and DENCLUE, have matrix while minimizing the sum of squares of the error

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
664 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

of approximating the input vectors. In this sense, PCA can ( -dimensional) are also preserved in the projected low-di-
be realized through a three-layer neural network, called an mensional space ( -dimensional). This is represented through
auto-associative multilayer perceptron, with linear activation a weight matrix, describing how each point is related to the
functions [19], [215]. In order to extract more complicated reconstruction of another data point. Therefore, the procedure
nonlinear data structure, nonlinear PCA was developed and one for dimensional reduction can be constructed as the problem
of the typical examples is kernel PCA. As methods discussed that finding -dimensional vectors so that the criterion
in Section II-I, kernel PCA first maps the input patterns into a function is minimized. Another inter-
feature space. The similar steps are then applied to solve the esting nonlinear dimensionality reduction approach, known as
eigenvalue problem with the new covariance matrix in the fea- Laplace eigenmap algorithm, is presented in [27].
ture space. In another way, extra hidden layers with nonlinear As discussed in Section II-H, SOFM also provide good visu-
activation functions can be added into the auto-associative alization for high-dimensional input patterns [168]. SOFM map
network for this purpose [38], [75]. input patterns into a one or usually two dimensional lattice struc-
PCA is appropriate for Gaussian distributions since it relies on ture, consisting of nodes associated with different clusters. An
second-order relationships in the covariance matrix, Other linear application for clustering of a large set of documental data is
transforms, like independent component analysis (ICA) and pro- illustrated in [170], in which 6 840 568 patent abstracts were
jection pursuit, which use higher order statistical information, projected onto a SOFM with 1 002 240 nodes.
are more suited for non-Gaussian distributions [60], [151]. The Subspace-based clustering addresses the challenge by ex-
basic goal of ICA is to find the components that are most statis- ploring the relations of data objects under different combina-
tically independent from each other [149], [154]. In the context tions of features. clustering in quest (CLIQUE) [3] employs a
of blind source separation, ICA aims to separate the independent bottom-up scheme to seek dense rectangular cells in all sub-
source signals from the mixed observation signal. This problem spaces with high density of points. Clusters are generated as the
can be formulated in several different ways [149], and one of connected components in a graph whose vertices stand for the
the simplest form (without considering noise) is represented as dense units. The resulting minimal description of the clusters is
, where is the -dimensional observable vector, obtained through the merge of these rectangles. OptiGrid [136] is
is the -dimensional source vector assumed to be statistically designed to obtain an optimal grid-partitioning. This is achieved
independent, and is a nonsingular mixing matrix. ICA by constructing the best cutting hyperplanes through a set of
can also be realized by virtue of multilayer perceptrons, and [158] projections. The time complexity for OptiGrid is in the interval
illustrates one of such examples. The proposed ICA network of and . ORCLUS (arbitrarily ORiented
includes whitening, separation, and basis vectors estimation projected CLUster generation) [2] defines a generalized pro-
layers, with corresponding learning algorithms. The authors jected cluster as a densely distributed subset of data objects in
also indicated its connection to the auto-associative multilayer a subspace, along with a subset of vectors that represent the
perceptron. Projection pursuit is another statistical technique for subspace. The dimensionality of the subspace is prespecified by
seeking low-dimensional projection structures for multivariate users as an input parameter, and several strategies are proposed
data [97], [144]. Generally, projection pursuit regards the normal in guidance of its selection. The algorithm begins with a set
distribution as the least interesting projections and optimizes of randomly selected seeds with the full dimensionality.
some certain indices that measure the degree of nonnormality This dimensionality and the number of clusters are decayed
[97]. PCA can be considered as a special example of projection according to some factors at each iteration, until the number
pursuit, as indicated in [60]. More discussions on the relations of clusters reaches the predefined values. Each repetition con-
among PCA, ICA, projection pursuit, and other relevant tech- sists of three basic operations, known as assignment, vector
niques are offered in [149] and [158]. finding, and merge. ORCLUS has the overall time complexity of
Different from PCA, ICA, and projection pursuit, Multidi- and space complexity of .
mensional scaling (MDS) is a nonlinear projection technique Obviously, the scalability to large data sets relies on the number
[75], [292]. The basic idea of MDS lies in fitting original mul- of initial seeds . A generalized subspace clustering model,
tivariate data into a low-dimensional structure while aiming to pCluster was proposed in [279]. These pClusters are formed
maintain the proximity information. The distortion is measured by a depth-first clustering algorithm. Several other interesting
through some criterion functions, e.g., in the sense of sum applications, including a Clindex (CLustering for INDEXing)
of squared error between the real distance and the projection scheme and wavelet transform, are shown in [184] and [211],
distance. The isometric feature mapping (Isomap) algorithm respectively.
is another nonlinear technique, based on MDS [270]. Isomap
estimates the geodesic distance between a pair of points, which M. How Many Clusters?
is the shortest path between the points on a manifold, by virtue The clustering process partitions data into an appropriate
of the measured input-space distances, e.g., the Euclidean number of subsets. Although for some applications, users can
distance usually used. This extends the capability of MDS to determine the number of clusters, , in terms of their expertise,
explore more complex nonlinear structures in the data. Locally under more circumstances, the value of is unknown and
linear embedding (LLE) algorithm addresses the nonlinear needs to be estimated exclusively from the data themselves.
dimensionality reduction problem from a different starting Many clustering algorithms ask to be provided as an input
point [235]. LLE emphasizes the local linearity of the manifold parameter, and it is obvious that the quality of resulting clusters
and assumes that the local relations in the original data space is largely dependent on the estimation of . A division with

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 665

too many clusters complicates the result, therefore, makes it A large number of criteria, which combine concepts
hard to interpret and analyze, while a division with too few from information theory, have been proposed in the
clusters causes the loss of information and misleads the final literature. Typical examples include,
decision. Dubes called the problem of determining the number • Akaike’s information criterion (AIC) [4], [282]
of clusters “the fundamental problem of cluster validity” [74].
A large number of attempts have been made to estimate the AIC
appropriate and some of representative examples are illus-
trated in the following. where is the total number of patterns, is the
1) Visualization of the data set. For the data points that number of parameters for each cluster, is the total
can be effectively projected onto a two-dimensional number of parameters estimated, and is the max-
Euclidean space, which are commonly depicted with imum log-likelihood. is selected with the minimum
a histogram or scatterplot, direct observations can pro- value of AIC .
vide good insight on the value of . However, the com- • Bayesian inference criterion (BIC) [226], [242]
plexity of most real data sets restricts the effectiveness
of the strategy only to a small scope of applications. BIC
2) Construction of certain indices (or stopping rules).
These indices usually emphasize the compactness of is selected with the maximum value of BIC .
intra-cluster and isolation of inter-cluster and consider More criteria, such as minimum description length
the comprehensive effects of several factors, including (MDL) [114], [233], minimum message length (MML)
the defined squared error, the geometric or statistical [114], [216], cross validation-based information crite-
properties of the data, the number of patterns, the dis- rion (CVIC) [254] and covariance inflation criterion
similarity (or similarity), and the number of clusters. (CIC) [272], with their characteristics, are summarized
Milligan and Cooper compared and ranked 30 indices in [197]. Like the previous discussion for validation
according to their performance over a series of artifi- index, there is no criterion that is superior to others in
cial data sets [202]. Among these indices, the Caliñski general case. The selection of different criteria is still
and Harabasz index [74] achieve the best performance dependent on the data at hand.
and can be represented as 4) Other heuristic approaches based on a variety of tech-
niques and theories. Girolami performed eigenvalue
decomposition on the kernel matrix in the high-dimen-
CH sional feature space and used the dominant compo-
nents in the decomposition summation as an indication
where is the total number of patterns and of the possible existence of clusters [107]. Kothari
and are the trace of the between and within and Pitts described a scale-based method, in which the
class scatter matrix, respectively. The that maxi- distance from a cluster centroid to other clusters in
mizes the value of CH is selected as the optimal. its neighborhood is considered (added as a regulariza-
It is worth noting that these indices may be data de- tion term in the original squared error criterion, Sec-
pendent. The good performance of an index for cer- tion II-C) [160]. The neighborhood of clusters work as
tain data does not guarantee the same behavior with a scale parameter and the that is persistent in the
different data. As pointed out by Everitt, Landau, and largest interval of the neighborhood parameter is re-
Leese, “it is advisable not to depend on a single rule garded as the optimal.
for selecting the number of groups, but to synthesize Besides the previous methods, constructive clustering algo-
the results of several techniques” [88]. rithms can adaptively and dynamically adjust the number of
3) Optimization of some criterion functions under prob- clusters rather than use a prespecified and fixed number. ART
abilistic mixture-model framework. In a statistical networks generate a new cluster, only when the match between
framework, finding the correct number of clusters the input pattern and the expectation is below some prespecified
(components) , is equivalent to fitting a model with confidence value [51]. A functionally similar mechanism is
observed data and optimizing some criterion [197]. used in the CDL network [82]. The robust competitive clus-
Usually, the EM algorithm is used to estimate the tering algorithm (RCA) describes a competitive agglomeration
model parameters for a given , which goes through process that progresses in stages, and clusters that lose in the
a predefined range of values. The value of that competition are discarded, and absorbed into other clusters [98].
maximizes (or minimizes) the defined criterion is This process is generalized in [42], which attains the number
regarded as optimal. Smyth presented a Monte-Carlo of clusters by balancing the effect between the complexity
cross-validation method, which randomly divides data and the fidelity. Another learning scheme, SPLL iteratively
into training and test sets times according to a cer- divides cluster prototypes from a single prototype until no
tain fraction ( works well from the empirical more prototypes satisfy the split criterion [296]. Several other
results) [252]. The is selected either directly based constructive clustering algorithms, including the FACS and
on the criterion function or some posterior probabili- plastic neural gas, can be accessed in [223] and [232], re-
ties calculated. spectively. Obviously, the problem of determining the number

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
666 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

of clusters is converted into a parameter selection problem,


and the resulting number of clusters is largely dependent on
parameter tweaking.

III. APPLICATIONS
We illustrate applications of clustering techniques in three as-
pects. The first is for two classical benchmark data sets that are
widely used in pattern recognition and machine learning. Then,
we show an application of clustering for the traveling salesman
problem. The last topic is on bioinformatics. We deal with clas-
sical benchmarks in Sections III-A and III-B and the traveling
salesman problem in Section III-C. A more extensive discussion
of bioinformatics is in Sections III-D and III-E.

A. Benchmark Data Sets—IRIS


The iris data set [92] is one of the most popular data
sets to examine the performance of novel methods in pat-
tern recognition and machine learning. It can be down-
loaded from the UCI Machine Learning Repository at
http://www.ics.uci.edu/~mlearn/MLRepository.html. There are
three categories in the data set (i.e., iris setosa, iris versicolor
and iris virginical), each having 50 patterns with four features
[i.e., sepal length (SL), sepal width (SW), petal length (PL),
and petal width (PW)]. Iris setosa can be linearly separated
from iris versicolor and iris virginical, while iris versicolor and
iris virginical are not linearly separable (see Fig. 4(a), in which
only three features are used). Fig. 4(b) depicts the clustering
result with a standard -means algorithm. It is clear to see that Fig. 4. (a) Iris data sets. There are three iris categories, each having 50 samples
-means can correctly differentiate iris setosa from the other K
with 4 features. Here, only three features are used: PL, PW, and SL. (b) -means
two iris plants. But for iris versicolor and virginical, there exist clustering result with 16 classification errors observed.
16 misclassifications. This result is similar to those (around
15 errors) obtained from other classical clustering algorithms
TABLE III
[221]. Table III summarizes some of the clustering results SOME CLUSTERING RESULTS FOR THE IRIS DATA SET
reported in the literature. From the table, we can see that many
newly developed approaches can greatly improve the clustering
performance on iris data set (around 5 misclassifications); some
even can achieve 100% accuracy. Therefore, the data can be
well classified with appropriate methods.

B. Benchmark Data Sets—MUSHROOM


Unlike the iris data set, all of the features of the mushroom
data set, which can also be accessible at the UCI Machine
Learning Repository, are nominal rather than numerical. These
23 species of gilled mushrooms are categorized as either edible
or poisonous. The total number of instances is 8 124 with 4
208 being edible and 3 916 poisonous. The 22 features are
summarized in Table IV with corresponding possible values.
Table V illustrates some experimental results in the literature.
As indicated in [117] and [277], traditional clustering strategies,
like -means and hierarchical clustering, work poorly on the
data set. The accuracy for -means is just around 69% [277] divides objects into 21 clusters with most of them (except one)
and the clusters formed by classical HC are mixed with nearly consisting of only one category, which increases the accuracy
similar proportion of both edible and poisonous objects [117]. almost to 99%. The algorithm SBAC works on a subset of
The results reported in the newly developed algorithms, which 200 randomly selected objects, 100 for each category and the
are specifically used for tackling categorical or mixture data, general results show the correct partition of 3 clusters (two for
greatly improve the situation [117], [183]. The algorithm ROCK edible mushrooms, one for poisonous ones). In both studies, the

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 667

TABLE IV TABLE V
FEATURES FOR THE MUSHROOM DATA SET SOME CLUSTERING RESULTS FOR THE MUSHROOM DATA SET

optimization problem of trying to find the shortest Hamiltonian


cycle, and in particular, the most common is the Euclidean
version, where the vertices and edges all lie in the plane.
Mulder and Wunsch applied a divide-and-conquer clustering
technique, with ART networks, to scale the problem to a mil-
lion cities [208]. The divide and conquer paradigm gives the
flexibility to hierarchically break large problems into arbitrarily
small clusters depending on what tradeoff between accuracy
and speed is desired. In addition, the subproblems provide an
excellent opportunity to take advantage of parallel systems
for further optimization. As the first stage of the algorithm,
the ART network is used to sort the cities into clusters. The
vigilance parameter is used to set a maximum distance from the
current pattern. A vigilance parameter between 0 and 1 is used
as a percentage of the global space to determine the vigilance
distance. Values were chosen based on the desired number and
size of individual clusters. The clusters were then each passed to
a version of the Lin-Kernighan (LK) algorithm [187]. The last
step combines the subtours back into one complete tour. Tours
constitution of each feature for generated clusters is also illus- with good quality for city levels up to 1 000 000 were obtained
trated and it is observed that some features, like cap-shape and within 25 minutes on a 2 GHz AMD Athlon MP processor with
ring-type, represent themselves identically for both categories 512 M of DDR RAM. Fig. 5 shows the visualizing results for 1
and, thus, suggest poor performance of traditional approaches. 000, 10 000, and 1 000 000 cities, respectively.
Meanwhile, feature odor shows good discrimination for the It is worthwhile to emphasize the relation between the TSP
different types of mushrooms. Usually, value almond, anise, and very large-scale integrated (VLSI) circuit clustering, which
or none indicates the edibility of mushrooms, while value partitions a sophisticated system into smaller and simpler sub-
pungent, foul, or fishy means the high possibility of presence circuits to facilitate the circuit design. The object of the par-
of poisonous contents in the mushrooms. titions is to minimize the number of connections among the
components. One strategy for solving the problem is based on
C. Traveling Salesman Problem geometric representations, either linear or multidimensional [8].
The traveling salesman problem (TSP) is one of the most Alpert and Kahng considered a solution to the problem as the
studied examples in an important class of problems known as “inverse” of the divide-and-conquer TSP method and used a
NP-complete problems. Given a complete undirected graph linear tour of the modules to form the subcircuit partitions [7].
, where is a set of vertices and is a set of They adopted the spacefilling curve heuristic for the TSP to con-
edges each relating two vertices with an associated nonnegative struct the tour so that connected modules are still close in the
integer cost, the most general form of the TSP is equivalent generated tour. A dynamic programming method was used to
to finding any Hamiltonian cycle, which is a tour over that generate the resulting partitions. More detailed discussion on
begins and ends at the same vertex and visits other vertices VLSI circuit clustering can be found in the survey by Alpert
exactly once. The more common form of the problem is the and Kahng [7].

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
668 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

DNA microarray technologies generate many gene ex-


pression profiles. Currently, there are two major microarray
technologies based on the nature of the attached DNA: cDNA
with length varying from several hundred to thousand bases,
or oligonucleotides containing 20–30 bases. For cDNA tech-
nologies, a DNA microarray consists of a solid substrate to
which a large amount of cDNA clones are attached according
to a certain order [79]. Fluorescently labeled cDNA, obtained
from RNA samples of interest through the process of reverse
transcription, is hybridized with the array. A reference sample
with a different fluorescent label is also needed for comparison.
Image analysis techniques are then used to measure the fluores-
cence of each dye, and the ratio reflects relative levels of gene
expression. For a high-density oligonucleotide microarray,
oligonucleotides are fixed on a chip through photolithography
or solid-phase DNA synthesis [188]. In this case, absolute
gene expression levels are obtained. After the normalization
of the fluorescence intensities, the gene expression profiles
are represented as a matrix , where is the ex-
pression level of the th gene in the th condition, tissue, or
experimental stage. Gene expression data analysis consists of a
three-level framework based on the complexity, ranging from
the investigation of single gene activities to the inference of the
entire genetic network [20]. The intermediate level explores
the relations and interactions between genes under different
Fig. 5. Clustering divide-and-conquer TSP resulting tours for (a) 1 k, (b) 10 k, conditions, and attracts more attention currently. Generally,
(c) 1 M cities. The clustered LK algorithm achieves a significant speedup and cluster analysis of gene expression data is composed of two
shows good scalability.
aspects: clustering genes [80], [206], [260], [268], [283], [288]
or clustering tissues or experiments [5], [109], [238].
D. Bioinformatics—Gene Expression Data Results of gene clustering may suggest that genes in the same
Recently, advances in genome sequencing projects and DNA group have similar functions, or they share the same transcrip-
microarray technologies have been achieved. The first draft of tional regulation mechanism. Cluster analysis, for grouping
the human genome sequence project was completed in 2001, functionally similar genes, gradually became popular after
several years earlier than expected [65], [275]. The genomic se- the successful application of the average linkage hierarchical
quence data for other organizms (e.g., Drosophila melanogaster clustering algorithm for the expression data of budding yeast
and Escherichia coli) are also abundant. DNA microarray tech- Saccharomyces cerevisiae and reaction of human fibroblasts to
nologies provide an effective and efficient way to measure gene serum by Eisen et al. [80]. They used the Pearson correlation
expression levels of thousands of genes simultaneously under coefficient to measure the similarity between two genes, and
different conditions and tissues, which makes it possible to in- provided a very informative visualization of the clustering re-
vestigate gene activities from the angle of the whole genome sults. Their results demonstrate that functionally similar genes
[79], [188]. With sequences and gene expression data in hand, tend to reside in the same clusters formed by their expression
to investigate the functions of genes and identify their roles in pattern, even under a relatively small set of conditions. Herwig
the genetic process become increasingly important. Analyzes et al. developed a variant of -means algorithm to cluster a set
under traditional laboratory techniques are time-consuming and of 2 029 human cDNA clones and adopted mutual information
expensive. They fall far behind the explosively increasing gen- as the similarity measure [230]. Tomayo et al. [268] made
eration of new data. Among the large number of computational use of SOFM to cluster gene expression data and its applica-
methods used to accelerate the exploration of life science, clus- tion in hematopoietic differentiation provided new insight for
tering can reveal the hidden structures of biological data, and is further research. Graph theories based clustering algorithms,
particularly useful for helping biologists investigate and under- like CAST [29] and CLICK [247], showed very promising
stand the activities of uncharacterized genes and proteins and performances in tackling different types of gene expression
further, the systematic architecture of the whole genetic net- data. Since many genes usually display more than one function,
work. We demonstrate the applications of clustering algorithms fuzzy clustering may be more effective in exposing these rela-
in bioinformatics from two aspects. The first part is based on tions [73]. Gene expression data is also important to elucidate
the analysis of gene expression data generated from DNA mi- the genetic regulation mechanism in a cell. By examining
croarray technologies. The second part describes clustering pro- the corresponding DNA sequences in the control regions of a
cesses that directly work on linear DNA or protein sequences. cluster of co-expressed genes, we may identify potential short
The assumption is that functionally similar genes or proteins and consensus sequence patterns, known as motifs, and further
usually share similar patterns or primary sequence structures. investigate their interaction with transcriptional binding factors,

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 669

leading to different gene activities. Spellman et al. clustered This can be regarded as a feature selection process. However,
800 genes according to their expression during the yeast cell problems like how many genes are really required, and whether
cycle [260]. Analyzes of 8 major gene clusters unravel the these genes selected are really biologically meaningful, are
connection between co-expression and co-regulation. Tavazoie still not answered satisfactorily. Hierarchical clustering was
et al. partitioned 3 000 genes into 30 clusters with the -means performed by the program CLUSTER and the results were
algorithm [269]. For each cluster, 600 base pairs upstream visualized by the program TreeView, developed by Eisen in
sequences of the genes were searched for potential motifs. 18 Stanford University. Fig. 7(a) and (b) depicts the clustering
motifs were found from 12 clusters in their experiments and 7 results for both the top 100 genes, selected by the Fisher
of them can be verified according to previous empirical results scores, and the samples. Graphic visualization is achieved by
in the literature. A more comprehensive investigation can be associating each data point with a certain color according to the
found in [206]. corresponding scale. Some clustering patterns are clearly dis-
As to another application, clustering tissues or experiments played in the image. Fig. 7(c) depicts a 5-by-5 SOFM topology
are valuable in identifying samples that are in the different dis- for all genes, with each cluster represented by the centroid
ease states, discovering, or predicting different cancer types, and (mean) for each feature (sample). 25 clusters are generated
evaluating the effects of novel drugs and therapies [5], [109], and the number of genes in each cluster is also indicated.
[238]. Golub et al. described the restriction of traditional cancer The software package GeneCluster, developed by Whitehead
classification methods, which are mostly dependent on mor- Institute/MIT Center for Genome Research (WICGR), was
phological appearance of tumors, and divided cancer classifi- used in this analysis.
cation into two challenges: class discovery and class predic- Although clustering techniques have already achieved many
tion. They utilized SOFM to discriminate two types of human impressive results in the analysis of gene expression data, there
acute leukemias: acute myeloid leukemia (AML) and acute lym- are still many problems that remain open. Gene expression data
phoblastic leukemia (ALL) [109]. According to their results, sets usually are characterized as
two subsets of ALL, with different origin of lineage, can be
1) small set samples with high-dimensional features;
well separated. Alon et al. performed a two-way clustering for
2) high redundancy;
both tissues and genes and revealed the potential relations, rep-
3) inherent noise;
resented as visualizing patterns, among them [6]. Alizadeh et
4) sparsity of the data.
al. demonstrated the effectiveness of molecular classification of
cancers by their gene expression profiles and successfully dis- Most of the published data sets include usually less than 20
tinguished two molecularly distinct subtypes of diffuse large samples for each tumor type, but with as many as thousands of
B-cell lymphoma, which cause high percentage failure in clin- gene measures [80], [109], [238], [268]. This is partly caused
ical treatment [5]. Furthermore, Scherf et al. constructed a gene by the lag of experimental condition (e.g., sample collection), in
expression database to study the relationship between genes and contrast to the rapid advancement of microarray and sequencing
drugs for 60 human cancer cell lines, which provides an im- technologies. In order to evaluate existing algorithms more
portant criterion for therapy selection and drug discovery [238]. reasonably and develop more effective new approaches, more
Other applications of clustering algorithms for tissue classifi- data with enough samples or more conditional observations are
cation include: mixtures of multivariate Gaussian distributions needed. But from the trend of gene chip technologies, which
[105], ellipsoidal ART [287], and graph theory-based methods also follows Moore’s law for semiconductor chips [205], the
[29], [247]. In most of these applications, important genes that current status will still exist for a long time. This problem is
are tightly related to the tumor types are identified according to more serious in the application of gene expression data for
their expression differentiation under different cancerous cate- cancer research, in which clustering algorithms are required to
gories, which are in accord with our prior recognition of roles be capable of effectively finding potential patterns under a large
of these genes, to a large extent [5], [109]. For example, Alon et number of irrelevant factors, as a result of the introduction of
al. found that 5 of 20 statistically significant genes were muscle too many genes. At the same time, feature selection, which is
genes, and the corresponding muscle indices provided an expla- also called informative gene selection in the context, also plays
nation for false classifications [6]. a very important role. Without any doubt, clustering algorithms
Fig. 7 illustrates an application of hierarchical clustering should be feasible in both time and space complexity. Due to
and SOFM for gene expression data. This data set is on the the nature of the manufacture process of the microarray chip,
diagnostic research of small round blue-cell tumors (SRBCT’s) noise can be inevitably introduced into the expression data
of childhood and consists of 83 samples from four categories, during different stages. Accordingly, clustering algorithms
known as Burkitt lymphomas (BL), the Ewing family of tumors should have noise and outlier detection mechanisms in order to
(EWS), neuroblastoma (NB), and rhabdomyosarcoma (RMS), remove their effects. Furthermore, different algorithms usually
and 5 non-SRBCT samples [164]. Gene expression levels of form different clusters for the same data set, which is a general
6 567 genes were measured using cDNA microarray for each problem in cluster analysis. How to evaluate the quality of the
sample, 2 308 of which passed the filter and were kept for fur- generated clusters of genes, and how to choose appropriate
ther analyzes. These genes are further ranked according to the algorithms for a specified application, are particularly crucial
scores calculated by some criterion functions [109]. Generally, for gene expression data research, because sometimes, even
these criterion functions attempt to seek a subset of genes that biologists cannot identify the real patterns from the artifacts of
contribute most to the discrimination of different cancer types. the clustering algorithms, due to the limitations of biological

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
670 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

Fig. 6. Basic procedure of cDNA microarray technology [68]. Fluorescently labeled cDNAs, obtained from target and reference samples through reverse
transcription, are hybridized with the microarray, which is comprised of a large amount of cDNA clones. Image analysis measures the ratio of the two dyes.
Computational methods, e.g., hierarchical clustering, further disclose the relations among genes and corresponding conditions.

Fig. 7. Hierarchical and SOFM clustering of SRBCT’s gene expression data set. (a) Hierarchical clustering result for the 100 selected genes under 83 tissue
samples. The gene expression matrix is visualized through a color scale. (b) Hierarchical clustering result for the 83 tissue samples. Here, the dimension is 100 as
2
100 genes are selected like in (a). (c) SOFM clustering result for the 2308 genes. A 5 5 SOFM is used and 25 clusters are formed. Each cluster is represented by
the average values.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 671

knowledge. Some recent results can be accessed in [29], [247], scheme for large database vs. database comparison exhibits an
and [291]. apparent improvement in computation time. Kent and Zahler de-
signed a three-pass algorithm, called wobble aware bulk aligner
E. Bioinformatics—DNA or Protein Sequences Clustering (WABA) [162], for aligning large-scale genomic sequences of
different species, which employs a seven-state pairwise hidden
DNA (deoxyribonucleic acid) is the hereditary material ex- Markov model [78] for more effective alignments. In [201],
isting in all living cells. A DNA molecule is a double helix con- Miller summarized the current research status of genomic se-
sisting of two strands, each of which is a linear sequence com- quence comparison and suggested valuable directions for fur-
posed of four different nucleotides—adenine, guanine, thymine, ther research efforts.
and cytosine, abbreviated as the letters A, G, T, and C, respec- Many clustering techniques have been applied to organize
tively. Each letter in a DNA sequence is also called a base. DNA or protein sequence data. Some directly operate on a
Proteins determine most of cells’ structures, functions, prop- proximity measure; some are based on feature extraction,
erties, and regulatory mechanisms. The primary structure of a while others are constructed on statistical models. Somervuo
protein is also a linear and alphabetic chain with the difference and Kohonen illustrated an application of SOFM to cluster
that each unit represents an amino acid, which has twenty types protein sequences in SWISSPROT database [257]. FASTA
in total. Proteins are encoded by certain segments of DNA se- was used to calculate the sequence similarity. The resulting
quences through a two-stage process (transcription and trans- two-dimensional SOFM provides a visualized representation
lation). These segments are known as genes or coding regions. of the relations within the entire sequence database. Based
Investigation of the relations between DNA and proteins, as well on the similarity measure of gapped BLAST, Sasson et al.
as their own functions and properties, is one of the important re- utilized an agglomerative hierarchical clustering paradigm to
search directions in both genetics and bioinformatics. cluster all protein sequences in SWISSPROT [237]. The effects
The similarity between newly sequenced genes or proteins of four merging rules, different from the interpretation of
and annotated genes or proteins usually offers a cue to identify cluster centers, on the resulting protein clusters were examined.
their functions. Searching corresponding databases for a new The advantages as well as the potential risk of the concept,
DNA or protein sequence has already become routine in genetic transitivity, were also elucidated in the paper. According to
research. In contrast to sequence comparison and search, cluster the transitivity relation, two sequences that do not show high
analysis provides a more effective means to discover compli- sequence similarity by virtue of direct comparison, may be
cated relations among DNA and protein sequences. We summa- homologous (having a common ancestor) if there exists an
rize the following clustering applications for DNA and protein intermediate sequence similar to both of them. This makes it
sequences: possible to detect remote homologues that can not be observed
1) function recognition of uncharacterized genes or pro- by similarity comparison. However, unrelated sequences may
teins [119]; be clustered together due to the effects of these intermediate
2) structure identification of large-scale DNA or protein sequences [237]. Bolten et al. addressed the problem with the
databases [237], [257]; construction a directed graph, in which each protein sequence
3) redundancy decrease of large-scale DNA or protein corresponds to a vertex and edges are weighted based on the
databases [185]; alignment score between two sequences and self alignment
4) domain identification [83], [115]; score of each sequence [41]. Clusters were formed through
5) expressed sequence tag (EST) clustering [49], [200]. the search of strongly connected components (SCCs), each of
As described in Section II-J, classical dynamic programming which is a maximal subset of vertices and for each pair of ver-
algorithms for global and local sequence alignment are too in- tices and in the subset, there exist two directed paths from
tensive in computational complexity. This becomes worse be- to and vice versa. A minimum normalized cut algorithm for
cause of the existence of a large volume of nucleic acids and detecting protein families and a minimum spanning tree (MST)
amino acids in the current DNA or protein databases, e.g., bac- application for seeking domain information were presented in
teria genomes are from 0.5 to 10 Mbp, fungi genomes range [1] and [115], respectively. In contrast with the aforementioned
from 10 to 50 Mbp, while the human genome is around 3 310 proximity-based methods, Guralnik and Karypis transformed
Mbp [18] (Mbp means million base pairs). Thus, conventional protein or DNA sequences into a new feature space, based
dynamic programming algorithms are computationally infea- on the detected subpatterns working as the sequence features,
sible. In practice, sequence comparison or proximity measure and clustered with the -means algorithm [119]. The method
is achieved via some heuristics. Well-known examples include is immune from all-against-all expensive sequence compar-
BLAST and FASTA with many variants [10], [11], [224]. The ison and suitable for analyzing large-scale databases. Krogh
key idea of these methods is to identify regions that may have demonstrated the power of hidden Markov models (HMMs)
potentially high matches, with a list of prespecified high-scoring in biological sequences modeling and clustering of protein
words, at an early stage. Therefore, further search only needs to families [177]. Fig. 8 depicts a typical structure of HMM, in
focus on these regions with expensive but accurate algorithms. which match states (abbreviated with letter M), insert states (I)
Recognizing the benefit coming from the separation of word and delete states (D) are represented as rectangles, diamonds,
matching and sequence alignment to computational burden re- and circles, respectively [78], [177]. These states correspond
duction, Miller, Gurd, and Brass described three algorithms fo- to substitution, insertion, and deletion in edit operations. For
cusing on specific problems [199]. The implementation of the convenience, a begin state and an end state are added to the

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
672 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

algorithms. The following properties are important to


the efficiency and effectiveness of a novel algorithm.
I) generate arbitrary shapes of clusters rather than be
confined to some particular shape;
II) handle large volume of data as well as high-dimen-
sional features with acceptable time and storage
complexities;
III) detect and remove possible outliers and noise;
Fig. 8. HMM architecture [177]. There are three different states, match (M), IV) decrease the reliance of algorithms on users-de-
insert (I), and delete (D), corresponding to substitution, insertion, and deletion
operation, respectively. A begin (B) and end (E) state are also introduced to
pendent parameters;
represent the start and end of the process. This process goes through a series of V) have the capability of dealing with newly occur-
states according to the transition probability, and emits either 4-letter nucleotide ring data without relearning from the scratch;
or 20-letter amino acid alphabet based on the emission probability.
VI) be immune to the effects of order of input patterns;
VII) provide some insight for the number of potential
model, denoted by letter B and E. Letters, either from the 4-letter clusters without prior knowledge;
nucleotide alphabet or from 20-letter amino acid alphabet, VIII) show good data visualization and provide users
are generated from match and insert states according to some with results that can simplify further analysis;
emission probability distributions. Delete states do not produce IX) be capable of handling both numerical and nom-
any symbols, and are used to skip the match states. HMMs are inal data or be easily adaptable to some other data
required in order to describe clusters, or families (subfamilies), type.
which are regarded as a mixture model and proceeded with Of course, some more detailed requirements for spe-
an EM learning algorithm similar to single HMM case. An cific applications will affect these properties.
example for clustering subfamilies of 628 globins shows the 3) At the preprocessing and post-processing phase, fea-
encouraging results. Further discussion can be found in [78] ture selection/extraction (as well as standardization
and [145]. and normalization) and cluster validation are as impor-
tant as the clustering algorithms. Choosing appropriate
and meaningful features can greatly reduce the burden
IV. CONCLUSION of subsequent designs and result evaluations reflect
the degree of confidence to which we can rely on the
As an important tool for data exploration, cluster analysis generated clusters. Unfortunately, both processes lack
examines unlabeled data, by either constructing a hierarchical universal guidance. Ultimately, the tradeoff among
structure, or forming a set of groups according to a prespecified different criteria and methods is still dependent on the
number. This process includes a series of steps, ranging from applications themselves.
preprocessing and algorithm development, to solution validity
and evaluation. Each of them is tightly related to each other ACKNOWLEDGMENT
and exerts great challenges to the scientific disciplines. Here, we
The authors would like to thank the Eisen Laboratory in Stan-
place the focus on the clustering algorithms and review a wide
ford University for use of their CLUSTER and TreeView soft-
variety of approaches appearing in the literature. These algo-
ware and Whitehead Institute/MIT Center for Genome Research
rithms evolve from different research communities, aim to solve
for use of their GeneCluster software. They would also like to
different problems, and have their own pros and cons. Though
thank S. Mulder for the part on the traveling salesman problem
we have already seen many examples of successful applications
and also acknowledge extensive comments from the reviewers
of cluster analysis, there still remain many open problems due
and the anonymous associate editor.
to the existence of many inherent uncertain factors. These prob-
lems have already attracted and will continue to attract intensive REFERENCES
efforts from broad disciplines. We summarize and conclude the
[1] F. Abascal and A. Valencia, “Clustering of proximal sequence space
survey with listing some important issues and research trends for the identification of protein families,” Bioinformatics, vol. 18, pp.
for cluster algorithms. 908–921, 2002.
[2] C. Aggarwal and P. Yu, “Redefining clustering for high-dimensional ap-
1) There is no clustering algorithm that can be univer- plications,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 2, pp. 210–225,
sally used to solve all problems. Usually, algorithms Feb. 2002.
[3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic
are designed with certain assumptions and favor some subspace clustering of high dimensional data for data mining applica-
type of biases. In this sense, it is not accurate to say tions,” in Proc. ACM SIGMOD Int. Conf. Management of Data, 1998,
pp. 94–105.
“best” in the context of clustering algorithms, although [4] H. Akaike, “A new look at the statistical model identification,” IEEE
some comparisons are possible. These comparisons are Trans. Autom. Control, vol. AC-19, no. 6, pp. 716–722, Dec. 1974.
mostly based on some specific applications, under cer- [5] A. Alizadeh et al., “Distinct types of diffuse large B-cell Lymphoma
identified by gene expression profiling,” Nature, vol. 403, pp. 503–511,
tain conditions, and the results may become quite dif- 2000.
ferent if the conditions change. [6] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and
A. Levine, “Broad patterns of gene expression revealed by clustering
2) New technology has generated more complex and analysis of tumor and normal colon tissues probed by oligonucleotide
challenging tasks, requiring more powerful clustering arrays,” Proc. Nat. Acad. Sci. USA, pp. 6745–6750, 1999.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 673

[7] C. Alpert and A. Kahng, “Multi-way partitioning via spacefilling curves [36] J. Bezdek and R. Hathaway, “Numerical convergence and interpretation
and dynamic programming,” in Proc. 31st ACM/IEEE Design Automa- c
of the fuzzy -shells clustering algorithms,” IEEE Trans. Neural Netw.,
tion Conf., 1994, pp. 652–657. vol. 3, no. 5, pp. 787–793, Sep. 1992.
[8] , “Recent directions in netlist partitioning: A survey,” VLSI J., vol. [37] J. Bezdek and N. Pal, “Some new indexes of cluster validity,” IEEE
19, pp. 1–81, 1995. Trans. Syst., Man, Cybern. B, Cybern., vol. 28, no. 3, pp. 301–315, Jun.
[9] K. Al-Sultan, “A Tabu search approach to the clustering problem,” Pat- 1998.
tern Recognit., vol. 28, no. 9, pp. 1443–1451, 1995. [38] C. Bishop, Neural Networks for Pattern Recognition. New York: Ox-
ford Univ. Press, 1995.
l l
[10] S. Altschul et al., “Gapped BLAST and PSI-BLAST: A new generation
of protein database search programs,” Nucleic Acids Res., vol. 25, pp. [39] L. Bobrowski and J. Bezdek, “c-Means clustering with the and
3389–3402, 1997. norms,” IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3, pp. 545–554,
[11] S. Altschul et al., “Basic local alignment search tool,” J. Molec. Biol., May-Jun. 1991.
vol. 215, pp. 403–410, 1990. [40] H. Bock, “Probabilistic models in cluster analysis,” Comput. Statist.
[12] G. Anagnostopoulos and M. Georgiopoulos, “Hypersphere ART and Data Anal., vol. 23, pp. 5–28, 1996.
ARTMAP for unsupervised and supervised incremental learning,” in [41] E. Bolten, A. Sxhliep, S. Schneckener, D. Schomburg, and R. Schrader,
Proc. IEEE-INNS-ENNS Int. Joint Conf. Neural Networks (IJCNN’00), “Clustering protein sequences—Structure prediction by transitive ho-
vol. 6, Como, Italy, pp. 59–64. mology,” Bioinformatics, vol. 17, pp. 935–941, 2001.
[13] , “Ellipsoid ART and ARTMAP for incremental unsupervised and [42] N. Boujemaa, “Generalized competitive clustering for image segmen-
supervised learning,” in Proc. IEEE-INNS-ENNS Int. Joint Conf. Neural tation,” in Proc. 19th Int. Meeting North American Fuzzy Information
Processing Soc. (NAFIPS’00), Atlanta, GA, 2000, pp. 133–137.
Networks (IJCNN’01), vol. 2, Washington, DC, 2001, pp. 1221–1226.
[14] M. Anderberg, Cluster Analysis for Applications. New York: Aca-
K
[43] P. Bradley and U. Fayyad, “Refining initial points for -means clus-
tering,” in Proc. 15th Int. Conf. Machine Learning, 1998, pp. 91–99.
demic, 1973.
[44] P. Bradley, U. Fayyad, and C. Reina, “Scaling clustering algorithms to
[15] G. Babu and M. Murty, “A near-optimal initial seed value selection in
K -means algorithm using a genetic algorithm,” Pattern Recognit. Lett.,
large databases,” in Proc. 4th Int. Conf. Knowledge Discovery and Data
Mining (KDD’98), 1998, pp. 9–15.
vol. 14, no. 10, pp. 763–769, 1993. [45] , “Clustering very large databases using EM mixture models,” in
[16] , “Clustering with evolution strategies,” Pattern Recognit., vol. 27, Proc. 15th Int. Conf. Pattern Recognition, vol. 2, 2000, pp. 76–80.
no. 2, pp. 321–329, 1994. [46] , “Clustering very large databases using EM mixture models,” in
[17] E. Backer and A. Jain, “A clustering performance measure based on Proc. 15th Int. Conf. Pattern Recognition, vol. 2, 2000, pp. 76–80.
fuzzy set decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. [47] D. Brown and C. Huntley, “A practical application of simulated an-
PAMI-3, no. 1, pp. 66–75, Jan. 1981. nealing to clustering,” Pattern Recognit., vol. 25, no. 4, pp. 401–412,
[18] P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Ap- 1992.
proach, 2nd ed. Cambridge, MA: MIT Press, 2001. [48] C. Burges, “A tutorial on support vector machines for pattern recogni-
[19] P. Baldi and K. Hornik, “Neural networks and principal component anal- tion,” Data Mining Knowl. Discov., vol. 2, pp. 121–167, 1998.
ysis: Learning from examples without local minima,” Neural Netw., vol. [49] J. Burke, D. Davison, and W. Hide, “d2 Cluster: A validated method for
2, pp. 53–58, 1989. clustering EST and full-length cDNA sequences,” Genome Res., vol. 9,
[20] P. Baldi and A. Long, “A Bayesian framework for the analysis of mi- pp. 1135–1142, 1999.
croarray expression data: Regularized t-test and statistical inferences of [50] I. Cadez, S. Gaffney, and P. Smyth, “A general probabilistic framework
gene changes,” Bioinformatics, vol. 17, pp. 509–519, 2001. for clustering individuals and objects,” in Proc. 6th ACM SIGKDD Int.
[21] G. Ball and D. Hall, “A clustering technique for summarizing multi- Conf. Knowledge Discovery and Data Mining, 2000, pp. 140–149.
variate data,” Behav. Sci., vol. 12, pp. 153–155, 1967. [51] G. Carpenter and S. Grossberg, “A massively parallel architecture for
[22] S. Bandyopadhyay and U. Maulik, “Nonparametric genetic clustering: a self-organizing neural pattern recognition machine,” Comput. Vis.
Comparison of validity indices,” IEEE Trans. Syst., Man, Cybern. C, Graph. Image Process., vol. 37, pp. 54–115, 1987.
Appl. Rev., vol. 31, no. 1, pp. 120–125, Feb. 2001. [52] , “ART2: Self-organization of stable category recognition codes for
[23] A. Baraldi and E. Alpaydin, “Constructive feedforward ART clustering analog input patterns,” Appl. Opt., vol. 26, no. 23, pp. 4919–4930, 1987.
networks—Part I and II,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. [53] , “The ART of adaptive pattern recognition by a self-organizing
645–677, May 2002. neural network,” IEEE Computer, vol. 21, no. 3, pp. 77–88, Mar. 1988.
[24] A. Baraldi and P. Blonda, “A survey of fuzzy clustering algorithms for [54] , “ART3: Hierarchical search using chemical transmitters in self-
pattern recognition—Part I and II,” IEEE Trans. Syst., Man, Cybern. B, organizing pattern recognition architectures,” Neural Netw., vol. 3, no.
Cybern., vol. 29, no. 6, pp. 778–801, Dec. 1999. 23, pp. 129–152, 1990.
[25] A. Baraldi and L. Schenato, “Soft-to-hard model transition in clustering: [55] G. Carpenter, S. Grossberg, N. Markuzon, J. Reynolds, and D. Rosen,
A review,”, Tech. Rep. TR-99-010, 1999. “Fuzzy ARTMAP: A neural network architecture for incremental super-
[26] D. Barbará and P. Chen, “Using the fractal dimension to cluster datasets,” vised learning of analog multidimensional maps,” IEEE Trans. Neural
in Proc. 6th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Netw., vol. 3, no. 5, pp. 698–713, 1992.
Mining, 2000, pp. 260–264. [56] G. Carpenter, S. Grossberg, and J. Reynolds, “ARTMAP: Supervised
[27] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques real-time learning and classification of nonstationary data by a self-or-
for embedding and clustering,” in Advances in Neural Information ganizing neural network,” Neural Netw., vol. 4, no. 5, pp. 169–181, 1991.
[57] G. Carpenter, S. Grossberg, and D. Rosen, “Fuzzy ART: Fast stable
Processing Systems, T. G. Dietterich, S. Becker, and Z. Ghahramani,
learning and categorization of analog patterns by an adaptive resonance
Eds. Cambridge, MA: MIT Press, 2002, vol. 14.
system,” Neural Netw., vol. 4, pp. 759–771, 1991.
[28] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton,
[58] G. Celeux and G. Govaert, “A classification EM algorithm for clustering
NJ: Princeton Univ. Press, 1961. and two stochastic versions,” Comput. Statist. Data Anal., vol. 14, pp.
[29] A. Ben-Dor, R. Shamir, and Z. Yakhini, “Clustering gene expression 315–332, 1992.
patterns,” J. Comput. Biol., vol. 6, pp. 281–297, 1999. [59] P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass):
[30] Y. Bengio, “Markovian models for sequential data,” Neural Comput. Theory and results,” in Advances in Knowledge Discovery and Data
Surv., vol. 2, pp. 129–162, 1999. Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
[31] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik, “Support vector Eds. Menlo Park, CA: AAAI Press, 1996, pp. 153–180.
clustering,” J. Mach. Learn. Res., vol. 2, pp. 125–137, 2001. [60] V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory,
[32] , “A support vector clustering method,” in Proc. Int. Conf. Pattern and Methods. New York: Wiley, 1998.
Recognition, vol. 2, 2000, pp. 2724–2727. [61] J. Cherng and M. Lo, “A hypergraph based clustering algorithm for spa-
[33] P. Berkhin. (2001) Survey of clustering data mining techniques. [On- tial data sets,” in Proc. IEEE Int. Conf. Data Mining (ICDM’01), 2001,
line]. Available: http://www.accrue.com/products/rp_cluster_review.pdf pp. 83–90.
http://citeseer.nj.nec.com/berkhin02survey.html [62] J. Chiang and P. Hao, “A new kernel-based fuzzy clustering approach:
[34] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is nearest Support vector clustering with cell growing,” IEEE Trans. Fuzzy Syst.,
neighbor meaningful,” in Proc. 7th Int. Conf. Database Theory, 1999, vol. 11, no. 4, pp. 518–527, Aug. 2003.
pp. 217–235. K
[63] C. Chinrungrueng and C. Séquin, “Optimal adaptive -means algo-
[35] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo- rithm with dynamic adjustment of learning rate,” IEEE Trans. Neural
rithms. New York: Plenum, 1981. Netw., vol. 6, no. 1, pp. 157–169, Jan. 1995.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
674 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

[64] S. Chu and J. Roddick, “A clustering algorithm using the Tabu search [92] R. Fisher, “The use of multiple measurements in taxonomic problems,”
approach with simulated annealing,” in Data Mining II—Proceedings Annu. Eugenics, pt. II, vol. 7, pp. 179–188, 1936.
of Second International Conference on Data Mining Methods and [93] D. Fogel, “An introduction to simulated evolutionary optimization,”
Databases, N. Ebecken and C. Brebbia, Eds, Cambridge, U.K., 2000, IEEE Trans. Neural Netw., vol. 5, no. 1, pp. 3–14, Jan. 1994.
pp. 515–523. [94] E. Forgy, “Cluster analysis of multivariate data: Efficiency vs. inter-
[65] I. H. G. S. Consortium, “Initial sequencing and analysis of the human pretability of classifications,” Biometrics, vol. 21, pp. 768–780, 1965.
genome,” Nature, vol. 409, pp. 860–921, 2001. [95] C. Fraley and A. Raftery, “MCLUST: Software for model-based cluster
[66] J. Corchado and C. Fyfe, “A comparison of kernel methods for instan- analysis,” J. Classificat., vol. 16, pp. 297–306, 1999.
tiating case based reasoning systems,” Comput. Inf. Syst., vol. 7, pp. [96] , “Model-Based clustering, discriminant analysis, and density esti-
29–42, 2000. mation,” J. Amer. Statist. Assoc., vol. 97, pp. 611–631, 2002.
[67] M. Cowgill, R. Harvey, and L. Watson, “A genetic algorithm approach [97] J. Friedman, “Exploratory projection pursuit,” J. Amer. Statist. Assoc.,
to cluster analysis,” Comput. Math. Appl., vol. 37, pp. 99–108, 1999. vol. 82, pp. 249–266, 1987.
[68] C. Cummings and D. Relman, “Using DNA microarray to study host- [98] H. Frigui and R. Krishnapuram, “A robust competitive clustering algo-
microbe interactions,” Genomics, vol. 6, no. 5, pp. 513–525, 2000. rithm with applications in computer vision,” IEEE Trans. Pattern Anal.
[69] E. Dahlhaus, “Parallel algorithms for hierarchical clustering and appli- Mach. Intell., vol. 21, no. 5, pp. 450–465, May 1999.
cations to split decomposition and parity graph recognition,” J. Algo- [99] B. Fritzke. (1997) Some competitive learning methods. [Online]. Avail-
rithms, vol. 36, no. 2, pp. 205–240, 2000. able: http://www.neuroinformatik.ruhr-uni-bochum.de/ini/VDM/re-
c
[70] R. Davé, “Adaptive fuzzy -shells clustering and detection of ellipses,” search/gsn/JavaPaper
IEEE Trans. Neural Netw., vol. 3, no. 5, pp. 643–662, Sep. 1992. [100] B. Gabrys and A. Bargiela, “General fuzzy min-max neural network for
[71] R. Davé and R. Krishnapuram, “Robust clustering methods: A unified clustering and classification,” IEEE Trans. Neural Netw., vol. 11, no. 3,
view,” IEEE Trans. Fuzzy Syst., vol. 5, no. 2, pp. 270–293, May 1997. pp. 769–783, May 2000.
[72] M. Delgado, A. Skármeta, and H. Barberá, “A Tabu search approach [101] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French, “Clus-
to the fuzzy clustering problem,” in Proc. 6th IEEE Int. Conf. Fuzzy tering large datasets in arbitrary metric spaces,” in Proc. 15th Int. Conf.
Systems, vol. 1, 1997, pp. 125–130. Data Engineering, 1999, pp. 502–511.
c
[73] D. Dembélé and P. Kastner, “Fuzzy -means method for clustering mi- [102] I. Gath and A. Geva, “Unsupervised optimal fuzzy clustering,” IEEE
croarray data,” Bioinformatics, vol. 19, no. 8, pp. 973–980, 2003. Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 773–781, Jul. 1989.
[74] Handbook of Pattern Recognition and Computer Vision, C. Chen, L. [103] GenBank Release Notes 144.0.
Pau, and P. Wang, Eds., World Scientific, Singapore, 1993, pp. 3–32. R. [104] A. Geva, “Hierarchical unsupervised fuzzy clustering,” IEEE Trans.
Dubes, “Cluster analysis and related issue”. Fuzzy Syst., vol. 7, no. 6, pp. 723–733, Dec. 1999.
[75] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. New [105] D. Ghosh and A. Chinnaiyan, “Mixture modeling of gene expression
York: Wiley, 2001. data from microarray experiments,” Bioinformatics, vol. 18, no. 2, pp.
[76] J. Dunn, “A fuzzy relative of the ISODATA process and its use in de- 275–286, 2002.
tecting compact well separated clusters,” J. Cybern., vol. 3, no. 3, pp. [106] A. Ghozeil and D. Fogel, “Discovering patterns in spatial data using
32–57, 1974. evolutionary programming,” in Proc. 1st Annu. Conf. Genetic Program-
[77] B. Duran and P. Odell, Cluster Analysis: A Survey. New York: ming, 1996, pp. 512–520.
Springer-Verlag, 1974. [107] M. Girolami, “Mercer kernel based clustering in feature space,” IEEE
[78] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Trans. Neural Netw., vol. 13, no. 3, pp. 780–784, May 2002.
Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cam- [108] F. Glover, “Tabu search, part I,” ORSA J. Comput., vol. 1, no. 3, pp.
bridge, U.K.: Cambridge Univ. Press, 1998. 190–206, 1989.
[79] M. Eisen and P. Brown, “DNA arrays for analysis of gene expression,” [109] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov,
Methods Enzymol., vol. 303, pp. 179–205, 1999. H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E.
[80] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster analysis and Lander, “Molecular classification of cancer: Class discovery and class
display of genome-wide expression patterns,” in Proc. Nat. Acad. Sci. prediction by gene expression monitoring,” Science, vol. 286, pp.
USA, vol. 95, 1998, pp. 14 863–14 868. 531–537, 1999.
[81] Y. El-Sonbaty and M. Ismail, “Fuzzy clustering for symbolic data,” IEEE [110] A. Gordon, “Cluster validation,” in Data Science, Classification, and Re-
Trans. Fuzzy Syst., vol. 6, no. 2, pp. 195–204, May 1998. lated Methods, C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H. Bock,
[82] T. Eltoft and R. deFigueiredo, “A new neural network for cluster-de- and Y. Bada, Eds. New York: Springer-Verlag, 1998, pp. 22–39.
tection-and-labeling,” IEEE Trans. Neural Netw., vol. 9, no. 5, pp. [111] , Classification, 2nd ed. London, U.K.: Chapman & Hall, 1999.
1021–1035, Sep. 1998. [112] J. Gower, “A general coefficient of similarity and some of its properties,”
[83] A. Enright and C. Ouzounis, “GeneRAGE: A robust algorithm for se- Biometrics, vol. 27, pp. 857–872, 1971.
quence clustering and domain detection,” Bioinformatics, vol. 16, pp. [113] S. Grossberg, “Adaptive pattern recognition and universal encoding II:
451–457, 2000. Feedback, expectation, olfaction, and illusions,” Biol. Cybern., vol. 23,
[84] S. Eschrich, J. Ke, L. Hall, and D. Goldgof, “Fast accurate fuzzy clus- pp. 187–202, 1976.
tering through data reduction,” IEEE Trans. Fuzzy Syst., vol. 11, no. 2, [114] P. Grünwald, P. Kontkanen, P. Myllymäki, T. Silander, and H. Tirri,
pp. 262–270, Apr. 2003. “Minimum encoding approaches for predictive modeling,” in Proc. 14th
[85] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based algorithm Int. Conf. Uncertainty in AI (UAI’98), 1998, pp. 183–192.
for discovering clusters in large spatial databases with noise,” in Proc. [115] X. Guan and L. Du, “Domain identification by clustering sequence
2nd Int. Conf. Knowledge Discovery and Data Mining (KDD’96), 1996, alignments,” Bioinformatics, vol. 14, pp. 783–788, 1998.
pp. 226–231. [116] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algo-
[86] V. Estivill-Castro and I. Lee, “AMOEBA: Hierarchical clustering rithm for large databases,” in Proc. ACM SIGMOD Int. Conf. Manage-
based on spatial proximity using Delaunay diagram,” in Proc. 9th Int. ment of Data, 1998, pp. 73–84.
Symp. Spatial Data Handling (SDH’99), Beijing, China, 1999, pp. [117] , “ROCK: A robust clustering algorithm for categorical attributes,”
7a.26–7a.41. Inf. Syst., vol. 25, no. 5, pp. 345–366, 2000.
[87] V. Estivill-Castro and J. Yang, “A fast and robust general purpose clus- K
[118] S. Gupata, K. Rao, and V. Bhatnagar, “ -means clustering algorithm
tering algorithm,” in Proc. 6th Pacific Rim Int. Conf. Artificial Intelli- for categorical attributes,” in Proc. 1st Int. Conf. Data Warehousing and
gence (PRICAI’00), R. Mizoguchi and J. Slaney, Eds., Melbourne, Aus- Knowledge Discovery (DaWaK’99), Florence, Italy, 1999, pp. 203–208.
tralia, 2000, pp. 208–218. [119] V. Guralnik and G. Karypis, “A scalable algorithm for clustering sequen-
[88] B. Everitt, S. Landau, and M. Leese, Cluster Analysis. London: tial data,” in Proc. 1st IEEE Int. Conf. Data Mining (ICDM’01), 2001,
Arnold, 2001. pp. 179–186.
[89] D. Fasulo, “An analysis of recent work on clustering algorithms,” Dept. [120] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer
Comput. Sci. Eng., Univ. Washington, Seattle, WA, Tech. Rep. 01-03-02, Science and Computational Biology. Cambridge, U.K.: Cambridge
1999. Univ. Press, 1997.
[90] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture [121] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster validity
models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. methods: Part I & II,” SIGMOD Record, vol. 31, no. 2–3, 2002.
381–396, Mar. 2002. [122] L. Hall, I. Özyurt, and J. Bezdek, “Clustering with a genetically opti-
[91] D. Fisher, “Knowledge acquisition via incremental conceptual clus- mized approach,” IEEE Trans. Evol. Comput., vol. 3, no. 2, pp. 103–112,
tering,” Mach. Learn., vol. 2, pp. 139–172, 1987. 1999.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 675

[123] R. Hammah and J. Curran, “Validity measures for the fuzzy cluster anal- [153] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expression
ysis of orientations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. data: A survey,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 11, pp.
12, pp. 1467–1472, Dec. 2000. 1370–1386, Nov. 2004.
[124] P. Hansen and B. Jaumard, “Cluster analysis and mathematical program- [154] C. Jutten and J. Herault, “Blind separation of sources, Part I: An adaptive
ming,” Math. Program., vol. 79, pp. 191–215, 1997. algorithms based on neuromimetic architecture,” Signal Process., vol.
[125] P. Hansen and N. Mladenoviæ, “J-means: A new local search heuristic 24, no. 1, pp. 1–10, 1991.
[155] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A.
K
for minimum sum of squares clustering,” Pattern Recognit., vol. 34, pp.
405–413, 2001. Wu, “An efficient -means clustering algorithm: Analysis and imple-
[126] F. Harary, Graph Theory. Reading, MA: Addison-Wesley, 1969. mentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp.
[127] J. Hartigan, Clustering Algorithms. New York: Wiley, 1975. 881–892, Jul. 2000.
[128] E. Hartuv and R. Shamir, “A clustering algorithm based on graph con- [156] N. Karayiannis, “A methodology for construction fuzzy algorithms for
nectivity,” Inf. Process. Lett., vol. 76, pp. 175–181, 2000. learning vector quantization,” IEEE Trans. Neural Netw., vol. 8, no. 3,
c
[129] R. Hathaway and J. Bezdek, “Fuzzy -means clustering of incomplete pp. 505–518, May 1997.
[157] N. Karayiannis, J. Bezdek, N. Pal, R. Hathaway, and P. Pai, “Repairs
data,” IEEE Trans. Syst., Man, Cybern., vol. 31, no. 5, pp. 735–744,
2001. to GLVQ: A new family of competitive learning schemes,” IEEE Trans.
[130] R. Hathaway, J. Bezdek, and Y. Hu, “Generalized fuzzy -means clus- c Neural Netw., vol. 7, no. 5, pp. 1062–1071, Sep. 1996.
tering strategies using Lnorm distances,” IEEE Trans. Fuzzy Syst., vol. [158] J. Karhunen, E. Oja, L. Wang, R. Vigário, and J. Joutsensalo, “A class
of neural networks for independent component analysis,” IEEE Trans.
8, no. 5, pp. 576–582, Oct. 2000.
[131] B. Hay, G. Wets, and K. Vanhoof, “Clustering navigation patterns on a Neural Netw., vol. 8, no. 3, pp. 486–504, May 1997.
[159] G. Karypis, E. Han, and V. Kumar, “Chameleon: Hierarchical clustering
website using a sequence alignment method,” in Proc. Intelligent Tech-
using dynamic modeling,” IEEE Computer, vol. 32, no. 8, pp. 68–75,
niques for Web Personalization: 17th Int. Joint Conf. Artificial Intelli-
Aug. 1999.
gence, vol. s.l, 2001, pp. 1–6, 200.
[160] R. Kathari and D. Pitts, “On finding the number of clusters,” Pattern
[132] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd Recognit. Lett., vol. 20, pp. 405–416, 1999.
ed. Englewood Cliffs, NJ: Prentice-Hall, 1999. [161] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction
[133] Q. He, “A review of clustering algorithms as applied to IR,” Univ. Illinois to Cluster Analysis: Wiley, 1990.
at Urbana-Champaign, Tech. Rep. UIUCLIS-1999/6+IRG, 1999. [162] W. Kent and A. Zahler, “Conservation, regulation, synteny, and introns
[134] M. Healy, T. Caudell, and S. Smith, “A neural architecture for pattern in a large-scale C. Briggsae—C. elegans genomic alignment,” Genome
sequence verification through inferencing,” IEEE Trans. Neural Netw., Res., vol. 10, pp. 1115–1125, 2000.
vol. 4, no. 1, pp. 9–20, Jan. 1993.
[135] A. Hinneburg and D. Keim, “An efficient approach to clustering in large
c
[163] P. Kersten, “Implementation issues in the fuzzy -medians clustering
algorithm,” in Proc. 6th IEEE Int. Conf. Fuzzy Systems, vol. 2, 1997, pp.
multimedia databases with noise,” in Proc. 4th Int. Conf. Knowledge 957–962.
Discovery and Data Mining (KDD’98), 1998, pp. 58–65. [164] J. Khan, J. Wei, M. Ringnér, L. Saal, M. Ladanyi, F. Westermann, F.
[136] , “Optimal grid-clustering: Toward breaking the curse of dimen- Berthold, M. Schwab, C. Antonescu, C. Peterson, and P. Meltzer, “Clas-
sionality in high-dimensional clustering,” in Proc. 25th VLDB Conf., sification and diagnostic prediction of cancers using gene expression
1999, pp. 506–517. profiling and artificial neural networks,” Nature Med., vol. 7, no. 6, pp.
[137] F. Hoeppner, “Fuzzy shell clustering algorithms in image processing: 673–679, 2001.
c
Fuzzy -rectangular and 2-rectangular shells,” IEEE Trans. Fuzzy Syst., [165] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated
vol. 5, no. 4, pp. 599–613, Nov. 1997. annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.
[138] J. Hoey, “Clustering contextual facial display sequences,” in Proc. 5th [166] J. Kleinberg, “An impossibility theorem for clustering,” in Proc. 2002
IEEE Int. Conf. Automatic Face and Gesture Recognition (FGR’02), Conf. Advances in Neural Information Processing Systems, vol. 15,
2002, pp. 354–359. 2002, pp. 463–470.
[139] T. Hofmann and J. Buhmann, “Pairwise data clustering by deterministic [167] R. Kohavi, “A study of cross-validation and bootstrap for accuracy es-
annealing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 1, pp. timation and model selection,” in Proc. 14th Int. Joint Conf. Artificial
1–14, Jan. 1997. Intelligence, 1995, pp. 338–345.
[140] J. Holland, Adaption in Natural and Artificial Systems. Ann Arbor, MI: [168] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9, pp.
Univ. Michigan Press, 1975. 1464–1480, Sep. 1990.
[141] F. Höppner, F. Klawonn, and R. Kruse, Fuzzy Cluster Analysis: Methods [169] , Self-Organizing Maps, 3rd ed. New York: Springer-Verlag,
for Classification, Data Analysis, and Image Recognition. New York: 2001.
Wiley, 1999. [170] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and
K
[142] Z. Huang, “Extensions to the -means algorithm for clustering large A. Saarela, “Self organization of a massive document collection,” IEEE
Trans. Neural Netw., vol. 11, no. 3, pp. 574–585, May 2000.
data sets with categorical values,” Data Mining Knowl. Discov., vol. 2,
pp. 283–304, 1998. [171] E. Kolatch. (2001) Clustering algorithms for spatial databases: A
[143] J. Huang, M. Georgiopoulos, and G. Heileman, “Fuzzy ART properties,” Survey. [Online]. Available: http://citeseer.nj.nec.com/436 843.html
[172] J. Kolen and T. Hutcheson, “Reducing the time complexity of the
Neural Netw., vol. 8, no. 2, pp. 203–213, 1995.
[144] P. Huber, “Projection pursuit,” Ann. Statist., vol. 13, no. 2, pp. 435–475, c
fuzzy -means algorithm,” IEEE Trans. Fuzzy Syst., vol. 10, no. 2, pp.
263–267, Apr. 2002.
1985.
[145] R. Hughey and A. Krogh, “Hidden Markov models for sequence anal-
K
[173] K. Krishna and M. Murty, “Genetic -means algorithm,” IEEE Trans.
Syst., Man, Cybern. B, Cybern., vol. 29, no. 3, pp. 433–439, Jun. 1999.
ysis: Extension and analysis of the basic method,” CABIOS, vol. 12, no.
[174] R. Krishnapuram, H. Frigui, and O. Nasraoui, “Fuzzy and possiblistic
2, pp. 95–107, 1996.
c
[146] M. Hung and D. Yang, “An efficient fuzzy -means clustering algo-
shell clustering algorithms and their application to boundary detection
and surface approximation—Part I and II,” IEEE Trans. Fuzzy Syst., vol.
rithm,” in Proc. IEEE Int. Conf. Data Mining, 2001, pp. 225–232. 3, no. 1, pp. 29–60, Feb. 1995.
[147] L. Hunt and J. Jorgensen, “Mixture model clustering using the MUL- [175] R. Krishnapuram and J. Keller, “A possibilistic approach to clustering,”
TIMIX program,” Australia and New Zealand J. Statist., vol. 41, pp. IEEE Trans. Fuzzy Syst., vol. 1, no. 2, pp. 98–110, Apr. 1993.
153–171, 1999.
[148] J. Hwang, J. Vlontzos, and S. Kung, “A systolic neural network archi-
c
[176] R. Krishnapuram, O. Nasraoui, and H. Frigui, “The fuzzy spherical
shells algorithm: A new approach,” IEEE Trans. Neural Netw., vol. 3,
tecture for hidden Markov models,” IEEE Trans. Acoust., Speech, Signal no. 5, pp. 663–671, Sep. 1992.
Process., vol. 37, no. 12, pp. 1967–1979, Dec. 1989. [177] A. Krogh, M. Brown, I. Mian, K. Sjölander, and D. Haussler, “Hidden
[149] A. Hyvärinen, “Survey of independent component analysis,” Neural Markov models in computational biology: Applications to protein mod-
Comput. Surv., vol. 2, pp. 94–128, 1999. eling,” J. Molec. Biol., vol. 235, pp. 1501–1531, 1994.
[150] A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood [178] G. Lance and W. Williams, “A general theory of classification sorting
Cliffs, NJ: Prentice-Hall, 1988. strategies: 1. Hierarchical systems,” Comput. J., vol. 9, pp. 373–380,
[151] A. Jain, R. Duin, and J. Mao, “Statistical pattern recognition: A review,” 1967.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, 2000. [179] M. Law and J. Kwok, “Rival penalized competitive learning for model-
[152] A. Jain, M. Murty, and P. Flynn, “Data clustering: A review,” ACM based sequence clustering,” in Proc. 15th Int. Conf. Pattern Recognition,
Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999. vol. 2, 2000, pp. 195–198.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
676 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

[180] Y. Leung, J. Zhang, and Z. Xu, “Clustering by scale-space filtering,” [207] T. Morzy, M. Wojciechowski, and M. Zakrzewicz, “Pattern-oriented
IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1396–1410, hierarchical clustering,” in Proc. 3rd East Eur. Conf. Advances in
Dec. 2000. Databases and Information Systems, 1999, pp. 179–190.
[181] E. Levine and E. Domany, “Resampling method for unsupervised es- [208] S. Mulder and D. Wunsch, “Million city traveling salesman problem
timation of cluster validity,” Neural Comput., vol. 13, pp. 2573–2593, solution by divide and conquer clustering with adaptive resonance neural
2001. networks,” Neural Netw., vol. 16, pp. 827–832, 2003.
[182] C. Li and G. Biswas, “Temporal pattern generation using hidden [209] K. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An intro-
Markov model based unsupervised classification,” in Advances in duction to kernel-based learning algorithms,” IEEE Trans. Neural Netw.,
Intelligent Data Analysis. ser. Lecture Notes in Computer Science, D. vol. 12, no. 2, pp. 181–201, Mar. 2001.
Hand, K. Kok, and M. Berthold, Eds. New York: Springer-Verlag, [210] F. Murtagh, “A survey of recent advances in hierarchical clustering al-
1999, vol. 1642. gorithms,” Comput. J., vol. 26, no. 4, pp. 354–359, 1983.
[183] , “Unsupervised learning with mixed numeric and nominal data,” [211] F. Murtagh and M. Berry, “Overcoming the curse of dimensionality in
IEEE Trans. Knowl. Data Eng., vol. 14, no. 4, pp. 673–690, Jul.-Aug. clustering by means of the wavelet transform,” Comput. J., vol. 43, no.
2002. 2, pp. 107–120, 2000.
[184] C. Li, H. Garcia-Molina, and G. Wiederhold, “Clustering for approxi- [212] S. Needleman and C. Wunsch, “A general method applicable to the
mate similarity search in high-dimensional spaces,” IEEE Trans. Knowl.
search for similarities in the amino acid sequence of two proteins,” J.
Data Eng., vol. 14, no. 4, pp. 792–808, Jul.-Aug. 2002.
Molec. Biol., vol. 48, pp. 443–453, 1970.
[185] W. Li, L. Jaroszewski, and A. Godzik, “Clustering of highly homologous
[213] R. Ng and J. Han, “CLARANS: A method for clustering objects for
sequences to reduce the size of large protein databases,” Bioinformatics,
vol. 17, pp. 282–283, 2001. spatial data mining,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 5, pp.
K
[186] A. Likas, N. Vlassis, and J. Verbeek, “The global -means clustering 1003–1016, Sep.-Oct. 2002.
[214] T. Oates, L. Firoiu, and P. Cohen, “Using dynamic time warping to boot-
algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003.
[187] S. Lin and B. Kernighan, “An effective heuristic algorithm for the trav- strap HMM-based clustering of time series,” in Sequence Learning. ser.
eling salesman problem,” Operat. Res., vol. 21, pp. 498–516, 1973. LNAI 1828, R. Sun and C. Giles, Eds. Berlin, Germany: Springer-
[188] R. Lipshutz, S. Fodor, T. Gingeras, and D. Lockhart, “High density Verlag, 2000, pp. 35–52.
synthetic oligonucleotide arrays,” Nature Genetics, vol. 21, pp. 20–24, [215] E. Oja, “Principal components minor components, and linear neural net-
1999. works,” Neural Netw., vol. 5, pp. 927–935, 1992.
[189] G. Liu, Introduction to Combinatorial Mathematics. New York: Mc- [216] J. Oliver, R. Baxter, and C. Wallace, “Unsupervised learning using
Graw-Hill, 1968. MML,” in Proc. 13th Int. Conf. Machine Learning (ICML’96), Lorenza,
[190] J. Lozano and P. Larrañaga, “Applying genetic algorithms to search for Saitta, 1996, pp. 364–372.
the best hierarchical clustering of a dataset,” Pattern Recognit. Lett., vol. [217] C. Olson, “Parallel algorithms for hierarchical clustering,” Parallel
20, pp. 911–918, 1999. Comput., vol. 21, pp. 1313–1325, 1995.
[191] J. MacQueen, “Some methods for classification and analysis of mul- [218] C. Ordonez and E. Omiecinski, “Efficient disk-based K-means clus-
tivariate observations,” in Proc. 5th Berkeley Symp., vol. 1, 1967, pp. tering for relational databases,” IEEE Trans. Knowl. Data Eng., vol. 16,
281–297. no. 8, pp. 909–921, Aug. 2004.
[192] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biolog- [219] L. Owsley, L. Atlas, and G. Bernard, “Self-organizing feature maps
ical data analysis: A survey,” IEEE/ACM Trans. Computat. Biol. Bioin- and hidden Markov models for machine-tool monitoring,” IEEE Trans.
formatics, vol. 1, no. 1, pp. 24–45, Jan. 2004. Signal Process., vol. 45, no. 11, pp. 2787–2798, Nov. 1997.
[193] Y. Man and I. Gath, “Detection and separation of ring-shaped clusters
using fuzzy clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16,
c
[220] N. Pal and J. Bezdek, “On cluster validity for the fuzzy -means model,”
IEEE Trans. Fuzzy Syst., vol. 3, no. 3, pp. 370–379, Aug. 1995.
no. 8, pp. 855–861, Aug. 1994.
[221] N. Pal, J. Bezdek, and E. Tsao, “Generalized clustering networks and
[194] J. Mao and A. Jain, “A self-organizing network for hyperellipsoidal clus-
tering (HEC),” IEEE Trans. Neural Netw., vol. 7, no. 1, pp. 16–29, Jan. Kohonen’s self-organizing scheme,” IEEE Trans. Neural Netw., vol. 4,
1996. no. 4, pp. 549–557, Jul. 1993.
[195] U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based clustering [222] G. Patanè and M. Russo, “The enhanced-LBG algorithm,” Neural Netw.,
technique,” Pattern Recognit., vol. 33, pp. 1455–1465, 2000. vol. 14, no. 9, pp. 1219–1237, 2001.
[196] G. McLachlan and T. Krishnan, The EM Algorithm and Exten- [223] , “Fully automatic clustering system,” IEEE Trans. Neural Netw.,
sions. New York: Wiley, 1997. vol. 13, no. 6, pp. 1285–1298, Nov. 2002.
[197] G. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley, [224] W. Pearson, “Improved tools for biological sequence comparison,” Proc.
2000. Nat. Acad. Sci., vol. 85, pp. 2444–2448, 1988.
[198] G. McLachlan, D. Peel, K. Basford, and P. Adams, “The EMMIX soft- [225] D. Peel and G. McLachlan, “Robust mixture modeling using the t-dis-
ware for the fitting of mixtures of normal and t-components,” J. Statist. tribution,” Statist. Comput., vol. 10, pp. 339–348, 2000.
Software, vol. 4, 1999. K
[226] D. Pelleg and A. Moore, “X-means: Extending -means with efficient
[199] C. Miller, J. Gurd, and A. Brass, “A RAPID algorithm for sequence data- estimation of the number of clusters,” in Proc. 17th Int. Conf. Machine
base comparisons: Application to the identification of vector contami- Learning (ICML’00), 2000, pp. 727–734.
nation in the EMBL databases,” Bioinformatics, vol. 15, pp. 111–121, [227] J. Peña, J. Lozano, and P. Larrañaga, “An empirical comparison of four
1999. K
initialization methods for the -means algorithm,” Pattern Recognit.
[200] R. Miller et al., “A comprehensive approach to clustering of expressed Lett., vol. 20, pp. 1027–1040, 1999.
human gene sequence: The sequence tag alignment and consensus [228] C. Pizzuti and D. Talia, “P-AutoClass: Scalable parallel clustering for
knowledge base,” Genome Res., vol. 9, pp. 1143–1155, 1999. mining large data sets,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 3,
[201] W. Miller, “Comparison of genomic DNA sequences: Solved and un- pp. 629–641, May-Jun. 2003.
solved problems,” Bioinformatics, vol. 17, pp. 391–397, 2001. [229] L. Rabiner, “A tutorial on hidden Markov models and selected applica-
[202] G. Milligan and M. Cooper, “An examination of procedures for deter- tions in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,
mining the number of clusters in a data set,” Psychometrika, vol. 50, pp.
Feb. 1989.
159–179, 1985.
[230] Ralf-Herwig, A. Poustka, C. Müller, C. Bull, H. Lehrach, and
[203] R. Mollineda and E. Vidal, “A relative approach to hierarchical
clustering,” in Pattern Recognition and Applications, Frontiers in J. O’Brien, “Large-scale clustering of cDNA-fingerprinting data,”
Artificial Intelligence and Applications, M. Torres and A. Sanfeliu, Genome Res., pp. 1093–1105, 1999.
Eds. Amsterdam, The Netherlands: IOS Press, 2000, vol. 56, pp. [231] A. Rauber, J. Paralic, and E. Pampalk, “Empirical evaluation of clus-
19–28. tering algorithms,” J. Inf. Org. Sci., vol. 24, no. 2, pp. 195–209, 2000.
[204] B. Moore, “ART1 and pattern clustering,” in Proc. 1988 Connectionist [232] S. Ridella, S. Rovetta, and R. Zunino, “Plastic algorithm for adaptive
Models Summer School, 1989, pp. 174–185. vector quantization,” Neural Comput. Appl., vol. 7, pp. 37–51, 1998.
[205] S. Moore, “Making chips to probe genes,” IEEE Spectr., vol. 38, no. 3, [233] J. Rissanen, “Fisher information and stochastic complexity,” IEEE
pp. 54–60, Mar. 2001. Trans. Inf. Theory, vol. 42, no. 1, pp. 40–47, Jan. 1996.
[206] Y. Moreau, F. Smet, G. Thijs, K. Marchal, and B. Moor, “Functional [234] K. Rose, “Deterministic annealing for clustering, compression, classifi-
bioinformatics of microarray data: From expression to regulation,” Proc. cation, regression, and related optimization problems,” Proc. IEEE, vol.
IEEE, vol. 90, no. 11, pp. 1722–1743, Nov. 2002. 86, no. 11, pp. 2210–2239, Nov. 1998.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS 677

[235] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally [262] K. Stoffel and A. Belkoniene, “Parallel K
-means clustering for
linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. large data sets,” in Proc. EuroPar’99 Parallel Processing, 1999, pp.
[236] D. Sankoff and J. Kruskal, Time Warps, String Edits, and Macro- 1451–1454.
molecules: The Theory and Practice of Sequence Comparison. Stan- [263] M. Su and H. Chang, “Fast self-organizing feature map algorithm,” IEEE
ford, CA: CSLI Publications, 1999. Trans. Neural Netw., vol. 11, no. 3, pp. 721–733, May 2000.
[237] O. Sasson, N. Linial, and M. Linial, “The metric space of pro- K
[264] M. Su and C. Chou, “A modified version of the -means algorithm with
teins—Comparative study of clustering algorithms,” Bioinformatics, a distance based on cluster symmetry,” IEEE Trans. Pattern Anal. Mach.
vol. 18, pp. s14–s21, 2002. Intell., vol. 23, no. 6, pp. 674–680, Jun. 2001.
[238] U. Scherf, D. Ross, M. Waltham, L. Smith, J. Lee, L. Tanabe, K. Kohn, [265] R. Sun and C. Giles, “Sequence learning: Paradigms, algorithms, and
W. Reinhold, T. Myers, D. Andrews, D. Scudiero, M. Eisen, E. Sausville, applications,” in LNAI 1828, . Berlin, Germany, 2000.
Y. Pommier, D. Botstein, P. Brown, and J. Weinstein, “A gene expression
[266] C. Sung and H. Jin, “A Tabu-search-based heuristic for clustering,” Pat-
database for the molecular pharmacology of cancer,” Nature Genetics,
tern Recognit., vol. 33, pp. 849–858, 2000.
vol. 24, no. 3, pp. 236–244, 2000.
[267] SWISS-PROT Protein Knowledgebase Release 45.0 Statistics.
[239] P. Scheunders, “A comparison of clustering algorithms applied to color
image quantization,” Pattern Recognit. Lett., vol. 18, pp. 1379–1384, [268] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitro-
1997. vsky, E. Lander, and T. Golub, “Interpreting patterns of gene expression
[240] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector Ma- with self-organizing maps: Methods and application to hematopoietic
chines, Regularization, Optimization, and Beyond. Cambridge, MA: differentiation,” Proc. Nat. Acad. Sci., pp. 2907–2912, 1999.
MIT Press, 2002. [269] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, “Sys-
[241] B. Schölkopf, A. Smola, and K. Müller, “Nonlinear component analysis tematic determination of genetic network architecture,” Nature Genetics,
as a kernel eigenvalue problem,” Neural Computat., vol. 10, no. 5, pp. vol. 22, pp. 281–285, 1999.
1299–1319, 1998. [270] J. Tenenbaum, V. Silva, and J. Langford, “A global geometric frame-
[242] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. work for nonlinear dimensionality reduction,” Science, vol. 290, pp.
6, no. 2, pp. 461–464, 1978. 2319–2323, 2000.
[243] G. Scott, D. Clark, and T. Pham, “A genetic clustering algorithm guided [271] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown,
by a descent algorithm,” in Proc. Congr. Evolutionary Computation, vol. “Clustering methods for the analysis of DNA microarray data,” Dept.
2, Piscataway, NJ, 2001, pp. 734–740. Statist., Stanford Univ., Stanford, CA, Tech. Rep..
[244] P. Sebastiani, M. Ramoni, and P. Cohen, “Sequence learning via [272] R. Tibshirani and K. Knight, “The covariance inflation criterion for
Bayesian clustering by dynamics,” in Sequence Learning. ser. LNAI adaptive model selection,” J. Roy. Statist. Soc. B, vol. 61, pp. 529–546,
1828, R. Sun and C. Giles, Eds. Berlin, Germany: Springer-Verlag, 1999.
2000, pp. 11–34.
[273] L. Tseng and S. Yang, “A genetic approach to the automatic clustering
[245] S. Selim and K. Alsultan, “A simulated annealing algorithm for the clus-
problem,” Pattern Recognit., vol. 34, pp. 415–424, 2001.
tering problems,” Pattern Recognit., vol. 24, no. 10, pp. 1003–1008,
1991. [274] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[246] R. Shamir and R. Sharan, “Algorithmic approaches to clustering gene [275] J. Venter et al., “The sequence of the human genome,” Science, vol. 291,
expression data,” in Current Topics in Computational Molecular Bi- pp. 1304–1351, 2001.
ology, T. Jiang, T. Smith, Y. Xu, and M. Zhang, Eds. Cambridge, MA: [276] J. Vesanto and E. Alhoniemi, “Clustering of the self-organizing map,”
IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 586–600, May 2000.
K
MIT Press, 2002, pp. 269–300.
[247] R. Sharan and R. Shamir, “CLICK: A clustering algorithm with appli- [277] K. Wagstaff, S. Rogers, and S. Schroedl, “Constrained -means clus-
cations to gene expression analysis,” in Proc. 8th Int. Conf. Intelligent tering with background knowledge,” in Proc. 8th Int. Conf. Machine
Systems for Molecular Biology, 2000, pp. 307–316. Learning, 2001, pp. 577–584.
[248] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A multi- [278] C. Wallace and D. Dowe, “Intrinsic classification by MML—The SNOB
resolution clustering approach for very large spatial databases,” in Proc. program,” in Proc. 7th Australian Joint Conf. Artificial Intelligence,
24th VLDB Conf., 1998, pp. 428–439. 1994, pp. 37–44.
[249] P. Simpson, “Fuzzy min-max neural networks—Part 2: Clustering,” [279] H. Wang, W. Wang, J. Yang, and P. Yu, “Clustering by pattern similarity
IEEE Trans. Fuzzy Syst., vol. 1, no. 1, pp. 32–45, Feb. 1993. in large data sets,” in Proc. ACM SIGMOD Int. Conf. Management of
[250] Handbook of Pattern Recognition and Computer Vision, C. Chen, L. Data, 2002, pp. 394–405.
Pau, and P. Wang, Eds., World Scientific, Singapore, 1993, pp. 61–124. [280] C. Wei, Y. Lee, and C. Hsu, “Empirical comparison of fast clustering
J. Sklansky and W. Siedlecki, “Large-scale feature selection”. algorithms for large data sets,” in Proc. 33rd Hawaii Int. Conf. System
[251] T. Smith and M. Waterman, “New stratigraphic correlation techniques,”
Sciences, Maui, HI, 2000, pp. 1–10.
J. Geology, vol. 88, pp. 451–457, 1980.
[252] P. Smyth, “Clustering using Monte Carlo cross-validation,” in Proc. 2nd [281] J. Williamson, “Gaussian ARTMAP: A neural network for fast incre-
Int. Conf. Knowledge Discovery and Data Mining, 1996, pp. 126–133. mental learning of noisy multidimensional maps,” Neural Netw., vol. 9,
[253] , “Clustering sequences with hidden Markov models,” in Advances no. 5, pp. 881–897, 1996.
in Neural Information Processing, M. Mozer, M. Jordan, and T. Petsche, [282] M. Windham and A. Culter, “Information ratios for validating mixture
Eds. Cambridge, MA: MIT Press, 1997, vol. 9, pp. 648–654. analysis,” J. Amer. Statist. Assoc., vol. 87, pp. 1188–1192, 1992.
[254] , “Model selection for probabilistic clustering using cross validated [283] S. Wu, A. W.-C. Liew, H. Yan, and M. Yang, “Cluster analysis of
likelihood,” Statist. Comput., vol. 10, pp. 63–72, 1998. gene expression data based on self-splitting and merging competitive
[255] , “Probabilistic model-based clustering of multivariate and sequen- learning,” IEEE Trans. Inf. Technol. Biomed., vol. 8, no. 1, pp. 5–15,
tial data,” in Proc. 7th Int. Workshop on Artificial Intelligence and Sta- Jan. 2004.
tistics, 1999, pp. 299–304. [284] D. Wunsch, “An optoelectronic learning machine: Invention, experi-
[256] P. Sneath, “The application of computers to taxonomy,” J. Gen. Micro- mentation, analysis of first hardware implementation of the ART1 neural
biol., vol. 17, pp. 201–226, 1957. network,” Ph.D. dissertation, Univ. Washington, Seattle, WA, 1991.
[257] P. Somervuo and T. Kohonen, “Clustering and visualization of large pro- [285] D. Wunsch, T. Caudell, C. Capps, R. Marks, and R. Falk, “An optoelec-
tein sequence databases by means of an extension of the self-organizing tronic implementation of the adaptive resonance neural network,” IEEE
map,” in LNAI 1967, 2000, pp. 76–85. Trans. Neural Netw., vol. 4, no. 4, pp. 673–684, Jul. 1993.
[258] T. Sorensen, “A method of establishing groups of equal amplitude in [286] Y. Xiong and D. Yeung, “Mixtures of ARMA models for model-based
plant sociology based on similarity of species content and its application
time series clustering,” in Proc. IEEE Int. Conf. Data Mining, 2002, pp.
to analyzes of the vegetation on Danish commons,” Biologiske Skrifter,
717–720.
vol. 5, pp. 1–34, 1948.
[287] R. Xu, G. Anagnostopoulos, and D. Wunsch, “Tissue classification
[259] H. Späth, Cluster Analysis Algorithms. Chichester, U.K.: Ellis Hor-
wood, 1980. through analysis of gene expression data using a new family of ART
[260] P. Spellman, G. Sherlock, M. Ma, V. Iyer, K. Anders, M. Eisen, P. Brown, architectures,” in Proc. Int. Joint Conf. Neural Networks (IJCNN’02),
D. Botstein, and B. Futcher, “Comprehensive identification of cell cycle- vol. 1, 2002, pp. 300–304.
regulated genes of the Yeast Saccharomyces Cerevisiae by microarray [288] Y. Xu, V. Olman, and D. Xu, “Clustering gene expression data using
hybridization,” Mol. Biol. Cell, vol. 9, pp. 3273–3297, 1998. graph-theoretic approach: An application of minimum spanning trees,”
[261] “Tech. Rep. 00–034,” Univ. Minnesota, Minneapolis, 2000. Bioinformatics, vol. 18, no. 4, pp. 536–545, 2002.

Authorized licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.
678 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005

[289] R. Yager, “Intelligent control of the hierarchical agglomerative clus- Donald C. Wunsch II (S’87–M’92–SM’94–F’05)
tering process,” IEEE Trans. Syst., Man, Cybern., vol. 30, no. 6, pp. received the B.S. degree in applied mathematics
835–845, 2000. from the University of New Mexico, Albuquerque,
[290] R. Yager and D. Filev, “Approximate clustering via the moun- and the M.S. degree in applied mathematics and
tain method,” IEEE Trans. Syst., Man, Cybern., vol. 24, no. 8, pp. the Ph.D. degree in electrical engineering from the
1279–1284, 1994. University of Washington, Seattle.
[291] K. Yeung, D. Haynor, and W. Ruzzo, “Validating clustering for gene Heis the Mary K. Finley Missouri Distinguished
expression data,” Bioinformatics, vol. 17, no. 4, pp. 309–318, 2001. Professor of Computer Engineering, University
[292] F. Young and R. Hamer, Multidimensional Scaling: History, Theory, and of Missouri-Rolla, where he has been since 1999.
Applications. Hillsdale, NJ: Lawrence Erlbaum, 1987. His prior positions were Associate Professor and
[293] L. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, pp. 338–353, 1965. Director of the Applied Computational Intelligence
[294] J. Zhang and Y. Leung, “Improved possibilistic C-means clustering al- Laboratory, Texas Tech University, Lubbock; Senior Principal Scientist,
gorithms,” IEEE Trans. Fuzzy Syst., vol. 12, no. 2, pp. 209–217, Apr. Boeing; Consultant, Rockwell International; and Technician, International
2004. Laser Systems. He has well over 200 publications, and has attracted over $5
[295] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data million in research funding. He has produced eight Ph.D. recipients—four in
clustering method for very large databases,” in Proc. ACM SIGMOD electrical engineering, three in computer engineering, and one in computer
Conf. Management of Data, 1996, pp. 103–114. science.
[296] Y. Zhang and Z. Liu, “Self-splitting competitive learning: A new on-line Dr. Wunsch has received the Halliburton Award for Excellence in Teaching
clustering paradigm,” IEEE Trans. Neural Networks, vol. 13, no. 2, pp. and Research, and the National Science Foundation CAREER Award. He served
369–380, Mar. 2002. as a Voting Member of the IEEE Neural Networks Council, Technical Program
[297] X. Zhuang, Y. Huang, K. Palaniappan, and Y. Zhao, “Gaussian mixture Co-Chair for IJCNN’02, General Chair for IJCNN’03, International Neural Net-
density modeling, decomposition, and applications,” IEEE Trans. Image works Society Board of Governors Member, and is now President of the Inter-
Process., vol. 5, no. 9, pp. 1293–1302, Sep. 1996. national Neural Networks Society.

Rui Xu (S’00) received the B.E. degree in electrical


engineering from Huazhong University of Science
and Technology, Wuhan, Hubei, China, in 1997,
and the M.E. degree in electrical engineering from
Sichuan University, Chengdu, Sichuan, in 2000.
He is currently pursuing the Ph.D. degree in the
Department of Electrical and Computer Engineering,
University of Missouri-Rolla.
His research interests include machine learning,
neural networks, pattern classification and clustering,
and bioinformatics.
Mr. Xu is a Student Member of the IEEE Computational Intelligence Society,
Engineering in Medicine and Biology Society, and the International Society for
Computational Biology.

Authorized
View publication stats licensed use limited to: Georgia Institute of Technology. Downloaded on March 19, 2009 at 19:15 from IEEE Xplore. Restrictions apply.

You might also like