0% found this document useful (0 votes)
222 views59 pages

Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views59 pages

Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO.

3, MAY 2005 645

Survey of Clustering Algorithms


Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE

Abstract—Data analysis plays an indispensable role for un- [60] , [167]. When the inducer reaches convergence or
derstanding various phenomena. Cluster analysis, primitive termi- nates, an induced classifier is generated [167].
exploration with little or no prior knowledge, consists of research In unsupervised classification, called clustering or ex-
developed across a wide variety of communities. The diversity,
on one hand, equips us with many tools. On the other hand, ploratory data analysis, no labeled data are available [88],
the profusion of options causes confusion. We survey clustering [150]. The goal of clustering is to separate a finite unlabeled
algorithms for data sets appearing in statistics, computer science, data set into a finite and discrete set of “natural,” hidden data
and machine learning, and illustrate their applications in some structures, rather than provide an accurate characterization
benchmark data sets, the traveling salesman problem, and bioin-
of unobserved samples generated from the same probability
formatics, a new field attracting intensive efforts. Several tightly
related topics, proximity measure, and cluster validation, are distribution [23], [60]. This can make the task of clustering fall
also discussed. outside of the framework of unsupervised predictive learning
problems, such as vector quantization [60] (see Section II-C),
Index Terms—Adaptive resonance theory (ART), clustering,
clustering algorithm, cluster validation, neural networks, prox- probability density function estimation [38] (see Section II-D),
imity, self-organizing feature map (SOFM). [60], and entropy maximization [99]. It is noteworthy that
clustering differs from multidimensional scaling (perceptual
maps), whose goal is to depict all the evaluated objects in a
I. INTRODUCTION way that minimizes the topographical distortion while using as

W E ARE living in a world full of data. Every day,


people encounter a large amount of information and store or
represent it as data, for further analysis and management. One
few dimensions as possible. Also note that, in practice, many
(predictive) vector quantizers are also used for (nonpredictive)
clustering analysis [60].
of the vital means in dealing with these data is to classify or Nonpredictive clustering is a subjective process in nature,
group them into a set of categories or clusters. Actually, as which precludes an absolute judgment as to the relative effi-
one of the most primitive activities of human beings [14], cacy of all clustering techniques [23], [152]. As pointed out by
classi- fication plays an important and indispensable role in Backer and Jain [17], “in cluster analysis a group of objects is
the long history of human development. In order to learn a split up into a number of more or less homogeneous
new object or understand a new phenomenon, people always subgroups on the basis of an often subjectively chosen
try to seek the features that can describe it, and further measure of sim- ilarity (i.e., chosen subjectively based on its
compare it with other known objects or phenomena, based on ability to create “interesting” clusters), such that the similarity
the similarity or dissimilarity, generalized as proximity, between objects within a subgroup is larger than the similarity
according to some cer- tain standards or rules. “Basically, between objects belonging to different subgroups””1.
classification systems are ei- ther supervised or unsupervised, Clustering algorithms partition data into a certain number
depending on whether they as- sign new inputs to one of a of clusters (groups, subsets, or categories). There is no univer-
finite number of discrete supervised classes or unsupervised sally agreed upon definition [88]. Most researchers describe a
categories, respectively [38], [60], [75]. In supervised cluster by considering the internal homogeneity and the external
classification, the mapping from a set of input data vectors ( separation [111], [124], [150], i.e., patterns in the same cluster
, where is the input space dimensionality), to a finite set should be similar to each other, while patterns in different
of discrete class labels ( , where is the total clus- ters should not. Both the similarity and the dissimilarity
number of class types), is modeled in terms of some should be examinable in a clear and meaningful way. Here,
mathematical function , where is a vector of we give some simple mathematical descriptions of several types
adjustable parameters. The values of these parameters are de- of clus- tering, based on the descriptions in [124].
termined (optimized) by an inductive learning algorithm (also Given a set of input patterns ,
termed inducer), whose aim is to minimize an empirical risk where and each measure
functional (related to an inductive principle) on a finite data is said to be a feature (attribute, dimension, or variable).
set of input–output examples, , where • (Hard) partitional clustering attempts to seek a -par-
is the finite cardinality of the available tition of , such that
representative data set [38], ;
Manuscript received March 31, 2003; revised September 28, 2004. This work 1) ;
was supported in part by the National Science Foundation and in part by the
M. K. Finley Missouri Endowment. 2)
The authors are with the Department of Electrical and Computer Engineering, 3) and .
University of Missouri-Rolla, Rolla, MO 65409 USA (e-mail: [email protected];
[email protected]). 1The preceding quote is taken verbatim from verbiage suggested by the
Digital Object Identifier 10.1109/TNN.2005.845141
anonymous associate editor, a suggestion which we gratefully acknowledge.

1045-9227/$20.00 © 2005 IEEE


IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005
64

Fig. 1. Clustering procedure. The typical cluster analysis consists of four steps with a feedback pathway. These steps are closely related to each other and affect
the derived clusters.

•) Hierarchical clustering attempts to construct a tree- proximity matrix, as defined in Section II-A.
like nested structure partition of Once a proximity measure is chosen, the
, such that construction of a
, and imply or
.
For hard partitional clustering, each pattern only belongs to
one cluster. However, a pattern may also be allowed to belong
to all clusters with a degree of membership, , which
represents the membership coefficient of the th object in the
th cluster and satisfies the following two constraints:

and

as introduced in fuzzy set theory [293]. This is known as fuzzy


clustering, reviewed in Section II-G.
Fig. 1 depicts the procedure of cluster analysis with four basic
steps.
1) Feature selection or extraction. As pointed out by
Jain et al. [151], [152] and Bishop [38], feature
selection
choosesdistinguishingfeaturesfromasetofcandidates,
while feature extraction utilizes some transformations
to generate useful and novel features from the
original ones. Both are very crucial to the
effectiveness of clus- tering applications. Elegant
selection of features can greatly decrease the
workload and simplify the subse-
quentdesignprocess. Generally, idealfeaturesshouldbe
of use in distinguishing patterns belonging to
different clusters, immune to noise, easy to extract and
interpret. We elaborate the discussion on feature
extraction in Section II-L, in the context of data
visualization and dimensionality reduction. More
information on feature selection can be found in [38],
[151], and [250].
2) Clustering algorithm design or selection. The step is
usually combined with the selection of a corresponding
proximity measure and the construction of a criterion
function. Patterns are grouped according to whether
they resemble each other. Obviously, the proximity
measure directly affects the formation of the resulting
clusters. Almost all clustering algorithms are explicitly
or implicitly connected to some definition of proximity
measure. Some algorithms even work directly on the
XU AND WUNSCH II: SURVEY OF CLUSTERING ALGORITHMS
64
clustering criterion function makes the partition of
clusters an optimization problem, which is well defined
mathematically, and has rich solutions in the literature.
Clusteringisubiquitous, andawealthofclusteringalgo-
rithmshasbeendevelopedto solvedifferentproblemsin
specific fields. However, thereisnoclusteringalgorithm
that can be universally used to solve all problems. “It has
been very difficult to develop a unified framework for
reasoning about it (clustering) at a technical level, and
profoundly diverse approaches to clustering” [166], as
proved through an impossibility theorem. Therefore, it
is important to carefully investigate the characteristics
of the problem at hand, in order to select or design an
appropriate clustering strategy.
3) Cluster validation. Given a data set, each clustering
algorithm can always generate a division, no matter
whether the structure exists or not. Moreover, different
approaches usually lead to different clusters; and even
for the same algorithm, parameter identification or
the presentation order of input patterns may affect the
final results. Therefore, effective evaluation standards
and criteria are important to provide the users with a
degree of confidence for the clustering results derived
from the used algorithms. These assessments should
be objective and have no preferences to any algorithm.
Also, they should be useful for answering questions
like how many clusters are hidden in the data, whether
the clusters obtained are meaningful or just an artifact
of the algorithms, or why we choose some algorithm
instead of another. Generally, there are three categories
of testing criteria: external indices, internal indices,
and relative indices. These are defined on three types
of clustering structures, known as partitional clus-
tering, hierarchical clustering, and individual clusters
[150]. Tests for the situation, where no clustering
structure exists in the data, are also considered [110],
but seldom used, since users are confident of the pres-
ence of clusters. External indices are based on some
prespecified structure, which is the reflection of prior
information on the data, and used as a standard to
validate the clustering solutions. Internal tests are not
dependent on external information (prior knowledge).
On the contrary, they examine the clustering structure
directly from the original data. Relative criteria place
Jaumard described the clustering problems under a mathematical
the emphasis on the comparison of different programming scheme [124]. Kolatch and He investigated appli-
clustering structures, in order to provide a reference,
to decide which one may best reveal the
characteristics of the objects. We will not survey the
topic in depth and refer interested readers to [74],
[110], and [150]. However, we will cover more
details on how to determine the number of clusters in
Section II-M. Some more recent discussion can be
found in [22], [37], [121], [180], and [181].
Approaches for fuzzy clustering validity are reported
in [71], [104], [123], and [220].
4) Results interpretation. The ultimate goal of clustering
is to provide users with meaningful insights from the
original data, so that they can effectively solve the
problems encountered. Experts in the relevant fields in-
terpret the data partition. Further analyzes, even exper-
iments, may be required to guarantee the reliability of
extracted knowledge.
Note that the flow chart also includes a feedback pathway.
Clusteranalysisisnotaone-shotprocess. Inmanycircumstances,
it needs a series of trials and repetitions. Moreover, there are
no universal and effective criteria to guide the selection of
features and clustering schemes. Validation criteria provide some
insights on the quality of clustering solutions. But even how to
choose the appropriate criterion is still a problem requiring more
efforts.
Clustering has been applied in a wide variety of fields,
ranging from engineering (machine learning, artificial intelli-
gence, pattern recognition, mechanical engineering, electrical
engineering), computer sciences (web mining, spatial database
analysis, textual document collection, image segmentation),
life and medical sciences (genetics, biology, microbiology,
paleontology, psychiatry, clinic, pathology), to earth sciences
(geography. geology, remote sensing), social sciences (soci-
ology, psychology, archeology, education), and economics
(marketing, business) [88], [127]. Accordingly, clustering is
also known as numerical taxonomy, learning without a teacher
(or unsupervised learning), typological analysis and partition.
The diversity reflects the important position of clustering in
scientific research. On the other hand, it causes confusion, due
to the differing terminologies and goals. Clustering algorithms
developed to solve a particular problem, in a specialized field,
usually make assumptions in favor of the application of interest.
These biases inevitably affect performance in other problems
that do not satisfy these premises. For example, the -means
algorithm is based on the Euclidean measure and, hence, tends
to generate hyperspherical clusters. But if the real clusters are
in other geometric forms, -means may no longer be effective,
and we need to resort to other schemes. This situation also
holds true for mixture-model clustering, in which a model is
fit to data in advance.
Clustering has a long history, with lineage dating back to Aris-
totle [124]. General references on clustering techniques
include [14], [75], [77], [88], [111], [127], [150], [161],
[259]. Important
survey papers on clustering techniques also exist in the literature.
Starting from a statistical pattern recognition viewpoint, Jain,
Murty, andFlynnreviewedtheclusteringalgorithmsandotherim-
portant issues related to cluster analysis [152], while Hansen and
• E. Graph Theory-Based
cationsofclusteringalgorithmsforspatialdatabasesystems[171] — Chameleon, Delaunay triangulation graph (DTG),
and information retrieval [133], respectively. Berkhin further highly connected subgraphs (HCS), clustering iden-
ex- panded the topic to the whole field of data mining [33].
Murtagh reported the advances in hierarchical clustering
algorithms [210] and
Baraldisurveyedseveralmodelsforfuzzyandneuralnetwork
clustering [24]. Some more survey papers can also be found
in [25], [40], [74], [89], and [151]. In addition to the review
papers, comparative research on clustering algorithms is also
significant. Rauber, Paralic, and Pampalk presented empirical
results for five typical clustering algorithms [231]. Wei, Lee,
and Hsu placed the
emphasisonthecomparisonoffastalgorithmsforlargedatabases
[280]. Scheunders compared several clustering techniques
for color image quantization, with emphasis on
computational time
andthepossibilityofobtainingglobaloptima[239]. Applications
and evaluations of different clustering algorithms for the
analysis of gene expression data from DNA microarray
experiments were described in [153], [192], [246], and [271].
Experimental evalua- tionondocumentclusteringtechniques,
basedonhierarchicaland
-means clustering algorithms, were summarized by Steinbach,
Karypis, and Kumar [261].
In contrast to the above, the purpose of this paper is to
pro- vide a comprehensive and systematic description of the
influ- ential and important clustering algorithms rooted in
statistics, computer science, and machine learning, with
emphasis on new advances in recent years.
The remainder of the paper is organized as follows. In
Sec- tion II, we review clustering algorithms, based on the
natures of generated clusters and techniques and theories
behind them. Furthermore, we discuss approaches for
clustering sequential data, large data sets, data visualization,
and high-dimensional data through dimension reduction.
Two important issues on cluster analysis, including
proximity measure and how to choose the number of
clusters, are also summarized in the section. This is the
longest section of the paper, so, for conve- nience, we give
an outline of Section II in bullet form here:
II. Clustering Algorithms
• A. Distance and Similarity
Measures (See also Table I)
• B. Hierarchical
— Agglomerative
Single linkage, complete linkage, group
average linkage, median linkage, centroid
linkage, Ward’s method, balanced iterative
reducing and clustering using hierarchies
(BIRCH), clustering using rep- resentatives
(CURE), robust clustering using links (ROCK)
— Divisive
divisive analysis (DIANA), monothetic
analysis (MONA)
• C. Squared Error-Based (Vector Quantization)
— -means, iterative self-organizing data analysis
technique (ISODATA), genetic -means algorithm
(GKA), partitioning around medoids (PAM)
• D. pdf Estimation via Mixture Densities
— Gaussian mixture density decomposition
(GMDD), AutoClass
TABLE I
SIMILARITY AND DISSIMILARITY MEASURE FOR QUANTITATIVE FEATURES

tification via connectivity kernels (CLICK), cluster Applications in two benchmark data sets, the traveling
affinity search technique (CAST) salesman problem, and bioinformatics are illustrated in Sec-
• F. Combinatorial Search Techniques-Based tion III. We conclude the paper in Section IV.
— Genetically guided algorithm (GGA), TS clustering,
SA clustering
• G. Fuzzy II. CLUSTERING ALGORITHMS
— Fuzzy -means (FCM), mountain method (MM), pos-
sibilistic -means clustering algorithm (PCM), fuzzy Different starting points and criteria usually lead to
-shells (FCS) different taxonomies of clustering algorithms [33], [88], [124],
• H. Neural Networks-Based [150], [152], [171]. A rough but widely agreed frame is to
— Learning vector quantization (LVQ), self- classify clustering techniques as hierarchical clustering and
organizing feature map (SOFM), ART, simplified parti- tional clustering, based on the properties of clusters
ART (SART), hyperellipsoidal clustering network generated [88], [152]. Hierarchical clustering groups data
(HEC), self-split- ting competitive learning network objects with a sequence of partitions, either from singleton
(SPLL) clusters to a cluster including all individuals or vice versa,
• I. Kernel-Based while partitional clustering directly divides data objects into
— Kernel -means, support vector clustering (SVC) some prespecified number of clusters without the hierarchical
• J. Sequential Data structure. We follow this frame in surveying the clustering
— Sequence Similarity algorithms in the literature. Beginning with the discussion on
— Indirect sequence clustering proximity measure, which is the basis for most clustering
— Statistical sequence clustering algorithms, we focus on hierarchical clustering and classical
• K. Large-Scale Data Sets (See also Table II)
partitional clustering algo- rithms in Section II-B–D. Starting
— CLARA, CURE, CLARANS, BIRCH, DBSCAN,
DENCLUE, WaveCluster, FC, ART from part E, we introduce and analyze clustering algorithms
• L. Data visualization and High-dimensional Data based on a wide variety of theories and techniques, including
— PCA, ICA, Projection pursuit, Isomap, LLE, graph theory, combinato- rial search techniques, fuzzy set
CLIQUE, OptiGrid, ORCLUS theory, neural networks, and kernels techniques. Compared
• M. How Many Clusters? with graph theory and fuzzy set
TABLE II A data object is described by a set of features, usually
COMPUTATIONAL COMPLEXITY OF CLUSTERING ALGORITHMS
repre- sented as a multidimensional vector. The features can
be quan- titative or qualitative, continuous or binary, nominal
or ordinal, which determine the corresponding measure
mechanisms.
A distance or dissimilarity function on a data set is defined
to satisfy the following conditions.
1) Symmetry. ;
2) Positivity. for all and .
If conditions
3) Triangle inequality.

for all and

and (4) Reflexivity. also


hold, it is called a metric.
Likewise, a similarity function is defined to satisfy the con-
ditions in the following.
1) Symmetry. ;
2) Positivity. , for all and .
If it also satisfies conditions
3)

consideration.

theory, which had already been widely used in cluster analysis


before the 1980s, the other techniques have been finding their
applications in clustering just in the recent decades. In spite of
the short history, much progress has been achieved. Note that
these techniques can be used for both hierarchical and parti-
tional clustering. Considering the more frequent requirement
of tackling sequential data sets, large-scale, and high-
dimensional data sets in many current applications, we review
clustering algorithms for them in the following three parts. We
focus particular attention on clustering algorithms applied in
bioin- formatics. We offer more detailed discussion on how to
identify appropriate number of clusters, which is particularly
important in cluster validity, in the last part of the section.

A. Distance and Similarity Measures

It is natural to ask what kind of standards we should use to


determine the closeness, or how to measure the distance (dis-
similarity) or similarity between a pair of objects, an object
and a cluster, or a pair of clusters. In the next section on
hierarchical clustering, we will illustrate linkage metrics for
measuring prox- imity between clusters. Usually, a prototype is
used to represent a cluster so that it can be further processed
like other objects. Here, we focus on reviewing measure
approaches between in- dividuals due to the previous
for all and

and (4) , it is called a simi-


larity metric.
For a data set with input patterns, we can define an
symmetric matrix, called proximity matrix, whose
th
element represents the similarity or dissimilarity measure for
the th and th patterns .
Typically, distance functions are used to measure
continuous features, while similarity measures are more
important for qual- itative variables. We summarize some
typical measures for con- tinuous features in Table I. The
selection of different measures is problem dependent. For
binary features, a similarity measure is commonly used
(dissimilarity measures can be obtained by simply using
). Suppose we use two binary sub- scripts to count
features in two objects. and represent the number of
simultaneous absence or presence of features in two objects,
and and count the features present only in one object.
Then two types of commonly used similarity mea- sures for
data points and are illustrated in the following.

simple matching coefficient


Rogers and Tanimoto
measure.
Gower and Legendre measure
These measures compute the match between two
objects directly. Unmatched pairs are weighted based
on their contribution to the similarity.

Jaccard coefficient
Sokal and Sneath measure.
Gower and Legendre
measure
These measures focus on the co-occurrence features while The general agglomerative clustering can be summarized
ignoring the effect of co-absence. by the following procedure.
For nominal features that have more than two states, a 1) Start with singleton clusters. Calculate the prox-
simple strategy needs to map them into new binary features imity matrix for the clusters.
[161], while a more effective method utilizes the matching 2) Search the minimal distance
criterion

where

if and do not match where is the distance function discussed be-


if and match fore, in the proximity matrix, and combine cluster
and to form a new cluster.
[88]. Ordinal features order multiple states according to some some of divisive clustering applications for binary data can be
standard and can be compared by using continuous dissimi- found in [88]. Two divisive clustering algorithms, named
larity measures discussed in [161]. Edit distance for alphabetic MONA and DIANA, are described in [161].
sequences is discussed in Section II-J. More discussion on se-
quences and strings comparisons can be found in [120] and
[236].
Generally, for objects consisting of mixed variables, we can
map all these variables into the interval (0, 1) and use mea-
sures like the Euclidean metric. Alternatively, we can trans-
form them into binary variables and use binary similarity
func- tions. The drawback of these methods is the information
loss. A more powerful method was described by Gower in the
form of , where indicates
the
similarity for the th feature and is a 0–1 coefficient based
on whether the measure of the two objects is missing [88],
[112].

B. Hierarchical Clustering
Hierarchical clustering (HC) algorithms organize data into a
hierarchical structure according to the proximity matrix. The re-
sults of HC are usually depicted by a binary tree or
dendrogram. The root node of the dendrogram represents the
whole data set and each leaf node is regarded as a data
object. The interme- diate nodes, thus, describe the extent that
the objects are prox- imal to each other; and the height of
the dendrogram usually expresses the distance between each
pair of objects or clusters, or an object and a cluster. The
ultimate clustering results can be obtained by cutting the
dendrogram at different levels. This representation provides
very informative descriptions and visu- alization for the
potential data clustering structures, especially when real
hierarchical relations exist in the data, like the data from
evolutionary research on different species of organizms. HC
algorithms are mainly classified as agglomerative methods
and divisive methods. Agglomerative clustering starts with
clusters and each of them includes exactly one object. A series
of merge operations are then followed out that finally lead all
objects to the same group. Divisive clustering proceeds in an
opposite way. In the beginning, the entire data set belongs to
a cluster and a procedure successively divides it until all clus-
ters are singleton clusters. For a cluster with objects, there
are possible two-subset divisions, which is very ex-
pensive in computation [88]. Therefore, divisive clustering
is not commonly used in practice. We focus on the
agglomera- tive clustering in the following discussion and
3) Update the proximity matrix by computing the dis-
tances between the new cluster and the other
clusters.
4) Repeat steps 2)–3) until all objects are in the same
cluster.
Based on the different definitions for distance between two
clusters, there are many agglomerative clustering algorithms.
The simplest and most popular methods include single
linkage
[256] and complete linkage technique [258]. For the single
linkage method, the distance between two clusters is deter-
mined by the two closest objects in different clusters, so
it is also called nearest neighbor method. On the contrary, the
complete linkage method uses the farthest distance of a pair
of objects to define inter-cluster distance. Both the single
linkage and the complete linkage method can be generalized
by the recurrence formula proposed by Lance and Williams
[178] as

where is the distance function and , and are


coefficients that take values dependent on the scheme used.
The formula describes the distance between a cluster and a
new cluster formed by the merge of two clusters and . Note
that when , and , the
formula becomes

which corresponds to the single linkage method. When


and , the formula is

which corresponds to the complete linkage method.


Several more complicated agglomerative clustering algo-
rithms, including group average linkage, median linkage,
centroid linkage, and Ward’s method, can also be constructed
by selecting appropriate coefficients in the formula. A
detailed table describing the coefficient values for different
algorithms is offered in [150] and [210]. Single linkage,
complete linkage and average linkage consider all points of a
pair of clusters, when calculating their inter-cluster distance,
and are also called graph methods. The others are called
geometric methods since they use geometric centers to
represent clusters and determine their distances. Remarks on
important features and properties of these methods are
summarized in [88]. More inter-cluster
distance measures, especially the mean-based ones, were “link” to describe the relation between a pair of objects and their
intro- duced by Yager, with further discussion on their common neighbors. Like CURE, a random sample strategy is
possible effect to control the hierarchical clustering process used to handle large data sets. Chameleon is constructed from
[289]. graph theory and will be discussed in Section II-E.
The common criticism for classical HC algorithms is that they Relative hierarchical clustering (RHC) is another
lack robustness and are, hence, sensitive to noise and outliers. exploration that considers both the internal distance (distance
Once an object is assigned to a cluster, it will not be between a pair of clusters which may be merged to yield a
considered again, which means that HC algorithms are not new cluster) and the external distance (distance from the two
capable of cor- recting possible previous misclassification. clusters to the rest), and uses the ratio of them to decide the
The computational complexity for most of HC algorithms is proximities [203]. Leung et al. showed an interesting
at least and hierarchical clustering based on scale-space theory [180].
this high cost limits their application in large-scale data sets. They interpreted clustering using a blurring process, in which
Other disadvantages of HC include the tendency to form each datum is regarded as a light point in an image, and a
spher- ical shapes and reversal phenomenon, in which the cluster is represented as a blob. Li and Biswas extended
normal hier- archical structure is distorted. agglomerative HC to deal with both nu- meric and nominal
In recent years, with the requirement for handling large- data. The proposed algorithm, called simi- larity-based
scale data sets in data mining and other fields, many new HC agglomerative clustering (SBAC), employs a mixed data
tech- niques have appeared and greatly improved the measure scheme that pays extra attention to less common
clustering per- formance. Typical examples include CURE matches of feature values [183]. Parallel techniques for HC are
[116], ROCK [117], Chameleon [159], and BIRCH [295]. discussed in [69] and [217], respectively.
The main motivations of BIRCH lie in two aspects, the ability
to deal with large data sets and the robustness to outliers C. Squared Error—Based Clustering (Vector Quantization)
[295]. In order to achieve these goals, a new data structure, In contrast to hierarchical clustering, which yields a succes-
clustering feature (CF) tree, is designed to store the summaries sive level of clusters by iterative fusions or divisions,
of the original data. The CF tree is a height-balanced tree, partitional clustering assigns a set of objects into clusters with
with each internal vertex composed of entries defined as no hier- archical structure. In principle, the optimal partition,
child based on some specific criterion, can be found by enumerating
, where is a representation of the cluster and all pos- sibilities. But this brute force method is infeasible in
is defined as , where is the number of practice, due to the expensive computation [189]. Even for a
data objects in the cluster, is the linear sum of the objects, small-scale clustering problem (organizing 30 objects into 3
and SS is the squared sum of the objects, child is a pointer to the groups), the number of possible partitions is .
th child node, and is a threshold parameter that determines Therefore, heuristic algorithms have been developed in order
the maximum number of entries in the vertex, and each leaf to seek approximate solutions.
composed of entries in the form of , where One of the important factors in partitional clustering is the
is the threshold parameter that controls the maximum number of criterion function [124]. The sum of squared error function is
entriesintheleaf. Moreover, theleavesmustfollowtherestriction one of the most widely used criteria. Suppose we have a set of
that the diameter objects , and we want to organize them
of each entry in the leaf is less than a threshold . The CF into subsets . The squared error
tree structure captures the important clustering information of criterion then is defined as
the original data while reducing the required storage. Outliers
are eliminated from the summaries by identifying the objects
sparsely distributed in the feature space. After the CF tree is
built, an agglomerative HC is applied to the set of summaries
to perform global clustering. An additional step may be where
performed to refine the clusters. BIRCH can achieve a a partition matrix;
computational complexity of .
Noticing the restriction of centroid-based HC, which is
unable to identify arbitrary cluster shapes, Guha, Rastogi, and
Shim developed a HC algorithm, called CURE, to explore if cluster with
more sophisticated cluster shapes [116]. The crucial feature of otherwise
CURE lies in the usage of a set of well-scattered points to
represent
each cluster, which makes it possible to find rich cluster weaken the effects of outliers. CURE utilizes random sample
shapes other than hyperspheres and avoids both the (and partition) strategy to reduce computational complexity.
chaining effect Guha et al. also proposed another agglomerative HC algorithm,
[88] of the minimum linkage method and the tendency to ROCK, to group data with qualitative attributes [117]. They
favor clusters with similar sizes of centroid. These used a novel measure
representative points are further shrunk toward the cluster
centroid according to an adjustable parameter in order to
cluster prototype or centroid (means)

matrix; sample mean for the th cluster;

number of objects in the th cluster.


Note the relation between the sum of squared error
criterion and the scatter matrices defined in multiclass
discriminant anal- ysis [75],
where of the subsets is clustered times again, setting
total scatter matrix; each subset solution as the initial guess. The starting
within-class scatter matrix; points for the whole data are obtained by choosing
the solution with minimal sum of squared distances.
between-class scatter matrix; and Likas, Vlassis, and Verbeek proposed a global -
means algo- rithm consisting of a series of -means
mean vector for the whole data set. clustering pro- cedures with the number of clusters
varying from 1 to [186]. After finding the centroid for
only one cluster
It is not difficult to see that the criterion based on the trace utilizes -means times to random sub- sets from
of is the same as the sum of squared error criterion. To the original data [43]. The set formed from the
minimize the squared error criterion is equivalent to minimizing union of the solution (centroids of the clusters)
the trace of or maximizing the trace of . We can obtain
a rich class of criterion functions based on the characteristics
of and [75].
The -means algorithm is the best-known squared error-
based clustering algorithm [94], [191].
1) Initialize a -partition randomly or based on some
prior knowledge. Calculate the cluster prototype ma-
trix .
2) Assign each object in the data set to the nearest
cluster , i.e.
if
for and

3) Recalculate the cluster prototype matrix based on the


current partition.
4) Repeat steps 2)–3) until there is no change for each
cluster.
The -means algorithm is very simple and can be easily
implemented in solving many practical problems. It can work
very well for compact and hyperspherical clusters. The time
complexity of -means is . Since and are usu-
ally much less than -means can be used to cluster large
data sets. Parallel techniques for -means are developed that
can largely accelerate the algorithm [262]. The drawbacks
of
-means are also well studied, and as a result, many variants
of
-means have appeared in order to overcome these obstacles.
We summarize some of the major disadvantages with the pro-
posed improvement in the following.
1) There is no efficient and universal method for iden-
tifying the initial partitions and the number of clus-
ters . The convergence centroids vary with different
initial points. A general strategy for the problem is
to run the algorithm many times with random initial
partitions. Peña, Lozano, and Larrañaga compared
the random method with other three classical initial
parti- tion methods by Forgy [94], Kaufman [161], and
Mac- Queen [191], based on the effectiveness,
robustness, and convergence speed criteria [227].
According to their experimental results, the random
and Kaufman’s method work much better than the
other two under the first two criteria and by further
considering the conver- gence speed, they
recommended Kaufman’s method. Bradley and
Fayyad presented a refinement algorithm that first
existing, at each , the previous
centroids are fixed and the new centroid is selected by
examining all data points. The authors claimed that the
algorithm is independent of the initial partitions and
provided accelerating strategies. But the problem on
computational complexity exists, due to the require-
ment for executing -means times for each value
of .
An interesting technique, called ISODATA, devel-
oped by Ball and Hall [21], deals with the estimation
of . ISODATA can dynamically adjust the number of
clusters by merging and splitting clusters according to
some predefined thresholds (in this sense, the problem
of identifying the initial number of clusters becomes
that of parameter (threshold) tweaking). The new is
used as the expected number of clusters for the next it-
eration.
2) The iteratively optimal procedure of -means cannot
guarantee convergence to a global optimum. The sto-
chastic optimal techniques, like simulated annealing
(SA) and genetic algorithms (also see part II.F), can
find the global optimum with the price of expensive
computation. Krishna and Murty designed new opera-
tors in their hybrid scheme, GKA, in order to achieve
global search and fast convergence [173]. The defined
biased mutation operator is based on the Euclidean
distance between an object and the centroids and aims
to avoid getting stuck in a local optimum. Another
operator, the -means operator (KMO), replaces the
computationally expensive crossover operators and
alleviates the complexities coming with them. An
adaptive learning rate strategy for the online mode
-means is illustrated in [63]. The learning rate is
exclusively dependent on the within-group variations
and can be adjusted without involving any user activi-
ties. The proposed enhanced LBG (ELBG) algorithm
adopts a roulette mechanism typical of genetic algo-
rithms to become near-optimal and therefore, is not
sensitive to initialization [222].
3) -means is sensitive to outliers and noise. Even if an
object is quite far away from the cluster centroid, it is
still forced into a cluster and, thus, distorts the cluster
shapes. ISODATA [21] and PAM [161] both consider
the effect of outliers in clustering procedures. ISO-
DATA gets rid of clusters with few objects. The split-
ting operation of ISODATA eliminates the possibility
of elongated clusters typical of -means. PAM utilizes real
data points (medoids) as the cluster prototypes and
avoids the effect of outliers. Based on the same con-
sideration, a -medoids algorithm is presented in [87]
by searching the discrete 1-medians as the cluster Gaussian densities are used due to its complete theory and
cen- troids. analytical tractability [88], [297].
4) The definition of “means” limits the application Maximum likelihood (ML) estimation is an important
only to numerical variables. The -medoids algo- statis- tical approach for parameter estimation [75] and it
rithm mentioned previously is a natural choice, when considers the best estimate as the one that maximizes the
the computation of means is unavailable, since the probability of gen- erating all the observations, which is given
medoids do not need any computation and always exist by the joint density function
[161]. Huang [142] and Gupta et al. [118] defined dif-
ferent dissimilarity measures to extend -means or, in a logarithm form
to categorical variables. For Huang’s method, the
clustering goal is to minimize the cost function
, where

and
The best estimate can be achieved by solving the log-
likelihood equations .
Unfortunately, since the solutions of the likelihood equa-
tions cannot be obtained analytically in most circumstances
with a set of -dimensional vectors [90], [197], iteratively suboptimal approaches are required to
, where approximate the ML estimates. Among these methods, the
. expectation-maximization (EM) algorithm is the most popular
Each vector is known as a mode and is defined to [196]. EM regards the data set as incomplete and divides
each data point into two parts , where
minimize the sum of distances . The
represents the observable features and
proposed -modes algorithm operates in a similar is the missing data, where chooses value 1 or 0 according
way as -means.
Several recent advances on -means and other squared-error any types of components, but more commonly, multivariate
based clustering algorithms with their applications can be found
in [125], [155], [222], [223], [264], and [277].

D. Mixture Densities-Based Clustering (pdf Estimation


via Mixture Densities)
In the probabilistic view, data objects are assumed to be
gen- erated according to several probability distributions. Data
points in different clusters were generated by different
probability dis- tributions. They can be derived from different
types of density functions (e.g., multivariate Gaussian or
-distribution), or the same families, but with different
parameters. If the distributions are known, finding the clusters
of a given data set is equivalent to estimating the parameters of
several underlying models. Sup- pose the prior probability
(also known as mixing probability) for cluster
(here, is assumed to
be known and methods for estimating are discussed in Sec-
tion II-M) and the conditional probability density
(also known as component density), where is the unknown
parameter vector, are known. Then, the mixture probability den-
sity for the whole data set is expressed as

where , and . As long as


the parameter vector is decided, the posterior probability for
assigning a data point to a cluster can be easily calculated with
Bayes’s theorem. Here, the mixtures can be constructed with
to whether belongs to the component or not. Thus,
the complete data log-likelihood is

The standard EM algorithm generates a series of


parameter estimates , where represents the
reaching of the convergence criterion, through the following
steps:
1) initialize and set ;
2) e-step: Compute the expectation of the complete
data log-likelihood

3) m-step: Select a new parameter estimate that maxi-


mizes the -function, ;
4) Increase ; repeat steps 2)–3) until the
conver- gence condition is satisfied.
The major disadvantages for EM algorithm are the
sensitivity to the selection of initial parameters, the effect of
a singular co- variance matrix, the possibility of convergence
to a local op- timum, and the slow convergence rate [96],
[196]. Variants of EM for addressing these problems are
discussed in [90] and [196].
A valuable theoretical note is the relation between the EM
algorithm and the -means algorithm. Celeux and Govaert
proved that classification EM (CEM) algorithm under a
spher- ical Gaussian mixture is equivalent to the -means
algorithm [58].
Fraley and Raftery described a comprehensive mix- ture- make Chameleon flexible enough to explore the
model based clustering scheme [96], which was im- characteristics of potential clusters, Chameleon merges these
plemented as a software package, known as MCLUST [95]. In small subsets and, thus, comes up with the ultimate clustering
this case, the component density is multivariate Gaussian, solutions. Here, the relative interconnectivity (or closeness) is
with a mean vector and a covariance matrix as the obtained by normalizing the sum of weights (or average
parameters to be estimated. The covariance matrix for each weight) of the edges connecting the two clusters over the
component can further be parameterized by virtue of internal connectivity (or closeness) of the clusters. DTG is
eigenvalue decomposi- tion, represented as , where another important graph representation for HC analysis.
is a scalar, is the orthogonal matrix of eigenvectors, and is Cherng and Lo constructed a hypergraph (each edge is
the diagonal matrix based on the eigenvalues of [96]. These three allowed to connect more than two vertices) from the DTG and
elements determine the geometric properties of each used a two-phase algorithm that is similar to Chameleon to
component. After the maximum number of clusters and the find clusters [61]. Another DTG-based application, known as
candidate models are specified, an agglomerative hierarchical AMOEBA algorithm, is presented in [86].
clustering was used to ignite the EM algorithm by forming an Graph theory can also be used for nonhierarchical clusters.
initial partition, which includes at most the maximum number Zahn’s clustering algorithm seeks connected components as
of clusters, for each model. The optimal clustering result is clusters by detecting and discarding inconsistent edges in the
achieved by checking the Bayesian information criterion minimum spanning tree [150]. Hartuv and Shamir treated clus-
(BIC) value discussed in Section II-M. GMDD is also based ters as HCS, where “highly connected” means the connectivity
on multivariate Gaussian densities and is designed as a (the minimum number of edges needed to disconnect a graph)
recursive algorithm that sequen- tially estimates each of the subgraph is at least half as great as the number of the
component [297]. GMDD views data points that are not vertices [128]. A minimum cut (mincut) procedure, which
generated from a distribution as noise and utilizes an aims to separate a graph with a minimum number of edges, is
enhanced model-fitting estimator to construct each component used to find these HCSs recursively. Another algorithm, called
from the contaminated model. AutoClass considers more CLICK, is based on the calculation of the minimum weight
families of probability distributions (e.g., Poisson and cut to form clusters [247]. Here, the graph is weighted and the
Bernoulli) for different data types [59]. A Bayesian approach edge weights are assigned a new interpretation, by combining
is used in AutoClass to find out the optimal partition of the probability and graph theory. The edge weight between node
given data based on the prior probabilities. Its parallel and is defined as shown in
realization is described in [228]. Other important algorithms
and programs include Multimix [147], EM based mixture belong to the same cluster does not belong to the same cluster
program (EMMIX) [198], and Snob [278].
where represents the similarity between the two nodes.
E. Graph Theory-Based Clustering
CLICK further assumes that the similarity values within clus-
The concepts and properties of graph theory [126] make it ters and between clusters follow Gaussian distributions with
very convenient to describe clustering problems by means of different means and variances, respectively. Therefore, the
graphs. Nodes of a weighted graph correspond to data previous equation can be rewritten by using Bayes’ theorem
points in the pattern space and edges reflect the proximities as
between each pair of data points. If the dissimilarity matrix is
defined as
if interconnectivity and relative closeness, which
otherwise
where is a threshold value, the graph is simplified to an
unweighted threshold graph. Both the single linkage HC and
the complete linkage HC can be described on the basis of
the threshold graph. Single linkage clustering is equivalent to
seeking maximally connected subgraphs (components) while
complete linkage clustering corresponds to finding maximally
complete subgraphs (cliques) [150]. Jain and Dubes illustrated
and discussed more applications of graph theory (e.g.,
Hubert’s algorithm and Johnson’s algorithm) for hierarchical
clustering in [150]. Chameleon [159] is a newly developed
agglomerative HC algorithm based on the -nearest-neighbor
graph, in which an edge is eliminated if both vertices are not
within the closest points related to each other. At the first
step, Chameleon divides the connectivity graph into a set of
subclusters with the minimal edge cut. Each subgraph should
contain enough nodes in order for effective similarity
computation. By combining both the relative
where is the prior probability that two objects belong to
the same cluster and are the means and
variances for between-cluster similarities and within-clusters
similarities, respectively. These parameters can be estimated
either from prior knowledge, or by using parameter
estimation methods [75]. CLICK recursively checks the
current subgraph, and generates a kernel list, which consists
of the components satisfying some criterion function.
Subgraphs that include only one node are regarded as
singletons, and are separated for further manipulation. Using
the kernels as the basic clusters, CLICK carries out a series
of singleton adoptions and cluster merge to generate the
resulting clusters. Additional heuristics are provided to
accelerate the algorithm performance.
Similarly, CAST considers a probabilistic model in
designing a graph theory-based clustering algorithm [29].
Clusters are modeled as corrupted clique graphs, which, in
ideal conditions, are regarded as a set of disjoint cliques.
The effect of noise is incorporated by adding or removing
edges from the ideal
stop condition is satisfied.
model, with a probability . Proofs were given for recovering
the uncorrupted graph with a high probability. CAST is the
heuristic implementation of the original theoretical version.
CAST creates clusters sequentially, and each cluster begins
with a random and unassigned data point. The relation
between a data point and a cluster being built is determined by
the affinity, defined as , and the affinity threshold
parameter . When , it means that the data point is
highly related to the cluster and vice versa. CAST alternately
adds high affinity data points or deletes low affinity data
points from the cluster until no more changes occur.

F. Combinatorial Search Techniques-Based Clustering


The basic object of search techniques is to find the global
or approximate global optimum for combinatorial
optimization problems, which usually have NP-hard
complexity and need to search an exponentially large solution
space. Clustering can be regarded as a category of
optimization problems. Given a set of data points, clustering
algorithms aim to organize them into subsets that
optimize some criterion function. The possible partition for
points into
clusters is given by the formula [189]

As shown before, even for small and , the computa-


tional complexity is extremely expensive, not to mention the
large-scale clustering problems frequently encountered in recent
decades. Simple local search techniques, like hill-climbing al-
gorithms, are utilized to find the partitions, but they are easily
stuck in local minima and therefore cannot guarantee optimality.
More complex search methods (e.g., evolutionary algorithms
(EAs) [93], SA [165], and Tabu search (TS) [108] are known as
stochastic optimization methods, while deterministic
annealing (DA) [139], [234] is the most typical deterministic
search tech- nique) can explore the solution space more
flexibly and effi- ciently.
Inspired by the natural evolution process, evolutionary
com- putation, which consists of genetic algorithms (GAs),
evolution strategies (ESs), evolutionary programming (EP),
and genetic programming (GP), optimizes a population of
structure by using a set of evolutionary operators [93]. An
optimization function, called the fitness function, is the standard
for evaluating the opti- mizing degree of the population, in which
each individual has its corresponding fitness. Selection,
recombination, and mutation are the most widely used
evolutionary operators. The selection operator ensures the
continuity of the population by favoring the best individuals in
the next generation. The recombination and mutation operators
support the diversity of the population by ex- erting
perturbations on the individuals. Among many EAs, GAs
[140] are the most popular approaches applied in cluster anal-
ysis. In GAs, each individual is usually encoded as a binary
bit string, called a chromosome. After an initial population is
gener- ated according to some heuristic rules or just randomly,
a series of operations, including selection, crossover and
mutation, are iteratively applied to the population until the
Hall, Özyurt, and Bezdek proposed a GGA that can be re-
garded as a general scheme for center-based (hard or fuzzy)
clustering problems [122]. Fitness functions are reformulated
from the standard sum of squared error criterion function in
order to adapt the change of the construction of the optimiza-
tion problem (only the prototype matrix is needed)

for hard clustering

for fuzzy clustering

where , is the distance between


the th cluster and the th data object, and is the fuzzification
parameter.
GGA proceeds with the following steps.
1) Choose appropriate parameters for the algorithm.
Ini- tialize the population randomly with
individuals, each of which represents a
prototype matrix and is encoded as gray codes.
Calculate the fitness value for each individual.
2) Use selection (tournament selection) operator to
choose parental members for reproduction.
3) Use crossover (two-point crossover) and mutation
(bit- wise mutation) operator to generate offspring
from the individuals chosen in step 2).
4) Determine the next generation by keeping the
individ- uals with the highest fitness.
5) Repeat steps 2)–4) until the termination condition is
satisfied.
Other GAs-based clustering applications have appeared
based on a similar framework. They are different in the
meaning of an individual in the population, encoding
methods, fitness function definition, and evolutionary
operators [67], [195], [273]. The algorithm
CLUSTERING in [273] includes a heuristic scheme for
estimating the appropriate number of clusters in the data. It
also uses a nearest-neighbor algorithm to divide data into
small subsets, before GAs-based clustering, in order to
reduce the computational complexity. GAs are very useful
for improving the performance of -means algorithms. Babu
and Murty used GAs to find good initial partitions [15].
Krishna and Murty combined GA with -means and devel-
oped GKA algorithm that can find the global optimum [173].
As indicated in Section II-C, the algorithm ELBG uses the
roulette mechanism to address the problems due to the bad
initialization [222]. It is worthwhile to note that ELBG are
equivalent to another algorithm, fully automatic clustering
system (FACS) [223], in terms of quantization level
detection. The difference lies in the input parameters
employed (ELBG adopts the number of quantization
levels, while FACS uses the desired distortion error).
Except the previous applications, GAs can also be used for
hierarchical clustering. Lozano and Larrañag discussed the
properties of ultrametric distance [127] and reformulated the
hierarchical clustering as an optimization
applica- tions in large-scale data sets.
problem that tries to find the closest ultrametic distance for a
given dissimilarity with Euclidean norm [190]. They suggested
an order-based GA to solve the problem. Clustering
algorithms based on ESs and EP are described and analyzed in
[16] and [106], respectively.
TS is a combinatory search technique that uses the tabu list
to guide the search process consisting of a sequence of moves.
The tabu list stores part or all of previously selected moves
according to the specified size. These moves are forbidden in
the current search and are called tabu. In the TS clustering
algorithm devel- oped by Al-Sultan [9], a set of candidate
solutions are generated from the current solution with some
strategy. Each candidate so- lution represents the allocations of
data objects in clusters. The candidate with the optimal cost
function is selected as the current solution and appended to the
tabu list, if it is not already in the tabu list or meets the
aspiration criterion, which can over- rule the tabu restriction.
Otherwise, the remaining candidates are evaluated in the order
of their cost function values, until all these conditions are
satisfied. When all the candidates are tabu, a new set of
candidate solutions are created followed by the same search
process. The search process proceeds until the maximum number
of iterations is reached. Sung and Jin’s method includes more
elaborate search processes with the packing and releasing
procedures [266]. They also used a secondary tabu list to keep
the search from trapping into the potential cycles. A fuzzy
ver- sion of TS clustering can be found in [72].
SA is also a sequential and global search technique and is mo-
tivated by the annealing process in metallurgy [165]. SA
allows the search process to accept a worse solution with a
certain prob- ability. The probability is controlled by a
parameter, known as temperature and is
usually expressed as , where
is the change of the energy (cost function). The tempera-
ture goes through an annealing schedule from initial high to
ultimate low values, which means that SA attempts to explore
solution space more completely at high temperatures while fa-
vors the solutions that lead to lower energy at low
temperatures. SA-based clustering was reported in [47] and
[245]. The former illustrated an application of SA clustering to
evaluate different clustering criteria and the latter investigated
the effects of input parameters to the clustering performance.
Hybrid approaches that combine these search techniques
are also proposed. A tabu list is used in a GA clustering
algorithm to preserve the variety of the population and avoid
repeating com- putation [243]. An application of SA for
improving TS was re- ported in [64]. The algorithm further
reduces the possible moves to local optima.
The main drawback that plagues the search techniques-
based clustering algorithms is the parameter selection. More
often than not, search techniques introduce more parameters
than other methods (like -means). There are no theoretic
guide- lines to select the appropriate and effective parameters.
Hall et al. provided some methods for setting parameters in
their GAs-based clustering framework [122], but most of
these criteria are still obtained empirically. The same situation
exists for TS and SA clustering [9], [245]. Another problem is
the computational complexity paid for the convergence to
global optima. High computational requirement limits their
G. Fuzzy Clustering
Except for GGA, the clustering techniques we have discussed
so far are referred to as hard or crisp clustering, which means
that each object is assigned to only one cluster. For fuzzy
clus- tering, this restriction is relaxed, and the object can
belong to all of the clusters with a certain degree of
membership [293]. This is particularly useful when the
boundaries among the clusters are not well separated and
ambiguous. Moreover, the member- ships may help us
discover more sophisticated relations between a given object
and the disclosed clusters.
FCM is one of the most popular fuzzy clustering
algorithms [141]. FCM can be regarded as a generalization
of ISODATA
[76] and was realized by Bezdek [35]. FCM attempts to find
a partition ( fuzzy clusters) for a set of data points
while minimizing the cost function

where
is the fuzzy partition matrix and
is the membership coefficient of
the th object in the th cluster;
is the cluster prototype (mean or
center) matrix;
is the fuzzification parameter and
usually is set to 2 [129];
is the distance measure between and
.
We summarize the standard FCM as follows, in which
the Euclidean or norm distance function is used.
1) Select appropriate values for , and a small positive
number . Initialize the prototype matrix randomly.
Set step variable .
2) Calculate (at ) or update (at ) the member-
ship matrix by
for and

3) Update the prototype matrix by

4) Repeat steps 2)–3) until .


Numerous FCM variants and other fuzzy clustering
algo-
rithms have appeared as a result of the intensive investigation for
on the distance measure functions, the effect of
weighting exponent on fuzziness control, the optimization
approaches for fuzzy partition, and improvements of the
drawbacks of FCM [84], [141].
Like its hard counterpart, FCM also suffers from the presence
of noise and outliers and the difficulty to identify the initial
par- titions. Yager and Filev proposed a MM in order to
estimate the
widely used fuzzy clustering algorithms were discussed and
centers of clusters [290]. Candidate centers consist of a set of their similarities to some robust statistical methods were also
vertices that are formed by building a grid on the pattern reviewed. They reached a unified framework as the conclusion
space. for the previous discussion and proposed generic algorithms for
The mountain function for a vertex is defined as robust clustering.

where is the distance between the th data object


and the th node, and is a positive constant. Therefore, the
closer a data object is to a vertex, the more the data object
contributes to the mountain function. The vertex with
the maximum
value of mountain function is selected as the first
center. A procedure, called mountain destruction, is performed
to get rid of the effects of the selected center. This is achieved
by sub-
tracting the mountain function value for each of the rest ver-
tices with an amount dependent on the current maximum
moun- tain function value and the distance between the vertex
and the center. The process iterates until the ratio between the
current
maximum and is below some threshold. The
connection of MM with several other fuzzy clustering
algorithms was fur- ther discussed in [71]. Gath and Geva
described an initialization
strategy of unsupervised tracking of cluster prototypes in their
2-layer clustering scheme, in which FCM and fuzzy ML esti-
mation are effectively combined [102].
Kersten suggested that city block distance (or norm) could
improve the robustness of FCM to outliers [163]. Furthermore,
Hathaway, Bezdek, and Hu extended FCM to a more
universal case by using Minkowski distance (or norm, )
and seminorm for the models that operate
either di- rectly on the data objects or indirectly on the
dissimilarity mea- sures [130]. According to their empirical
results, the object data based models, with and norm,
are recommended. They also pointed out the possible
improvement of models for other norm with the price of
more complicated optimization oper- ations. PCM is another
approach for dealing with outliers [175]. Under this model, the
memberships are interpreted by a possi- bilistic view, i.e., “the
compatibilities of the points with the class prototypes” [175]. The
effect of noise and outliers is abated with the consideration of
typicality. In this case, the first condition for the membership
coefficient described in Section I is relaxed to
. Accordingly, the cost function is reformu-
lated as

where are some positive constants. The additional term


tends to give credits to memberships with large values. A
modified version in order to find appropriate clusters is pro-
posed in [294]. Davé and Krishnapuram further elaborated
the discussion on fuzzy clustering robustness and indicated its
connection with robust statistics [71]. Relations among some
GLVQ-F). They constructed the clustering problem as an
The standard FCM alternates the calculation of the optimization process based on minimizing a loss function,
member- ship and prototype matrix, which causes a which is defined on the locally weighted error between the
computational burden for large-scale data sets. Kolen and input pattern and the winning prototype. They also showed
Hutcheson accelerated the computation by combining the relations between LVQ and the online -means
updates of the two matrices [172]. Hung and Yang proposed
a method to reduce computational time by identifying more
accurate cluster centers [146]. FCM variants were also
developed to deal with other data types, such as symbolic
data [81] and data with missing values [129].
A family of fuzzy -shells algorithms has also appeared to de-
tect different types of cluster shapes, especially contours
(lines, circles, ellipses, rings, rectangles, hyperbolas) in a
two-dimen- sional data space. They use the “shells” (curved
surfaces [70]) as the cluster prototypes instead of points or
surfaces in tra- ditional fuzzy clustering algorithms. In the
case of FCS [36], [70], the proposed cluster prototype is
represented as a -di- mensional hyperspherical shell (
for circles), where
is the center, and is the radius. A
dis- tance function is defined as
to
measure the distance from a data object to the prototype
.
Similarly, other cluster shapes can be achieved by defining
ap- propriate prototypes and corresponding distance
functions, ex- ample including fuzzy -spherical shells
(FCSS) [176], fuzzy
-rings (FCR) [193], fuzzy -quadratic shells (FCQS) [174], and
fuzzy -rectangular shells (FCRS) [137]. See [141] for further
details.
Fuzzy set theories can also be used to create hierarchical
cluster structure. Geva proposed a hierarchical unsupervised
fuzzy clustering (HUFC) algorithm [104], which can effec-
tively explore data structure at different levels like HC, while
establishing the connections between each object and cluster
in the hierarchy with the memberships. This design makes
HUFC overcome one of the major disadvantages of HC, i.e.,
HC cannot reassign an object once it is designated into a
cluster. Fuzzy clustering is also closely related to neural
networks [24], and we will see more discussions in the
following section.

H. Neural Networks-Based Clustering


Neural networks-based clustering has been dominated by
SOFMs and adaptive resonance theory (ART), both of which
are reviewed here, followed by a brief discussion of other
approaches.
In competitive neural networks, active neurons reinforce
their neighborhood within certain regions, while suppressing
the activities of other neurons (so-called on-center/off-
surround competition). Typical examples include LVQ and
SOFM [168], [169]. Intrinsically, LVQ performs supervised
learning, and is not categorized as a clustering algorithm
[169], [221]. But its learning properties provide an insight to
describe the potential data structure using the prototype
vectors in the competitive layer. By pointing out the
limitations of LVQ, including sen- sitivity to initiation and
lack of a definite clustering object, Pal, Bezdek, and Tsao
proposed a general LVQ algorithm for clustering, known
as GLVQ [221] (also see [157] for its improved version
ART can learn arbitrary input patterns in a stable, fast, and
algorithm. Soft LVQ algorithms, e.g., fuzzy algorithms for
LVQ (FALVQ), were discussed in [156].
The objective of SOFM is to represent high-dimensional
input patterns with prototype vectors that can be visualized in
a usually two-dimensional lattice structure [168], [169]. Each
unit in the lattice is called a neuron, and adjacent neurons are
connected to each other, which gives the clear topology of
how the network fits itself to the input space. Input patterns
are fully connected to all neurons via adaptable weights, and
during the training process, neighboring input patterns are
projected into the lattice, corresponding to adjacent neurons.
In this sense, some authors prefer to think of SOFM as a
method to displaying latent data structure in a visual way
rather than a clustering approach [221]. Basic SOFM training
goes through the following steps.
1) Define the topology of the SOFM; Initialize the
proto- type vectors randomly.
2) Present an input pattern to the network; Choose
the winning node that is closest to , i.e.,
.
3) Update prototype vectors

where is the neighborhood function that is


often defined as

where is the monotonically decreasing learning


rate, represents the position of corresponding neuron,
and is the monotonically decreasing kernel width
function, or

if node belongs to the neighborhood


of the winning node
otherwise

4) Repeat steps 2)–3) until no change of neuron position


that is more than a small positive number is observed.
While SOFM enjoy the merits of input space density ap-
proximation and independence of the order of input patterns, a
number of user-dependent parameters cause problems when
ap- plied in real practice. Like the -means algorithm, SOFM
need to predefine the size of the lattice, i.e., the number of
clusters, which is unknown for most circumstances. Additionally,
trained SOFM may be suffering from input space density
misrepresen- tation [132], where areas of low pattern density
may be over-rep- resented and areas of high density under-
represented. Kohonen reviewed a variety of variants of SOFM
in [169], which improve drawbacks of basic SOFM and
broaden its applications. SOFM can also be integrated with
other clustering approaches (e.g.,
-means algorithm or HC) to provide more effective and
faster clustering. [263] and [276] illustrate two such hybrid
systems.
ART was developed by Carpenter and Grossberg, as a so-
lution to the plasticity and stability dilemma [51], [53], [113].
self-organizing way, thus, overcoming the effect of learning
in- stability that plagues many other competitive networks.
ART is not, as is popularly imagined, a neural network
architecture. It is a learning theory, that resonance in neural
circuits can trigger fast learning. As such, it subsumes a large
family of current and future neural networks architectures,
with many variants. ART1 is the first member, which only
deals with binary input patterns [51], although it can be
extended to arbitrary input patterns by a variety of coding
mechanisms. ART2 extends the applications to analog input
patterns [52] and ART3 intro- duces a new mechanism
originating from elaborate biological processes to achieve
more efficient parallel search in hierar- chical structures [54].
By incorporating two ART modules, which receive input
patterns ART and corresponding labels ART ,
respectively, with an inter-ART module, the resulting
ARTMAP system can be used for supervised classifications
[56]. The match tracking strategy ensures the consistency of
category prediction between two ART modules by
dynamically adjusting the vigilance parameter of ART . Also
see fuzzy ARTMAP in [55]. A similar idea, omitting the
inter-ART module, is known as LAPART [134].
The basic ART1 architecture consists of two-layer nodes,
the feature representation field and the category
representation field . They are connected by adaptive
weights, bottom-up weight matrix and top-down weight
matrix . The pro- totypes of clusters are stored in layer
. After it is activated according to the winner-takes-all
competition, an expectation is reflected in layer , and
compared with the input pattern. The orienting subsystem
with the specified vigilance parameter
determines whether the expectation and the input are closely
matched, and therefore controls the generation of new
clusters. It is clear that the larger is, the more
clusters are generated. Once weight adaptation occurs, both
bottom-up and top-down weights are updated simultaneously.
This is called resonance, from which the name comes. The
ART1 algorithm
can be described as follows.
1) Initialize weight matrices and as
, where are sorted in a descending order and sat-
isfies for and any
binary input pattern , and ;
2) For a new pattern , calculate the input from layer
to layer as

if is an uncommitted node
first activated
if is a committed node

where represents the logic AND operation.


3) Activate layer by choosing node with the
winner- takes-all rule .
4) Compare the expectation from layer with the input
pattern. If , go to step 5a),
other- wise go to step 5b).
5)
a) Update the corresponding weights for the active
node as new old
old and new old .;
b) Send a reset signal to disable the current active node
by the orienting subsystem and return to step 3).
6) Present another input pattern, return to step 2) until all available. The divisibility of a prototype is based on the
patterns are processed. consideration that each prototype represents only one natural
Note the relation between ART network and other
clustering algorithms described in traditional and statistical
language. Moore used several clustering algorithms to explain
the clus- tering behaviors of ART1 and therefore induced and
proved a number of important properties of ART1, notably its
equiva- lence to varying -means clustering [204]. She also
showed how to adapt these algorithms under the ART1
framework. In
[284] and [285], the ease with which ART may be used for
hierarchical clustering is also discussed.
Fuzzy ART (FA) benefits the incorporation of fuzzy set theory
and ART [57]. FA maintains similar operations to ART1 and
uses the fuzzy set operators to replace the binary operators, so
that it can work for all real data sets. FA exhibits many desirable
characteristics such as fast and stable learning and atypical
pat- tern detection. Huang et al. investigated and revealed more
prop- erties of FA classified as template, access, reset, and the
number of learning epochs [143]. The criticisms for FA are
mostly fo- cused on its inefficiency in dealing with noise and
the defi- ciency of hyperrectangular representation for clusters
in many circumstances [23], [24], [281]. Williamson described
Gaussian ART (GA) to overcome these shortcomings [281], in
which each cluster is modeled with Gaussian distribution and
represented as a hyperellipsoid geometrically. GA does not
inherit the offline fast learning property of FA, as indicated by
Anagnostopoulos et al. [13], who proposed different ART
architectures: hypersphere ART (HA) [12] for hyperspherical
clusters and ellipsoid ART (EA) [13] for hyperellipsoidal
clusters, to explore a more effi- cient representation of
clusters, while keeping important prop- erties of FA. Baraldi
and Alpaydin proposed SART following their general ART
clustering networks frame, which is described through a
feedforward architecture combined with a match com- parison
mechanism [23]. As specific examples, they illustrated
symmetric fuzzy ART (SFART) and fully self-organizing SART
(FOSART) networks. These networks outperform ART1 and FA
according to their empirical studies [23].
In addition to these, many other neural network
architectures are developed for clustering. Most of these
architectures uti- lize prototype vectors to represent clusters,
e.g., cluster detec- tion and labeling network (CDL) [82],
HEC [194], and SPLL [296]. HEC uses a two-layer network
architecture to estimate the regularized Mahalanobis distance,
which is equated to the Euclidean distance in a transformed
whitened space. CDL is also a two-layer network with an
inverse squared Euclidean metric. CDL requires the match
between the input patterns and the prototypes above a threshold,
which is dynamically adjusted. SPLL emphasizes initiation
independent and adaptive genera- tion of clusters. It begins
with a random prototype in the input space and iteratively
chooses and divides prototypes until no fur- ther split is
Fig. 2. ART1 architecture. Two layers are included in the attentional
subsystem, connected via bottom-up and top-down adaptive weights. Their
interactions are controlled by the orienting subsystem through a vigilance
parameter.

cluster, instead of the combinations of several clusters.


Simpson employed hyperbox fuzzy sets to characterize
clusters [100], [249]. Each hyperbox is delineated by a min
and max point, and data points build their relations with the
hyperbox through the membership function. The learning
process experiences a se- ries of expansion and contraction
operations, until all clusters are stable.

I. Kernel-Based Clustering
Kernel-based learning algorithms [209], [240], [274] are
based on Cover’s theorem. By nonlinearly transforming a set
of complex and nonlinearly separable patterns into a higher-
di- mensional feature space, we can obtain the possibility to
separate these patterns linearly [132]. The difficulty of
curse of dimensionality can be overcome by the kernel trick,
arising from Mercer’s theorem [132]. By designing and
calculating an inner-product kernel, we can avoid the time-
consuming, sometimes even infeasible process to explicitly
describe the nonlinear mapping and compute the
corresponding points in the transformed space.
In [241], Schölkopf, Smola, and Müller depicted a kernel-
-means algorithm in the online mode. Suppose we have a set
of patterns and a nonlinear map
. Here, represents a feature space with arbitrarily high di-
mensionality. The object of the algorithm is to find centers so
that we can minimize the distance between the mapped
patterns and their closest center

where is the center for the th cluster and lies in a span


of
, and is the inner-
product kernel.
Define the cluster assignment variable

if belongs to
cluster otherwise.
Then the kernel- -means algorithm can be formulated as the and, by adjusting the width parameter of RBF, SVC can form ei-
following. ther agglomerative or divisive hierarchical clusters. When some
1) Initialize the centers with the first , points are allowed to lie outside the hypersphere, SVC can
ob- servation patterns; deal with outliers effectively. An extension, called multiple
2) Take a new pattern and calculate as spheres support vector clustering, was proposed in [62], which
shown in the equation at the bottom of the page. combines the concept of fuzzy membership.
Kernel-based clustering algorithms have many advantages.
3) Update the mean vector whose corresponding
is 1 1) It is more possible to obtain a linearly separable
hyper- plane in the high-dimensional, or even infinite
feature space.
2) They can form arbitrary clustering shapes other than
where . hyperellipsoid and hypersphere.
4) Adapt the coefficients for each as 3) Kernel-based clustering algorithms, like SVC, have the
capability of dealing with noise and outliers.
for 4) For SVC, there is no requirement for prior
for knowledge to determine the system topological
structure. In [107], the kernel matrix can provide the
5) Repeat steps 2)–4) until convergence is achieved. means to estimate the number of clusters.
Two variants of kernel- -means were introduced in [66], Meanwhile, there are also some problems requiring further
motivated by SOFM and ART networks. These variants con- consideration and investigation. Like many other algorithms,
sider effects of neighborhood relations, while adjusting the how to determine the appropriate parameters, for example, the
cluster assignment variables, and use a vigilance parameter to width of Gaussian kernel, is not trivial. The problem of
control the process of producing mean vectors. The authors also compu- tational complexity may become serious for large data
illustrated the application of these approaches in case based sets.
reasoning systems. The process of constructing the sum-of-squared clustering
An alternative kernel-based clustering approach is in [107]. algorithm [107] and -means algorithm [241] presents a good
The problem was formulated to determine an optimal partition example to reformulate more powerful nonlinear versions
to minimize the trace of within-group scatter matrix in the for many existing linear algorithms, provided that the scalar
feature space product can be obtained. Theoretically, it is important to investi-
gate whether these nonlinear variants can keep some useful
and essential properties of the original algorithms and how
Mercer kernels contribute to the improvement of the
algorithms. The effect of different types of kernel functions,
which are rich in the literature, is also an interesting topic for
further exploration.

J. Clustering Sequential Data


Sequential data are sequences with variable length and
many other distinct characteristics, e.g., dynamic behaviors,
time constraints, and large volume [120], [265]. Sequential
data can be generated from: DNA sequencing, speech
processing, text mining, medical diagnosis, stock market,
customer transactions, web data mining, and robot sensor
where analysis, to name a few [78], [265]. In recent decades,
sequential data grew explosively. For
and is the total number of patterns in the th cluster. , example, in genetics, the recent statistics released on October
15, 2004 (Release 144.0) shows that there are 43 194 602
Note that the kernel function utilized in this case is the 655 bases from 38 941 263 sequences in GenBank database
radial basis function (RBF) and can be interpreted as a [103] and release 45.0 of SWISSPROT on October 25, 2004
mea- sure of the denseness for the th cluster. contains 59 631 787 amino acids in 163 235 sequence entries
Ben-Hur et al. presented a new clustering algorithm, SVC, [267]. Cluster analysis explores potential patterns hidden in
in order to find a set of contours used as the cluster bound- the large number of sequential data in the context of
aries in the original data space [31], [32]. These contours can unsupervised learning and therefore provides a crucial way to
be formed by mapping back the smallest enclosing sphere in meet the cur- rent challenges. Generally, strategies for
the transformed feature space. RBF is chosen in this sequential clustering mostly fall into three categories.
algorithm,
if
otherwise
1) Sequence Similarity: The first scheme is based on the
measure of the distance (or similarity) between each pair of
se- quences. Then, proximity-based clustering algorithms, either
hi- erarchical or partitional, can group sequences. Since many
se- quential data are expressed in an alphabetic form, like
DNA or protein sequences, conventional measure methods are Fig. 3. Illustration of a sequence alignment. Series of edit operations
inap- propriate. If a sequence comparison is regarded as a is performed to change the sequence CLUSTERING into the sequence
CLASSIFICATION.
process of transforming a given sequence to another with a series
of substi- tution, insertion, and deletion operations, the
distance between the two sequences can be defined by virtue 2) Indirect Sequence Clustering: The second approach
of the minimum number of required operations. A common employs an indirect strategy, which begins with the extraction
analysis processes is alignment, illustrated in Fig. 3. The of a set of features from the sequences. All the sequences
defined distance is known as edit distance or Levenshtein are then mapped into the transformed feature space, where
distance [120], [236]. These edit operations are weighted classical vector space-based clustering algorithms can be
(punished or rewarded) according to some prior domain used to form clusters. Obviously, feature extraction becomes
knowledge and the distance herein is equiva- lent to the the essential factor that decides the effectiveness of these
minimum cost to complete the transformation. In this sense, algorithms. Guralnik and Karypis discussed the potential de-
the similarity or distance between two sequences can be pendency between two sequential patterns and suggested both
reformulated as an optimal alignment problem, which fits well the global and the local approaches to prune the initial feature
in the framework of dynamic programming. sets in order to better represent sequences in the new feature
Given two sequences, and space [119]. Morzy et al. utilized the sequential patterns as
, the basic dynamic program- the basic element in the agglomerative hierarchical clustering
ming-based sequence alignment algorithm, also known as and defined a co-occurrence measure, as the standard of
the Needleman-Wunsch algorithm, can be depicted by the fusion of smaller clusters [207]. These methods greatly reduce
following recursive equation [78], [212]: the computational complexities and can be applied to large-
scale sequence databases. However, the process of feature
selection inevitably causes the loss of some information in the
original sequences and needs extra attention.
alignment score
of 3) Statistical Sequence Clustering: Typically, the first two
where is defined as the best be-
tween sequence segment approaches are used to deal with sequential data composed of
and
of , and , or pattern mining [131].
represent the cost for aligning to , aligning to
a gap (denoted as ), or aligning to a gap, respectively. The
computational results for each position at and are recorded
in an array with a pointer that stores current optimal
operations and provides an effective path in backtracking the
alignment. The Needleman-Wunsch algorithm considers the
comparison of the whole length of two sequences and
therefore performs a global optimal alignment. However, it is
also important to find local similarity among sequences in
many circumstances. The Smith-Waterman algorithm
achieves that by allowing the beginning of a new alignment
during the recursive computa- tion, and the stop of an
alignment anywhere in the dynamic programming matrix
[78], [251]. This change is summarized in the following:

For both the global and local alignment algorithms, the


com- putation complexity is , which is very expensive, es-
pecially for a clustering problem that requires an all-against-
all pairwise comparison. A wealth of speed-up methods has
been developed to improve the situation [78], [120]. We will
see more discussion in Section III-E in the context of
biological sequences analysis. Other examples include
applications for speech recognition [236] and navigation
alphabets, while the third paradigm, which aims to construct
statistical models to describe the dynamics of each group of
se- quences, can be applied to numerical or categorical
sequences. The most important method is hidden Markov
models (HMMs) [214], [219], [253], which first gained its
popularity in the appli- cation of speech recognition [229]. A
discrete HMM describes an unobservable stochastic process
consisting of a set of states, each of which is related to
another stochastic process that emits observable symbols.
Therefore, the HMM is completely speci- fied by the
following.
1) A finite set states. with
2) A discrete set with observa-
tion symbols.
3) A state transition distribution ,

where th state at time th state at

time

4) A symbol emission distribution , where

at th state at

5) An initial state distribution , where

th state at

After an initial state is selected according to the initial


dis- tribution , a symbol is emitted with emission
distribution . The next state is
decided by the state transition distribution and it also
generates a symbol based on . The process
repeats until reaching the last state. Note that the procedure
generates
a sequence of symbol observations instead of states, which is are combined with EM for parameter estimation [286]. Smyth
where the name “hidden” comes from. HMMs are well [255] and Cadez et al. [50] further generalize a universal prob-
founded theoretically [229]. Dynamic programming and EM abilistic framework to model mixed data measurement, which
algorithm are developed to solve the three basic problems of includes both conventional static multivariate vectors and dy-
HMMs as the following. namic sequence data.
1) Likelihood (forward or backward algorithm). Com- The paradigm models clusters directly from original data
pute the probability of an observation sequence given without additional process that may cause information loss.
a model. They provide more intuitive ways to capture the dynamics
2) State interpretation (Vertbi algorithm). Find an op- of data and more flexible means to deal with variable length
timal state sequence by optimizing some criterion sequences. However, determining the number of model com-
function given the observation sequence and the ponents remains a complicated and uncertain process [214],
model. [253]. Also, the model selected is required to have sufficient
3) Parameter estimation (Baum–Welch algorithm). De- complexity, in order to interpret the characteristics of data.
sign suitable model parameters to maximize the prob-
ability of observation sequence under the model.
The equivalence between an HMM and a recurrent back- K. Clustering Large-Scale Data Sets
propagation network was elucidated in [148], and a universal
framework was constructed to describe both the Scalability becomes more and more important for clustering
computational and the structural properties of the HMM and algorithms with the increasing complexity of data, mainly
the neural network. man- ifesting in two aspects: enormous data volume and high
Smyth proposed an HMM-based clustering model, which, dimen- sionality. Examples, illustrated in the sequential
similar to the theories introduced in mixture densities-based clustering sec- tion, are just some of the many applications
clustering, assumes that each cluster is generated based on some that require this ca- pability. With the further advances of
probability distribution [253]. Here, HMMs are used rather than database and Internet tech- nologies, clustering algorithms will
the common Gaussian or -distribution. In addition to the form face more severe challenges in handling the rapid growth of
of finite mixture densities, the mixture model can also be de- data. We summarize the com- putational complexity of some
scribed by means of an HMM with the transition matrix typical and classical clustering algorithms in Table II with
several newly proposed approaches specifically designed to
deal with large-scale data sets. Several points can be
generalized through the table.
1) Obviously, classical hierarchical clustering algo-
where is the transition distribution for the th rithms, including single-linkage, complete linkage,
cluster. The initial distribution of the HMM is determined average linkage, centroid linkage and median
based on the prior probability for each cluster. The basic linkage, are not appropriate for large-scale data sets
learning process starts with a parameter initialization scheme due to the quadratic computational complexities in
to form a rough partition with the log-likelihood of each both execu- tion time and store space.
sequence serving as the distance measure. The partition is 2) -means algorithm has a time complexity of
further re- fined by training the overall HMM over all and space complexity of . Since is usu-
sequences with the classical EM algorithm. A Monte-Carlo ally much larger than both and , the complexity be-
cross validation method was used to estimate the possible comes near linear to the number of samples in the
number of clusters. An application with a modified HMM data sets. -means algorithm is effective in clustering
model that considers the effect of context for clustering facial large- scale data sets, and efforts have been made in
display sequences is illustrated in [138]. Oates et al. addressed order to overcome its disadvantages [142], [218].
the initial problem by pregrouping the sequences with the 3) Many novel algorithms have been developed to
agglomerative hierarchical clustering, which operates on the cluster large-scale data sets, especially in the context
proximity matrix determined by the dynamic time warping of data mining [44], [45], [85], [135], [213], [248].
(DTW) technique [214]. The area formed between one original Many of them can scale the computational
sequence and a new sequence, generated by warping the time complexity linearly to the input size and demonstrate
dimension of another original sequence, reflects the similarity the possibility of han- dling very large data sets.
of the two sequences. Li and Biswas suggested several a) Random sampling approach, e.g., CLARA clus-
objective criterion functions based on posterior probability tering large applications (CLARA) [161] and CURE
and information theory for structural selection of HMMs and [116]. The key point lies that the appropriate sample
cluster validity [182]. More recent advances on HMMs and sizes can effectively maintain the important geomet-
other related topics are reviewed in [30]. rical properties of clusters. Furthermore, Chernoff
Other model-based sequence clustering includes mixtures boundscanprovideestimationforthelowerboundof
of first-order Markov chain [255] and a linear model like the minimum sample size, given the low probability
autore- gressive moving average (ARMA) model [286]. that points in each cluster are missed in the sample
Usually, they set [116]. CLARA represents each cluster with a
dimension- ality. Some algorithms, like FC and
medoid while CURE chooses a set of well-scattered DENCLUE, have
and center-shrunk points.
b) Randomized search approach, e.g., clustering
large applications based on randomized search
(CLARANS) [213]. CLARANS sees the clustering
as a search process in a graph, in which each node
corresponds to a set of medoids. It begins with an
arbitrary node as the current node and examines a set
of neighbors, defined as the node consisting of
only one different data object, to seek a better
solution, i.e., any neighbor, with a lower cost,
becomes the current node. If the maximum number
of neighbors, specified by the user, has been
reached, the current node is accepted as a winning
node. This process iterates several times as
specified by users. Though CLARANS achieves
better performance than algo- rithms like CLARA,
the total computational time is still quadratic,
which makes CLARANS not quite effective in
very large data sets.
c) Condensation-based approach, e.g., BIRCH [295].
BIRCH generates and stores the compact sum-
maries of the original data in a CF tree, as
discussed in Section II-B. This new data structure
efficiently captures the clustering information and
largely reduces the computational burden. BIRCH
was generalized into a broader framework in [101]
with two algorithms realization, named as
BUBBLE and BUBBLE-FM.
d) Density-based approach, e.g., density based
spatial clustering of applications with noise
(DBSCAN)
[85] and density-based clustering (DENCLUE)
[135]. DBSCAN requires that the density in a
neighborhood for an object should be high enough
if it belongs to a cluster. DBSCAN creates a new
cluster from a data object by absorbing all objects in
its neighborhood. The neighborhood needs to sat-
isfy a user-specified density threshold. DBSCAN
uses a -tree structure for more efficient queries.
DENCLUE seeks clusters with local maxima of
the overall density function, which reflects the
comprehensive influence of data objects to their
neighborhoods in the corresponding data space.
e) Grid-based approach, e.g., WaveCluster [248] and
fractal clustering (FC) [26]. WaveCluster assigns
dataobjectstoasetofunitsdividedintheoriginalfea-
ture space, and employs wavelet transforms on these
units, to map objects into the frequency domain. The
key idea is that clusters can be easily distinguished in
the transformed space. FC combines the concepts of
both incremental clustering and fractal dimension.
Data objects are incrementally added to the clusters,
specified through an initial process, and represented
as cells in a grid, with the condition that the fractal
dimension of cluster needs to keep relatively stable.
4) Most algorithms listed previously lack the capability of
dealing with data with high dimensionality. Their
per- formances degenerate with the increase of
subspace, where is the resulting matrix and is the
shown some successful applications in such cases, projection matrix whose columns are the eigenvectors
but these are still far from completely effective. that correspond to the largest eigenvalues of the co-
In addition to the aforementioned approaches, several variance matrix , calculated from the whole data set (hence,
other techniquesalsoplaysignificantrolesinclusteringlarge- the column vectors of are orthonormal). PCA estimates the
scaledata sets. Parallel algorithms can more effectively use matrix while minimizing the sum of squares of the error
computational resources,
andgreatlyimproveoverallperformanceinthecontext ofboth
time andspace complexity [69], [217], [262]. Incremental
clustering techniques do not require the storage of the entire
data set, and can handle it in a one-pattern-at-a-time way. If
the pat- tern displays enough closeness to a cluster
according to some predefined criteria, it is assigned to the
cluster. Otherwise, a new cluster is created to represent the
object. A typical example is the ART family [51]–[53]
discussed in Section II-H. Most incre- mental clustering
algorithms are dependent on the order of the input patterns
[51], [204]. Bradley, Fayyad, and Reina proposed a scalable
clustering framework, considering seven relevant im-
portant characteristics in dealing with large databases [44].
Ap- plications of the framework were illustrated for the
-means al- gorithm and EM mixture models [44], [45].

L. Exploratory Data Visualization and High-


Dimensional Data Analysis Through Dimensionality
Reduction
For most of the algorithms summarized in Table II,
although they can deal with large-scale data, they are not
sufficient for analyzing high-dimensional data. The term,
“curse of dimen- sionality,” which was first used by
Bellman to indicate the ex- ponential growth of complexity
in the case of multivariate func- tion estimation under a
high dimensionality situation [28], is generally used to
describe the problems accompanying high di- mensional
spaces [34], [132]. It is theoretically proved that the
distance between the nearest points is no different from
that of other points when the dimensionality of the space is
high enough [34]. Therefore, clustering algorithms that are
based on the distance measure may no longer be effective in a
high dimen- sional space. Fortunately, in practice, many
high-dimensional data usually have an intrinsic
dimensionality that is much lower than the original
dimension [60]. Dimension reduction is impor- tant in cluster
analysis, which not only makes the high-dimen- sional data
addressable and reduces the computational cost, but
provides users with a clearer picture and visual
examination of the data of interest. However, dimensionality
reduction methods inevitably cause some loss of
information, and may damage the interpretability of the
results, even distort the real clusters.
One natural strategy for dimensionality reduction is to
extract important components from original data, which
can contribute to the division of clusters. Principle
component analysis (PCA) or Karhunen-Loéve
transformation is one of the typical approaches, which is
concerned with constructing a linear combination of a set
of vectors that can best describe the variance of data.
Given the input pattern
matrix
, the linear mapping
projects into a low-dimensional
the local linearity of the manifold and assumes that the local
of approximating the input vectors. In this sense, PCA can relations in the original data space
be realized through a three-layer neural network, called an
auto-associative multilayer perceptron, with linear activation
functions [19], [215]. In order to extract more complicated
nonlinear data structure, nonlinear PCA was developed and
one of the typical examples is kernel PCA. As methods
discussed in Section II-I, kernel PCA first maps the input
patterns into a feature space. The similar steps are then
applied to solve the eigenvalue problem with the new
covariance matrix in the fea- ture space. In another way, extra
hidden layers with nonlinear activation functions can be added
into the auto-associative network for this purpose [38], [75].
PCA is appropriate for Gaussian distributions since it relies on
second-order relationships in the covariance matrix, Other linear
transforms, like independent component analysis (ICA) and pro-
jection pursuit, which use higher order statistical information,
are more suited for non-Gaussian distributions [60], [151]. The
basic goal of ICA is to find the components that are most
statis- tically independent from each other [149], [154]. In the
context of blind source separation, ICA aims to separate the
independent source signals from the mixed observation signal.
This problem can be formulated in several different ways
[149], and one of the simplest form (without considering
noise) is represented as
, where is the -dimensional observable vector,
is the -dimensional source vector assumed to be statistically
independent, and is a nonsingular mixing matrix. ICA
can also be realized by virtue of multilayer perceptrons, and[158]
illustrates one of such examples. The proposed ICA network
includes whitening, separation, and basis vectors estimation
layers, with corresponding learning algorithms. The authors
also indicated its connection to the auto-associative multilayer
perceptron. Projection pursuit is another statistical technique for
seeking low-dimensional projection structures for multivariate
data [97], [144]. Generally, projection pursuit regards the
normal distribution as the least interesting projections and
optimizes some certain indices that measure the degree of
nonnormality [97]. PCA can be considered as a special
example of projection pursuit, as indicated in [60]. More
discussions on the relations among PCA, ICA, projection
pursuit, and other relevant tech- niques are offered in [149]
and [158].
Different from PCA, ICA, and projection pursuit, Multidi-
mensional scaling (MDS) is a nonlinear projection technique
[75], [292]. The basic idea of MDS lies in fitting original mul-
tivariate data into a low-dimensional structure while aiming to
maintain the proximity information. The distortion is
measured through some criterion functions, e.g., in the
sense of sum of squared error between the real distance and
the projection distance. The isometric feature mapping
(Isomap) algorithm is another nonlinear technique, based on
MDS [270]. Isomap estimates the geodesic distance between a
pair of points, which is the shortest path between the points on
a manifold, by virtue of the measured input-space distances,
e.g., the Euclidean distance usually used. This extends the
capability of MDS to explore more complex nonlinear
structures in the data. Locally linear embedding (LLE)
algorithm addresses the nonlinear dimensionality reduction
problem from a different starting point [235]. LLE emphasizes
an input parameter, and it is obvious that the quality of
( -dimensional) are also preserved in the projected low-di- resulting clusters is largely dependent on the estimation of .
mensional space ( -dimensional). This is represented A division with
through a weight matrix, describing how each point is related
to the reconstruction of another data point. Therefore, the
procedure for dimensional reduction can be constructed as
the problem that finding -dimensional vectors so that the
criterion function is minimized.
Another inter- esting nonlinear dimensionality reduction
approach, known as Laplace eigenmap algorithm, is
presented in [27].
As discussed in Section II-H, SOFM also provide good
visu- alization for high-dimensional input patterns [168].
SOFM map input patterns into a one or usually two dimensional
lattice struc- ture, consisting of nodes associated with
different clusters. An application for clustering of a large set
of documental data is illustrated in [170], in which 6 840
568 patent abstracts were
projected onto a SOFM with 1 002 240 nodes.
Subspace-based clustering addresses the challenge by ex-
ploring the relations of data objects under different combina-
tions of features. clustering in quest (CLIQUE) [3] employs a
bottom-up scheme to seek dense rectangular cells in all sub-
spaces with high density of points. Clusters are generated as
the connected components in a graph whose vertices stand
for the dense units. The resulting minimal description of the
clusters is obtained through the merge of these rectangles.
OptiGrid [136] is designed to obtain an optimal grid-
partitioning. This is achieved by constructing the best cutting
hyperplanes through a set of projections. The time
complexity for OptiGrid is in the interval of and
. ORCLUS (arbitrarily ORiented projected
CLUster generation) [2] defines a generalized pro- jected
cluster as a densely distributed subset of data objects in a
subspace, along with a subset of vectors that represent the
subspace. The dimensionality of the subspace is prespecified by
users as an input parameter, and several strategies are
proposed in guidance of its selection. The algorithm begins
with a set of randomly selected seeds with the full
dimensionality. This dimensionality and the number of
clusters are decayed according to some factors at each
iteration, until the number of clusters reaches the predefined
values. Each repetition con- sists of three basic operations,
known as assignment, vector finding, and merge. ORCLUS
has the overall time complexity of
and space complexity of.
Obviously, the scalability to large data sets relies on the
number of initial seeds . A generalized subspace
clustering model, pCluster was proposed in [279]. These
pClusters are formed by a depth-first clustering algorithm.
Several other interesting applications, including a Clindex
(CLustering for INDEXing) scheme and wavelet transform,
are shown in [184] and [211], respectively.

M. How Many Clusters?


The clustering process partitions data into an appropriate
number of subsets. Although for some applications, users can
determine the number of clusters, , in terms of their
expertise, under more circumstances, the value of is
unknown and needs to be estimated exclusively from the data
themselves. Many clustering algorithms ask to be provided as
posterior probabili- ties calculated.
too many clusters complicates the result, therefore, makes it
hard to interpret and analyze, while a division with too few
clusters causes the loss of information and misleads the final
decision. Dubes called the problem of determining the number
of clusters “the fundamental problem of cluster validity” [74].
A large number of attempts have been made to estimate the
appropriate and some of representative examples are illus-
trated in the following.
1) Visualization of the data set. For the data points that
can be effectively projected onto a two-dimensional
Euclidean space, which are commonly depicted with
a histogram or scatterplot, direct observations can
pro- vide good insight on the value of . However, the
com- plexity of most real data sets restricts the
effectiveness of the strategy only to a small scope of
applications.
2) Construction of certain indices (or stopping rules).
These indices usually emphasize the compactness of
intra-cluster and isolation of inter-cluster and
consider the comprehensive effects of several factors,
including the defined squared error, the geometric or
statistical properties of the data, the number of
patterns, the dis- similarity (or similarity), and the
number of clusters. Milligan and Cooper compared
and ranked 30 indices according to their performance
over a series of artifi- cial data sets [202]. Among
these indices, the Caliñski and Harabasz index [74]
achieve the best performance and can be represented
as

CH

where is the total number of patterns and


and are the trace of the between and within
class scatter matrix, respectively. The that maxi-
mizes the value of CH is selected as the optimal.
It is worth noting that these indices may be data de-
pendent. The good performance of an index for cer-
tain data does not guarantee the same behavior with
different data. As pointed out by Everitt, Landau, and
Leese, “it is advisable not to depend on a single rule
for selecting the number of groups, but to synthesize
the results of several techniques” [88].
3) Optimization of some criterion functions under prob-
abilistic mixture-model framework. In a statistical
framework, finding the correct number of clusters
(components) , is equivalent to fitting a model with
observed data and optimizing some criterion [197].
Usually, the EM algorithm is used to estimate the
model parameters for a given , which goes through
a predefined range of values. The value of that
maximizes (or minimizes) the defined criterion is
regarded as optimal. Smyth presented a Monte-Carlo
cross-validation method, which randomly divides
data into training and test sets times according to a
cer- tain fraction ( works well from the
empirical results) [252]. The is selected either
directly based on the criterion function or some
and plastic neural gas, can be accessed in [223] and [232], re-
A large number of criteria, which combine spectively. Obviously, the problem of determining the number
concepts from information theory, have been
proposed in the literature. Typical examples
include,
• Akaike’s information criterion (AIC) [4], [282]

AIC

where is the total number of patterns, is the


number of parameters for each cluster, is the total
number of parameters estimated, and is the max-
imum log-likelihood. is selected with the minimum
value of AIC .
• Bayesian inference criterion (BIC) [226],

[242] BIC

is selected with the maximum value of BIC .


More criteria, such as minimum description
length (MDL) [114], [233], minimum message length
(MML) [114], [216], cross validation-based
information crite- rion (CVIC) [254] and covariance
inflation criterion (CIC) [272], with their
characteristics, are summarized in [197]. Like the
previous discussion for validation index, there is no
criterion that is superior to others in general case.
The selection of different criteria is still dependent
on the data at hand.
4) Other heuristic approaches based on a variety of
tech- niques and theories. Girolami performed
eigenvalue decomposition on the kernel matrix in
the high-dimen- sional feature space and used the
dominant compo- nents in the decomposition
summation as an indication of the possible existence
of clusters [107]. Kothari and Pitts described a
scale-based method, in which the distance from a
cluster centroid to other clusters in its
neighborhood is considered (added as a regulariza-
tion term in the original squared error criterion,
Sec- tion II-C) [160]. The neighborhood of clusters
work as a scale parameter and the that is persistent
in the largest interval of the neighborhood
parameter is re- garded as the optimal.
Besides the previous methods, constructive clustering
algo- rithms can adaptively and dynamically adjust the
number of clusters rather than use a prespecified and fixed
number. ART networks generate a new cluster, only when
the match between the input pattern and the expectation is
below some prespecified confidence value [51]. A
functionally similar mechanism is used in the CDL network
[82]. The robust competitive clus- tering algorithm (RCA)
describes a competitive agglomeration process that
progresses in stages, and clusters that lose in the competition
are discarded, and absorbed into other clusters [98]. This
process is generalized in [42], which attains the number of
clusters by balancing the effect between the complexity
and the fidelity. Another learning scheme, SPLL iteratively
divides cluster prototypes from a single prototype until no
more prototypes satisfy the split criterion [296]. Several
other constructive clustering algorithms, including the FACS
categorical or mixture data, greatly improve the situation [117],
of clusters is converted into a parameter selection problem, [183]. The algorithm ROCK
and the resulting number of clusters is largely dependent on
parameter tweaking.

III. APPLICATIONS
We illustrate applications of clustering techniques in three as-
pects. The first is for two classical benchmark data sets that
are widely used in pattern recognition and machine learning.
Then, we show an application of clustering for the traveling
salesman problem. The last topic is on bioinformatics. We deal
with clas- sical benchmarks in Sections III-A and III-B and the
traveling salesman problem in Section III-C. A more extensive
discussion of bioinformatics is in Sections III-D and III-E.

A. Benchmark Data Sets—IRIS


The iris data set [92] is one of the most popular data
sets to examine the performance of novel methods in pat-
tern recognition and machine learning. It can be down-
loaded from the UCI Machine Learning Repository at
http://www.ics.uci.edu/~mlearn/MLRepository.html. There are
three categories in the data set (i.e., iris setosa, iris versicolor
and iris virginical), each having 50 patterns with four features
[i.e., sepal length (SL), sepal width (SW), petal length (PL),
and petal width (PW)]. Iris setosa can be linearly separated
from iris versicolor and iris virginical, while iris versicolor
and iris virginical are not linearly separable (see Fig. 4(a), in
which only three features are used). Fig. 4(b) depicts the
clustering result with a standard -means algorithm. It is clear
to see that
-means can correctly differentiate iris setosa from the other
two iris plants. But for iris versicolor and virginical, there
exist 16 misclassifications. This result is similar to those
(around 15 errors) obtained from other classical clustering
algorithms [221]. Table III summarizes some of the clustering
results reported in the literature. From the table, we can see
that many newly developed approaches can greatly improve
the clustering performance on iris data set (around 5
misclassifications); some even can achieve 100% accuracy.
Therefore, the data can be well classified with appropriate
methods.

B. Benchmark Data Sets—MUSHROOM


Unlike the iris data set, all of the features of the mushroom
data set, which can also be accessible at the UCI Machine
Learning Repository, are nominal rather than numerical.
These 23 species of gilled mushrooms are categorized as
either edible or poisonous. The total number of instances is 8
124 with 4 208 being edible and 3 916 poisonous. The 22
features are summarized in Table IV with corresponding
possible values. Table V illustrates some experimental results
in the literature. As indicated in [117] and [277], traditional
clustering strategies, like -means and hierarchical clustering,
work poorly on the data set. The accuracy for -means is just
around 69% [277] and the clusters formed by classical HC are
mixed with nearly similar proportion of both edible and
poisonous objects [117]. The results reported in the newly
developed algorithms, which are specifically used for tackling
Fig. 4. (a) Iris data sets. There are three iris categories, each having 50 samples
with 4 features. Here, only three features are used: PL, PW, and SL. (b) -means
clustering result with 16 classification errors observed.

TABLE III
SOME CLUSTERING RESULTS FOR THE IRIS DATA SET

divides objects into 21 clusters with most of them (except


one) consisting of only one category, which increases the
accuracy almost to 99%. The algorithm SBAC works on a
subset of 200 randomly selected objects, 100 for each
category and the general results show the correct partition of
3 clusters (two for edible mushrooms, one for poisonous
ones). In both studies, the
TABLE IV TABLE V
FEATURES FOR THE MUSHROOM DATA SET SOME CLUSTERING RESULTS FOR THE MUSHROOM DATA SET

optimization problem of trying to find the shortest


Hamiltonian cycle, and in particular, the most common is the
Euclidean version, where the vertices and edges all lie in the
plane. Mulder and Wunsch applied a divide-and-conquer
clustering technique, with ART networks, to scale the problem
to a mil- lion cities [208]. The divide and conquer paradigm
gives the flexibility to hierarchically break large problems into
arbitrarily small clusters depending on what tradeoff between
accuracy and speed is desired. In addition, the subproblems
provide an excellent opportunity to take advantage of parallel
systems for further optimization. As the first stage of the
algorithm, the ART network is used to sort the cities into
clusters. The vigilance parameter is used to set a maximum
distance from the current pattern. A vigilance parameter
between 0 and 1 is used as a percentage of the global space to
determine the vigilance distance. Values were chosen based
on the desired number and size of individual clusters. The
clusters were then each passed to a version of the Lin-
Kernighan (LK) algorithm [187]. The last step combines the
constitution of each feature for generated clusters is also illus- subtours back into one complete tour. Tours with good quality
trated and it is observed that some features, like cap-shape and for city levels up to 1 000 000 were obtained within 25
ring-type, represent themselves identically for both categories minutes on a 2 GHz AMD Athlon MP processor with 512 M
and, thus, suggest poor performance of traditional approaches. of DDR RAM. Fig. 5 shows the visualizing results for 1 000,
Meanwhile, feature odor shows good discrimination for the 10 000, and 1 000 000 cities, respectively.
different types of mushrooms. Usually, value almond, anise, It is worthwhile to emphasize the relation between the TSP
or none indicates the edibility of mushrooms, while value and very large-scale integrated (VLSI) circuit clustering,
pungent, foul, or fishy means the high possibility of presence which partitions a sophisticated system into smaller and
of poisonous contents in the mushrooms. simpler sub- circuits to facilitate the circuit design. The object
of the par- titions is to minimize the number of connections
C. Traveling Salesman Problem among the components. One strategy for solving the problem
The traveling salesman problem (TSP) is one of the most is based on geometric representations, either linear or
studied examples in an important class of problems known as multidimensional [8]. Alpert and Kahng considered a solution
NP-complete problems. Given a complete undirected graph to the problem as the “inverse” of the divide-and-conquer TSP
, where is a set of vertices and is a set of method and used a linear tour of the modules to form the
edges each relating two vertices with an associated subcircuit partitions [7]. They adopted the spacefilling curve
nonnegative integer cost, the most general form of the TSP heuristic for the TSP to con- struct the tour so that connected
is equivalent to finding any Hamiltonian cycle, which is a tour modules are still close in the generated tour. A dynamic
over that begins and ends at the same vertex and visits other programming method was used to generate the resulting
vertices exactly once. The more common form of the partitions. More detailed discussion on VLSI circuit clustering
problem is the can be found in the survey by Alpert and Kahng [7].
sequence structures.

Fig. 5. Clustering divide-and-conquer TSP resulting tours for (a) 1 k, (b) 10


k,
(c) 1 M cities. The clustered LK algorithm achieves a significant speedup and
shows good scalability.

D. Bioinformatics—Gene Expression Data


Recently, advances in genome sequencing projects and
DNA microarray technologies have been achieved. The first
draft of the human genome sequence project was completed in
2001, several years earlier than expected [65], [275]. The
genomic se- quence data for other organizms (e.g., Drosophila
melanogaster and Escherichia coli) are also abundant. DNA
microarray tech- nologies provide an effective and efficient
way to measure gene expression levels of thousands of genes
simultaneously under different conditions and tissues, which
makes it possible to in- vestigate gene activities from the
angle of the whole genome [79], [188]. With sequences and
gene expression data in hand, to investigate the functions of
genes and identify their roles in the genetic process become
increasingly important. Analyzes under traditional laboratory
techniques are time-consuming and expensive. They fall far
behind the explosively increasing gen- eration of new data.
Among the large number of computational methods used to
accelerate the exploration of life science, clus- tering can
reveal the hidden structures of biological data, and is
particularly useful for helping biologists investigate and
under- stand the activities of uncharacterized genes and
proteins and further, the systematic architecture of the whole
genetic net- work. We demonstrate the applications of clustering
algorithms in bioinformatics from two aspects. The first part is
based on the analysis of gene expression data generated from
DNA mi- croarray technologies. The second part describes
clustering pro- cesses that directly work on linear DNA or
protein sequences. The assumption is that functionally similar
genes or proteins usually share similar patterns or primary
potential short and consensus sequence patterns, known as
DNA microarray technologies generate many gene ex- motifs, and further investigate their interaction with
pression profiles. Currently, there are two major microarray transcriptional binding factors,
technologies based on the nature of the attached DNA:
cDNA with length varying from several hundred to
thousand bases, or oligonucleotides containing 20–30 bases.
For cDNA tech- nologies, a DNA microarray consists of a
solid substrate to which a large amount of cDNA clones are
attached according to a certain order [79]. Fluorescently
labeled cDNA, obtained from RNA samples of interest
through the process of reverse transcription, is hybridized
with the array. A reference sample with a different
fluorescent label is also needed for comparison. Image
analysis techniques are then used to measure the fluores-
cence of each dye, and the ratio reflects relative levels of
gene expression. For a high-density oligonucleotide
microarray, oligonucleotides are fixed on a chip through
photolithography or solid-phase DNA synthesis [188]. In
this case, absolute gene expression levels are obtained.
After the normalization of the fluorescence intensities, the
gene expression profiles are represented as a matrix
, where is the ex-
pression level of the th gene in the th condition, tissue, or
experimental stage. Gene expression data analysis consists of
a three-level framework based on the complexity, ranging
from the investigation of single gene activities to the
inference of the entire genetic network [20]. The intermediate
level explores the relations and interactions between genes
under different conditions, and attracts more attention
currently. Generally, cluster analysis of gene expression data
is composed of two aspects: clustering genes [80], [206],
[260], [268], [283], [288]
or clustering tissues or experiments [5], [109], [238].
Results of gene clustering may suggest that genes in the same
group have similar functions, or they share the same
transcrip- tional regulation mechanism. Cluster analysis, for
grouping functionally similar genes, gradually became
popular after the successful application of the average
linkage hierarchical clustering algorithm for the expression
data of budding yeast Saccharomyces cerevisiae and reaction
of human fibroblasts to serum by Eisen et al. [80]. They used
the Pearson correlation coefficient to measure the similarity
between two genes, and provided a very informative
visualization of the clustering re- sults. Their results
demonstrate that functionally similar genes tend to reside in
the same clusters formed by their expression pattern, even
under a relatively small set of conditions. Herwig et al.
developed a variant of -means algorithm to cluster a set of 2
029 human cDNA clones and adopted mutual information as
the similarity measure [230]. Tomayo et al. [268] made
use of SOFM to cluster gene expression data and its applica-
tion in hematopoietic differentiation provided new insight for
further research. Graph theories based clustering algorithms,
like CAST [29] and CLICK [247], showed very promising
performances in tackling different types of gene expression
data. Since many genes usually display more than one
function, fuzzy clustering may be more effective in exposing
these rela- tions [73]. Gene expression data is also important
to elucidate the genetic regulation mechanism in a cell. By
examining the corresponding DNA sequences in the control
regions of a cluster of co-expressed genes, we may identify
functions attempt to seek a subset of genes that contribute most
leading to different gene activities. Spellman et al. clustered to the discrimination of different cancer types.
800 genes according to their expression during the yeast cell
cycle [260]. Analyzes of 8 major gene clusters unravel the
connection between co-expression and co-regulation.
Tavazoie et al. partitioned 3 000 genes into 30 clusters with
the -means algorithm [269]. For each cluster, 600 base pairs
upstream sequences of the genes were searched for potential
motifs. 18 motifs were found from 12 clusters in their
experiments and 7 of them can be verified according to
previous empirical results in the literature. A more
comprehensive investigation can be found in [206].
As to another application, clustering tissues or experiments
are valuable in identifying samples that are in the different
dis- ease states, discovering, or predicting different cancer types,
and evaluating the effects of novel drugs and therapies [5],
[109], [238]. Golub et al. described the restriction of
traditional cancer classification methods, which are mostly
dependent on mor- phological appearance of tumors, and
divided cancer classifi- cation into two challenges: class
discovery and class predic- tion. They utilized SOFM to
discriminate two types of human acute leukemias: acute
myeloid leukemia (AML) and acute lym- phoblastic leukemia
(ALL) [109]. According to their results, two subsets of ALL,
with different origin of lineage, can be well separated. Alon et
al. performed a two-way clustering for both tissues and genes
and revealed the potential relations, rep- resented as
visualizing patterns, among them [6]. Alizadeh et al.
demonstrated the effectiveness of molecular classification of
cancers by their gene expression profiles and successfully dis-
tinguished two molecularly distinct subtypes of diffuse large
B-cell lymphoma, which cause high percentage failure in clin-
ical treatment [5]. Furthermore, Scherf et al. constructed a
gene expression database to study the relationship between genes
and drugs for 60 human cancer cell lines, which provides an
im- portant criterion for therapy selection and drug discovery
[238]. Other applications of clustering algorithms for tissue
classifi- cation include: mixtures of multivariate Gaussian
distributions [105], ellipsoidal ART [287], and graph theory-
based methods [29], [247]. In most of these applications,
important genes that are tightly related to the tumor types are
identified according to their expression differentiation under
different cancerous cate- gories, which are in accord with our
prior recognition of roles of these genes, to a large extent [5],
[109]. For example, Alon et al. found that 5 of 20 statistically
significant genes were muscle genes, and the corresponding
muscle indices provided an expla- nation for false
classifications [6].
Fig. 7 illustrates an application of hierarchical clustering
and SOFM for gene expression data. This data set is on the
diagnostic research of small round blue-cell tumors
(SRBCT’s) of childhood and consists of 83 samples from four
categories, known as Burkitt lymphomas (BL), the Ewing
family of tumors (EWS), neuroblastoma (NB), and
rhabdomyosarcoma (RMS), and 5 non-SRBCT samples
[164]. Gene expression levels of 6 567 genes were measured
using cDNA microarray for each sample, 2 308 of which
passed the filter and were kept for fur- ther analyzes. These
genes are further ranked according to the scores calculated by
some criterion functions [109]. Generally, these criterion
clustering algorithms, due to the limitations of biological
This can be regarded as a feature selection process.
However, problems like how many genes are really required,
and whether these genes selected are really biologically
meaningful, are still not answered satisfactorily.
Hierarchical clustering was performed by the program
CLUSTER and the results were visualized by the program
TreeView, developed by Eisen in Stanford University. Fig.
7(a) and (b) depicts the clustering results for both the top
100 genes, selected by the Fisher scores, and the samples.
Graphic visualization is achieved by associating each data
point with a certain color according to the corresponding
scale. Some clustering patterns are clearly dis- played in the
image. Fig. 7(c) depicts a 5-by-5 SOFM topology for all
genes, with each cluster represented by the centroid (mean)
for each feature (sample). 25 clusters are generated and
the number of genes in each cluster is also indicated.
The software package GeneCluster, developed by Whitehead
Institute/MIT Center for Genome Research (WICGR), was
used in this analysis.
Although clustering techniques have already achieved
many impressive results in the analysis of gene expression
data, there are still many problems that remain open. Gene
expression data sets usually are characterized as
1) small set samples with high-dimensional features;
2) high redundancy;
3) inherent noise;
4) sparsity of the data.
Most of the published data sets include usually less than
20 samples for each tumor type, but with as many as
thousands of gene measures [80], [109], [238], [268]. This is
partly caused by the lag of experimental condition (e.g., sample
collection), in contrast to the rapid advancement of microarray
and sequencing technologies. In order to evaluate existing
algorithms more reasonably and develop more effective new
approaches, more data with enough samples or more
conditional observations are needed. But from the trend of
gene chip technologies, which also follows Moore’s law for
semiconductor chips [205], the current status will still exist
for a long time. This problem is more serious in the
application of gene expression data for cancer research, in
which clustering algorithms are required to be capable of
effectively finding potential patterns under a large number of
irrelevant factors, as a result of the introduction of too many
genes. At the same time, feature selection, which is also
called informative gene selection in the context, also plays a
very important role. Without any doubt, clustering
algorithms should be feasible in both time and space
complexity. Due to the nature of the manufacture process of
the microarray chip, noise can be inevitably introduced into
the expression data during different stages. Accordingly,
clustering algorithms should have noise and outlier detection
mechanisms in order to remove their effects. Furthermore,
different algorithms usually form different clusters for the
same data set, which is a general problem in cluster analysis.
How to evaluate the quality of the generated clusters of
genes, and how to choose appropriate algorithms for a
specified application, are particularly crucial for gene
expression data research, because sometimes, even biologists
cannot identify the real patterns from the artifacts of the
Fig. 6. Basic procedure of cDNA microarray technology [68]. Fluorescently labeled cDNAs, obtained from target and reference samples through reverse
transcription, are hybridized with the microarray, which is comprised of a large amount of cDNA clones. Image analysis measures the ratio of the two dyes.
Computational methods, e.g., hierarchical clustering, further disclose the relations among genes and corresponding conditions.

Fig. 7. Hierarchical and SOFM clustering of SRBCT’s gene expression data set. (a) Hierarchical clustering result for the 100 selected genes under 83 tissue
samples. The gene expression matrix is visualized through a color scale. (b) Hierarchical clustering result for the 83 tissue samples. Here, the dimension is 100
as 100 genes are selected like in (a). (c) SOFM clustering result for the 2308 genes. A 5 5 SOFM is used and 25 clusters are formed. Each cluster is represented
by the average values.
fo- cusing on specific problems [199]. The implementation of
knowledge. Some recent results can be accessed in [29], [247], the
and [291].

E. Bioinformatics—DNA or Protein Sequences Clustering


DNA (deoxyribonucleic acid) is the hereditary material ex-
isting in all living cells. A DNA molecule is a double helix
con- sisting of two strands, each of which is a linear sequence
com- posed of four different nucleotides—adenine, guanine,
thymine, and cytosine, abbreviated as the letters A, G, T, and
C, respec- tively. Each letter in a DNA sequence is also called
a base. Proteins determine most of cells’ structures, functions,
prop- erties, and regulatory mechanisms. The primary
structure of a protein is also a linear and alphabetic chain with
the difference that each unit represents an amino acid, which
has twenty types in total. Proteins are encoded by certain
segments of DNA se- quences through a two-stage process
(transcription and trans- lation). These segments are known as
genes or coding regions. Investigation of the relations between
DNA and proteins, as well as their own functions and properties,
is one of the important re- search directions in both genetics
and bioinformatics.
The similarity between newly sequenced genes or proteins
and annotated genes or proteins usually offers a cue to identify
their functions. Searching corresponding databases for a new
DNA or protein sequence has already become routine in genetic
research. In contrast to sequence comparison and search, cluster
analysis provides a more effective means to discover compli-
cated relations among DNA and protein sequences. We summa-
rize the following clustering applications for DNA and protein
sequences:
1) function recognition of uncharacterized genes or pro-
teins [119];
2) structure identification of large-scale DNA or protein
databases [237], [257];
3) redundancy decrease of large-scale DNA or protein
databases [185];
4) domain identification [83], [115];
5) expressed sequence tag (EST) clustering [49], [200].
As described in Section II-J, classical dynamic
programming algorithms for global and local sequence
alignment are too in- tensive in computational complexity.
This becomes worse be- cause of the existence of a large
volume of nucleic acids and amino acids in the current DNA
or protein databases, e.g., bac- teria genomes are from 0.5 to
10 Mbp, fungi genomes range from 10 to 50 Mbp, while the
human genome is around 3 310 Mbp [18] (Mbp means million
base pairs). Thus, conventional dynamic programming
algorithms are computationally infea- sible. In practice,
sequence comparison or proximity measure is achieved via
some heuristics. Well-known examples include BLAST and
FASTA with many variants [10], [11], [224]. The key idea of
these methods is to identify regions that may have potentially
high matches, with a list of prespecified high-scoring words, at
an early stage. Therefore, further search only needs to focus
on these regions with expensive but accurate algorithms.
Recognizing the benefit coming from the separation of word
matching and sequence alignment to computational burden re-
duction, Miller, Gurd, and Brass described three algorithms
substitution, insertion, and deletion in edit operations. For
scheme for large database vs. database comparison exhibits convenience, a begin state and an end state are added to the
an apparent improvement in computation time. Kent and Zahler
de- signed a three-pass algorithm, called wobble aware bulk
aligner (WABA) [162], for aligning large-scale genomic
sequences of different species, which employs a seven-state
pairwise hidden Markov model [78] for more effective
alignments. In [201], Miller summarized the current research
status of genomic se- quence comparison and suggested
valuable directions for fur- ther research efforts.
Many clustering techniques have been applied to
organize DNA or protein sequence data. Some directly
operate on a proximity measure; some are based on feature
extraction, while others are constructed on statistical
models. Somervuo and Kohonen illustrated an application
of SOFM to cluster protein sequences in SWISSPROT
database [257]. FASTA was used to calculate the
sequence similarity. The resulting two-dimensional SOFM
provides a visualized representation of the relations within
the entire sequence database. Based on the similarity
measure of gapped BLAST, Sasson et al. utilized an
agglomerative hierarchical clustering paradigm to cluster all
protein sequences in SWISSPROT [237]. The effects of four
merging rules, different from the interpretation of cluster
centers, on the resulting protein clusters were examined. The
advantages as well as the potential risk of the concept,
transitivity, were also elucidated in the paper. According
to the transitivity relation, two sequences that do not show
high sequence similarity by virtue of direct comparison,
may be homologous (having a common ancestor) if there
exists an intermediate sequence similar to both of them.
This makes it possible to detect remote homologues that can
not be observed by similarity comparison. However,
unrelated sequences may be clustered together due to the
effects of these intermediate sequences [237]. Bolten et al.
addressed the problem with the construction a directed
graph, in which each protein sequence corresponds to a
vertex and edges are weighted based on the alignment
score between two sequences and self alignment score of
each sequence [41]. Clusters were formed through the
search of strongly connected components (SCCs), each of
which is a maximal subset of vertices and for each pair of
ver- tices and in the subset, there exist two
directed paths from to and vice versa. A minimum
normalized cut algorithm for detecting protein families and a
minimum spanning tree (MST) application for seeking
domain information were presented in
[1] and [115], respectively. In contrast with the
aforementioned proximity-based methods, Guralnik and
Karypis transformed protein or DNA sequences into a
new feature space, based on the detected subpatterns
working as the sequence features, and clustered with the
-means algorithm [119]. The method is immune from all-
against-all expensive sequence compar- ison and suitable for
analyzing large-scale databases. Krogh demonstrated the
power of hidden Markov models (HMMs) in biological
sequences modeling and clustering of protein families [177].
Fig. 8 depicts a typical structure of HMM, in which match
states (abbreviated with letter M), insert states (I) and delete
states (D) are represented as rectangles, diamonds, and
circles, respectively [78], [177]. These states correspond to
challenging tasks, requiring more powerful clustering

Fig. 8. HMM architecture [177]. There are three different states, match (M),
insert (I), and delete (D), corresponding to substitution, insertion, and deletion
operation, respectively. A begin (B) and end (E) state are also introduced to
represent the start and end of the process. This process goes through a series
of states according to the transition probability, and emits either 4-letter
nucleotide or 20-letter amino acid alphabet based on the emission probability.

model, denoted by letter B and E. Letters, either from the 4-letter


nucleotide alphabet or from 20-letter amino acid alphabet,
are generated from match and insert states according to some
emission probability distributions. Delete states do not
produce any symbols, and are used to skip the match states.
HMMs are requiredinorderto describe clusters, or families
(subfamilies), which are regarded as a mixture model and
proceeded with an EM learning algorithm similar to single
HMM case. An example for clustering subfamilies of 628
globins shows the encouraging results. Further discussion can
be found in [78] and [145].

IV. CONCLUSION

As an important tool for data exploration, cluster analysis


examines unlabeled data, by either constructing a hierarchical
structure, or forming a set of groups according to a
prespecified number. This process includes a series of steps,
ranging from preprocessing and algorithm development, to
solution validity and evaluation. Each of them is tightly
related to each other and exerts great challenges to the scientific
disciplines. Here, we place the focus on the clustering
algorithms and review a wide variety of approaches appearing
in the literature. These algo- rithms evolve from different
research communities, aim to solve different problems, and
have their own pros and cons. Though we have already seen
many examples of successful applications of cluster analysis,
there still remain many open problems due to the existence of
many inherent uncertain factors. These prob- lems have already
attracted and will continue to attract intensive efforts from broad
disciplines. We summarize and conclude the survey with listing
some important issues and research trends for cluster
algorithms.
1) There is no clustering algorithm that can be univer-
sally used to solve all problems. Usually, algorithms
are designed with certain assumptions and favor
some type of biases. In this sense, it is not accurate to
say “best” in the context of clustering algorithms,
although some comparisons are possible. These
comparisons are mostly based on some specific
applications, under cer- tain conditions, and the
results may become quite dif- ferent if the conditions
change.
2) New technology has generated more complex and
A. Levine, “Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide
algorithms. The following properties are important arrays,” Proc. Nat. Acad. Sci. USA, pp. 6745–6750, 1999.
to the efficiency and effectiveness of a novel
algorithm.
I) generate arbitrary shapes of clusters rather than be
confined to some particular shape;
II) handle large volume of data as well as high-
dimen- sional features with acceptable time and
storage complexities;
III) detect and remove possible outliers and noise;
IV) decrease the reliance of algorithms on users-
de- pendent parameters;
V) have the capability of dealing with newly
occur- ring data without relearning from the
scratch;
VI) be immune to the effects of order of input patterns;
VII) provide some insight for the number of
potential clusters without prior knowledge;
VIII) show good data visualization and provide users
with results that can simplify further analysis;
IX) be capable of handling both numerical and
nom- inal data or be easily adaptable to some
other data type.
Of course, some more detailed requirements for spe-
cific applications will affect these properties.
3) At the preprocessing and post-processing phase, fea-
ture selection/extraction (as well as standardization
and normalization) and cluster validation are as
impor- tant as the clustering algorithms. Choosing
appropriate and meaningful features can greatly
reduce the burden of subsequent designs and result
evaluations reflect the degree of confidence to
which we can rely on the generated clusters.
Unfortunately, both processes lack universal
guidance. Ultimately, the tradeoff among different
criteria and methods is still dependent on the
applications themselves.

ACKNOWLEDGMEN
T

The authors would like to thank the Eisen Laboratory in Stan-


ford University for use of their CLUSTER and TreeView
soft- ware and Whitehead Institute/MIT Center for Genome
Research for use of their GeneCluster software. They would
also like to thank S. Mulder for the part on the traveling
salesman problem and also acknowledge extensive
comments from the reviewers and the anonymous associate
editor.

REFERENCES
[1] F. Abascal and A. Valencia, “Clustering of proximal sequence space
for the identification of protein families,” Bioinformatics, vol. 18, pp.
908–921, 2002.
[2] C. Aggarwal and P. Yu, “Redefining clustering for high-dimensional ap-
plications,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 2, pp. 210–
225, Feb. 2002.
[3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic
subspace clustering of high dimensional data for data mining applica-
tions,” in Proc. ACM SIGMOD Int. Conf. Management of Data,
1998, pp. 94–105.
[4] H. Akaike, “A new look at the statistical model identification,” IEEE
Trans. Autom. Control, vol. AC-19, no. 6, pp. 716–722, Dec. 1974.
[5] A. Alizadeh et al., “Distinct types of diffuse large B-cell Lymphoma
identified by gene expression profiling,” Nature, vol. 403, pp. 503–
511, 2000.
[6] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and
[7] C. Alpert and A. Kahng, “Multi-way partitioning via spacefilling [36] J. Bezdek and R. Hathaway, “Numerical convergence and
curves and dynamic programming,” in Proc. 31st ACM/IEEE Design interpretation of the fuzzy -shells clustering algorithms,” IEEE Trans.
Automa- tion Conf., 1994, pp. 652–657. Neural Netw., vol. 3, no. 5, pp. 787–793, Sep. 1992.
[8] , “Recent directions in netlist partitioning: A survey,” VLSI J., [37] J. Bezdek and N. Pal, “Some new indexes of cluster validity,” IEEE
vol. 19, pp. 1–81, 1995. Trans. Syst., Man, Cybern. B, Cybern., vol. 28, no. 3, pp. 301–315,
[9] K. Al-Sultan, “A Tabu search approach to the clustering problem,” Jun. 1998.
Pat- tern Recognit., vol. 28, no. 9, pp. 1443–1451, 1995. [38] C. Bishop, Neural Networks for Pattern Recognition. New York: Ox-
[10] S. Altschul et al., “Gapped BLAST and PSI-BLAST: A new ford Univ. Press, 1995.
generation of protein database search programs,” Nucleic Acids Res., [39] L. Bobrowski and J. Bezdek, “c-Means clustering with the and
vol. 25, pp. 3389–3402, 1997. norms,” IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3, pp. 545–554,
[11] S. Altschul et al., “Basic local alignment search tool,” J. Molec. Biol., May-Jun. 1991.
vol. 215, pp. 403–410, 1990. [40] H. Bock, “Probabilistic models in cluster analysis,” Comput. Statist.
[12] G. Anagnostopoulos and M. Georgiopoulos, “Hypersphere ART and Data Anal., vol. 23, pp. 5–28, 1996.
ARTMAP for unsupervised and supervised incremental learning,” in [41] E. Bolten, A. Sxhliep, S. Schneckener, D. Schomburg, and R.
Proc. IEEE-INNS-ENNS Int. Joint Conf. Neural Networks Schrader, “Clustering protein sequences—Structure prediction by
(IJCNN’00), vol. 6, Como, Italy, pp. 59–64. transitive ho- mology,” Bioinformatics, vol. 17, pp. 935–941, 2001.
[13] , “Ellipsoid ART and ARTMAP for incremental unsupervised [42] N. Boujemaa, “Generalized competitive clustering for image segmen-
and supervised learning,” in Proc. IEEE-INNS-ENNS Int. Joint Conf. tation,” in Proc. 19th Int. Meeting North American Fuzzy Information
Processing Soc. (NAFIPS’00), Atlanta, GA, 2000, pp. 133–137.
Neural Networks (IJCNN’01), vol. 2, Washington, DC, 2001, pp.
[43] P. Bradley and U. Fayyad, “Refining initial points for -means clus-
1221–1226.
tering,” in Proc. 15th Int. Conf. Machine Learning, 1998, pp. 91–99.
[14] M. Anderberg, Cluster Analysis for Applications. New York: Aca-
[44] P. Bradley, U. Fayyad, and C. Reina, “Scaling clustering algorithms to
demic, 1973.
large databases,” in Proc. 4th Int. Conf. Knowledge Discovery and
[15] G. Babu and M. Murty, “A near-optimal initial seed value selection in
Data Mining (KDD’98), 1998, pp. 9–15.
-means algorithm using a genetic algorithm,” Pattern Recognit. [45] , “Clustering very large databases using EM mixture models,” in
Lett., vol. 14, no. 10, pp. 763–769, 1993. Proc. 15th Int. Conf. Pattern Recognition, vol. 2, 2000, pp. 76–80.
[16] , “Clustering with evolution strategies,” Pattern Recognit., vol. [46] , “Clustering very large databases using EM mixture models,” in
27, no. 2, pp. 321–329, 1994. Proc. 15th Int. Conf. Pattern Recognition, vol. 2, 2000, pp. 76–80.
[17] E. Backer and A. Jain, “A clustering performance measure based on [47] D. Brown and C. Huntley, “A practical application of simulated an-
fuzzy set decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., nealing to clustering,” Pattern Recognit., vol. 25, no. 4, pp. 401–412,
vol. PAMI-3, no. 1, pp. 66–75, Jan. 1981. 1992.
[18] P. Baldi and S. Brunak, Bioinformatics: The Machine Learning Ap- [48] C. Burges, “A tutorial on support vector machines for pattern recogni-
proach, 2nd ed. Cambridge, MA: MIT Press, 2001. tion,” Data Mining Knowl. Discov., vol. 2, pp. 121–167, 1998.
[19] P. Baldi and K. Hornik, “Neural networks and principal component anal- [49] J. Burke, D. Davison, and W. Hide, “d2 Cluster: A validated method for
ysis: Learning from examples without local minima,” Neural Netw., clustering EST and full-length cDNA sequences,” Genome Res., vol.
vol. 2, pp. 53–58, 1989. 9, pp. 1135–1142, 1999.
[20] P. Baldi and A. Long, “A Bayesian framework for the analysis of mi- [50] I. Cadez, S. Gaffney, and P. Smyth, “A general probabilistic framework
croarray expression data: Regularized t-test and statistical inferences for clustering individuals and objects,” in Proc. 6th ACM SIGKDD
of gene changes,” Bioinformatics, vol. 17, pp. 509–519, 2001. Int. Conf. Knowledge Discovery and Data Mining, 2000, pp. 140–149.
[21] G. Ball and D. Hall, “A clustering technique for summarizing multi- [51] G. Carpenter and S. Grossberg, “A massively parallel architecture for
variate data,” Behav. Sci., vol. 12, pp. 153–155, 1967. a self-organizing neural pattern recognition machine,” Comput. Vis.
[22] S. Bandyopadhyay and U. Maulik, “Nonparametric genetic clustering: Graph. Image Process., vol. 37, pp. 54–115, 1987.
Comparison of validity indices,” IEEE Trans. Syst., Man, Cybern. C, [52] , “ART2: Self-organization of stable category recognition codes
Appl. Rev., vol. 31, no. 1, pp. 120–125, Feb. 2001. for analog input patterns,” Appl. Opt., vol. 26, no. 23, pp. 4919–4930,
[23] A. Baraldi and E. Alpaydin, “Constructive feedforward ART 1987.
clustering networks—Part I and II,” IEEE Trans. Neural Netw., vol. [53] , “The ART of adaptive pattern recognition by a self-organizing
13, no. 3, pp. 645–677, May 2002. neural network,” IEEE Computer, vol. 21, no. 3, pp. 77–88, Mar.
[24] A. Baraldi and P. Blonda, “A survey of fuzzy clustering algorithms for 1988.
pattern recognition—Part I and II,” IEEE Trans. Syst., Man, Cybern. [54] , “ART3: Hierarchical search using chemical transmitters in self-
B, Cybern., vol. 29, no. 6, pp. 778–801, Dec. 1999. organizing pattern recognition architectures,” Neural Netw., vol. 3, no.
[25] A. Baraldi and L. Schenato, “Soft-to-hard model transition in clustering: 23, pp. 129–152, 1990.
A review,”, Tech. Rep. TR-99-010, 1999. [55] G. Carpenter, S. Grossberg, N. Markuzon, J. Reynolds, and D. Rosen,
[26] D. Barbará and P. Chen, “Using the fractal dimension to cluster datasets,” “Fuzzy ARTMAP: A neural network architecture for incremental
in Proc. 6th ACM SIGKDD Int. Conf. Knowledge Discovery and Data super- vised learning of analog multidimensional maps,” IEEE Trans.
Mining, 2000, pp. 260–264. Neural Netw., vol. 3, no. 5, pp. 698–713, 1992.
[27] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques [56] G. Carpenter, S. Grossberg, and J. Reynolds, “ARTMAP: Supervised
for embedding and clustering,” in Advances in Neural Information real-time learning and classification of nonstationary data by a self-or-
Processing Systems, T. G. Dietterich, S. Becker, and Z. Ghahramani, ganizing neural network,” Neural Netw., vol. 4, no. 5, pp. 169–181, 1991.
[57] G. Carpenter, S. Grossberg, and D. Rosen, “Fuzzy ART: Fast stable
Eds. Cambridge, MA: MIT Press, 2002, vol. 14.
learning and categorization of analog patterns by an adaptive
[28] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton,
resonance system,” Neural Netw., vol. 4, pp. 759–771, 1991.
NJ: Princeton Univ. Press, 1961.
[58] G. Celeux and G. Govaert, “A classification EM algorithm for clustering
[29] A. Ben-Dor, R. Shamir, and Z. Yakhini, “Clustering gene expression
and two stochastic versions,” Comput. Statist. Data Anal., vol. 14, pp.
patterns,” J. Comput. Biol., vol. 6, pp. 281–297, 1999.
315–332, 1992.
[30] Y. Bengio, “Markovian models for sequential data,” Neural Comput. [59] P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass):
Surv., vol. 2, pp. 129–162, 1999. Theory and results,” in Advances in Knowledge Discovery and Data
[31] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik, “Support vector Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
clustering,” J. Mach. Learn. Res., vol. 2, pp. 125–137, 2001. Uthurusamy, Eds. Menlo Park, CA: AAAI Press, 1996, pp. 153–180.
[32] , “A support vector clustering method,” in Proc. Int. Conf. [60] V. Cherkassky and F. Mulier, Learning From Data: Concepts,
Pattern Recognition, vol. 2, 2000, pp. 2724–2727. Theory, and Methods. New York: Wiley, 1998.
[33] P. Berkhin. (2001) Survey of clustering data mining techniques. [On- [61] J. Cherng and M. Lo, “A hypergraph based clustering algorithm for
line]. Available: http://www.accrue.com/products/rp_cluster_review.pdf spa- tial data sets,” in Proc. IEEE Int. Conf. Data Mining (ICDM’01),
http://citeseer.nj.nec.com/berkhin02survey.html 2001, pp. 83–90.
[34] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is nearest [62] J. Chiang and P. Hao, “A new kernel-based fuzzy clustering approach:
neighbor meaningful,” in Proc. 7th Int. Conf. Database Theory, 1999, Support vector clustering with cell growing,” IEEE Trans. Fuzzy Syst.,
pp. 217–235. vol. 11, no. 4, pp. 518–527, Aug. 2003.
[35] J. Bezdek, Pattern Recognition with Fuzzy Objective Function Algo- [63] C. Chinrungrueng and C. Séquin, “Optimal adaptive -means algo-
rithms. New York: Plenum, 1981. rithm with dynamic adjustment of learning rate,” IEEE Trans. Neural
Netw., vol. 6, no. 1, pp. 157–169, Jan. 1995.
tering,” Mach. Learn., vol. 2, pp. 139–172, 1987.
[64] S. Chu and J. Roddick, “A clustering algorithm using the Tabu search
approach with simulated annealing,” in Data Mining II—Proceedings
of Second International Conference on Data Mining Methods and
Databases, N. Ebecken and C. Brebbia, Eds, Cambridge, U.K., 2000,
pp. 515–523.
[65] I. H. G. S. Consortium, “Initial sequencing and analysis of the human
genome,” Nature, vol. 409, pp. 860–921, 2001.
[66] J. Corchado and C. Fyfe, “A comparison of kernel methods for instan-
tiating case based reasoning systems,” Comput. Inf. Syst., vol. 7, pp.
29–42, 2000.
[67] M. Cowgill, R. Harvey, and L. Watson, “A genetic algorithm
approach to cluster analysis,” Comput. Math. Appl., vol. 37, pp. 99–
108, 1999.
[68] C. Cummings and D. Relman, “Using DNA microarray to study host-
microbe interactions,” Genomics, vol. 6, no. 5, pp. 513–525, 2000.
[69] E. Dahlhaus, “Parallel algorithms for hierarchical clustering and appli-
cations to split decomposition and parity graph recognition,” J. Algo-
rithms, vol. 36, no. 2, pp. 205–240, 2000.
[70] R. Davé, “Adaptive fuzzy -shells clustering and detection of
ellipses,”
IEEE Trans. Neural Netw., vol. 3, no. 5, pp. 643–662, Sep. 1992.
[71] R. Davé and R. Krishnapuram, “Robust clustering methods: A unified
view,” IEEE Trans. Fuzzy Syst., vol. 5, no. 2, pp. 270–293, May 1997.
[72] M. Delgado, A. Skármeta, and H. Barberá, “A Tabu search approach
to the fuzzy clustering problem,” in Proc. 6th IEEE Int. Conf. Fuzzy
Systems, vol. 1, 1997, pp. 125–130.
[73] D. Dembélé and P. Kastner, “Fuzzy -means method for clustering mi-
croarray data,” Bioinformatics, vol. 19, no. 8, pp. 973–980, 2003.
[74] Handbook of Pattern Recognition and Computer Vision, C. Chen, L.
Pau, and P. Wang, Eds., World Scientific, Singapore, 1993, pp. 3–32. R.
Dubes, “Cluster analysis and related issue”.
[75] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. New
York: Wiley, 2001.
[76] J. Dunn, “A fuzzy relative of the ISODATA process and its use in de-
tecting compact well separated clusters,” J. Cybern., vol. 3, no. 3, pp.
32–57, 1974.
[77] B. Duran and P. Odell, Cluster Analysis: A Survey. New York:
Springer-Verlag, 1974.
[78] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cam-
bridge, U.K.: Cambridge Univ. Press, 1998.
[79] M. Eisen and P. Brown, “DNA arrays for analysis of gene
expression,”
Methods Enzymol., vol. 303, pp. 179–205, 1999.
[80] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster analysis and
display of genome-wide expression patterns,” in Proc. Nat. Acad. Sci.
USA, vol. 95, 1998, pp. 14 863–14 868.
[81] Y. El-Sonbaty and M. Ismail, “Fuzzy clustering for symbolic data,” IEEE
Trans. Fuzzy Syst., vol. 6, no. 2, pp. 195–204, May 1998.
[82] T. Eltoft and R. deFigueiredo, “A new neural network for cluster-de-
tection-and-labeling,” IEEE Trans. Neural Netw., vol. 9, no. 5, pp.
1021–1035, Sep. 1998.
[83] A. Enright and C. Ouzounis, “GeneRAGE: A robust algorithm for se-
quence clustering and domain detection,” Bioinformatics, vol. 16, pp.
451–457, 2000.
[84] S. Eschrich, J. Ke, L. Hall, and D. Goldgof, “Fast accurate fuzzy clus-
tering through data reduction,” IEEE Trans. Fuzzy Syst., vol. 11, no. 2,
pp. 262–270, Apr. 2003.
[85] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based
algorithm for discovering clusters in large spatial databases with
noise,” in Proc. 2nd Int. Conf. Knowledge Discovery and Data Mining
(KDD’96), 1996, pp. 226–231.
[86] V. Estivill-Castro and I. Lee, “AMOEBA: Hierarchical clustering
based on spatial proximity using Delaunay diagram,” in Proc. 9th Int.
Symp. Spatial Data Handling (SDH’99), Beijing, China, 1999, pp.
7a.26–7a.41.
[87] V. Estivill-Castro and J. Yang, “A fast and robust general purpose
clus- tering algorithm,” in Proc. 6th Pacific Rim Int. Conf. Artificial
Intelli- gence (PRICAI’00), R. Mizoguchi and J. Slaney, Eds.,
Melbourne, Aus- tralia, 2000, pp. 208–218.
[88] B. Everitt, S. Landau, and M. Leese, Cluster Analysis. London:
Arnold, 2001.
[89] D. Fasulo, “An analysis of recent work on clustering algorithms,”
Dept. Comput. Sci. Eng., Univ. Washington, Seattle, WA, Tech. Rep.
01-03-02, 1999.
[90] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture
models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp.
381–396, Mar. 2002.
[91] D. Fisher, “Knowledge acquisition via incremental conceptual clus-
Computer Science and Computational Biology. Cambridge, U.K.:
Cambridge Univ. Press, 1997.
[92] R. Fisher, “The use of multiple measurements in taxonomic
[121] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster validity
problems,”
methods: Part I & II,” SIGMOD Record, vol. 31, no. 2–3, 2002.
Annu. Eugenics, pt. II, vol. 7, pp. 179–188, 1936.
[122] L. Hall, I. Özyurt, and J. Bezdek, “Clustering with a genetically opti-
[93] D. Fogel, “An introduction to simulated evolutionary
mized approach,” IEEE Trans. Evol. Comput., vol. 3, no. 2, pp. 103–112,
optimization,”
1999.
IEEE Trans. Neural Netw., vol. 5, no. 1, pp. 3–14, Jan. 1994.
[94] E. Forgy, “Cluster analysis of multivariate data: Efficiency vs.
inter- pretability of classifications,” Biometrics, vol. 21, pp. 768–
780, 1965.
[95] C. Fraley and A. Raftery, “MCLUST: Software for model-based
cluster analysis,” J. Classificat., vol. 16, pp. 297–306, 1999.
[96] , “Model-Based clustering, discriminant analysis, and density
esti- mation,” J. Amer. Statist. Assoc., vol. 97, pp. 611–631, 2002.
[97] J. Friedman, “Exploratory projection pursuit,” J. Amer. Statist.
Assoc., vol. 82, pp. 249–266, 1987.
[98] H. Frigui and R. Krishnapuram, “A robust competitive clustering
algo- rithm with applications in computer vision,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 21, no. 5, pp. 450–465, May 1999.
[99] B. Fritzke. (1997) Some competitive learning methods. [Online].
Avail- able: http://www.neuroinformatik.ruhr-uni-
bochum.de/ini/VDM/re- search/gsn/JavaPaper
[100] B. Gabrys and A. Bargiela, “General fuzzy min-max neural
network for clustering and classification,” IEEE Trans. Neural
Netw., vol. 11, no. 3, pp. 769–783, May 2000.
[101] V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, and J. French,
“Clus- tering large datasets in arbitrary metric spaces,” in Proc.
15th Int. Conf. Data Engineering, 1999, pp. 502–511.
[102] I. Gath and A. Geva, “Unsupervised optimal fuzzy clustering,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 773–781,
Jul. 1989.
[103] GenBank Release Notes 144.0.
[104] A. Geva, “Hierarchical unsupervised fuzzy clustering,” IEEE
Trans. Fuzzy Syst., vol. 7, no. 6, pp. 723–733, Dec. 1999.
[105] D. Ghosh and A. Chinnaiyan, “Mixture modeling of gene
expression data from microarray experiments,” Bioinformatics, vol.
18, no. 2, pp. 275–286, 2002.
[106] A. Ghozeil and D. Fogel, “Discovering patterns in spatial data
using evolutionary programming,” in Proc. 1st Annu. Conf. Genetic
Program- ming, 1996, pp. 512–520.
[107] M. Girolami, “Mercer kernel based clustering in feature space,”
IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 780–784, May 2002.
[108] F. Glover, “Tabu search, part I,” ORSA J. Comput., vol. 1, no. 3,
pp. 190–206, 1989.
[109] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.
Mesirov,
H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E.
Lander, “Molecular classification of cancer: Class discovery and
class prediction by gene expression monitoring,” Science, vol. 286,
pp. 531–537, 1999.
[110] A. Gordon, “Cluster validation,” in Data Science, Classification, and
Re- lated Methods, C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaka, H.
Bock, and Y. Bada, Eds. New York: Springer-Verlag, 1998, pp.
22–39.
[111] , Classification, 2nd ed. London, U.K.: Chapman & Hall, 1999.
[112] J. Gower, “A general coefficient of similarity and some of its
properties,”
Biometrics, vol. 27, pp. 857–872, 1971.
[113] S. Grossberg, “Adaptive pattern recognition and universal encoding
II: Feedback, expectation, olfaction, and illusions,” Biol. Cybern.,
vol. 23, pp. 187–202, 1976.
[114] P. Grünwald, P. Kontkanen, P. Myllymäki, T. Silander, and H.
Tirri, “Minimum encoding approaches for predictive modeling,” in
Proc. 14th Int. Conf. Uncertainty in AI (UAI’98), 1998, pp. 183–
192.
[115] X. Guan and L. Du, “Domain identification by clustering sequence
alignments,” Bioinformatics, vol. 14, pp. 783–788, 1998.
[116] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering
algo- rithm for large databases,” in Proc. ACM SIGMOD Int. Conf.
Manage- ment of Data, 1998, pp. 73–84.
[117] , “ROCK: A robust clustering algorithm for categorical
attributes,”
Inf. Syst., vol. 25, no. 5, pp. 345–366, 2000.
[118] S. Gupata, K. Rao, and V. Bhatnagar, “ -means clustering
algorithm for categorical attributes,” in Proc. 1st Int. Conf. Data
Warehousing and Knowledge Discovery (DaWaK’99), Florence,
Italy, 1999, pp. 203–208.
[119] V. Guralnik and G. Karypis, “A scalable algorithm for clustering
sequen- tial data,” in Proc. 1st IEEE Int. Conf. Data Mining
(ICDM’01), 2001, pp. 179–186.
[120] D. Gusfield, Algorithms on Strings, Trees, and Sequences:
[123] R. Hammah and J. Curran, “Validity measures for the fuzzy cluster anal- [153] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene
ysis of orientations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. expression data: A survey,” IEEE Trans. Knowl. Data Eng., vol. 16,
12, pp. 1467–1472, Dec. 2000. no. 11, pp. 1370–1386, Nov. 2004.
[124] P. Hansen and B. Jaumard, “Cluster analysis and mathematical program- [154] C. Jutten and J. Herault, “Blind separation of sources, Part I: An adaptive
ming,” Math. Program., vol. 79, pp. 191–215, 1997. algorithms based on neuromimetic architecture,” Signal Process., vol.
[125] P. Hansen and N. Mladenoviæ, “J-means: A new local search heuristic 24, no. 1, pp. 1–10, 1991.
for minimum sum of squares clustering,” Pattern Recognit., vol. 34, [155] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A.
pp. 405–413, 2001. Wu, “An efficient -means clustering algorithm: Analysis and imple-
[126] F. Harary, Graph Theory. Reading, MA: Addison-Wesley, 1969. mentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7,
[127] J. Hartigan, Clustering Algorithms. New York: Wiley, 1975. pp. 881–892, Jul. 2000.
[128] E. Hartuv and R. Shamir, “A clustering algorithm based on graph con- [156] N. Karayiannis, “A methodology for construction fuzzy algorithms for
nectivity,” Inf. Process. Lett., vol. 76, pp. 175–181, 2000. learning vector quantization,” IEEE Trans. Neural Netw., vol. 8, no. 3,
[129] R. Hathaway and J. Bezdek, “Fuzzy -means clustering of incomplete pp. 505–518, May 1997.
data,” IEEE Trans. Syst., Man, Cybern., vol. 31, no. 5, pp. 735–744, [157] N. Karayiannis, J. Bezdek, N. Pal, R. Hathaway, and P. Pai, “Repairs
2001. to GLVQ: A new family of competitive learning schemes,” IEEE
[130] R. Hathaway, J. Bezdek, and Y. Hu, “Generalized fuzzy -means clus- Trans. Neural Netw., vol. 7, no. 5, pp. 1062–1071, Sep. 1996.
tering strategies using norm distances,” IEEE Trans. Fuzzy Syst., vol. [158] J. Karhunen, E. Oja, L. Wang, R. Vigário, and J. Joutsensalo, “A class
8, no. 5, pp. 576–582, Oct. 2000. of neural networks for independent component analysis,” IEEE Trans.
[131] B. Hay, G. Wets, and K. Vanhoof, “Clustering navigation patterns on Neural Netw., vol. 8, no. 3, pp. 486–504, May 1997.
a website using a sequence alignment method,” in Proc. Intelligent [159] G. Karypis, E. Han, and V. Kumar, “Chameleon: Hierarchical clustering
Tech- niques for Web Personalization: 17th Int. Joint Conf. Artificial using dynamic modeling,” IEEE Computer, vol. 32, no. 8, pp. 68–75,
Intelli- gence, vol. s.l, 2001, pp. 1–6, 200. Aug. 1999.
[160] R. Kathari and D. Pitts, “On finding the number of clusters,” Pattern
[132] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd
Recognit. Lett., vol. 20, pp. 405–416, 1999.
ed. Englewood Cliffs, NJ: Prentice-Hall, 1999.
[161] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An
[133] Q. He, “A review of clustering algorithms as applied to IR,” Univ. Illinois
Introduction to Cluster Analysis: Wiley, 1990.
at Urbana-Champaign, Tech. Rep. UIUCLIS-1999/6+IRG, 1999.
[162] W. Kent and A. Zahler, “Conservation, regulation, synteny, and
[134] M. Healy, T. Caudell, and S. Smith, “A neural architecture for pattern
introns in a large-scale C. Briggsae—C. elegans genomic alignment,”
sequence verification through inferencing,” IEEE Trans. Neural Genome Res., vol. 10, pp. 1115–1125, 2000.
Netw., vol. 4, no. 1, pp. 9–20, Jan. 1993. [163] P. Kersten, “Implementation issues in the fuzzy -medians clustering
[135] A. Hinneburg and D. Keim, “An efficient approach to clustering in algorithm,” in Proc. 6th IEEE Int. Conf. Fuzzy Systems, vol. 2, 1997,
large multimedia databases with noise,” in Proc. 4th Int. Conf. pp. 957–962.
Knowledge Discovery and Data Mining (KDD’98), 1998, pp. 58–65. [164] J. Khan, J. Wei, M. Ringnér, L. Saal, M. Ladanyi, F. Westermann, F.
[136] , “Optimal grid-clustering: Toward breaking the curse of dimen- Berthold, M. Schwab, C. Antonescu, C. Peterson, and P. Meltzer, “Clas-
sionality in high-dimensional clustering,” in Proc. 25th VLDB Conf., sification and diagnostic prediction of cancers using gene expression
1999, pp. 506–517. profiling and artificial neural networks,” Nature Med., vol. 7, no. 6,
[137] F. Hoeppner, “Fuzzy shell clustering algorithms in image processing: pp. 673–679, 2001.
Fuzzy -rectangular and 2-rectangular shells,” IEEE Trans. Fuzzy Syst., [165] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated
vol. 5, no. 4, pp. 599–613, Nov. 1997. annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.
[138] J. Hoey, “Clustering contextual facial display sequences,” in Proc. 5th [166] J. Kleinberg, “An impossibility theorem for clustering,” in Proc. 2002
IEEE Int. Conf. Automatic Face and Gesture Recognition (FGR’02), Conf. Advances in Neural Information Processing Systems, vol. 15,
2002, pp. 354–359. 2002, pp. 463–470.
[139] T. Hofmann and J. Buhmann, “Pairwise data clustering by deterministic [167] R. Kohavi, “A study of cross-validation and bootstrap for accuracy es-
annealing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 1, pp. timation and model selection,” in Proc. 14th Int. Joint Conf. Artificial
1–14, Jan. 1997. Intelligence, 1995, pp. 338–345.
[140] J. Holland, Adaption in Natural and Artificial Systems. Ann Arbor, MI: [168] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, no. 9,
Univ. Michigan Press, 1975. pp. 1464–1480, Sep. 1990.
[141] F. Höppner, F. Klawonn, and R. Kruse, Fuzzy Cluster Analysis: Methods [169] , Self-Organizing Maps, 3rd ed. New York: Springer-Verlag,
for Classification, Data Analysis, and Image Recognition. New York: 2001.
Wiley, 1999. [170] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paatero, and
[142] Z. Huang, “Extensions to the -means algorithm for clustering large A. Saarela, “Self organization of a massive document collection,”
data sets with categorical values,” Data Mining Knowl. Discov., vol. 2, IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 574–585, May 2000.
pp. 283–304, 1998. [171] E. Kolatch. (2001) Clustering algorithms for spatial databases: A
[143] J. Huang, M. Georgiopoulos, and G. Heileman, “Fuzzy ART properties,” Survey. [Online]. Available: http://citeseer.nj.nec.com/436 843.html
Neural Netw., vol. 8, no. 2, pp. 203–213, 1995. [172] J. Kolen and T. Hutcheson, “Reducing the time complexity of the
fuzzy -means algorithm,” IEEE Trans. Fuzzy Syst., vol. 10, no. 2, pp.
[144] P. Huber, “Projection pursuit,” Ann. Statist., vol. 13, no. 2, pp. 435–475,
263–267, Apr. 2002.
1985. [173] K. Krishna and M. Murty, “Genetic -means algorithm,” IEEE Trans.
[145] R. Hughey and A. Krogh, “Hidden Markov models for sequence anal- Syst., Man, Cybern. B, Cybern., vol. 29, no. 3, pp. 433–439, Jun. 1999.
ysis: Extension and analysis of the basic method,” CABIOS, vol. 12, [174] R. Krishnapuram, H. Frigui, and O. Nasraoui, “Fuzzy and possiblistic
no. 2, pp. 95–107, 1996. shell clustering algorithms and their application to boundary detection
[146] M. Hung and D. Yang, “An efficient fuzzy -means clustering algo- and surface approximation—Part I and II,” IEEE Trans. Fuzzy Syst.,
rithm,” in Proc. IEEE Int. Conf. Data Mining, 2001, pp. 225–232. vol. 3, no. 1, pp. 29–60, Feb. 1995.
[147] L. Hunt and J. Jorgensen, “Mixture model clustering using the MUL- [175] R. Krishnapuram and J. Keller, “A possibilistic approach to clustering,”
TIMIX program,” Australia and New Zealand J. Statist., vol. 41, pp. IEEE Trans. Fuzzy Syst., vol. 1, no. 2, pp. 98–110, Apr. 1993.
153–171, 1999. [176] R. Krishnapuram, O. Nasraoui, and H. Frigui, “The fuzzy spherical
[148] J. Hwang, J. Vlontzos, and S. Kung, “A systolic neural network archi- shells algorithm: A new approach,” IEEE Trans. Neural Netw., vol. 3,
tecture for hidden Markov models,” IEEE Trans. Acoust., Speech, Signal no. 5, pp. 663–671, Sep. 1992.
Process., vol. 37, no. 12, pp. 1967–1979, Dec. 1989. [177] A. Krogh, M. Brown, I. Mian, K. Sjölander, and D. Haussler, “Hidden
[149] A. Hyvärinen, “Survey of independent component analysis,” Neural Markov models in computational biology: Applications to protein
Comput. Surv., vol. 2, pp. 94–128, 1999. mod- eling,” J. Molec. Biol., vol. 235, pp. 1501–1531, 1994.
[150] A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood [178] G. Lance and W. Williams, “A general theory of classification sorting
Cliffs, NJ: Prentice-Hall, 1988. strategies: 1. Hierarchical systems,” Comput. J., vol. 9, pp. 373–380,
[151] A. Jain, R. Duin, and J. Mao, “Statistical pattern recognition: A 1967.
review,” [179] M. Law and J. Kwok, “Rival penalized competitive learning for
IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, 2000. model- based sequence clustering,” in Proc. 15th Int. Conf. Pattern
[152] A. Jain, M. Murty, and P. Flynn, “Data clustering: A review,” ACM Recognition, vol. 2, 2000, pp. 195–198.
Comput. Surv., vol. 31, no. 3, pp. 264–323, 1999.
[180] Y. Leung, J. Zhang, and Z. Xu, “Clustering by scale-space filtering,” [207] T. Morzy, M. Wojciechowski, and M. Zakrzewicz, “Pattern-oriented
IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1396– hierarchical clustering,” in Proc. 3rd East Eur. Conf. Advances in
1410, Dec. 2000. Databases and Information Systems, 1999, pp. 179–190.
[181] E. Levine and E. Domany, “Resampling method for unsupervised es- [208] S. Mulder and D. Wunsch, “Million city traveling salesman problem
timation of cluster validity,” Neural Comput., vol. 13, pp. 2573–2593, solution by divide and conquer clustering with adaptive resonance neural
2001. networks,” Neural Netw., vol. 16, pp. 827–832, 2003.
[182] C. Li and G. Biswas, “Temporal pattern generation using hidden [209] K. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An intro-
Markov model based unsupervised classification,” in Advances in duction to kernel-based learning algorithms,” IEEE Trans. Neural Netw.,
Intelligent Data Analysis. ser. Lecture Notes in Computer Science, D. vol. 12, no. 2, pp. 181–201, Mar. 2001.
Hand, K. Kok, and M. Berthold, Eds. New York: Springer-Verlag, [210] F. Murtagh, “A survey of recent advances in hierarchical clustering al-
1999, vol. 1642. gorithms,” Comput. J., vol. 26, no. 4, pp. 354–359, 1983.
[183] , “Unsupervised learning with mixed numeric and nominal data,” [211] F. Murtagh and M. Berry, “Overcoming the curse of dimensionality in
IEEE Trans. Knowl. Data Eng., vol. 14, no. 4, pp. 673–690, Jul.-Aug. clustering by means of the wavelet transform,” Comput. J., vol. 43, no.
2002. 2, pp. 107–120, 2000.
[184] C. Li, H. Garcia-Molina, and G. Wiederhold, “Clustering for approxi- [212] S. Needleman and C. Wunsch, “A general method applicable to the
mate similarity search in high-dimensional spaces,” IEEE Trans. Knowl. search for similarities in the amino acid sequence of two proteins,” J.
Data Eng., vol. 14, no. 4, pp. 792–808, Jul.-Aug. 2002. Molec. Biol., vol. 48, pp. 443–453, 1970.
[185] W. Li, L. Jaroszewski, and A. Godzik, “Clustering of highly [213] R. Ng and J. Han, “CLARANS: A method for clustering objects for
homologous sequences to reduce the size of large protein databases,” spatial data mining,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 5,
Bioinformatics, vol. 17, pp. 282–283, 2001. pp. 1003–1016, Sep.-Oct. 2002.
[186] A. Likas, N. Vlassis, and J. Verbeek, “The global -means clustering
[214] T. Oates, L. Firoiu, and P. Cohen, “Using dynamic time warping to boot-
algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003.
strap HMM-based clustering of time series,” in Sequence Learning.
[187] S. Lin and B. Kernighan, “An effective heuristic algorithm for the
ser. LNAI 1828, R. Sun and C. Giles, Eds. Berlin, Germany: Springer-
trav- eling salesman problem,” Operat. Res., vol. 21, pp. 498–516,
1973. Verlag, 2000, pp. 35–52.
[188] R. Lipshutz, S. Fodor, T. Gingeras, and D. Lockhart, “High density [215] E. Oja, “Principal components minor components, and linear neural net-
synthetic oligonucleotide arrays,” Nature Genetics, vol. 21, pp. 20–24, works,” Neural Netw., vol. 5, pp. 927–935, 1992.
1999. [216] J. Oliver, R. Baxter, and C. Wallace, “Unsupervised learning using
[189] G. Liu, Introduction to Combinatorial Mathematics. New York: Mc- MML,” in Proc. 13th Int. Conf. Machine Learning (ICML’96),
Graw-Hill, 1968. Lorenza, Saitta, 1996, pp. 364–372.
[190] J. Lozano and P. Larrañaga, “Applying genetic algorithms to search for [217] C. Olson, “Parallel algorithms for hierarchical clustering,” Parallel
the best hierarchical clustering of a dataset,” Pattern Recognit. Lett., vol. Comput., vol. 21, pp. 1313–1325, 1995.
20, pp. 911–918, 1999. [218] C. Ordonez and E. Omiecinski, “Efficient disk-based K-means clus-
[191] J. MacQueen, “Some methods for classification and analysis of mul- tering for relational databases,” IEEE Trans. Knowl. Data Eng., vol.
tivariate observations,” in Proc. 5th Berkeley Symp., vol. 1, 1967, pp. 16, no. 8, pp. 909–921, Aug. 2004.
281–297. [219] L. Owsley, L. Atlas, and G. Bernard, “Self-organizing feature maps
[192] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biolog- and hidden Markov models for machine-tool monitoring,” IEEE
ical data analysis: A survey,” IEEE/ACM Trans. Computat. Biol. Trans. Signal Process., vol. 45, no. 11, pp. 2787–2798, Nov. 1997.
Bioin- formatics, vol. 1, no. 1, pp. 24–45, Jan. 2004. [220] N. Pal and J. Bezdek, “On cluster validity for the fuzzy -means model,”
[193] Y. Man and I. Gath, “Detection and separation of ring-shaped clusters IEEE Trans. Fuzzy Syst., vol. 3, no. 3, pp. 370–379, Aug. 1995.
using fuzzy clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 16, [221] N. Pal, J. Bezdek, and E. Tsao, “Generalized clustering networks and
no. 8, pp. 855–861, Aug. 1994. Kohonen’s self-organizing scheme,” IEEE Trans. Neural Netw., vol.
[194] J. Mao and A. Jain, “A self-organizing network for hyperellipsoidal 4, no. 4, pp. 549–557, Jul. 1993.
clus- tering (HEC),” IEEE Trans. Neural Netw., vol. 7, no. 1, pp. 16– [222] G. Patanè and M. Russo, “The enhanced-LBG algorithm,” Neural Netw.,
29, Jan. 1996. vol. 14, no. 9, pp. 1219–1237, 2001.
[195] U. Maulik and S. Bandyopadhyay, “Genetic algorithm-based [223] , “Fully automatic clustering system,” IEEE Trans. Neural Netw.,
clustering technique,” Pattern Recognit., vol. 33, pp. 1455–1465, vol. 13, no. 6, pp. 1285–1298, Nov. 2002.
2000. [224] W. Pearson, “Improved tools for biological sequence comparison,” Proc.
[196] G. McLachlan and T. Krishnan, The EM Algorithm and Exten- Nat. Acad. Sci., vol. 85, pp. 2444–2448, 1988.
sions. New York: Wiley, 1997.
[225] D. Peel and G. McLachlan, “Robust mixture modeling using the t-dis-
[197] G. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley,
tribution,” Statist. Comput., vol. 10, pp. 339–348, 2000.
2000.
[198] G. McLachlan, D. Peel, K. Basford, and P. Adams, “The EMMIX [226] D. Pelleg and A. Moore, “X-means: Extending -means with efficient
soft- ware for the fitting of mixtures of normal and t-components,” J. estimation of the number of clusters,” in Proc. 17th Int. Conf.
Statist. Software, vol. 4, 1999. Machine Learning (ICML’00), 2000, pp. 727–734.
[199] C. Miller, J. Gurd, and A. Brass, “A RAPID algorithm for sequence data- [227] J. Peña, J. Lozano, and P. Larrañaga, “An empirical comparison of four
base comparisons: Application to the identification of vector contami- initialization methods for the -means algorithm,” Pattern Recognit.
nation in the EMBL databases,” Bioinformatics, vol. 15, pp. 111–121, Lett., vol. 20, pp. 1027–1040, 1999.
1999. [228] C. Pizzuti and D. Talia, “P-AutoClass: Scalable parallel clustering for
[200] R. Miller et al., “A comprehensive approach to clustering of expressed mining large data sets,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 3,
human gene sequence: The sequence tag alignment and consensus pp. 629–641, May-Jun. 2003.
knowledge base,” Genome Res., vol. 9, pp. 1143–1155, 1999. [229] L. Rabiner, “A tutorial on hidden Markov models and selected
[201] W. Miller, “Comparison of genomic DNA sequences: Solved and un- applica- tions in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp.
solved problems,” Bioinformatics, vol. 17, pp. 391–397, 2001. 257–286, Feb. 1989.
[202] G. Milligan and M. Cooper, “An examination of procedures for deter- [230] Ralf-Herwig, A. Poustka, C. Müller, C. Bull, H. Lehrach, and
mining the number of clusters in a data set,” Psychometrika, vol. 50, J. O’Brien, “Large-scale clustering of cDNA-fingerprinting data,”
pp. 159–179, 1985. Genome Res., pp. 1093–1105, 1999.
[203] R. Mollineda and E. Vidal, “A relative approach to hierarchical [231] A. Rauber, J. Paralic, and E. Pampalk, “Empirical evaluation of clus-
clustering,” in Pattern Recognition and Applications, Frontiers in tering algorithms,” J. Inf. Org. Sci., vol. 24, no. 2, pp. 195–209, 2000.
Artificial Intelligence and Applications, M. Torres and A. Sanfeliu, [232] S. Ridella, S. Rovetta, and R. Zunino, “Plastic algorithm for adaptive
Eds. Amsterdam, The Netherlands: IOS Press, 2000, vol. 56, pp. 19– vector quantization,” Neural Comput. Appl., vol. 7, pp. 37–51, 1998.
28. [233] J. Rissanen, “Fisher information and stochastic complexity,” IEEE
[204] B. Moore, “ART1 and pattern clustering,” in Proc. 1988 Trans. Inf. Theory, vol. 42, no. 1, pp. 40–47, Jan. 1996.
Connectionist Models Summer School, 1989, pp. 174–185. [234] K. Rose, “Deterministic annealing for clustering, compression,
[205] S. Moore, “Making chips to probe genes,” IEEE Spectr., vol. 38, no. classifi- cation, regression, and related optimization problems,” Proc.
3, pp. 54–60, Mar. 2001. IEEE, vol. 86, no. 11, pp. 2210–2239, Nov. 1998.
[206] Y. Moreau, F. Smet, G. Thijs, K. Marchal, and B. Moor, “Functional
bioinformatics of microarray data: From expression to regulation,”
Proc. IEEE, vol. 90, no. 11, pp. 1722–1743, Nov. 2002.
[235] S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally [262] K. Stoffel and A. Belkoniene, “Parallel -means clustering for
linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. large data sets,” in Proc. EuroPar’99 Parallel Processing, 1999, pp.
[236] D. Sankoff and J. Kruskal, Time Warps, String Edits, and Macro- 1451–1454.
molecules: The Theory and Practice of Sequence Comparison. Stan- [263] M. Su and H. Chang, “Fast self-organizing feature map algorithm,” IEEE
ford, CA: CSLI Publications, 1999. Trans. Neural Netw., vol. 11, no. 3, pp. 721–733, May 2000.
[237] O. Sasson, N. Linial, and M. Linial, “The metric space of pro- teins— [264] M. Su and C. Chou, “A modified version of the -means algorithm with
Comparative study of clustering algorithms,” Bioinformatics, vol. 18, a distance based on cluster symmetry,” IEEE Trans. Pattern Anal. Mach.
pp. s14–s21, 2002. Intell., vol. 23, no. 6, pp. 674–680, Jun. 2001.
[238] U. Scherf, D. Ross, M. Waltham, L. Smith, J. Lee, L. Tanabe, K. Kohn, [265] R. Sun and C. Giles, “Sequence learning: Paradigms, algorithms, and
W. Reinhold, T. Myers, D. Andrews, D. Scudiero, M. Eisen, E. Sausville,
applications,” in LNAI 1828, . Berlin, Germany, 2000.
Y. Pommier, D. Botstein, P. Brown, and J. Weinstein, “A gene
[266] C. Sung and H. Jin, “A Tabu-search-based heuristic for clustering,”
expression database for the molecular pharmacology of cancer,”
Pat- tern Recognit., vol. 33, pp. 849–858, 2000.
Nature Genetics, vol. 24, no. 3, pp. 236–244, 2000.
[239] P. Scheunders, “A comparison of clustering algorithms applied to color [267] SWISS-PROT Protein Knowledgebase Release 45.0 Statistics.
image quantization,” Pattern Recognit. Lett., vol. 18, pp. 1379–1384, [268] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitro-
1997. vsky, E. Lander, and T. Golub, “Interpreting patterns of gene expression
[240] B. Schölkopf and A. Smola, Learning with Kernels: Support Vector Ma- with self-organizing maps: Methods and application to hematopoietic
chines, Regularization, Optimization, and Beyond. Cambridge, MA: differentiation,” Proc. Nat. Acad. Sci., pp. 2907–2912, 1999.
MIT Press, 2002. [269] S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church, “Sys-
[241] B. Schölkopf, A. Smola, and K. Müller, “Nonlinear component tematic determination of genetic network architecture,” Nature Genetics,
analysis as a kernel eigenvalue problem,” Neural Computat., vol. 10, vol. 22, pp. 281–285, 1999.
no. 5, pp. 1299–1319, 1998. [270] J. Tenenbaum, V. Silva, and J. Langford, “A global geometric frame-
[242] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. work for nonlinear dimensionality reduction,” Science, vol. 290, pp.
6, no. 2, pp. 461–464, 1978. 2319–2323, 2000.
[243] G. Scott, D. Clark, and T. Pham, “A genetic clustering algorithm guided [271] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P.
by a descent algorithm,” in Proc. Congr. Evolutionary Computation, vol. Brown, “Clustering methods for the analysis of DNA microarray
2, Piscataway, NJ, 2001, pp. 734–740. data,” Dept. Statist., Stanford Univ., Stanford, CA, Tech. Rep..
[244] P. Sebastiani, M. Ramoni, and P. Cohen, “Sequence learning via [272] R. Tibshirani and K. Knight, “The covariance inflation criterion for
Bayesian clustering by dynamics,” in Sequence Learning. ser. LNAI adaptive model selection,” J. Roy. Statist. Soc. B, vol. 61, pp. 529–
1828, R. Sun and C. Giles, Eds. Berlin, Germany: Springer-Verlag,
546, 1999.
2000, pp. 11–34.
[245] S. Selim and K. Alsultan, “A simulated annealing algorithm for the clus- [273] L. Tseng and S. Yang, “A genetic approach to the automatic clustering
tering problems,” Pattern Recognit., vol. 24, no. 10, pp. 1003–1008, problem,” Pattern Recognit., vol. 34, pp. 415–424, 2001.
1991. [274] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[246] R. Shamir and R. Sharan, “Algorithmic approaches to clustering gene [275] J. Venter et al., “The sequence of the human genome,” Science, vol. 291,
expression data,” in Current Topics in Computational Molecular Bi- pp. 1304–1351, 2001.
ology, T. Jiang, T. Smith, Y. Xu, and M. Zhang, Eds. Cambridge, MA: [276] J. Vesanto and E. Alhoniemi, “Clustering of the self-organizing map,”
MIT Press, 2002, pp. 269–300. IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 586–600, May 2000.
[247] R. Sharan and R. Shamir, “CLICK: A clustering algorithm with appli- [277] K. Wagstaff, S. Rogers, and S. Schroedl, “Constrained -means clus-
cations to gene expression analysis,” in Proc. 8th Int. Conf. Intelligent tering with background knowledge,” in Proc. 8th Int. Conf. Machine
Systems for Molecular Biology, 2000, pp. 307–316. Learning, 2001, pp. 577–584.
[248] G. Sheikholeslami, S. Chatterjee, and A. Zhang, “WaveCluster: A multi- [278] C. Wallace and D. Dowe, “Intrinsic classification by MML—The SNOB
resolution clustering approach for very large spatial databases,” in program,” in Proc. 7th Australian Joint Conf. Artificial Intelligence,
Proc. 24th VLDB Conf., 1998, pp. 428–439. 1994, pp. 37–44.
[249] P. Simpson, “Fuzzy min-max neural networks—Part 2: Clustering,” [279] H. Wang, W. Wang, J. Yang, and P. Yu, “Clustering by pattern similarity
IEEE Trans. Fuzzy Syst., vol. 1, no. 1, pp. 32–45, Feb. 1993. in large data sets,” in Proc. ACM SIGMOD Int. Conf. Management of
[250] Handbook of Pattern Recognition and Computer Vision, C. Chen, L.
Data, 2002, pp. 394–405.
Pau, and P. Wang, Eds., World Scientific, Singapore, 1993, pp. 61–124.
J. Sklansky and W. Siedlecki, “Large-scale feature selection”. [280] C. Wei, Y. Lee, and C. Hsu, “Empirical comparison of fast clustering
[251] T. Smith and M. Waterman, “New stratigraphic correlation techniques,” algorithms for large data sets,” in Proc. 33rd Hawaii Int. Conf. System
J. Geology, vol. 88, pp. 451–457, 1980. Sciences, Maui, HI, 2000, pp. 1–10.
[252] P. Smyth, “Clustering using Monte Carlo cross-validation,” in Proc. 2nd [281] J. Williamson, “Gaussian ARTMAP: A neural network for fast incre-
Int. Conf. Knowledge Discovery and Data Mining, 1996, pp. 126–133. mental learning of noisy multidimensional maps,” Neural Netw., vol.
[253] , “Clustering sequences with hidden Markov models,” in 9, no. 5, pp. 881–897, 1996.
Advances in Neural Information Processing, M. Mozer, M. Jordan, and [282] M. Windham and A. Culter, “Information ratios for validating mixture
T. Petsche, Eds. Cambridge, MA: MIT Press, 1997, vol. 9, pp. 648– analysis,” J. Amer. Statist. Assoc., vol. 87, pp. 1188–1192, 1992.
654. [283] S. Wu, A. W.-C. Liew, H. Yan, and M. Yang, “Cluster analysis of
[254] , “Model selection for probabilistic clustering using cross gene expression data based on self-splitting and merging competitive
validated likelihood,” Statist. Comput., vol. 10, pp. 63–72, 1998. learning,” IEEE Trans. Inf. Technol. Biomed., vol. 8, no. 1, pp. 5–15,
[255] , “Probabilistic model-based clustering of multivariate and Jan. 2004.
sequen- tial data,” in Proc. 7th Int. Workshop on Artificial Intelligence [284] D. Wunsch, “An optoelectronic learning machine: Invention, experi-
and Sta- tistics, 1999, pp. 299–304. mentation, analysis of first hardware implementation of the ART1 neural
[256] P. Sneath, “The application of computers to taxonomy,” J. Gen. network,” Ph.D. dissertation, Univ. Washington, Seattle, WA, 1991.
Micro- biol., vol. 17, pp. 201–226, 1957. [285] D. Wunsch, T. Caudell, C. Capps, R. Marks, and R. Falk, “An optoelec-
[257] P. Somervuo and T. Kohonen, “Clustering and visualization of large pro-
tronic implementation of the adaptive resonance neural network,”
tein sequence databases by means of an extension of the self-
IEEE Trans. Neural Netw., vol. 4, no. 4, pp. 673–684, Jul. 1993.
organizing map,” in LNAI 1967, 2000, pp. 76–85.
[258] T. Sorensen, “A method of establishing groups of equal amplitude in [286] Y. Xiong and D. Yeung, “Mixtures of ARMA models for model-based
plant sociology based on similarity of species content and its application time series clustering,” in Proc. IEEE Int. Conf. Data Mining, 2002,
to analyzes of the vegetation on Danish commons,” Biologiske pp. 717–720.
Skrifter, vol. 5, pp. 1–34, 1948. [287] R. Xu, G. Anagnostopoulos, and D. Wunsch, “Tissue classification
[259] H. Späth, Cluster Analysis Algorithms. Chichester, U.K.: Ellis Hor- through analysis of gene expression data using a new family of ART
wood, 1980. architectures,” in Proc. Int. Joint Conf. Neural Networks (IJCNN’02),
[260] P. Spellman, G. Sherlock, M. Ma, V. Iyer, K. Anders, M. Eisen, P. Brown, vol. 1, 2002, pp. 300–304.
D. Botstein, and B. Futcher, “Comprehensive identification of cell cycle- [288] Y. Xu, V. Olman, and D. Xu, “Clustering gene expression data using
regulated genes of the Yeast Saccharomyces Cerevisiae by microarray graph-theoretic approach: An application of minimum spanning trees,”
hybridization,” Mol. Biol. Cell, vol. 9, pp. 3273–3297, 1998. Bioinformatics, vol. 18, no. 4, pp. 536–545, 2002.
[261] “Tech. Rep. 00–034,” Univ. Minnesota, Minneapolis, 2000.
[289] R. Yager, “Intelligent control of the hierarchical agglomerative clus- Donald C. Wunsch II (S’87–M’92–SM’94–F’05)
tering process,” IEEE Trans. Syst., Man, Cybern., vol. 30, no. 6, pp. received the B.S. degree in applied mathematics
835–845, 2000. from the University of New Mexico, Albuquerque,
[290] R. Yager and D. Filev, “Approximate clustering via the moun- and the M.S. degree in applied mathematics and
tain method,” IEEE Trans. Syst., Man, Cybern., vol. 24, no. 8, pp. the Ph.D. degree in electrical engineering from the
1279–1284, 1994. University of Washington, Seattle.
[291] K. Yeung, D. Haynor, and W. Ruzzo, “Validating clustering for gene Heis the Mary K. Finley Missouri Distinguished
expression data,” Bioinformatics, vol. 17, no. 4, pp. 309–318, 2001. Professor of Computer Engineering, University
[292] F. Young and R. Hamer, Multidimensional Scaling: History, Theory, and of Missouri-Rolla, where he has been since 1999.
Applications. Hillsdale, NJ: Lawrence Erlbaum, 1987. His prior positions were Associate Professor and
[293] L. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, pp. 338–353, 1965. Director of the Applied Computational Intelligence
[294] J. Zhang and Y. Leung, “Improved possibilistic C-means clustering al- Laboratory, Texas Tech University, Lubbock; Senior Principal Scientist,
gorithms,” IEEE Trans. Fuzzy Syst., vol. 12, no. 2, pp. 209–217, Apr. Boeing; Consultant, Rockwell International; and Technician, International
2004. Laser Systems. He has well over 200 publications, and has attracted over $5
[295] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data million in research funding. He has produced eight Ph.D. recipients—four in
clustering method for very large databases,” in Proc. ACM SIGMOD electrical engineering, three in computer engineering, and one in computer
Conf. Management of Data, 1996, pp. 103–114. science.
[296] Y. Zhang and Z. Liu, “Self-splitting competitive learning: A new on-line Dr. Wunsch has received the Halliburton Award for Excellence in Teaching
clustering paradigm,” IEEE Trans. Neural Networks, vol. 13, no. 2, and Research, and the National Science Foundation CAREER Award. He served
pp. 369–380, Mar. 2002. as a Voting Member of the IEEE Neural Networks Council, Technical
[297] X. Zhuang, Y. Huang, K. Palaniappan, and Y. Zhao, “Gaussian mixture Program Co-Chair for IJCNN’02, General Chair for IJCNN’03, International
density modeling, decomposition, and applications,” IEEE Trans. Image Neural Net- works Society Board of Governors Member, and is now President
Process., vol. 5, no. 9, pp. 1293–1302, Sep. 1996. of the Inter- national Neural Networks Society.

Rui Xu (S’00) received the B.E. degree in electrical


engineering from Huazhong University of Science
and Technology, Wuhan, Hubei, China, in 1997,
and the M.E. degree in electrical engineering from
Sichuan University, Chengdu, Sichuan, in 2000.
He is currently pursuing the Ph.D. degree in the
Department of Electrical and Computer
Engineering, University of Missouri-Rolla.
His research interests include machine learning,
neural networks, pattern classification and clustering,
and bioinformatics.
Mr. Xu is a Student Member of the IEEE Computational Intelligence Society,
Engineering in Medicine and Biology Society, and the International Society
for Computational Biology.

You might also like