0% found this document useful (0 votes)
17 views

2000 - Genetic Clustering Algorithms

Uploaded by

Aishwarya Desai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

2000 - Genetic Clustering Algorithms

Uploaded by

Aishwarya Desai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

European Journal of Operational Research 135 (2001) 413±427

www.elsevier.com/locate/dsw

Theory and Methodology

Genetic clustering algorithms


Yu-Chiun Chiou a, Lawrence W. Lan b,*

a
Aviation and Maritime Management Department, Chang Jung Christian University, Tainan 711, Taiwan, ROC
b
Institute of Trac and Transportation, National Chiao Tung University, 4F, 114 Sec. 1, Chung-Hsiao W. Rd., Taipei 10012,
Taiwan, ROC

Received 15 November 1999; accepted 20 November 2000

Abstract

This study employs genetic algorithms to solve clustering problems. Three models, SICM, STCM, CSPM, are de-
veloped according to di€erent coding/decoding techniques. The e€ectiveness and eciency of these models under
varying problem sizes are analyzed in comparison to a conventional statistics clustering method (the agglomerative
hierarchical clustering method). The results for small scale problems (10±50 objects) indicate that CSPM is the most
e€ective but least ecient method, STCM is second most e€ective and ecient, SICM is least e€ective because of its
long chromosome. The results for medium-to-large scale problems (50±200 objects) indicate that CSPM is still the most
e€ective method. Furthermore, we have applied CSPM to solve an exempli®ed p-Median problem. The good results
demonstrate that CSPM is usefully applicable. Ó 2001 Elsevier Science B.V. All rights reserved.

Keywords: Genetic algorithms; Clustering; p-Median problem

1. Introduction cessing and computer graphics, etc.). Clustering is


mainly to group all objects into several mutually
Clustering, so-called set partitioning, is a basic exclusive clusters in order to achieve the maximum
and widely applied methodology. Application or minimum of an objective function. Clustering is
®elds include statistics, mathematical program- rapidly becoming computationally intractable as
ming (such as location selecting, network parti- problem scale increases, because of the combina-
tioning, routing, scheduling and assignment torial character of the method. Brucker [7] and
problems, etc.) and computer science (including Welch [34] proved that, for speci®c objective
pattern recognition, learning theory, image pro- functions, clustering becomes an NP-hard problem
when the number of clusters exceeds 3. Even the
best algorithms developed for some speci®c
* objective functions, exhibit complexities of
Corresponding author. Tel.: +886-2-2311-0094; fax: +886-2-
2331-2160.
O N 3 log N † or O N 3 † [15], leaving much room
E-mail address: [email protected] for improvement. The heuristic algorithms for
http://www.itt.nctu.edu.tw (L.W. Lan). clustering can be divided into four categories:

0377-2217/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 3 7 7 - 2 2 1 7 ( 0 0 ) 0 0 3 2 0 - 9
414 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427

conventional statistics clustering, mathematical functions without a need for additional informa-
programming, network programming, and genetic tion in the search. This study attempts to develop
algorithms (GAs). The algorithms for conven- coding/decoding techniques for GAs to solve si-
tional statistic clustering [3,14,18,25,33] include multaneously the optimal number of clusters and
agglomerative hierarchical clustering method and the optimal clustering result in comparison to the
K-means. The algorithms for mathematical pro- conventional statistic clustering method (the ag-
gramming [8,11,17,27,28,30±32] range from dy- glomerative hierarchical clustering method).
namic programming, Lagrangian relaxation, linear
relaxation, column generation, branch-and-price
and Lipschitz continuous. The algorithms for the 2. Mathematical model of clustering
network programming [2,12] include graph theo-
retic relaxations and network relaxation. The al- The mathematical model of clustering of given
gorithms for GAs are rapidly developed recently number (m) of clusters is
[4,6,10,19,20,22±24,26] including group-numbers
encoding method (e.g. binary code, Boolean ‰CAm Š
matching code), group-separators encoding meth- Max F X † 1†
od and evolution program method.
subject to
While the aforementioned studies have pro- X
posed ways to solve clustering problems, two main Xij ˆ 1 all i; 2†
research gaps still remain. First, the number of X
j

clusters must be subjectively determined in ad- Xjj ˆ m; 3†


vance. This number cannot be determined simul- j
taneously by the model. Therefore, the above Xij 6 Xjj all i; j; 4†
studies involve a complex procedure that exhaus-
Xij ˆ f0; 1g all i; j; 5†
tively compares all the optimum clustering for
every given number of clusters, then determines
where Xij ˆ 1 denotes that ith object is assigned to
the number of clusters of best objective value.
jth cluster, Xij ˆ 0 otherwise, i; j ˆ f1; 2; . . . ; N g;
Exceptions for this gap are the studies of Lozano
N is the number of objects, m is the number of
et al. [22] and Lunchian et al. [25] in which they
clusters, F X † is the objective function. In the
only solved the optimal number of clusters without
application ®eld of statistics, F X † can be generally
developing an explicit algorithm for the assign-
de®ned as [30]:
ment problem. Second, most of the non-GAs
based algorithms are limited in applications. They F X †d ˆ min kfd S1 †; . . . ; d Sm †gkp ; 6†
are proposed under a speci®c form of the objective Pm 2pm
r
function such as convex function, or proposed by F X † ˆ max fr Si ; Sj †; 1 6 i 6 j 6 mg p ; 7†
Pm 2pm
the assumption that the feasible set is a convex
hull, or proposed with the help of additional in- F X †s ˆ max kfs S1 †; . . . ; s Sm †gkp ; 8†
Pm 2pm
formation such as the gradient of the objective
function. where Si is ith cluster, i ˆ 1; . . . ; m. Pm is a clus-
GAs, ®rst proposed by Holland [16], are gen- tering result of m-clustering, Pm ˆ fS1 ; . . . ; Sm g: pm
eral-purpose search algorithms that have the is a set of all possible clustering results of m-clus-
characteristics of stochastic search, multi-points tering. F X †d is the diameter function of clustering
search, direct search and parallel search. The re- and d Sj † ˆ maxOk ;Ol 2Sj dkl is the diameter of jth
r
lated articles have proved the e€ectiveness and cluster; F X † is the distance function of clustering
eciency of GAs in application to the combina- and r Si ; Sj † ˆ minOk 2Si ;Ol 2Sj dkl is the distance or
s
torial optimization problems [5,9,13,21,29]. discrimination between ith and jth clusters; F X †
Directly using the ®tness to evaluate the chromo- is the split function of clustering and s Sj † ˆ
somes, GAs can be applied to various objective minOk 2Sj ;Ol 62Sj dkl is shortest distance between jth
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 415

cluster with other clusters. k  kp represents the lp - The second operator ± crossover ± is to com-
norm. fg represents the measure of a vector of bine the features of two parent structures to form
diameter or distance. Ok ; Ol are kth and lth ob- two o€springs. The simplest way to make a
jects, respectively. crossover is to swap a corresponding segment of
The total number of feasible
 solutions of ‰CAm Š the parents. One-point crossover, two-point
Pm
is jpm j ˆ 1=m!† jˆ0 1† m j m N
j [1]. There are a crossover and uniform crossover are often em-
j
ployed.
total of 511 feasible solutions for 10 objects
The last operator ± mutation ± is to alter one or
(N ˆ 10) to be divided into 2 clusters (m ˆ 2).
more genes of o€springs with a very low proba-
There are a total of 42,525 feasible solutions for 10
bility to avoid being trapped in a local optimum.
objects to be divided into 5 clusters.
If the number of clusters (m) is not given ex- The resulting o€spring is then evaluated and in-
serted back into the population. This process
ogenously, then ‰CAm Š, without the constraint (3),
continues until a predetermined criterion (e.g.
becomes [CA], that is:
maximum number of generations, minimum value
‰CAŠ of ®tness improved between two adjacent genera-
Max F X † 9† tions or certain mature rate) is reached.
subject to
X
Xij ˆ 1 all i; 10† 4. Problem formulation
j

Xij 6 Xjj all i; j; 11† The e€ectiveness and eciency of GAs vary
Xij ˆ f0; 1g all i; j; 12† with various coding/decoding techniques. This
study proposes three coding/decoding techniques
As to P
the number m of feasible solutions of [CA] is for GAs to solve clustering problems. They are
N
jpj ˆ mˆ1 jpm j. There are 52 feasible solutions at the simultaneously clustering method (SICM),
N ˆ 5,113,608 at N ˆ 10 and 1:99  107 at N ˆ 15. the stepwise clustering method (STCM) and the
Obviously, the complexity of [CA] is exponential cluster seed points method (CSPM). Then, these
to the problem size. models are compared with a conventional sta-
tistics clustering model ± the agglomerative hi-
erarchical clustering method (AHCM). The
3. Genetic algorithms details of these four models are described as
follows.
Genetic algorithms are general-purpose search
algorithms that use principles inspired by natural
population genetics to evolve solutions to prob-
4.1. Agglomerative hierarchical clustering method
lems. The basic idea of genetic algorithms is to
(AHCM)
maintain a population of chromosomes that rep-
resent candidate solutions. A chromosome is
AHCM involves a series of successive merges.
composed of a series of genes that represent deci-
Initially, there are as many clusters as objects.
sion variables or parameters. Each member of the
These initial groups are merged according to their
population is evaluated and assigned a measure of
degree of improvement in the objective values.
its ®tness as a solution. There are three genetic
Eventually, all subgroups are fused into a single
operators: selection, crossover and mutation [13]:
cluster [33]. The following are the steps in AHCM
The ®rst operator ± selection ± is to assign the
for grouping N objects in a maximizing problem
reproduction possibilities to chromosomes based
for example:
on their ®tness. A Monte Carlo wheel is often
employed. That is, the higher ®tness a chromo- Step 0. Start with N clusters, each containing a sin-

some is, the more possible it is selected. gle object, that is, Si ˆ fOi g; i ˆ 1; . . . ; N . An
416 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427

N  N symmetric matrix increments of an objec- Table 2


tive function MF ˆ fDFij ; i; j ˆ 1; . . . ; N ; i 6ˆ jg; Matching rules for gene strings, integers, and clusters
where DFij represents the incremental of objective Gene strings Integers Clusters
value in case that ith cluster and jth cluster are 000 0 1
fused into a single cluster. Let k ˆ 1. 001 1 2
010 2 3
Step 1. If DFvu ˆ maxfDFij ; i; j ˆ 1; . . . ; N k;
011 3 4
i 6ˆ jg and v > u, then let Su k‡1† ˆ Suk† [ Svk† ; 100 4 5
k‡1† k† k‡1† k† k‡1† k†
S1 ˆ S1 ; . . . ; Su 1 ˆ Su 1 ; Su‡1 ˆ Su‡1 ; . . . ; 101 5 6
k‡1† k† k‡1† k† k‡1† 110 6 7
Sv ˆ Sv‡1 ; . . . ; SN v 1 ˆ SN v and SN k ˆ U.
k† 111 7 8
Calculate the objective value F X † of the parti-
tion. Let k ˆ k ‡ 1.

Step 2. Repeat Step 1 until k ˆ N 1. F X † ˆ string. Take 17 objects for example, there are at

maxfF X † ; k ˆ 1; . . . ; N 1g. most 8 partition sets 8 ˆ ‰17=2І. Because each
object is likely to be assigned to any set, these
clusters require three genes to represent them, as
4.2. Simultaneously clustering method (SICM) shown in Table 2.
Replacement of each row with these three genes
If there must be at least two objects to form a not only curtails the length of chromosomes (four
cluster, the maximal number of clusters, K, is genes can represent the problem of 33 objects, ®ve
equal to ‰N =2Š ([ ] is the Gauss sign). Then, there genes can represent the problem of 65 objects) but
are altogether N  K decision variables of prob- also avoids the problem that an object might be
lem [CA], as shown in Table 1. If there are more assigned to several clusters or be unassigned. Table
than two variables with value of 1 in the same 3 illustrates a feasible clustering result for these 17
column, it means that they mutually form a objects. The chromosome of Table 3 is composed
cluster. However, if Xik is encoded as a gene, it of 51 genes (000001011000101101010000001000
will cause the length of chromosome be too long 000011011000000101010). Every three genes of the
(for instance, the number of decision variables for chromosome are then decoded into an integer of
10 objects is 50, for 20 objects is 200, and for 60 0±7 sequentially, representing the cluster into
objects is 1800) and will result in an insuciency which each object is assigned according to the
of computer memory. In addition, it will be dif- matching rules stated in Table 2. After being de-
®cult to handle the constraints if one and only coded, the chromosome represents that ®ve clus-
one variable equals 1 and else equals 0 in the ters are formed. The clusters consist of 6; 2; 2; 3; 3
same row. objects, respectively.
In order to deal with the problem, SICM uses a
coding/decoding technique to replace each row of
the decision variable matrix with a shorter gene 4.3. Stepwise clustering method (STCM)

Table 1 STCM successively solves the optimal binary


Relationship between N objects and K clusters clustering of a cluster until the objective value
Object Clusters cannot be further improved. An initial single
1 2 ... k ... K
cluster containing all objects is divided into two
subgroups such that the objective function is op-
1 X11 X12 X1k X1K
2 X21 X22 X2k X2K
timized at this stage. Through each binary clus-
..
. tering process, a cluster is divided into two
i. Xi1 Xi2 Xik Xik subgroups. A cluster is called fathomed when it
..
cannot be further binary clustered to improve the
N XN 1 XN 2 XNk XNK objective value. This concept is similar to that of
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 417

Table 3
Relationship between objects and clusters (17 objects for example)
Objects Clusters Encoding
1 2 3 4 5 6 7 8
1 1 0 0 0 0 0 0 0 000
2 0 1 0 0 0 0 0 0 001
3 0 0 0 1 0 0 0 0 011
4 1 0 0 0 0 0 0 0 000
5 0 0 0 0 0 1 0 0 101
6 0 0 0 0 0 1 0 0 101
7 0 0 1 0 0 0 0 0 010
8 1 0 0 0 0 0 0 0 000
9 0 1 0 0 0 0 0 0 001
10 1 0 0 0 0 0 0 0 000
11 1 0 0 0 0 0 0 0 000
12 0 0 0 1 0 0 0 0 011
13 0 0 0 1 0 0 0 0 011
14 1 0 0 0 0 0 0 0 000
15 1 0 0 0 0 0 0 0 000
16 0 0 0 0 0 1 0 0 101
17 0 0 1 0 0 0 0 0 010
Subtotal 6 2 2 3 0 3 0 0

Fig. 1. Framework of STCM.

branch-and-bound. When all clusters are fath- The following are the algorithms for STCM under
omed, STCM has attained the optimal clustering. the depth ®rst principal, which fathoms individual
The framework of the model is depicted by Fig. 1. branch one at a time.
418 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427

Step 0. Let S 0† stand for the cluster containing all chromosome can be largely curtailed and can be
objects. The problem of optimally dividing S 0† further reduced in the evolutions of optimization
ˆ
into two subgroups, namely S 0† and S 0† , can be stages. Consider N objects for instance, let
formulated as the following 0±1 mathematical pro- jS 0† j ˆ N denote N objects in set S 0† , the length of
ˆ
gramming: the chromosome at the stage 0 is N. If jS 0† j ˆ L0 ,
the length of the chromosome at the stage 1 is
ˆ
MP 0† N L0 . If jS 1† j ˆ L1 , the length of chromosome at
the stage 2 can be further shortened as

Max F X † 13† N L0 L1 , and so forth.

subject to Xi ˆ f0; 1g i ˆ 1; . . . ; jS 0† j; 14†


4.4. Cluster seed points method (CSPM)
where Xi ˆ 1 denotes that ith object of S 0† is
grouped in the cluster S 0† ; Xi ˆ 0 denotes that ith CSPM ®rst employs GAs to select the most
ˆ
object of S 0† is grouped in the cluster S 0† . Xi suitable cluster seeds from all objects, then as-
is encoded as the gene of chromosomes (the signs the rest of the objects to each cluster ac-
length of chromosomes is jS 0† j), then GAs are cording to their similarity to the cluster seed or
employed to solve MP 0† by maximizing F X † 0† to to their degree of improvement of the objective
attain the optimal binary clustering: S 0† ˆ function. The number of cluster seeds represents
ˆ
fOi j Xi ˆ 1; Oi 2 S 0† g and S 0† ˆ fOi j Xi ˆ 0; the number of clusters and the characteristics of
Oi 2 S 0† g . these cluster seeds determine the clustering re-
Step 1. Let S 1† ˆ S 0† and renumber the objects of sult. The framework of CSPM is depicted by
S 1† . Formulate the optimal binary clustering prob- Fig. 2.
lem of S 1† as MP 1† , which is also solved by GAs. The following are the steps of the assignment
F X†

is the objective value of optimal binary algorithm in Fig. 2.
clustering of S 1† under the assumption that the
ˆ Step 0. Let k ˆ 1 and S be a set of all objects, that
other cluster (S 0† ) remaining unchanged. The clus-
is, S ˆ fO1 ; . . . ; ON g. CPm is a set of cluster seeds,
tering result is: S 1† ˆ fOi j Xi ˆ 1; Oi 2 S 1† g and
ˆ that is, CPm ˆ fc1 ; . . . ; cm g. NP is a set of non-clus-
S 1† ˆ fOi j Xi ˆ 0; Oi 2 S 1† g. Three clusters are ter seeds, that is, NP ˆ S CPm  Sj ˆ fcj g;

ˆ ˆ 1†
formed, they are: S 0† , S 1† and S 1† . F X † is the j ˆ 1; . . . ; m.
objective value of these three clusters. Step 1. Let Ok denote kth object of NP. If
Step 2. Let S i† ˆ S i 1† and solve MP i† by k† k† k† k†
F S1 ; . . . ; Sj 1 ; Sj [ fOk g; Sj‡1 ; . . . ; Smk† † ˆ
GAs. k† k† k† k†
ˆ
Step 3. Repeat Step 2 until S k† ˆ U, then this Maxi fF S1 ; . . . ; Si‡1 ; Si [ fOk g; Si‡1 ; . . . ; Smk† g,
branch is fathomed. There is a total of k ‡ 1 clus- then Ok is assigned to jth cluster.
ˆ ˆ ˆ k† k 1†
ters, that is S 0† ; S 1† ; . . . ; S k 1† and S k† : S k† can no Step 2. Let Sj ˆ Sj [ fOk g and k ˆ k ‡ 1. If
longer be divided in the following steps. F X † k† is k < N m ‡ 1, return to Step 1, otherwise termi-
the optimal objective value of these k ‡ 1 clusters. nate.
Step 4. Choose one of the remaining branches to
Once the clustering result, Pm , is obtained, the
be binarily clustered by repeating Steps 2 and 3 un-
objective value F X † which represents the ®tness of
til it is fathomed.
this set of cluster seeds is also determined. How-
Step 5. If all branches are fathomed, then stop.
ever, CSPM employs GAs to search for the opti-
The clusters formed are the optimal clustering re-
mal cluster seeds by encoding variables Xi as genes
sult of STCM. Otherwise, go to Step 4.
to represent related objects, where Xi ˆ 1 denotes
In comparison to SICM, the coding/decoding that the ith object is chosen as a cluster seed, and
of STCM are much simpler because the length of Xi ˆ 0 otherwise.
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 419

Fig. 2. Framework of CSPM.

5. Computational experiments

5.1. Experimental design

A random number generator is using to gen-


erate 200 two-dimensional objects (ai and
bi ; i ˆ 1; . . . ; 200† as shown in Fig. 3. All objects
with coordinate of (ai ; bi ) are uniformly distributed
within the square formed by four corners of which
coordinates are (0, 0), (0, 1), (1, 0) and (1, 1), re-
spectively. In order to analyze the e€ectiveness and
eciency of four models under varying problem
sizes, we choose the ®rst 50 objects merely in
testing of small problems (10±50 objects) and use
all the objects to be involved in testing of medium-
Fig. 3. Distribution of 200 two-dimensional objects.
to-large problems (50±200 objects).
Generally, minimizing the sum of squared er-
rors is chosen as the objective function for clus- squared errors is 0. Nevertheless, in this study,
tering problems. However, it is only applicable to the objective function is to maximize the ratio of
problems in which the number of clusters is between-clusters variability to within-clusters
speci®ed. Employing this objective function to variability, to determine simultaneously the opti-
solve the optimal number of clusters will opti- mal number of clusters and the optimal result of
mally result in N clusters such that the sum of clustering [15].
420 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427

N m 5.2. The results


F ˆ
m 1
    5.2.1. Small scale problems N 5 50†
Pm ˆ 2 ˆ 2
n 
a a ‡ 
b b Table 4 and Fig. 4 summarize the objective
kˆ1 k k k
P h i; 15† values solved by AHCM, SICM, STCM and
m P nk 2 k †2 CSPM under various problem sizes. The objective
kˆ1 iˆ1 a i 
a k † ‡ b i b
values solved by SICM are all signi®cantly (at 5%
Pk Pk level of signi®cance) inferior to those of AHCM,
where ak ˆ 1=nk † niˆ1 ai ; bk ˆ 1=nk † niˆ1 bi ; showing the ine€ectiveness of SICM. At N ˆ 10,
ˆ PN ˆ PN
a ˆ 1=N † iˆ1 ai ; b ˆ 1=N † iˆ1 bi  nk is the 20 and 30, the objective values solved by STCM
number of objects in the kth cluster, that is, are not signi®cantly di€erent from those of
nk ˆ jSk j. In order to avoid clusters with only one AHCM, but at N ˆ 40 and 50 STCM shows more
object, a penalty of J  M (J is the number of one- e€ective than AHCM. The objective values solved
object clusters; M is a big number) is added to the by CSPM are not signi®cantly di€erent from those
objective value. of AHCM at N ˆ 10 and 20, but CSPM demon-
The mechanism of genetic algorithms in these strates more e€ective than AHCM at N ˆ 30, 40
three models is tested according to the following: and 50. For further comparison between STCM
population of each generation ˆ 100, roulette and CSPM, we test the null hypothesis H0 : F2 ˆ F3
wheel selecting, two points crossover at a rate of against alternative hypothesis H1 : F2 < F3 at
1.00 and gene mutation at a rate of 0.01. The N ˆ 10; 20; 30; 40 and 50, respectively. Their
stopping rule is preset as mature rate reaching corresponding Z values are 0.00, 0.69, 13.45, 16.70
80%. That is, GAs will continue to evolve until and 11.17, indicating that CSPM is signi®cantly
there are over 80% chromosomes with the same more e€ective than STCM at N ˆ 30; 40 and 50.
®tness in an epoch. Due to the stochastic charac- The testing results show that CSPM is the most
teristics of GAs, the following empirical compari- e€ective method, followed by STCM, and SICM is
son of our proposed methods are analyzed using the least e€ective due to its long chromosome (®ve
hypothesis test on the results obtained from 30 times that of CSPM) which leads to the need to
di€erent executions. search a larger feasible space.

Table 4
E€ectiveness of four methods N 5 50†a
Number AHCM SICM STCM CSPM
of objects F F1 dF1 F2 dF2 F3 dF3
F1 =F † Z1 † F2 =F † Z2 † F3 =F † Z3 †
10 13.06 12.45 0.73 13.06 0.00 13.06 0.00
(0.95) ()4.63 ) (1.00) (0.00) (1.00) (0.00)
20 26.95 11.38 4.49 26.77 0.69 26.87 0.36
(0.42) ()18.97 ) (0.99) ()1.43) (1.00) ()1.28)
30 28.70 13.83 3.83 28.82 2.81 36.40 1.28
(0.48) ()21.25 ) (1.00) (0.22) (1.27) (32.90 )
40 37.10 11.55 4.44 38.72 2.88 48.99 1.75
(0.31) ()31.55 ) (1.04) (3.08 ) (1.32) (37.20 )
50 38.24 14.81 4.50 51.18 4.47 61.01 1.81
(0.39) ()28.53 ) (1.34) (15.86 ) (1.60) (68.98 )
a
(1) F stands for the objective value of AHCM. (2) Fi and dFi represent the means and standard deviations of objective values solved by
SICM, STCM and CSPM 
p with 30 di€erent executions, i ˆ 1; 2; 3. (3) E€ectiveness index of the ith method is de®ned as Fi =F .
(4) Zi ˆ Fi F †= dFi = 30† which follows Normal distribution (0,1). (5)* denotes that the null hypothesis test (H0 : Fi ˆ F ) is rejected
at the 5% level of signi®cance.
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 421

Fig. 4. E€ectiveness of four methods N 5 50†.

Table 5
Eciencies of four methods N 5 50†a
Number of objects AHCM SICM STCM CSPM
SS SS 1 dSS1 SS 2 dSS2 SS 3 dSS3
SS=SS 1 † Z1 † SS=SS 2 † Z2 † SS=SS 3 † Z3 †
10 172 1987 1193 2917 418 11,372 1724
(0.09) (9.06) (0.06) (36.08) (0.02) (36.08)
20 1347 7383 2506 7533 1020 133,743 42,927
(0.18) (16.08) (0.18) (17.06) (0.01) (17.06)
30 4521 12,963 3538 14,323 2018 511,915 220,156
(0.35) (20.02) (0.32) (12.74) (0.01) (12.74)
40 10,696 33,020 10,250 20,970 1776 1,257,755 454,935
(0.32) (17.62) (0.51) (15.14) (0.01) (15.14)
50 20,871 68,033 17,141 28,080 2143 2,226,590 633,469
(0.31) (21.73) (0.74) (19.25) (0.01) (19.25)
a
(1) SS stands for the number of solutions searched by AHCM. (2) SS i and dSSi represent the means and standard deviations of the
number of solutions searched by SICM, STCM and CSPM p with 30 di€erent executions, i ˆ 1; 2; 3. (3) Eciency index of the ith
method is de®ned as SS=SS i . (4) Zi ˆ SS i SS†= dSSi = 30† which follows Normal distribution (0,1). (5) * denotes that the null hy-
pothesis test (H0 : SS i ˆ SS) is rejected at the 5% level of signi®cance.

Table 5 summarizes the number of solutions N ˆ 40 and 50. Similarly, the results of two-tailed
searched until the ``optimal solution'' being ob- tests of H0 : SS 1 ˆ SS 3 against H1 : SS 1 6ˆ SS 3 and
tained by four methods under various problem H0 : SS 2 ˆ SS 3 against H1 : SS 2 6ˆ SS 3 have shown
sizes. Obviously, AHCM, which has the least that CSPM is the least ecient method.
number of solutions searched, is the most ecient Fig. 5 shows an optimal clustering result of
method. Further comparison between SICM and CSPM at N ˆ 50. The ®gure shows that, by
STCM is made by a two-tailed test of maximizing the ratio of between-clusters variabil-
H0 : SS 1 ˆ SS 2 against H1 : SS 1 6ˆ SS 2 . The Z val- ity to within-cluster variability, these 50 objects
ues at N ˆ 10; 20; 30; 40 and 50 are 4.03, 0.30, have been divided into 11 clusters. Each cluster
1.83, )6.34 and )12.67, respectively, implicitly contains 2±8 objects. Since the objects in the same
showing that SICM is more ecient than STCM clusters are adjacent to each other and no object is
at N ˆ 10, not signi®cantly di€erent from STCM obviously grouped into a wrong cluster, the result
at N ˆ 20 and 30, and less ecient than STCM at of clustering appears to be good.
422 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427

is more e€ective than AHCM for small scale


problems N 5 50†, but less e€ective than AHCM
for medium-to-large scale problems 50 5
N 5 200†.

6. Applications

CSPM can be applied to problems of p-Median


because of not only its e€ectiveness, but also its
search procedure (cluster seed points are chosen
®rst, and the rest of the objects are assigned later).
A p-Median problem is chosen as an example to
examine the applicability of this method.

Fig. 5. Optimal clustering result of CSPM (N ˆ 50).


6.1. Problem statement

5.2.2. Medium-to-large scale problems 50 5 N 5 There are 25 districts uniformly distributed in a


200† square area. Each link that connects two adjacent
If SICM were applied to larger scale problems, districts is 1 km long. We consider setting up
the length of the chromosome would be too long, several public facilities, such as hospitals, schools
saturating the computer memory. Thus we only or ®re departments in some districts to serve all
analyze the clustering results of AHCM, STCM districts. More facilities represent a higher total
and CSPM, illustrated in Fig. 6. It shows obvi- set-up cost. However, if facilities are insucient or
ously that CSPM is the most e€ective, especially set up in the wrong place, the total cost of incon-
for the larger scale problems. AHCM is slightly venience across all districts using the facilities will
more e€ective than STCM. It shows that STCM increase. Thus the problem of optimal locations

Fig. 6. Means and 95% upper/lower con®dence intervals of STCM and CSPM 50 5 N 5 200.
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 423

and serviceable areas of facilities can be formu- 6.2. The results


lated as follows:
Without a loss of generality, let b ˆ 1 dollars/
km. Figs. 7±14 show the optimal designs of CSPM
‰SCŠ
according to di€erent set-up costs a†, ranging
X XX
Min Z ˆ a Xjj ‡ b jai aj j ‡ jbi bj j†Xij from 0.5 to 15 dollars/facility, respectively. In
j i j these ®gures, a circle denotes a district, a shad-
owed circle indicates that a public facility has been
16†

subject to
X
Xij ˆ 1 all i; 17†
j

Xij 6 Xjj all i;j; i ˆ 1;... ;25; 18†


Xij ˆ f0; 1g all i;j; i ˆ 1;.. .;25; j ˆ 1;... ;25; 19†

where Z represents the total cost function, a is the


set-up cost of a public facilities (dollars/facility), b
is the unit distance cost for a district to access its
public facility (dollars/kilometer), (ai ; bi ) is the
coordinate of ith district, Xjj ˆ 1 denotes that jth
district has a public facility, Xjj ˆ 0 otherwise.
Xij ˆ 1 denotes that ith district uses the public
facility located at jth district, Xij ˆ 0 otherwise.
The mechanism of GAs follows the set-ups in Fig. 8. The optimal design at a ˆ 1.
Section 5.1.

Fig. 7. The optimal design at a ˆ 0:5. Fig. 9. The optimal design at a ˆ 2.


424 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427

Fig. 10. The optimal design at a ˆ 3.


Fig. 12. The optimal design at a ˆ 5.

Fig. 11. The optimal design at a ˆ 4.


Fig. 13. The optimal design at a ˆ 10.

set up in that district, and the districts connected


with links are all served by the same public facility. (Fig. 10). At a ˆ 15 dollars/facility, the optimal
At a ˆ 0:5 dollars/facility, the optimal number of number of facilities is only one and the total dis-
facilities is 25, that is, each district has one public tance cost reaches 60 dollars (Fig. 14). Fig. 15
facility and the total distance cost is 0 (Fig. 7). At further illustrates that while the set-up cost of a
a ˆ 3 dollars/facility, the optimal number of facility increases, the optimal number of facilities
facilities is 6 and total distance cost is 22 dollars decreases and the total distance cost increases.
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 425

dium-to-large problems (50±200 objects) indicate


that CSPM is still the most e€ective method.
AHCM is slightly more e€ective than STCM. Both
results show that CSPM can solve clustering
problems e€ectively. CSPM can be easily applied
to problems of p-Median or location selection
because of its e€ectiveness and its search procedure
(cluster seed points is chosen ®rst, and the rest of
the objects are assigned later). CSPM is highly
applicable as evidenced by the reasonable results
of the application to an exempli®ed p-Median
problem.
Since the search space for CSPM is propor-
tional to 2N (search space for N ˆ 10 is
210 ˆ 1024, N ˆ 100 is 2100 ˆ 1:27  1030 ), the
larger the scale of the problem is, the less e€ec-
tiveness of CSRM would be anticipated. Future
studies can examine the feasibility of employing a
Fig. 14. The optimal design at a ˆ 15. hybrid GA model (combining other heuristic al-
gorithms, such as simulated annealing) to further
6.3. Concluding remarks enhance its e€ectiveness and eciency. The clus-
tering problems solved by GAs using personal
This study discusses the e€ectiveness and e- computers, however, involves the storage of a
ciency of solving clustering problems by employing certain population of chromosomes, which is
GAs. Varying the techniques of coding/decoding, likely to cause an insuciency of computer
we proposed SICM, STCM and CSPM models memory as the problem scale gets larger (say
and tested them with 200 two-dimensional objects. N > 200). Therefore, reducing the storage re-
The results from small scale problems (10±50 ob- quirements is worthy of further investigation.
jects) show that CSPM is most e€ective but least Future studies can also be conducted by com-
ecient, STCM is second most e€ective and e- paring the e€ectiveness and eciency of our
cient, and SICM is least e€ective because of its proposed methods with the GA based algorithms
long chromosome required. The results from me- mentioned in the references.

Fig. 15. Optimal number of facilities and total distance costs vs. facility set-up cost.
426 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427

Acknowledgements [13] D. Goldberg, Genetic Algorithms in Search Optimization


and Machine Learning, Addison-Wesley, Reading, MA,
1989.
The authors are greatly indebted to two referees [14] P. Hansen, M. Delattre, Complete-link cluster analysis by
for their constructive comments. This paper is graph coloring, Journal of the American Statistical Asso-
partially sponsored by the National Science ciation 73 (1978) 397±403.
Council of the Republic of China under contract [15] P. Hansen, B. Jaumard, Minimum sum of diameters
clustering, Journal of Classi®cation 4 (1987) 215±226.
number NSC88-2211-E-009-019.
[16] J.H. Holland, Adaptation in Nature and Arti®cial Systems,
University of Michigan Press, Ann Arbor, MI, 1975.
[17] R.E. Jensen, A dynamic programming algorithm for
References cluster analysis, Operations Research 12 (1969) 1034±1057.
[18] R.A. Johnson, D.W. Wichern, Applied Multivariate Statis-
[1] M. Abramowitz, I.A. Stegun (Eds.), Handbook of Math- tical Analysis, Second ed., Prentice-Hall, New York, 1988.
ematical Functions, National Bureau of Standards, Ap- [19] D.R. Jones, M.A. Beltramo, Solving partitioning problems
plied Mathematical Series 55, U.S. Department of with genetic algorithms, in: Proceedings of the Fourth
Commerce, 1964. International Conference on Genetic Algorithms, 1991, pp.
[2] A.I. Ali, H. Thiagarajan, A network relaxation based 442±449.
enumeration algorithm for set partitioning, European [20] F. Kettaf, J. Asselin de Beauville, Genetic and fuzzy based
Journal of Operational Research 38 (1989) 76±85. clustering, in: Proceedings of the Fifth Conference on
[3] M.R. Anderberg, Cluster Analysis for Applications, Aca- International Federation of Classi®cation Societies, 1996,
demic Press, New York, 1973. pp. 100±103.
[4] C. Alippi, R. Cucchiara, Cluster partitioning in image [21] F.T. Lin, C.Y. Kao, C.C. Hsu, Applying the genetic
analysis classi®cation: A genetic algorithm approach, in: approach to simulated annealing in solving some NP-hard
Proceedings of the 1992 IEEE International Conference on problems, IEEE Transactions on Systems Man and
Computer Systems and Software Engineering, 1992, pp. Cybernetics 23 (1993) 1752±1767.
423±427. [22] J.A. Lozano, P. Larranga, M. Grana, Partitional cluster
[5] H. Aytug, G. Koehler, J.L. Snowdon, Genetic learning of analysis with genetic algorithm: Searching for the number
dynamic scheduling within a simulation environment, of clusters, Data Science, Classi®cation and Related
Computers and Operations Research 21 (1994) 909±925. Methods (1998) 117±125.
[6] J.N. Bhuyan, V.V. Raghavan, V.K. Elayavalli, Genetic [23] C.B. Lucasius, A.D. Dane, G. Kateman, On k-medoid
algorithms with an ordered representation, in: Proceedings clustering of large data sets with the aid of a genetic
of the Fourth International Conference on Genetic Algo- algorithm: Background, feasibility, and comparison, Anal-
rithms, 1991, pp. 408±415. ytica Chimica Acta 282 (1993) 647±669.
[7] P. Brucker, On the complexity of clustering problems, in: [24] S. Lunchian, H. Lunchian, M. Petriuc, Evolutionary
M. Beckmenn, H.P. Kunzi (Eds.), Optimization and automated classi®cation, in: Proceedings of the First IEEE
Operations Research, Lecture Notes in Economics and Conference on Evolutionary Computation, 1994, pp. 585±
Mathematical Systems, vol. 157, Springer, Berlin, 1978, 589.
pp. 45±54. [25] J. MacQueen, Some methods for classi®cation and analysis
[8] D.G. Cattrysse, M. Salomon, L.N. Van Wassenhove, A set of multivariate observations, in: Proceedings of Fifth
partitioning heuristic for the generalized assignment prob- Berkeley Symposium on Mathematical Statistics and
lem, European Journal of Operational Research 72 (1994) Probability, vol. 1, 1967, pp. 281±297.
167±174. [26] I.R. Moraczawski, W. Borkowski, A. Kierzek, Clustering
[9] D.G. Conway, M.A. Venkataramanan, Genetic search and geobotanical data with the use of a genetic algorithm,
dynamic facility layout problem, Computers and Opera- Coenoses 10 (1995) 17±28.
tions Research 21 (1994) 955±960. [27] E.C. Moshe, C.A. Tovey, J.C. Ammons, Circuit partition-
[10] R. Cucchiara, Analysis and comparison of di€erent genetic ing via set partitioning and column generation, Operations
models for the clustering problem in image analysis, in: Research 44 (1996) 65±76.
Proceedings of the International Conference on Arti®cial [28] J.M. Mulvey, H.P. Crowder, Cluster analysis: An applica-
Neural Nets and Genetic Algorithms, Innsbruck, Austria, tion of Lagrangian relaxation, Management Science 25
1993, pp. 423±427. (1979) 329±340.
[11] J. Etcheberry, The set covering problem: A new implicit [29] A.L. Nordstrom, S. Tufekci, A genetic algorithm for the
enumeration algorithm, Operations Research 25 (1977) talent scheduling problem, Computers and Operations
760±772. Research 21 (1994) 941±954.
[12] E. El-Darzi, G. Mitra, Graph theoretic relaxations of set [30] J. Pinte`r, G. Pesti, Set partition by globally optimized
covering and set partitioning problems, European Journal cluster seed points, European Journal of Operational
of Operational Research 87 (1995) 109±121. Research 51 (1991) 127±135.
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 427

[31] M. Savelsbergh, A branch-and-price algorithm for the [33] J.H. Ward, Hierarchical grouping to optimize an objective
generalized assignment problem, Operations Research 45 function, Journal of the American Statistical Association
(1997) 831±841. 58 (1963) 236±244.
[32] M.A. Trick, A linear relaxation heuristic for the general- [34] J.W. Welch, Algorithmic complexity: Three NP-hard
ized assignment problem, Naval Research Logistic 39 problems in computational statistics, Journal of Statistical
(1992) 137±152. Computation and Simulation 15 (1983) 17±25.

You might also like