2000 - Genetic Clustering Algorithms
2000 - Genetic Clustering Algorithms
www.elsevier.com/locate/dsw
a
Aviation and Maritime Management Department, Chang Jung Christian University, Tainan 711, Taiwan, ROC
b
Institute of Trac and Transportation, National Chiao Tung University, 4F, 114 Sec. 1, Chung-Hsiao W. Rd., Taipei 10012,
Taiwan, ROC
Abstract
This study employs genetic algorithms to solve clustering problems. Three models, SICM, STCM, CSPM, are de-
veloped according to dierent coding/decoding techniques. The eectiveness and eciency of these models under
varying problem sizes are analyzed in comparison to a conventional statistics clustering method (the agglomerative
hierarchical clustering method). The results for small scale problems (10±50 objects) indicate that CSPM is the most
eective but least ecient method, STCM is second most eective and ecient, SICM is least eective because of its
long chromosome. The results for medium-to-large scale problems (50±200 objects) indicate that CSPM is still the most
eective method. Furthermore, we have applied CSPM to solve an exempli®ed p-Median problem. The good results
demonstrate that CSPM is usefully applicable. Ó 2001 Elsevier Science B.V. All rights reserved.
0377-2217/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 3 7 7 - 2 2 1 7 ( 0 0 ) 0 0 3 2 0 - 9
414 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427
conventional statistics clustering, mathematical functions without a need for additional informa-
programming, network programming, and genetic tion in the search. This study attempts to develop
algorithms (GAs). The algorithms for conven- coding/decoding techniques for GAs to solve si-
tional statistic clustering [3,14,18,25,33] include multaneously the optimal number of clusters and
agglomerative hierarchical clustering method and the optimal clustering result in comparison to the
K-means. The algorithms for mathematical pro- conventional statistic clustering method (the ag-
gramming [8,11,17,27,28,30±32] range from dy- glomerative hierarchical clustering method).
namic programming, Lagrangian relaxation, linear
relaxation, column generation, branch-and-price
and Lipschitz continuous. The algorithms for the 2. Mathematical model of clustering
network programming [2,12] include graph theo-
retic relaxations and network relaxation. The al- The mathematical model of clustering of given
gorithms for GAs are rapidly developed recently number (m) of clusters is
[4,6,10,19,20,22±24,26] including group-numbers
encoding method (e.g. binary code, Boolean CAm
matching code), group-separators encoding meth- Max F X 1
od and evolution program method.
subject to
While the aforementioned studies have pro- X
posed ways to solve clustering problems, two main Xij 1 all i; 2
research gaps still remain. First, the number of X
j
cluster with other clusters. k kp represents the lp - The second operator ± crossover ± is to com-
norm. fg represents the measure of a vector of bine the features of two parent structures to form
diameter or distance. Ok ; Ol are kth and lth ob- two osprings. The simplest way to make a
jects, respectively. crossover is to swap a corresponding segment of
The total number of feasible
solutions of CAm the parents. One-point crossover, two-point
Pm
is jpm j 1=m! j0 1 m j m N
j [1]. There are a crossover and uniform crossover are often em-
j
ployed.
total of 511 feasible solutions for 10 objects
The last operator ± mutation ± is to alter one or
(N 10) to be divided into 2 clusters (m 2).
more genes of osprings with a very low proba-
There are a total of 42,525 feasible solutions for 10
bility to avoid being trapped in a local optimum.
objects to be divided into 5 clusters.
If the number of clusters (m) is not given ex- The resulting ospring is then evaluated and in-
serted back into the population. This process
ogenously, then CAm , without the constraint (3),
continues until a predetermined criterion (e.g.
becomes [CA], that is:
maximum number of generations, minimum value
CA of ®tness improved between two adjacent genera-
Max F X 9 tions or certain mature rate) is reached.
subject to
X
Xij 1 all i; 10 4. Problem formulation
j
Xij 6 Xjj all i; j; 11 The eectiveness and eciency of GAs vary
Xij f0; 1g all i; j; 12 with various coding/decoding techniques. This
study proposes three coding/decoding techniques
As to P
the number m of feasible solutions of [CA] is for GAs to solve clustering problems. They are
N
jpj m1 jpm j. There are 52 feasible solutions at the simultaneously clustering method (SICM),
N 5,113,608 at N 10 and 1:99 107 at N 15. the stepwise clustering method (STCM) and the
Obviously, the complexity of [CA] is exponential cluster seed points method (CSPM). Then, these
to the problem size. models are compared with a conventional sta-
tistics clustering model ± the agglomerative hi-
erarchical clustering method (AHCM). The
3. Genetic algorithms details of these four models are described as
follows.
Genetic algorithms are general-purpose search
algorithms that use principles inspired by natural
population genetics to evolve solutions to prob-
4.1. Agglomerative hierarchical clustering method
lems. The basic idea of genetic algorithms is to
(AHCM)
maintain a population of chromosomes that rep-
resent candidate solutions. A chromosome is
AHCM involves a series of successive merges.
composed of a series of genes that represent deci-
Initially, there are as many clusters as objects.
sion variables or parameters. Each member of the
These initial groups are merged according to their
population is evaluated and assigned a measure of
degree of improvement in the objective values.
its ®tness as a solution. There are three genetic
Eventually, all subgroups are fused into a single
operators: selection, crossover and mutation [13]:
cluster [33]. The following are the steps in AHCM
The ®rst operator ± selection ± is to assign the
for grouping N objects in a maximizing problem
reproduction possibilities to chromosomes based
for example:
on their ®tness. A Monte Carlo wheel is often
employed. That is, the higher ®tness a chromo- Step 0. Start with N clusters, each containing a sin-
1
some is, the more possible it is selected. gle object, that is, Si fOi g; i 1; . . . ; N . An
416 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427
Table 3
Relationship between objects and clusters (17 objects for example)
Objects Clusters Encoding
1 2 3 4 5 6 7 8
1 1 0 0 0 0 0 0 0 000
2 0 1 0 0 0 0 0 0 001
3 0 0 0 1 0 0 0 0 011
4 1 0 0 0 0 0 0 0 000
5 0 0 0 0 0 1 0 0 101
6 0 0 0 0 0 1 0 0 101
7 0 0 1 0 0 0 0 0 010
8 1 0 0 0 0 0 0 0 000
9 0 1 0 0 0 0 0 0 001
10 1 0 0 0 0 0 0 0 000
11 1 0 0 0 0 0 0 0 000
12 0 0 0 1 0 0 0 0 011
13 0 0 0 1 0 0 0 0 011
14 1 0 0 0 0 0 0 0 000
15 1 0 0 0 0 0 0 0 000
16 0 0 0 0 0 1 0 0 101
17 0 0 1 0 0 0 0 0 010
Subtotal 6 2 2 3 0 3 0 0
branch-and-bound. When all clusters are fath- The following are the algorithms for STCM under
omed, STCM has attained the optimal clustering. the depth ®rst principal, which fathoms individual
The framework of the model is depicted by Fig. 1. branch one at a time.
418 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427
Step 0. Let S 0 stand for the cluster containing all chromosome can be largely curtailed and can be
objects. The problem of optimally dividing S 0 further reduced in the evolutions of optimization
into two subgroups, namely S 0 and S 0 , can be stages. Consider N objects for instance, let
formulated as the following 0±1 mathematical pro- jS 0 j N denote N objects in set S 0 , the length of
gramming: the chromosome at the stage 0 is N. If jS 0 j L0 ,
the length of the chromosome at the stage 1 is
MP 0 N L0 . If jS 1 j L1 , the length of chromosome at
the stage 2 can be further shortened as
0
Max F X 13 N L0 L1 , and so forth.
5. Computational experiments
Table 4
Eectiveness of four methods N 5 50a
Number AHCM SICM STCM CSPM
of objects F F1 dF1 F2 dF2 F3 dF3
F1 =F Z1 F2 =F Z2 F3 =F Z3
10 13.06 12.45 0.73 13.06 0.00 13.06 0.00
(0.95) ()4.63 ) (1.00) (0.00) (1.00) (0.00)
20 26.95 11.38 4.49 26.77 0.69 26.87 0.36
(0.42) ()18.97 ) (0.99) ()1.43) (1.00) ()1.28)
30 28.70 13.83 3.83 28.82 2.81 36.40 1.28
(0.48) ()21.25 ) (1.00) (0.22) (1.27) (32.90 )
40 37.10 11.55 4.44 38.72 2.88 48.99 1.75
(0.31) ()31.55 ) (1.04) (3.08 ) (1.32) (37.20 )
50 38.24 14.81 4.50 51.18 4.47 61.01 1.81
(0.39) ()28.53 ) (1.34) (15.86 ) (1.60) (68.98 )
a
(1) F stands for the objective value of AHCM. (2) Fi and dFi represent the means and standard deviations of objective values solved by
SICM, STCM and CSPM
p with 30 dierent executions, i 1; 2; 3. (3) Eectiveness index of the ith method is de®ned as Fi =F .
(4) Zi Fi F = dFi = 30 which follows Normal distribution (0,1). (5)* denotes that the null hypothesis test (H0 : Fi F ) is rejected
at the 5% level of signi®cance.
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 421
Table 5
Eciencies of four methods N 5 50a
Number of objects AHCM SICM STCM CSPM
SS SS 1 dSS1 SS 2 dSS2 SS 3 dSS3
SS=SS 1 Z1 SS=SS 2 Z2 SS=SS 3 Z3
10 172 1987 1193 2917 418 11,372 1724
(0.09) (9.06) (0.06) (36.08) (0.02) (36.08)
20 1347 7383 2506 7533 1020 133,743 42,927
(0.18) (16.08) (0.18) (17.06) (0.01) (17.06)
30 4521 12,963 3538 14,323 2018 511,915 220,156
(0.35) (20.02) (0.32) (12.74) (0.01) (12.74)
40 10,696 33,020 10,250 20,970 1776 1,257,755 454,935
(0.32) (17.62) (0.51) (15.14) (0.01) (15.14)
50 20,871 68,033 17,141 28,080 2143 2,226,590 633,469
(0.31) (21.73) (0.74) (19.25) (0.01) (19.25)
a
(1) SS stands for the number of solutions searched by AHCM. (2) SS i and dSSi represent the means and standard deviations of the
number of solutions searched by SICM, STCM and CSPM p with 30 dierent executions, i 1; 2; 3. (3) Eciency index of the ith
method is de®ned as SS=SS i . (4) Zi SS i SS= dSSi = 30 which follows Normal distribution (0,1). (5) * denotes that the null hy-
pothesis test (H0 : SS i SS) is rejected at the 5% level of signi®cance.
Table 5 summarizes the number of solutions N 40 and 50. Similarly, the results of two-tailed
searched until the ``optimal solution'' being ob- tests of H0 : SS 1 SS 3 against H1 : SS 1 6 SS 3 and
tained by four methods under various problem H0 : SS 2 SS 3 against H1 : SS 2 6 SS 3 have shown
sizes. Obviously, AHCM, which has the least that CSPM is the least ecient method.
number of solutions searched, is the most ecient Fig. 5 shows an optimal clustering result of
method. Further comparison between SICM and CSPM at N 50. The ®gure shows that, by
STCM is made by a two-tailed test of maximizing the ratio of between-clusters variabil-
H0 : SS 1 SS 2 against H1 : SS 1 6 SS 2 . The Z val- ity to within-cluster variability, these 50 objects
ues at N 10; 20; 30; 40 and 50 are 4.03, 0.30, have been divided into 11 clusters. Each cluster
1.83, )6.34 and )12.67, respectively, implicitly contains 2±8 objects. Since the objects in the same
showing that SICM is more ecient than STCM clusters are adjacent to each other and no object is
at N 10, not signi®cantly dierent from STCM obviously grouped into a wrong cluster, the result
at N 20 and 30, and less ecient than STCM at of clustering appears to be good.
422 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427
6. Applications
Fig. 6. Means and 95% upper/lower con®dence intervals of STCM and CSPM 50 5 N 5 200.
Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427 423
subject to
X
Xij 1 all i; 17
j
Fig. 15. Optimal number of facilities and total distance costs vs. facility set-up cost.
426 Y.-C. Chiou, L.W. Lan / European Journal of Operational Research 135 (2001) 413±427
[31] M. Savelsbergh, A branch-and-price algorithm for the [33] J.H. Ward, Hierarchical grouping to optimize an objective
generalized assignment problem, Operations Research 45 function, Journal of the American Statistical Association
(1997) 831±841. 58 (1963) 236±244.
[32] M.A. Trick, A linear relaxation heuristic for the general- [34] J.W. Welch, Algorithmic complexity: Three NP-hard
ized assignment problem, Naval Research Logistic 39 problems in computational statistics, Journal of Statistical
(1992) 137±152. Computation and Simulation 15 (1983) 17±25.