A Method To Estimate The Statistical Confidence of Cluster Separation
A Method To Estimate The Statistical Confidence of Cluster Separation
With 1 Figure
under consideration of those existing parameters the quality of separation is unknown, as is the
that have to be normalized accordingly in the case objective number of clusters. The following pro-
of different scalings): cedure shows a solution to this problem.
1 kL
= Z ei (3) 2. Definition of a Quality Criterion
i=(k- 1)L+ 1
to Separate Clusters
By applying the Euclidean distance, the following
The quality criterion represents the statistical se-
objective function a(g) for each grouping step
curity of the cluster separation. The definition of
9 can be defined:
this criterion can be described as follows: After
K
having reached the local minimum, each cluster is
a(g) = Z 2 l e, - G I2. (4)
k = 1 i~k equipped with a generally varying number of
elements. Each element is defined by N pa-
By considering the Euclidean distance, each
rameters, i.e., it is located in a N-dimensional
grouping step can be seen as a displacement of the
parameter space. As each cluster consists of a cer-
element e~ into that cluster which contains the
tain number of elements, each representing a scat-
respective nearest centroid. The objective can thus
ter plot of elements in the above space. If the
be minimized:
clustering leads to a local secondary minimum,
a(g)Vg ~ min. (5) overlaps occur between the scatter plots of single
clusters. The principle of this method is presented
This procedure is repeated until a local m i n i m u m
in Fig. 1 which depicts the projection of two pa-
of the objective function is reached. The objective
rameters within the N-dimensional space. The
function reaches a local minimum if two succes-
number of overlaps O of the two clusters a and b
sive grouping steps show the same result; the
of N parameters can accordingly be defined as
iteration is in this case discontinued, i.e., the opti-
follows:
m u m classification with respect to the given
La Lb N a=l,...,k--1 (6)
number of clusters has been reached. Oa,b: i aE= l i bE= l j =El Oia,ibd b = 2,. "" ,k
An important disadvantage of this method is
that one does not know whether an absolute with
minimum or just a secondary minimum of the
objective function has been obtained (Fovell, {10 Pib'J>/Pi~'J (7)
Oia'ib'j : Pib,j< Pi~,j
1993; Milligan and Cooper, 1985). For this reason
30
25-
¢u20 •
..-.. m I
• a;."
~15 ,,=
i .'- I1,,
under the additional condition for instance by the z2-test (e.g., Taubenheim,
1969) which can be written as follows:
~1 > ~2 > "'" > ~k. (8)
)~2 (Oa, b __ 0 ) 2 * (%qmax __ 1)
IfOa, b = 0, than the clusters a and b are completely _ , vo,b (h)"
separated from each other. The m a x i m u m pos-
(Oa,b + O)*(20 m; x oo,
- - O)
sible number of overlaps is with one degree of freedom (dr).
Oared,x = N L a L b. (9) The result of this test can be interpreted in the
following way: If the calculated Z2-value is greater
This number is reached if both clusters cover the than a given threshold of significance, the fre-
same region within the N-dimensional space. quency of overlaps exceeding the mean value
Thus by applying Eqs. (6) to (9) the quality of 0 differs significantly from the Z2-value. The sep-
the separation of clusters can be determined sta- aration between the clusters is therefore statisti-
tistically by the following steps: cally not significant.
1. Calculation of the mean number of maximum
3. Determination of an Optimum Number
possible overlaps (~max and the mean actual
of Clusters
number of overlaps 0 over all combinations of
cluster pairs. The optimum number of clusters is defined as that
2. Undertake a test to see whether O and O max number which leads to the best separation be-
originate from the same basic population. As- tween all clusters. The method presented above
suming that there is a gaussian normal dis- allows the optimum number of clusters for the
tribution, the Student's t-test can be used. non-hierarchical clustering to be determined in
(Because of the necessary normalization of the the best possible way. The following procedure is
parameters, a normal distribution is generally required to this end:
realized.) The null hypothesis implies that both
1. If a clustering with a given initial number of
mean values originate from the same popula-
clusters does not lead to a separation, then the
tion. The clusters can be separated only when
initial number of clusters is varied until at least
the null hypothesis is rejected. Otherwise, the
a single statistically reliable separation be-
procedure is as follows:
tween one cluster and the rest exists.
3. The ratio v~,b of the actual to the maximum
2. If point 1 is fulfilled, the elements of the sepa-
possible number of overlaps is determined for
each cluster pair: rated clusters are noted as being a final partial
result.
Oo,b (10) 3. The initial series is reduced by the separated
vo,b - om;x - cluster elements.
4. This algorithm is repeated using the method
4. The mean value ~ over all v~,b is calculated. It is
presented in Section 2 until all clusters are
the empirical estimate of the actual occurrence
probability of overlaps. statistically reliably separated.
5. If mean values f are not identical, point 5. The optimum number of clusters results from
2 implies that there i s - a c c o r d i n g to the the amount of clusters separated per algorithm
step.
chosen level of significance - a statistically sig-
nificant separation of those clusters for which
4. Rank sum Analysis
Va, b ~ !).
6. The quality of separation in the case va,b > ~7 In addition to the previous investigations, a rank
still needs to be determined. The point is hence sum analysis can be carried out which allows the
to clarify whether a certain value of the number clustering to be checked and determines the inter-
of the actual overlap Oa,b is compatible with the hal structure of the clusters more precisely. After
mean value of all numbers of the actual overlap clustering process is finished, each cluster con-
O. If one interprets the overlaps as empirical tains a certain number of elements. The order of
occurrence frequencies, a statistical compari- elements is random, i.e., there is no knowledge as
son between both is possible. This can be done to a possibly existing order which, however, is of
106 F.-W. Gerstengarbe and P. C. Werner
Table 1. Values of the Group Centroids with Regard to the Initial Number of Clusters
Cluster 1 2 3 4 5 6 7 8 9
Cluster 1 2 3 4 5 6 7 8 9
- Minimization of the objective function accord- As shown in Table 3, actual overlaps exist. That is
ing to Eqs. (4) and (5), why the quality of cluster separation must be
The values of the group centroids are given in checked. The necessary m a x i m u m number of
Table 1. Because of normalization of the pa- overlaps is given in Table 4. Using the values of
rameters, the values are without units. The group Tables 3 and 4 the ratios of overlaps were cal-
centroids are ranked in order from large to small culated. The result can be seen in Table 5. Accord-
values. This corresponds to a ranking of the clus- ing to the null hypothesis of the z2-test, Table 6
ters from "warm" to "cool" (see Table 2). shows all those clusters which cannot be statisti-
cally reliably separated from each other (to the
5.2 Calculation of the Quality of the 1% or 5% margin of error). Therefore it is neces-
Separation of Clusters sary to realize an additional separation of clusters.
Cluster 1 2 3 4 5 6 7 8 9
1 23 2 0 2 0 0 0 0
2 120 28 93 18 0 1 0
3 113 253 79 7 2 9
4 105 31 17 5 3
5 127 130 3 17
6 317 52 82
7 79 90
8 210
9
Cluster 1 2 3 4 5 6 7 8 9
Cluster 1 2 3 4 5 6 7 8 9
Table 7. Results of the Cluster- and rank sum Analysis "Summers". The Clusters are ranked in Order from "warm" to "cold", per
Cluster. 1st Column- year, 2st Column rank of the rank sum
Cluster 1 2 3 4 5
Cluster 6 7 8 9 10
Cluster 11 12 13 14
the r a n k s u m analysis. T h e classification o f the " w a r m " to "cool". O n e c a n see t h a t the results of
clusters is c a r r i e d o u t a c c o r d i n g to the a b o v e the r a n k s u m analysis are in g o o d c o r r e s p o n d e n c e
results, i.e. t h a t the clusters are r a n k e d in o r d e r with the r a n k e d clusters e v e n w h e n single
f r o m " v e r y w a r m s u m m e r s " (cluster 1) to " v e r y e l e m e n t s are a r r a n g e d differently within the
cool s u m m e r s " (cluster 14) o n the basis of the line f r o m " w a r m " to "coo1".
g r o u p centroids. W i t h i n the clusters the classifica-
t i o n of the e l e m e n t s is c a r r i e d o u t in the o r d e r of
6. Conclusions
the r a n k s of the r a n k sums. B e c a u s e of the fact t h a t
the r a n k s of the r a n k s u m s are also o r d e r e d f r o m T h e results p r e s e n t e d s h o w t h a t the s u g g e s t e d
" v e r y w a r m s u m m e r s " to " v e r y c o o l s u m m e r s " p r o c e d u r e is the first w h i c h allows the q u a l i t y of
(increasing r a n k s ) w i t h i n e a c h cluster o n e gets the s e p a r a t i o n of clusters to be c a l c u l a t e d in a sta-
a s e q u e n c e of the e l e m e n t s ( s u m m e r s ) f r o m tistically w e l l - f o u n d e d way; it r e p l a c e s the often
110 F.-W.Gerstengarbe and P. C. Werner: A Method to Estimate the Statistical Confidence of Cluster Separation
adverse effects of a given number of clusters when Fovell, R.G., Fovell, M.C., 1993: Climate zones of the
employing non-hierarchical cluster analysis by conterminous United States Defined using cluster analy-
applying the o p t i m u m number of clusters which sis. Journal of Climate, 6, 2103-2135.
Gerstengarbe, F.-W., Werner, P. C., 1992: The time structure
guarantee a statistically reliable separation of all of extreme summers in Central Europe between 1901 and
clusters from each other. The procedure is com- 1980. Meteor. Zeitschrift, N, F,, 1, 285-289.
pleted by applying the described rank sum analy- Milligan, G.W., Cooper, M.C., 1985: An examination of
sis which makes it possible to indicate the internal procedures for determining the number of clusters in
order of the elements for each calculated cluster. a data set. Psychometrika, 50, 159-179.
Steinhausen, D., Langer, K., 1977: Clusteranalyse- Einfiih-
The combined application of the described pro-
rung in Methoden und Verfahren der automatischen Klas-
cedures is thus a suitable method to improve the sifikation Berlin: Walter de Gruyter, 411 pp.
use of cluster analysis methods and the methods Taubenheim, J., 1969: Statistische Auswertung geophysikali-
hint whether an application of the cluster analysis scher und meteorologischer Daten. Leipzig: Akad. Ver-
is possible or not. lagsges. Geest & Portig, 386pp.
References
Authors' address: Dr. F.-W. Gerstengarbe, Dr. P.C.
Forgy, E.W., 1965: Cluster analysis of multivariate data: Werner, Potsdam Institute for Climate Impact Research,
efficiency versus interpretability of classifications (ab- P,O. Box 60 1203, D-14412 Potsdam, Federal Republic of
stract). Biometrics, 21, 768. Germany.