0% found this document useful (0 votes)
69 views8 pages

A Method To Estimate The Statistical Confidence of Cluster Separation

A method to estimate the statistical

Uploaded by

Ivan Cordova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views8 pages

A Method To Estimate The Statistical Confidence of Cluster Separation

A method to estimate the statistical

Uploaded by

Ivan Cordova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Theor. Appl. Climatol.

57, 103 110 (1997)


Theoretical
and Applied
Climatology
© Springer-Verlag 1997
Printed in Austria

Potsdam Institute for Climate Impact Research, Potsdam, Germany

A Method to Estimate the Statistical Confidence


of Cluster Separation
F.-W. Gerstengarbe and P. C. Werner

With 1 Figure

Received November 24, 1995


Revised May 2, 1996

Summary impossible if the "tree structure" is built up. This


disadvantage restricts the application.
Cluster analysis contains several multivariate methods
for the separation of patterns (clusters). The definition of With non-hierarchical methods, the elements
the optimum or universally best cluster analysis is an un- ei are simultaneously partitioned into a given
resolved issue. Three methods are of special importance: 1. number of dusters K. By displacing the elements
The statistical confidence of cluster separation. 2. The defini- between the clusters in case of a given quality
tion of the optimal number of clusters. 3. The description criterion, a given initial partition is built up step
of the internal cluster structure. Two new methods addres-
sing these problems are presented. On the basis of non-
by step, and developed into steadily improving
hierarchical minimum-distance cluster analysis a new grouping until the optimum is reached; for more
method is described that allows a separation of clusters details, see Steinhausen and Langer (1977). The
in a statistically welt-founded way. This method solves starting point of the following method is the non-
problems one and two. Using a newly developed special hierarchical minimum-distance method accord-
rank-sum analysis, a solution to the third problem is possible.
ing to Forgy (1965). The starting condition when
An example shows the practicability of the proposed pro-
cedures. applying the above method is to have the elements
ei equally distributed over a number K of given
clusters (initial partition). In the case of M given
1. Introduction
elements and K clusters each cluster receives
The main idea of cluster analysis is to relate to L = M / K elements as follows:
each other an existing number M of elements ei
which are each described by N parameters p, i.e.: eL~C 1
eL+l,..., e2L~C 2
ei = f ( P n . . . . , P,N). (1)
(2)
Two main techniques are possible: e(k- 1)L+ 1, • • •, ekL ~ ¢k
Using hierarchical methods, different sequences (The number of clusters K must be defined empiri-
of groups on different levels may be constructed. cally; the number of elements depends on the data
The result is a hierarchy of clusters in a "tree series and the problem being investigated.)
structure". The disadvantage of this technique A so-called group centroid 6k is then caicutated
lies in the fact that an exchange of elements is for each k of the K clusters (cluster mean value
104 F.-W. Gerstengarbe and P. C. Werner

under consideration of those existing parameters the quality of separation is unknown, as is the
that have to be normalized accordingly in the case objective number of clusters. The following pro-
of different scalings): cedure shows a solution to this problem.
1 kL
= Z ei (3) 2. Definition of a Quality Criterion
i=(k- 1)L+ 1
to Separate Clusters
By applying the Euclidean distance, the following
The quality criterion represents the statistical se-
objective function a(g) for each grouping step
curity of the cluster separation. The definition of
9 can be defined:
this criterion can be described as follows: After
K
having reached the local minimum, each cluster is
a(g) = Z 2 l e, - G I2. (4)
k = 1 i~k equipped with a generally varying number of
elements. Each element is defined by N pa-
By considering the Euclidean distance, each
rameters, i.e., it is located in a N-dimensional
grouping step can be seen as a displacement of the
parameter space. As each cluster consists of a cer-
element e~ into that cluster which contains the
tain number of elements, each representing a scat-
respective nearest centroid. The objective can thus
ter plot of elements in the above space. If the
be minimized:
clustering leads to a local secondary minimum,
a(g)Vg ~ min. (5) overlaps occur between the scatter plots of single
clusters. The principle of this method is presented
This procedure is repeated until a local m i n i m u m
in Fig. 1 which depicts the projection of two pa-
of the objective function is reached. The objective
rameters within the N-dimensional space. The
function reaches a local minimum if two succes-
number of overlaps O of the two clusters a and b
sive grouping steps show the same result; the
of N parameters can accordingly be defined as
iteration is in this case discontinued, i.e., the opti-
follows:
m u m classification with respect to the given
La Lb N a=l,...,k--1 (6)
number of clusters has been reached. Oa,b: i aE= l i bE= l j =El Oia,ibd b = 2,. "" ,k
An important disadvantage of this method is
that one does not know whether an absolute with
minimum or just a secondary minimum of the
objective function has been obtained (Fovell, {10 Pib'J>/Pi~'J (7)
Oia'ib'j : Pib,j< Pi~,j
1993; Milligan and Cooper, 1985). For this reason

30

25-

¢u20 •
..-.. m I

• a;."
~15 ,,=
i .'- I1,,

Fig. 1. Principle scheme of


overlaps between the clusters the descriptionof the cluster-
0 | . . . . t r ing quality (square/cross-
0 5 10 15 20 25 30 35 overlapped clusters, double
PARAMETER 1 cross - full separatedcluster)
A Method to Estimate the Statistical Confidence of Cluster Separation 105

under the additional condition for instance by the z2-test (e.g., Taubenheim,
1969) which can be written as follows:
~1 > ~2 > "'" > ~k. (8)
)~2 (Oa, b __ 0 ) 2 * (%qmax __ 1)
IfOa, b = 0, than the clusters a and b are completely _ , vo,b (h)"
separated from each other. The m a x i m u m pos-
(Oa,b + O)*(20 m; x oo,
- - O)
sible number of overlaps is with one degree of freedom (dr).
Oared,x = N L a L b. (9) The result of this test can be interpreted in the
following way: If the calculated Z2-value is greater
This number is reached if both clusters cover the than a given threshold of significance, the fre-
same region within the N-dimensional space. quency of overlaps exceeding the mean value
Thus by applying Eqs. (6) to (9) the quality of 0 differs significantly from the Z2-value. The sep-
the separation of clusters can be determined sta- aration between the clusters is therefore statisti-
tistically by the following steps: cally not significant.
1. Calculation of the mean number of maximum
3. Determination of an Optimum Number
possible overlaps (~max and the mean actual
of Clusters
number of overlaps 0 over all combinations of
cluster pairs. The optimum number of clusters is defined as that
2. Undertake a test to see whether O and O max number which leads to the best separation be-
originate from the same basic population. As- tween all clusters. The method presented above
suming that there is a gaussian normal dis- allows the optimum number of clusters for the
tribution, the Student's t-test can be used. non-hierarchical clustering to be determined in
(Because of the necessary normalization of the the best possible way. The following procedure is
parameters, a normal distribution is generally required to this end:
realized.) The null hypothesis implies that both
1. If a clustering with a given initial number of
mean values originate from the same popula-
clusters does not lead to a separation, then the
tion. The clusters can be separated only when
initial number of clusters is varied until at least
the null hypothesis is rejected. Otherwise, the
a single statistically reliable separation be-
procedure is as follows:
tween one cluster and the rest exists.
3. The ratio v~,b of the actual to the maximum
2. If point 1 is fulfilled, the elements of the sepa-
possible number of overlaps is determined for
each cluster pair: rated clusters are noted as being a final partial
result.
Oo,b (10) 3. The initial series is reduced by the separated
vo,b - om;x - cluster elements.
4. This algorithm is repeated using the method
4. The mean value ~ over all v~,b is calculated. It is
presented in Section 2 until all clusters are
the empirical estimate of the actual occurrence
probability of overlaps. statistically reliably separated.
5. If mean values f are not identical, point 5. The optimum number of clusters results from
2 implies that there i s - a c c o r d i n g to the the amount of clusters separated per algorithm
step.
chosen level of significance - a statistically sig-
nificant separation of those clusters for which
4. Rank sum Analysis
Va, b ~ !).
6. The quality of separation in the case va,b > ~7 In addition to the previous investigations, a rank
still needs to be determined. The point is hence sum analysis can be carried out which allows the
to clarify whether a certain value of the number clustering to be checked and determines the inter-
of the actual overlap Oa,b is compatible with the hal structure of the clusters more precisely. After
mean value of all numbers of the actual overlap clustering process is finished, each cluster con-
O. If one interprets the overlaps as empirical tains a certain number of elements. The order of
occurrence frequencies, a statistical compari- elements is random, i.e., there is no knowledge as
son between both is possible. This can be done to a possibly existing order which, however, is of
106 F.-W. Gerstengarbe and P. C. Werner

interest to certain investigations. This problem element e i can be determined as follows:


can be solved by interpretation of the elements N
ei(see Eq. (1)). The disadvantage is that this pro- RS i= ~ wjpij i = I , . . . , K L . (15)
cedure depends directly on the cluster analysis. j=l
The following method is therefore suggested to Having determined all i rank sums, they can in
solve the problem. For cluster analysis, the turn be assigned to the ranks R i (RSi). If these
elements are described by their parameters, ranks are assigned to the respective elements %
where as for rank sum analysis, the elements are then each cluster is equipped with the structure
described by the ranks of their parameters, the ordered according to ranks.
rank of a parameter being determined by its posi-
tion within the existing values of this parameter.
5. Example
Thus each parameter can be considered as the
function of the ranks of their parameters: Many meteorological events are characterized by
more than one parameter, for instance the de-
ei = h(Ril,... , RiN ) (12)
scription of seasonal temperature conditions.
with R~j = rank of the parameter pi~. A preliminary study (Gerstengarbe and Werner,
Rank sum analysis must be started with the 1992) led to the conclusion that the temperature
calculation of the weighted sum obtained from the conditions with regard to Central European sum-
ranks of the parameters for each element. In order mers have to be described by five parameters:
to scale the parameters with one another and to
reduce their possible interdependence to a mini- Number of summer days: Tmax~-25 C, May-September.
Number of hot days: Tin,x>~30 °C, May-September,
m u m it is useful to weight the parameters. The
Heat s u m : Tax > 20 °C, May-September,
weighting is based on the correlation between the Summer mean: T/n, June-August,
parameters: The starting point is the distribution- Extreme value mean: (T~I 1 + T~a + TM3)/3 , June-August
free fourfold table test (Taubenheim, 1969) which
allows the estimation of the tetrachoric correla- with n = amount of days June-August,
tion coefficient r between two parameters: TM1, TM2, TM3 = monthly maximum of the air
r = sin(q * re/2) (13) temperature June, July, August, Tmax= d a i l y
maximum of the air temperature, T = daily mean
with q = 1 - 4aiM (quadrant ratio), a = number
of the air temperature.
of values within the first quadrant, M = number
In order to understand them better, the follow-
of values of all quadrants.
ing calculations are - without restricting general-
The weights of wj (j = 2,..., N) are determined
i t y - b a s e d on the investigation of only one
under the following conditions:
station. The time series of the meteorological
- Determination of a reference parameter with station at Potsdam covering the period 1893-
the weight of w 1 = 1 (which is necessary to 1993 were available for the daily mean and daily
determine the weights by means of correlation). maximum of the air temperature. The aim of
- The weight of the rank of a parameter is sup- the study is to classify the 101 summers of the
posed to reach at least the value of 1/N, which is Potsdam station:
the case if the correlation coefficient between
the above parameter and the reference pa- 5.1 Calculation of the non-Hierarchical
rameter is r = 1. Cluster Analysis with a fixed Initial Number
- If the correlation decreases (in terms of quan- of Clusters
tity), the weight increases, whereas if a correla-
The following quantities have been determined
tion does not exist the weight is one.
for the calculation: number of elements M = 101,
The weighting depends on the sign of the corre-
number of parameters N = 5, number of clusters
lation:
K = 9. The calculation steps are as follows:
N (N 1)lrJ[sign(rj)j = 2,... , g . (14) Determination of the initial partition of Eq. (2),
wj = N
- Calculation of the group centroids according to
By employing (14), the rank sums RSi for the Eq. (3),
A Method to Estimate the Statistical Confidence of Cluster Separation 107

Table 1. Values of the Group Centroids with Regard to the Initial Number of Clusters

Cluster 1 2 3 4 5 6 7 8 9

ok 12.17 5.37 1.50 1.20 0.27 -2.26 -2.70 -5.43 -6.04

Table 2. Results of the Cluster Analysis with a Fixed Number of Clusters

Cluster 1 2 3 4 5 6 7 8 9

1947 1983 1938 1949 1943 1919 1967 1984 1898


1992 i982 1986 1901 1900 1931 1933 1977 1993
1911 1917 1895 1914 1945 1979 1955 1894 1918
1975 1939 1906 1904 1981 1988 1974 1899
1976 1970 1966 1941 1940 1980 1923
1934 1951 1908 1928 1897 1987 1903
1959 1968 1915 1961 1924 1962 1926
1989 1991 1942 1920 1927 1913
1921 1925 1936 1910 1958 1907
1944 1990 1930 1978 1896 1956
1964 1946 1893 1960 1912 1965
1932 1905 1952 1922 1916
1963 1957 1954 1902
1937 1985 1909
1971 1972
1969
1929
1973
1953
1948
1935
1950

- Minimization of the objective function accord- As shown in Table 3, actual overlaps exist. That is
ing to Eqs. (4) and (5), why the quality of cluster separation must be
The values of the group centroids are given in checked. The necessary m a x i m u m number of
Table 1. Because of normalization of the pa- overlaps is given in Table 4. Using the values of
rameters, the values are without units. The group Tables 3 and 4 the ratios of overlaps were cal-
centroids are ranked in order from large to small culated. The result can be seen in Table 5. Accord-
values. This corresponds to a ranking of the clus- ing to the null hypothesis of the z2-test, Table 6
ters from "warm" to "cool" (see Table 2). shows all those clusters which cannot be statisti-
cally reliably separated from each other (to the
5.2 Calculation of the Quality of the 1% or 5% margin of error). Therefore it is neces-
Separation of Clusters sary to realize an additional separation of clusters.

Calculation steps for cluster separation are:


5.3 Optimization of the Number of Clusters
- Calculation of the actual overlaps according to
Eqs. (6)-(8), The o p t i m u m number of clusters is determined
Calculation of the m a x i m u m possible number according to the method described in Section 3.
of overlaps according to Eq. (9), The results are presented in Table 7. It shows that
- Determination of the ratios of overlaps accord- the number of clusters increases from 9 (initial
ing to Eq. (10), number of clusters) to 14 (number of statistically
- z2_test. significantly separated clusters).
Table 3. Number ofActual Overlaps

Cluster 1 2 3 4 5 6 7 8 9

1 23 2 0 2 0 0 0 0
2 120 28 93 18 0 1 0
3 113 253 79 7 2 9
4 105 31 17 5 3
5 127 130 3 17
6 317 52 82
7 79 90
8 210
9

Table 4. Maximum NumberofOverlaps

Cluster 1 2 3 4 5 6 7 8 9

1 325 505 565 790 985 1150 1255 1465


2 1495 1995 3870 5495 6870 7745 9495
3 735 3510 5915 7950 9245 11835
4 3070 5735 7990 9425 12295
5 3635 6715 8675 12595
6 3790 6205 11035
7 2795 8395
8 6085
9

Table 5. Ratio of Overlaps (ratio*lO -4)

Cluster 1 2 3 4 5 6 7 8 9

1 707 39.6 0 25.30 0 0 0 0


2 803 140 240 32.8 0 1.3 0
3 1537 721 134 8.8 2.2 7.6
4 342 50.4 21.3 5.3 2.4
5 349 194 3.4 13.5
6 836 83.8 74.3
7 28.6 10.7
8 345
9

5.4 Rank sum Analysis


Table 6. Error Probability for the Significant Difference Be-
Calculation steps for rank sum analysis are: tween the Frequency of Occurrence of "Overlaps" of the Clu-
sters among each other
- C a l c u l a t i o n of the correlation coefficients
according to Eq. (13), Cluster 1 2 3 4 5 6 7 8 9
- Determination of the weights according to
Eq. (14), 1 1
Calculation of the rank sums according to 2 1 5
3 1 1
Eq. (15), 4 1
- Determination of the ranks of the rank sums, 5 1
- Assignment of the ranks to the elements (sum- 6 1
mers). 7
8 1
Table 7 shows in addition to the results of the 9
optimum clustering those that are obtained from
A Method to Estimate the Statistical Confidence of Cluster Separation 109

Table 7. Results of the Cluster- and rank sum Analysis "Summers". The Clusters are ranked in Order from "warm" to "cold", per
Cluster. 1st Column- year, 2st Column rank of the rank sum

Cluster 1 2 3 4 5

1947 1 1921 12 1917 6 1976 8 1944 13


1992 2 1964 14 1934 9 1932 15 1971 18
1911 3 1959 10 1950 26 1969 19
1983 4 1989 11 1953 22
1982 5 1948 23
1975 7

Cluster 6 7 8 9 10

1963 16 1973 21 1949 29 1943 28 1925 39


1937 17 1935 25 1939 31 1904 45 1905 43
1929 20 1895 30 1901 32 t930 52 1952 56
1938 24 1970 33
1986 27 1951 34
1914 37
1990 41

Cluster 11 12 13 14

1900 35 1967 54 1931 60 1898 76


1968 36 1933 55 1979 63 1993 78
1991 38 1955 59 1981 64 1918 82
1906 40 1988 61 1941 66 1899 83
1946 42 1940 65 1928 68 1923 85
1945 44 1897 67 1961 69 1977 87
1966 46 1920 70 1903 88
1908 47 1910 71 1926 89
1915 48 1924 72 1913 91
1942 49 1978 73 1974 92
1919 50 1960 74 1980 93
1936 51 1922 75 1907 94
1893 53 1927 77 1987 95
1957 57 1954 79 1956 96
1985 58 1958 80 1965 97
1972 62 1896 81 1916 98
1912 84 1902 99
1894 86 1909 100
1984 90 1962 101

the r a n k s u m analysis. T h e classification o f the " w a r m " to "cool". O n e c a n see t h a t the results of
clusters is c a r r i e d o u t a c c o r d i n g to the a b o v e the r a n k s u m analysis are in g o o d c o r r e s p o n d e n c e
results, i.e. t h a t the clusters are r a n k e d in o r d e r with the r a n k e d clusters e v e n w h e n single
f r o m " v e r y w a r m s u m m e r s " (cluster 1) to " v e r y e l e m e n t s are a r r a n g e d differently within the
cool s u m m e r s " (cluster 14) o n the basis of the line f r o m " w a r m " to "coo1".
g r o u p centroids. W i t h i n the clusters the classifica-
t i o n of the e l e m e n t s is c a r r i e d o u t in the o r d e r of
6. Conclusions
the r a n k s of the r a n k sums. B e c a u s e of the fact t h a t
the r a n k s of the r a n k s u m s are also o r d e r e d f r o m T h e results p r e s e n t e d s h o w t h a t the s u g g e s t e d
" v e r y w a r m s u m m e r s " to " v e r y c o o l s u m m e r s " p r o c e d u r e is the first w h i c h allows the q u a l i t y of
(increasing r a n k s ) w i t h i n e a c h cluster o n e gets the s e p a r a t i o n of clusters to be c a l c u l a t e d in a sta-
a s e q u e n c e of the e l e m e n t s ( s u m m e r s ) f r o m tistically w e l l - f o u n d e d way; it r e p l a c e s the often
110 F.-W.Gerstengarbe and P. C. Werner: A Method to Estimate the Statistical Confidence of Cluster Separation

adverse effects of a given number of clusters when Fovell, R.G., Fovell, M.C., 1993: Climate zones of the
employing non-hierarchical cluster analysis by conterminous United States Defined using cluster analy-
applying the o p t i m u m number of clusters which sis. Journal of Climate, 6, 2103-2135.
Gerstengarbe, F.-W., Werner, P. C., 1992: The time structure
guarantee a statistically reliable separation of all of extreme summers in Central Europe between 1901 and
clusters from each other. The procedure is com- 1980. Meteor. Zeitschrift, N, F,, 1, 285-289.
pleted by applying the described rank sum analy- Milligan, G.W., Cooper, M.C., 1985: An examination of
sis which makes it possible to indicate the internal procedures for determining the number of clusters in
order of the elements for each calculated cluster. a data set. Psychometrika, 50, 159-179.
Steinhausen, D., Langer, K., 1977: Clusteranalyse- Einfiih-
The combined application of the described pro-
rung in Methoden und Verfahren der automatischen Klas-
cedures is thus a suitable method to improve the sifikation Berlin: Walter de Gruyter, 411 pp.
use of cluster analysis methods and the methods Taubenheim, J., 1969: Statistische Auswertung geophysikali-
hint whether an application of the cluster analysis scher und meteorologischer Daten. Leipzig: Akad. Ver-
is possible or not. lagsges. Geest & Portig, 386pp.

References
Authors' address: Dr. F.-W. Gerstengarbe, Dr. P.C.
Forgy, E.W., 1965: Cluster analysis of multivariate data: Werner, Potsdam Institute for Climate Impact Research,
efficiency versus interpretability of classifications (ab- P,O. Box 60 1203, D-14412 Potsdam, Federal Republic of
stract). Biometrics, 21, 768. Germany.

You might also like