Johnson1967 PDF
Johnson1967 PDF
3
SEPTEI~IBER, 1967
Introduction
I n m a n y empirical fields there is an increasing interest in identifying
those groupings or clusterings of the "objects" under study that best represent
certain empirically measured relations of similarity. For example, often
large arrays of data are collected, but strong theoretical structures (which
might otherwise guide the analysis) are lacking; the problem is then one of
discovering whether there is any structure (i.e., natural arrangement of the
objects into homogeneous groups) inherent in the data themselves. Recent
work along these lines in the biological sciences has gone under the name
"numerical t a x o n o m y " [Sokal, 1963].
Although the techniques to be described here m a y find useful application
in biology, medicine and other fields as well, we shall use psychology as an
illustrative field of application. In t h a t field, the "objects" under study might,
for example, be individual h u m a n or animal subjects, or various visual or
acoustic stimuli presented to such subjects. We might want to use measures
t h a t we have obtained on the similarities (or psychological "proximities")
among the "objects" to classify the obiects into optimally homogeneous
groups; t h a t is, similar obiects are assigned to different groups.
Suitable data on the similarities among the objects (from which such a
natural grouping might be derived) m a y be obtained directly or indirectly.
For example, sometimes one obtains for every pair of objects a subjective
rating of similarity, or, (what is often very closely related) a measure of the
confusion or "interchangeability" of the objects. Less directly, we may
measure a number of attributes of the objects (often termed a profile of
measures) and combine them to form a single measure of similarity. Various
kinds of measures of profile similarity can be used for this purpose (e.g.,
product-moment-correlation, covariance, or the sum of squared or absolute
differences between corresponding components of the profiles).
The problem of course, is that if the number of objects is large, the re-
sulting array of similarity measures (containing, as it does, one value for
each pair of objects) can be so enormous that the underlying pattern or struc-
ture is not evident from inspection alone. This paper discusses procedures
which, when applied to such an array of similarity measures, constructs
a hierarchical system of clustering representations, ranging from one in
which each of the n objects is represented as a separate cluster to one in
which all n objects are grouped together as a single cluster.
An algorithm for finding such a clustering representation was sought
that would have the following features:
Object Number
1 3 5 6 4 2
.00
"Strength" .04 X X X X X
or .07 X X X X X X X X X
"Value" .23 X X X X X X X X X X X X X X X X X X
.31 X X X X X X X X X X X X X X X X X X-X X X
Fm~ms 1
A Hierarchical Clustering Scheme
Notice the main features of such a result. T h e first clustering (top row)
is the "weak" clustering--each object is a cluster, so with six objects we have
six clusters. This is given the "value" or "rating" .00. Next we have a cluster-
ing with five clusters; the set [3, 5] is one cluster, and the remaining four
objects are themselves clusters. This is given the value .04. At level .07
we have a clustering with four clusters [1], [4], [2], and [3, 5, 6]. At level
.23 we have the two clusters [1, 3, 5, 6] and [2, 4], and finally at level .31
we have the "strong" clustering, with all objects in the same cluster.
We examine the following relevant features of this model. First, the
"values" start at 0 and increase strictly as we read down the table. Second,
and more important, the clusterings "increase" also, hierarchically; each
clustering (except, evidently, the first) is obtained b y the merging of clusters
at the previous level. For example, if level .23 had had clusters [1, 3], [5, 6, 4],
and [2] we would have not had a hierarchical clustering; the cluster [1, 3]
cannot be obtained b y merging any of the .07 level clusters. Finally we see
t h a t the first clustering is the weak clustering and the last is the strong
clustering.
We now abstract from this simple example to the general notion of a
hierarchical clustering scheme. We assume we have n objects, represented b y
the integers I through n. We have also a sequence of m Jr 1 clusterings,
Co , C~ , . - - , C~ , and with each clustering Ci we have a number as , its
value. W e require t h a t Co be the weak clustering of the n objects, with ao = 0,
and t h a t C~ be the strong clustering. We require also t h a t the numbers as
increase; a;_~ _< ai , for j = 1, 2, . . . , m, and the clusters "increase" also,
where again C;_1 < C~ means t h a t every cluster in Ci is the merging (or
union) of clusters in C~_~ . This general arrangement will be referred to as a
hierarchical clustering scheme, or HCS for short.
This section will demonstrate that every HCS gives rise to a particular
kind of distance, or ~netrie, between the objects 1, 2, • • • , n, and, conversely,
244 PSYCHOMETRIKA
d(x, y) = .~.
TABLE 1
Distance Matrix Corresponding to Figure 1
d 1 2 3 4 5 6
Thus x and y are in the same cluster in C~, and y and z are in the same
cluster in C , . Because the clusterings are hierarchical, one of these clusters
includes the other; in fact, t h a t cluster corresponding to the larger of j
and k. Let this integer be ~, then in C~, x, y, and z are all in the same cluster.
From the definition of d, we see thus t h a t
d ( z , y) < ~ , .
But l = max [j, k], and the a's increase as their subscripts do, so
TABLE 2
Distance Matrix for Table 1 After First Clustering
1 2 [3, 5] 4 6
the smallest nonzero entry (.07, between {3, 5} and 6) and clustering these
together to obtain a clustering, at level .07, containing a cluster [3, 5, 6]
and individual clusters [1], [2], and [4]. Once again, we define the distance
from [3, 5, 6] to 1, 2, or 4 in a unique manner, construct another distance
matrix, and so on. Eventually we end up clustering all obiects together to
get the strong clustering, and we find that we have completely reconstructed
Fig. 1.
The key to the above process is being able to replace two (or more)
objects by a cluster, and still being able to define the distance between such
clusters and other objects or clusters. This property in turn depends on two
essential facts: that d ~utisfies the ultrametric inequality, and that, at each
stage, we cluster the minimum distances.
We now generalize this method, to enable us to get a HCS, given n objects
and a metric d on them which satisfies the ultrametric inequality.
B y assumption,
d(y, z) < d(x, z).
Thus,
d(x, z) < _ . , .
and then proceed as :in Sect. I, above. It would be natural to require that if
d(x, z) = d(y, z), then
f(d(x, z), d(y, z)) = d(x, z) -- d(y, z).
Then if d satisfies the ultrametric inequality, the process will give the same
HCS as the "natural" one described in Sect. I. This still leaves us with a
large number of choices for/--geometric means, various weighted averages,
and so on. We evidently need stronger conditions on the function f.
This work was strongly influenced by the work of Shepard [1962a & b
and Kruskal [1964] on multidimensional scaling, in which the results are
invariant under montone transformations of the similarity matrix. Since
much data of psychological interest is of this type, it seemed worthwhile
to try to develop a clustering program with this feature. Immediately we
ruled out most of the common functions for 1, since the operations of addi-
tion, multiplication, square root, and so on are not monotone invariant.
The functions max and rain, however, give rise to monotone invariant
clustering methods; the corresponding methods may be summarized as
follows:
when x and y are two objects and/or clusters of Ci-~ which cluster in Ci ,
and z is any third object or cluster of C~.
NOTE: It is tacitly assumed in the discussion of the methods that the dis-
tances in the original matrix are all distinct except for 0. This is not important
in the Minimum Method, but difficulties do arise when applying the Maxi-
mum Method to matrices with large numbers of identical entries. In practice
this restriction rarely produces an ambiguous result.
The above two methods are clearly related to the method described
in Sect. I. In particular, if d satisfies the ultrametric inequality, the two
methods reduce to the method of Sect. I, as promised.
Monotone Invariance
One of the requirements that we have set for our methods is that the
solutions be invariant under monotone transformations of the original data.
Monotone invariant processes are those which are dependent only on the
rank order of the data. In our methods, we use the matrix elements twice;
first, we find the smallest nonzero matrix element, and second, we form the
maximum or minimum of two matrix elements. Both these processes may
be carried out knowing nothing of the data except the rank order. Thus the
clusterings are unaffected by monotone transformations of the similarity
matrix. The values assigned to the clusterings also are determined merely
by rank order--thus a monotone transformation of the similarity matrix
transforms the values of the clus~erings, but leaves the clusterings invariant.
Given a clustering, we say that the chain distance d' from x to y is the minimal
chain size of all chains from x to y;
d'(x, y) = min [size of chain.]
al I ehatna
from z to
I t turns out that d' satisfies the ultrametric inequality and is indeed associated
with the HCS we obtain from the Minimum Method. The chain distance
intuitively measures a kind of connectedness of x and y through intermediate
points. We may thus describe the value of a Minimum Method clustering by
Value of Clustering = max [d'(x, y)].
z and
in s a m e cluster
Consonants
Similarity
Value p tk fo ,S b d # v ~z 5m n
2.635 kx:
2.234: XXX XXX
2.230 XXX XXX XXX
2.185 XXX XXXXX XXX
2.123 XXX XXXXX XXX xxx
2.108 XXXXX XXXXX XXX XXX
1.870 XXXXX XXXXX XXXXX XXX
1.683 XXXXX XXXXXXX XXXXX XXX
1.604 XXXXX XXXXXXX XXXXX XXX KXX
1.577 XXXXX X X X ~ X X XXXXX XXXXX .~XX
1.567 XXXXX XXXXXXX XXXXX ~XXXXXX XXX
1.065 XXXXXXXXXXXXX XXXXX XXXXXXX XXX
1.009 XXXXXXXXXXXXX XXXXXXXXXXXXX XXX
0.425 XXXXXXXXXXXXXXXXXXXXXXXXXXX XXX
O.279 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
FIGuRv. 2
The HCS Obtained on the basis of Miller and Nicely's Table VII
by the Minimum Method
Consonants
Similarity
Value p tk /e s S b d g , ~z 5m n
co
2.635 xxx
2.234 xxx XXX
2.230 XXX . XXX XXX
2.123 XXX . XXX XXX
1.855 XXXXX XXX XXX XXX
1.683 XXXXX XXXXX XXX XXX
1.604 XXXXX XXXXX XXX XXX X X X
1.525 XXXXX XXXXX XXX XXXXX XXX
1.186 XXXXX XXXXX XXX X X X X X X X X X X
1.119 XXXXX XXXXX XXXXX XXXXXXX XXX
0.939 XXXXX XXXXXXX XXXXX X X X X X X X X X X
0.422 X X X X X X X X X X X X X XXXXX XXXXXXX XXX
0.302 XXXXXXXXXXXXX XXXXXXXXXXXXX XXX
0.019 XXXXXXXXXXXXX XXXXXXXXXXXXXXXXX
0.000 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
FIGURE 3
The HCS obtained on the basis of Miller and Nieely's Table VII
by the Maximum Method
252 pSYCHOMETRIK&
For these data the Maximum and Minimum Methods yield very similar
results. The princip~d difference between the two representations is confined
to the order in which the last three clusters [p, t, k, f, o, s, S], [b, d, g, v, ~, z, 3],
and [m, n] combine into two clusters. For the Maximum Method the last
two of these combine with each other before joining the first, whereas for
the Minimum Method the first two combine with each other before joining
the last. Otherwise, although the precise numerical values associated with
the clusterings differ somewhat between the two methods, the topological
structures of the two representations are alike. That is, above the level of
three clusters, we find that exactly the same subclusters appear in both
representations and (consequently) that each such subcluster divides into
exactly the same sub-subclusters. This close agreement suggests that these
data do not seriously violate the assumed ultrametric structure.
Moreover, both of the obtained HCS's are meaningfully related to the
distinctive features presumed [e.g., by Miller and Nicely, 1955] to govern
the discrimination of consonant phonemes. At the level of five clusters, for
example, the sixteen phonemes divide into the unvoiced stops [p, t, k], the
corresponding voiced stops [b, d, g], the unvoiced fricatives [], o, s, S],
the corresponding voiced fricatives [v, ~, z, 5], and the (voiced) nasals
[m, n]. Then, at the level of three clusters, the stops and fricatives coalesce
for the voiced and unvoiced phonemes, separately, to yield just the nasals,
the remaining voiced, and the corresponding unvoiced consonants.
Analyses of other of Miller and Nicely's matrices (which were obtained
under different conditions of filtering) led to clusterings that, although highly
consistent (across independent sets of data), departed systematically from
the HCS's presented here for their Table VII. These divergent results will
be covered in a forthcoming report by Shepard; their detailed discussion here
would require too extensive a detour into the substantive problems of psycho-
acoustics. One further observation should perhaps be made here, though,
regarding these further analyses. In this particular kind of application anyway,
it has generally appeared that, to the extent that there is an appreciable
departure between the HCS's obtained by the Maximum and Minimum
Methods, the results of the Maximum Method have appeared to be the more
meaningful or interpretable. That is, the search for compact clusters (of
small over-all "diameter") has proved more useful than the search for in-
ternally "connected" but potentially long chain-like clusters. The reverse
may of course prove to be true in other types of applications.
V. Discussion
Relation to Other, Similar Methods
Although the methods described here were developed independently,
they were subsequently found to be closely related to some methods that
STEPHEN C. JOHNSON 253
A Computer Program
Another step that has been taken here is the construction of a computer
program that will carry out both the Maximum and Minimum Methods on
an arbitrary matrix of similarities or "proximities." (The program is written
in F O R T R A N and is suitable for IBM machines of the 709-7090 class.)
The solutions displayed in the present Tables 11 and 12 were in fact computed
and printed out (in the form shown) by this program. When necessary the
program also determines an appropriate reordering of the "objects" so
that such a table can be constructed. On an IBM 7094, the analysis is com-
pleted quite rapidly; in another application with 64 objects, solution~s were
obtained for both the Minimum and Maximum Methods in just 10.1 seconds.
Possible Extensions
Sokal and Sneath [1963, p. 190] have pointed out that, in methods like
our Minimum or Maximum Methods, the merging of two clusters depends
upon a single similarity value (viz., the least or greatest in the appropriate
set). They suggest that, for greater robustness of the solution, it may some-
times be desirable to use some sort of average value instead. As we have
already noted, to base such a procedure upon averages of the more obvious
types is to lose the invariance, sought here, under monotone transformations
of the similarity values. More importantly, the solutions would no longer
254 PSYCHOMETRIKA
REFERENCES
Krnskal, J. B. Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrik~a, 1964, 29, 1-27.
McQuitty, L. L. Hierarchical linkage analysis for the isolation of types. Educational and
Psychological Measurement, 1960, 20, 55-67.
Miller, G. A. and Nicely, P. E. An analysis of perceptual confusions among some English
consonants. Journal of the Acoustical Society of America, 1955, 27, 338-352.
Shepard, tL N. Analysis of proximities: Multidimensional scaling with an unknown dis-
tance function. I. Psychometrika, 1962a, 27, 125-140.
Shepard, R. N. Analysis of proximities: Multidimensional scaling with an unknown distance
function. IL Psychometrika, 1962b, 27, 219-246.
Sneath, P. H. A. The application of computers to taxonomy. Journal of General Microbiology,
1957, 17, 201-226.
Sokal, R. R. and Sneath, P. H. A. Principles of Numerical Taxonomy. San Francisco" W. H.
Freeman, 1963.
S~rensen, T. A method of establishing groups of equal amplitude in plant sociology based on
similarity of species content and its application to analyses of the vegetation on
Danish commons. Biologislce Skrifler, 1948, 5 (4), 1-34.
Ward, J. H., Jr. Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 1963, 58, 236-244.