0% found this document useful (0 votes)

74 views14 pages

Johnson1967 PDF

This document describes hierarchical clustering schemes (HCS), which arrange objects into successively merged clusters. The author develops a correspondence between HCS and distance metrics. Given an HCS, a distance metric can be defined for each pair of objects based on the level at which they are first clustered together. Conversely, a distance metric can be used to recover the clustering structure of an HCS. Two rapid clustering methods are presented that generate HCS from a similarity matrix in a way that is invariant to monotonic transformations of the data.

Uploaded by

Kunal Ambardekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views14 pages

Johnson1967 PDF

Uploaded by

Kunal Ambardekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

PSYCI~OI~ETRIKA--VOL32~ NO.

3
SEPTEI~IBER, 1967

HIERARCHICAL CLUSTERING SCHEMES*

STEPHEN C. JOHNSON
BELL TELEPHONE LABORATORIES,
MURRAY HILL, NEW JERSEY

Techniques for partitioning objects into optimally homogeneous

groups on the basis of empirical measures of similarity among those objects
have received increasing attention in several different fields. This paper
develops a useful correspondence between any hierarchical system of such
clusters, and a particular type of distance measure. The correspondence
gives rise to two methods of clustering that are computationally rapid and
invariant under monotonic transformations of the data. In an explicitly
defined sense, one method forms clusters that are optimally "connected,"
while the other forms clusters that are optimally "compact."

Introduction
I n m a n y empirical fields there is an increasing interest in identifying
those groupings or clusterings of the "objects" under study that best represent
certain empirically measured relations of similarity. For example, often
large arrays of data are collected, but strong theoretical structures (which
might otherwise guide the analysis) are lacking; the problem is then one of
discovering whether there is any structure (i.e., natural arrangement of the
objects into homogeneous groups) inherent in the data themselves. Recent
work along these lines in the biological sciences has gone under the name
"numerical t a x o n o m y " [Sokal, 1963].
Although the techniques to be described here m a y find useful application
in biology, medicine and other fields as well, we shall use psychology as an
illustrative field of application. In t h a t field, the "objects" under study might,
for example, be individual h u m a n or animal subjects, or various visual or
acoustic stimuli presented to such subjects. We might want to use measures
t h a t we have obtained on the similarities (or psychological "proximities")
among the "objects" to classify the obiects into optimally homogeneous
groups; t h a t is, similar obiects are assigned to different groups.
Suitable data on the similarities among the objects (from which such a
natural grouping might be derived) m a y be obtained directly or indirectly.
For example, sometimes one obtains for every pair of objects a subjective

*I am indebted to R. N. Shepard and J. D. Carroll for many stimulating discussions

about this work, and for aid in preparing this paper.
241
242 PSYCHOMETRIKA

rating of similarity, or, (what is often very closely related) a measure of the
confusion or "interchangeability" of the objects. Less directly, we may
measure a number of attributes of the objects (often termed a profile of
measures) and combine them to form a single measure of similarity. Various
kinds of measures of profile similarity can be used for this purpose (e.g.,
product-moment-correlation, covariance, or the sum of squared or absolute
differences between corresponding components of the profiles).
The problem of course, is that if the number of objects is large, the re-
sulting array of similarity measures (containing, as it does, one value for
each pair of objects) can be so enormous that the underlying pattern or struc-
ture is not evident from inspection alone. This paper discusses procedures
which, when applied to such an array of similarity measures, constructs
a hierarchical system of clustering representations, ranging from one in
which each of the n objects is represented as a separate cluster to one in
which all n objects are grouped together as a single cluster.
An algorithm for finding such a clustering representation was sought
that would have the following features:

1. The input should consist solely of the n ( n - 1)/2 similarity measures

among the n objects under study. This is in contrast to some previous methods
which additionally require that each object be initially represented as a point
in Euclidean space. (In many applications the restriction to a representa-
tion of the grouping in the concrete, spatial sense of an Euclidean metric
seems unnecessarily and undesirably severe).
2. There should be a clear, explicit, and intuitive description of the
clustering; i.e., the clusters should mean something. Some of the published
clustering methods have nice algorithms, but when they have been carried
out it is difficult to see exactly what problem has been solved.
3. The clustering procedure should be essentially invariant under
monotone transformations of the similarity data. Often in psychology we
have confidence in our data only up to rank-order; the absolute numbers
obtained from the experiments may lie along virtually any scale. The method
of Ward [1963], which inspired much of this current study, is indeed so
general as to permit :monotone invariant methods, but they are not explicitly
treated.

The notion of a hierarchical clustering scheme, the central idea of this

paper, was abstracted from examples given by Ward [1963]. We first consider
such schemes, and develop a correspondence between hierarchical clustering
schemes and a certain type of metric. Two recursive methods are then given
for obtaining hierarchical clustering schemes from a given similarity matrix,
and finally the significance of these two methods is discussed and illustrated
by application to real data.
S T E P H E N C. J O H N S O N 2~

I. Clusterings and Metrics

Figure 1 gives the typical results of a hierarchical clustering method,
such as those discussed b y Ward [1963] and others.

Object Number

1 3 5 6 4 2
.00
"Strength" .04 X X X X X
or .07 X X X X X X X X X
"Value" .23 X X X X X X X X X X X X X X X X X X
.31 X X X X X X X X X X X X X X X X X X-X X X
Fm~ms 1
A Hierarchical Clustering Scheme

Notice the main features of such a result. T h e first clustering (top row)
is the "weak" clustering--each object is a cluster, so with six objects we have
six clusters. This is given the "value" or "rating" .00. Next we have a cluster-
ing with five clusters; the set [3, 5] is one cluster, and the remaining four
objects are themselves clusters. This is given the value .04. At level .07
we have a clustering with four clusters [1], [4], [2], and [3, 5, 6]. At level
.23 we have the two clusters [1, 3, 5, 6] and [2, 4], and finally at level .31
we have the "strong" clustering, with all objects in the same cluster.
We examine the following relevant features of this model. First, the
"values" start at 0 and increase strictly as we read down the table. Second,
and more important, the clusterings "increase" also, hierarchically; each
clustering (except, evidently, the first) is obtained b y the merging of clusters
at the previous level. For example, if level .23 had had clusters [1, 3], [5, 6, 4],
and [2] we would have not had a hierarchical clustering; the cluster [1, 3]
cannot be obtained b y merging any of the .07 level clusters. Finally we see
t h a t the first clustering is the weak clustering and the last is the strong
clustering.
We now abstract from this simple example to the general notion of a
hierarchical clustering scheme. We assume we have n objects, represented b y
the integers I through n. We have also a sequence of m Jr 1 clusterings,
Co , C~ , . - - , C~ , and with each clustering Ci we have a number as , its
value. W e require t h a t Co be the weak clustering of the n objects, with ao = 0,
and t h a t C~ be the strong clustering. We require also t h a t the numbers as
increase; a;_~ _< ai , for j = 1, 2, . . . , m, and the clusters "increase" also,
where again C;_1 < C~ means t h a t every cluster in Ci is the merging (or
union) of clusters in C~_~ . This general arrangement will be referred to as a
hierarchical clustering scheme, or HCS for short.
This section will demonstrate that every HCS gives rise to a particular
kind of distance, or ~netrie, between the objects 1, 2, • • • , n, and, conversely,
244 PSYCHOMETRIKA

t h a t given such a metric we m a y recover the H C S from it. This reduces

the s t u d y of H C S ' s to the s t u d y of these metrics.
First, we shall assume t h a t we are given an HCS, a sequence of cluster-
ings Co , - - - , C,, with values ao , . . . , ~ • F o r each pair x, y of objects,
we shall define d(x, y), a number, and prove t h a t d is a metric.
W e define d as follows: given the two objects x and y, we notice t h a t
in Cm (the strong clustering) x and y are in the same cluster. L e t j be the
least integer in the set [0, I, . . . , m] such that, in the clustering Ci , x and
y are in the same cluster. We define

d(x, y) = .~.

F o r example, in Fig. 1 we h a v e d(d, 5) = .04 (since 3 and 5 are clustered

a t level .04 b u t not at level .0), d(1, 4) = .31 (since 1 and 4 are clustered
a t level .31 b u t not at level .23), d(1, 6) = .23, d(5, 5) = .00, d(4, 2) = .23,
and so o n - - t h e complete distance matrix is given in T a b l e 1

TABLE 1
Distance Matrix Corresponding to Figure 1

d 1 2 3 4 5 6

1 0 .31 .23 .31 .23 .23

2 .31 0 .31 .23 .31 .31
3 .23 .31 0 .31 .04 .07
4 .31~ .23 .31 0 .31 .31
5 .23 .31 .04 .31 0 .07
6 .23 .31 .07 .31 .07 0

A few things are immediate f r o m the definition: for example x and x

are in the same cluster (evidently!) for all C; ; 0 is the smallest j, so b y defini-
tion

d(x, x) -'- ao --- O.

Conversely, if d(x, y) = 0 for some x and y, it implies t h a t x and y

are in the same cluster in Co--but, Co being the weak clustering, the only
element in the same cluster with x is x i t s e l f - - t h a t is, d(x, y) -- 0 implies
x = y. T h u s d(x, y) = 0 if and only if x = y.
W e see also t h a t d(x, y) = d(y, x) for all objects x and y. To show t h a t
d is a good metric it remains only to show the triangle inequality. L e t x, y,
and z be a n y three objects, and let
d(x, y) = ai
d(y, z) = ak
S T E P H E N C° J O H N S O N 245

Thus x and y are in the same cluster in C~, and y and z are in the same
cluster in C , . Because the clusterings are hierarchical, one of these clusters
includes the other; in fact, t h a t cluster corresponding to the larger of j
and k. Let this integer be ~, then in C~, x, y, and z are all in the same cluster.
From the definition of d, we see thus t h a t
d ( z , y) < ~ , .

But l = max [j, k], and the a's increase as their subscripts do, so

ac -- max [a~ , ak],

or, finally,
d(x, z) <_ max [d(x, y), d(y, z)].
This is called the ultrametric inequality; we have shown that d satisfies
it. I t is plainly stronger than the triangle inequality, which would merely
require
d(x, z) < d(x, y) --k d(y, z),
for it is evident that
max [d(x, y), d(y, z)] ~ d(x, y) q- d(y, z),
SO

d(x, z) <_ d(x, y) -b d(y, z).

Thus we have taken a HCS and obtained a metric d on the objects
which satisfies the ultrametric inequality. We now do the converse--given
a distance matrix (such as Table 1) representing some metric d which satisfies
the ultrametric inequality, we will construct a HCS (such as Fig. 7) from it.
At level 0, we have the weak clustering--six clusters, each with but one object
in them. The smallest element of the distance matrix, aside from the O's,
is the .04 entry t h a t appears between objects 3 and 5. Accordingly, we create
a clustering with value .04 with 3 and 4 in the same cluster, and the other
objects constituting clusters by themselves. Now, we notice one very nice
property of Table 1 : 3 and 5 are exactly the same distance from any other
object--that is, if x denotes object 1, 2, 4, or 6 - - then

d(3, x) = d(5, x), all x.

Thus, in fact it makes sense to talk about the distance from x to the
cluster [3, 5].
We indicate this distance in Table 2. In effect we have a new object,
[3, 5], which replaces 3 and 5 in the matrix. B u t now we get our next cluster-
ing by using the matrix in Table 2 and applying the same process, i.e., taking
246 PSYCHOMETRIKA

TABLE 2
Distance Matrix for Table 1 After First Clustering

1 2 [3, 5] 4 6

1 0 .31 .23 .31 .23

2 .31 0 .31 .23 .31
[3, ~ .23 .31 0 .31 .07
4 .31 .23 .31 0 .31
6 .23 .31 .07 .31 0

the smallest nonzero entry (.07, between {3, 5} and 6) and clustering these
together to obtain a clustering, at level .07, containing a cluster [3, 5, 6]
and individual clusters [1], [2], and [4]. Once again, we define the distance
from [3, 5, 6] to 1, 2, or 4 in a unique manner, construct another distance
matrix, and so on. Eventually we end up clustering all obiects together to
get the strong clustering, and we find that we have completely reconstructed
Fig. 1.
The key to the above process is being able to replace two (or more)
objects by a cluster, and still being able to define the distance between such
clusters and other objects or clusters. This property in turn depends on two
essential facts: that d ~utisfies the ultrametric inequality, and that, at each
stage, we cluster the minimum distances.
We now generalize this method, to enable us to get a HCS, given n objects
and a metric d on them which satisfies the ultrametric inequality.

Step 1. Clustering Co, with value 0, is the weak clustering.

Step 2. Assume we are given the clustering Ci-1 with the distance matrix
between each cluster or object and every other. Let ai be the smallest
nonzero entry in the matrix. Merge the pair of points and/or clusters
with distance a i , to create C~, of value a i .
Step 3. We may create a new distance matrix, treating the new clusters
as objects, in an unambiguous manner.
That is, if x and y are two objects (possibly clusters) at level
Ct-z, and if d(x, y) = c~ (so that x and y become clustered in C;),
and if z is any other object or cluster at level C~_~ , then d(x, z ) =
d(y, z). The proof of this is easily sketched--if d(x, z) ~ d(y, z)
one must be larger--say d(x, z) > d(y, z). Then, however, the ultra-
metric inequality demands

d(x, z) < max (d(x, y), d(y, z))

max (cL~ , d(y, z))
STEPHEN C. JOHNSON 247

B y assumption,
d(y, z) < d(x, z).
Thus,
d(x, z) < _ . , .

B u t a; was chosen to be the least nonzero distance in the matrix; thus

d(x, z) = at •
This then in turn requires that
d(y, z) < d(x, z) = a, ,
a contradiction, since no nonzero distance can be strictly smaller
than ai • Thus the hypothesis that d(x, z) ~ d(y, z) leads to a con-
tradiction; the two distances must be equal, and we can construct
our reduced matrix.
Step 4. We now repeat Steps 2 and 3 until we finally obtain the strong
clustering--we are then finished.

This procedure evidently produces a hierarchical clustering scheme,

since each clustering is a merging of clusters from the previous clustering
and the ai increase. There is thus a complete correspondence between HCS's
on the one hand, and metrics satisfying the ultrametric inequality on the
other.

II. The Two Methods

In the first section we developed a natural way of going from a metric d,
satisfying the ultrametric inequality, to an HCS. In general, however, (if
only because of noisy data) the similarity matrix does not satisfy the ultra-
metric inequality--we will thus try to modify our method to give us reasonable
clusterings in this case.
When we went from a metric to an tICS in Sect. I, we required the
ultrametric inequality only in Step 3; we assumed that we have clusters a n d / o r
objects x and y from C~-1 which clustered in C~ (i.e., d(x, y) = ai). We then
took any third cluster or object z and attempted to define d([x, y], z). The
ultrametric inequality told us that d(x, z) = d(y, z), and thus led to a natural
definition:
d([x, y], z) = d(z, z) -- d(y, z).
In general we m a y not expect d(x, z) = d(y, z), but we stiff may formally
define d([x, y], z) as some function )¢ of d(x, z) and d(y, z),

d([x, y], z) = l(d(x, z), d(y, z)),

243 PSYCH0~IETRIKA.

and then proceed as :in Sect. I, above. It would be natural to require that if
d(x, z) = d(y, z), then
f(d(x, z), d(y, z)) = d(x, z) -- d(y, z).

Then if d satisfies the ultrametric inequality, the process will give the same
HCS as the "natural" one described in Sect. I. This still leaves us with a
large number of choices for/--geometric means, various weighted averages,
and so on. We evidently need stronger conditions on the function f.
This work was strongly influenced by the work of Shepard [1962a & b
and Kruskal [1964] on multidimensional scaling, in which the results are
invariant under montone transformations of the similarity matrix. Since
much data of psychological interest is of this type, it seemed worthwhile
to try to develop a clustering program with this feature. Immediately we
ruled out most of the common functions for 1, since the operations of addi-
tion, multiplication, square root, and so on are not monotone invariant.
The functions max and rain, however, give rise to monotone invariant
clustering methods; the corresponding methods may be summarized as
follows:

M i n i m u m Method: Given a similarity function d on n objects, we build

an HCS as follows:
Step 1. Clustering Co, with value 0, is the weak clustering.
Step 2. Assume we are given the clustering C;-1 with the similarity func-
tion d, defined for all objects or clusters in Ci-1. Let at be a minimal
nonzero entry in the matrix. Merge the pair of objects and/or
clusters with distance a; to create Ca, of value ai •
Step 3. We create a new similarity function for Ci in the following manner:
if x and y are clustered in Ca and not in Ci_l (i.e., d(x, y) = at)
we define the distance from the cluster Ix, y] to any third object
or cluster, z, by

d([x, y], z) = min [d(x, z), d(y, z)].

If x and y are objects and/or clusters in Ci-1 not clustered in Ci ,

d(x, y) remains the same. We obtain a new similarity function d for
C4 in this way.
Step 4. We now repeat Steps 2 and 3 until we finally obtain the strong
clustering--we are then finished.

M a x i m u m Method. Same as the Minimum Method, except in Step 3, where

we define
d([x, y], z) -= max [d(x, z), d(y, z)]
STEPHEN C. J O H N S O N 249

when x and y are two objects and/or clusters of Ci-~ which cluster in Ci ,
and z is any third object or cluster of C~.
NOTE: It is tacitly assumed in the discussion of the methods that the dis-
tances in the original matrix are all distinct except for 0. This is not important
in the Minimum Method, but difficulties do arise when applying the Maxi-
mum Method to matrices with large numbers of identical entries. In practice
this restriction rarely produces an ambiguous result.
The above two methods are clearly related to the method described
in Sect. I. In particular, if d satisfies the ultrametric inequality, the two
methods reduce to the method of Sect. I, as promised.

III. Nature o] the Solutions

Monotone Invariance
One of the requirements that we have set for our methods is that the
solutions be invariant under monotone transformations of the original data.
Monotone invariant processes are those which are dependent only on the
rank order of the data. In our methods, we use the matrix elements twice;
first, we find the smallest nonzero matrix element, and second, we form the
maximum or minimum of two matrix elements. Both these processes may
be carried out knowing nothing of the data except the rank order. Thus the
clusterings are unaffected by monotone transformations of the similarity
matrix. The values assigned to the clusterings also are determined merely
by rank order--thus a monotone transformation of the similarity matrix
transforms the values of the clus~erings, but leaves the clusterings invariant.

What the Clusterings Mean

Both methods depict basic attributes of the original similarity matrix.
In particular, the values assigned to the clusterings have simple meanings
in the two methods.
If we are given a clustering obtained by the Maximum Method, we may
represent the value of the clustering as follows: for each cluster in the cluster-
ing, compute the diameter of the cluster (the largest intra-cluster distance).
For a given Maximum Method clustering, the value of the clustering is
the maximum diameter o] the clusters in the clustering. At any stage, the
distance from object/cluster x to object/cluster y is exactly the diameter
of the set x union y. This gives us a simple means of visualizing the cluster-
ings--the Maximum Method attempts at each stage to minimize the diameter
of the clusters.
The analysis for the Minimum Method is slightly more involved, but
equally basic. A chain from object x to object y is any sequence of objects
zo, z~, • • • , z, with zo -- x and ze -- y. The size of a chain is the largest link
distance:
250 PSYCHOMETRtKA

size = max [d(zi-, , zi)].

i~l,...,~

Given a clustering, we say that the chain distance d' from x to y is the minimal
chain size of all chains from x to y;
d'(x, y) = min [size of chain.]
al I ehatna
from z to

I t turns out that d' satisfies the ultrametric inequality and is indeed associated
with the HCS we obtain from the Minimum Method. The chain distance
intuitively measures a kind of connectedness of x and y through intermediate
points. We may thus describe the value of a Minimum Method clustering by
Value of Clustering = max [d'(x, y)].
z and
in s a m e cluster

The above statements may be proved easily by induction; the proofs

are omitted here as not being central to the argument.

IV. Illustrative Application

In the course of exploring some alternative methods for analyzing exist-
ing data on the confusability of various speech sounds under different condi-
tions of filtering and noise, R. N. Shepard recently applied the two methods
described here to the data of Miller and Nicely [1955] on confusions among
sixteen English consonants. As an illustration of the kind of meaningful
results that can be obtained in this way we present the HCS's that were
obtained for one of Miller and Nicely's sets of data; viz., their Table VII
(based on the condition in which only the low audio frequencies from 200
to 300 cps were passed).
For the purpose of applying the present methods, a symmetric matrix
was constructed givhlg, for each of the n(n -- 1)/2 pairs of consonants x
and y, a measure of their similarity, s(x, y), defined by
l(z, ~ + f(y, x.)
s(x, y) = ](x, x) ](y, y) '
where ](x, y), for example, is the frequency with which the consonant x
was heard as the consonant y according to Miller and Nicely's Table VII.
In the analysis, then, the similarity estimate, s(x, y), is treated as an approxi-
mately monotonically decreasing function of an assumed underlying dis-
tance d(x, y).
The representations obtained by the Maximum and Minimum Methods
are shown in Figs. 2 and 3, respectively. Across the top, in each table, are
indicated the phonetic symbols for the sixteen consonants studied by Miller
and Nicely. (In both cases, it turned out that the diagram for the HCS
could be constructed with these consonants in the same order and, indeed,
in exactly the order in which they originally were listed by Miller and Nicely.)
STEPHEN C. J O H N S O N 251

T h e n u m b e r s d o w n t h e l e f t - h a n d side of each t a b l e are t h e s i m i l a r i t y v a l u e s

associated w i t h each clustering i n t h e hierarchical r e p r e s e n t a t i o n . ( N o t i c e
t h a t since s(x, y) is inversely r e l a t e d t o distance, these n u m b e r s b e g i n a t
for t h e weak c l u s t e r i n g a n d decrease.

Consonants
Similarity
Value p tk fo ,S b d # v ~z 5m n

2.635 kx:
2.234: XXX XXX
2.230 XXX XXX XXX
2.185 XXX XXXXX XXX
2.123 XXX XXXXX XXX xxx
2.108 XXXXX XXXXX XXX XXX
1.870 XXXXX XXXXX XXXXX XXX
1.683 XXXXX XXXXXXX XXXXX XXX
1.604 XXXXX XXXXXXX XXXXX XXX KXX
1.577 XXXXX X X X ~ X X XXXXX XXXXX .~XX
1.567 XXXXX XXXXXXX XXXXX ~XXXXXX XXX
1.065 XXXXXXXXXXXXX XXXXX XXXXXXX XXX
1.009 XXXXXXXXXXXXX XXXXXXXXXXXXX XXX
0.425 XXXXXXXXXXXXXXXXXXXXXXXXXXX XXX
O.279 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
FIGuRv. 2
The HCS Obtained on the basis of Miller and Nicely's Table VII
by the Minimum Method

Consonants
Similarity
Value p tk /e s S b d g , ~z 5m n
co

2.635 xxx
2.234 xxx XXX
2.230 XXX . XXX XXX
2.123 XXX . XXX XXX
1.855 XXXXX XXX XXX XXX
1.683 XXXXX XXXXX XXX XXX
1.604 XXXXX XXXXX XXX XXX X X X
1.525 XXXXX XXXXX XXX XXXXX XXX
1.186 XXXXX XXXXX XXX X X X X X X X X X X
1.119 XXXXX XXXXX XXXXX XXXXXXX XXX
0.939 XXXXX XXXXXXX XXXXX X X X X X X X X X X
0.422 X X X X X X X X X X X X X XXXXX XXXXXXX XXX
0.302 XXXXXXXXXXXXX XXXXXXXXXXXXX XXX
0.019 XXXXXXXXXXXXX XXXXXXXXXXXXXXXXX
0.000 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
FIGURE 3
The HCS obtained on the basis of Miller and Nieely's Table VII
by the Maximum Method
252 pSYCHOMETRIK&

For these data the Maximum and Minimum Methods yield very similar
results. The princip~d difference between the two representations is confined
to the order in which the last three clusters [p, t, k, f, o, s, S], [b, d, g, v, ~, z, 3],
and [m, n] combine into two clusters. For the Maximum Method the last
two of these combine with each other before joining the first, whereas for
the Minimum Method the first two combine with each other before joining
the last. Otherwise, although the precise numerical values associated with
the clusterings differ somewhat between the two methods, the topological
structures of the two representations are alike. That is, above the level of
three clusters, we find that exactly the same subclusters appear in both
representations and (consequently) that each such subcluster divides into
exactly the same sub-subclusters. This close agreement suggests that these
data do not seriously violate the assumed ultrametric structure.
Moreover, both of the obtained HCS's are meaningfully related to the
distinctive features presumed [e.g., by Miller and Nicely, 1955] to govern
the discrimination of consonant phonemes. At the level of five clusters, for
example, the sixteen phonemes divide into the unvoiced stops [p, t, k], the
corresponding voiced stops [b, d, g], the unvoiced fricatives [], o, s, S],
the corresponding voiced fricatives [v, ~, z, 5], and the (voiced) nasals
[m, n]. Then, at the level of three clusters, the stops and fricatives coalesce
for the voiced and unvoiced phonemes, separately, to yield just the nasals,
the remaining voiced, and the corresponding unvoiced consonants.
Analyses of other of Miller and Nicely's matrices (which were obtained
under different conditions of filtering) led to clusterings that, although highly
consistent (across independent sets of data), departed systematically from
the HCS's presented here for their Table VII. These divergent results will
be covered in a forthcoming report by Shepard; their detailed discussion here
would require too extensive a detour into the substantive problems of psycho-
acoustics. One further observation should perhaps be made here, though,
regarding these further analyses. In this particular kind of application anyway,
it has generally appeared that, to the extent that there is an appreciable
departure between the HCS's obtained by the Maximum and Minimum
Methods, the results of the Maximum Method have appeared to be the more
meaningful or interpretable. That is, the search for compact clusters (of
small over-all "diameter") has proved more useful than the search for in-
ternally "connected" but potentially long chain-like clusters. The reverse
may of course prove to be true in other types of applications.

V. Discussion
Relation to Other, Similar Methods
Although the methods described here were developed independently,
they were subsequently found to be closely related to some methods that
STEPHEN C. JOHNSON 253

had been developed earlier for applications to biological taxonomy. In par-

ticular, what are here called the Minimum Method and the Maximum
Method appear to be essentially like the earlier methods of Sneath [1957]
and of SCrensen [1948], respectively, (which methods Sokal and Sheath
[1963, pp. 180-181] have, in turn, called "clustering by single linkage" and
"clustering by complete linkage," respectively). A more recent method of
this same general type is that of "hierarchical linkage analysis" proposed by
McQuitty [1960]. In view of the general upsurge of interest in clustering
methods that is currently taking place in a number of different fields, it is
likely that similar methods have been proposed by others as well.
Apart from what function it may serve to bring such methods to the
attention of psychologists, the present report has the advantage of providing,
for the first time, a unifying conceptual formulation; specifically, a formula-
tion based upon the notion of the ultrametrie. This ultrametric conceptualiza-
tion, moreover, leads directly to readily mechanizable computing algorithms
for both the Minimum and Maximum Methods, as well as certain inter-
mediate methods (which are briefly mentioned below). More importantly,
it allows one to specify--as had not previously been done--precisely what
type of underlying structure is being assumed and, hence, precisely what
problem is being solved.

A Computer Program
Another step that has been taken here is the construction of a computer
program that will carry out both the Maximum and Minimum Methods on
an arbitrary matrix of similarities or "proximities." (The program is written
in F O R T R A N and is suitable for IBM machines of the 709-7090 class.)
The solutions displayed in the present Tables 11 and 12 were in fact computed
and printed out (in the form shown) by this program. When necessary the
program also determines an appropriate reordering of the "objects" so
that such a table can be constructed. On an IBM 7094, the analysis is com-
pleted quite rapidly; in another application with 64 objects, solution~s were
obtained for both the Minimum and Maximum Methods in just 10.1 seconds.

Possible Extensions
Sokal and Sneath [1963, p. 190] have pointed out that, in methods like
our Minimum or Maximum Methods, the merging of two clusters depends
upon a single similarity value (viz., the least or greatest in the appropriate
set). They suggest that, for greater robustness of the solution, it may some-
times be desirable to use some sort of average value instead. As we have
already noted, to base such a procedure upon averages of the more obvious
types is to lose the invariance, sought here, under monotone transformations
of the similarity values. More importantly, the solutions would no longer
254 PSYCHOMETRIKA

have the clear-cut meaning of the "connected" or " c o m p a c t " solutions

obtained b y the conceptually simpler Min. and Max. Methods.
Nevertheless, when this seems desirable, the methods described here
can be (and, indeed, have been) modified to yield solutions intermediate
between those obtained b y these two extreme methods. J. D. Carroll (personal
communication) has suggested an average method based upon medians which,
of course, do have the desired property of monotone invariance. T h e main
problem, in the case of medians, is the choice of an appropriate procedure for
dealing with the ambiguities that tend to arise when two or more of the
initial similarity estimates are tied. Moreover, as in the case of other sorts
of averages, the solution no longer lends itself to a simple characterization
(e.g., in terms of "compactness" or "connectedness").
Finally, a different kind of possible extension will be briefly indicated
here, in the form of a presently unsolved problem. I n Section I we saw t h a t the
construction of an HCS is equivalent to finding a metric which satisfies
the ultrametric inequality. Given a similarity measure d, we would in general
like to find the closest metric D which satisfies the ultrametric inequality--
various measures of closeness could be used. For example, we could use a
rank-order correlation between d(x, y) and D(x, y) over all objects x and y.
T o the author's knowledge, this problem is unsolved.

REFERENCES
Krnskal, J. B. Multidimensional scaling by optimizing goodness of fit to a nonmetric
hypothesis. Psychometrik~a, 1964, 29, 1-27.
McQuitty, L. L. Hierarchical linkage analysis for the isolation of types. Educational and
Psychological Measurement, 1960, 20, 55-67.
Miller, G. A. and Nicely, P. E. An analysis of perceptual confusions among some English
consonants. Journal of the Acoustical Society of America, 1955, 27, 338-352.
Shepard, tL N. Analysis of proximities: Multidimensional scaling with an unknown dis-
tance function. I. Psychometrika, 1962a, 27, 125-140.
Shepard, R. N. Analysis of proximities: Multidimensional scaling with an unknown distance
function. IL Psychometrika, 1962b, 27, 219-246.
Sneath, P. H. A. The application of computers to taxonomy. Journal of General Microbiology,
1957, 17, 201-226.
Sokal, R. R. and Sneath, P. H. A. Principles of Numerical Taxonomy. San Francisco" W. H.
Freeman, 1963.
S~rensen, T. A method of establishing groups of equal amplitude in plant sociology based on
similarity of species content and its application to analyses of the vegetation on
Danish commons. Biologislce Skrifler, 1948, 5 (4), 1-34.
Ward, J. H., Jr. Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 1963, 58, 236-244.

Manuscript received 5/9/66

Revised manuscript received 1~/1~/66

Grade 1 Mathematics Lesson Plan
100% (33)
Grade 1 Mathematics Lesson Plan
4 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
Chapter 4 - Cluster Analysis
No ratings yet
Chapter 4 - Cluster Analysis
55 pages
DM & W - Unit - 3
No ratings yet
DM & W - Unit - 3
34 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
Research 1
No ratings yet
Research 1
36 pages
Unit 5 Clustering
No ratings yet
Unit 5 Clustering
70 pages
Clustering
No ratings yet
Clustering
80 pages
A Hash-Based Co-Clustering Algorithm For Categorical Data
No ratings yet
A Hash-Based Co-Clustering Algorithm For Categorical Data
12 pages
Using Hierarchical Cluster Analysis in Nursing Research: Jason W. Beckstead
No ratings yet
Using Hierarchical Cluster Analysis in Nursing Research: Jason W. Beckstead
13 pages
4 3 Topic Notes New
No ratings yet
4 3 Topic Notes New
9 pages
CLustering Methods
No ratings yet
CLustering Methods
2 pages
7.2. Clustering Methods
No ratings yet
7.2. Clustering Methods
46 pages
Hierarchical Clustering Dendrograms
No ratings yet
Hierarchical Clustering Dendrograms
12 pages
001 - Clustering - Jain - Dubes (1) - 69-103
No ratings yet
001 - Clustering - Jain - Dubes (1) - 69-103
40 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
10 1002@9781118445112 Stat02449 Pub2
No ratings yet
10 1002@9781118445112 Stat02449 Pub2
13 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
TIMO Final 2020-2021 P3
100% (3)
TIMO Final 2020-2021 P3
5 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
DA Seminar
No ratings yet
DA Seminar
29 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Lec 35
No ratings yet
Lec 35
18 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Clustering
No ratings yet
Clustering
38 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
16 pages
CLUSTERING
No ratings yet
CLUSTERING
16 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
An Overview On Clustering Methods: T. Soni Madhulatha
No ratings yet
An Overview On Clustering Methods: T. Soni Madhulatha
7 pages
Lecture-9 Cluster Analysis - LAK
No ratings yet
Lecture-9 Cluster Analysis - LAK
4 pages
Clustering Techniques
No ratings yet
Clustering Techniques
23 pages
Lecture 7 Clustring
No ratings yet
Lecture 7 Clustring
10 pages
Clustering
No ratings yet
Clustering
75 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
6902 An Applied Algorithmic Foundation For Hierarchical Clustering
No ratings yet
6902 An Applied Algorithmic Foundation For Hierarchical Clustering
10 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Recursive Hierarchical Clustering Algorithm
No ratings yet
Recursive Hierarchical Clustering Algorithm
7 pages
A New Hierarchical Clustering Algorithm
No ratings yet
A New Hierarchical Clustering Algorithm
5 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Clustering Jain Dubes (1) - 69-103
No ratings yet
Clustering Jain Dubes (1) - 69-103
5 pages
Expt 5
No ratings yet
Expt 5
3 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
102 - Introduction To Mass Communication 1
100% (2)
102 - Introduction To Mass Communication 1
72 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Group 3contemporary
50% (2)
Group 3contemporary
22 pages
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
No ratings yet
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
3 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
Clustering Hierarchical Algorithms
100% (1)
Clustering Hierarchical Algorithms
21 pages
Class-2 Maths 1
100% (2)
Class-2 Maths 1
7 pages
Economic Theory by ShumPeter
No ratings yet
Economic Theory by ShumPeter
16 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Prickly Pear
100% (2)
Prickly Pear
83 pages
Lec 2. Processing Room Design
No ratings yet
Lec 2. Processing Room Design
59 pages
BuildGui in Matlab
100% (1)
BuildGui in Matlab
515 pages
Din. Digital Temperature Controller: Ttm-000 Series User'S Manual Operation Flow and Setting Menu
No ratings yet
Din. Digital Temperature Controller: Ttm-000 Series User'S Manual Operation Flow and Setting Menu
1 page
Interview Questions (TD)
No ratings yet
Interview Questions (TD)
9 pages
God of Small Things
100% (2)
God of Small Things
6 pages
(Hotel Name) Feedback Form: Customer Name: Address: Email/Phone Account
No ratings yet
(Hotel Name) Feedback Form: Customer Name: Address: Email/Phone Account
2 pages
Passenger Elevators (High-Speed Custom-Type)
No ratings yet
Passenger Elevators (High-Speed Custom-Type)
19 pages
CFL CPM Guidelines
No ratings yet
CFL CPM Guidelines
30 pages
Name: Dian Puspita Wati Class: XI Science Three: Summary of English Lesson
No ratings yet
Name: Dian Puspita Wati Class: XI Science Three: Summary of English Lesson
22 pages
EJ815370
No ratings yet
EJ815370
10 pages
Geometry
No ratings yet
Geometry
1 page
Computational Linguistics and Audio-Visual Readability: Analysing Linguistic Features of Intralingual-Subtitles Corpora
No ratings yet
Computational Linguistics and Audio-Visual Readability: Analysing Linguistic Features of Intralingual-Subtitles Corpora
14 pages
CASE 01 - Walker Wire Products Co.
No ratings yet
CASE 01 - Walker Wire Products Co.
2 pages
Enterprise Network Products Recommended Version List 2015Q4
No ratings yet
Enterprise Network Products Recommended Version List 2015Q4
36 pages
Mysteries of The Universe
No ratings yet
Mysteries of The Universe
37 pages
Job Satisfaction, Job Performance, and Effort and Reexamination Using Agency Theory
No ratings yet
Job Satisfaction, Job Performance, and Effort and Reexamination Using Agency Theory
15 pages
Frtool - The User's Guide: Frequency Response Controller Design Tool
No ratings yet
Frtool - The User's Guide: Frequency Response Controller Design Tool
21 pages
Project-Design-Boost (2023)
No ratings yet
Project-Design-Boost (2023)
7 pages
DOEFINAL
No ratings yet
DOEFINAL
16 pages
About The Affluent Worker de
No ratings yet
About The Affluent Worker de
18 pages
CS614-Assignment 1 Solution Spring 2024
No ratings yet
CS614-Assignment 1 Solution Spring 2024
4 pages
Android - Simple Tab Bar Example
No ratings yet
Android - Simple Tab Bar Example
7 pages
Representation Theory of Finite Groups
From Everand
Representation Theory of Finite Groups
Martin Burrow
4/5 (2)
Recursive Analysis
From Everand
Recursive Analysis
R. L. Goodstein
No ratings yet

Johnson1967 PDF

Uploaded by

Johnson1967 PDF

Uploaded by

PSYCI~OI~ETRIKA--VOL32~ NO.

HIERARCHICAL CLUSTERING SCHEMES*

Techniques for partitioning objects into optimally homogeneous

*I am indebted to R. N. Shepard and J. D. Carroll for many stimulating discussions

1. The input should consist solely of the n ( n - 1)/2 similarity measures

The notion of a hierarchical clustering scheme, the central idea of this

I. Clusterings and Metrics

t h a t given such a metric we m a y recover the H C S from it. This reduces

F o r example, in Fig. 1 we h a v e d(d, 5) = .04 (since 3 and 5 are clustered

1 0 .31 .23 .31 .23 .23

A few things are immediate f r o m the definition: for example x and x

d(x, x) -'- ao --- O.

Conversely, if d(x, y) = 0 for some x and y, it implies t h a t x and y

ac -- max [a~ , ak],

d(x, z) <_ d(x, y) -b d(y, z).

d(3, x) = d(5, x), all x.

1 0 .31 .23 .31 .23

Step 1. Clustering Co, with value 0, is the weak clustering.

d(x, z) < max (d(x, y), d(y, z))

B u t a; was chosen to be the least nonzero distance in the matrix; thus

This procedure evidently produces a hierarchical clustering scheme,

II. The Two Methods

d([x, y], z) = l(d(x, z), d(y, z)),

M i n i m u m Method: Given a similarity function d on n objects, we build

d([x, y], z) = min [d(x, z), d(y, z)].

If x and y are objects and/or clusters in Ci-1 not clustered in Ci ,

M a x i m u m Method. Same as the Minimum Method, except in Step 3, where

III. Nature o] the Solutions

What the Clusterings Mean

size = max [d(zi-, , zi)].

The above statements may be proved easily by induction; the proofs

IV. Illustrative Application

T h e n u m b e r s d o w n t h e l e f t - h a n d side of each t a b l e are t h e s i m i l a r i t y v a l u e s

had been developed earlier for applications to biological taxonomy. In par-

have the clear-cut meaning of the "connected" or " c o m p a c t " solutions

Manuscript received 5/9/66

You might also like