A Characterization of Linkage-Based Hierarchical Clustering
A Characterization of Linkage-Based Hierarchical Clustering
Abstract
The class of linkage-based algorithms is perhaps the most popular class of hierarchical
algorithms. We identify two properties of hierarchical algorithms, and prove that linkage-
based algorithms are the only ones that satisfy both of these properties. Our character-
ization clearly delineates the difference between linkage-based algorithms and other hier-
archical methods. We formulate an intuitive notion of locality of a hierarchical algorithm
that distinguishes between linkage-based and “global” hierarchical algorithms like bisecting
k-means, and prove that popular divisive hierarchical algorithms produce clusterings that
cannot be produced by any linkage-based algorithm.
1. Introduction
Clustering is a fundamental and immensely useful task, with many important applications.
There are many clustering algorithms, and these algorithms often produce different results
on the same data. Faced with a concrete clustering task, a user needs to choose an appropri-
ate algorithm. Currently, such decisions are often made in a very ad hoc, if not completely
random, manner. Users are aware of the costs involved in employing different clustering
algorithms, such as running times, memory requirements, and software purchasing costs.
However, there is very little understanding of the differences in the outcomes that these
algorithms may produce.
It has been proposed to address this challenge by identifying significant properties
that distinguish between different clustering paradigms (see, for example, Ackerman et al.
(2010b) and Fisher and Van Ness (1971)). By focusing on the input-output behaviour of al-
gorithms, these properties shed light on essential differences between them (Ackerman et al.
(2010b, 2012)). Users could then choose desirable properties based on domain expertise,
and select an algorithm that satisfies these properties.
In this paper, we focus hierarchical algorithms, a prominent class of clustering algo-
rithms. These algorithms output dendrograms, which the user can then traverse to obtain
the desired clustering. Dendrograms provide a convenient method for exploring multiple
clusterings of the data. Notably, for some applications the dendrogram itself, not any clus-
tering found in it, is the desired final outcome. One such application is found in the field
of phylogeny, which aims to reconstruct the tree of life.
One popular class of hierarchical algorithms is linkage-based algorithms. These algo-
rithms start with singleton clusters, and repeatedly merge pairs of clusters until a den-
drogram is formed. This class includes commonly-used algorithms such as single-linkage,
average-linkage, complete-linkage, and Ward’s method.
In this paper, we provide a property-based characterization of hierarchical linkage-based
algorithms. We identify two properties of hierarchical algorithms that are satisfied by all
linkage-based algorithms, and prove that at the same time no algorithm that is not linkage-
based can satisfy both of these properties.
The popularity of linkage-based algorithms leads to a common misconception that
linkage-based algorithms are synonymous with hierarchical algorithms. We show that even
when the internal workings of algorithms are ignored, and the focus is placed solely on their
input-output behaviour, there are natural hierarchical algorithms that are not linkage-based.
We define a large class of divisive algorithms that includes the popular bisecting k-means al-
gorithm, and show that no linkage-based algorithm can simulate the input-output behaviour
of any algorithm in this class.
2. Previous Work
Our work falls within the larger framework of studying properties of clustering algorithms.
Several authors study such properties from an axiomatic perspective. For instance, Wright
(1973) proposes axioms of clustering functions in a weighted setting, where every domain
element is assigned a positive real weight, and its weight may be distributed among multiple
clusters. A recent, and influential, paper in this line of work is Kleinberg’s impossibility
result (Kleinberg (2003)), where he proposes three axioms of partitional clustering functions
and proves that no clustering function can simultaneously satisfy these properties.
Properties have been used study different aspects of clustering. Ackerman and Ben-
David (2008) consider properties satisfied by clustering quality measures, showing that
properties analogous to Kleinberg’s axioms are consistent in this setting. Meila (2005)
studies properties of criteria for comparing clusterings, functions that map pairs of cluster-
ings to real numbers, and identifies properties that are sufficient to uniquely identify several
such criteria. Puzicha et al. (2000) explore properties of clustering objective functions.
They propose a few natural properties of clustering objective functions, and then focus on
objective functions that arise by requiring functions to decompose into additive form.
Most relevant to our work are previous results distinguishing linkage-based algorithms
based on their properties. Most of these results are concerned with the single-linkage al-
gorithm. In the hierarchial clustering setting, Jardine and Sibson (1971) and Carlsson and
Mémoli (2010) formulate a collection of properties that define single linkage.
Zadeh and Ben-David (2009) characterize single linkage in the partitional setting where
instead of constructing a dendrogram, clusters are merged until a given number of clusters
remain. Finally, Ackerman et al. (2010a) characterize linkage-based algorithms in the same
partitional setting in terms of a few natural properties. These results enable a comparison
2
Linkage-Based Hierarchical Clustering
Figure 1: A dendrogram of domain set {x1 , . . . , x8 }. The horizontal lines represent levels
and every leaf is associated with an element of the domain.
3. Definitions
A distance function is a symmetric function d : X × X → R+ , such that d(x, x) = 0 for
all x ∈ X. The data sets that we consider are pairs (X, d), where X is some finite domain
set and d is a distance function over X. We say that a distance function d over X extends
distance function d0 over X 0 ⊆ X, denoted d0 ⊆ d, if d0 (x, y) = d(x, y) for all x, y ∈ X 0 . Two
distance function d over X and d0 over X 0 agree on a data set Y if Y ⊆ X, Y ⊆ X 0 , and
d(x, y) = d0 (x, y) for all x, y ∈ Y .
A k-clustering C = {C1 , C2 , . . . , Ck } of a data set X is a partition of X into k non-empty
disjoint subsets of X (so, ∪i Ci = X). A clustering of X is a k-clustering of X for some
1 6 k 6 |X|. For a clustering C, let |C| denote the number of clusters in C. For x, y ∈ X
and clustering C of X, we write x ∼C y if x and y belong to the same cluster in C and
x 6∼C y, otherwise.
Given a rooted tree T where the edges are oriented away from the root, let V (T ) denote
the set of vertices in T , and E(T ) denote the set of edges in T . We use the standard
interpretation of the terms leaf, descendent, parent, and child.
A dendrogram over a data set X is a binary rooted tree where the leaves correspond to
elements of X. In addition, every node is assigned a level, using a level function (η); leaves
are placed at level 0, parents have higher levels than their children, and no level is empty.
See Figure 1 for an illustration. Formally,
3
Ackerman and Ben-David
3. For all y, z ∈ V (T 0 ), η 0 (y) < η 0 (z) if and only if η(y) < η(z).
1. We say that (X, d) and (X 0 , d0 ) are isomorphic domains, denoted (X, d) ∼=X (X 0 , d0 ),
0 0
if there exists a bijection φ : X → X so that d(x, y) = d (φ(x), φ(y)) for all x, y ∈ X.
2. We say that two clusterings (or partitions) C of some domain (X, d) and C 0 of some
domain (X 0 , d0 ) are isomorphic clusterings, denoted (C, d) ∼
=C (C 0 , d0 ), if there exists
0
a domain isomorphism φ : X → X so that x ∼C y if and only if φ(x) ∼C 0 φ(y).
3. We say that (T1 , η1 ) and (T2 , η2 ) are isomorphic trees, denoted (T1 , η1 ) ∼
=T (T1 , η1 ), if
there exists a bijection H : V (T1 ) → V (T2 ) so that
(a) for all x, y ∈ V (T1 ), (x, y) ∈ E(T1 ) if and only if (H(x), H(y)) ∈ E(T2 ), and
(b) for all x ∈ V (T1 ), η1 (x) = η2 (H(x)).
4
Linkage-Based Hierarchical Clustering
3. Richness: For all data sets {(X1 , d1 ), . . . , (Xk , dk )} where Xi ∩ Xj = ∅ for all i 6= j,
there exists a distance function dˆ over ki=1 Xi that extends each of the di ’s (for i 6 k),
S
` : {(X1 , X2 , d) | d over X1 ∪ X2 } → R+
such that,
1. ` is representation independent: For all (X1 , X2 ) and (X10 , X20 ), if ({X1 , X2 }, d) ∼
=C
({X10 , X20 }, d0 ) then `(X1 , X2 , d) = `(X10 , X20 , d0 ).
5
Ackerman and Ben-David
Note that the above definition implies that there exists a linkage function that can be
used to simulate the output of F . We start by assigning every element of the domain to
a leaf node. We then use the linkage function to identify the closest pair of nodes (with
respect to the clusters that they represent), and repeatedly merge the closest pairs of nodes
that do yet have parents, until only one such node remains.
4.3 Locality
We introduce a new property of hierarchical algorithms. Locality states that if we select a
clustering from a dendrogram (a union of disjoint clusters that appear in the dendrogram),
and run the hierarchical algorithm on the data underlying this clustering, we obtain a result
that is consistent with the original dendrogram.
2. For all x, y ∈ X 0 , η 0 (x) < η 0 (y) if and only if η(x) < η(y).
Locality is often a desirable property. Consider for example the field of phylogenetics,
which aims to reconstruct the tree of life. If an algorithm clusters phylogenetic data cor-
rectly, then if we cluster any subset of the data, we should get results that are consistent
with the original dendrogram.
6
Linkage-Based Hierarchical Clustering
Note that for any cluster A in D of (X, d), the A-cut is a clustering of X, and A is one
of the clusters in that clustering.
For example, consider the diagram in Figure 2. Let A = {x3 , x4 }. The horizontal
line on level 4 of the dendrogram represents the intuitive notion of a cut. To obtain the
corresponding clustering, we select all clusters represented by nodes on the line, and for
7
Ackerman and Ben-David
the remaining clusters, we choose clusters represented by nodes that lay directly below the
horizontal cut. In this example, clusters {x3 , x4 } and {x5 , x6 , x7 , x8 } are represented by
nodes directly on the line, and {x1 , x2 } is a cluster represented by a node directly below
the marked horizontal line.
Recall that a distance function d0 over X is (C, d)-outer-consistent if d0 (x, y) = d(x, y)
whenever x ∼C y, and d0 (x, y) > d(x, y) whenever x 6∼C y.
5. Main Result
The following is our characterization of linkage-based hierarchical algorithms.
We prove the result in the following subsections (one for each direction of the iff). In
the last part of this section, we demonstrate the necessity of both properties.
We show that there exists a linkage function ` so that when ` is used in Definition 6 then
for all (X, d) the output is F (X, d). Due to the representation independence of F , one can
assume w.l.o.g., that the domain sets over which F is defined are (finite) subsets of the set
of natural numbers, N.
Definition 12 (The (pseudo-) partial ordering <F ) We consider triples of the form
(A, B, d), where A ∩ B = ∅ and d is a distance function over A ∪ B. Two triples, (A, B, d)
and (A0 , B 0 , d0 ) are equivalent, denoted (A, B, d) ∼= (A0 , B 0 , d0 ) if they are isomorphic as
∼ 0 0
clusterings, namely, if ({A, B}, d) =C ({A , B }, d ).0
<F is a binary relation over equivalence classes of such triples, indicating that F merges
a pair of clusters earlier than another pair of clusters. Formally, denoting ∼ =-equivalence
classes by square brackets, we define it by: [(A, B, d)] <F [(A0 , B 0 , d0 )] if
1. At most two sets in {A, B, A0 , B 0 } are equal and no set is a strict subset of another.
8
Linkage-Based Hierarchical Clustering
(b) There exist (x, y), (x, z) ∈ E(T ) such that C(x) = A∪B, C(y) = A, and C(z) = B
(c) For all D ∈ {A0 , B 0 }, either D ⊆ A ∪ B, or D ∈ cutA∪B F (X, d∗ ).
(d) η(v(A0 )) < η(v(A ∪ B)) and η(v(B 0 )) < η(v(A ∪ B)).
Definition 14 (= ∼F ) [(A, B, d)] and [(A0 , B 0 , d0 )] are F -equivalent, denoted [(A, B, d)] ∼
=F
0 0 0 ∼ 0 0
[(A , B , d )], if either they are isomorphic as clusterings, ({A, B}, d) =C ({A , B }, d ) or 0
1. At most two sets in {A, B, A0 , B 0 } are equal and no set is a strict subset of another.
9
Ackerman and Ben-David
(b) There exist (x, y), (x, z) ∈ E(T ) such that C(x) = A ∪ B, and C(y) = A, and
C(z) = B,
(c) There exist (x0 , y 0 ), (x0 , z 0 ) ∈ E(T ) such that C(x0 ) = A0 ∪ B 0 , and C(y 0 ) = A0 ,
and C(z 0 ) = B 0 , and
(d) η(x) = η(x0 )
• if (A, B, d1 ) ∼
=F (E, F, d3 ) then (C, D, d2 ) ∼
=F (E, F, d3 )
• if (A, B, d1 ) <F (E, F, d3 ) then (C, D, d2 ) <F (E, F, d3 )
Proof Let X = A∪B∪C ∪D∪E∪F . By richness (condition 3 of Definition 4), there exists a
distance function d that extends di for i ∈ {1, 2, 3} so that {A∪B, C∪D, E∪F } is a clustering
in F (X, d). Assume that (E, F, d3 ) is comparable with both (A, B, d1 ) and (C, D, d2 ). By
way of contradiction, assume that (A, B, d1 ) ∼ =F (E, F, d3 ) and (C, D, d2 ) <F (E, F, d3 ).
Then by locality, in F (X, d), η(v(A ∪ B)) = η(v(E ∪ F )).
Observe that by locality, since (C, D, d1 ) <F (E, F, d3 ), then η(v(C ∪ D)) < η(v(E ∪ F ))
in F (X, d). Therefore (again by locality) η(v(A ∪ B)) 6= η(v(C ∪ D)) in any data set that
extends d1 and d2 , contradicting that (A, B, d1 ) ∼ =F (C, D, d2 ).
Note that <F is not transitive. In particular, if (A, B, d1 ) <F (C, D, d2 ) and (C, D, d2 ) <F
(E, F, d3 ), it may be that (A, B, d1 ) and (E, F, d3 ) are incomparable. To show that <F can
be extended to a partial ordering, we first prove the following “anti-cycle” property.
Lemma 16 Given a hierarchical function F that is local and outer-consistent, there exists
no finite sequence (A1 , B1 , d1 ) <F · · · <F (An , Bn , dn ) <F (A1 , B1 , d1 ).
Proof Without loss of generality, assume that such a sequence exists. By richness, there
exists a distance function
S d that extends each of the di where {A1 ∪B1 , A1 ∪B2 , . . . , An ∪Bn }
is a clustering in F ( i Ai ∪ Bi , d) = (T, M, η).
Let i0 be so that η(v(Ai0 ∪ Bi0 ) 6 η(v(Aj ∪ Bj )) for all j 6= i0 . By the circular structure
with respect to <F , there exists j0 so that (Aj0 , Bj0 , dj0 ) <F (Ai0 , Bi0 , di0 ). This contradicts
Lemma 13.
10
Linkage-Based Hierarchical Clustering
Proof First we convert the relation P into a partial order by defining a < b whenever there
exists a sequence x1 , ...., xk so that P (a, x1 ), P (x2 , x3 ), ..., P (xk , b). This is a partial ordering
because P is antisymmetric and cycle-free. To map the partial order to the positive reals,
we first enumerate the elements, which can be done because the domain is countable. The
first element is then mapped to any value, φ(x1 ). By induction, we assume that the first
n elements are mapped in an order preserving manner. Let xi1 ...xik be all the members of
{x1 , ...xn } that are below xn+1 in the partial order. Let r1 = max{φ(xi1 ), ....φ(xik }, and sim-
ilarly let r2 be the minimum among the images of all the members of {x1 , . . . , xk } that are
above xn+1 in the partial order. Finally, let φ(xn+1 ) be any real number between r1 and r2 .
It is easy to see that now φ maps {x1 , ...xn , xn+1 } in a way that respects the partial order.
Finally, we define our linkage function by embedding the ∼ =F -equivalence classes into the
positive real numbers in an order preserving way, as implied by applying Lemma 17 to <F .
Namely, `F : {[(A, B, d)] : A ⊆ N, B ⊆ N, A ∩ B = ∅ and d is a distance function over A ∪
B} → R+ so that [(A, B, d)] <F [(A0 , B 0 , d0 )] implies `F [(A, B, d)] < `F [(A, B, d)].
Lemma 18 The function `F is a linkage function for any hierarchical function F that
satisfies locality and outer-consistency.
Proof Assume that there exist such d1 and d2 where (X1 , X2 , d2 ) <F (X1 , X2 , d1 ). Let d3
over X1 ∪ X2 be a distance function such that d3 is ({X1 , X2 }, d1 )-outer-consistent and d2
is ({X1 , X2 }, d3 )-outer-consistent. In particular, d3 can be constructed as follows:
d1 (x,y)+d2 (x,y)
• d3 (x, y) = 2 whenever x ∈ X1 and y ∈ X2
11
Ackerman and Ben-David
The following Lemma concludes the proof that every local, outer-consistent hierarchical
algorithm is linkage-based.
Lemma 20 Given any hierarchical function F that satisfies locality and outer-consistency,
let `F be the linkage function defined above. Let L`F denote the linkage-based algorithm that
`F defines. Then L`F agrees with F on every input data set.
Proof Let (X, d) be any data set. We prove that at every level s, the nodes at level s in
F (X, d) represent the same clusters as the nodes at level s in L`F (X, d). In both F (X, d) =
(T, M, η) and L`F (X, d) = (T 0 , M 0 , η 0 ), level 0 consists of |X| nodes each representing a
unique elements of X.
Assume the result holds below level k. We show that pairs of nodes that do not have
parents below level k have minimal `F value only if they are merged at level k in F (X, d).
Consider F (X, d) at level k. Since the dendrogram has no empty levels, let x ∈ V (T )
where η(x) = k. Let x1 and x2 be the children of x in F (X, d). Since η(x1 ), η(x2 ) < k,
these nodes also appear in L`F (X, d) below level k, and neither node has a parent below
level k.
If x is the only node in F (X, d) above level k − 1, then it must also occur in L`F (X, d).
Otherwise, there exists a node y1 ∈ V (T ), y1 6∈ {x1 , x2 } so that η(y1 ) < k and η(parent(y1 )) >
k. Let X 0 = C(x) ∪ C(y1 ). By locality, cutC(x) F (X 0 , d|X 0 ) = {C(x), C(y1 )}, y1 is below x,
and x1 and x2 are the children of x. Therefore, (C(x1 ), C(x2 ), d) <F (C(x1 ), C(y1 ), d) and
`F (C(x1 ), C(x2 ), d) < `F (C(x1 ), C(y1 ), d).
Assume that there exists y2 ∈ V (T ), y2 6∈ {x1 , x2 , y1 } so that η(y2 ) < k and η(parent(y2 )) >
k. If parent(y1 ) = parent(y2 ) and η(parent(y1 )) = k, then (C(x1 ), C(x2 ), d) ∼ =F (C(y1 ), C(y2 ), d)
and so `F (C(x1 ), C(x2 ), d) = `F (C(y1 ), C(y2 ), d).
12
Linkage-Based Hierarchical Clustering
Otherwise, let X 0 = C(x) ∪ C(y1 ) ∪ C(y2 ). By richness, there exists a distance function
d∗ that extends d|C(x) and d|(C(y1 ) ∪ C(y1 )), so that {C(x), C(y1 ) ∪ C(y2 )} is in F (X 0 , d∗ ).
Note that by locality, the node v(C(y1 ) ∪ C(y2 )) has children v(C(y1 )) and v(C(y2 )) in
F (X 0 , d∗ ). We can separate C(x) from C(y1 ) ∪ C(y2 ) in both F (X 0 , d∗ ) and F (X 0 , d|X 0 ) until
both are equal. Then by outer-consistency, cutC(x) F (X 0 , d|X 0 ) = {C(x), C(y1 ), C(y2 )} and
by locality y1 and y2 are below x. Therefore, (C(x1 ), C(x2 ), d) <F (C(y1 ), C(y2 ), d) and so
`F (C(x1 ), C(x2 ), d) < `F (C(y1 ), C(y2 ), d).
Proof Let C = {C1 , C2 , . . . , Ck } be a Ci -cut in F (X, d) for some 1 6 i 6 k. Let d0 be (C, d)-
outer-consistent. Then for all 1 6 i 6 k, and all X1 , X2 ⊆ Ci , `(X1 , X2 , d) = `(X1 , X2 , d0 ),
while for all X1 ⊆ Ci , X2 ⊆ Cj , for any i 6= j, `(X1 , X2 , d) 6 `(X1 , X2 , d0 ) by monotonic-
ity. Therefore, for all 1 6 j 6 k, the sub-dendrogram rooted at v(Cj ) in F (X, d) also
appears in F (X, d0 ). All nodes added after these sub-dendrograms are at a higher level
than the level of v(Ci ). And since the Ci -cut is represented by nodes that occur on levels no
higher than the level of v(Ci ), the Ci -cut in F (X, d0 ) is the same as the Ci -cut in F (X, d).
13
Ackerman and Ben-David
elements where average-linkage detects an odd-sized cluster, for which single-linkage would
produce a different dendrogram.
Now, consider the following function
1
`(X1 , X2 , d) = .
maxx∈X1 ,y∈X2 d(x, y)
The function ` is not a linkage-function since it fails the monotonicity condition. The
function ` also does not conform with the intended meaning of a linkage-function. For
instance, `(X1 , X2 , d) is smaller than `(X10 , X20 , d0 ) when all the distances between X1 and
X2 are (arbitrarily) larger than any distance between X10 and X20 . If we then consider the
hierarchical clustering function F that results by utilizing ` in a greedy fashion to construct
a dendrogram (by repeatedly merging the closest clusters according to `), then the function
F is local by the same argument as the proof of Lemma 21. We now demonstrate that F
is not outer-consistent. Consider a data set (X, d) such that for some A ⊂ X, the A-cut
of F (X, d) is a clustering with a least 3 clusters where every cluster consists of a least 2
elements. Then if we move two clusters sufficiently far away from each other and all other
data, they will be merged by the algorithm before any of the other clusters are formed, and
so the A-cut on the resulting data changes following an outer-consistent change. As such,
F is not outer-consistent.
6. Divisive Algorithms
Our formalism provides a precise sense in which linkage-based algorithms make only local
considerations, while many divisive algorithms inevitably take more global considerations
into account. This fundamental distinction between these paradigms can be used to help
select a suitable hierarchical algorithm for specific applications.
This distinction also implies that many divisive algorithms cannot be simulated by any
linkage-based algorithm, showing that the class of hierarchical algorithms is strictly richer
than the class of linkage-based algorithm (even when focusing only on the input-output
behaviour of algorithms).
A 2-clustering function F maps a data set (X, d) to a 2-partition of X. An F-Divisive
algorithm is a divisive algorithm that uses a 2-clustering function F to decide how to split
nodes. Formally,
Note that Definition 23 does not place restrictions on the level function. This allows for
some flexibility in the levels. Intuitively, it doesn’t force an order on splitting nodes.
The following property represents clustering functions that utilize contextual informa-
tion found in the remainder of the data set when partitioning a subset of the domain.
14
Linkage-Based Hierarchical Clustering
Many 2-clustering functions, including k-means, min-sum, and min-diameter are context-
sensitive (see Corollary 29, below). Natural divisive algorithms, such as bisecting k-means
(k-means-Divisive), rely on context-sensitive 2-clustering functions.
Whenever a 2-clustering algorithm is context-sensitive, then the F-divisive function is
not local.
Proof
Since F is context-sensitive, there exists a distance functions d ⊂ d0 so that {x} and
{y, z} are the children of the root in F({x, y, z}, d), while in F({x, y, z, w}, d0 ), {x, y} and
{z, w} are the children of the root and z and w are the children of {z, w}. Therefore,
{{x, y}, {z}} is clustering in F({x, y, z, w}, d0 ). But cluster {x, y} is not in F({x, y, z}, d),
so the clustering {{x, y}, {z}} is not in F({x, y, z}, d), and so F-divisive is not local.
We say that two hierarchical algorithms disagree if they may output dendrograms with
different clusterings. Formally,
Definition 27 Two hierarchical functions F0 and F1 disagree if there exists a data set
(X, d) and a clustering C of X so that C is in Fi (X, d) but not in F1−i (X, d), for some
i ∈ {0, 1}.
Corollary 29 The divisive algorithms that are based on the following 2-clustering functions
disagree with every linkage-based function: k-means, min-sum, min-diameter.
15
Ackerman and Ben-David
7. Conclusions
In this paper, we provide the first property-based characterization of hierarchical linkage-
based clustering. Our characterization shows the existence of hierarchical methods that
cannot be simulated by any linkage-based method, revealing inherent input-output differ-
ences between agglomeration and divisive hierarchical algorithms.
This work falls in the larger framework of property-based analysis of clustering algo-
rithms, which aims to provide a better understanding of these techniques as well as aid users
in the crucial task of algorithm selection. It is important to note that our characterization
is not intended to demonstrate the superiority of linkage-based methods over other hier-
archical techniques, but rather to enable users to make informed trade-offs when choosing
algorithms. In particular, properties investigated in previous work should also be consid-
ered, while future work will continue to investigate important properties with the ultimate
goal of providing users with a property-based taxonomy of popular clustering methods that
would enable selecting suitable methods for a wide range of applications.
8. Acknowledgements
We would like to thank David Loker for several helpful discussions. We would also like
to thank the anonymous referees whose comments and suggestions greatly improved this
paper.
References
M. Ackerman and S. Ben-David. Measures of clustering quality: A working set of axioms
for clustering. In Proceedings of Neural Information Processing Systems (NIPS), pages
121–128, 2008.
L. Fisher and J.W. Van Ness. Admissible clustering procedures. Biometrika, 58(1):91–104,
1971.
16
Linkage-Based Hierarchical Clustering
R.B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Proceedings of the
Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 639–646. AUAI
Press, 2009.
17