Graph Clustering
Graph Clustering
available at www.sciencedirect.com
Survey
Graph clustering
A R T I C L E I N F O A B S T R A C T
Article history: In this survey we overview the definitions and methods for graph clustering, that is,
Received 29 January 2007 finding sets of “related” vertices in graphs. We review the many definitions for what is a
Received in revised form cluster in a graph and measures of cluster quality. Then we present global algorithms for
8 May 2007 producing a clustering for the entire vertex set of an input graph, after which we discuss
Accepted 28 May 2007 the task of identifying a cluster for a specific seed vertex by local computation. Some ideas
on the application areas of graph clustering algorithms are given. We also address the
problematics of evaluating clusterings and benchmarking cluster algorithms.
c 2007 Elsevier Ltd. All rights reserved.
∗ Corresponding address: Universidad Autónoma de Nuevo León, Facultad de Ingeniería Mecánica y Eléctrica, Posgrado en Ingeniería de
Sistemas (PISIS), AP 126-F, Ciudad Universitaria, San Nicolás de los Garza, NL 66450, Mexico.
E-mail address: [email protected].
1574-0137/$ - see front matter c 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.cosrev.2007.05.001
28 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
Section 6. Section 7 discusses the difficulty of comparing, Decision problems are characterized by a set of problem
evaluating and benchmarking graph-clustering methods. instances and a set of solutions, together with a relation
Applications are reviewed in Section 8. In Section 9 we glance associating a particular problem instance to a possibly empty
at open problems and future directions, and in Section 10 we subset of solutions. Essentially, in a decision problem, we
conclude the survey. ask whether the solution set mapped to a given instance is
nonempty. If a solution can be constructed to a given problem
instance in time that is polynomial to the length of the
2. Terminology and definitions representation of the problem instance, the corresponding
decision problem is said to be in class P. Formally, a problem
In this section we first review the necessary terminology to is in P if it has an algorithm with time complexity bounded by
facilitate discussion in the rest of the survey. We provide some polynomial of the input size x.
some of the basic definitions of computational complexity, For quite few important problems, there are no known
approximation algorithms, graph theory, and Markov chains. polynomial-time algorithms. However, such a problem may
Readers familiar with these topics are encouraged to proceed still have a polynomial-time verification algorithm that can
directly to Section 3 on page section. check whether a given certificate y that has length polynomial
in x provides a feasible solution to a problem instance of size
x. Formally, NP is the class of all languages that are decided
2.1. Computational complexity
by nondeterministic Turing machines in polynomial time [192].
Practically, this means that problems with polynomial-time
The worst-case running time of an algorithm for a problem
verification algorithms form the class NP. Note that P is a
instance of size x is the number of computation steps needed
subclass of NP.
to execute the algorithm for the most difficult instance of size
Furthermore, a decision problem S is reducible to a decision
x possible. The instance size is measured in some fixed units,
problem T if there exists a polynomial-time reduction f such
typically integers or bits — the effect of the unit selection will
that for any x ∈ S, f (x) ∈ T. This is denoted by S ≤P m T. A
vanish as we proceed to the definitions of complexity used in
problem T is said to be NP-hard if S ≤P m T for all problems
the survey. Hence x is a positive integer, x ∈ Z+ .
S ∈ NP. An NP-hard problem T is said to be NP-complete
Equally, the worst-case memory consumption is the if additionally T ∈ NP. In this survey, many NP-complete
number of memory units that the algorithm will need to problems will be mentioned. For more information on NP-
simultaneously occupy in the worst possible case for an completeness and the complexity classes, we recommend the
instance of size x. In computational complexity, the interest classical reference text of Garey and Johnson [103] and the
is in characterizing how the running time and memory textbook of Papadimitriou [192].
consumption grow when x grows. Let f (x) be a function
of x that determines the number of computation steps (or
2.2. Approximation algorithms
alternatively the units of memory) needed in the worst case,
given x ∈ Z+ .
In some applications, it may not be worth the effort to
The worst-case complexity of an algorithm is denoted by
compute the best possible solution to the problem at hand,
O(g(x)), where g(x) is a function of the input size x such that
but a not-too-bad solution will suffice. Whenever exact
f (x) grows no faster than g(x). This means that there exists a computation is time-consuming, impossible, or simply not
positive constant c such that justified by the needs of the application, heuristic and
f (x) ≤ c · g(x) (1) approximate methods are useful. Many such methods provide
a nondeterministic output, meaning that the method may
for all sufficiently large values of x. In general, g(x) is formed output a different solution on different executions. However,
by ignoring constant multipliers in f (x) and only keeping the one may need to repeatedly execute a heuristic algorithm and
highest-order term. then filter the output with respect to some quality measure.
Stating that an algorithm has run time or memory The goal of an approximation algorithm is to find efficiently
consumption f (x) = Ω (h(x)) in turn means that there exists a solution that differs no more than a fixed factor from the
a positive constant d such that exact solution. By efficient, one usually means “in polynomial
f (x) ≥ d · h(x) (2) time”. Approximation algorithms are the practical approach
for solving large instances of NP-complete problems and
for all sufficiently large values of x. The difference is that problems harder than that. A good approximation algorithm
f (x) = O(g(x)) provides an upper bound, whereas f (x) = should have provably polynomial running time.
Ω (h(x)) is a lower bound on how the complexity grows. The problems for which approximation algorithms are
Furthermore, we write that f (x) = Θ(g(x)) if both f (x) = O(g(x)) most commonly used are optimization problems. In an
and f (x) = Ω (g(x)) hold. optimization problem the task is to choose from a large set
This kind of “rounding” of the functions and the study of possible solutions the one that gives the best value for a
of their behaviour for large values of x is called asymptotic certain function. The goal may either be the minimization of
analysis. For more information on these notations for worst- a cost function or the maximization of a fitness function.
case and complexity and other related definitions, we When searching for an approximate solution to an
recommend the basic textbook on algorithms by Cormen optimization problem, it is a matter of application to
et al. [60]. define how much the approximate solution may differ from
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 29
the exact solution to be acceptable. As the approximation The number of edges incident on a given vertex v is the degree
algorithm does not aim to find the best possible solution, but of v and is denoted by deg (v). A graph is regular if all of the
rather one that is not very far from being the best, one should vertices have the same degree; if ∀v ∈ V in G = (V, E) we have
present provable bounds on how far from the optimal solution deg (v) = k, the graph G is k-regular. The diagonal degree matrix
can the approximation solution be. Such a bound is called of a graph G = (V, E) is
the approximation factor and is less than one for maximization
problems and greater than one for minimization problems. deg (v )
1 0 0 ... 0 0
The extremal value of the factor over all problem instances 0 deg (v2 ) 0 ... 0 0
0 0 deg (v3 ) ... 0 0
is (minimum for maximization problems and maximum
D= . . . .. . . . (5)
for minimization problems) called the approximation ratio. . . . . . .
. . . . .
A constant-factor approximation algorithm is one where the
0 0 0 ... deg (vn−1 ) 0
value of the solution found is at most a constant-multiple of 0 0 0 ... 0 deg (vn )
the optimum solution. If there exists a systematic method to
approximate the solution to arbitrary factors, the method is A partition of the vertices V of a graph G = (V, E) into two
called a (polynomial-time) approximation scheme, abbreviated nonempty sets S and V\S is called a cut and is denoted by
PTAS. The complexity class of problems that have a PTAS is (S, V\S). A cut is uniquely identified by defining a set S; hence
also denoted by PTAS. any subset of V can be called a cut. As the sets S and V\S
The complexity class NPO is a function problem class: define the same cut, it is often preferred to denote by S the
find an n-bit string x that maximizes a given cost function smaller set, hence requiring |S| ≤ b n2 c.
C(x), where the function C is computable in polynomial The cut size is the number of edges that connect vertices in
time by a deterministic Turing machine. The class NPO is S to vertices in V\S:
called NP-optimization. The complexity class APX is a subclass
c (S, V\S) = |{{v, u} ∈ E | u ∈ S, v ∈ V\S}| . (6)
of NPO such that each problem in APX allows constant-
factor polynomial-time approximation algorithms. There We denote by
exist problems that are in APX but not PTAS, unless P = NP. X
APX-hard problems have a PTAS-reduction from every other deg (S) = deg (v) (7)
v∈S
problem in APX. Assuming P 6= NP, no APX-hard problem
can have a PTAS. We recommend the book of Vazirani [226] the sum of degrees in a cut S. Note that in the presence of
on approximation algorithms as well as the book of Ausiello edge weights, the cut size is generally redefined as the sum
et al. [14] on the complexity of approximation algorithms. of the weights of the edges crossing the cut instead of using
simply the number of edges that cross it.
2.3. Graph theory A path from v to u in a graph G = (V, E) is a sequence
of edges in E starting at vertex v0 = v and ending at vertex
A graph G is a pair of sets G = (V, E). V is the set of vertices and vk+1 = u;
the number of vertices n = |V| is the order of the graph. The
{v, v1 }, {v1 , v2 }, . . . , {vk−1 , vk }, {vk , u}. (8)
set E contains the edges of the graph. In an undirected graph,
each edge is an unordered pair {v, w}. In a directed graph (also If such a path exists, v and u are connected. The path is simple if
called a digraph in much literature), edges are ordered pairs. no vertex is repeated, that is, for all i ∈ [0, k+1] and j ∈ [0, k+1],
The vertices v and w are called the endpoints of the edge. The vi 6= vj unless i = j.
edge count |E| = m is the size of the graph. In a weighted graph, The length of a path is the number of edges on it, and the
a weight function ω : E → R is defined that assigns a weight distance between v and u is the length of the shortest path
on each edge. A graph is planar if it can be drawn in a plane connecting them in G. The distance from a vertex to itself
without any of the edges crossing. is zero: the path from a vertex to itself is an empty edge
In this survey, we define the density of a graph G = (V, E) sequence. A graph is connected if there exist paths between all
as the ratio of the number of edges present to the maximum pairs of vertices. If there are vertices that cannot be reached
possible, from others, the graph is disconnected. The minimum number
m of edges that would need to be removed from G in order to
δ (G) = n . (3)
make it disconnected is the edge-connectivity of the graph. A
2
cycle is a simple path that begins and ends at the same vertex.
For n ∈ {0, 1}, we set δ (G) = 0. A graph of density one is called
A graph that contains no cycle is acyclic and is also called a
complete.
forest. A connected forest is called a tree.
If {v, u} ∈ E, we say that v is a neighbour of u. The set of
A subgraph GS = (S, ES ) of G = (V, E) is composed of a set of
neighbours for a given vertex v is called the neighbourhood
vertices S ⊆ V and a set of edges ES ⊆ E such that {v, u} ∈ ES
of v and is denoted by Γ (v). A vertex v is a member of its
implies v, u ∈ S; the graph G is a supergraph of GS . A connected
own neighbourhood Γ (v) if and only if the graph contains a
acyclic subgraph that includes all vertices is called a spanning
reflexive edge {v, v}.
tree of the graph. A spanning tree has necessarily exactly n − 1
The adjacency matrix AG of a given graph G = (V, E) of order
edges. If the edges are assigned weights, the spanning tree
n is an n × n matrix AG = (aG v,u ) where
with smallest total weight is called the minimum spanning tree.
(
1, if {v, u} ∈ E, Note that there may exist several minimum spanning trees
aG
v,u = (4)
0, otherwise. that may even be edge-disjoint.
30 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
An induced subgraph of a graph G = (V, E) is the graph with with each element being the square-root of the degree of the
the vertex set S ⊆ V with an edge set E(S) that includes all such corresponding vertex.
edges {v, u} in E with both of the vertices v and u included in A pleasant consequence of having the spectra limited to
the set S: the interval [0, 2] is that it makes comparing the spectra
of two graphs easier [229,51]. However, even nonisomorphic
E(S) = {{v, u} | v ∈ S, u ∈ S, {v, u} ∈ E} . (9)
graphs can share the same spectrum [229]. Graphs that have
We denote the subgraph induced by the vertex subset S by the same spectrum are called cospectral (also isospectral) [63].
G(S) (or by GS where it is clear that the subgraph is an induced When the equality of the sets of pairwise distinct eigenvalues
subgraph). An induced subgraph that is a complete graph is holds, but the multiplicities do not coincide, the graphs are
called a clique. An induced subgraph with an empty edge set weakly cospectral [215]. A survey on to which extent the
is called an independent set. We define the local density of an spectrum determines a graph is given by van Dam and
induced subgraph in G = (V, E) to be simply Haemers [223]. The spectra of real-world graphs is studied by
Farkas et al. [86].
|E(S)|
δ (G(S)) = (10) For comparing spectra of two graphs, it has been found
|S|
2 better to compare the spectra of their line graphs [63]. A line
graph GL = (VL , EL ) of a given graph G = (V, E) is a graph
in accordance to Eq. (3). An alternative definition of density
where the vertex set VL corresponds to the set of edges E of
in the literature is the ratio of the edge count to the vertex
the original graph G and two vertices vij ∈ VL (corresponding
count,
to the edge {vi , vj } ∈ E) and vk` ∈ VL (corresponding to the
|E(S)| edge {vk , v` } ∈ E) are connected by an edge {vij , vk` } ∈ EL if and
δ0 (G(S)) = , (11)
|S| only if the vertex subsets {vi , vj } and {vk , v` } have a nonempty
which yields possible alternative definitions of global graph intersection,
density for G, such as the average density {vi , vj } ∩ {vk , v` } 6= ∅, (16)
m
δ0 (G) = , (12) meaning that the two original edges represented by the
n
vertices vij and vk` share one or both of their endpoints.
or the maximum density The (degree-adjusted) Rayleigh quotient [51] is the ratio
|E(S)| P
g(v)Lg(v)
δ0 max (G) = max . (13)
S⊂V |S| v∈V
, (17)
g(v)2
P
Two graphs Gi = (Vi , Ei ) and Gj = (Vj , Ej ) are isomorphic v∈V
if there exists a bijective (one-to-one) mapping f : Vi → Vj
where g(v) : V → R is viewed as a column vector assigning
(called an isomorphism) such that {v, w} ∈ Ei if and only if
arbitrary real values to the vertices. Simplifying with an
{f (v), f (w)} ∈ Ej . 1
The spectrum of a graph G = (V, E) is defined as the assignment g(x) = D 2 f (v) and Eq. (14) we get
f (v)(Lf (v)) f (v) Lv,w f (w)
P P P P
list of eigenvalues (together with their multiplicities) of its g(v)Lg(v)
adjacency matrix AG . Spectral properties can be computed for v∈V v∈V v∈V w∈V
= =
g(v)2 1 f (v)2 deg (v)
P P
(D 2 f (v))2
P
both undirected and directed graphs, as well as unweighted
v∈V v∈V
v∈V
and weighted, but by far the easiest case are undirected,
f (v) f (v) − f (w)
P P
unweighted simple graphs, which will be the focus of the brief
v∈V {v,u}∈E
treatment of spectral graph theory in this section. =
f (v)2 deg (v)
P
It is often more convenient to study the eigenvalues of the v∈V
Laplacian matrix L = I − AG than those of AG itself [51]. The 2
f (v) − f (u)
P
normalized Laplacian is defined as {v,u}∈E
= . (18)
f (v)2 deg (v)
P
1 1 1 1
L= D− 2 LD− 2 = I − D− 2 AG D− 2 , (14) v∈V
where I is an n × n identity matrix (with ones on the diagonal, It is easy to see that the minimum value of this ratio is zero,
other elements being zero). An element-wise definition for obtained by any function f (v) that assigns the same value to
the normalized Laplacian, which is easier to understand the vertices in each connected component of the graph. The
intuitively than the matrix version, is the following: Rayleigh quotient in its basic form is written for any real and
symmetrical1 matrix B and a vector x as
1, if u = v and deg (v) > 0,
xT Bx
1
Luv = − p , if u ∈ Γ (v) , (15) (19)
deg (u) · deg (v) xT x
0, otherwise. and is widely used as an approximation to the extreme
eigenvalues of B: the ratio is minimized when x is the
As these matrices are symmetrical, their eigenvalues are
eigenvector corresponding to the smallest eigenvalue of B
real and nonnegative. Using the normalized Laplacian is
convenient as the eigenvalues of L all fall within the interval
[0, 2]. The smallest eigenvalue is always zero, as the matrix is 1 In general, the definition of the Rayleigh quotient works for
singular, and the corresponding eigenvector is simply a vector complex Hermitian matrices.
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 31
and this minimum value of the Rayleigh quotient is the eigenvalue λ1 of any transition matrix is one, as is the case for
eigenvalue itself. Similarly, the maximum gives the largest any stochastic matrix. The Perron–Frobenius theorem [113]
eigenvalue. These properties hold for the degree-adjusted states that for the non-principal eigenvalues, λi ≤ λ1 = 1.
version as well, and hence zero is the smallest of the real, If there are more eigenvalues with the value one, the chain
positive eigenvalue of the normalized Laplacian. The accuracy has more stationary distributions.
of iterative optimization of the Rayleigh quotient is studied by The eigenvectors form a basis for a vector space. As any
Li [164]. vector, including the initial distribution, can be represented as
The right eigenvector associated to the second-smallest an eigenvalue decomposition in the vector space determined
eigenvalue of the Laplacian matrix is called a Fiedler vector, by the eigenvectors, and all λi other than those corresponding
named after Fiedler [92,93], for his contributions to algebraic to stationary distributions have absolute value smaller than
graph theory. If we instead use the normalized Laplacian, the one, the corresponding components get smaller and smaller
corresponding vector is a normalized Fiedler vector. as the chain is ran. This implies that the smaller the
eigenvalues λi are, the faster the chain converges to the
2.4. Markov chains stationary distribution [139]. For estimating the mixing time,
the second eigenvalue of the transition matrix is already
A Markov chain is a stochastic process in which future states quite useful [211]. For more information on mixing times, we
only depend on the current state, not the past, taking recommend Behrends’ book [23].
values from some countable state space. The probabilities for
moving to another state from current state form the transition
matrix of the Markov chain. 3. Graph clustering
In general, each Markov chain, independent of how the
transition probabilities are defined, can be represented by In this section, we begin the difficult work of defining what
a weighted directed graph where each state corresponds to constitutes a cluster in a graph and what a clustering should
a vertex, each edge corresponds to a transition that has be like; we also discuss some special classes of graphs. In
nonzero probability and the edge weight is the probability some of the clustering literature, a cluster in a graph is called
in question. For an unweighted graph, when one moves a community [186,107].
from one vertex to another choosing a neighbouring vertex Formally, given a data set, the goal of clustering is to divide
uniformly at random, the transition matrix that results is the data set into clusters such that the elements assigned
the normalized adjacency matrix D−1 AG of the graph G. This to a particular cluster are similar or connected in some
means that the probability for moving from vertex v to w is predefined sense. However, not all graphs have a structure
simply with natural clusters. Nonetheless, a clustering algorithm
outputs a clustering for any input graph. If the structure
1 , if w ∈ Γ (v) ,
of the graph is completely uniform, with the edges evenly
pv,w = deg (v) (20) distributed over the set of vertices, the clustering computed
0, otherwise.
by any algorithm will be rather arbitrary. Quality measures
Such a walk is called random, blind, regular or simple, as it – and if feasible, visualizations – will help to determine
is but one of many possible definitions of walks in graphs. whether there are significant clusters present in the graph
One may impose different kinds of weight functions on the and whether a given clustering reveals them or not.
neighbouring vertices, which generalized to the definition In order to give a more concrete idea of what clusters are,
of a Markov chain with the state set being the vertex set we present here a small example. On the left in Fig. 1 we have
of the graph. For more insight to the mathematics and an adjacency matrix of a graph with n = 210 vertices and m =
interpretations of random walks on graphs, see for example 1505 edges: the 2m black dots (two for each edge) represent
the survey by Lovász [167] or the textbook in preparation by the ones of the matrix, whereas white areas correspond to
Aldous and Fill [6]. The latter emphasizes the Markov chain zero elements. When the vertices are ordered randomly, there
connection. is no apparent structure in the adjacency matrix and one
The first passage time from state j to state i is the time step can not trivially interpret the presence, number, or quality
when the chain first visits state i when started at state j. The of clusters inherent in the graph. However, once we run a
absorption time from state j to state i is the first passage time in graph clustering algorithm (in this example, an algorithm of
a modified chain, where state i is made into an absorbing state Schaeffer [205]) and re-order the vertices according to their
by removing all of its outbound transitions. respective clusters, we obtain a diagonalized version of the
The spectrum of the transition matrix can be used to adjacency matrix, shown on the right in Fig. 1. Now the cluster
evaluate the mixing time of the chain, which is the time it takes structure is evident: there are 17 dense clusters of varying
for the chain to reach its stationary distribution. The stationary orders and some sparse connections between the clusters.
distribution is a distribution that no longer changes over time Matrix diagonalization in itself is an important application
as more and more transitions are being performed. It defines of clustering algorithms, as there are efficient computational
for each state the probability that the walk is at that state methods available for processing diagonalized matrices, for
if a single observation is made after the walk has been run example, to solve linear systems. Such computations in turn
for a sufficiently long time. The stationary distribution can enable efficient algorithms for graph partitioning [214], as the
be obtained by computing the left eigenvector corresponding graph partitioning problem can be written in the form of a
to the largest eigenvalue of the transition matrix. The primary set of linear equations. The goal in graph partitioning is to
32 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
Fig. 1 – The adjacency matrix of a 210-vertex graph with 1505 edges composed of 17 dense clusters. On the left, the
vertices are ordered randomly and the graph structure can hardly be observed. On the right, the vertex ordering is by cluster
and the 17-cluster structure is evident. Each black dot corresponds to an element of the adjacency matrix that has the value
one, the white areas correspond to elements with the value zero.
Fig. 2 – Two graphs both of which have 84 vertices and 358 edges. The graph on the left is a uniform random graph of the
Gn,m model [84,85] and the graph on the right has a relaxed caveman structure [228]. Both graphs were drawn with
spring-force visualization [203].
minimize the number of edges that cross from one subgroup a planted partition into ` clusters of k vertices each, instead
of vertices to another, usually posing limits on the number of of optimizing some measure on the partition. McSherry [173]
groups as well as to the relative size of the groups. discusses also planted versions of other problems such as k-
clique and graph colouring.
Fig. 2 shows two graphs of the same order and size,
3.1. Generation models for clustered graphs
one of is a uniform random graph and the other has a
clearly clustered structure. The graph on the right is a
Gilbert [106] presented in 1959 a process for generating
relaxed caveman graph. Caveman graphs were an early attempt
uniform random graphs with n vertices: each of the n2
in social sciences to capture the clustering properties of
possible edges is included in the graph with probability
social networks, produced by linking together a ring of
p, considering each pair of vertices independently. In such
small complete graphs called “caves” by moving one of the
uniform random graphs, the degree distribution is Poissonian. edges in each cave to point to another cave [231]. Also the
Also, the presence of dense clusters is unlikely as the edges graph represented in Fig. 1 was generated with the caveman
are distributed by construction uniformly, and hence no model. These graphs are also especially constructed to
dense clusters can be expected. contain a clear cluster structure, possibly with a hierarchical
A generalization of the Gilbert model, especially designed structure where clusters can be further divided into natural
to produce clusters, is the planted `-partition model [59]: a graph subclusters.
is generated with n = ` · k vertices that are partitioned into The procedure to create a relaxed caveman graph is the
` groups each with k vertices. Two probability parameters following [228]: a connection probability p ∈ (0, 1] of the
p and q < p are used to construct the edge set: each pair top level of the hierarchy is given as a parameter, together
of vertices that are in the same group share an edge with with a scaling coefficient s that adjusts the density of the
the higher probability p, whereas each pair of vertices in lower-level caves. The minimum nmin and maximum nmax
different groups shares an edge with the lower probability for the numbers of subcomponents (subcaves at higher levels,
r. The goal of the planted partition problem is to recover such vertices at the bottom level) are given as parameters. The
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 33
“levels of membership” in several clusters? In document range can be transformed into the interval [0, 1] where one
clustering, such a situation is easily imaginable: a document corresponds to a “full” edge, intermediate values to “partial”
can be mainly about fishing, for example, but also address edges, and zero to there being no edge between two vertices.
sailing-related issues, and hence could be clustered into With such a transformation, we may compute density not by
“fishing” with 0.9 membership, for example, and to “sailing” counting edges but summing over the edge weights in the unit
with a level of 0.3. Another solution would be creating a line: the internal density of a cluster C (Eq. (22)) on Section 3.2
supercluster to include all documents related to fishing and is rewritten as
sailing, but the downside is that there can be documents on 1 X
fishing that have no relation to sailing whatsoever. δint (C) = ω (v, u) (25)
|C| (|C| − 1) {v,u}∈E
For general clustering tasks, fuzzy clustering algorithms v∈C,u∈C
have been proposed [129,105], as well as validity mea-
to account for the degree of “presence” of the edges.
sures [239]. Within graph clustering, not much work can be
Now a cluster of high density has either many edges or
found on fuzzy clustering, and in general, the past decade has
important edges, and a low-density cluster has either few
been quiet on the area of fuzzy clustering. Yan and Hsiao [241]
or unimportant edges. It may be desirable to do a nonlinear
present a fuzzy graph-clustering algorithm and apply it to cir-
transformation from the original weight set to the unit line
cuit partitioning. A study on general clustering methods using
to adjust the distribution of edge importance if the clustering
fuzzy set theory is presented by Dave and Krishnapuram [68].
results obtained by the linear transformation appear noisy or
A fuzzy graph GR = (V, R) is composed of a set of vertices
otherwise unsatisfactory.
and a fuzzy edge-relation R that is reflexive and symmetrical
together with a membership function µR assigns to each
fuzzy edge a level of “presence” in the graph [75]. Different 3.3.1. Bipartite graphs
nonfuzzy graphs can be obtained by thresholding µR such A bipartite graph is a graph where the vertex set V can be split
that only those edges {v, u} for which µr (v, u) ≥ τ are included in two sets A and B such that all edges lie between those two
as edges in Gτ . The graph Gτ is called a cut graph of GR . sets: if {v, w} ∈ E, either v ∈ A and w ∈ B or v ∈ B and w ∈ A.
Dong et al. [75] present a clustering method based on Such graphs are natural for many application areas where
a connectivity property of fuzzy graphs assuming that the the vertices represent two distinct classes of objects, such as
vertices represent a set of objects that is being clustered customers and products; an edge could signify for example
based on a distance measure. Their algorithm first preclusters that a certain customer has bought a certain product. Possible
the data into subclusters based on the distance measure, clustering tasks could be grouping the customers by the types
after which a fuzzy graph if constructed for each subcluster of products they purchase or grouping products purchased
and a cut graph of the resulting graph is used to define by the same people — the motivation could be targeted
what constitutes a cluster. Dong et al. also discuss the marketing, for instance. Carrasco et al. [41] study a graph of
modifications needed in the current clustering upon updates advertisers and keywords used in advertisements to identify
in the database that contains the objects to be clustered. submarkets by clustering.
Fuzzy clustering has not been established as a widely A bipartite graph G = (A ∪ B, E) with edges crossing only
accepted approach for graph clustering, but it offers a more between A and B and not within can be transformed into two
relaxed alternative for applications where assigning each graphs GA and GB . Consider two vertices v and w in A. As the
vertex to just one cluster seems restricting while the vertex graph is bipartite, Γ (v) ⊆ B as well as Γ (w) ⊆ B and these two
does relate more strongly to some of the candidate clusters neighbourhoods may overlap. The more neighbours the two
than to others. vertices in A share, the more “similar” they are. Hence, we
create a graph GA = (A, EA ) such that
3.3. Representations of clusters for different classes of
{v, w} ∈ EA if and only if (Γ (v) ∩ Γ (w)) 6= ∅. (26)
graphs
Similarly a graph GB results from connecting vertices whose
It is common that in applications, the graphs are not just neighbourhoods in A overlap. Weighted versions of GA and GB
simple, unweighted and undirected. If more than one edge is can be obtained by setting
allowed between two vertices, instead of a binary adjacency
ω (v, w) = |Γ (v) ∩ Γ (w)|, (27)
matrix it is customary to use a matrix that determines for
each pair of vertices how many edges they share. Graphs with possibly normalizing with the maximum degree of the graph.
such edge multiplicities are called multigraphs. Clusterings can be computed either for the original
Also, should the graph be weighted, cutting an important bipartite graph G or for the derived graphs GA and GB [41]. An
edge (with a large weight) when separating a cluster is to intuition on how this works can be developed by thinking of
be punished more heavily than cutting a few unimportant the set A as books in a bookstore and the set B the customers
edges (with very small weights). Edge multiplicities can in of the store, the edges connecting a book to a customer if
essence be treated as edge weights, but the situation naturally the customer has bought that book. Two customers have
gets more complicated if the multiple edges themselves have a similar taste in books if they have bought many of the
weights. same books, and two books appeal to a similar audience if
Luckily, many measures extend rather fluently to incor- several customers have purchased both. Hence the overlap
porate weights or multiplicities. It is especially easy when of the neighbourhoods the one side of the graph reflects the
the possible values are confined to a known range, as this similarity of the vertices of the other side and vice versa.
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 35
Bipartite graphs often arise as representations of hyper- section we first overview vertex similarity measures that
graphs, that is, graphs where the edges are generalized to sub- can be used in the former manner to identify the cluster
sets of V of arbitrary order instead of restricting to pairs of of a specific vertex or to group all of the vertices into a
vertices. An example is factor graphs. Direct clustering of such set of clusters, and then present possible cluster fitness
hypergraphs is another problem of interest. measures that serve for methods that produce the clustering
by comparing different groupings and selecting one that
3.3.2. Directed graphs meets or optimizes a certain criterion.
Up to now, we have been dealing with undirected graphs. Let
us turn into directed graphs, which require special attention, 4.1. Vertex similarity
as the connections that are used in defining the clustering
are asymmetrical and so is the adjacency matrix. This causes There are many clustering algorithms based on similarities
relevant changes in the structure of the Laplacian matrix and between the vertices. Should the vertices represent docu-
hence makes spectral analysis more complex. ments, for example, one could compute content-based sim-
Web graphs [36] are directed graphs formed by web pages ilarity values for all pairs of documents and use the similarity
as vertices and hyperlinks as edges. A clustering of a higher- matrix as a basis for the clustering, attempting to group to-
level web graph formed by all Chilean domains was presented gether vertices that are not only well connected but also sim-
by Virtanen [227]. Clustering of web pages can help identify ilar to each other. The higher the similarity, the stronger the
topics and group similar pages. This opens applications need to cluster the vertices together. Computing such simi-
in search-engine technology; building artificial clusters is larities is not necessarily simple, and in some cases evaluat-
known to be a popular trick among websites of adult content ing the similarity of two vertices may turn out to be a task
to try to fool the PageRank algorithm [35] used by Google to even more complex than the clustering of the graph once the
rate the quality of websites. similarities are known.
The basic PageRank algorithm assigns initially to each web If a similarity measure has been defined for the vertices,
page the same importance value, which is then iteratively the cluster should contain vertices with close-by values and
distributed uniformly among all of its neighbours. The goal of exclude those for which the values differ significantly from
the PageRank algorithm is to reach a value distribution close the values of the included vertices. If instead of similarity,
to a stationary converged state using relatively few iterations. we use a distance measure, the cluster boundary should be
The amount of “importance” left at each vertex at the end located in an area where including more of the outside
of the iteration becomes the PageRank of that vertex, the vertices would drastically increase the intracluster distances
higher the better. A large cluster formation of N 0 vertices (for example, the sum of squares of all-pairs distances).
can be constructed to “maintain” the initial values assigned Hence, with distance measures, it is desirable to cluster
within the cluster without letting it drain out and eventually together vertices that have small distances to each other.
accumulating it to a single target vertex that would hence
receive a value close to N times the initial value assigned for
4.1.1. Distance and similarity measures
each vertex when the iterative computation of PageRank is
Defining or selecting an appropriate similarity or distance
stopped, even though the true value after global convergence
function depends on the task at hand. The number of
would be low. Identifying such clusters helps to overcome the
similarity measures used in the literature has been very high
problem.
for various decades [122,233]. Given a data set, a distance
Another example of directed graphs are web logs (widely
measure dist (di dj ), should fulfil the following criteria:
known as blogs) that are web pages in which users can make
comments — identifying communities of users regularly 1. The distance from a datum to itself is zero: dist (di di ) = 0.
commenting on each other is not only a task of clustering 2. The distances are symmetrical: dist (di dj ) = dist (dj di ).
a directed graph, but also involves a temporal element: the 3. The triangle inequality holds:
blogs evolve over time and hence the graph is under constant
dist (di dj ) ≤ dist (di dk ) + dist (dk dj ). (28)
change as new links between existing blogs appear and new
blogs are created [157]. The blog graph can be viewed directly
For points in an n-dimensional Euclidean space, possible
as the graph of the links between the blogs or as a bipartite
distance measures for two data points di = (di,1 , di,2 , . . . , di,n )
graph of blogs and users. Also the content of the online
and dj = (dj,1 , dj,2 , . . . , dj,n ) include the Euclidean distance
shared-effort encyclopedia Wikipedia forms an interesting
directed graph. n q
X
dist (di dj ) = (di,k − dj,k )2 (29)
k=1
4. Measures for identifying clusters which is the L2 norm, the Manhattan distance
n
X
There are two main approaches for identifying a good cluster: dist (di dj ) = |di,k − dj,k | (30)
one may either compute some values for the vertices and k=1
then classify the vertices into clusters based on the values which is the L1 norm, and the L∞ norm
obtained, or compute a fitness measure over the set of
possible clusters and then choose among the set of cluster dist (di dj ) = max |di,k − dj,k |. (31)
k∈[1,n]
candidates those that optimize the measure used. In this
36 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
A typical example of a nonEuclidean space is that formed A = (a1 , a2 , . . . , an ) and B = (b1 , b2 , . . . , bn ); the extension is
by vector representations of textual data: for a collection of m called the Tanimoto coefficient [220], defined as
text documents D1 , D2 , . . . , Dm , each term ti that appears in at
A·B
least one of the documents is represented by a dimension. ρ(A, B) = s s . (36)
n n
Denote the number of terms by n — note that typically
P P
a1 + b1 − A · B
non-informative words like articles and prepositions are k=1 k=1
filtered out to reduce the dimensionality. Each document Dj There exists a distance measure based on the Tanimoto
is then represented by a datum dj such that the element coefficient that fulfils the triangle equality (Eq. 28) [166].
at position i is the frequency at which term ti appears in For string data, other typical distance measures in-
document Dj , i.e., how many times the term ti is included clude the edit distance (also known as the Levenshtein dis-
in the document. Typically the frequencies are normalized to tance [118]), which is the number of character insertions
eliminate the effect of variations in document length. Usually and/or deletions that need to be made in order to transform di
these frequencies are then multiplied by a factor that is into dj , possibly with different costs for different operations.
inversely proportional to the number of documents in which Cohen, Ravikumar, and Fienberg [57] give a survey on string-
the term appears, to give more weight to terms that appear similarity metrics, and a comprehensive survey to clustering
in fewer documents. The product measure is called term- feature vectors (i.e. points in high-dimensional space) is given
frequency inverse-document-frequency (tf-idf) and is commonly by Jain et al. [79,134,135].
used in the field of data mining and also well studied [235].
Once the vectors are computed, a similarity measure 4.1.2. Adjacency-based measures
can be applied. Possibilities include variations of the In some applications, the vertices lack additional properties
aforementioned three distances as well as the dot product and there is nothing in the vertices themselves that would
and/or the angle between the vectors. A common measure allow the computation of a similarity matrix. The situation is
that utilized the latter two is the cosine similarity, also known not desperate, however, as the edges incident to the vertices
as the Ochini coefficient: for two vectors di = (di,1 , di,2 , . . . , di,n ) can be used to derive similarity measures for the vertices
and dj = (dj,1 , dj,2 , . . . , dj,n ), their cosine similarity is the angle either using the adjacency information directly or through
some more sophisticated computation. In this section we
di · dj review vertex-similarity measures based on the structural
ρ(di , dj ) = arccos s s . (32)
n n properties of the graph instead of some application-specific
P 2
(di,k )
P 2
(dj,k )
properties imposed on the vertices.
k=1 k=1
Possibly the most straightforward manner of determining
As the resulting measure is an angle in [0, π), the most whether two vertices are similar using only the adjacency
dissimilar value is π/2 and zero is the best possible similarity. information is to study the overlap of their neighbourhoods
An example of using cosine similarity in clustering is the in G = (V, E): a straightforward way is to compute the
work of Lakroum et al. [159]. intersection and the union of the two sets,
Many similarity measures are based on the Jaccard index
|Γ (v) ∩ Γ (w) |
[133], defined for sets A and B as ω (v, w) = , (37)
|Γ (v) ∪ Γ (w) |
|A| ∩ |B| arriving at the Jaccard similarity of Eq. (34) on Section 4.1.1.
ρ(A, B) = . (33)
|A| ∪ |B| The measure takes values in [0, 1]: zero when there are
This is easily transformed into a distance measure: no common neighbours, one when the neighbourhoods are
dist (A, B) = 1−ρ(A, B). This idea generalized to n-dimensional identical.
binary vectors A = (a1 , a2 , . . . , an ) and B = (b1 , b2 , . . . , bn ) as Another measure is the so-called (Pearson) correlation of
follows: denote by Ci,j the number of positions k ∈ [1, n] in the columns (or rows) in a modified adjacency matrix C =
which ak = i and bk = j. The Jaccard similarity coefficient for AG + I (the modification simply forces all reflexive edges to be
such vectors A and B is present). The Pearson correlation is defined for two vertices vi
and vj corresponding to the columns i and j of C as
C1,1
ρ(A, B) = (34) !
C0,1 + C1,0 + C1,1 n
(ci,k cj,k ) − deg(vi ) deg(vj )
P
n
and their Jaccard distance is k=1
r
. (38)
C1,0 + C0,1 deg(vi ) deg(vj ) n − deg(vi ) n − deg(vj )
dist (A, B) = . (35)
C0,1 + C1,0 + C1,1
This value can then be used as an edge weight ω(vi , vj ) to
An advantage of the Jaccard index and the derived measures construct a symmetrical similarity matrix. It reaches the
is that they can be applied on categorical data, where the data value one if and only if the two vertices have the same
attributes are not numerical but rather represent the presence neighbourhood, and for neighbourhoods with no overlap the
or absence of a property. An example of the application of the value is a negative number no less than −1, depending on
Jaccard coefficient is the work of Dong et al. [75]. the degrees of the vertices. Correlations can also be applied
The cosine similarity was extended to the coincide to other measures than the plain adjacency data to determine
with the Jaccard similarity for n-dimensional binary vectors cluster structure [40].
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 37
defined for any proper nonempty subset S ⊂ V in graph used within a certain “framework” of defining what is a global
G = (V, E) as follows: clustering of a graph. One decision is whether the clusters
c (S, V\C) C1 , . . . , Ck should form a partition such that
Φ(S) = . (40)
min deg (S) , deg (V\S)
Ci ∩ Cj = ∅ if i 6= j, (44)
Finding a cut with minimum conductance is NP-hard [210]. or alternatively a cover of the data set — the latter option
Variants of conductance include normalized cut [121,209] and allows for each datum to belong to more than one cluster,
expansion [95,142], as well as the cut ratio. The problem of but each datum needs to be assigned to at least one cluster. In
optimizing the cut ratio is called the sparsest cut problem and applications, both cases are, however, possible; we will return
it is known to be NP-hard [172]. to the application-specific nature of clustering in Section 8.
In order to arrive at more independence measures, we
define the internal degree of a cluster C to be the number of
5.1. Complexity of global clustering
edges connecting vertices in C to each other:
degint (C) = |{{v, u} ∈ E | v, u ∈ C}| (41) In this section we discuss some related problems where a
dataset – which can be represented as a (weighted) complete
and the external degree of a cluster to be the number of edges
graph – is divided into clusters that optimize a certain criteria.
that connect it to the rest of the graph:
Understanding of the approximability and the algorithms
degext (C) = |{{v, u} ∈ E | v ∈ C, u ∈ V\C}| . (42) for these problems helps to understand how good global
Note that the external degree is in fact the size of the clustering algorithms can be.
cut (C, V\C). Using these definitions, we arrive at another The minimum k-clustering problem is the combinatorial
“independence” measure used in clustering: the relative optimization problem where a finite data set D is given
density ρ (C) [178]. Relative density is the ratio of the internal together with a distance function d : D × D → N, where
degree to the number of incident edges: d satisfies the triangle inequality (Eq. (28)). The task is to
partition D into k clusters C1 , C2 , . . . , Ck , where Ci ∩ Cj = ∅
degint (C)
ρ (C) = for i 6= j, such that the maximum intercluster distance is
degint (C) + degext (C)
minimized (i.e. the maximum distance between two points
degint (v, C)
P
assigned to the same cluster). This problem is approximable
v∈C
= P . (43) within a factor of two, but not approximable within (2 − ) for
degint (v, C) + 2 degext (v, C)
v∈C any > 0 [112,125].
For cluster candidates with only one vertex (and any other A related problem is the minimum k-centre problem, where a
candidate that is an independent set), we set ρ (C) = 0. complete graph is given with a distance function d : V×V → N
Thresholding ρ (C) is NP-complete [210]. and the goal is to construct a set of centres C ⊆ V of fixed order
The computational challenge lies in identifying subgraphs |C| = k such that the maximum distance from a vertex to the
within the input graph that reach a certain value of a nearest centre is minimized. Essentially, this is not a graph
measure, whether of density or independence, as the number problem as the data set is simply a set of datums and their
of possible subgraphs is exponential. Consequently, finding distances: the edges play no role.
the subgraph that optimizes the measure (i.e. a subgraph If the distance function satisfies the triangle inequality,
of a given order k that reaches the maximum value of a the minimum k-centre problem can be approximated within a
measure in the graph) is computationally hard. However, as factor of two [125], but it is not approximable within (2 − ) for
the computation of the measure for a known subgraph is any > 0 [131]. Without the triangle inequality, the problem
polynomial, we may use these measures to evaluate whether is not in APX [126]. A capacitated version, where the triangle
or not a given subgraph is a good cluster. We will return to this inequality does hold but the number of vertices “served” by
property of easy computation of the these quality measures a single centre vertex is bounded from above by a constant,
in Section 6.3. is approximable within a factor of five [146]: a centre serves
a vertex if it is the closest centre to that vertex. Another
capacitated version where the maximum distance is bounded
5. Global methods for graph clustering by a constant and the task is to choose a minimum-order set
of centres [19] is approximable within a factor log c + 1, where
This section addresses methods that are designed to obtain c is the capacity of each centre [21]. Agarwal and Procopiuc [4]
a global clustering for a given graph. The existing global present an exact and an approximation algorithm for the
approaches are capable of dealing with up to a few millions of k-centre problem with extensions to various distance metrics
vertices on sparse graphs [128,185,187]. In a global clustering, as well as both exact and approximate algorithms for the
each vertex of the input graph is assigned a cluster in the capacitated version of the problem.
output of the method, whereas in a local clustering, the cluster A weighted version of the k-centre problem, where the
assignments are only done for a certain subset of vertices, distance of a vertex to a centre is multiplied by the weight
commonly only one vertex. Local clustering will be the topic of the vertex and the maximum of this product is to be
of Section 6. A brief survey on some global clustering methods minimized, is approximable within a factor of two [195], but it
is given by Newman [184]. can not be approximated within (2 − ) for any > 0. If it is not
When applying graph clustering to a specific application, the maximum distance that is of interest, but the sum of the
one needs to choose or design the clustering algorithm to be distances to the nearest centre is minimized instead while
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 39
keeping the order of the centre set fixed, the problem is called It is noteworthy that in online clustering, the order in
the minimum k-median problem. Feder and Greene [88] show which the data are processed may significantly affect the
that the problems of minimizing the maximum intracluster resulting clusters: should the algorithm be attempting to
distance in general and that of minimizing the distance construct k clusters, and the data is presented one cluster at a
from each point to its cluster centre cannot be approximated time, there is a risk that the algorithm initially divides the first
within a factor close to two for points in Rd if d ≥ 2, unless cluster into several subclusters. If the cluster assignments
P = NP. made by the online algorithm are immutable, meaning that
A popular algorithm for clustering vector data with respect once a datum has been assigned to a cluster, it will not be
to a distance function is the k-means algorithm [119]. The basic moved to another cluster at any later time, the algorithm
idea of the k-means method is to cluster a set of points in can not recover from bad initial cluster assignments. To avoid
some metric space into k clusters by iteratively improving k these problems, commonly the existing partial clustering is
cluster centres and grouping each point to the cluster with constantly optimized with respect to some carefully selected
the closest centre; the centres are chosen to minimize the global measure as new data are processed, reassigning also
sum-of-squares of the intracluster distances. Unfortunately, the old data as necessary.
k-means is NP-hard even for k = 2 [1]. Guha et al. [116] study clustering of points in space where
For more variants of such problems and the related the data arrive one at a time and the goal is to maintain
complexity results, we refer the reader to the excellent online a clustering of the data seen thus far. Additional concerns
resource of Crescenzi and Kann [62] with a brief definition, to the quality of the clustering achieved are the time and
summary of the results and bibliographical references for memory consumption of such a system — if a new datum
numerous problems in combinatorial optimization. may arrive while the previous one is still being clustered, the
system needs to maintain a queue buffer for the incoming
data and may congest. The approach of Guha et al. [116] is a
5.2. Iterative or online computation of global clusterings
constant-factor approximation for a k-means algorithm that
uses 2k medians during the observation phase that are used
Clustering can either be performed to all of the data elements
compress information on the entire dataset. The algorithm
at once, or iteratively, assigning one element at a time to
has multiple phases and relies on subroutines defined in
an appropriate cluster. Approaches that require the entire other work, such as an algorithm by Jain and Vazirani [136]
graph to be accessible simultaneously do not scale well for for finding k medians in a data set of order O(k).
large data sets. In iterative clustering, the cluster assignments Zanghi, Ambroise and Miele [247] present an online
made to elements upon their first processing may either be clustering algorithm for graphs that clusters the graph into
considered immutable or may be changed later on to optimize k clusters, although they ran the algorithm in parallel for
some property of the clustering being computed. various values of k = 2, 3, . . . and choose the clustering that
If a clustering algorithm operates one datum at a time, maximizes the integrated classification likelihood (discussed
having only the knowledge of previously encountered data, in Section 7.2). Their approach is based on assuming a certain,
it is said to operate online. Also methods that process a group relatively high probability for connections within clusters and
of elements at a time are possible. Such online algorithms for a smaller one for intercluster connections, similar to that of
clustering provide a partial clustering for the data already the planted partition model.
seen from an unknown data stream to be clustered. They can
be designed to dynamically determine the number of clusters 5.3. Hierarchical clustering
to use, often relying on some threshold value to determine
when a newly arriving datum needs to be assigned a new A global clustering does not have to be a single partition or
cluster instead of merging it to an existing cluster. cover, but it may also be defined as a hierarchical structure,
In order to successfully cluster a large database, an online where each top-level cluster is composed of subclusters and
clustering method should scan the database at most once. It so forth. This is useful in situations where the graph structure
should also be able to provide some solution, at least a crude itself is hierarchical, and a single cluster can naturally be
approximate, at any time, while incrementally incorporating composed further to obtain a more fine-grained clustering or
newly added data into the existing clustering [32]. Such an alternatively merged with another cluster to obtain a coarser
approach of incremental clustering is useful for clustering data division into clusters.
sets that undergo frequent modification, such as addition, It depends on the application and the input data whether
removal or editing of the data elements [43]. Incremental it makes sense to compute a hierarchy of clusterings or a flat
clustering has been suggested for web page classification by clustering. In a flat clustering, each cluster is defined by vertex
Wong and Fu [236]. subset C ⊆ V as the subgraph induced by C, although often
Toussaint [222] studied iterative online clustering for the subset itself is called a cluster. In a hierarchical clustering,
points in space by identifying the nearest neighbour of the point each level of the clustering hierarchy defines a different
being clustered among the set of already clustered points: subset, and usually the clusters defined by the higher levels
the new arrival is assigned to the same cluster than the contain the clusters of the lower levels as subgraphs.
neighbour. For graph clustering, the distance measure used For example, if the number of clusters in which to
should preferably incorporate some structural information on group the data is known a priori, there is little use in
connectivity among the vertices further than the immediate knowing an entire hierarchy and it may be better to resort to
neighbourhood. flat clustering. However, in many contexts, the hierarchical
40 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
Fig. 4 – A company is divided into three departments, each of which is formed by two or three 5–7 person teams. Each
person is represented by a vertex, and an edge is placed between two people if they interact on work-related matters on a
daily basis. The teams and the departments are encompassed by dotted lines.
Fig. 5 – An example dendrogram that groups 23 elements into clusters at four intermediate levels, the root cluster
containing the entire dataset and the leaf clusters each containing one data point. Any level of the dendrogram, indicated
by dotted lines in the picture, can be interpreted as a clustering, grouping together as a cluster those elements that remain
in the same branch of the dendrogram tree above the line. In the hierarchy, the cluster [1, 2] is a subcluster of [1, 5] which in
turn is a subcluster of [1, 8], and so forth.
structure itself can be of interest. This is the case in the a sequence of partitions, where each subcluster belongs to one
field of social networks: within a city, a workplace will form supercluster in its entity. The root cluster contains at most all
a cluster and so will a school, but within workplaces and of the data, and each of the leaf clusters contains at least one
schools, a new level of clusters appears from work teams, data element; semantically relevant clusters usually appear
classes, etc. See Fig. 4 for a toy example with this kind on intermediate levels.
of a structure. Comparing the existing “official” hierarchy, Such a tree is called a dendrogram; an example is shown
such as the one defined by department boundaries and team in Fig. 5. If at each iteration, each cluster is split into two,
memberships, with a hierarchical clustering of the current a balanced binary tree of clusters of different levels will
person-to-person contact graph may give reorganizational result. If the graph has high structural variations, such as
insight and reveal hidden self-organization among the large density differences, interpretation of the tree needs
people. One of the seminal studies on social network theory some insight: at a certain level of the hierarchy, some natural
was that of Zachary [244], who predicted a split in a karate clusters may already have been split into two or more parts
club analysing the interactions between the members. while sets of other natural clusters are still to be identified
Clustering methods that produce multi-level clusterings within the remaining subgraphs.
are called hierarchical clustering algorithms, as opposed to In practice, a good algorithm for clustering will not only
flat clusterings that comprise a single partition or cover. A consider the clusterings at certain levels of the tree (drawn
hierarchical clustering is generally constructed by generating in the figure), but also clusterings that result from cutting
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 41
different branches of the dendrogram at different levels. as removing vertices one by one results in singleton clusters
Determining where to cut the dendrogram usually involves and does not reveal any higher-level structural properties.
the optimization of some global quality measure for the entire However, posing restrictions on the orders of the resulting
clustering. Such measures are discussed in Section 7. subgraphs makes the complexity of the problem harder.
Hierarchical clustering algorithms can be further divided For example, requiring the two “sides” of the split in
into two classes, depending on whether the partition is the graph to have the same order, results in an NP-hard
refined or coarsened during each iteration: problem: Minimum bisection is the problem of dividing a
2n-vertex graph into two n-vertex subgraphs such that the
• top-down or divisive algorithms (addressed in Section 5.4)
cut size is minimized [104]. Deciding whether such a cut
that split the dataset iteratively or recursively into smaller
exists remains NP-complete even for regular graphs [37]
and smaller clusters [34,37,95,97,107,138,185,186,209], and
and for graphs with bounded maximum degree [170], but
• bottom-up or agglomerative algorithms (addressed in
the problem is polynomial for trees [170] and graphs with
Section 5.5) that start with each data element in its own
bounded treewidth [212]. Feige and Krauthgamer [89] provide
singleton cluster or another set of small initial clusters,
approximations for the minimum bisection problem. For a
iteratively merging these clusters into larger ones [41,74,
survey on related problems, called graph layout problems, see
78,128].
Díaz, Petit, and Serna [71].
At each step, the clustering algorithm selects the clusters As the graph bisection is NP-hard, also `-partition where
to merge or split by optimizing a certain criterion on the data the graph is to be partitioned to ` equal-sized groups such that
set. A stopping condition may be imposed on the algorithm to the grouping minimizes the total number of edges crossing
select the best clustering with respect to a quality measure on from one group to another is also NP-hard. Johnson et al. [138]
the current cluster set. discuss efficient strategies for solving min-cut clustering
Křivánek and Morávek [158] present results on the problems from an integer programming viewpoint.
complexity of problems related to hierarchical clustering. Condon and Karp [59] present an `-bisection algorithm
They formulate the problem as clustering a set of objects with that finds in linear time the optimal partition with probability
respect to a dissimilarity matrix into a dendrogram. The goal 1 − exp(−nΘ () ) under the planted `-partition model with
is to find a k-level dendrogram that minimizes a measure that p ≥ r + n − 12 + for constant . Their algorithm greedily
combines the dissimilarity values and the grouping made at classifies the vertices into two groups, L1 and R2 , minimizing
each level of the dendrogram for the objects being clustered. the total number of edges crossing the various cuts. The
In their terminology, level one consists of the singleton sets processing order of the vertices is based on randomly and
of the objects and level k consists of the entire set of objects. uniformly sampling vertex pairs among the unprocessed
They show that for k ≥ 3 the problem is NP-hard. If the goal vertices. This division is then recursively applied to the sets L1
is to find the dendrogram of arbitrary height that minimizes and R1 to create a second-level division into four groups, and
the measure, the problem is NP-hard. further until the desired group-size ` has been reached. Note
that the algorithm will fail for some combinations of ` and
5.4. Divisive global clustering n. They also present a nonrecursive version. Dubhashi, Laura
and Panconesi [80] further develop the approach of Condon
Divisive clustering algorithms are a class of hierarchical and Karp to cluster categorical data rather than graphs.
methods that work top-down, recursively partitioning the The second complication with cut-based methods is
graph into clusters. The split at each iteration is typically into shared by most hierarchical divisive algorithms; one needs to
two sets, but there is no reason why a clustering algorithm know when to stop splitting the graph. Setting limits on the
could not divide a vertex set into more than two sets for the cluster order or the number of clusters can be feasible in the
next iteration. The various criteria for determining where to presence of a priori information on how the clustering should
split the graph are discussed in this section. be like. Another approach is to optimize some cluster quality
index; such issues are discussed in Section 7.
5.4.1. Cuts Hartuv and Shamir [120] propose a divisive clustering
One intuitive approach is to look for the small cuts (as defined algorithm that uses a density-based stopping condition. Their
in Section 2.3 in the input graph. Note that the notion of a cut intuition is that for some vertices to belong to the same
can be naturally defined for directed and undirected graphs as cluster, they should be highly connected to each other,
well as weighted or unweighted. The minimum cut in a given whereas there should not be many paths connecting them
(weighted) graph can be found efficiently with a maximum- to vertices outside the cluster. The splitting of the graph is
flow algorithm [60,83,96]. done by removing from the graph at each iteration the edges
We wish to split the graph in two by removing a cut. that cross the current minimum cut. For each connected
Remembering that we want the clusters to be dense subsets component, they check whether the component is highly
with respect to the global density of the graph, a well-chosen connected. If it is, it will not be divided further. If it is not
cut should separate two or more clusters, instead of breaking highly connected, the iteration continues with the removal of
into two the vertex set of any single cluster. There are two the edges crossing the minimum cut. Their definition for a
complications with this idea, however. Firstly, we would like highly connected graph is that the edge-connectivity of the
to be able to make some statements regarding the relative graph of order n is above n2 .
order of the subgraphs separated by the cut: just cutting out Rather than simple cut size, a popular criterion for
single vertices does not help much in computing a clustering, partitioning is that of low conductance [41,209,210], believed
42 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
to be in general superior to simple minimum cuts when used 5.4.2. Maximum flow
in graph clustering [95,142]. One reason for such preference As the well-known connection with maximum-flow and
is that conductance also takes into account the orders of minimum-cut problems suggests (see for example [60,83,96]),
the sets that are being cut apart, yielding often in more there exist clustering algorithms based on flow computations
significant separations. Although finding a cut with minimum [34,41,94,160]. The algorithms for computing maximum flows
conductance is NP-hard [210], several clustering algorithms in graphs (such as that of Goldberg and Tarjan [111]) are
have been proposed based on variants of conductance [142, efficient and hence such operations are not too costly to be
121,95,34,48,108], generally iteratively finding a cut with low used as subroutines for clustering moderate-size instances.
conductance or normalized cut size and splitting the graph. Most flow-based methods take edge weights readily into
The general idea is that conductance and other similar account; flow computations extend even to cases where the
measures tend to reach optimal values at cluster boundaries edge capacities are single-parameter functions instead of
and not within clusters, as each cluster should be internally constants [102]. Flake et al. [95] identify clusters by inserting
dense while being sparsely connected to the rest of the graph. an artificial sink and calculating flows to that sink. The
minimum cuts that correspond to the maximum flows are
Hence the division should not likely break a natural cluster
used to build a minimum-cut tree, as defined by Gomory and
but rather separate clusters from other clusters. Care must be
Hu [8]. The minimum-cut tree is then used to define what is
taken to define the stopping condition in order to recognize
the cluster of a given vertex v with respect to the artificial sink
when the subgraph that is being cut contains only one
vertex. The algorithm is designed for undirected weighted
cluster and should not be cut further. Usually the minimum
graphs, and the weight assigned to all of the edges of the
conductance of a single-cluster graph is much higher than
artificial sink vertex is a parameter of the method. It requires
the conductance of a multicluster one, but the magnitude
some intuition to choose a good value for the parameter. The
of the difference depends on the graph structure. No global
basic version of the algorithm simply treats all connected
threshold or percentage for the relative increase can be given
components of the min-cut tree after the artificial sink has
that would correctly stop the clustering for all graphs at the
been removed as clusters, but the authors also present a
most natural cluster division. recursive version that incorporates adjusting the parameter
The workarounds to solving an NP-complete problem and imposing quality criteria such as desirable cluster order
are various; Johnson et al. [138], for example, choose a cut and cluster count.
with conductance almost as low as the minimum over all
cuts (implementational details by Cheng et al. [47]), whereas 5.4.3. Spectral methods
Carrasco et al. [41] use an exact, fast algorithm of Lang and When a graph is formed by a collection of k disjoint
Rao [160] for finding a cut S0 that is better than S such that cliques, the normalized Laplacian (Eq. (14)) is a block-diagonal
S0 ⊂ S (i.e. improving the cut only by removing vertices from matrix that has eigenvalue zero with multiplicity k and the
the selected subset). Matula and Shahrokhi [172] present an corresponding eigenvectors serve as indicator functions of
efficient method for finding sparsest cuts for a broad class membership in the corresponding cliques: the elements of
p
of graphs. Arora et al. [10] give a O( log n)-approximation the clique have a different value (of larger magnitude) than
algorithm for sparsest cut and conductance. the other vertices. Any deviations caused by introducing
Shi and Malik [209] obtain a clustering (which in their edges between the cliques causes k − 1 of the k eigenvalues
application is actually a segmentation of an image) by that were zero to become slightly larger than zero and also
computing the eigenvector associated with the second- the corresponding eigenvectors change. However, some of the
smallest eigenvalue of a Laplacian matrix that incorporates underlying structure can still be seen in the eigenvectors of
edge weights. Using the components of this eigenvector as the Laplacian even when edges are added to connect the
vertex weights, they search for the smallest normalized cut. cliques and when some edges are removed from within the
If the value of normalized cut is below a predetermined original cliques.
threshold, the graph is partitioned in two using the cut set This phenomenon is the basis of spectral clustering, where
that gives the minimum normalized cut. If a partitioning is an eigenvector or a combination of several eigenvectors
is used a vertex similarity measure for computing the
made, the above process is repeated for the two subgraphs
clusters. For example clustering and other analysis of the
thus created.
network of the Internet autonomous-system domains has
He et al. [121] study the normalized-cut method of Shi
been done with spectral methods [109,178]. A comprehensive
and Malik [209] and discuss its connections to the k-means
introduction to the mathematics involved in spectral graph
algorithm. He et al. restrict to the case of binary edge
theory is the textbook of Chung [51]; we also recommend
weights in the case of the k-means method. Applying matrix
the textbook of Biggs [27]. The dissertation of McSherry [174]
algebra, they are able to show that what is being optimized
provides an overview of the area, also applying it to graph
in the normalized-cut method is actually the same that is
partitioning.
optimized the iteration of the k-means method, with slight Also other matrices than the Laplacian can be used to
modifications on how neighbouring vertices are weighted. compute such spectral measures; if instead of the adjacency
Another method based on matrix algebra is that of Drineas matrix of a simple graph, the input is some kind of a similarity
et al. [1] who present an clustering algorithm for large graphs matrix for a complete graph, similar computations still may
that is based on computing the singular value decomposition of yield good results. The downside is that computing or even
a suitably selected random submatrix. approximating eigenvalues and eigenvectors is not fast for all
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 43
graphs and hence such methods may face scalability issues that in general, spectral methods (or actually, one common
when applied to massive graphs. variant) find good clusterings. Ding and He [72] propose
Spectral clustering is typically based on computing the a spectral method that directly computes k clusters in a
eigenvectors corresponding to the second-smallest eigenvalue complete weighted graph. The problem of determining what
of the normalized Laplacian or some eigenvector of some value of k to choose – a requisite for the employment of
other matrix representing the graph structure. Possible various clustering algorithms – is discussed in Section 7.2.
matrices include modifications of the adjacency matrix such Goh, Kahng and Kim [110] have studied the spectrum
as the transition matrix of a blind random walk on the graph. created by the Barabási–Albert generation method for scale-
The component values of the resulting eigenvector are used free graphs with two outgoing edges per added vertex. For
as vertex-similarity values to determine the clustering. A their studies, they computed the exact spectrum for graphs
recent survey of properties of the second-smallest eigenvalue up to 5000 vertices and determined the first few of the
is given by de Abreu [70]. largest eigenvalues for graphs of order as high as 400,000,
Spectral methods are in general computationally demand- which gives a hint on the scalability of spectral methods,
ing, although a distributed algorithm for decentralizing the although the techniques they used were not the most
computational load has been proposed by Kempe and McSh- modern. Theoretical results on the spectra of graphs with
erry [144]. The idea of the decentralization is that the com- defined degree distributions are also available [53]. Saerens
putation of the eigenvector elements of each vertex for the et al. [204] discuss the relation between principal components
first k eigenvectors can be computed by a separate processor, analysis of graphs and spectral clustering.
assuming that all processors are aware of the weights of the A graph partition into two sets with few edges between
incident edges and messages may be exchanged between the the sets using the magnitudes (or signs) of the components of
processors that are in charge of neighbouring vertices. The an eigenvector (or a combination of eigenvectors) is called a
number of messages needed is O(k3 ) per round of computa- spectral bisection [115]. Spectral measures perform well on such
tion and the number of rounds needed in order to obtain a 2-classification tasks [72]. When three clusters are present,
deviation > 0 from the exact eigenvectors is O(log2 (n−1 )T), spectral information groups two of these together in the
where T is the mixing time of the random walk on the in- sense that the separation between these two and the third
put graph. The basis of the decentralized algorithm is the or- one is clear and easily interpreted from the second (and third)
thogonal iteration method, where random initial vectors are eigenvector, but the other two are harder to distinguish [123].
iteratively first multiplied by the adjacency matrix and then Intuition on how the two-classification works is relatively
orthonormalized. The decentralization is in effect decentral- easy to gain through the behaviour of the Rayleigh quotient
izing the matrix multiplication and the orthonormalization. (as in Eq. (18)) when the function f (v) is interpreted as an
How well spectral measures work as separators in indicator vector: a positive value indicates that the vertex
clustering is studied by Guattery and Miller [115]. They belongs to a cluster CA and a negative value that it belongs
concentrate on the problem of finding a separator that divides to a cluster CB . As the edges should be mostly internal to
the vertices of the graph into two sets such that the number the clusters, almost all differences in the sum of the Rayleigh
of edges crossing the boundary is small (if not minimal) using quotient are zero. Only the edges connecting vertices in CA
the information contained in the Fiedler vector or possibly with those of CB contribute to the sum.
combining information from several eigenvectors. Guattery If we normalize the vector represented by f () for example
and Miller find that using the Fiedler vector to partition a to have norm n, and consider a vector other than the vector of
graph into two equal-sized vertex sets work poorly for a all ones that minimizes the Rayleigh quotient. If such a vector
family of bounded-degree planar graphs and in that there has both positive and negative values, the positive ones get
exists a family of graphs for which spectral methods in assigned to one of the classes and the negative ones to the
general work poorly. other class in order to minimize the ratio. There will be only
Qiu and Hancock [198] present a spectral method for one “gap” in the values of f () and that gap will show the class
clustering graphs based on the Fiedler vector of the graph. boundary.
Bach and Jordan [15] address the question of clustering If we, however, wish to perform classification into more
a complete weighted graph with a spectral method, also than two classes, some number of classes get assigned
discussing approximations for the eigenvectors. Ng et al. [188] negative values and the rest positive ones, and the “gaps”
complain that there are several spectral clustering algorithms of the values assigned to each class will vary, making it
that all work a little differently with respect to utilizing the much harder to automatically determine the division. The
eigenvectors and that commonly no proofs are presented remedy for the multicluster problem is to perform the two-
regarding the quality of the produced clustering. Although classification iteratively, using the spectra of the resulting
their work addresses the field of clustering points in space, induced subgraphs. This will yield a divisive hierarchical
the situation is similar in spectral graph clustering — there is clustering algorithm. Spielman and Teng [213] show that such
no one canonical way to utilize spectral methods, and even partitioning performs well on bounded-degree planar graphs
the matrix the spectrum of which is used is not always the and finite element meshes. Their analysis is based on a
same. relation of the cut size with the second-smallest eigenvalue
A spectral clustering method for directed weighted graphs of the Laplacian. By showing that for d-dimensional well-
−1
is given by Capoccia et al. [40]. Their idea is to compute shaped meshes, this eigenvalue is O(n−2d ), they can show
eigenvectors and use the correlations between the elements that spectral methods can be applied to identifying cuts
−1
to determine the cluster structure. Kannan et al. [142] show of size O(n(d−1)d ). Pothen et al. [196] present a heuristic
44 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
algorithm for minimum bisection using the eigenvectors of Newman and Girvan, but instead of the betweenness values,
the Laplacian matrix. they use information centrality [162] that is defined for each
Brandes et al. [34] propose a clustering algorithm that edge as the relative decrease in the average efficiency of the
first computes edge weights using k distinct eigenvectors of graph upon the removal of the edge. Latora and Marchiori
the normalized adjacency matrix, associated with the largest [161] define efficiency of a pair of distinct vertices v, u ∈ V
eigenvalues less than one. Using these weights, a minimum as the inverse of their distance in the graph, 1/ dist(v, u), and
spanning tree is computed. Partitioning the spanning tree the average efficiency of the graph G is defined as the average
by removing all edges with weight below a certain threshold of the individual efficiencies over all n(n − 1) ordered pairs of
and considering each connected component of the resulting distinct vertices.
forest to be a cluster defines a clustering, and for different The trick with these hierarchical, edge-removal-based
values of the threshold, different clusterings are obtained. A clustering methods is that to decide when to stop the
dendrogram can be constructed removing the edges one by partitioning, just as was the case with cut-removal methods.
one starting from the weakest edge. Newman [185] proposes computing a quality measure called
modularity over the entire clustering at each iteration and
5.4.4. Betweenness stopping when there is no improvement. Many formulations
In order to cluster an unweighted graph G = (V, E), Newman of essentially the same measure exist, depending on
and Girvan [186] impose weights on the edges based on whether the graph is weighted and whether minimization
structural properties of the graph G. The idea is based on or maximization is used. Modularity is in general defined
that of node-betweenness defined by Freeman [100] for use for weighted graphs, where the weights represent some
in sociological studies. The weight used by Newman and application-specific attributes. For unweighted graphs, one
Girvan [186] is the betweenness of an edge {v, w}, which is can simply set ω (v, w) = 1 for all edges to obtain a working
the number of shortest paths connecting any pair of vertices definition of modularity.
that pass through the edge. Freeman in turn studied node- In terms of the edge weights, modularity M(C1 , . . . , Ck )
betweenness, which is defined for each vertex as the number is defined over a specific clustering into k known clusters
of shortest paths in the graph that pass through that vertex. C1 , . . . , Ck as
As a computational detail, one should note that there may
k
exist multiple paths of the same length between a given pair
X X
M(C1 , . . . , Ck ) = Ei,i − Ei,j , (45)
of vertices. Hence each of these shortest paths should be i=1 i6=j
accounted for in proportion to their number when computing i,j∈{1,...,k}
the betweenness values of the edges that these paths use. If where
there are k shortest paths connecting v and u, each of them X
will have weight 1k in betweenness calculations of the edges Ei,j = ω (v, u) , (46)
{v,u}∈E
on those paths. v∈Ci , u∈Cj
Girvan and Newman [107,186] assume edges with high
with each edge {v, u} ∈ E included at most once in the
betweenness to be links between clusters instead of internal
computation. Defining the internal and external degrees in
links within a cluster: the several shortest paths passing
terms of these modularity measures gives
through these edges are the shortest paths connecting the
members of one cluster to those of another. Hence they split
degint Ci = Ei,i and
the network into clusters by removing one by one edges with
k
high betweenness values. If more than one edge has the
X
Ei,j .
degext Ci = −Ei,i + (47)
highest betweenness value, one of them is chosen randomly j=1
and removed. The removal is followed by recalculation of the
Newman [180] also presents a formulation of modularity in
betweenness values, as the shortest paths have possibly been
matrix form, using the spectrum of the k × k modularity
altered. This gives a clustering algorithm polynomial in n and
matrix, where the elements are the values Ei,j . Newman also
m.
uses the spectral information to derive centrality measures
Straightforward algorithms compute the betweenness on
for the vertices of the input graph.
an edge operate in O(n · m) time. Brandes [33] proposes
Note that as modularity directly incorporates the number
algorithms for betweenness-centrality computations that
of internal edges per cluster, the orders of the clusters
require O(n + m) space. The running time for his unweighted
tend to have an effect: small clusters in simple graphs
version is O(nm) and the weighted version runs in O(nm +
n2 log n) time. He solves for each vertex once a single-source may only contribute a few internal edges. Danon et al. [66]
shortest path problem, with small modifications to either provide a modification of a modularity-based algorithm of
Dijkstra’s algorithm or the breadth-first search algorithm Newman [185] to accommodate for clusters of varying orders
(see for example [60] for shortest-path algorithms). Newman without slowing down the computation.
proposes a method based on random walks rather than exact
betweenness-value computation [181]. Comellas and Gago 5.4.5. Voltage and potential
Álvarez [58] derive bounds on the betweenness values using Electrical circuits provide reasonable intuition for graph
the spectrum of the graph. clustering: think of the graph as a circuit that has a unit
Fortunato, Latora and Marchiori [97] propose a hierarchical resistor on each edge. Calculate the potentials at all of the
method closely related to the betweenness-method of vertices (i.e. the voltages for all the edges), and then cluster
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 45
Fig. 7 – On the left, the definition of the circuit of Fig. 6 in SPICE syntax, given as input to the DMCS-SPICE Java Applet [218].
The transient analysis, tracing the voltages at each vertex, gives the figure on the right, where the two clusters are clearly
separated. The voltage vector is (−1.93, −2.06, −2.06, −2.19, −2.72, −2.82, −2.82, −2.82, −2.93).
46 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
Fig. 9 – Similarity matrices for a 75-vertex graph with 900 edges, composed of six clusters constructed by computing the
Fiedler vector for each vertex. On the left, exact vectors were used, and on the right, a locally computable approximation of
Orponen and Schaeffer [191] for the Fiedler vector was used.
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 47
Naturally, in addition to dividing the graph top-down into For large graphs, global clustering becomes computationally
clusters, one may also work bottom-up merging singleton demanding. For massive data sets, the running time of a
sets of vertices iteratively into clusters. Such methods clustering algorithm should not grow faster than O(n) in order
are called agglomerative clustering algorithms. Typically a to be scalable; sublinearity is strongly preferable.
similarity measure is used to the vertices to be merged What comes to memory consumption, storing the
into a cluster. The similarity measure could be based, for complete edge set that for dense graphs has size O(n2 ) is
also often infeasible. There are applications where the input
example, on the relative overlap of the vertex neighbourhoods
cannot possibly be read into the main memory at once and
or some semantic-based value. Possible similarity measures
the computational cost of swapping the memory contents
are addressed in Section 4.1. Sometimes merges are limited
may prove critical. For large enough graphs, even sparsity
to assigning a singleton vertex into a cluster, sometimes
does not help much: for example, the World Wide Web has
also larger clusters may be merged — the former case
billions of vertices and many more edges, setting it out of
aims to a single flat clustering, the latter to a hierarchy
reach of the global algorithms.
of clusterings. As with divisive methods, iterative merging
However, if the graph is stored in a format that allows
typically continues until some threshold or a desired number
access to connected subgraphs or adjacency lists of nearby
of clusters is reached.
vertices, ideas similar to agglomerative clustering can be
An approach that roots in the clustering of point sets is
applied: clusters can be computed one at a time based on
to begin grouping the vertices into clusters by forming a two- only partial views of the graph topology. This is called local
vertex cluster from the two most similar vertices. This is very clustering. An example of a data structure allowing local
similar to what is often done in the case of clustering points access to adjacency information is a search tree of adjacency
in space [135]. The intuition is that at least the two closest lists with vertex identifiers as seeds.
points should be placed in the same cluster with each other. Additional motivation for local clustering methods come
Such merging then continues until only a desired number from large networks that are not explicitly available, but
of clusters remains or another stopping condition is met. At rather require on-demand generation or exploration with a
each iteration, one picks the two clusters (singletons or larger) crawler, such as the programs that are used to index the World
that have the highest similarity value to be merged. Note that Wide Web for search-engine construction [50,221].
a function must be defined for determining the similarity of Computing the desired answer by a clustering algorithm
vertex subsets of different orders. for many applications only requires a small subset of vertices
This approach is generally known as the pairwise nearest to be clustered instead of the whole graph. Such tasks
neighbours method and the merging criterion in general is include locating documents or genes closely related to a
based on greedy optimization of an objective function. The given “seed” data set. The scalability problem of global
method is studied in the dissertation of Virmajoki [99]. clustering is avoided, as the graph as a whole does not
Agglomerative clustering algorithms include that of need to be processed unless a single cluster contains nearly
Carrasco et al. [41] for bipartite graphs and that of Hopcroft the entire graph. Also, clusters for different seeds may be
et al. [128] for general graphs. Du [78] first clusters the graph simultaneously obtained by parallel computation.
into initial clusters using information on vertex degrees and In this section we study local approaches for finding a good
then agglomeratively combines the initial clusters until an cluster containing a specified seed vertex or a set of vertices
agreeable clustering is achieved. Clauset et al. [56] present by examining only a limited number of vertices at a time,
a betweenness-based method that runs in O(n log2 n) time proceeding in the “vicinity” of the seed vertex. We denote by
C (v) the cluster of vertex v, that is, the resulting cluster when
in practice for sparse natural graphs. The method performs
using v as the seed vertex.
greedy optimization of modularity (Eq. (45)) similarly to the
An application-specific detail that arises in local clustering
method of Newman [185]: Clauset et al. greedily maximize the
is whether the clustering should be symmetrical: that is, if
ratio of internal edges for cluster members that have nonzero
vertex v belongs to the cluster C (u), should u necessarily
external degree.
belong to the cluster C (v)? For example networks of
Another modularity-optimizing approach is presented
social contacts tend to be asymmetrical: you may consider
by Donetti and Muñoz [74] who perform agglomerative
someone to be your acquaintance while that person does
clustering using spectral properties to construct the full
not remember having met you before. Similarly you may be
cluster hierarchy and then select a clustering from the
considered a friend by someone you would think a mere
resulting tree maximizing modularity. The idea is to first
acquaintance. Directed input graphs can be expected to allow
let the hierarchical clustering algorithm create the entire more natural clusterings when symmetry is not required.
dendrogram for the data and then optimize modularity over It is noteworthy that local clustering algorithms may be
all possible sets of dendrogram “nodes”. As nodes chosen used to obtain a global clustering of the entire input graph
represent vertex sets, candidate sets of chosen nodes must [17,54,205]: the options include, for example, initiating the
form a cover of the vertex set. The set of nodes that procedure n times using each vertex as the seed vertex once
gives the highest modularity is chosen to be the final, flat and applying some majority-vote rule or a quality measure
clustering. to combine the local clusters into a global clustering. Also
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 49
computing a fuzzy global clustering from a set of locally represented by a state. The computational cost of evaluating
determined clusters is an option. the function should be small or at least moderate, as during
Another approach for deriving a global clustering through the course of a search it will be repeatedly evaluated for
a local method is to select seed vertices according to some different solution candidates.
preference rule and excluding the local clusters found from The rule for choosing the next state to proceed to may
further clustering, hence limiting the number of times the be heuristic, which means that a fitness function is evaluated
local procedure is executed. The computation for clusters for all neighbours and the outcome is used to choose to
with different seeds may in some cases be near-trivially which neighbour the search will proceed. Always moving
parallelized, as the clusters are formed independently, as to the neighbour with best fitness is a greedy strategy.
in the case of the Fiedler-vector method of Orponen and The selection may also involve a probabilistic element, for
Schaeffer [191]. example proceeding to each neighbour with a probability
proportional to the value of the fitness function. While
6.1. Definition of locality in graphs proceeding through the search space, the search algorithm
always remembers at least the best state visited and the
In order to distinguish local computation from global associated fitness.
computation, we must define what information about an Typically a limit is imposed on the number of steps
input graph we consider to be locally available. For a single the search may take. The search terminates either when
seed vertex s, we assume that at least the adjacency list it encounters a solution that has the best theoretically
containing the identifiers of vertices in Γ (s) are known and possible fitness (ideally corresponding to having found an
that the algorithm may “crawl” into any of the neighbouring exact optimum) when it reaches the step-count limit. The
vertices. A wider definition would also allow direct access solution candidate yielding the best fitness value is the output
from a vertex v to its second neighbours: of the local search procedure. The search may be iterated
[ several times in order to cover more of the state space.
Γ (v) . (48) For deterministic heuristics, one should use random initial
v∈Γ(s)
states to explore different parts of the search space, but for
Optionally we can allow knowledge of the degrees of the probabilistic heuristics, also fixed initial states work.
vertices in Γ (s). However, such information could also be Some common local search procedures are hill-climbing,
obtained by on-demand local crawling from s. deterministic and probabilistic versions of tabu search, and
We also allow a local algorithm to remember any simulated annealing. Care must be taken in guiding the search
adjacency information that has already been seen, although in the neighbourhood: the presence of local optima is likely
in practice restrictions are posed on the amount of memory and the search must be guided in such a manner that it is
available. Therefore, once a local algorithm has computed possible to “escape” a local optimum with reasonable effort.
a candidate cluster C for s, it knows the list of vertices For this purpose, simulated annealing allows the search to
included in C, any edge internal to C, as well as a list of proceed to a lower-fitness neighbour in the search space with
border vertices directly adjacent to the cluster (i.e. vertices probability that decreases over time as the search proceeds.
that are neighbours of at least one included vertex but are The speed at which the probability decreases is controlled
not themselves included in C). by two parameters: the initial temperature and a cool-down
For edge weights, the weight of each edge that has at least coefficient — the aim is to mimic cooling in metals. In addition,
one endpoint in the subgraph is considered locally available. one fixes the number of iterations to be computed and the
For vertex weights, the weights of the included vertices and number of steps taken per each iteration. The parameters
their immediate neighbours are considered locally available. are usually chosen by running some initial experiments and
choosing a parameter set that gives promising results [155].
6.2. Local search Graph partitioning by simulated annealing has been
studied by Johnson et al. [137], comparing it to the
Local search methods are heuristic and/or probabilistic Kernighan–Lin algorithm [145] that is a well-known method
algorithms designed to find near-optimal solutions among for partitioning weighted graphs with respect to the edge
large, complex sets of solution candidates. These methods weights. A similar approach for points in space has been
aim not explore the entire solution space but rather, possibly suggested by Klein and Dubes [149], who compared the
with probabilistic decision-making, study a limited region clusterings achieved with simulated annealing to those of a k-
that contains at least good if not the best solutions. The means algorithm. Booth et al. [29] study data partitioning with
extent in which the input graph is traversed depends on the another stochastic search method, namely the Metropolis-
local search method applied. Hastings algorithm [6]. Felner [91] uses heuristic search to
Each solution candidate is represented by a state and solve the graph partitioning problem using heuristics for
the set of states is called the state space. A neighbourhood estimating the size of the optimal partition based on graph
relation over the space of possible solutions, with the goal of structure.
examining solution candidates one by one and then moving Schaeffer [205] computes with simulated annealing the
to a neighbouring candidate. The neighbourhood relation cluster C (s) of a single, given seed vertex s, only considering
should be such that the search may navigate from one state all possible clusters in which s can be assigned. This reduces
to another with light computation. One also needs to define the size of the search space significantly: instead of having to
a fitness function that measures the quality of the solution construct a global clustering of all the vertices in the input
50 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
graph, one only needs to determine for each vertex whether neighbours that a vertex could have in C, namely |C| − 1, to
it is included in C (s) or not. For such a procedure to be local, obtain a measure in [0, 1]:
the fitness function used should be locally computable in
degint (v, C)
the sense that the fitness of a cluster candidate should not δ (v, C) = . (50)
|C| − 1
depend on global properties of the graph.
In initiating the local search procedure, Schaeffer [205] This measure indicates how densely v is connected to C and
uses a fixed initial state where the cluster candidate is formed it should give a high value if C is a good cluster for v. We also
by the seed vertex v itself along with its neighbourhood Γ (v). want to make sure that the vertex is not densely connected to
The neighbourhood relation is formed by allowing two kinds other parts of the graph, and hence define a measure in [0, 1]
of operations: the addition of an adjacent vertex and the for vertex introversion, namely the ratio of internal edges to all
removal of an included vertex. Upon the removal of u ∈ C (v), edges incident on v:
u 6= v, she ensures connectivity by setting the connected degint (v, C)
component containing v to be the next cluster candidate. ρ (v, C) = . (51)
deg (v)
The approach generalizes to a variety of initial states, fitness
If both of the above measures have a high value, we can
functions, and heuristics.
assume v to be correctly classified into C. If either one is low,
For a good review on local search methods, we recommend
it would be worthwhile to try reassigning v to some other
the book by Michalewicz and Fogel [177] and also the book by
cluster.
Aarts and Lenstra [2]. Stochastic search is by no means limited
The quality of a given cluster can be evaluated on the basis
to local computation. For moderate-size instances, one may
of the suitability of the included vertices; a possible measure
define as the search space the set of all possible clusterings
for cluster density would be a scaled sum of vertex densities
and locally optimize a clustering quality measure by moving
Eq. (50):
in the space of all possible clusterings.
1 X 1 X
δs (C) = δ (v, C) = degint (v, C) . (52)
6.3. Fitness functions |C| v∈C |C| (|C| − 1) v∈C
and Eq. (54) above. A cluster is properly introvert if Eq. (54) same cluster are connected to be high, interpreting strong
has a high value and the capacity of the cut is low. connectivity as an indicator of vertex similarity.
Relative density favours introvert clusters, i.e. subgraphs The cluster fitness function used by Schaeffer [205,206]
with few connections to other parts of the graph. Introversion is the product of the local (Eq. (53)) and relative (Eq. (43))
measures are, however, optimized for any connected densities
component in which all edges are by definition internal,
2 degint (C)2
yielding zero for cut capacity and one for relative density as F (C) = δ (C) · ρ (C) = . (57)
|C| (|C| − 1)(degint (C) + degext (C))
well as the summation of Eq. (54). This imposes restrictions
on their usage as fitness functions, as a local search method It is but one of the many possible combinations of the local
would prefer any connected component as a cluster selection and relative density measures.
even if it would allow intuitively pleasing divisions into
smaller clusters.
Bagrow and Bollt [17] grow a cluster candidate by breadth- 7. Comparison, evaluation and benchmarking
first search level by level (always adding all the neighbours
of presently included vertices at once) and optimizing the For traditional methods of clustering points in space,
emerging degree of the cluster candidate, which is effectively clusters that are of different orders or shapes often produce
the external degree. A threshold is used to determine when difficulties, as well as clusters that overlap each other [98].
to stop growing the cluster further, in order to avoid including Similarly in graph clustering, when the clusters are of
the entire connected component of the start vertex. Also different orders and have varying densities, global methods
Clauset [54] uses a measure similar to relative density. tend to run into difficulties in correctly classifying them.
The local clustering algorithm of Clauset greedily optimises Properties of good clusterings are discussed by Kleinberg
the fraction of the internal edges of boundary vertices only, [150]. He defines an axiomatic framework for clustering a data
i.e. vertices v ∈ C such that degext (v, C) > 0. set S = {1, 2, . . . , n} of “abstract points” using the notion of
One possible interpretation of the relative density is as a clustering function f that takes as a parameter a distance
follows: consider a global, partitional clustering of G = (V, E) function d : S × S → R and returns a partition of the data set
into clusters C1 , . . . , Ck . Evidently into clusters based on the distance function d. The distance
function must be such that all reflexive distances are zero and
k k
all other distances are positive and symmetrical. The triangle
X X
degext Ci ,
degint Ci + degext Ci = m + (56)
i=1 i=1 equation is left as an option instead of requiring it. Kleinberg
as every external edge has endpoints in exactly two clusters. lists three desirable properties of a clustering function:
Now for a clustering to be of high quality in terms of 1. Scale-invariance: given any distance function d and a
introversion, as m is a constant, we are interested to minimize constant α > 0, it should hold that multiplying all
Pk
i=1 degext Ci , which means that out of all clusterings into distances by α does not change the clustering,
k clusters, one clustering is better than another if any two 2. Richness: the range of f is the set of all partitions of S,
clusters have a smaller external degree whereas the external meaning that the function is capable of producing any
degrees of the others remain unaltered. Note that modifying of the possible partitions of the data set S given the
just a single cluster is not possible, as a removed vertex must appropriate distance function d,
be included into another cluster. The computation is even 3. Consistency: let P = (C1 , C2 , . . . , Ck ) be a partition of S given
more tedious if the number of clusters is allowed to vary. a distance function d and d0 be another distance function
Hence, to approximate this global optimum, each cluster such that ∀i, j ∈ S
may locally attempt to minimize its own degext Ci ; as
• if i and j belong to the same cluster C` of the partition P,
the cluster should also attempt to be the maximal-order
it applies that d0 (i, j) ≤ d(i, j), and
cluster with the minimal external degree, it should favour
• if i and j belong to different clusters of P, it applied that
higher values of degint Ci over lower ones, meaning
d0 (i, j) ≥ d(i, j),
that it attempts to maximize degint Ci while minimizing
then f (d) = P, meaning that no modification to the distance
degext Ci , which can be directly achieved by maximizing the
function that never lengthens an intracluster distance and
ratio degint Ci / degext Ci . This measure, however, can take
never shrinks an intercluster distance should cause the
arbitrary positive values over connected cluster candidates
clustering to change.
and may result in division by zero in the absence of external
edges. In order to scale it to values in [0, 1] and avoid The theorem of Kleinberg [150] is that for n ≥ 2 no
division by zero, we add to the denominator the value of the clustering function f exists that satisfies all the above
numerator, which yields exactly Eq. (43). properties 1, 2, 3. Unfortunately to the case of graph
The relative density is the probability that a randomly clustering, these properties do not translate directly into
chosen edge incident on the cluster is an internal edge, graph clustering in general. We may apply them all to the
whereas the local density can be interpreted as the scenario of clustering a complete weighted graph where
probability that two randomly chosen cluster members are the weights are assigned by the distance function d and
connected by an edge. In a good global clustering, when the vertex set is S, but when not all edges are present or
picking an edge uniformly at random, we would like the when the graph is unweighted, it is not straightforward to
probability that it is internal to a cluster to be high. Also, we fill the role of the distance function in the definitions of
would like the probability that two vertices that are in the Kleinberg [150]. In a sense, the edge relation E would be a
52 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
candidate: richness could be interpreted as the capability In many cases, the graph that was clustered was based on
of producing all clusterings while choosing an appropriate some data set, all information of which was not completely
edge set E, consistency in the sense that adding intra-cluster utilized in the construction of the graph. Carrasco et al. [41]
edges and/or removing intercluster edges should not change also use that additional information to evaluate the quality of
the clustering, but scale-invariance in the absence of edge clusterings obtained by different methods.
weights is more challenging without modifying the vertex set Determining whether one clustering algorithm is better
as well. For a general weighted graph, scale-invariance is the than another would be simplified were there a canonical
easiest property to check: the clustering should not change if set of benchmark cases, i.e. graphs for which a “correct”
all weights are multiplied by a positive constant. clustering is known. Typically used graphs include the
In practise, a question easier than “is this a good karate club social network of Zachary [244] and other social
clustering” is “which of these two clusterings is better”. With networks for which a semantic division into clusters is known
flat clusterings, we may define comparison measures such beforehand (such as known research groups in scientific
as overlaps or agreements between two clustering. Such collaboration networks). However, as clustering tends to be
measures are straightforward to define iterating over the rather application specific, comparing any two algorithms
vertices; for example, if clustering Ci (1), Ci (2), . . . , Ci (n) and not always makes sense, as the motivation and intended
clustering Cj (1), Cj (2), . . . , Cj (n) have a value close to one for application areas differ.
Problems arise in the evaluation task especially when
1 X Ci (v) ∩ Cj (v) a global clustering needs to be compared with a local
, (58) clustering, as global clusterings as often partitions or at least
n
v∈V Ci (v) ∪ Cj (v)
symmetrical whereas the question posed by local clustering
in a sense the clusterings agree well. However, the measure allows for covers and asymmetrical cluster-membership
does not behave well if the clusters of one clustering are in relations.
fact subclusters of the clusters of the other clustering. For
moderate-size graphs, visualization again helps, for example, 7.1. The parameter jungle
by colouring the vertices according to the clusterings. When
comparing two hierarchical clusterings, more complicated Typically, clustering algorithms have at least a few param-
schemes are needed to evaluate to which extent the two eters. In comparing the output of different algorithms, one
divisions agree, especially if the dendrograms are of different needs to choose the parameters of the two algorithms under
heights. comparison fairly. This is not always trivial, as the resulting
Two flat clusterings can also be compared or the quality clustering may heavily depend on the parameter values cho-
of a single clustering evaluated by examining the adjacency sen. Therefore, before defining quality indices, we address the
matrix of the graph ordered by clusters — an example of problem of parameter selection.
an adjacency-matrix visualization was given in Fig. 1. For The purpose of the parameters is to attempt to overcome
comparison, first order the matrix by placing the vertices in difficulties caused by structural properties inherent in the
order that follows the first clustering and see if the second data set, such as varying densities. Determining the optimal
clustering also produces a near block-diagonal structure, values for the parameters is usually nontrivial or even
and then repeat ordering by the second clustering. For impossible, and the methods may be highly sensitive to
evaluating the cluster quality, such visualization helps reveal the choice of parameter values. A common parameter is
the presence of dense clusters. Mathematically this could the number of clusters to compute. When clustering data
be achieved, for example, by calculating the distance of such as speech or handwritten characters, aiming to identify
each element that has the value one to the diagonal in the which sound in the speech correspond to the same phoneme
adjacency matrix — the smaller the value the better the or which characters of the writing correspond to the same
clustering. Block-diagonalization has been utilized in relation letter, the number of clusters is determined by the number
to clustering by Schaeffer [206] and Carrasco et al. [41]. of phonemes or the size of the alphabet. In many cases,
Another option is to compute vertex similarity measures however, the user will not have any a priori information on
(such as those of Section 4.1) within and across clusters — the number of clusters. For example, when clustering a social
a clustering is good if the intracluster similarities are high network based on phone call data, knowing the number of
and the inter-cluster similarities low. Similarly one could use callers and calls made does not give any concrete information
cluster quality measures (such as those of Section 4.2) and on how many clusters the graph could be expected to form.
prefer a clustering that has higher overall quality. The vertex The use of methods requiring the number of clusters as a
similarities can be also be used to construct a minimum parameter is more straightforward when the user has at least
spanning tree to the graph using the inverse of the similarity some information on the range of possible cluster counts.
as a distance measure. A good clustering is such that each A problem related to choosing the number of clusters is
cluster corresponds to a connected subtree of the minimal that many algorithms implicitly assume that the clusters
spanning tree [245]. Alternative uses of this observation, other should be of similar orders, even though this is not
than using it to evaluate the quality of a given clustering, is necessarily the case in real-world data. The problem may be
to cluster the spanning tree instead of the graph as a whole, avoided either by local clustering methods where the size of
as this usually leaves a large portion of the edges out of the the other clusters present plays no role, or by resorting to
consideration and eases the computation. methods that have been designed to find clusters of different
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 53
Modularity M(C1 , . . . , Ck ) of a clustering, as defined in Eq. the data assuming the data to follow a finite mixture model,
(45), evaluates a related property for weighted graphs. The similarly to planted partition problem [248]. Also Fraley and
higher the modularity, the better the clustering, as for a high- Raftery [98] discuss likelihood methods for determining the
modularity clustering the total weight of intracluster edges number of clusters.
is large and the total weight of intercluster edges is small. Such definitions rely on the notion of likelihood, which
Modularity is in essence the graph-theoretical equivalent of is the hypothetical probability that the observation made
minimizing the sum-of-squares of distances within clusters would have been generated in a certain way and not any
and maximizing it between the clusters for a clustering of the other possibilities within a finite set. In the context
of a set of points in space [135], closely related to the of clustering, the goal is to estimate the likelihood that a
Davies–Bouldin index [69]. given sample belongs to a certain cluster. Several likelihood-
Danon et al. [67] compare several graph clustering based formulations exist for evaluating the quality of a given
methods in terms of their sensitivity to changes in the clustering assuming that the input data was generated under
input data and the running time, using modularity as a a specific probabilistic model.
quality measure. They conclude that the most accurate For points in space, the aforementioned Davies–Bouldin
methods are computationally expensive, but that it depends index [69] is commonly used to choose the number of clusters:
on the application whether speed or accuracy is more crucial. one repeats the same algorithm varying the parameter that
Intuition of intercluster sparsity combined with intra-cluster determines the cluster count and chooses the clustering
density has also been used by Brandes et al. [34] both that optimizes the Davies–Bouldin index. For graphs in
with modularity-like formulations and conductance-based general, modularity (Eq. (45)) could be optimized. In the
notions to evaluate the performance of clustering algorithms. presence of a priori information on the generation model,
Other measures for evaluating a single cluster are distance the aforementioned optimization of a likelihood measure will
measures such the average or maximum distance (i.e. length work.
of the shortest path) within the included vertices, which When optimizing a quality index, it would be useful to
should be small for a good cluster. These measures are useful know what is the best possible value that can be reached for a
if a method returns two candidate clusters and only one is to given input graph, i.e. how many intercluster edges will there
be chosen. A clustering fitness measure used by Wu et al. [237] be at least and how many intra-cluster edges will be missing
compares the differences in average path lengths of the at least. Knowing this allows to determine whether a given
original graph and a graph where each cluster is contracted clustering is actually globally good, whereas not knowing the
into a single vertex with distances calculated by having optimum only justifies comparisons between two clusterings
that single vertex represent all of its member vertices. This to determine if one is better than the other.
measure is called the distortion of the graph geodesics. An idea Considering the application-specific nature of clustering
applied to clustering points in space that also gives insight to problems, there seems to be no answer that would satisfy
graph clustering is to use the sum of cluster diameters as a all. Often the only sensible way to evaluate the quality of a
quality measure and preferring clusterings that yield smaller clustering is to see how well it performs for the application
sums [44]. The motivation behind diameter-based clustering at hand, i.e. how costly is the computation and what are the
is that cluster members should be structurally close to each benefits of utilizing the obtained clustering.
other, and hence connected by short paths. Also, it would be
desirable that the diameters of the individual clusters were 7.3. Scalability and stability under perturbations
clearly smaller than the diameter of the graph as a whole.
Boutin and Hascoet [30] discuss and compare different The amount of digitally available information grows rapidly
cluster quality indices with respect to different instances and hence scalability of computational methods is becoming
and different clustering methods. They find that many of an increasingly critical issue. By scalability we mean that
the quality indices are difficult to interpret and compare. a great increase in the size of the problem instance
One problem is that one should know what measures the should only cause moderate effects in the amount of
algorithms under comparison are directly optimizing, as computational resources needed. With global methods, often
using the same measures as quality indices is not too both computation time and memory requirements pose
informative: if method A optimizes measure f and method problems — in addition to designing novel scalable methods,
B optimizes measure g, one needs some intuition on what parallel and distributed versions of known clustering
combinations of the values of f and g are desirable for good algorithms help to achieve at least some scalability, as the
clusters. Preferably one should also assess the clusterings cost of additional hardware is no longer tremendous. Another
with quality measures other than f and g. This effectively option is to resort to approximations of qualitatively good but
brings us back to posing the question “what is a good cluster” computationally demanding methods. Another idea, applied
that already served as a starting point for designing or by Milenova and Campos [179] to cluster high-dimensional
choosing a clustering algorithm! feature-vector data, is to use sampling. Combinations of
The integrated classification likelihood is a measure that can sampling techniques and local clustering algorithms could
be used to choose the number of cluster [25,26]. It is based well yield easily scalable methods that produce high-quality
on assuming the data to follow a finite mixture model. Mixture global clusterings; one idea would be to choose a set of
models are statistical models for classification that deal with seed vertices with a carefully designed sampling method
the probability that a given element belongs to a certain class. that gives preference to vertices that structurally make good
Model-based clustering aims to “recover” the classification in cluster seeds and iteratively combine information of the local
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 55
clusters of the seed vertices to obtain a global clustering. For One option on how to convert feature-vector data into
more information on sampling vertices of massive graphs, graph format is the Delaunay graph. The Delaunay graph of a
see for example the Markov-chain constructions discussed set of points on a plane can be constructed by representing
by Schaeffer [206] or the path-sampling method of Clauset each point by a vertex and placing an edge between each
and Moore [55] further discussed by Achlioptas et al. [3] and pair of points that are Voronoi neighbours [135]. The approach
Dall’Asta et al. [65]. naturally generalizes to higher dimensions. Two points are
On the field of feature-vector data clustering, scalability Voronoi neighbours if their Voronoi cells are adjacent [13]. A
issues are more thoroughly addressed. Zaïane et al. [246] Voronoi cell of a datum is formed by those points in the data
present an experimental study of different clustering space that are closer to that data point than any other. The
methods and also discuss the difficulty of cluster validation. boundaries of the Voronoi cells are hyperplanes that partition
Farnstrom et al. [87] present a highly scalable variation of the space in which the data lie.
the k-means algorithm that scans a large data set once and More often than relying to Delaunay graphs, when
produces a clustering using a small memory buffer — a buffer transforming feature-vector data into graph format, the data
that accommodates just one per cent of the input data already elements d ∈ D are represented by the vertices, and an edge
serves to produce good clusterings. is placed between two elements depending on their similarity
Another potentially critical issue is that in clustering under some measure, selected according to the application.
applications, one may wish to maintain a clustering for a Vertex similarity measures are addressed in Section 4.1.
graph that undergoes frequent modifications. It is however Any data for which a similarity measure has been defined
application-dependent whether the changes in the clustering can be transformed into a complete weighted graph using its
should be limited to the area of the modification or whether connectivity matrix [M]i,j , where the element mi,j contains the
the change should propagate and alter the cluster structure similarity measure ρ(di , dj ) for data elements di and dj . If the
in general, and if so, to what extent. similarity values are symmetrical, i.e. ρ(di , dj ) = ρ(dj , di ), an
Raghavan and Yu [200] study the stability of clustering undirected graph can be formed by representing each datum
methods when the input data is perturbed. Examples of di by vi ∈ V and using
possible perturbations are the introduction or removal of
a few edges and/or vertices. They measure the stability ω vi , vj = ρ(di , dj ). (59)
of a clustering algorithm by computing a clustering for For asymmetrical similarities, the resulting graph is directed
the original data and again for the perturbed data, then and hence the number of edges is doubled. In both cases,
calculating how many operations it would take to transform the a number of edges is O(n2 ); for large data sets, this is
the latter set of clusters to the former. Raghavan and computationally infeasible. The number of edges in the graph
Yu compare different graph-theoretical clustering methods can be controlled by setting a threshold value ξ such that
and cluster definitions. Also Hopcroft et al. [128] evaluate
their agglomerative clustering algorithm with respect to {vi , vj } ∈ E if and only if ρ(di , dj ) ≥ ξ, (60)
perturbations. although choosing the value of the threshold is application-
specific and not always easily justified. Such edge-elimination
is referred to as sparsification [219].
8. Applications of graph clustering An example of other methods simplify the connectivity
graph is that of Bansal et al. [18], who reduce the possible
As has been emphasized repeatedly throughout the survey, set of edge weights into a binary set by placing an edge with
the task of clustering is highly application-specific. In this weight +1 between all vertex pairs that are similar (e.g. above
section we review some of the key application areas of a threshold) and an edge with weight −1 between those that
graph clustering, although it is not to be forgotten that many are dissimilar. Optimal clustering of such graphs is NP-hard,
problems allow the utilization of other representations as well but polynomial-time approximation schemes exist [18].
and hence clustering algorithms for feature vectors or others A similarity measure defined on the data set can usually
kinds of classification systems, for example, may equally be fluently converted into a distance measure: similarity
be applied. We begin by viewing how data sets composed measures assign large values for similar data elements,
of points in n-dimensional space can be transformed into whereas distance measures assign smaller values for similar
graphs. elements — often just taking the inverse of the similarity
value works as a distance measure. A simple distance-based
8.1. Data transformations graph construction is assigning each point to be represented
by a vertex and connecting each vertex by an edge with the
The range of interesting clustering applications is wide, as vertices representing the k nearest points with respect to
many if not practically all systems of interacting (or simply the distance measure [232]. The resulting graph is a k-nearest
coexisting) entities can be modelled in some way as graphs. neighbour graph. Clustering of point sets by searching for cuts
For data that are not readily in graph, several transformations in such graphs was already studied in the 1970s by Zahn [245].
into graph representations are possible. In this section we As an alternative to a fixed neighbour count, one may also
discuss some of the various possibilities to convert feature- set a distance range within which the neighbours are selected
vector data into graph format. Transformations vice versa [135].
exist as well [234], but as the focus of this survey are graph- Of course one may also work the other way around,
theoretical clustering algorithms, we do not address those. and convert a graph into a set of feature vectors, then
56 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
utilizing some clustering algorithm for general data sets. The 8.3. Database systems
task of constructing the feature vectors for the vertices is
addressed by Wilson et al. [234] who use a spectral matrix When storing a large set of data, a key question is how to
and construct symmetric, permutation invariant polynomials group the data onto pages in physical memory. A single page is
from the matrix elements, then using the coefficients of the typically large enough to contain multiple elements but only
polynomials as feature vectors. In practice these vectors allow a small fraction of the entire data set. Therefore, a desirable
for a locally linear embedding in a low-dimensional space. grouping would be such than when a datum is retrieved,
along would come relevant data so that possible related
future queries might benefit from the already retrieved page.
8.2. Information networks and usage information
Also traditional concerns such as the complexity of searching,
inserting, deleting, and modifying the data stored must be
In any communication network, graph clustering serves as a attended in addition to the relevancy concern in the paging
tool for analysis, modelling and prediction of the function, design.
usage and evolution of the network. Applications include Diwan et al. [73] propose paging by clustering for tree-
business analysis, marketing, improving the infrastructure, like data. Wu et al. [237] address the database organization
and identifying anomalous use. issue for graphs, providing a solution where the data storage
In computer networks, clustering may be used to identify format itself supports quick, approximate computation of
relevant substructures and to analyse the connectivity shortest paths and distances for a special class of nonuniform
for purposes of modelling or structural optimization; the networks called scale-free networks [22]. A similar idea is
canonical example is the Internet and the structure of presented by Agrawal and Jagadish [5], implicitly assuming
the Autonomous Domains [178]. One example of topology an underlying cluster structure in the input graph. Their
design through clustering is the work of Grout and method was modified to explicitly use clustering algorithms
Cunningham [114]. In the World Wide Web, clustering of by Schaeffer [206]. Bradley et al. [32] use a k-means like
hypertext documents – representing each web page by a iterative algorithm to determine a clustering for a large
vertex and each hyperlink by an edge – helps to identify database in one scan using a limited memory buffer.
topics and other entities formed by several interconnected
documents [227,236]. 8.4. Biological and sociological networks
What comes to Internet telephony and chat services like
Yahoo Messenger, Microsoft’s Messenger Live, and Skype, In the field of bioinformatics, graph clustering tasks typically
interesting usage statistics for optimizing related software deal with classification of gene expression data (specifically
and hardware configurations can be obtained by representing gene-activation dependencies) [240,31] and protein interac-
each user as a vertex and placing (weighted) edges between tions [16,193,148,243,7,141]. Another biological application of
two users as they communicate over the system. For example, clustering is epidemic spreading. Newman [182] studies SIR-
in a multiserver environment, savings could be obtained by type epidemic processes in a special class of graphs and find
grouping a dense cluster of users on the same server as it that graphs with a cluster structure have smaller epidemics,
would reduce the interserver traffic. but a lower epidemic threshold, making it easier for diseases
to spread. Applications of local clustering in social networks
Similar analysis can help traditional teleoperators identify
include identifying groups of individuals “exposed” to the in-
“frequent call clusters”, i.e. groups of people that all mainly
fluence of a certain individual of interest, such as identify-
call each other (such as families, coworkers, or groups
ing terrorist networks when a member is known or locating
of teenage friends) and hence better design and target
potentially infected people when an infected and contagious
the widely-spread offers on special rates for calling to a
individual is encountered.
limited number of prespecified phone numbers. Clustering
Cluster analysis of a social network also helps to identify
the caller information can also help to identify changes in
mechanisms underlying, for example, the formation of trends
the communication pattern of a certain client: when long
(relevant to market studies) and voter behaviour. In the
calls are repeatedly being made outside the cluster, the
current information society, the study of social networks
phone may have been stolen or the client may simply have
tends to overlap the study of information networks, as
decided not to pay the bill anymore. For fraud detection,
the popularity and significance of electronic messaging has
call durations and a geographical embedding would be most
become overwhelming. However, traditional studies where
helpful in determining what forms the cluster of “normal call the daily contacts of individuals are mapped and classified
destinations” for a specific client and which calls are “out of do coexist with the studies of chats and web logs.
the ordinary”.
Clustering algorithms are also used in the structural
8.5. Other applications
design and operation of ad hoc [130,165,194] and sensor
networks [101]. For networks with a dynamic topology, with In the business world, other than market analysis based on
frequent changes in the edge structure, local clustering social or communication networks, also stock market data
methods prove useful, as the network nodes can make local can be clustered: represent each stock by a vertex and place
decisions on how to modify the clustering to better reflect the weighted edges to represent the correlations of the valuations
current network topology [207]. Imposing a cluster structure of the stocks in the stock market. Such a representation
on a dynamic network eases the routing task [156,216]. allows for the identification of clusters of stocks that
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 57
either all gain or lose value together, or alternatively – The theoretical foundations of graph clustering are not
varying the cluster definition – stocks that appear to behave yet fully explored; we believe that there may well be
independently of each other. Such knowledge is useful in several supposedly distinct graph clustering algorithms that
portfolio management when one wishes to distribute and/or fundamentally compute the same exact thing. However, we
concentrate investments. do not expect there to be a single universal answer to the
A clustering analysis of the global air transportation questions what is a good cluster in a graph and how to find it,
network is given by Guimerà et al. [117]. In logistics, as the field is highly application-specific.
the hub-location problem [20,39] and other kinds of facility-
location problems [61] are of interest. Several clustering-
based heuristic solutions have been proposed to the hub- 10. Concluding remarks
location problem [153,189] and heuristics and approximation
algorithms that rely on a graph representation and a In this survey we have given an overview of some of the
clustering computation exist for the facility-location problem essential definitions and techniques of graph clustering.
as well [49,217]. A related problem of sales territory design [249, In general, it seems that many of the good measures of
140] also has proposed solutions building on the use of clustering are intertwined: cut-based methods are in a sense
clustering methods [201,225]. spectral methods that in turn are related to random walks
Clustering also serves in manufacturing, where identi- that model the behaviour of electrical networks and also
fication of clusters of similar parts helps to smoothe the serve to do betweenness-like computations, and so forth.
production line (called group technology). Chen et al. [46] clus- These theoretical connections between many of the methods
ter transaction databases in hope of profit-increasing patterns gives a reason to believe we are on the right track: the field of
using a noise-insensitive similarity measure between items graph clustering seems to be revolving around fundamentally
based on cooccurrence relationship of items. similar definitions, although some of the starting points for
the algorithms are quite far apart.
We reviewed both global and local approaches and
9. Open problems and future directions discussed the delicate issues of selecting an appropriate
method for the task at hand, selecting good parameter
In the previous sections we reviewed three major open values, and evaluating the quality of the resulting clustering.
problems of graph clustering: The tools available are already almost as various as the
• Parameter selection: how is the user to determine the applications of graph clustering, although much work still
parameter values to give as input to the clustering remains to be done.
algorithm,
• Scalability: how does the runtime and memory consump-
tion of the algorithm behave for massive input graphs, and Acknowledgements
• Evaluation: how to decide which of several clusterings is
the best. This work has been supported by the Academy of
Finland under grants 81120 (STADYCS, 2002–2003) and 206235
For a non-expert user, ideally there would be few if any (ANNE, 2004–2006), the Helsinki Graduate School in Computer
parameters and the output of the method would at least not Science and Engineering (2001–2002), the Nokia Foundation
be highly sensitive to the parameter values. The scalability (2004), and the Rotary Foundation (2005). For their valuable
issue can be resolved either by resorting to approximation comments, the author thanks Pekka Orponen, the editors and
algorithms to the existing methods or by novel approaches. the anonymous reviewer, whose comments greatly improved
Parallelization of local methods could also offer a solution. the structure of the presentation.
Especially the field of data mining frequently needs to
evaluate cluster structures in very large datasets.
REFERENCES
The problematics of evaluation could be eased by the
creation of a benchmark set that allows comparison between
different clustering methods at design phase. For the end
[1] Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala,
user, it would be helpful in evaluating the output of
V. Vinay, Clustering in large graphs and matrices, Machine
different algorithms (or of the same method with different Learning 56 (2004) 9–33.
parameter values) if the clustering measures and/or the [2] E. Aarts, J.K. Lenstra, Local Search in Combinatorial
quality indices were intuitively pleasant even to users that are Optimization, John Wiley & Sons, Inc., Chichester, UK, 1997.
neither mathematicians nor computer scientist — should the [3] D. Achlioptas, A. Clauset, D. Kempe, C. Moore, On the bias of
measure be easily explained in lay person’s terms, the user traceroute sampling (or: Why almost every network looks
could analyse whether the vertices considered to be similar like it has a power law), in: H.N. Gabow, R. Fagin (Eds.),
Proceedings of the Thirty-seventh Annual ACM Symposium
were actually grouped as similar.
on Theory of Computing, STOC, ACM Press, New York, NY,
More extensions of the existing graph clustering algo- USA, 2005.
rithms to weighted graphs would be of great interest, as well [4] P.K. Agarwal, C.M. Procopiuc, Exact and approximation
as novel methods for clustering directed graphs. In applica- algorithms for clustering, in: Proceedings of the Ninth
tion areas, there is certainly need to cluster also multigraphs Annual ACM–SIAM Symposium on Discrete Algorithms,
and hypergraphs. ACM–SIAM, Philadelphia, PA, USA, 1998.
58 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
[5] R. Agrawal, H.V. Jagadish, Algorithms for searching massive [26] C. Biernacki, G. Govaert, Using the classification likelihood
graphs, IEEE Transactions on Knowledge and Data Engineer- to choose the number of clusters, Computing Science and
ing 6 (2) (1994) 225–238. Statistics 29 (2) (1997) 451–457.
[6] D.J. Aldous, J.A. Fill, Reversibe Markov Chains and Random [27] N. Biggs, Algebraic Graph Theory, 2nd ed., Cambridge
Walks on Graphs. http://www.stat.berkeley.edu/aldous/ University Press, Cambridge, UK, 1994.
RWG/book.html, 2001 (in preparation). [28] I.M. Bomze, M. Budinich, P.M. Pardalos, M. Pelillo, The
[7] M. Altaf-Ul-Amin, Y. Shinbo, K. Mihara, K. Kurokawa, maximum clique problem, in: D.-Z. Du, P.M. Parda-
S. Kanaya, Development and implementation of an los (Eds.), in: Handbook of Combinatorial Optimization,
algorithm for detection of protein complexes in large vol. Supplement Volume A, Kluwer Academic Publishers,
interaction networks, BMC Bioinformatics 7 (2006) 207. Boston, MA, USA, 1999, pp. 1–74.
[8] R.G. , T. Hu, Multiterminal network flows, SIAM Journal 9 [29] J.G. Booth, G. Casella, J.P. Hobert, Clustering using objective
(1961) 551–570. functions and stochastic search, Journal of the Royal Statis-
[9] R. Andersen, F.R.K. Chung, K. Lang, Local partitioning using tical Society, Series B (2007) (submitted for publication).
PageRank vectors, in: Proceedings of the Fourty-seventh [30] F. Boutin, M. Hascoet, Cluster validity indices for graph
Annual Symposium on Foundations of Computer Science, partitioning, in: Proceedings of the Eighth International
FOCS, IEEE Computer Society Press, Washington, DC, USA, Conference on Information Visualisation, IEEE Computer
2006. Society, 2004.
[10] S. Arora, S. Rao, U. Vazirani, Expander flows, geometric [31] F. Boyer, A. Morgat, L. Labarre, J. Pothier, A. Viari, Syntons,
embeddings and graph partitioning, in: Proceedings of the metabolons and interactons: An exact graph-theoretical
Thirty-Sixth Annual Symposium on Theory of Computing, approach for exploring neighbourhood between genomic
STOC, ACM Press, New York, NY, USA, 2004. and functional data, Bioinformatics 21 (23) (2005) 4209–4215.
[11] Y. Asahiro, R. Hassin, K. Iwama, Complexity of finding dense [32] P.S. Bradley, U.M. Fayyad, C. Reina, Scaling clustering
subgraphs, Discrete Applied Mathematics 121 (1) (2002) algorithms to large databases, in: Proceedings of the Fourth
15–26. International Conference on Knowledge Discovery and Data
[12] D. Auber, M. Delest, Y. Chiricota, Strahler based graph Mining, KDD, ACM, New York, NY, USA, 1998.
clustering using convolution, in: Proceedings of the Eighth [33] U. Brandes, A faster algorithm for betweenness centrality,
International Conference on Information Visualisation, IEEE Journal of Mathematical Sociology 25 (2) (2001) 163–177.
Computer Society, 2004.
[34] U. Brandes, M. Gaertler, D. Wagner, Experiments on
[13] F. Aurenhammer, Voronoi diagrams — A survey of a
graph clustering algorithms, in: G. Di Battista, U. Zwick
fundamental geometric data structure, ACM Computing
(Eds.), Proceedings of the Eleventh European Symposium
Surveys 23 (3) (1991) 345–405.
on Algorithms, in: Lecture Notes in Computer Science,
[14] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti
vol. 2832, Springer-Verlag GmbH, Heidelberg, Germany,
Spaccamela, M. Protasi, Complexity and Approximation:
2003.
Combinatorial optimization problems and their approxima-
[35] S. Brin, L. Page, The anatomy of a large-scale hypertextual
bility properties, Springer-Verlag GmbH, Berlin, Heidelberg,
Web search engine, Computer Networks and ISDN Systems
Germany, 1999.
30 (1–7) (1998) 107–117.
[15] F.R. Bach, M.I. Jordan, Learning spectral clustering, Tech.
[36] A.Z. Broder, S.R. Kumar, F. Maghoul, P. Raghavan,
Rep. UCB/CSD-03-1249, Computer Science Division, Univer-
S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener, Graph
sity of California, Berkeley, CA, USA, Jun. 2003.
structure in the Web, Computer Networks 33 (1–6) (2000)
[16] G.D. Bader, C.W.V. Hogue, An automated method for finding
309–320.
molecular complexes in large protein interaction networks,
[37] T.N. Bui, F.T. Leighton, S. Chaudhuri, M. Sipser, Graph
BMC Bioinformatics 4(2).
[17] J.P. Bagrow, E.M. Bollt, Local method for detecting communi- bisection algorithms with good average case behavior,
ties, Physical Review E 72 (2005) 046108. Combinatorica 7 (2) (1987) 171–191.
[18] N. Bansal, A. Blum, S. Chawla, Correlation clustering, [38] H. Bunke, P. Foggia, C. Guidobaldi, M. Vento, Graph clus-
Machine Learning 56 (1–3) (2004) 89–113. tering using the weighted minimum common supergraph,
[19] J. Bar-Ilan, G. Kortsarz, D. Peleg, How to allocate network in: E.R. Hancock, M. Vento (Eds.), Proceedings of the Fourth
centers, Journal of Algorithms 15 (3) (1993) 385–415. IARP International Workshop on Graph Based Representa-
[20] J. Bar-Ilan, G. Kortsarz, D. Peleg, How to allocate network tions in Pattern Recognition, in: Lecture Notes in Computer
centers, Journal of Algorithms 15 (3) (1993) 385–415. Science, vol. 2726, Springer-Verlag GmbH, Berlin, Heidel-
[21] J. Bar-Ilan, D. Peleg, Approximation algorithms for selecting berg, Germany, 2003.
network centers, in: F.K.H.A. Dehne, J.-R. Sack, N. Santoro [39] J.F. Campbell, Hub location and the p-hub median problem,
(Eds.), Proceedings of the Second Workshop on Algorithms Operations Research 44 (6) (1996) 923–935.
and Data Structures, WADS’91, in: Lecture Notes in [40] A. Capoccia, V. Servedioa, G. Caldarellia, F. Colaiorib,
Computer Science, vol. 519, Springer-Verlag GmbH, Berlin, Detecting communities in large networks, Physica A:
Heidelberg, Germany, 1991. Statistical Mechanics and its Applications 352 (2–4) (2005)
[22] A.-L. Barabási, R. Albert, Emergence of scaling in random 669–676.
networks, Science 286 (5439) (1999) 509–512. [41] J.J.M. Carrasco, D.C. Fain, K.J. Lang, L. Zhukov, Clustering
[23] E. Behrends, Introduction to Markov Chains, with Special of bipartite advertiser-keyword graph, in: Proceedings of
Emphasis on Rapid Mixing, Vieweg & Sohn, Braunschweig, the Third IEEE International Conference on Data Mining,
Wiesbaden, Germany, 2000. Workshop on Clustering Large Data Sets, 2003.
[24] L.M.A. Bettencourt, Tipping the balances of a small world, [42] D. Chakrabarti, C. Faloutsos, Graph mining: Laws, genera-
Tech. Rep. MIT-CTP-3361 (cond-mat/0304321 at arXiv.org), tors, and algorithms, ACM Computing Surveys 38 (1) (2006)
Center for Theoretical Physics, Massachusetts Institute of Article No. 2.
Technology, Cambridge, MA, USA, 2002. [43] M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incre-
[25] C. Biernacki, G. Celeux, G. Govaert, Assessing a mixture mental clustering and dynamic information retrieval,
model for clustering with the integrated completed in: F.T. Leighton, P. Shor (Eds.), Proceedings of the Twenty-
likelihood, IEEE Transactions on Pattern Analysis and ninth Annual Symposium on Theory of Computing, STOC,
Machine Intelligence 22 (7) (2000) 719–725. ACM Press, New York, NY, USA, 1997.
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 59
[44] M. Charikar, R. Panigrahy, Clustering to minimize the sum [65] L. Dall’Asta, I. Alvarez-Hamelin, A. Barrat, A. Vázquez,
of cluster diameters, Journal of Computer and System A. Vespignani, Exploring networks with traceroute-like
Sciences 68 (2) (2004) 417–441. probes: Theory and simulations, Theoretical Computer
[45] J. Cheeger, A lower bound for the smallest eigenvalue of the Science 355 (1) (2006) 6–24.
laplacian, in: Problems in Analysis: Symposium in Honor [66] L. Danon, A. Díaz Guilera, A. Arenas, The effect of size
of Salomon Bochner (1969), Princeton University Press, heterogeneity on community identification in complex
Princeton, NJ, USA, 1970. networks, Journal of Statistical Mechanics Theory and
[46] N. Chen, A. Chen, L. Zhou, L. Lu, A graph-based clustering Experiment (2006) P11010.
algorithm in large transaction databases, Intelligent Data [67] L. Danon, A. Díaz Guilera, J. Duch, A. Arenas, Comparing
Analysis 5 (4) (2001) 327–338. community structure identification, Journal of Statistical
[47] D. Cheng, R. Kannan, S. Vempala, G. Wang, On a Mechanics Theory and Experiment (2005) P09008.
recursive spectral algorithm for clustering from pairwise [68] R.N. Dave, R. Krishnapuram, Robust clustering methods: A
similarities, Tech. Rep. MIT-LCS-TR-906, Laboratory of unified view, IEEE Transactions on Fuzzy Systems 5 (2) (1997)
Computer Science, Massachusetts Institute of Technology, 270–293.
Boston, MA, USA, 2003. [69] D.L. Davies, D.W. Bouldin, A cluster separation measure,
[48] D. Cheng, S. Vempala, R. Kannan, G. Wang, A divide- IEEE Transactions on Pattern Analysis and Machine
and-merge methodology for clustering, in: Proceedings of Intelligence 1 (4) (1979) 224–227.
the Twenty-fourth Symposium on Principles of Database [70] N.M.M. de Abreu, Old and new results on algebraic
Systems, ACM Press, New York, NY, USA, 2005. connectivity of graphs, Linear Algebra and its Applications
[49] F.A. Chudak, D.B. Shmoys, Improved approximation algo- 423 (1) (2007) 53–73.
rithms for the uncapacitated facility location problem, SIAM [71] J. Díaz, J. Petit, M. Serna, A survey of graph layout problems,
Journal on Computing 33 (1) (2003) 1–25. ACM Computing Surveys 34 (3) (2002) 313–356.
[50] T.Y. Chun, World Wide Web robots: An overview, Online [72] C. Ding, X. He, Linearized cluster assignment via spectral
Information Review 22 (3) (1999) 135–142. ordering, in: Proceedings of the Twenty-First International
[51] F.R.K. Chung, Spectral Graph Theory, American Mathemati- Conference on Machine Learning, vol. 69, ACM Press, New
cal Society, Providence, RI, USA, 1997. York, NY, USA, 2004.
[52] F.R.K. Chung, Random walks and local cuts in graphs, Linear [73] A.A. Diwan, S. Rane, S. Seshadri, S. Sudarshan, Clustering
Algebra and its Applications. techniques for minimizing external path length, in: Pro-
[53] F.R.K. Chung, L. Lu, V. Vu, The spectra of random graphs with ceedings of the Twenty-second International Conference on
given expected degrees, Internet Mathematics 1 (3) (2004) Very Large Data Bases (VLDB), Morgan Kaufmann Publish-
257–275. ers, San Francisco, CA, USA, 1996.
[54] A. Clauset, Finding local community structure in networks,
[74] L. Donetti, M.A. Muñoz, Detecting network communities:
Physical Review E 72 (2005) 026132.
A new systematic and efficient algorithm, Journal of
[55] A. Clauset, C. Moore, Accuracy and scaling phenomena
Statistical Mechanics (2004) P10012.
in Internet mapping, Physical Review Letters 94 (1) (2005)
[75] Yihong Dong, Yueting Zhuang, Ken Chen, Xiaoying Tai,
018701.
A hierarchical clustering algorithm based on fuzzy graph
[56] A. Clauset, M.E.J. Newman, C. Moore, Finding community
connectedness, Fuzzy Sets and Systems 157 (13) (2006)
structure in very large networks, Physical Review E 70 (6)
1760–1774.
(2004) 066111.
[76] S.N. Dorogovtsev, J.F.F. Mendes, Evolution of networks,
[57] W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison
Advances in Physics 51 (4) (2002) 1079–1187.
of string distance metrics for name-matching tasks,
[77] P.G. Doyle, J.L. Snell, Random Walks and Electric Networks,
in: S. Kambhampati, C.A. Knoblock (Eds.), Proceedings of
Mathematical Association of America, Washington, DC,
IJCAI-03 Workshop on Information Integration on the Web,
USA, 1984.
IIWeb-03, AAAI, 2003.
[78] H. Du, An algorithm for detecting community structure of
[58] F. Comellas, S. Gago Álvarez, Spectral bounds for the
social networks based on prior knowledge and modularity,
betweenness of a graph, Linear Algebra and its Applications
Complexity 12 (3) (2007) 53–60.
423 (1) (2007) 74–80.
[59] A. Condon, R.M. Karp, Algorithms for graph partitioning [79] R.C. Dubes, A.K. Jain, Clustering methodologies in ex-
on the planted partition model, Random Structures & ploratory data analysis, Advances in Computers 19 (1980)
Algorithms 18 (2) (2001) 116–140. 113–228.
[60] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduc- [80] D.P. Dubhashi, L. Laura, A. Panconesi, Analysis and exper-
tion to Algorithms, 2nd ed., MIT Press and McGraw Hill, imental evaluation of a simple algorithm for collaborative
Cambridge, MA, USA, 2001, pp. 643–700. filtering in planted partition models, in: P.K. Pandya, J. Rad-
[61] G. Cornuéjols, G.L. Nemhauser, L.A. Wolsey, The unca- hakrishnan (Eds.), Proceedings of the Twenty-Third Confer-
pacitated facility location problem, in: P.B. Mirchandani, ence on the Foundations of Software Technology and Theo-
R.L. Francis (Eds.), Discrete Location Theory, John Wiley and retical Computer Science, in: Lecture Notes in Computer Sci-
Sons, Inc., New York, NY, USA, 1990, pp. 119–171. ence, vol. 2914, Springer-Verlag GmbH, Berlin, Heidelberg,
[62] P. Crescenzi, V. Kann, A compendium of np optimization Germany, 2003.
problems. http://www.csc.kth.se/viggo/wwwcompendium/ [81] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd
wwwcompendium.html, accessed on May 18, 2007. ed., John Wiley & Sons, Inc., New York, NY, USA, 2001.
[63] D. Cvetković, Signless laplacians and line graphs, Bulletin, [82] J. Edachery, A. Sen, F.J. Brandenburg, Graph clustering
Classe des Sciences Mathématiques et Naturelles, Sciences using distance-k cliques, in: Proceedings of the Seventh
mathématiques Académie Serbe des Sciences et des Arts International Symposium on Graph Drawing, in: Lecture
CXXXI (30) (2005) 85–92. Notes in Computer Science, vol. 1731, Springer-Verlag
[64] L. da F. Costa, F.A. Rodrigues, G. Travieso, P.R. Villas GmbH, Berlin, Heidelberg, Germany, 1999.
Boas, Characterization of complex networks: A survey of [83] P. Elias, A. Feinstein, C.E. Shannon, Note on maximum flow
measurements, Tech. Rep. cond-mat/0505185 arXiv.org, May through a network, IRE Transactions on Information Theory
2005. IT-2 (1956) 117–119.
60 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
[84] P. Erdős, A. Rényi, On random graphs I, in: Selected Papers [106] E.N. Gilbert, Random graphs, Annals of Mathematical
of Alfréd Rényi, vol. 2, Akadémiai Kiadó, Budapest, Hungary, Statistics 30 (4) (1959) 1141–1144.
1976, pp. 308–315. First publication in Publ. Math. Debrecen [107] M. Girvan, M.E.J. Newman, Community structure in social
1959. and biological networks, Proceedings of the National
[85] P. Erdős, A. Rényi, On the evolution of random graphs, Academy of Sciences, USA 99 (2002) 8271–8276.
in: Selected Papers of Alfréd Rényi, vol. 2, Akadémiai Kiadó, [108] C. Gkantsidis, M. Mihail, A. Saberi, Conductance and
Budapest, Hungary, 1976, pp. 482–525. First publication in congestion in power law graphs, in: Proceedings of the
MTA Mat. Kut. Int. Közl. 1960. International Conference on Measurement and Modeling of
[86] I.J. Farkas, I. Derényi, A.-L. Barabási, T. Vicsek, Spectra of Computer Systems, ACM Press, New York, NY, USA, 2003.
“real-world” graphs: Beyond the semicircle law, Physical [109] C. Gkantsidis, M. Mihail, E. Zegura, Spectral analysis
Review E 64 (2) (2001) 026704. of Internet topologies, in: Proceedings of the Twenty-
[87] F. Farnstrom, J. Lewis, C. Elkan, Scalability for clustering second Annual Joint Conference of the IEEE Computer and
algorithms revisited, SIGKDD Explorations 2 (2) (2000) 1–7. Communications Societies, INFOCOM, vol. 1, IEEE, New
[88] T. Feder, D.H. Greene, Optimal algorithms for approximate York, NY, USA, 2003.
clustering, in: Proceedings of the Twentieth Annual ACM [110] K.-I. Goh, B. Kahng, D. Kim, Spectra and eigenvectors of
Symposium on Theory of Computing, STOC, ACM Press, scale-free networks, Physical Review E 64 (5) (2001) 051903.
New York, NY, USA, 1988. [111] A.V. Goldberg, R.E. Tarjan, A new approach to the maximum-
[89] U. Feige, R. Krauthgamer, A polylogarithmic approximation flow problem, Journal of the ACM 35 (4) (1988) 921–940.
of the minimum bisection, SIAM Journal on Computing 31
[112] T.F. Gonzalez, Clustering to minimize the maximum
(4) (2002) 1090–1118.
intercluster distance, Theoretical Computer Science 38
[90] U. Feige, D. Peleg, G. Kortsarz, The dense k-subgraph (1985) 293–306.
problem, Algoritmica 29 (3) (2001) 410–421.
[113] G.R. Grimmett, D.R. Stirzaker, Probability and Random
[91] A. Felner, Finding optimal solutions to the graph partition-
Processes, 3rd ed., Oxford University Press, Oxford, UK, 2001.
ing problem with heuristic search, Annals of Mathematics
[114] V. Grout, S. Cunningham, A constrained version of a cluster-
and Artificial Intelligence 45 (3–4) (2005) 292–322.
ing algorithm for switch placement and interconnection in
[92] M. Fiedler, Algebraic connectivity of graphs, Czechoslovak
large networks, in: T. Philip (Ed.), Proceedings of the Nine-
Mathematical Journal 23 (1973) 298–305.
teenth International Conference on Computer Applications
[93] M. Fiedler, A property of eigenvectors of nonnegative
in Industry and Engineering, CAINE, ICSA, 2006.
symmetric matrices and its application to graph theory,
[115] S. Guattery, G.L. Miller, On the quality of spectral separators,
Czechoslovak Mathematical Journal 25 (1975) 619–633.
SIAM Journal on Matrix Analysis and Applications 19 (3)
[94] G.W. Flake, S. Lawrence, C.L. Giles, F.M. Coetzee, Self-
(1998) 701–719.
organization and identification of Web communities, IEEE
[116] S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering
Computer 35 (3) (2002) 66–71.
data streams, in: Proceedings of the Fourty-first Annual
[95] G.W. Flake, R.E. Tarjan, K. Tsioutsiouliklis, Graph clustering
Symposium on Foundations of Computer Science, FOCS,
and minimum cut trees, Internet Mathematics 1 (1) (2004)
IEEE Computer Society Press, Los Alamitos, CA, USA, 2000.
385–408.
[117] R. Guimerà, S. Mossa, A. Turtschi, L.A. Nunes Amaral,
[96] L.R. Ford Jr., D.R. Fulkerson, Maximum flow through a
The worldwide air transportation network: Anomalous
network, Canadian Journal of Mathematics 8 (1956) 399–404.
centrality, community structure, and cities’ global roles,
[97] S. Fortunato, V. Latora, M. Marchiori, Method to find
Proceedings of the National Academy of Science of the
community structures based on information centrality,
United States of America 102 (22) (2005) 7794–7799.
Physical Review E 70 (2004) 056104.
[118] D. Gusfield, Algorithms on Strings, Trees, and Sequences:
[98] C. Fraley, A.E. Raftery, How many clusters? Which clustering
Computer Science and Computational Biology, Cambridge
method? Answers via model-based cluster analysis, The
University Press, Cambridge, UK, 1997.
Computer Journal 41 (8) (1998) 578–588.
[99] P. Fränti, O. Virmajoki, V. Hautamäki, Fast PNN-based clus- [119] J.A. Hartigan, M.A. Wong, Algorithm AS 136: A k-means
tering using k-nearest neighbor graph, IEEE Transactions clustering algorithm, Applied Statistics 29 (1979) 100–108.
on Pattern Analysis and Machine Intelligence 28 (11) (2006) [120] E. Hartuv, R. Shamir, A clustering algorithm based on graph
1875–1881. connectivity, Information Processing Letters 76 (4–6) (2000)
[100] L.C. Freeman, A set of measures of centrality based upon 175–181.
betweenness, Sociometry 40 (1) (1977) 35–41. [121] X. He, H. Zha, C.H.Q. Ding, H.D. Simon, Web document clus-
[101] T. Furuta, M. Sasaki, F. Ishizaki, A. Suzuki, H. Miyazawa, tering using hyperlink structures, Computational Statistics
A new cluster formation method for sensor networks & Data Analysis 41 (1) (2002) 19–45.
using facility location theory, Tech. Rep. NANZAN-TR-2006- [122] C. Hennig, B. Hausdorf, Design of dissimilarity measures:
01, Nanzan Academic Society Mathematical Sciences and A new dissimilarity measure between species distribution
Information Engineering, Nagoya, Japan, August 2006. ranges, in: V. Batagelj, H.-H. Bock, A. Ferligoj, A. Ziberna
[102] G. Gallo, M.D. Grigoriadis, R.E. Tarjan, A fast parametric (Eds.), Data Science and Classification, Studies in Classifica-
maximum flow algorithm and applications, SIAM Journal on tion, Data Analysis, and Knowledge Organization, Springer-
Computing 18 (1) (1989) 30–55. Verlag GmbH, Berlin, Germany, 2006, pp. 29–38.
[103] M.R. Garey, D.S. Johnson, Computers and Intractability: A [123] D.J. Higham, G. Kalna, M. Kibble, Spectral clustering and its
Guide to the Theory of NP-Completeness, W.H. Freeman, use in bioinformatics, Journal of Computational and Applied
San Francisco, CA, USA, 1979. Mathematics 204 (1) (2007) 25–37.
[104] M.R. Garey, D.S. Johnson, L.J. Stockmeyer, Some simplified [124] A. Hlaoui, S. Wang, Median graph computation for graph
NP-complete graph problems, Theoretical Computer Sci- clustering, Soft Computing — A Fusion of Foundations
ence 1 (3) (1976) 237–267. Methodologies and Applications 10 (1) (2006) 47–53.
[105] I. Gath, A.B. Geva, Unsupervised optimal fuzzy clustering, [125] D.D. Hochbaum, D.B. Shmoys, A unified approach to
IEEE Transactions on Pattern Analysis and Machine approximation algorithms for bottleneck problems, Journal
Intelligence 11 (7) (1989) 773–780. of the ACM 33 (3) (1986) 533–550.
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 61
[126] D.S. Hochbaum, Various notions of approximations: Good, [145] B.W. Kernighan, S. Lin, An efficient heuristic procedure for
better, best, and more, in: D.S. Hochbaum (Ed.), Approxi- partitioning graphs, Bell System Technical Journal 49 (2)
mation Algorithms for NP-hard Problems, PWS Publishing (1970) 291–308.
Company, Boston, MA, USA, 1997, pp. 346–398 (Chapter 9). [146] S. Khuller, Y.J. Sussmann, The capacitated k-center problem,
[127] K. Holzapfel, S. Kosub, M.G. Maaß, H. Täubig, The com- in: J. Díaz, M.J. Serna (Eds.), Proceedings of the Fourth Annual
plexity of detecting fixed-density clusters, in: R. Petreschi, European Symposium on Algorithms, ESA, in: Lecture Notes
G. Persiano, R. Silvestri (Eds.), Proceedings of the Fifth Italian in Computer Science, vol. 1136, Springer-Verlag GmbH,
Conference on Algorithms and Complexity, CIAC, in: Lec- Berlin, Heidelberg, Germany, 1996.
ture Notes in Computer Science, vol. 2653, Springer-Verlag [147] S. Kim, Graph theoretic sequence clustering algorithms and
GmbH, Berlin, Germany, 2003. their applications to genome comparison, in: J.T.L. Wang,
[128] J.E. Hopcroft, O. Khan, B. Kulis, B. Selman, Natural C.H. Wu, P.P. Wang (Eds.), Computational Biology and
communities in large linked networks, in: Proceedings Genome Informatics, World Scientific Publishing Company,
of the Ninth International Conference on Knowledge 2003, pp. 81–116 (Chapter 4).
Discovery and Data Mining, KDD, ACM, New York, NY, USA, [148] A.D. King, N. Przulj, I. Jurisica, Protein complex prediction
2003. via cost-based clustering, Bioinformatics 20 (17) (2004)
[129] F. Höppner, F. Klawonn, R. Kruse, T. Runkler, Fuzzy Cluster 3013–3020.
Analysis: Methods for Classification, Data Analysis and [149] R.W. Klein, R.C. Dubes, Experiments in projection and
Image Recognition, John Wiley & Sons, Inc., Hoboken, NJ, clustering by simulated annealing, Pattern Recognition 22
USA, 1999. (2) (1989) 213–220.
[130] T.-C. Hou, T.-J. Tsai, An access-based clustering protocol for [150] J. Kleinberg, An Impossibility Theorem for Clustering, MIT
multihop wireless ad hoc networks, IEEE Journal on Selected Press, Cambridge, MA, USA, 2002.
Areas in Communications 19 (7) (2001) 1201–1210. [151] J.M. Kleinberg, S. Lawrence, The structure of the Web,
[131] W.-L. Hsu, G.L. Nemhauser, Easy and hard bottleneck Science 294 (5548) (2001) 1849–1850.
location problems, Discrete and Applied Mathematics 1 [152] J.M. Kleinberg, E. Tardos, Approximation algorithms for
(1979) 209–216. classification problems with pairwise relationships: Metric
[132] H. Hu, X. Yan, Y. Huang, J. Han, X.J. Zhou, Mining coherent labeling and Markov random fields, Journal of the ACM 49
dense subgraphs across massive biological networks (5) (2002) 14–23.
for functional discovery, Bioinformatics (Suppl. 1) (2005) [153] J.G. Klincewicz, Heuristics for the p-hub location problem,
213–221. European Journal of Operational Research 53 (1991) 25–37.
[154] M. Kozdron, The discrete dirichlet problem. http://citeseer.
[133] P. Jaccard, Distribution de la flore alpine dans la Bassin de
ist.psu.edu/293959.html, April 2000.
Dranses et dans quelques regions voisines, Bulletin del la
[155] D.L. Kreher, D.R. Stinson, Combinatorial Algorithms: Gener-
Société Vaudoisedes Sciences Naturelles 37 (1901) 241–272.
ation, Enumeration, and Search, CRC Press, Boca Raton, FL,
cited in [122].
USA, 1998.
[134] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data,
[156] P. Krishna, N.H. Vaidya, M. Chatterjee, D.K. Pradhan, A
Prentice-Hall, Englewood, NJ, USA, 1988.
cluster-based approach for routing in dynamic networks,
[135] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: A review,
ACM SIGCOMM Computer Communication Review 27 (2)
ACM Computing Surveys 31 (3) (1999) 264–323.
(1997) 49–64.
[136] K. Jain, V.V. Vazirani, Primal-dual approximation algorithms
[157] S.R. Kumar, J. Novak, P. Raghavan, A. Tomkins, On the
for metric facility location and k-median problems,
bursty evolution of blogspace, in: Proceedings of the Twelfth
in: Proceedings of the Fourtieth Annual Symposium on
International World-Wide Web Conference, WWW, ACM
Foundations of Computer Science, FOCS, IEEE Computer
Press, New York, NY, USA, 2003.
Society, Washington, DC, USA, 1999.
[158] M. Křivánek, J. Morávek, NP-hard problems in hierarchical-
[137] D.S. Johnson, C.R. Aragon, L.A. McGeoch, C. Schevon, tree clustering, Acta Informatica 23 (3) (1986) 311–323.
Optimization by simulated annealing: An experimental [159] S. Lakroum, V. Devlaminck, P. Terrier, P. Biela Enberg,
evaluation. Part I, graph partitioning, Operations Research J.-G. Postaire, Clustering of the Poincare vectors, in: IEEE
37 (6) (1989) 865–892. International Conference on Image Processing, vol. 2, IEEE,
[138] E.J.L. Johnson, A. Mehrotra, G.L. Nemhauser, Min-cut 2005.
clustering, Mathematical Programming 62 (1) (1993) 133–151. [160] K. Lang, S. Rao, A flow-based method for improving the ex-
[139] N. Kahale, A semidefinite bound for mixing rates of Markov pansion or conductance of graph cuts, in: G.L. Nemhauser,
chains, Random Structures and Algorithms 11 (4) (1998) D. Bienstock (Eds.), Proceedings of the Tenth International
299–313. Conference on Integer Programming and Combinatorial Op-
[140] J. Kalcsics, S. Nickel, M. Schröder, Toward a unified timization, IPCO, in: Lecture Notes in Computer Science,
territorial design approach: Applications, algorithms, and vol. 3064, Springer-Verlag GmbH, Berlin, Heidelberg, Ger-
GIS integration, TOP 13 (1) (2005) 1–56. many, 2004.
[141] N. Kannan, S. Selvaraj, M.M. Gromiha, S. Vishveshwara, [161] V. Latora, M. Marchiori, Efficient behavior of small-world
Clusters in α/β barrel proteins: Implications for protein networks, Physical Review Letters 87 (19) (2001) 198701.
structure, function, and folding: A graph theoretical [162] V. Latora, M. Marchiori, A measure of centrality based on the
approach, Proteins 43 (2) (2001) 103–112. network efficiency, Tech. Rep. cond-mat/0402050, arXiv.org,
[142] R. Kannan, S. Vempala, A. Vetta, On clusterings — good, bad February 2004.
and spectral, Journal of the ACM 51 (3) (2004) 497–515. [163] G.F. Lawler, Intersections of Random Walks, Probability and
[143] R.M. Karp, Reducibility among combinatorial problems, its Applications, Birkhäuser, Boston, MA, USA, 1991.
in: Proceedings of a Symposium on the Complexity of [164] R.-C. Li, Accuracy of computed eigenvectors via optimizing
Computer Computations, IBM, Plenum, NY, USA, 1972. a rayleigh quotient, Bit Numerical Mathematics 44 (3) (2004)
[144] D. Kempe, F. McSherry, A decentralized algorithm for 585–593.
spectral analysis, in: L. Babai (Ed.), Proceedings of the Thirty- [165] C.R. Lin, M. Gerla, Adaptive clustering for mobile wireless
sixth Annual Symposium on Theory of Computing, STOC, networks, IEEE Journal on Selected Areas in Communica-
ACM Press, New York, NY, USA, 2004. tions 15 (7) (1997) 1265–1275.
62 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
[166] A.H. Lipkus, A proof of the triangle inequality for the [185] M.E.J. Newman, Fast algorithm for detecting community
tanimoto distance, Journal of Mathematical Chemistry 26 structure in networks, Physical Review E 69 (6) (2004) 066133.
(1–3) (1999) 263–265. [186] M.E.J. Newman, M. Girvan, Mixing patterns and community
[167] L. Lovász, Random walks on graphs: A survey, in: Bolyai structure in networks, in: R. Pastor-Satorras, M. Rubi, A. Díaz
Society Mathematical Studies, 2, in: Combinatorics, Pál Guilera (Eds.), Statistical Mechanics of Complex Networks:
Erdős is Eighty, vol. 2, Bolyai Mathematical Society, 1996, Proceedings of the XVIII Sitges Conference on Statistical
pp. 353–397. Mechanics, in: Lecture Notes in Physics, vol. 625, Springer-
[168] B. Luo, R.C. Wilson, E.R. Hancock, Spectral feature vectors Verlag GmbH, Berlin, Germany, 2003.
for graph clustering, in: T. Caelli, A. Amin, R.P. Duin, [187] M.E.J. Newman, M. Girvan, Finding and evaluating commu-
M. Kamel, D. de Ridder (Eds.), Proceedings of the Joint nity structure in networks, Physical Review E 69 (2) (2004)
IARP International Workshops on Syntactical and Structural 026113.
Pattern Recognition and Statistical Pattern Recognition, [188] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering:
in: Lecture Notes in Computer Science, vol. 2396, Springer- Analysis and an algorithm, in: T.G. Dietterich, S. Becker,
Verlag GmbH, Berlin, Heidelberg, Germany, 2002. Z. Ghahramani (Eds.), Proceedings of the Fourteenth
[169] B. Luo, R.C. Wilson, E.R. Hancock, Spectral clustering of Conference on Advances in Neural Information Processing
graphs, in: G. Goos, J. Hartmanis, J. van Leeuwen (Eds.), Systems, vol. 2, The MIT Press, Cambridge, MA, USA, 2002.
Proceedings of the Tenth International Conference on [189] M. O’Kelly, A clustering approach to the planar hub location
Computer Analysis of Images and Patterns, CAIP, in: Lecture problem, Annals of Operations Research 40 (1) (1992)
Notes in Computer Science, vol. 2756, Springer-Verlag 339–353.
GmbH, Berlin, Heidelberg, Germany, 2003. [190] P. Orponen, S.E. Schaeffer, Locally computable approxima-
[170] R.M. MacGregor, On partitioning a graph: A theoretical tions for spectral clustering and absorption times of random
and empirical study, Ph.D. Thesis, University of California, walks (in preparation).
Berkeley, CA, USA, 1978. [191] P. Orponen, S.E. Schaeffer, Local clustering of large graphs
[171] H. Matsuda, T. Ishihara, A. Hashimoto, Classifying molec- by approximate Fiedler vectors, in: S. Nikoletseas (Ed.),
ular sequences using a linkage graph with their pairwise Proceedings of the Fourth International Workshop on
similarities, Theoretical Computer Science 210 (2) (1999) Efficient and Experimental Algorithms, WEA, in: Lecture
305–325. Notes in Computer Science, vol. 3505, Springer-Verlag
[172] D.W. Matula, F. Shahrokhi, Sparsest cuts and bottlenecks GmbH, Berlin, Heidelberg, Germany, 2005.
in graphs, Discrete Applied Mathematics 27 (1–2) (1990) [192] C.H. Papadimitriou, Computational Complexity, Addison
113–123. Wesley, Reading, MA, USA, 1993.
[173] F. McSherry, Spectral partitioning of random graphs, [193] J.B. Pereira-Leal, A.J. Enright, C.A. Ouzounis, Detection of
in: Proceedings of the Fourty-Second IEEE Symposium on functional modules from protein interaction networks,
Foundations of Computer Science, FOCS, IEEE Computer Proteins: Structure, Function, and Bioinformatics 54 (1)
Society Press, Washington, DC, USA, 2001. (2003) 49–57.
[174] F. McSherry, Spectral methods for data analysis, Ph.D. [194] C.E. Perkins (Ed.), Ad Hoc Networking, Addison Wesley,
Thesis, University of Washington, Seattle, WA, USA, 2004. Reading, MA, USA, 2001.
[195] J. Plesník, A heuristic for the p-center problem in graphs,
[175] M. Meilă, J. Shi, Learning segmentation by random walks,
Discrete and Applied Mathematics 17 (1987) 263–268.
in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Advances in
[196] A. Pothen, H.D. Simon, K.-P. Liou, Partitioning sparse
Neural Information Processing Systems 13, Papers from
matrices with eigenvectors of graphs, SIAM Journal on
Neural Information Processing Systems, NIPS, MIT Press,
Matrix Analysis and Applications 11 (3) (1990) 430–452.
Cambridge, MA, USA, 2000.
[197] J. Puhan, T. Tuma, I. Fajfar, Spice for Windows 95/98/NT,
[176] M. Meilă, J. Shi, A random walks view of spectral
Elektrotehnišski vestnik 65 (5) (1998) 267–271. Electrotech-
segmentation, in: Proceedings of the Eighth International
nical review, Ljubljana, Slovenia.
Workshop on Artificial Intelligence and Statistics, Morgan
[198] H. Qiu, E.R. Hancock, Graph matching and clustering using
Kauffman, San Francisco, CA, USA, 2001.
spectral partitions, Pattern Recognition 39 (1) (2004) 22–34.
[177] Z. Michalewicz, D.B. Fogel, How to Solve it: Modern Heuris-
[199] J.M. Rabaey, The spice circuit simulator, eECS Department
tics, 2nd ed., Springer-Verlag GmbH, Berlin, Heidelberg, Ger-
of the University of California at Berkeley. http://bwrc.eecs.
many, 2004.
berkeley.edu/Classes/ICBook/SPICE/.
[178] M. Mihail, C. Gkantsidis, A. Saberi, E. Zegura, On the
[200] V.V. Raghavan, C.T. Yu, A comparison of the stability
semantics of internet topologies, Tech. Rep. GIT-CC-02-
characteristics of some graph theoretic clustering methods,
07, College of Computing, Georgia Institute of Technology,
IEEE Transactions on Pattern Analysis and Machine
Atlanta, GA, USA, 2002.
Intelligence 3 (4) (1981) 393–403.
[179] B.L. Milenova, M.M. Campos, O-cluster: Scalable clustering
[201] R.Z. Ríos-Mercado, E. Fernández, A reactive GRASP for
of large high dimensional data sets, in: Proceedings of the
a sales territory design problem with multiple balancing
IEEE International Conference on Data Mining, ICDM, 2002.
requirements, Tech. Rep. PISIS-2006-12, Graduate Program
[180] M.E. Newman, Finding community structure in networks
in Systems Engineering, Universidad Autónoma de Nuevo
using the eigenvectors of matrices, Physical Review E 74 (3)
León, San Nicolás de los Garza, Mexico, September 2006.
(2006) 036104.
[202] A. Robles-Kelly, E.R. Hancock, Graph edit distance from
[181] M.E.J. Newman, A measure of betweenness centrality based spectral seriation, IEEE Transactions on Pattern Analysis
on random walks, Tech. Rep. cond-mat/0309045, arXiv.org, and Machine Intelligence 27 (3) (2005) 365–378.
September 2003. [203] K.A. Rytkönen, A spring-force visualization algorithm
[182] M.E.J. Newman, Properties of highly clustered networks, implemented in Java (2003), unpublished.
Physical Review E 68 (2) (2003) 026121. [204] M. Saerens, F. Fouss, L. Yen, P. Dupont, The principal
[183] M.E.J. Newman, The structure and function of complex components analysis of a graph, and its relationships
networks, SIAM Review 45 (2) (2003) 167–256. to spectral clustering, in: J.-F. Boulicaut, F. Esposito,
[184] M.E.J. Newman, Detecting community structure in net- F. Giannotti, D. Pedreschi (Eds.), Proceedings of the
works, The European Physical Journal B 38 (2) (2004) 321–330. Fifteenth European Conference on Machine Learning, ECML,
COMPUTER SCIENCE REVIEW 1 (2007) 27–64 63
in: Lecture Notes in Computer Science, Springer-Verlag [221] M. Thelwall, A web crawler design for data mining, Journal
GmbH, Berlin, Heidelberg, Germany, 2004. of Information Science 27 (5) (2001) 319–325.
[205] S.E. Schaeffer, Stochastic local clustering for massive [222] G.T. Toussaint, Proximity graphs for nearest neighbor
graphs, in: T.B. Ho, D. Cheung, H. Liu (Eds.), Proceedings decision rules: Recent progress, in: Proceedings of the
of the Ninth Pacific-Asia Conference on Knowledge Thirty-Fourth Symposium on Computing and Statistics,
Discovery and Data Mining, PAKDD, in: Lecture Notes in Interface-2002, The Interface Foundation of North America,
Computer Science, vol. 3518, Springer-Velgar GmbH, Berlin, Fairfax Station, VA, USA, 2002.
Heidelberg, Germany, 2005. [223] E.R. van Dam, W.H. Haemers, Which graphs are determined
[206] S.E. Schaeffer, Algorithms for nonuniform networks, Ph.D. by their spectrum? Linear Algebra and its Applications 373
Thesis, Helsinki University of Technology, Espoo, Finland, (2003) 241–272.
April 2006. [224] S.M. van Dongen, Graph clustering by flow simulation, Ph.D.
[207] S.E. Schaeffer, S. Marinoni, M. Särelä, P. Nikander, Dynamic Thesis, Universiteit Utrecht, Utrecht, The Netherlands, May
local clustering for hierarchical ad hoc networks, in: Pro- 2000.
ceedings of the IEEE Communications Society Conference [225] L. Vargas Suáarez, R.Z. Ríos-Mercado, F. López, Usando
on Sensor, Mesh and Ad Hoc Communications and Net- GRASP para resolver un problema de definición de ter-
works, SECON’06, International Workshop on Wireless Ad- ritorios de atención comercial, in: M. Arenas, F. Herrera,
hoc and Sensor Networks, IWWAN’06, subtrack, IEEE Com- M. Lozano, J. Merelo, G. Romero, A. Sáanchez (Eds.), Pro-
munications Society, New York, NY, USA, 2006. ceedings of the IV Spanish Conference on Metaheuris-
[208] R. Shamir, R. Sharan, D. Tsur, Cluster graph modification tics, in: Evolutionary and Bioinspired Algorithms, vol. 2,
problems, in: Proceedings of the Twenty-eighth Interna- Granada, Spain, 2005 (in Spanish).
tional Workshop on Graph-Theoretic Concepts in Computer
[226] V.V. Vazirani, Approximation Algorithms, Springer-Verlag
Science, in: Lecture Notes in Computer Science, vol. 2573,
GmbH, Berlin, Germany, 2001.
Springer-Verlag GmbH, Berlin, Germany, 2002.
[227] S.E. Virtanen, Clustering the Chilean web, in: Proceedings
[209] J. Shi, J. Malik, Normalized cuts and image segmentation,
of the First Latin American Web Congress, LAWEB, IEEE
IEEE Transactions on Pattern Analysis and Machine
Computer Society, Los Alamitos, CA, USA, 2003.
Intelligence 22 (8) (2000) 888–901.
[228] S.E. Virtanen, Properties of nonuniform random graph
[210] J. Šíma, S.E. Schaeffer, On the NP-completeness of
models, Tech. Rep. HUT-TCS-A77, Helsinki University of
some graph cluster measures, in: J. Wiedermann, G. Tel,
Technology, Laboratory for Theoretical Computer Science,
J. Pokorný, M. Bieliková, J. Štuller (Eds.), Proceedings of the
Espoo, Finland, May 2003.
Thirty-second International Conference on Current Trends
in Theory and Practice of Computer Science, SOFSEM, [229] D. Vukadinović, P. Huang, T. Erlebach, On the spectrum
in: Lecture Notes in Computer Science, vol. 3831, Springer- and structure of Internet topology graphs, in: H. Unger,
Verlag GmbH, Berlin, Heidelberg, Germany, 2006. T. Böhme, A.R. Mikler (Eds.), Proceedings of Second
International Workshop on Innovative Internet Computing
[211] A. Sinclair, Algorithms for Random Generation & Counting:
Systems, in: Lecture Notes in Computer Science, vol. 2346,
A Markov Chain Approach, Birkhäuser, Boston, MA, USA,
Springer-Verlag GmbH, Berlin, Heidelberg, Germany, 2002.
1993.
[212] K. Soumyanath, J.S. Deogun, On bisection width of partial [230] T. Washio, H. Motoda, Multi relational data mining (MRDM):
k-trees, Congressus Numerantium 74 (1990) 45–51. State of the art of graph-based data mining, ACM SIGKDD
Explorations Newsletter 5 (1) (2003) 59–68.
[213] D.A. Spielman, S.-H. Teng, Spectral partitioning works:
Planar graphs and finite element meshes, in: Proceedings [231] D.J. Watts, Small Worlds, Princeton University Press,
of the Thirty-seventh IEEE Symposium on Foundations Princeton, NJ, USA, 1999.
of Computing, FOCS, IEEE Computer Society Press, Los [232] R. Weber, P. Zezula, Is similarity search useful for
Alamitos, CA, USA, 1996. high dimensional spaces? in: Proceedings of the Tenth
[214] D.A. Spielman, S.-H. Teng, Nearly-linear time algorithms for International Workshop on Database and Expert Systems
graph partitioning, graph sparsification, and solving linear Applications, 1999.
systems, in: L. Babai (Ed.), Proceedings of the Thirty-sixth [233] W.T. Williams, M.B. Dale, P. Macnaughton-Smith, An
Annual Symposium on Theory of Computing, STOC, ACM objective method of weighting in similarity analysis, Nature
Press, New York, NY, USA, 2004. 201 (426).
[215] S.P. Strunkov, On weakly cospectral graphs, Mathematical [234] R. Wilson, X. Bai, E.R. Hancock, Graph clustering using sym-
Notes 80 (4) (2006) 590–592. translated from Matematich- metric polynomials and local linear embedding, in: British
eskie Zametki 80 (4) pp. 627–629. Machine Vision Conference, 2003.
[216] J. Sucec, I. Marsic, Clustering Overhead for Hierarchical [235] S.M. Wong, Y.Y. Yao, An information-theoretic measure
Routing in Mobile ad hoc Networks, in: Proceedings of the of term specificity, Journal of the American Society for
Twenty-first Annual Joint Conference of the IEEE Computer Information Science 43 (1) (1992) 54–61.
and Communications Societies, vol. 3, IEEE Computer [236] W.-C. Wong, A.W. Fu, Incremental document clustering
Society Press, Los Alamitos, CA, USA, 2002. for web page classification, in: J. Qun (Ed.), International
[217] C. Swamy, D.B. Shmoys, Fault-tolerant facility location, Conference on Information Society in the 21st Century:
in: Proceedings of the Fourteenth Annual ACM–SIAM Emerging Technologies and New Challenges, The University
Symposium on Discrete Algorithms, SODA, ACM, SIAM, of Aizu, Aizu-Wakamatsu, Fukushima, Japan, 2000.
2003. [237] A.Y. Wu, M. Garland, J. Han, Mining scale-free networks
[218] B. Świercz, Ł. Starzak, M. Zubert, A. Napieralski, DMCS- using geodesic clustering, in: W. Kim, R. Kohavi, J. Gehrke,
SPICE circuit analysis. http://lux.dmcs.p.lodz.pl/∼swierczu/ W. DuMouchel (Eds.), Proceedings of the Tenth International
java_gui.php. Conference on Knowledge Discovery and Data Mining, KDD,
[219] P.-N. Tan, M. Steinbach, V. Kumar, Cluster Analysis: Ad- ACM Press, New York, NY, USA, 2004.
ditional Issues and Algorithms, Addison-Wesley Longman [238] F. Wu, B.A. Huberman, Finding communities in linear time:
Publishing Co., Inc., Boston, MA, USA, 2005, pp. 569–650. A physics approach, The European Physical Journal B 38 (2)
[220] T. Tanimoto, IBM Internal Report, November 17 1957. (2004) 331–338.
64 COMPUTER SCIENCE REVIEW 1 (2007) 27–64
[239] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE [244] W.W. Zachary, An information flow model for conflict and
Transactions on Pattern Analysis and Machine Intelligence fission in small groups, Journal of Anthropological Research
(8) (1991) 841–847. 33 (1977) 452–473.
[240] Y. Xu, V. Olman, D. Xu, Clustering gene expression [245] C.T. Zahn, Graph-theoretical methods for detecting and
data using a graph-theoretic approach: An application describing gestalt clusters, IEEE Transactions on Computers
of minimum spanning trees, Bioinformatics 18 (4) (2002) C-20 (1) (1971) 68–86.
536–545. [246] O.R. Zaïane, A. Foss, C.-H. Lee, W. Wang, On data
[241] J.-T. Yan, P.-Y. Hsiao, A new fuzzy-clustering-based approach clustering analysis: Scalability, constraints, and validation,
for two-way circuit partitioning, in: Proceedings of the Eight in: Proceddings of the Sixth Pacific-Asia Conference on
International Conference on VLSI Design, IEEE, New York, Advances in Knowledge Discovery and Data Mining, PAKDD,
NY, USA, 1995. in: Lecture Notes in Computer Science, vol. 2336, Springer-
[242] B. Yang, J. Liu, An efficient probabilistic approach to Verlag GmbH, Berlin, Heidelberg, Germany, 2002.
network community mining, in: J. Yao, P. Lingras, W.-Z. Wu, [247] H. Zanghi, C. Ambroise, V. Miele, Fast online graph
M. Szczuka, N. Cercone, D. Slezak (Eds.), Rough Sets and clustering via Erdős-Rényi mixture, Tech. Rep. 8, Jouy-
Knowledge Technology, Second International Conference, en-Josas/Paris/Evry, France, April 2007 (submitted for
RSKT 2007, 14–16 May, Toronto, Canada, in: Lecture Notes publication).
in Computer Science, vol. 4481, 2007, pp. 267–275. [248] S. Zhong, J. Ghosh, A unified framework for model-based
[243] Q. Yang, S. Lonardi, A parallel algorithm for clustering clustering, Journal of Machine Learning Research 4 (2003)
protein–protein interaction networks, in: Workshops and 1001–1037.
Poster Abstracts of the 2005 IEEE Computational Systems [249] A.A. Zoltners, P. Sinha, Sales territory alignment: A review
Bioinformatics Conference, 2005. and model, Management Science 29 (1983) 1237–1256.