2092 On Spectral Clustering Analysis and An Algorithm
2092 On Spectral Clustering Analysis and An Algorithm
Abstract
1 Introduction
The task of finding good clusters has been the focus of considerable research in
machine learning and pattern recognition. For clustering points in Rn-a main ap-
plication focus of this paper- one standard approach is based on generative mod-
els, in which algorithms such as EM are used to learn a mixture density. These
approaches suffer from several drawbacks. First, to use parametric density estima-
tors , harsh simplifying assumptions usually need to be made (e.g., that the density
of each cluster is Gaussian) . Second, the log likelihood can have many local minima
and therefore multiple restarts are required to find a good solution using iterative
algorithms. Algorithms such as K-means have similar problems.
A promising alternative that has recently emerged in a number of fields is to use
spectral methods for clustering. Here, one uses the top eigenvectors of a matrix
derived from the distance between points. Such algorithms have been successfully
used in many applications including computer vision and VLSI design [5, 1]. But
despite their empirical successes, different authors still disagree on exactly which
eigenvectors to use and how to derive clusters from them (see [11] for a review).
Also, the analysis of these algorithms, which we briefly review below, has tended to
focus on simplified algorithms that only use one eigenvector at a time.
One line of analysis makes the link to spectral graph partitioning, in which the sec-
ond eigenvector of a graph's Laplacian is used to define a semi-optimal cut. Here,
the eigenvector is seen as a solving a relaxation of an NP-hard discrete graph parti-
tioning problem [3], and it can be shown that cuts based on the second eigenvector
give a guaranteed approximation to the optimal cut [9, 3]. This analysis can be
extended to clustering by building a weighted graph in which nodes correspond to
datapoints and edges are related to the distance between the points. Since the ma-
jority of analyses in spectral graph partitioning appear to deal with partitioning the
graph into exactly two parts, these methods are then typically applied recursively
to find k clusters (e.g. [9]). Experimentally it has been observed that using more
eigenvectors and directly computing a k way partitioning is better (e.g. [5, I]).
Here, we build upon the recent work of Weiss [11] and Meila and Shi [6], who
analyzed algorithms that use k eigenvectors simultaneously in simple settings. We
propose a particular manner to use the k eigenvectors simultaneously, and give
conditions under which the algorithm can be expected to do well.
2 Algorithm
Given a set of points S = {81' ... ,8 n } in jRl that we want to cluster into k subsets:
4. Form the matrix Y from X by renormalizing each of X's rows to have unit length
(i.e. Yij = X ij/CL.j X~)1 / 2).
5. Treating each row of Y as a point in Rk , cluster them into k clusters via K-means
or any other algorithm (that attempts to minimize distortion).
6. Finally, assign the original point Si to cluster j if and only if row i of the matrix
Y was assigned to cluster j.
Here, the scaling parameter a 2 controls how rapidly the affinity Aij falls off with
the distance between 8i and 8j, and we will later describe a method for choosing
it automatically. We also note that this is only one of a large family of possible
algorithms, and later discuss some related methods (e.g., [6]).
At first sight, this algorithm seems to make little sense. Since we run K-means
in step 5, why not just apply K-means directly to the data? Figure Ie shows an
example. The natural clusters in jR2 do not correspond to convex regions, and K-
means run directly finds the unsatisfactory clustering in Figure li. But once we map
the points to jRk (Y 's rows) , they form tight clusters (Figure lh) from which our
method obtains the good clustering shown in Figure Ie. We note that the clusters
in Figure lh lie at 90 0 to each other relative to the origin (cf. [8]).
lReaders familiar with spectral graph theory [3) may be more familiar with the Lapla-
cian 1- L. But as replacing L with 1- L would complicate our later discussion, and only
changes the eigenvalues (from Ai to 1 - Ai ) and not the eigenvectors, we instead use L .
3 Analysis of algorithm
A. = [
A(ll)
0
0
A(22)
o
o
1; A
L =
[L(11)
0 £(22)
o
(1)
o 0 A~~ 0 o
where we have adopted the convention of using parenthesized superscripts to index
into subblocks of vectors/matrices, and Lrii) = (D(ii)) - 1/2A(ii) (D(ii)) - 1/2. Here,
A(ii) = A(ii) E jRni xni is the matrix of "intra-cluster" affinities for cluster i. For fu-
ture use, also define d(i) E jRni to be the vector containing D(ii) 's diagonal elements,
and dE jRn to contain D's diagonal elements.
To construct X, we find t's first k = 3 eigenvectors. Since t is block diagonal, its
eigenvalues and eigenvectors are the union of the ei~envalues and eigenvectors of its
blocks (the latter padded appropriately with zeros). It is straightforward to show
that Lrii) has a strictly positive principal eigenvector xii) E jRni with eigenvalue
1. Also, since A)~) > 0 (j i:- k), the next eigenvalue is strictly less than 1. (See,
e.g., [3]). Thus, stacking t 's eigenvectors in columns to obtain X, we have:
X= [
xi1)
0
0
xi 2)
0
0
1 E jRnx3. (2)
o 0 xi 3)
Actually, a subtlety needs to be addressed here. Since 1 is a repeated eigenvalue
in t, we could just as easily have picked any other 3 orthogonal vectors spanning
the same subspace as X's columns, and defined them to be our first 3 eigenvectors.
That is, X could have been replaced by XR for any orthogonal matrix R E jR3X3
(RT R = RRT = 1). Note that this immediately suggests that one use considerable
caution in attempting to interpret the individual eigenvectors of L, as the choice
of X's columns is arbitrary up to a rotation, and can easily change due to small
perturbations to A or even differences in the implementation of the eigensolvers.
Instead, what we can reasonably hope to guarantee about the algorithm will be
arrived at not by considering the (unstable) individual columns of X, but instead
the subspace spanned by the columns of X, which can be considerably more stable.
Next, when we renormalize each of X's rows to have unit length, we obtain:
y= [
y(l)
y(2)
1 [r0 0r 00 1
R (3)
y(3) 0 0 r
where we have used y(i) E jRni xk to denote the i-th subblock of Y. Letting fiji)
denote the j-th row of17(i) , we therefore see that fjY) is the i-th row ofthe orthogonal
matrix R. This gives us the following proposition.
Proposition 1 Let A's off-diagonal blocks A(i j ) , i =I- j, be zero. Also assume
that each cluster Si is connected. 2 Then there exist k orthogonal vectors 1'1, . .. ,1' k
(1'; l' j = 1 if i = j, 0 otherwise) so that Y's rows satisfy
, (i) ( )
~ =G 4
for all i = 1, ... ,k, j = 1, ... ,ni.
In other words , there are k mutually orthogonal points on the surface of the unit
k-sphere around which Y 's rows will cluster. Moreover, these clusters correspond
exactly to the true clustering of the original data.
In the general case, A's off-diagonal blocks are non-zero, but we still hope to recover
guarantees similar to Proposition 1. Viewing E = A - A as a perturbation to the
"ideal" A that results in A = A+E, we ask: When can we expect the resulting rows
of Y to cluster similarly to the rows of Y? Specifically, when will the eigenvectors
of L, which we now view as a perturbed version of L, be "close" to those of L?
Matrix perturbation theory [10] indicates that the stability of the eigenvectors of a
matrix is determined by the eigengap. More precisely, the subspace spanned by L's
first 3 eigenvectors will be stable to small changes to L if and only if the eigengap
8 = IA3 - A41, the difference between the 3rd and 4th eigenvalues of L, is large. As
discussed previously, the eigenvalues of L is the union of the eigenvalues of D11),
D22), and D33), and A3 = 1. Letting Ay) be the j-th largest eigenvalue of Dii), we
therefore see that A4 = maxi A~i). Hence, the assumption that IA3 - A41 be large is
exactly the assumption that maXi A~i) be bounded away from 1.
Assumption AI. There exists 8 > 0 so that, for all i = 1, ... ,k, A~i) :s: 1 - 8.
Note that A~i) depends only on Dii), which in turn depends only on A(ii) = A(ii) ,
the matrix of intra-cluster similarities for cluster Si' The assumption on A~i) has a
very natural interpretation in the context of clustering. Informally, it captures the
idea that if we want an algorithm to find the clusters Sl, S2 and S3, then we require
that each of these sets Si really look like a "tight" cluster. Consider an example
in which Sl = S1.1 U S1.2 , where S1.1 and S1.2 are themselves two well separated
clusters. Then S = S1.1 U S1.2 U S2 U S3 looks like (at least) four clusters, and it
would be unreasonable to expect an algorithm to correctly guess what partition of
the four clusters into three subsets we had in mind.
This connection between the eigengap and the cohesiveness of the individual clusters
can be formalized in a number of ways.
Assumption ALl. Define the Cheeger constant [3] of the cluster Si to be
_. ~lE I, kIi' I A;;.,')
h(S.) - mmI . {~ '(.) ~ ,(.)}. (5)
mm lEI d , kli'I dk
where the outer minimum is over all index subsets I ~ {I, ... ,nd. Assume that
there exists 8 > 0 so that (h(Si))2 /2 ~ 8 for all i.
2This condition is satisfied by A.j~) > 0 (j i- k) , which is true in our case.
A standard result in spectral graph theory shows that Assumption Al.l implies
Assumption Al. Recall that d)i) = 2:k A)~) characterizes how "well connected"
or how "similar" point j is to the other points in the same cluster. The term in
the minI{·} characterizes how well (I , I) partitions Si into two subsets, and the
minimum over I picks out the best such partition. Specifically, if there is a partition
of Si'S points so that the weight of the edges across the partition is small, and so
that each of the partitions has moderately large "volume" (sum of dY) 's), then the
Cheeger constant will be small. Thus, the assumption that the Cheeger constants
h(Si) be large is exactly that the clusters Si be hard to split into two subsets.
We can also relate the eigengap to the mixing time of a random walk (as in [6])
defined on the points of a cluster, in which the chance of transitioning from point i
to j is proportional to A ij , so that we tend to jump to nearby-points. Assumption
Al is equivalent to assuming that, for such a walk defined on the points of any
one of the clusters Si , the corresponding transition matrix has second eigenvalue at
most 1- 8. The mixing time of a random walk is governed by the second eigenvalue;
thus, this assumption is exactly that the walks mix rapidly. Intuitively, this will be
true for tight (or at least fairly "well connected") clusters, and untrue if a cluster
consists of two well-separated sets of points so that the random walk takes a long
time to transition from one half of the cluster to the other. Assumption Al can also
be related to the existence of multiple paths between any two points in the same
cluster.
Assumption A2. There is some fixed fl > 0, so that for every iI , i2 E {I, ... ,k} ,
i l =j:. i 2, we have that
(6)
To gain intuition about this, consider the case of two "dense" clusters il and i2 of
size O(n) each. Since dj measures how "connected" point j is to other points in
the same cluster, it will be dj = O(n) in this case, so the sum, which is over 0(n 2 )
terms , is in turn divided by djdk = O(n 2 ) . Thus, as long as the individual Ajk's
are small, the sum will also be small, and the assumption will hold with small fl.
Whereas dj measures how connected Sj E Si is to the rest of Si, 2:k:k'itSi Ajk
measures how connected Sj is to points in other clusters. The next assumption is
that all points must be more connected to points in the same cluster than to points
in other clusters; specifically, that the ratio between these two quantities be small.
Assumption A3. For some fixed f2 > 0, for every i = 1, ... ,k, j E Si, we have:
(7)
For intuition about this assumption, again consider the case of densely connected
clusters (as we did previously). Here, the quantity in parentheses on the right hand
side is 0(1), so this becomes equivalent to demanding that the following ratio be
small: (2:k:k'it Si Ajk)/dj = (2: k:k'it Si Ajk)/(2:k:kESi A jk ) = 0(f2) .
Assumption A4. There is some constant C >
. _ ' (i) ' (i )
° so that for every i = 1, .. . ,k,
J - 1, ... ,ni, we have dj ~ (2: k =l dk )/(Cni).
ni
This last assumption is a fairly benign one that no points in a cluster be "too much
less" connected than other points in the same cluster.
Theorem 2 Let assumptions Al, A2, A3 and A4 hold. Set f = Jk(k - l)fl + kE~.
If 0 > (2 + V2}::, then there exist k orthogonal vectors rl, . .. , rk (rF r j = I if i = j,
o otherwise) so that Y's rows satisfy
(8)
Thus, the rows of Y will form tight clusters around k well-separated points (at 90 0
from each other) on the surface of the k-sphere according to their "true" cluster Si.
4 Experiments
To test our algorithm, we applied it to seven clustering problems. Note that whereas
(J2 was previously described as a human-specified parameter, the analysis also sug-
gests a particularly simple way of choosing it automatically: For the right (J2,
Theorem 2 predicts that the rows of Y will form k "tight" clusters on the surface
of the k-sphere. Thus, we simply search over (J2 , and pick the value that, after
clustering Y 's rows, gives the tightest (smallest distortion) clusters. K-means in
Step 5 of the algorithm was also inexpensively initialized using the prior knowledge
that the clusters are about 90 0 apart. 3 The results of our algorithm are shown in
Figure l a-g. Giving the algorithm only the coordinates of the points and k, the
different clusters found are shown in the Figure via the different symbols (and col-
ors, where available). The results are surprisingly good: Even for clusters that do
not form convex regions or that are not cleanly separated (such as in Figure 19) ,
the algorithm reliably finds clusterings consistent with what a human would have
chosen.
We note that there are other, related algorithms that can give good results on a
subset of these problems, but we are aware of no equally simple algorithm that
can give results comparable to these. For example, we noted earlier how K-means
easily fails when clusters do not correspond to convex regions (Figure Ii). Another
alternative may be a simple "connected components" algorithm that, for a threshold
T, draws an edge between points Si and Sj whenever Iisi - sjl12 :s: T, and takes the
resulting connected components to be the clusters. Here, T is a parameter that can
(say) be optimized to obtain the desired number of clusters k. The result of this
algorithm on the threecircles-j oined dataset with k = 3 is shown in Figure lj.
One of the "clusters" it found consists of a singleton point at (1.5,2). It is clear
that this method is very non-robust.
We also compare our method to the algorithm of Meila and Shi [6] (see Figure lk).
Their method is similar to ours, except for the seemingly cosmetic difference that
they normalize A's rows to sum to I and use its eigenvectors instead of L 's, and do
not renormalize the rows of X to unit length. A refinement of our analysis suggests
that this method might be susceptible to bad clusterings when the degree to which
different clusters are connected (L: j d;il) varies substantially across clusters.
3 Briefiy, we let the first cluster centroid be a randomly chosen row of Y , and then
repeatedly choose as the next centroid the row of Y that is closest to being 90° from
all the centroids (formally, from the worst-case centroid) already picked. The resulting
K-means was run only once (no restarts) to give the results presented. K-means with the
more conventional random initialization and a small number of restarts also gave identical
results. In contrast, our implementation of Meila and Shi 's algorithm used 2000 restarts.
flips,8clusten
o
o
threecircles-joined, 3 clusters(conoecled """'l""'enlS) ~nea r.dballs , 3 dus\efs(Meil a and Shi algor1lhm) flips, 6 cluste<s (Kannan elal ,aigor;thm)
N
o q, o
o 0
o~ 00 0
&~llO
o~
~& 0
~~
o 0
o
Figure 1: Clustering examples, with clusters indicated by different symbols (and colors,
where available). (a-g) Results from our algorithm, where the only parameter varied across
runs was k. (h) Rows of Y (jittered, subsampled) for twocircles dataset . (i) K-means.
(j) A "connected components" algorithm. (k) Meila and Shi algorithm. (1) Kannan et al.
Spectral Algorithm I. (See text.)
5 Discussion
There are some intriguing similarities between spectral clustering methods and Ker-
nel peA, which has been empirically observed to perform clustering [7, 2]. The main
difference between the first steps of our algorithm and Kernel PCA with a Gaussian
kernel is the normalization of A (to form L) and X. These normalizations do im-
prove the performance of the algorithm, but it is also straightforward to extend our
analysis to prove conditions under which Kernel PCA will indeed give clustering.
While different in detail , Kannan et al. [4] give an analysis of spectral clustering
that also makes use of matrix perturbation theory, for the case of an affinity matrix
with row sums equal to one. They also present a clustering algorithm based on
k singular vectors , one that differs from ours in that it identifies clusters with
individual singular vectors. In our experiments, that algorithm very frequently
gave poor results (e.g., Figure 11).
Acknowledgments
We thank Marina Meila for helpful conversations about this work. We also thank
Alice Zheng for helpful comments. A. Ng is supported by a Microsoft Research
fellowship. This work was also supported by a grant from Intel Corporation, NSF
grant IIS-9988642, and ONR MURI N00014-00-1-0637.
References
[1] C. Alpert, A. Kahng, and S. Yao. Spectral partitioning: The more eigenvectors, the
better. Discrete Applied Math , 90:3- 26, 1999.
[2] N. Christianini, J. Shawe-Taylor, and J. Kandola. Spectral kernel methods for clus-
tering. In Neural Information Processing Systems 14, 2002.
[3] F. Chung. Spectral Graph Theory. Number 92 in CBMS Regional Conference Series
in Mathematics. American Mathematical Society, 1997.
[4] R. Kannan, S. Vempala, and A. Yetta. On clusterings- good, bad and spectral.
In Proceedings of the 41st Annual Symposium on Foundations of Computer Science,
2000.
[5] J. Malik, S. Belongie, T. Leung, and J. Shi. Contour and texture analysis for image
segmentation. In Perceptual Organization for Artificial Vision Systems. Kluwer, 2000.
[6] M. Meila and J. Shi. Learning segmentation by random walks. In N eural Information
Processing Systems 13, 200l.
[7] B. Scholkopf, A. Smola, and K.-R Miiller. Nonlinear component analysis as a kernel
eigenvalue problem. N eural Computation, 10:1299- 1319, 1998.
[8] G. Scott and H. Longuet-Higgins. Feature grouping by relocalisation of eigenvectors
of the proximity m atrix. In Proc. British Machine Vision Conference, 1990.
[9] D. Spielman and S. Teng. Spectral partitioning works: Planar graphs and finite
element meshes. In Proceedings of the 37th Annual Symposium on Foundations of
Computer Science, 1996.
[10] G. W. Stewart and J.-G. Sun. Matrix Perturbation Th eory. Academic Press, 1990.
[11] Y . Weiss. Segmentation using eigenvectors: A unifying view. In International Con-
f erence on Computer Vision, 1999.