S VD For Clustering
S VD For Clustering
1
,
2
, . . . ,
k
, each with variance 1 in every direction. Then the density of the mixture
is
F(x) = Prob[x] =
1
k
1
(2)
d/2
k
t=1
e
|x
t
|
2
.
Denote by (x) the center nearest to x. Since the exponential function falls o fast, we
have the approximation
F(x)
1
k
1
(2)
d/2
e
|x(x)|
2
.
So, given a sample of points x
(1)
, x
(2)
, . . . x
(n)
drawn according to the mixture, the like-
lyhood of a particular
1
,
2
, . . . ,
k
, namely, the (posterior) probability of generating
the sample if
1
,
2
, . . . ,
k
were in fact the centers, is approximately
1
k
n
1
(2)
nd/2
n
i=1
e
|x
(i)
(x
(i)
)|
2
= ce
n
i=1
|x
(i)
(x
(i)
)|
2
.
241
So, minimizing the sum of squared distances to cluster centers nds the maximum like-
lyhood
1
,
2
, . . . ,
k
and this suggests the criterion : sum of distance squared to the
cluster centers.
On the other hand, if the generating process had an exponential probability distribu-
tion, with the probability law
Prob [(x
1
, x
2
, . . . , x
d
)] =
1
2
d
i=1
e
|x
i
i
|
=
1
2
e
i=1
|x
i
i
|
=
1
2
e
|x |
1
,
one would use the L
1
norm (not the L
2
or the square of the L
1
) since the probability den-
sity decreases as the L
1
distance from the center. The intuition here is that the distance
used to cluster data should be related to the actual distribution of the data.
The choice of whether to use a distance measure and cluster together points that are
close or use a similarity measure and cluster together points with high similarity and
what particular distance or similarity measure to use can be crucial to the application.
However, there is not much theory on these choices; they are determined by empirical
domain-specic knowledge. One general observation is worth making. Using distance
squared instead of distance, favors outliers more since the square function magnies
large values, which means a small number of outliers may make a clustering look bad.
On the other hand, distance squared has some mathematical advantages; see for example
Corollary 8.3 which asserts that with the distance squared criterion, the centroid is the
correct cluster center. The widely used k-means criterion is based on sum of squared
distances.
In the formulations we have discussed so far, we have one number (eg. sum of distances
squared to the cluster center) as the measure of goodness of a clustering and we try
to optimize that number (to nd the best clustering according to the measure). This
approach does not always yield desired results, since often, it is hard to optimize exactly
(most clustering problems are NP-hard). Often, there are polynomial time algorithms to
nd an approximately optimal solution. But such a solution may be far from the optimal
(desired) clustering. We will in section (8.4) how to formalize some realistic conditions
under which an approximate optimal solution indeed gives us a desired clustering as well.
But rst we see some simple algorithms for getting a good clustering according to some
natural measures.
8.2 A Simple Greedy Algorithm for k-clustering
There are many algorithms for clustering high-dimensional data. We start with a
simple one. Suppose we use the k-center criterion. The k-center criterion partitions the
points into k clusters so as to minimize the maximum distance of any point to its cluster
center. Call the maximum distance of any point to its cluster center the radius of the
clustering. There is a k-clustering of radius r if and only if there are k spheres, each of
radius r which together cover all the points. Below, we give a simple algorithm to nd k
spheres covering a set of points. The lemma following shows that this algorithm needs to
242
use a radius that is o by a factor of at most two from the optimal k-center solution.
The Greedy k-clustering Algorithm
Pick any data point to be the rst cluster center. At time t, for t = 2, 3, . . . , k, pick
any data point that is not within distance r of an existing cluster center; make it the
t
th
cluster center.
Lemma 8.1 If there is a k-clustering of radius
r
2
, then the above algorithm nds a k-
clustering with radius at most r.
Proof: Suppose for contradiction that the algorithm using radius r fails to nd a k-
clustering. This means that after the algorithm chooses k centers, there is still at least
one data point that is not in any sphere of radius r around a picked center. This is the
only possible mode of failure. But then there are k + 1 data points, with each pair more
than distance r apart. Clearly, no two such points can belong to the same cluster in any
k-clustering of radius
r
2
contradicting the hypothesis.
There are in general two variations of the clustering problem for each of the criteria.
We could require that each cluster center be a data point or allow a cluster center to be
any point in space. If we require each cluster center to be a data point, the problem can
be solved in time
n
k
j=1
a
i
S
j
(c
j
a
i
)
2
where c
j
is the center of cluster j.
Suppose we have already determined the clustering or the partitioning into S
1
, S
2
, . . . , S
k
.
What are the best centers for the clusters? The following lemma shows that the answer
is the centroids, the coordinatewise means, of the clusters.
243
Lemma 8.2 Let {a
1
, a
2
, . . . , a
n
} be a set of points. The sum of the squared distances
of the a
i
to any point x equals the sum of the squared distances to the centroid plus the
number of points times the squared distance from the point x to the centroid. That is,
i
|a
i
x|
2
=
i
|a
i
c|
2
+ n|c x|
2
where c is the centroid of the set of points.
Proof:
i
|a
i
x|
2
=
i
|a
i
c +c x|
2
=
i
|a
i
c|
2
+ 2(c x)
i
(a
i
c) + n|c x|
2
Since c is the centroid,
i
(a
i
c) = 0. Thus,
i
|a
i
x|
2
=
i
|a
i
c|
2
+ nc x
2
A corollary of Lemma 8.2 is that the centroid minimizes the sum of squared distances
since the second term, nc x
2
, is always positive.
Corollary 8.3 Let {a
1
, a
2
, . . . , a
n
} be a set of points. The sum of squared distances of
the a
i
to a point x is minimized when x is the centroid, namely x =
1
n
i
a
i
.
Another expression for the sum of squared distances of a set of n points to their centroid
is the sum of all pairwise distances squared divided by n. First, a simple notational issue.
For a set of points {a
1
, a
2
, . . . , a
n
},
n
i=1
n
j=i+1
|a
i
a
j
|
2
counts the quantity |a
i
a
j
|
2
once
for each ordered pair (i, j), j > i. However,
i,j
|a
i
a
j
|
2
counts each |a
i
a
j
|
2
twice, so
the later sum is twice the rst sum.
Lemma 8.4 Let {a
1
, a
2
, . . . , a
n
} be a set of points. The sum of the squared distances
between all pairs of points equals the number of points times the sum of the squared dis-
tances of the points to the centroid of the points. That is,
j>i
|a
i
a
j
|
2
= n
i
|a
i
c|
2
where c is the centroid of the set of points.
Proof: Lemma 8.2 states that for every x,
i
|a
i
x|
2
=
i
|a
i
c|
2
+ n|c x|
2
.
244
Letting x range over all a
j
and summing the n equations yields
i,j
|a
i
a
j
|
2
= n
i
|a
i
c|
2
+ n
j
|c a
j
|
2
= 2n
i
|a
i
c|
2
.
Observing that
i,j
|a
i
a
j
|
2
= 2
j>i
|a
i
a
j
|
2
yields the result that
j>i
|a
i
a
j
|
2
= n
i
|a
i
c|
2
.
The k-means clustering algorithm
A natural algorithm for k-means clustering is given below. There are two unspecied
aspects of the algorithm. One is the set of starting centers and the other is the stopping
condition.
k-means algorithm
Start with k centers.
Cluster each point with the center nearest to it.
Find the centroid of each cluster and replace the set of old centers with the centroids.
Repeat the above two steps until the centers converge (according to some criterion).
The k-means algorithm always converges but possibly to a local minimum. To show
convergence, we argue that the cost of the clustering, the sum of the squares of the
distances of each point to its cluster center, always improves. Each iteration consists of
two steps. First, consider the step which nds the centroid of each cluster and replaces
the old centers with the new centers. By Corollary 8.3, this step improves the sum of
internal cluster distances squared. The other step reclusters by assigning each point to
its nearest cluster center, which also improves the internal cluster distances.
8.4 Meaningful Clustering via Singular Value Decomposition
Optimizing a criterion such as kmeans is often not an end in itself. It is a means to
nding a good (meaningful) clustering. How do we dene a meaningful clustering? Here
is a possible answer: an optimal clustering is meaningful if it is unique, in the sense that
any other nearly optimal clustering agrees with it on most data points. We will formalize
this below. But the bad news is that we will soon see that this is too much to ask for and
many common data sets do not admit such a clustering. Luckily though, the discussion
245
will lead us to a weaker requirement which has the twin properties of being met by many
data sets as well as admitting an ecient (SVD-based) algorithm to nd the clustering.
We start with some notation.
We denote by n the number of (data) points to be clustered; they are listed as the
rows A
i
of a n d matrix A. A clustering (partition) of the data points is represented by
a matrix of cluster centers C which is also a n d matrix; the i th row of C is the center
of the cluster that A
i
belongs to. So C has only k distinct rows. We refer to A as the
data and to C as the clustering.
Denition: The cost of the clustering C is the sum of distances squared to the cluster
centers; so we have
cost(A, C) = ||A C||
2
F
.
The mean squared distance (MSD) of a clustering C is just cost(A, C)/n.
We will say two clusterings of A dier in s points if s is the minimum number of data
points to be reassigned to get from one clustering to the other. [Note: a clustering is
specied by just a partition; the cluster centers are just the centroids of data points in a
cluster.] Here is the rst attempt at dening when a clustering is meaningful:
A k clustering C is meaningful if every kclustering C
of C,
|
|
2
MSD(C) :
Proof: We will prove the claim by showing that if two cluster centers in C are too close,
then we may move n points from cluster to the other without increasing the cost by more
than a factor of (1 + O()), thus contradicting the assumption that C is meaningful.
Let T be the cluster with cluster center . Project each data point A
i
T, onto
the line through ,
and let d
i
be the distance of the projected point to . Let T
1
be
the subset of T whose projections land on the
iT
1
d
i
=
iT
2
d
i
. Since each A
i
T is closer to than to
, for i T
1
, we have
d
i
|
|/2 and so
iT
1
d
i
|T||
iT
2
d
i
|T||
|/2. So,
iT
d
i
|T||
|.
Now from the assumption that |T| (n), we have |T| 2n. So, the n th smallest d
i
is
at most
|T|
|T|n
|
| 2|
|
2
. By the assumption that
C is meaningful (under the proposed denition above), we must thus have |
|
2
MSD(C) as claimed.
246
But we will now see that the condition that |
|
2
MSD(C) is too strong for some
common data sets. Consider two spherical Gaussians in d space, each with variance 1
in each direction. Clearly if we have data generated from a (equal weight) mixture of the
two, the correct 2-clustering one would seek is to split them into the Gaussians they
were generated from with the actual centers of the Gaussians (or a very close point) as the
cluster centers. But the MSD of this clustering is approximately d. So, by the Claim, for
C to be meaningful, the centers must be (d) apart. It is easy to see however that if the
separation is (
ln n), the clusters are distinct : the projection of each Gaussian to the
line joining their centers is a 1 dimensional Gaussian of variance 1, so the probability that
any data point lies more than a tenth of the way (in the projection) to the wrong center
is at most O(1/n
2
), so by union bound, all data points have their distance to the wrong
center at least 10 times the distance to the correct center, provided, we measure distances
only in the projection. Since the variance is 1 in each direction, the Mean Squared
Distance to the cluster center in each direction is only O(1). So, in this example,
it would make more sense to require a inter-center separation of the maximum mean
squared distance in any one direction to the cluster center. We will be able to achieve
this in general.
Denition: Let A, C be the data and cluster centers matrices respectively. The mean
squared distance in a direction (denoted MSDD(A, C)) is the maximum over all unit
length vectors v of the mean squared distance of data points to their cluster centers in
the direction v, namely,
MSDD(A, C) =
1
n
Max
v:|v|=1
|(A C)v|
2
=
1
n
||A C||
2
2
,
where, we have used the basic denition of the largest singular value to get the last
expression.
Theorem 8.5 Suppose there exists a kclustering C of data points A with (i) (n) data
points per cluster and (ii) distance at least (
within distance O(
MSDD(A, C)).
Proof: Let =MSDD(A, C). Let T be the set of data points in a cluster of C and
suppose for contradiction that centroid of T has no cluster center of C
at distance
O(
k). Let
A
i
denote projection of A
i
onto the SVD subspace; so the SVD clustering
actually clusters points
A
i
. Recall the notation that C
i
is the cluster center of C
closest
to
A
i
. The cost of a data point A
i
T in the SVD solution is
|
A
i
C
i
|
2
= |( C
i
) (
A
i
)|
2
1
2
| C
i
|
2
|
A
i
|
2
() |
A
i
|
2
,
where, we have used |a b|
2
1
2
|a|
2
|b|
2
for any two vectors a and b.
Adding over all points in T, the cost of C
is at least (n) ||
AC||
2
F
. Now, one way to
cluster the points
A
i
is to just use the same cluster centers as C; the cost of this clustering
is ||
A C||
2
F
. So the optimal clustering of points
A
i
costs at most ||
A C||
2
F
and since
the algorithm nds a 2-approximate clustering, the cost of the SVD clustering is at most
2||
A C||
2
F
. So we get
2||
A C||
2
F
(n) ||
A C||
2
F
= ||
A C||
2
F
(n).
We will prove that
A C
2
F
5kn in Lemma 8.7. By (i), we have k O(1), so
||
A C||
2
F
= O(). So, we get a contradiction (with suitable choice of constants under
the .)
Note that in the Lemma below, the main inequality has Frobenius norm on the left hand
side, but only operator norm on the right hand side. This makes it stronger than having
either Frobenius norm on both sides or operator norm on both sides.
9
Note that C is not assumed to satisfy condition (ii) of the Theorem.
248
Lemma 8.7 Suppose A is an nd matrix and suppose C is an nd rank k matrix. Let
i=1
|(
A C)u
i
|
2
+
p
i=k+1
|(
A C)u
i
|
2
.
Since {u
i
| 1 i k} are the top k singular vectors of A and since
A is the rank k ap-
proximation to A, for 1 i k, Au
i
=
Au
i
and thus
A C
u
i
2
= |(A C) u
i
|
2
. For
i > k,
Au
i
= 0, thus
A C
u
i
2
= |Cu
i
|
2
. From this it follows that
A C
2
F
=
k
i=1
|(A C)u
i
|
2
+
p
i=k+1
|Cu
i
|
2
k||A C||
2
2
+
p
i=k+1
|Au
i
+ (C A)u
i
|
2
Using |a + b|
2
2|a|
2
+ 2|b|
2
A C
2
F
k||A C||
2
2
+ 2
p
i=k+1
|Au
i
|
2
+ 2
p
i=k+1
|(C A)u
i
|
2
k||A C||
2
2
+ 2 (p k 1)
2
k+1
(A) + 2 (p k 1) ||A C||
2
2
Using p 2k implies k > p k 1
A C
2
F
k||A C||
2
2
+ 2k
2
k+1
(A) + 2k||A C||
2
2
. (8.1)
As we saw in Chapter 4, for any rank k matrix B, ||AB||
2
k+1
(A) and so
k+1
(A)
||A C||
2
and plugging this in, we get the Lemma.
Proof: (of Theorem (8.5)) Let =
is , but
A
i
(the projection of A
i
249