0% found this document useful (0 votes)
65 views10 pages

S VD For Clustering

The document discusses different clustering algorithms and criteria. It begins by providing examples of clustering, such as clustering documents by topic or social networks into communities. It then discusses representing data as vectors and defining distance/similarity measures. Common measures include cosine similarity and Euclidean distance. The choice of measure depends on the data generation process. Several clustering criteria are described, such as minimizing the sum of distances to cluster centers. The document outlines a greedy k-clustering algorithm and proves it finds a clustering within a factor of two of the optimal radius.

Uploaded by

LM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views10 pages

S VD For Clustering

The document discusses different clustering algorithms and criteria. It begins by providing examples of clustering, such as clustering documents by topic or social networks into communities. It then discusses representing data as vectors and defining distance/similarity measures. Common measures include cosine similarity and Euclidean distance. The choice of measure depends on the data generation process. Several clustering criteria are described, such as minimizing the sum of distances to cluster centers. The document outlines a greedy k-clustering algorithm and proves it finds a clustering within a factor of two of the optimal radius.

Uploaded by

LM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

8 Clustering

8.1 Some Clustering Examples


Clustering comes up in many contexts. For example, one might want to cluster jour-
nal articles into clusters of articles on related topics. In doing this, one rst represents
a document by a vector. This can be done using the vector space model introduced in
Chapter 2. Each document is represented as a vector with one component for each term
giving the frequency of the term in the document. Alternatively, a document may be
represented by a vector whose components correspond to documents in the collection and
the i
th
vectors j
th
component is a 0 or 1 depending on whether the i
th
document refer-
enced the j
th
document. Once one has represented the documents as vectors, the problem
becomes one of clustering vectors.
Another context where clustering is important is the study of the evolution and growth
of communities in social networks. Here one constructs a graph where nodes represent
individuals and there is an edge from one node to another if the person corresponding
to the rst node sent an email or instant message to the person corresponding to the
second node. A community is dened as a set of nodes where the frequency of messages
within the set is higher than what one would expect if the set of nodes in the community
were a random set. Clustering partitions the set of nodes of the graph into sets of nodes
where the sets consist of nodes that send more messages to one another than one would
expect by chance. Note that clustering generally asks for a strict partition into subsets
although in reality in this case for instance, a node may well belong to several communities.
In these clustering problems, one denes either a similarity measure between pairs
of objects or a distance measure (a notion of dissimilarity). One measure of similarity
between two vectors a and b is the cosine of the angle between them:
cos(a, b) =
a
T
b
|a| |b|
.
To get a distance measure, subtract the cosine similarity from one.
dist(a, b) = 1 cos(a, b)
Another distance measure is the Euclidean distance. There is an obvious relationship
between cosine similarity and Euclidean distance. If a and b are unit vectors, then
|a b|
2
= (a b)
T
(a b) = |a|
2
+ |b|
2
2a
T
b = 2 (1 cos (a, b)) .
In determining the distance function to use, it is useful to know something about the
origin of the data. In problems where we have to cluster nodes of a graph, we may represent
each node as a a vector, namely, as the row of the adjacency matrix corresponding to the
node. One notion of dissimilarity here is the square of the Euclidean distance. For 0-1
240
vectors, this measure is just the number of uncommon 1s, whereas, the dot product
is the number of common 1s. In many situations one has a stochastic model of how the
data was generated. An example is customer behavior. Suppose there are d products and
n customers. A reasonable assumption is that each customer generates from a probability
distribution, the basket of goods he or she buys. A basket species the amount of each
good bought. One hypothesis is that there are only k types of customers, k << n.
Each customer type is characterized by a probability density used by all customers of
that type to generate their baskets of goods. The densities may all be Gaussians with
dierent centers and covariance matrices. We are not given the probability densities,
only the basket bought by each customer, which is observable. Our task is to cluster the
customers into the k types. We may identify the customer with his or her basket which is
a vector. One way to formulate the problem mathematically is via a clustering criterion
which we then optimize. Some potential criteria are to partition the customers into k
clusters so as to minimize
1. the sum of distances between all pairs of customers in the same cluster,
2. the sum of distances of all customers to their cluster center (any point in space
may be designated as the cluster center), or
3. minimize the sum of squared distances to the cluster center.
The last criterion is called the k-means criterion and is widely used. A variant called
the k-median criterion minimizes the sum of distances (not squared) to the cluster center.
Another possibility, called the k-center criterion, is to minimize the maximum distance
of any point to its cluster center.
The chosen criterion can aect the results. To illustrate, suppose we have data gen-
erated according to a equal weight mixture of k spherical Gaussian densities centered at

1
,
2
, . . . ,
k
, each with variance 1 in every direction. Then the density of the mixture
is
F(x) = Prob[x] =
1
k
1
(2)
d/2
k

t=1
e
|x
t
|
2
.
Denote by (x) the center nearest to x. Since the exponential function falls o fast, we
have the approximation
F(x)
1
k
1
(2)
d/2
e
|x(x)|
2
.
So, given a sample of points x
(1)
, x
(2)
, . . . x
(n)
drawn according to the mixture, the like-
lyhood of a particular
1
,
2
, . . . ,
k
, namely, the (posterior) probability of generating
the sample if
1
,
2
, . . . ,
k
were in fact the centers, is approximately
1
k
n
1
(2)
nd/2
n

i=1
e
|x
(i)
(x
(i)
)|
2
= ce

n
i=1
|x
(i)
(x
(i)
)|
2
.
241
So, minimizing the sum of squared distances to cluster centers nds the maximum like-
lyhood
1
,
2
, . . . ,
k
and this suggests the criterion : sum of distance squared to the
cluster centers.
On the other hand, if the generating process had an exponential probability distribu-
tion, with the probability law
Prob [(x
1
, x
2
, . . . , x
d
)] =
1
2
d

i=1
e
|x
i

i
|
=
1
2
e

i=1
|x
i

i
|
=
1
2
e
|x |
1
,
one would use the L
1
norm (not the L
2
or the square of the L
1
) since the probability den-
sity decreases as the L
1
distance from the center. The intuition here is that the distance
used to cluster data should be related to the actual distribution of the data.
The choice of whether to use a distance measure and cluster together points that are
close or use a similarity measure and cluster together points with high similarity and
what particular distance or similarity measure to use can be crucial to the application.
However, there is not much theory on these choices; they are determined by empirical
domain-specic knowledge. One general observation is worth making. Using distance
squared instead of distance, favors outliers more since the square function magnies
large values, which means a small number of outliers may make a clustering look bad.
On the other hand, distance squared has some mathematical advantages; see for example
Corollary 8.3 which asserts that with the distance squared criterion, the centroid is the
correct cluster center. The widely used k-means criterion is based on sum of squared
distances.
In the formulations we have discussed so far, we have one number (eg. sum of distances
squared to the cluster center) as the measure of goodness of a clustering and we try
to optimize that number (to nd the best clustering according to the measure). This
approach does not always yield desired results, since often, it is hard to optimize exactly
(most clustering problems are NP-hard). Often, there are polynomial time algorithms to
nd an approximately optimal solution. But such a solution may be far from the optimal
(desired) clustering. We will in section (8.4) how to formalize some realistic conditions
under which an approximate optimal solution indeed gives us a desired clustering as well.
But rst we see some simple algorithms for getting a good clustering according to some
natural measures.
8.2 A Simple Greedy Algorithm for k-clustering
There are many algorithms for clustering high-dimensional data. We start with a
simple one. Suppose we use the k-center criterion. The k-center criterion partitions the
points into k clusters so as to minimize the maximum distance of any point to its cluster
center. Call the maximum distance of any point to its cluster center the radius of the
clustering. There is a k-clustering of radius r if and only if there are k spheres, each of
radius r which together cover all the points. Below, we give a simple algorithm to nd k
spheres covering a set of points. The lemma following shows that this algorithm needs to
242
use a radius that is o by a factor of at most two from the optimal k-center solution.
The Greedy k-clustering Algorithm
Pick any data point to be the rst cluster center. At time t, for t = 2, 3, . . . , k, pick
any data point that is not within distance r of an existing cluster center; make it the
t
th
cluster center.
Lemma 8.1 If there is a k-clustering of radius
r
2
, then the above algorithm nds a k-
clustering with radius at most r.
Proof: Suppose for contradiction that the algorithm using radius r fails to nd a k-
clustering. This means that after the algorithm chooses k centers, there is still at least
one data point that is not in any sphere of radius r around a picked center. This is the
only possible mode of failure. But then there are k + 1 data points, with each pair more
than distance r apart. Clearly, no two such points can belong to the same cluster in any
k-clustering of radius
r
2
contradicting the hypothesis.
There are in general two variations of the clustering problem for each of the criteria.
We could require that each cluster center be a data point or allow a cluster center to be
any point in space. If we require each cluster center to be a data point, the problem can
be solved in time

n
k

times a polynomial in the length of the data. First, exhaustively


enumerate all sets of k data points as the possible sets of k cluster centers, then associate
each point to its nearest center and select the best clustering. No such naive enumeration
procedure is available when cluster centers can be any point in space. But, for the k-means
problem, Corollary 8.3 shows that once we have identied the data points that belong to
a cluster, the best choice of cluster center is the centroid. Note that the centroid might
not be a data point.
8.3 Lloyds Algorithm for k-means Clustering
In k-means criterion, a set A = {a
1
, a
2
, . . . , a
n
} of n points in d-dimensions is parti-
tioned into k-clusters, S
1
, S
2
, . . . , S
k
, so as to minimize the sum of squared distances of
each point to its cluster center. That is, A is partitioned into clusters, S
1
, S
2
, . . . , S
k
, and
a center is assigned to each cluster so as to minimize
d (S
1
, S
2
, . . . , S
k
) =
k

j=1

a
i
S
j
(c
j
a
i
)
2
where c
j
is the center of cluster j.
Suppose we have already determined the clustering or the partitioning into S
1
, S
2
, . . . , S
k
.
What are the best centers for the clusters? The following lemma shows that the answer
is the centroids, the coordinatewise means, of the clusters.
243
Lemma 8.2 Let {a
1
, a
2
, . . . , a
n
} be a set of points. The sum of the squared distances
of the a
i
to any point x equals the sum of the squared distances to the centroid plus the
number of points times the squared distance from the point x to the centroid. That is,

i
|a
i
x|
2
=

i
|a
i
c|
2
+ n|c x|
2
where c is the centroid of the set of points.
Proof:

i
|a
i
x|
2
=

i
|a
i
c +c x|
2
=

i
|a
i
c|
2
+ 2(c x)

i
(a
i
c) + n|c x|
2
Since c is the centroid,

i
(a
i
c) = 0. Thus,

i
|a
i
x|
2
=

i
|a
i
c|
2
+ nc x
2
A corollary of Lemma 8.2 is that the centroid minimizes the sum of squared distances
since the second term, nc x
2
, is always positive.
Corollary 8.3 Let {a
1
, a
2
, . . . , a
n
} be a set of points. The sum of squared distances of
the a
i
to a point x is minimized when x is the centroid, namely x =
1
n

i
a
i
.
Another expression for the sum of squared distances of a set of n points to their centroid
is the sum of all pairwise distances squared divided by n. First, a simple notational issue.
For a set of points {a
1
, a
2
, . . . , a
n
},
n

i=1
n

j=i+1
|a
i
a
j
|
2
counts the quantity |a
i
a
j
|
2
once
for each ordered pair (i, j), j > i. However,

i,j
|a
i
a
j
|
2
counts each |a
i
a
j
|
2
twice, so
the later sum is twice the rst sum.
Lemma 8.4 Let {a
1
, a
2
, . . . , a
n
} be a set of points. The sum of the squared distances
between all pairs of points equals the number of points times the sum of the squared dis-
tances of the points to the centroid of the points. That is,

j>i
|a
i
a
j
|
2
= n

i
|a
i
c|
2
where c is the centroid of the set of points.
Proof: Lemma 8.2 states that for every x,

i
|a
i
x|
2
=

i
|a
i
c|
2
+ n|c x|
2
.
244
Letting x range over all a
j
and summing the n equations yields

i,j
|a
i
a
j
|
2
= n

i
|a
i
c|
2
+ n

j
|c a
j
|
2
= 2n

i
|a
i
c|
2
.
Observing that

i,j
|a
i
a
j
|
2
= 2

j>i
|a
i
a
j
|
2
yields the result that

j>i
|a
i
a
j
|
2
= n

i
|a
i
c|
2
.
The k-means clustering algorithm
A natural algorithm for k-means clustering is given below. There are two unspecied
aspects of the algorithm. One is the set of starting centers and the other is the stopping
condition.
k-means algorithm
Start with k centers.
Cluster each point with the center nearest to it.
Find the centroid of each cluster and replace the set of old centers with the centroids.
Repeat the above two steps until the centers converge (according to some criterion).
The k-means algorithm always converges but possibly to a local minimum. To show
convergence, we argue that the cost of the clustering, the sum of the squares of the
distances of each point to its cluster center, always improves. Each iteration consists of
two steps. First, consider the step which nds the centroid of each cluster and replaces
the old centers with the new centers. By Corollary 8.3, this step improves the sum of
internal cluster distances squared. The other step reclusters by assigning each point to
its nearest cluster center, which also improves the internal cluster distances.
8.4 Meaningful Clustering via Singular Value Decomposition
Optimizing a criterion such as kmeans is often not an end in itself. It is a means to
nding a good (meaningful) clustering. How do we dene a meaningful clustering? Here
is a possible answer: an optimal clustering is meaningful if it is unique, in the sense that
any other nearly optimal clustering agrees with it on most data points. We will formalize
this below. But the bad news is that we will soon see that this is too much to ask for and
many common data sets do not admit such a clustering. Luckily though, the discussion
245
will lead us to a weaker requirement which has the twin properties of being met by many
data sets as well as admitting an ecient (SVD-based) algorithm to nd the clustering.
We start with some notation.
We denote by n the number of (data) points to be clustered; they are listed as the
rows A
i
of a n d matrix A. A clustering (partition) of the data points is represented by
a matrix of cluster centers C which is also a n d matrix; the i th row of C is the center
of the cluster that A
i
belongs to. So C has only k distinct rows. We refer to A as the
data and to C as the clustering.
Denition: The cost of the clustering C is the sum of distances squared to the cluster
centers; so we have
cost(A, C) = ||A C||
2
F
.
The mean squared distance (MSD) of a clustering C is just cost(A, C)/n.
We will say two clusterings of A dier in s points if s is the minimum number of data
points to be reassigned to get from one clustering to the other. [Note: a clustering is
specied by just a partition; the cluster centers are just the centroids of data points in a
cluster.] Here is the rst attempt at dening when a clustering is meaningful:
A k clustering C is meaningful if every kclustering C

which diers from C in at


least n points has cost at least (1 +()) cost(C). [ is a small constant.]
Claim 8.1 If C is meaningful under this denition, and each of its clusters has (n)
data points in it, then for any two cluster centers ,

of C,
|

|
2
MSD(C) :
Proof: We will prove the claim by showing that if two cluster centers in C are too close,
then we may move n points from cluster to the other without increasing the cost by more
than a factor of (1 + O()), thus contradicting the assumption that C is meaningful.
Let T be the cluster with cluster center . Project each data point A
i
T, onto
the line through ,

and let d
i
be the distance of the projected point to . Let T
1
be
the subset of T whose projections land on the

side of and let T


2
= T \ T
1
be the
points whose projections land on the other side. Since is the centroid of T, we have

iT
1
d
i
=

iT
2
d
i
. Since each A
i
T is closer to than to

, for i T
1
, we have
d
i
|

|/2 and so

iT
1
d
i
|T||

|/2; hence also,

iT
2
d
i
|T||

|/2. So,

iT
d
i
|T||

|.
Now from the assumption that |T| (n), we have |T| 2n. So, the n th smallest d
i
is
at most
|T|
|T|n
|

| 2|

|. We can now get a new clustering C

as follows: move the


n A
i
in T with the smallest d
i
to the cluster with

as center. Recompute centers. The


recomputatation of the centers can only reduce the cost (as we saw in Corollary (8.3)).
What about the move? The move can only add cost (distance squared) in the direction of
the line joining ,

and this extra cost is at most 4n|

|
2
. By the assumption that
C is meaningful (under the proposed denition above), we must thus have |

|
2

MSD(C) as claimed.
246
But we will now see that the condition that |

|
2
MSD(C) is too strong for some
common data sets. Consider two spherical Gaussians in d space, each with variance 1
in each direction. Clearly if we have data generated from a (equal weight) mixture of the
two, the correct 2-clustering one would seek is to split them into the Gaussians they
were generated from with the actual centers of the Gaussians (or a very close point) as the
cluster centers. But the MSD of this clustering is approximately d. So, by the Claim, for
C to be meaningful, the centers must be (d) apart. It is easy to see however that if the
separation is (

ln n), the clusters are distinct : the projection of each Gaussian to the
line joining their centers is a 1 dimensional Gaussian of variance 1, so the probability that
any data point lies more than a tenth of the way (in the projection) to the wrong center
is at most O(1/n
2
), so by union bound, all data points have their distance to the wrong
center at least 10 times the distance to the correct center, provided, we measure distances
only in the projection. Since the variance is 1 in each direction, the Mean Squared
Distance to the cluster center in each direction is only O(1). So, in this example,
it would make more sense to require a inter-center separation of the maximum mean
squared distance in any one direction to the cluster center. We will be able to achieve
this in general.
Denition: Let A, C be the data and cluster centers matrices respectively. The mean
squared distance in a direction (denoted MSDD(A, C)) is the maximum over all unit
length vectors v of the mean squared distance of data points to their cluster centers in
the direction v, namely,
MSDD(A, C) =
1
n
Max
v:|v|=1
|(A C)v|
2
=
1
n
||A C||
2
2
,
where, we have used the basic denition of the largest singular value to get the last
expression.
Theorem 8.5 Suppose there exists a kclustering C of data points A with (i) (n) data
points per cluster and (ii) distance at least (

MSDD(A, C)) between any two cluster


centers. Then any clustering returned by the following simple algorithm (which we hence-
forth call the SVD clustering of A) diers from C is at most n points (here, the hidden
constants in depend on .)
Find the SVD of A. Project data points to the space of the top k (right) singular
vectors of A. Return a 2 approximate
8
kmeans clustering in the projection.
Remark: Note that if we did the approximate optimization in the whole space (without
projecting), we will not succeed in general. In the example of the two spherical Gaussians
above, for the correct clustering C, MSDD(A, C) = O(1). But if the inter-center separa-
tion is just O(1), then n points of the rst Gaussian may be put into the second at an
added cost (where cost is distance squared in the whole space) of only O(n) as we argued,
whereas, the cost of clustering C is O(nd).
8
A 2-approximate clustering is one which has cost at most twice the optimal cost.
247
Remark: The Theorem also implies a sort of uniqueness of the clustering C, namely,
that any other clustering satisfying both (i) and (ii) diers from C in at most 2n points,
as seen from the following: The Theorem applies to the other clustering as well, since it
satises the hypothesis. So, it also diers from the SVD clustering in at most n points.
Thus, C and the other clustering cannot dier in more than 2n points.
The proof of the Theorem will use the following two Lemmas, which illustrate the power
of SVD. The rst Lemma says that for any clustering C with (n) points per cluster, the
SVD clustering described in the Theorem nds cluster centers fairly close to the cluster
centers in C, where, close is measured in terms of MSDD(A, C). The argument will be
that one candidate clustering in the SVD projection is to just use the same centers as C
(projected to the space) and if the SVD clustering does not have a cluster center close to
a particular cluster of C, it ends up paying too high a penalty compared to this candidate
clustering to be 2-optimal.
Lemma 8.6 Suppose A is the data matrix and C a clustering with (n) points in each
cluster.
9
Suppose C

is the SVD clustering (described in the Theorem) of A. Then for ev-


ery cluster center of C, there is a cluster center of C

within distance O(

MSDD(A, C)).
Proof: Let =MSDD(A, C). Let T be the set of data points in a cluster of C and
suppose for contradiction that centroid of T has no cluster center of C

at distance
O(

k). Let

A
i
denote projection of A
i
onto the SVD subspace; so the SVD clustering
actually clusters points

A
i
. Recall the notation that C

i
is the cluster center of C

closest
to

A
i
. The cost of a data point A
i
T in the SVD solution is
|

A
i
C

i
|
2
= |( C

i
) (

A
i
)|
2

1
2
| C

i
|
2
|

A
i
|
2
() |

A
i
|
2
,
where, we have used |a b|
2

1
2
|a|
2
|b|
2
for any two vectors a and b.
Adding over all points in T, the cost of C

is at least (n) ||

AC||
2
F
. Now, one way to
cluster the points

A
i
is to just use the same cluster centers as C; the cost of this clustering
is ||

A C||
2
F
. So the optimal clustering of points

A
i
costs at most ||

A C||
2
F
and since
the algorithm nds a 2-approximate clustering, the cost of the SVD clustering is at most
2||

A C||
2
F
. So we get
2||

A C||
2
F
(n) ||

A C||
2
F
= ||

A C||
2
F
(n).
We will prove that


A C

2
F
5kn in Lemma 8.7. By (i), we have k O(1), so
||

A C||
2
F
= O(). So, we get a contradiction (with suitable choice of constants under
the .)
Note that in the Lemma below, the main inequality has Frobenius norm on the left hand
side, but only operator norm on the right hand side. This makes it stronger than having
either Frobenius norm on both sides or operator norm on both sides.
9
Note that C is not assumed to satisfy condition (ii) of the Theorem.
248
Lemma 8.7 Suppose A is an nd matrix and suppose C is an nd rank k matrix. Let

A be the best rank k approximation to A found by SVD. Then, ||



AC||
2
F
5k||AC||
2
2
.
Proof: Let u
1
, u
2
, . . . , u
k
be the top k singular vectors of A. Extend the set of the top
k singular vectors to an orthonormal basis u
1
, u
2
, . . . , u
p
of the vector space spanned by
the rows of

A and C. Note that p 2k since

A is spanned by u
1
, u
2
, . . . , u
k
and C is of
rank at most k. Then,
||

A C||
2
F
=
k

i=1
|(

A C)u
i
|
2
+
p

i=k+1
|(

A C)u
i
|
2
.
Since {u
i
| 1 i k} are the top k singular vectors of A and since

A is the rank k ap-
proximation to A, for 1 i k, Au
i
=

Au
i
and thus

A C

u
i

2
= |(A C) u
i
|
2
. For
i > k,

Au
i
= 0, thus

A C

u
i

2
= |Cu
i
|
2
. From this it follows that


A C

2
F
=
k

i=1
|(A C)u
i
|
2
+
p

i=k+1
|Cu
i
|
2
k||A C||
2
2
+
p

i=k+1
|Au
i
+ (C A)u
i
|
2
Using |a + b|
2
2|a|
2
+ 2|b|
2


A C

2
F
k||A C||
2
2
+ 2
p

i=k+1
|Au
i
|
2
+ 2
p

i=k+1
|(C A)u
i
|
2
k||A C||
2
2
+ 2 (p k 1)
2
k+1
(A) + 2 (p k 1) ||A C||
2
2
Using p 2k implies k > p k 1


A C

2
F
k||A C||
2
2
+ 2k
2
k+1
(A) + 2k||A C||
2
2
. (8.1)
As we saw in Chapter 4, for any rank k matrix B, ||AB||
2

k+1
(A) and so
k+1
(A)
||A C||
2
and plugging this in, we get the Lemma.
Proof: (of Theorem (8.5)) Let =

MSDD(A, C). We will use Lemma (8.6). To every


cluster center of C, there is a cluster center of the SVD clustering within distance
O(). Also, since, C satises (ii) of the Theorem, the mapping from to is 1-1. C
and the SVD clustering dier on a data point A
i
if the following happens: A
i
belongs to
cluster center of C whose closest cluster center in C

is , but

A
i
(the projection of A
i
249

You might also like