0% found this document useful (0 votes)
6 views20 pages

Big Data Algo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views20 pages

Big Data Algo

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Turning Big data into tiny data:

Constant-size coresets for k-means, PCA and projective clustering


Dan Feldman∗ Melanie Schmidt† Christian Sohler†

Abstract 1 Introduction
We prove that the sum of the squared Euclidean
Big Data. Scientists regularly encounter limitations
distances from the n rows of an n × d matrix A to any
due to large data sets in many areas. Data sets grow
compact set that is spanned by k vectors in Rd can
in size because they are increasingly being gathered
be approximated up to (1 + ε)-factor, for an arbitrary
by ubiquitous information-sensing mobile devices,
small ε > 0, using the O(k/ε2 )-rank approximation of
aerial sensory technologies (remote sensing), genome
A and a constant. This implies, for example, that the
sequencing, cameras, microphones, radio-frequency
optimal k-means clustering of the rows of A is (1 + ε)-
identification chips, finance (such as stocks) logs,
approximated by an optimal k-means clustering of
internet search, and wireless sensor networks [30, 38].
their projection on the O(k/ε2 ) first right singular
The world’s technological per-capita capacity to store
vectors (principle components) of A.
information has roughly doubled every 40 months
A (j, k)-coreset for projective clustering is a small
since the 1980s [31]; as of 2012, every day 2.5
set of points that yields a (1 + ε)-approximation to
etabytes(2.5 × 1018 ) of data were created [4]. Data
the sum of squared distances from the n rows of A
sets as the ones described above and the challenges
to any set of k affine subspaces, each of dimension at
involved when analyzing them is often subsumed in
most j. Our embedding yields (0, k)-coresets of size
the term Big Data.
O(k) for handling k-means queries, (j, 1)-coresets of
Gartner, and now much of the industry use the
size O(j) for PCA queries, and (j, k)-coresets of size
“3Vs” model for describing Big Data [14]: increasing
(log n)O(jk) for any j, k ≥ 1 and constant ε ∈ (0, 1/2).
volume n (amount of data), its velocity (update time
Previous coresets usually have a size which is linearly
per new observation) and its variety d (dimension,
or even exponentially dependent of d, which makes
or range of sources). The main contribution of this
them useless when d ∼ n.
paper is that it deals with cases where both n and d
Using our coresets with the merge-and-reduce ap-
are huge, and does not assume d  n.
proach, we obtain embarrassingly parallel streaming
algorithms for problems such as k-means, PCA and Data analysis. Classical techniques to analyze
projective clustering. These algorithms use update and/or summarize data sets include clustering, i.e.
time per point and memory that is polynomial in log n the partitioning of data into subsets of similar char-
and only linear in d. acteristics, and dimension reduction which allows to
For cost functions other than squared Euclidean consider the dimensions of a data set that have the
distances we suggest a simple recursive coreset con- highest variance. In this paper we mainly consider
O(1)
struction that produces coresets of size k 1/ε for problems that minimize the sum of squared error, i.e.
k-means and a special class of bregman divergences we try to find a set of geometric centers (points, lines
that is less dependent on the properties of the squared or subspaces), such that the sum of squared distances
Euclidean distance. from every input point to its nearest center is mini-
mized.
Examples are the k-means or sum of squares clus-
tering problem where the centers are points. An-
other example is the j-subspace mean problem, where
k = 1 and the center is a j-subspace, i. e., the sum
of squared distances to the points is minimized over
∗ MIT, Distributed Robotics Lab.
all j-subspaces. The j-rank approximation of a ma-
Email: dan-
[email protected] trix is the projection of its rows on their j-subspace
† TU Dortmund, Germany, Email: {melanie.schmidt, mean. Principal component analysis (PCA) is an-
christian.sohler}@tu-dortmund.de other example where k = 1, and the center is an

1434 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
affine subspace. Constrained versions of this prob- while passing over the streaming data. Our paper
lem (that are usually NP-hard) include the non- support this non-relational model in the sense that,
negative matrix factorization (NNMF) [2], when the unlike most of previous results, we do not assume
j-subspace should be spanned by positive vectors, and that either d or n are bounded or known in advance.
Latent Dirichlet Allocation (LDA) [3] which general- We only assume that the first coordinates (i, j) are
izes NNMF by assuming a more general prior distri- increasing for every new inserted value.
bution that defines the probability for every possible
subspace. 2 Related work
The most general version of the problem that we
study is the linear or affine j-Subspace k-Clustering Coresets. The term coreset was coined by Agarwal,
problem, where the centers are j-dimensional linear Har-Peled and Varadarajan [10] in the context of
or affine subspaces and k ≥ 1 is arbitrary. extend measures of point sets. They proved that
In the context of Big Data it is of high interest to every point set P contains a small subset of points
find methods to reduce the size of the streaming data such that for any direction, the directional width of
but keep its main characteristics according to these the point set will be approximated. They used their
optimization problems. result to obtain kinetic and streaming algorithms
to approximately maintain extend measures of point
Coresets. A small set of points that approximately sets.
maintains the properties of the original point with Application to Big Data. An off-line coreset
respect to a certain problem is called a coreset (or construction can immediately turned into streaming
core-set). Coresets are a data reduction technique, and embarrassingly parallel algorithms that use small
which means that they tackle the first two ‘Vs’ of Big amount of memory and update time. This is done
Data, volume n and velocity (update time), for a large using a merge-and-reduce technique as explained in
family of problems in machine learning and statistics. Section 10. This technique makes coresets a practi-
Intuitively, a coreset is a semantic compression of a cal and provably accurate tool for handling Big data.
given data set. The approximation is with respect to The technique goes back to the work of Bentley and
a given (usually infinite) set Q of query shapes: for Saxe [11] and has been first applied to turn core-
every shape in Q the sum of squared distances from set constructions into streaming algorithms in [10].
the original data and the coreset is approximately Popular implementations of this technique include
the same. Running optimization algorithms on the Hadoop [40].
small coreset instead of the original data allows us
to compute the optimal query much faster, under Coresets for k-points clustering (j = 0). The
different constraints and definition of optimality. first coreset construction for clustering problems was
Coresets are usually of size at most logarithmic done by Badoiu, Har-Peled and Indyk [13], who
in the number n of observations, and have similar showed that for k-center, k-median and k-means clus-
update time per point. However, they do not handle tering, an approximate solution has a small witness
the variety of sources d in the sense that their size is set (a subset of the input points) that can be used
linear or even exponential in d. In particular, existing to generate the solution. This way, they obtained im-
coreset are useless for dealing with Big Data when proved clustering algorithms. Har-Peled and Mazum-
d ∼ n. In this paper we suggest coresets of size dar [29] gave a stronger definition of coresets for k-
independent of d , while still independent or at most median and k-means clustering. Given a point set P
logarithmic in n. this definition requires that for any set of k centers C
the cost of the weighted coreset S with respect to P is
Non-SQL databases. Big data is difficult to work approximated upto a factor of (1 + ε). Here, S is not
with using relational databases of n records in d necessarily a subset of P . We refer to their definition
columns. Instead, in NOSQL is a broad class of as a strong coreset.
database management systems identified by its non- Har-Peled and Kushal [28] showed strong coresets
adherence to the widely used relational database for low-dimensional space, of size independent of
management system model. In NoSQL, every point the number of input points n. Frahling and Sohler
in the input stream consists of tuples (object, feature, designed a strong coreset that allows to efficiently
value), such as (”Scott”, ”age”, 25). More generally, maintain a coreset for k-means in dynamic data
the tuple can be decoded as (i, j, value) which means streams [24]. The first construction of a strong coreset
that the entry of the input matrix in the ith row and for k-median and k-means of size polynomial in the
jth column is value. In particular, the total number dimension was done by Chen [15]. Langberg and
of observations and dimensions (n and d) is unknown Schulman [35] defined the notion of sensitivity of an

1435 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
input point, and used it to construct strong coresets of the considered subspace j. Feldman and Langberg
size O(k 2 d2 /ε2 ), i.e, independent of n. Feldman and [21] gave a strong coreset whose size is polynomial in
Langberg [21] showed that small total sensitivity and both d and j.
VC-dimension for a family of shapes yields a small
coreset, and provided strong coresets of size O(kd/ε2 ) Constrained optimization (e.g. NNMF, LDA).
for the k-median problem. Non-negative matrix factorization (NNMF) [2], and
Feldman, Monemizadeh and Sohler [22] gave a Latent Dirichlet Allocation (LDA) [3] are constrained
construction of a weak coreset of size independent of versions of the j-subspace problem. In NNMF the de-
both the number of input points n and the dimension sired j-subspace should be spanned by positive vec-
d. The disadvantage of weak coresets is that, unlike tors, and LDA is a generalization of NNMF, where
the result in this paper, they only give a guarantee we are given additinal prior probabilities (e.g., mul-
for centers coming from a certain set of candidates tiplicative cost) for every candidate solution. An im-
(instead of any set of centers as in the case of strong portant practical advantage of coresets (unlike weak
coresets). coresets) is that they can be used with existing al-
Feldman and Schulman [23] developed a strong gorithms and heuristics for such constrained opti-
coreset for the k-median problem when the centers are mization problems. The reason is that strong core-
weighted, and for generalized distance functions that set approximates the sum of squared distances to any
handle outliers. There are also efficient implemen- given shape from a family of shapes (in this case, j-
tation of k-means approximation algorithm based on subspaces), independently for a specific optimization
coresets, both in the streaming [9] and non-streaming problem. In particular, running heuristics for a con-
setting [25]. strained optimization problem on an ε-coreset would
yield the same quality of approximations compared to
j-subspace clustering (k = 1). Coresets have also such run on the original input, up to provable (1 + ε)-
been developed for the subspace clustering problem. approximation. While NNMF and LDA are NP-hard,
The j-dimensional subspace of Rd that minimizes coresets for j-subspaces can be constructed in poly-
the sum of squared distances
 to n input points can nomial (and usually practical) time.
be computed in O(min nd2 , dn2 ) time using the
Singular Value Decomposition (SVD). The affine j- k-lines clustering (j = 1). For k-lines clustering,
dimensional subspace (i.e, flat that does not intersect Feldman, Fiat and Sharir [19] show strong coresets for
the origin) that minimizes this sum can be computed low-dimensional space, and [21] improve the result for
similarly using principle component analysis (PCA), high-dimensional space. In [23] the size of the coreset
which is the SVD technique applied after translating reduced to be polynomial in 2O(k) log n.
 Har-Peled
the origin of the input points to their mean. proved a lower bound of min 2k , log n for the size
In both cases, the solution for the higher di- of such coresets.
mensional (j + 1)-(affine)-subspace clustering prob- Projective clustering (k, j > 1). For general
lem can be obtained by extending the solution to the projective clustering, where k, j ≥ 1, Har-Peled [27]
j-(affine)-subspace clustering problem by one dimen- showed that there is no strong coreset of size sub-
sion. This implicitely defines an orthogonal basis of linear in n, even for the family of pair of planes in R3
Rd . The vectors of these bases are called right singu- (j = k = 2, d = 3). However, recently Varadarajan
lar vectors, for the subspace case, and principle com- and Xiao [36] showed that there is such a coreset if the
ponents for the affine case. input points are on an integer grid whose side length
A line of research developed weak coresets for is polynomial in n.
faster approximations [16, 17, 39, 26]. These results
are usually based on the idea to sample sets of points Bregman divergences. Bregman divergences are
with probability roughly proportional to the volume a class of distance measures that are used frequently
of the simplex spanned by the points, because such in machine learning and they include l22 . Banerjee et
a simplex will typically have a large extension in the al. [12] generalize the Lloyd’s algorithm to Bregman
directions defined by the first left singular vectors. divergences for clustering points (j = 0). In general,
See recent survery in [8] for weak coresets. Bregman divergences have singularities at which the
Feldman, Fiat and Sharir [19] developed a strong cost might go to infinity. Therefore, researchers
coreset for the subspace approximation problem in studied clustering for so-called μ-similar Bregman
low dimensional spaces, and Feldman, Monemizadeh, divergences, where there is an upper and lower bound
Sohler and Woodruff [20] provided such a coreset for by constant μ times a Mahalanobis distance [7].
high dimensional spaces that is polynomial in the The only known coreset construction for clustering
dimension of the input space d and exponential in under Bregman divergences is a weak coreset of
3
1436 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
size O(k log n/ε2 log(|Γ|k log n)) by Ackermann and by
Blömer [6].

n
(3.1) dist2 (A, S) = dist2 (pi , S).
3 Background and Notation i=1
We deal with clustering problems on point sets in
Euclidean space, where a point set is represented by Thus, if L is a linear j-subspace and Y is a
the rows of a matrix A. The input matrix A and all matrix with n − j orthonormal columns spanning
other matrices in this paper are over the reals. L⊥ , then dist(p, L) = pT Y 2 . We generalize the
notation to n × n matrices and define, dist(A, L) =
Notations and Assumptions. The number of n dist(Ai∗ , L) for any compact set L. Here Ai∗ is
i=1
input points is denoted by n. For simplicity of the ith row of A. Furthermore, we write dist2 (A, L) =
notation, we assume that d = n and thus A is an  n  2
i=1 dist(Ai∗ , L) .
n × n matrix. Otherwise, we add (n − d) columns (or
d − n rows) containing all zeros to A. We label the Definition 1. (Linear (Affine) j-Subspace k-
entries of A by ai,j . The ith row of A will be denoted Clustering) Given a set of n points in n-dimensional
as Ai∗ and the jth columns as A∗j . space as an n × n matrix A, the k j-subspace
The identity matrix of Rj is denoted as I ∈ Rj×j . clustering problem is to find a set L of k linear
For a matrix X with entries x i,j , we denote the (affine) j-dimensional subspaces L1 , . . . , Lk of Rd
 2
Frobenius norm of X by X2 = i,j xi,j . We say
that minimizes the sum of squared distances to the
nearest subspace, i.e.,
that a matrix X ∈ R n×j
has orthonormal columns
if its columns are orthogonal unit vectors. Such a  n
matrix is also called orthogonal matrix. Notice that cost(A, L) = min dist2 (Ai∗ , Lj )
T j=1...,k
every orthogonal matrix X satisfies X X = I. i=1
The columns of a matrix X span a linear sub-
is minimized over every choice of L1 , · · · , Lk .
space L if all points in L can be written as linear
combinations of the columns of A. This implies that
the columns contain a basis of L. Singular Value Decomposition. An important
A j-dimensional linear subspace L ⊆ R will tool from linear algebra that we will use in this
n

be represented by an n × j basis matrix X with paper is the singular value decomposition T


of a matrix
orthonormal columns that span L. The projection A. Recall that A = U DV is the Singular Value
of a point set (matrix) A on a linear subspace L Decomposition (SVD) of A if U, V ∈ R n×n
are
represented by X will be the matrix (point set) orthogonal matrices, and D ∈ R n×n
is a diagonal
AX ∈ Rn×j . These coordinates are with respect to matrix with non-increasing entries. Let (s1 , · · · , sn )
2
the column space of X. The projections of A on L denote the diagonal of D .
using the coordinates of Rn are the rows of AXX T . For an integer j between 0 to n, the first j
columns of V span a linear subspace L∗ that minimize
Distances to Subspaces. We will often compute the sum of squared distances to the points (rows) of
the squared Euclidean distance of a point set given as A, over all j-dimensional linear subspaces in Rn . This
a matrix A to a linear subspace L. This distance is sum equals sj+1 + . . . + sn , i.e. for any n × (n − j)
given by AY 22 , where Y is an n×(n−j) matrix with matrix Y with orthonormal columns, we have
orthonormal columns that span L⊥ , the orthogonal
complement of L. Therefore, we will also sometimes n
2
represent a linear subspace L by such a matrix Y . (3.2) AY  2 ≥ si .
i=j+1
An affine subspace is a translation of a linear
subspace and as such can be written as p + L, where The projection of the points of A on L∗ are the rows
p ∈ Rn is the translation vector and L is a linear of U D. The sum of the squared projection of the
subspace. points of A on L∗ is s1 + . . . + sj . Note that the sum
For a compact set S ⊆ Rd and a vector p in Rd , of squared distances to the origin (the optimal and
we denote the Euclidean distance between p and (its only 0-subspace) is s1 + . . . + sn .
closest points in) S by
Coresets. In this paper we introduce a new notion
dist2 (p, S) := minp − s22 . of coresets, which is a small modification of the
s∈S
earlier definition by Har-Peled and Mazumdar [29]
For an n × d matrix A whose rows are p1 , · · · , pn , we for the k-median and k-means problem (here adapted
define the sum of the squared distances from A to S to the more general setting of subspace clustering)

1437 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
that is commonly used in coreset constructions for Corollary 4.1. Let P be a set of points in Rd , and
these problems. The new idea is to allow to add a k, j ≥ 0 be a pair of integers. There is a set Q in
constant C to the cost of the coreset. Interestingly, Rd , a weight function w : Q → [0, ∞) and a constant
this simple modification allows us to obtain improved c > 0 such that the following holds.
coreset constructions. For every set B which is the union of k affine
j-subspaces of Rd we have
Definition 2. Let A be an n × d matrix whose rows
 
represents n points in Rd . An m × n matrix M (1 − ε) dist2 (p, B) ≤ w(p)dist2 (p, B) + c
is called (k, ε)-coreset for the j-subspace k-clustering p∈P p∈Q
problem of A, if there is a constant c such that for 
2
every choice of k j-dimensional subspaces L1 , . . . , Lk ≤(1 + ε) dist (p, B),
we have p∈P

(1−ε) cost(A, L) ≤ cost(M, L)+c ≤ (1+ε) cost(A, L). and


1. |Q| = O(j/ε) if k = 1
4 Our results
In this section we summarize our results. 2. |Q| = O(k 2 /ε4 ) if j = 0
The main technical result is a proof that the sum 3. |Q| = poly(2k log n, 1/ε) if j = 1,
of squared distances from a set of points in Rd (rows
of an n × d matrix A) to any other compact set 4. |Q| = poly(2kj , 1/ε) if j, k > 1, under the
that is spanned by k vectors of Rd can be (1 + ε)- assumption that the coordinates of the points in
approximated using the O(k/ε)-rank approximation P are integers between 1 and nO(1) .
of A, together with a constant that depends only on
In particular the size of Q is independent of d.
A. Here, a distance between a point p to a set is the
Euclidean distance of p to the closest point in this set.
The O(k/ε)-rank approximation of A is the projection PCA and k-rank approximation. For the first
of the rows of A on the k-dimensional subspace that case of the last theorem, we obtain a small coreset
minimizes their sum of squared distances. Hence, for k-dimensional subspaces with no multiplicative
we prove that the low rank approximation of A can weights, which contains only O(k/ε) points in Rd .
be considered as a coreset for its n rows. While That is, its size is independent of both d and n.
the coreset also has n rows, its dimensionality is Corollary 4.2. Let A be an n × d matrix. Let
independent of d, but only on k and the desired error. m = k/ε + k − 1 for some k ≥ 1 and 0 < ε < 1
Formally: and suppose that m ≤ n − 1. Then, there is an m × d
Theorem 4.1. Let A be an n × d matrix, k ≥ 1 be matrix à and a constant c ≥ 1, such that for every
an integer and 0 < ε < 1. Suppose that Am is the m- k-subspace S of R we have
d

rank approximation of A, where m := bk/ε2 ≤ n for a


(1 − ε)dist2 (A, S) ≤ dist2 (Ã, S) + c
sufficiently large constant b. Then for every compact (4.3)
set S that is contained in a k-dimensional subspace of ≤(1 + ε)dist2 (A, S)
Rd , we have
Equality (4.3) can also be written using matrix nota-
(1 − ε)dist2 (A, S) ≤ dist2 (Am , S) + A − Am 2 tion: for every d × (d − k) matrix Y whose columns
2 are orthonormal we have
≤(1 + ε)dist (A, S),
(1 − ε)AY 2 ≤ ÃY  + c
where dist2 (A, S) is the sum of squared distances from
each row on A to its closest point in S. ≤(1 + ε)AY 2 .

Note that A takes nd space, while the pair Am Notice that in Theorem 4.1, Am is still n-
and the constant A−Am  can be stored using nm+1 dimensional and holds n points. We obtain Corol-
space. lary 4.2 by obbserving that Am X = D(m) V T X
for every X ∈ Rn×n−j . As the only non-zero entries
Coresets. Combining our main theorem with known
of D(m) V T are in the first m rows, we can store these
results [21, 19, 36] we gain the following coresets for
as the matrix Ã. This can be considered as an exact
projective clustering. All the coresets can handle
coreset for Am .
Big data (streaming, parallel computation and fast
update time), as explained in Section 10 and later in Streaming. In the streaming model, the input
this section. points (rows of A) arrive on-line (one by one) and
5
1438 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
we need to maintain the desired output for the points Latent Drichlet analysis (LDA) [3] is a generaliza-
that arrived so far. We aim that both the update tion of NNMF, where a prior (multiplicative weight)
time per point and the required memory (space) will is given for every possible k-subspace in Rd . In prac-
be small. Usually, linear in d and polynomial in log n.tice, especially when the corresponding optimization
For computing the k-rank approximation of a problem is NP-hard (as in the case of NNMF and
matrix A efficiently, in parallel or in the streaming LDA), running popular heuristics on the coreset pair
model we cannot use Corollary 4.2 directly: First, Ã and c may not only turn them into faster, stream-
because it assumes that we already have the O(k/ε)- ing and parallel algorithms. It might actually yield
rank approximation of A, and second, that A assumed better results (i.e, “1 − ε” approximations) compared
to be in memory which takes nd space. However, to running the heuristics on A; see [33].
using merge-and-reduce we only need to apply the In principle component analysis (PCA) we usu-
construction of the theorem on very small matrices ally interested in the affine k-subspace that mini-
A of size independent of d in overall time that is mizes the sum of squared distances to the rows of
linear in both n and d, and space that is logarithmic A. That is, the subspace may not intersect the ori-
in O(log n). The construction is also embarrassingly gin. To this end, we replace k by k + 1 in the previous
parallel; see Fig. 3 and discussion in Section 10. theorems, and compute the optimal affine k-subspace
The following corollary follows from Theo- rather than the (k + 1) optimal subspace of the small
rem 10.1 and the fact that computing the SVD for matrix Ã.
an m × d matrix takes O(dm2 ) time when m ≤ d. For the k-rank approximation we use the follow-
ing corollary with Z as the empty set of constraints.
Corollary 4.3. Let A be the n × d matrix whose n
Otherwise, for PCA, NNMF, or LDA we use the cor-
rows are vectors seen so far in a stream of vectors in
responding constraints.
Rd . For every n ≥ 1 we can maintain a matrix à and
c ≥ 0 that satisfies (4.3) where Corollary 4.4. Let A be an n × d matrix. Let
Ak denote an n × k matrix of rank at most k that
1. Ã is of size 2m × 2m for m = k/ε
minimizes A−Ak 2 among a given (possibly infinite)
2. The update time per row insertion, and overall set Z of such matrices. Let à and c be defined as
space used is in Corollary 4.3, and let Ãk denote the matrix that
minimizes à − Ãk 2 among the matrices in Z. Then
O(1)
k log n
d· A − Ãk 2 ≤ (1 + ε)A − Ak 2 .
ε

Using the last corollary, we can efficiently com- Moreover, A − Ak 2 can be approximated using
pute a (1 + ε)-approximation to the subspace that à and c, as
minimizes the sum of squared distances to the rows
(1 − ε)A − Ãk 2 ≤ Ã − Ãk 2 + c
of a huge matrix A. After computing à and c for
A as described in Corollary 4.3, we compute the k- ≤(1 + ε)A − Ãk 2 .
subspace S ∗ that minimizes the sum of squared dis-
tances to the small matrix Ã. By (4.3), S ∗ approxi- k-means. The k-mean of A is the set S ∗ of k points
mately minimizes the sum of squared distances to the in Rd that minimizes the sum of squared distances
rows of A. To obtain an approximation to the k-rank dist2 (A, S ∗ ) to the rows of A among every k points in
approximation of A, we project the rows of A on S ∗ Rd . It is not hard to prove that the k-mean of the k-
in O(ndk) time. rank approximation Ak of A is a 2-approximation for
Since à approximates dist2 (A, S) for any k- the k-mean of A in term of sum of squared distances
subspace of Rd (not only S ∗ ), computing the sub- to the k centers [18]. Since every set of k points is
space that minimizes dist2 (Ã, S) under arbitrary con- contained in a k-subspace of Rd , the k-mean of Am
straints would yield a (1 + ε)-approximation to the is a (1 + ε)-approximation to the k-means of A in
subspace that minimizes dist2 (A, S) under the same term of sum of squared distances. Since the k-mean
constraints. Such problems include the non-negative of à is clearly in the span of Ã, we conclude from our
matrix factorization (NNMF, also called pLSA, or main theorem the following corollary that generalizes
probabilistic LSA) which aims to compute a k- the known results from 2-approximation to (1 + ε)-
subspace S ∗ that minimizes sum of squared distances approximation.
to the rows of A as defined above, with the addi-
tional constraint that the entries of S ∗ will all be Corollary 4.5. Let Am denote the m = bk/ε2 rank
non-negative. approximation of an n × d matrix A, where b is

1439 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
a sufficiently large constant. Then the sum of the probability 1, no dependency of time and space on
squared distances from the rows of A to the k-mean of δ). Second, the dimension of our subspace is k/ε ,
Am is a (1 + ε)-approximation for the sum of squared that is, independent of n and only linear in 1/ε.
distances to the k-mean of A. Finally, the sum of squared distances to any set that
is spanned by k-vectors is approximated, rather than
While the last corollary projects the input points the inter distance between each pair of input rows.
to a O(k/ε)-dimensional subspace, the number of
rows (points) is still n. We use existing coreset
constructions for k-means on the lower dimensional Generalizations for Bregman divergences. Our
points. These constructions are independent of the last result is a simple recursive coreset construction
number of points and linear in the dimension. Since that yields the first strong coreset for k-clustering
we apply these coresets on Am , the resulting coreset with μ-similar Bregman divergences. The coreset
O(1/ε)
size is independent of both n and d. we obtain has size k . The idea behind the
Notice the following ‘coreset’ of a similar type for construction is to distinguish between the cases that
the case k = 1. Let A be the mean of A. Then the the input point set can be clustered into k clusters at
following ‘triangle inequality’ holds for every point a cost of at most (1−ε) times the cost of a 1-clustering
s∈R : d or not. We apply the former case recursively on the
  clusters of an optimal (or approximate) solution until
dist2 (A, s) = dist2 (A, A ) + n · dist2 (A, s). either the cost of the clustering has dropped to at
most ε times the cost of an optimal solution for the
Thus, A forms an exact coreset consisting of one input point set or we reach the other case. In both
d-dimensional
  point together with the constant cases we can then replace the clustering by a so-called
dist2 (A, A ). clustering feature containing the number of points,
As in our result, and unlike previous coreset con- the mean and the sum of distances to the mean. Note
structions, this coreset for 1-mean is of size indepen- that our result uses a stronger assumption than [6]
dent of d and uses additive constant on the right hand because we assume that the divergence is μ-similar
side. Our results generalizes this exact simple coresets on the complete Rd while in [6] this only needs to
for k-means where k ≥ 2 while introducing (1 + ε)- hold for a subset X ⊂ Rd . However, compared to
approximation. [6] we gain a strong coreset, and the coreset size is
Unlike the k-rank approximation that can be independent of the number of points.
computed using the SVD of A, the k-means prob-
lem is NP-hard when k is not a constant, analog to 5 Implementation in Matlab
constrained versions (e.g. NNMF or LDA) that are
Our main coreset construction is easy to implement
also NP-hard. Again, we can run (possibly inefficient)
if a subroutine for SVD is already available, for
heuristics or constant factor approximations for com-
example, Algorithm 1 shows a very short Matlab
puting the k-mean of A under different constraints
implementation. The first part initializes the n × d-
in the streaming and parallel model by running the
matrix A with random entries, and also sets the
corresponding algorithms on Ã.
parameters j and ε. In the second part, the actual
Comparison to Johnson-Lindestrauss Lemma. coreset construction happens. We set m := j/ε +
The JL-Lemma states that projecting the n rows j − 1 according to Corollary 6.1. Then we calculate
of an n × d matrix A on a random subspace of the first m singular vectors of A and store it in the
dimension Ω(log(n)/ε2 ) in Rd preserves the Euclidean matrices U , D and V . Notice that for small matrices,
distance between every pair of the rows up to a it is better to use the full svd svd(A,0). Then, we
factor of (1 + ε), with high (arbitrary small constant)  compute the coreset matrix C and the constant c =
n
probability 1 − δ. In particular, the k-mean of the i=m+1 si . In the third part, we check the quality
projected rows minimizes the k-means cost of the of our coreset. For this, we compute a random query
  is because subspace represented by a matrix Q with j random
original rows up to a factor of (1 + ε). This
the minimal cost depends only on the n2 distances orthogonal columns, calculate the sum of the squared
between the n rows [34]. Note that we get the same distances of A to Q, and of C to Q. In the end, err
approximation by the projection Am of A on an m- contains the multiplicative error of our coreset.
dimensional subspace. The Matlab code of Algorithm 2 is based on
While our embedding also projects the rows onto Corollary 4.5 for constructing a coreset C for k-
a linear subspace of small dimension, its construction mean queries. This coreset has n points that are
and properties are different from a random subspace. lying on O(k/ε2 ) subspace, rather than the original
First, our embedding is deterministic (succeeds with d-dimensional space. Note that further reduction for
7
1440 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
a coreset of size independent of both n and d can be >> %% Coreset Computation %%%
obtained by applying additional construction on C; >> %% for j-Subspace Approximation %%%
see Corollary 9.1. The code in Algorithms 1 and 2 is
only for demonstration and explanation purposes. In >> % 1. Creating Random Input
particular, it should be noted that >> n=10000;
>> d=2000;
• For simplicity, we choose A as a random Gaus- >> A=rand(n,d);
sian matrix. Real data sets usually contain more >> j=2;
structure. >> epsilon=0.1;
• We choose the value of m according to the
>> % 2. The Actual Coreset Construction
corresponding theorems, that are based on worst
>> m=j+ceil(j/epsilon)-1; % i.e., m=21.
case analysis. In practice, as can be seen in our
>> [U, D, V]=svds(A,m);
experiments, significantly smaller values for m
>> C = D*V’; % C is an m-by-m matrix
can be used to obtain the same error bound.
>> c=norm(A,’fro’)^2-norm(D,’fro’)^2;
The desired value of m can be chosen using
hill climbing or binary search techniques on the
>> % 3. Compare sum of squared distances of j
integer m.
>> % random orthogonal columns Q to A and C
• In Algorithm 1 we compare the coreset approx- >> Q=orth(rand(d,d-j));
imation for a fixed query subspace, and not the >> costA= norm(A*Q,’fro’)^2;
maximum error over all such subspaces, which is >> costC= c+norm(C*Q,’fro’)^2;
bounded in Corollary 6.1. >> errJsubspaceApprox = abs(costC/costA - 1)

• In Algorithm 2 we actually get a better solution errJsubspaceApprox = 2.4301e-004


using the coreset, compared to the original set
(ε < 0). This is possible since we used Matlab’s
Algorithm 1: A Matlab implementation of the coreset
heuristic for k-means and not an optimal solu-
construction for j-subspace approximation.
tion.

6 Subspace Approximation for one Linear


Subspace value decomposition A = U DV T . Our first step
is to replace the matrix A by its best rank m
We will first develop a coreset for the problem of approximation with respect to the squared Frobenius
approximating the sum of squared distances of a point norm, namely, by U D(m) V T , where D(m) is the
set to one linear j-dimensional subspace. Let L ⊆ Rn matrix that contains the m largest diagonal entries
be a j-dimensional subspace represented by an n × j of D and that is 0 otherwise. We show the following
matrix X with orthonormal columns spanning X and simple lemma about the error of this approximation
an n × (n − j) matrix Y with orthonormal columns with respect to squared Frobenius norm.
spanning the orthogonal complement L⊥ of L. Above,
we recalled the fact that for a given matrix A ∈ Rn×n
containing n points of dimension n as its rows, the
distances AY 22 of the points to L is Lemma 6.1. Let A ∈ R be an n × n matrix with
n×n
sum of squared
n T
at least i=j+1 si , where si is the ith singular value the singular value decomposition A = U DV , and let
of A (sorted non increasingly). When L is the span X be a n × j matrix whose columns are orthonormal.
of the first j eigenvectors of an orthonormal basis of Let ε ∈ (0, 1] and m ∈ N with n − 1 ≥ m ≥
(m)
AT A (where the eigen vectors [=left singular vectors j + j/ε −1, and let D be the matrix that contains
of A] are sorted according to the corresponding left the first m diagonal entries of D and is 0 otherwise.
singular values), then this value is exactly matched. Then
Thus, the subspace that is spanned by the first j left
singular
n vectors of A achieves a minimum value of n
i=j+1 i .
s 0 ≤ U DV T X22 − U D(m) V T X22 ≤ ε · si .
Now, we will show that m := j + j/ε − 1 i=j+1
appropriately chosen points suffice to approximately
compute the cost of every j-dimensional subspace L.
We obtain these points by considering the singular Proof. We first observe that U DV T X22 −

1441 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
>> %% Coreset Computation for 2-means %%%

>> % 1. Creating Random Input


>> n=10000;
>> d=2000;
>> A=rand(n,d); % n-by-d random matrix
>> j=2;
>> epsilon=0.1;

>> %2. Coreset Construction


>> m=ceil(j/epsilon^2)+j-1; % i.e, m=201
>> [U, D, V]=svds(A,m);
>> c=norm(A,’fro’)^2-norm(D,’fro’)^2;
>> C = U*D*V; % C is an n-by-m matrix
Figure 1: Vizualization of a point set that is projected
>> % 3. Compute k-mean for A and its coreset down to a 1-dimensional subspace. Notice that both
>> k=2; subspace here and in all other pictures are of the same
>> [~,centersA]=... dimension to keep the picture 2-dimensional, but the
kmeans(A,k,’onlinephase’,’off’); query subspace should have smaller dimension.
>> [~,centersC]=...
kmeans(C,k,’onlinephase’,’off’);
where the first equality follows since U has or-
>> % 4.Evaluate sum of squared distances thonormal columns, the second inequality since for
2 (m)
>> [~,distsAA]=knnsearch(centersA,A); T
n =nV X we
M DM 
have  2 − D M 22 =
2 m n 2
>> [~,distsCA]=knnsearch(centersC,A); j=1 si mij − j=1 si mij =
i=1
n n 2
i=1
(m) 2
>> [~,distsCC]=knnsearch(centersC,C); i=m+1 j=1 si mij = (D − D )M  , and
>> costAA=sum(distsAA.^2); % opt. k-means the inequality follows because the spectral norm is
>> costCA=sum(distsCA.^2); % appr. cost consistent with the Euclidean norm. It follows for
>> costCC=sum(distsCC.^2); our choice of m that

>> % Evaluate quality of coreset solution 


m+1 
n
jsm+1 ≤ ε·(m−j +1)sm+1 ≤ ε· si ≤ ε· si .
>> epsApproximation = costCA/costAA - 1
i=j+1 i=j+1

epsApproximation = -4.2727e-007 2

>> % Evaluate quality of cost estimation In the following we will show that one can use
>> % using coreset the first m rows of D(m) V T as a coreset for the
>> epsEstimation = abs(costCA/(costCC+c) - 1) linear j-dimensional subspace 1-clustering problem.
We exploit the observation that by the Pythagorean
epsEstimation = 5.1958e-014 theorem it holds that AY 22 + AX22 = A22 .
Thus, AY 22 can be decomposed as the differ-
ence between A22 , the squared lengths of the
Algorithm 2: A Matlab implementation of the coreset points in A, and AX 2
2 , the squared lengths of
construction for k-means queries. the projection of A on L. Now, when using
U D(m) V T instead of A, U D(m) V T Y 22 decomposes
into U D(m) V T 22 − U D(m) V T X22 in nthe same
U D(m) V T X22 is always non-negative. Then way. By noting  that A22 is actually i=1 si and
(m) T 2 m
U D V 2 is s
i=1 i , it is clear that storing the
U DV T X22 − U D(m) V T X22 remaining terms of the sum is sufficient to account
= DV T X22 − D(m) V T X22 for the difference between these two norms. Approxi-
(m)
= (D − D )V X2 T 2 mating AY 22 by U D(m) V T Y 22 thus reduces to ap-
proximate AX22 by U D(m) V T X22 . We show that
≤ (D − D(m) )22 X2S
the difference between these two is only an ε-fraction
= j · sm+1 of AY 22 , and explain after the corollary why this
implies the desired coreset result.
9
1442 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Thus, we can replace the matrix U D(m) V T in the
above corollary by D(m) V T . This is interesting,
because all rows except the first m rows of this
new matrix have only 0 entries and so they don’t
contribute to D(m) V T X22 . Therefore, we will define
our coreset matrix S to be matrix consisting of the
first m = O(j/ε) rows of D(m) V T . The rows of this
matrix will be the coreset points.
In the following, we summarize our results and
state them for n points in d-dimensional space.
Corollary 6.1. Let A be an n × n matrix whose n
rows represent n points in n-dimensional space. Let
A = U DV T be the SVD of A and let D(m) be a matrix
Figure 2: The distances of a point and its projection that contains the first m = O(j/ε) diagonal entries
to a query subspace. of D and is 0 otherwise. Then the rows of D(m) V T
form a coreset for the linear j-subspace 1-clustering
problem.
Lemma 6.2. (Coreset for Linear Subspace 1-
Clustering) Let A ∈ Rn×n be an n × n matrix with If one is familiar with the coreset literature it
singular value decomposition A = U DV T , Y be may seem a bit strange that the resulting point set
an n × (n − j) matrix with orthonormal columns, is unweighted, i.e. we replace n unweighted points
0 ≤ ε ≤ 1, and m ∈ N with n − 1 ≥ m ≥ j + j/ε − 1. by m unweighted points. However, for this problem
Then the weighting can be implicitly done by scaling.
Alternatively, we could also define our coreset to be

n
the set of the first m rows of V T where the ith row is
0 ≤ U D(m) V T Y 22 + si − AY 22 ≤ ε · AY 22 .
weighted by si .
i=m+1

Proof. We have AY 22 ≤ U D(m) VT Y 22 + U (D − 7 Dimensionality Reduction for Projective
D(m) )V T Y 22 ≤ U D(m) V T Y 22 + i=m+1 si , which
n Clustering Problems

proves that U D(m) V T Y 22 + i=m+1 si − AY 22 is
n In order to deal with k subspaces we will use a form
non-negative. We now follow the outline sketched of dimensionality reduction. To define this reduction,
above. By the Pythagorean Theorem, AY 22 = let L be a linear j-dimensional subspace represented
A22 − AX22 , where X has orthonormal columns by an n × j matrix X with orthonormal columns and
and spans the space orthogonal to the space spanned with Y being an n × (n − j) matrix with orthonormal
n
by Y . Using, A 2
= s and, U D(m) V T 22 = columns that spans L⊥ . Notice that for any matrix
m 2 i=1 i
M we can write the projection of the points in the
i=1 si , we obtain
rows of M to L as M XX T , and that these projected

n
points are still n-dimensional, but lie within the j-
U D(m) V T Y 22 + si − AY 22 dimensional subspace L. Our first step will be to
i=m+1
show that if we project both U DV T and A :=
(m) T 2
= U D V 2 − U D(m) V T X22 U D(m) V T on X by computing U DV T XX T and

n U D(m) V T XX T , then the sum of squared distances
+ si − A22 + AX22 of the corresponding rows of the projection is small
i=m+1 compared to the cost of the projection. In other
= AX22 − U D(m) V T X22 words, after the projection the points of A will be

n relatively close to their counterparts of A . Notice
≤ ε· si ≤ ε · AY 22 the difference to the similar looking Lemma 6.1: In
i=j+1 Lemma 6.1, we showed that if we project A to L and
sum up the squared lengths of the projections, then
where the first inequality follows from Lemma 6.1. 2 this sum is similar to the sum of the lengths of the
projections of A . In the following corollary, we look
Now we observe that by orthonormality of the at the distances between a projection of a point from
columns of U we have U M 22 = M 22 for any matrix A and the projection of the corresponding point in
M , which implies that U DV T X22 = DV T X22 . A , then we square these distances and show that the

1443 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
sum of them is small. Either dist(p, B) ≤ p − q/ε or p − q < εdist(p, B).
Hence,
Corollary 7.1. Let A ∈ Rn×n and let A = U DV T
be its singular value decomposition. Let D(m) be a p − q2
dist(p, B)p − q ≤ + εdist2 (p, B).
matrix whose first m diagonal entries are the same as ε
that of D and let it be 0, otherwise. Let X ∈ Rn×j be Combining the last inequality with (8.4) yields
a matrix with orthonormal columns and let n − 1 ≥
m ≥ j + j/ε − 1 and let Y ∈ Rn×(n−j) a matrix |dist2 (p, B) − dist2 (q, B)|
with orthonormal columns that spans the orthogonal 2p − q2
complement of the column space of X. Then ≤ p − q2 + + 2εdist2 (p, B)
ε
U DV T XX T − U D(m) V T XX T 22 ≤ ε · AY 22 3p − q2
≤ + 2εdist2 (p, B).
ε
Proof. We have U DV T XX T − U D(m) V T XX T 22 = Finally, the lemma follows by replacing ε with ε/4. 2
DV T X − D(m) V T X22 = DV T X22 −
(m) T 2
Dn V X2 =U DV X2 − U D(m) V T X22 ≤
T 2 We can combine the above lemma with Corollary
ε i=j+1 si ≤ ε i=1 si ≤ εAY 22 , where the third 7.1 by replacing ε in the corollary with ε2 /30 and
n

last inequality follows from Lemma 6.1. 2 summing the error of approximating an input point
p by its projection q. If C is contained in a j ∗ -
Now assume we want to use A = U D(m) V T dimensional subspace, the error will be sufficiently
to estimate the cost of an j-dimensional affine sub- small for m ≥ j ∗ + 30j ∗ /ε2 − 1. This is done in
space k-clustering problem. Let L1 , . . . , Lk be a set the proof of the theorem below.
of affine subspaces and let L be a j ∗ -dimensional
subspace containing C = L1 ∪ · · · ∪ Lk , j ∗ ≥ Theorem 8.1. Let A ∈ Rn×n be a matrix with
k(j + 1). Then by the Pythagorean theorem we can singular value decomposition A = U DV T and let
write dist2 (A, C) = dist2 (A, L) + dist2 (AXX T , C), ε > 0. Let j ≥ 1 be an integer, j ∗ = j + 1 and
where X is a matrix with orthonormal columns m = j ∗ + 30j ∗ /ε2 − 1 such that m ≤ n − 1.
whose span is L. We claim that dist2 (A , C) + Furthermore, let A = U D(m) V T , where D(m) is a
 n 2 diagonal matrix whose diagonal is the first m diagonal
i=m+1 si is a good approximation for dist (A, C).
We also know that dist (A , C) = dist (A , L) + entries of D followed by n − m zeros.
2  2 

dist2 (A XX T , C). Furthermore, by Corollary Then for any compact set C, which is contained
2  n 2 in a j-dimensional subspace, we have
6.2, |dist (A , L) + i=m+1 si − dist (A, L)| ≤
2 2
εdist (A, L) ≤ εdist (A, C). Thus, if we can n

prove that |dist2 (AXX T , C) − dist2 (A XX T )|22 ≤ dist2 (A, C) − dist2 (A , C) + si
ε · dist(A, C) we have shown that |dist(A, C) − i=m+1

(dist(A , C) + i=m+1 si )| ≤ 2εdist(A, C), which will ≤ε · dist2 (A, C).


n

prove our dimensionality reduction result.


In order to do so, we can use the following ‘weak Proof. Suppose that X ∈ R
n×(j+1)
has orthonormal
triangle inequality’, which is well known in the coreset columns that span C. Let Y ∈ R
n×(n−(j+1))
denote a
literature, and can be generalized for other norms and matrix whose orthonormal columns are also orthogo-
distance functions (say, m-estimators). nal to the columns of X. Since AY 2 is the sum of
squared distances to the column space of X, we have
Lemma 7.1. For any 1 > ε > 0, a compact set
C ⊆ Rn , and p, q ∈ Rn , (8.5) AY 2 ≤ dist2 (A, C).

12p − q2 ε Fix an integer i, 1 ≤ i ≤ n, and let pi denote


|dist2 (p, C)−dist2 (q, C)| ≤ + dist2 (p, C). the ith row of the matrix AXX T . That is, pi is the
ε 2
projection of a row of A on X. Let pi denote the ith
8 Proof of Lemma 7.1 row of A XX T . Using Corollary 7.1, while replacing
Proof. Using the triangle inequality, ε with ε = ε2 /30

n
(8.4) (8.6) pi − pi 2
|dist2 (p, B) − dist2 (q, B)| i=1
= |dist(p, B) − dist(q, B)| · (dist(p, B) + dist(q, B)) =U DV XX T − U D(m) V XX T 2
≤ p − q · (2dist(p, B) + p − q) ε2
≤ · AY 2 .
≤ p − q2 + 2dist(p, B)p − q. 30
11
1444 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Notice that if ui is the ith row of U , then the Proof. Let B1 be an n × n matrix whose columns
ith row of A can be written as Ai∗ = ui DV T , are an orthonormal basis of Rn such that the first
and thus for its projection p to X it holds p = d1 columns span L1 and the first d1 + d2 columns
ui DV T XX T . Similarly, p = ui D(m) V T XX T . Thus, span L2 . Let B2 an n × n matrix whose columns
the sum of the squared distances between all p are an orthonormal basis of Rn such that the first d1
and their corresponding p is just ||U DV T XX T − columns are identical with B1 and the first d1 + d2
U D(m) V T XX T ||22 , which is bounded by ε ||AY ||2 by columns span L2 . Define U = B1 B2T . It is easy to
Corollary 7.1 and since we set m = j ∗ + 30j ∗ /ε2 − 1 verify that U satisfies the conditions of the lemma. 2
in the precondition of this Theorem. Together, we get


n
2 2 
dist (A, C) − dist (A , C) − si
i=j+1 Let L be any m + j ∗ -dimensional subspace that
contains the m-dimensional subspace in which the
(8.7) = AY 2 − A Y 2
points in A lie, let C be an arbitrary compact set,

n  which can, for example, be a set of centers (points,
− si + (dist2 (p, C) − dist2 (p , C))
lines, subspaces, rings, etc.) and let L2 be a j ∗ -
i=j+1 p∈P
 2
dimensional subspace that contains C. If we now set
(8.8) ≤ ε AY  d1 := m and d2 := j ∗ and let L1 be the subspace
n
12pi − pi 2 ε spanned by the rows of A , then Lemma 9.1 implies
+ + · dist2 (pi , C) that every given compact set (and so any set of
i=1
ε 2
centers) can be rotated into L. We could now proceed
12ε ε by computing a coreset in a m + j ∗ -dimensional
(8.9) ≤ ε + + dist2 (A, C)
ε 2 subspace and then rotate every set of centers to this
≤ εdist2 (A, C), subspace. However, the last step is not necessary
because of the following.
where (8.7) follows from the Pythagorean Theo- The mapping defined by any orthonormal matrix
rem, (8.8) follows from Lemma 7.1 and Corollary 6.2, U is an isometry, and thus applying it does not change
and (8.9) follows from (8.5) and (8.6). 2 Euclidean distances. So, Lemma 9.1 implies that
the sum of squared distances of C to the rows of
9 Small Coresets for Projective Clustering A is the same as the sum of squared distances of
In this section we use the result of the previous U (C) := {U x : x ∈ C} to A . Now assume that we
section to prove that there is a strong coreset of size have a coreset for the subspace L. U (C) is still a union
independent of the dimension of the space. In the of k subspaces, and it is in L. This implies that the
last section, we showed that the projection A of A sum of squared distances to U (C) is approximated
is a coreset for A. A still has n points, which are n- by the coreset. But this is identical to the sum of
dimensional but lie in an m-dimensional subspace. To squared distances to C and so this is approximated
reduce the number of points, we want to apply known by the coreset as well.
coreset constructions to A within the low dimensional Thus, in order to construct a coreset for a set
subspace. However, this would mean that our coreset of n points in Rn we proceed as follows. In a first
only holds for centers that are also from the low step we use the dimensionality reduction to reduce
dimensional subspace, but of course we want that the input point set to a set of n points that lie in an
the centers can be chosen from the full dimensional m-dimensional subspace. Then we construct a coreset
space. We get around this problem by applying for a (d + j ∗ )-dimensional subspace that contains the
the coreset constructions to a slightly larger space low-dimensional point set. By the discussion above,
than the subspace that A lives in. The following this will be a coreset.
lemma provides us the necessary tool to complete the Finally, combining this with known results [21,
argumentation. 19, 36] we gain the following corolarry.

Lemma 9.1. Let L1 be a d1 -dimensional subspace of


Rn and let L be a (d1 +d2 )-dimensional subspace of Rn Corollary 9.1. Let P be a set of points in Rd , and
that contains L1 . Let L2 be a d2 -dimensional subspace k, j ≥ 0 be a pair of integers. There is a set Q in
of Rn . Then there is an orthonormal matrix U such Rd , a weight function w : Q → [0, ∞) and a constant
that U x = x for any x ∈ L1 and U x ∈ L for any c > 0 such that the following holds.
x ∈ L2 . For every set B which is the union of k affine

1445 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
j-subspaces of Rd we have The following definition is a generalization of
  Definition 2, where the ground set X = Rd × [0, ∞) ×
(1 − ε) dist2 (p, B) ≤ w(p)dist2 (p, B) + c [0, ∞) is the set of points in Rd , each assigned with
p∈P p∈Q multiplicative and additive weight, Q is all k-tuples

≤(1 + ε) 2
dist (p, B), of j-dimensional query subspaces in Rd for some fixed
p∈P
j, k ≥ 1. For a given pair j, k ≥ 1 of integers,
a coreset scheme for the (j, k)-projective clustering
and problem Alg(P, ε) is an algorithm whose input is a
set P ⊆ X of weighted points and an error parameter
1. |Q| = O(j/ε) if k = 1 ε. It outputs a set C (called coreset) such that for
2. |Q| = O(k 2 /ε4 ) if j = 0 every query subspace L ∈ Q, its cost
 
3. |Q| = poly(2k log n, 1/ε) if j = 1, f (P, q) := w · dist2 (p, L) + w
(p,w,w )∈P (p,w,w )∈P
4. |Q| = poly(2kj , 1/ε) if j, k > 1, under the
assumption that the coordinates of the points in approximated, up to a factor of (1 ± ε),
by f (C, q). 
P are integers between 1 and n O(1)
. In Definition 2, we denote c = (p,w,w )∈C w
and assumed that P has no weights. However, in the
In particular the size of Q is independent of d. merged-and-reduce approach that is described in this
section, we apply the algorithm Alg recursively on
10 Fast, streaming, and parallel its output coresets, and thus must assume that the
implementations input is also weighted. Since the distribution of the
One major advantage of coresets is that they can additive weights across the points has no effect on f ,
be constructed in parallel as well as in a streaming we can assume that only one arbitrary point in P has
setting. a positive additive weight.
That are can be constructed (‘embarrassingly’) in Definition 3. (Coreset Scheme) Let X and Q be
parallel is due to the fact that the union of coresets two sets. Let f : 2X × Q → [0, ∞) be a function
is again a coreset. More precisely, if a data set that maps every subset of X and q ∈ Q to a positive
is split into  subsets, and we compute a (1 + ε )- real. An ε-coreset scheme Alg(·, ·) for (X, Q) is an
coreset of size s for every subset, then the union algorithm that gets as input a set P ⊆ X and a
of the  coresets is a (1 + ε )-coreset of the whole parameter ε ∈ (0, 1/10). It then returns a subset
data set and has size  · s. This is especially helpful C := A(P, ε) of X such that for every q ∈ Q we
if the data is already given in a distributed fashion have
on  different machines. Then, every machine will
simply compute a coreset. The small coresets can (1 − ε)f (C, q) ≤ f (P, q) ≤ (1 + ε)f (C, q)
then be sent to a master device which approximately
The following theorem is a generalized and im-
solves the optimization problem. Our coreset results
proved version of the coreset for streaming k-means
for subspace approximation for one j-dimensional
from [29, Theorem 7.2]. The improvement is due to
subspace, for k-means and for general projective
the use of (ε/ log n)-coresets rather than (ε/ log2 n)-
clustering thus directly lead to distributed algorithms
coresets, which are usually constructed in less time
for solving these problems approximately.
and space. Similar generalizations have been used
If we want to compute a coreset with size indepen-
in other papers. The proof uses certain composition
dent of  on the master device or in a parallel setting
properties that coresets satisfy.
where the data was split and the coresets are later
collected, then we can compute a (1 + ε )-coreset of Theorem 10.1. Let X be a set representing a (pos-
this merged coreset, gaining a (1 + ε )2 -coreset. For sibly unbounded) stream of items, let Q be a set
ε := ε/3, this results in a (1 + ε)-coreset because ε ∈ (0, 1/10) and let Alg(·, ·) be an ε-coreset scheme
(1 + ε/3)2 < 1 + ε. for (X, Q). Suppose that for every input set P ⊆ X of
In the streaming setting, data points arrive one by |P | ≤ n items, and every ε ∈ (0, 1/10) we can com-
one, in which it is impossible to remember the entire pute a coreset Alg(P, ε) of size at most a(ε, n ) using
data set due to memory constraints. We state a failry at most t(ε, n ) time and s(ε, n ) space, and that there
general variant of the technique ‘Merge & Reduce’ is an integer m(ε) such that a(ε, 2m(ε)) ≤ m(ε).
[29] allowing us to use coresets in a stream while also Then, we can dynamically maintain an ε-coreset
being able to compute in parallel. For this, we define C for the n items seen so far at any point in the
the notion of coreset schemes. stream, where
13
1446 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
most m items. Whenever we have two coresets in the
same level, we replace them by a coreset in a higher
level. The height of the tree is bounded from above
by log n . Hence, in every given moment during the
streaming of n items, there are at most O(log n) ε -
coresets in memory, each of size at most m . For any
n, we can obtain a coreset for the first n points at
the moment after they the nth point was added by
computing an (ε/3)-coreset C for the union of these
at most O(log n) coresets in the memory.
The approximation error of a coreset in the tree
Figure 3: Tree construction for generating coresets in with respect to its leaves (the original items) is
parallel or from a data stream. Black arrows indi- increased by a multiplicative factor of (1+ε ) in every
cate ‘merge-and-reduce’ operations. The (intermedi- tree level. Hence, the overall multiplicative error of
ate) coresets C1 , . . . , C7 are enumerated in the order the union of coresets in memory is (1+ε )log n . Using
in which they would be generated in the streaming the definition of ε , we obtain
case. In the parallel case, C1 , C2 , C4 and C5 would be
ε ε
constructed in parallel, followed by parallel construc- (1 + ε )log n = (1 + )log n ≤ e 6 < 1 + ε/3.
tion of C3 and C6 , finally resulting in C7 . The figure 6 log n
is taken from [33]. Since (1 + ε/3)2 ≤ 1 + ε by the assumption ε ∈
(0, 1/10), we have that C is an ε-coreset for the union
 
(i) it holds |C| ≤ a(ε /9, m(ε ) · O(log n)) for an of all the n items seen so far. We now prove claims

ε = O(ε/ log n). (i)-(iv) for fixed n.
(i) C is an ε/3 coreset for O(log n) sets of size
(ii) The construction of C takes at most m , and thus |C| = a(ε/3, m · O(log n))
(i) In every item insertion, there are O(log n)
s(ε , 2m(ε )) + s(ε/9, m(ε ) · O(log n)) coresets in memory, each of size at most m , and
additional O(m) items at the leaves. Since we
space with additional m(ε ) · O(log n) items. use additional space of s(ε , 2m ) for constructing a

(iii) Update time of C per point insertion to the coreset in the tree, and s(ε/3, m · O(log n)) space for
stream is constructing C, the overall space for the construction
is bounded by
t(ε , 2m(ε )) · O(log n) + t(ε/9, O(m(ε ) log n))
s(ε , 2m ) + s(ε/3, m · O(log n))
(iv) The amortized update time can be divided by M with additional m · O(log n) items.
using M ≥ 1 processors in parallel. (ii) When a new item is inserted, we need to
apply the coreset construction at most O(log n) times,
Proof. We first show how to construct an ε-coreset once for each level. Together with the computation
only for the first n items p1 , · · · , pn in the stream, for time of C, the overall time is
a given n ≥ 2, using space and update time as in (i)
and (ii). t(ε , 2m ) · O(log n) + t(ε/3, O(m log n))
Put ε = ε/(6 log n ) and denote m = m(ε ).
The ε -coreset of the first 2m − 1 items in the stream per point insertion.
is simply their union. After the insertion of p2m (iii) follows since the coresets can be computed
we replace the first 2m items by their ε -coreset C1 . in parallel for each level. More precisely, the n/(2i m )
The coreset for p1 , · · · , p2m +i is the union of C1 and coresets in the ith level can  be computed  in parallel
p2m +1 , · · · , p2m +i for every i = 1, · · · , 2m. When by dividing them into min M, n/(2i m ) machines.
p4m is inserted, we replace p2m +1 , · · · , p4m by their The main problem with the above construction
ε -coreset C2 . Using the assumption of the theorem, is that it assumes that n is known in advance.
we have |C1 | + |C2 | ≤ 2m . We can thus compute an To remove this assumption, we compute the above
ε -coreset of size at most m for the union of coresets coresets tree independently for batches (sets) with
C1 ∪ C2 , and delete C1 and C2 from memory. exponentially increasing size: a tree for the first batch
We continue to construct a binary tree of coresets of n1 = 2m items in the stream (which consists of a
as in Fig. 3. That is, every leaf and node contains at single node), then a new tree for the next batch of

1447 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
n2 = 4m items, and in general a tree for ith batch of arbitrary set C of k centers.
the following ni = 2i m items in the stream for i ≥ 1. Clustering Features. A classical coreset is be a
For every i ≥ 1, let εi = ε/(3c log ni ) and weighted set of points, for which the clustering cost
compute an εi -coreset in each node of the ith tree, is computed in a weighted fashion. In the sections
rather than ε -coreset. After all the ni items of the above, we already saw that it can be beneficial to
ith batch were read, its tree consists of a single (root) deviate from this definition. Here, we will construct
(ε/3)-coreset Ci of its ni leaves, where |Ci | = m(εi ). a coreset consisting of so called clustering features.
After insertion of the nth item to the stream to the These are motivated by the well-known formula
current jth batch, we have j = O(log n). We output 
an (ε/3)-coreset C for the union of previous coresets (11.10) d(p, z)
C1 ∪ C2 ∪ . . . ∪ Cj−1 with the O(log n) (εj )-coresets in p∈P
the current (last) tree. Hence, C is an ε-coreset for the 
=|P  |d(z, μ(P )) + d(p, μ(P ))
n items. Its size is union of the O(log n) (ε/(3c log n))-
p∈P
coresets in its tree. Hence, Cj is an (ε/3)-coreset for
the items that were read in the current batch, and that holds for squared Euclidean distances as well
thus C1 ∪ C2 ∪ · · · ∪ Cj is an (ε/3)-coreset for all as for Bregman divergences, if μ(P  ) denotes the
the n items. We then output an (ε/3)-coreset C for centroid of P  . With help of 11.10, the sum of the
C1 ∪ C2 ∪ · · · ∪ Cj , which is an ε-coreset for the n distances of points in a set P  to a center c can
items. The space and time bounds are dominated by be calculated exactly whenonly given the clustering
the last batch of size nj ≤ n, so the time and space features |P |, p∈P p and p∈P d(p, 0).
have the same bounds as above for a fixed set of size When given a CF and a set of centers, then the
n. optimal choice to cluster the points represented by the
For parallel construction of an unbounded clustering feature without splitting them is to assign
stream, we simply send the ith point in the stream to them to the center closest to the centroid of the CF,
machine number (i mod M ) for every i ≥ 1. Each which follows from 11.10. Thus, when clustering a
machine computes a coreset to its given sub-stream, set of CFs Z with a set of centers C we define the
as explained above for a single machine. The coreset clustering cost costCF (Z, C) as the cost for clustering
in every given moment is the union of current core- every CF with the center closest to its centroid. We
sets in all the M machines. It may then be written in are interested in the following coreset type.
parallel to a shared memory on one of the machines.
Definition 4. Let P be a point set in Rd . Let
2 d : R × R → R be a distance measure that satisfies
d d

11.10. A CF-coreset  is a setof clustering features


11 Small Coresets for Other Dissimilarity consisting of |P |, p∈P p and p∈P d(p, 0), such that
Measures |cost(P, C) − costCF (Z, C)| ≤ ε · cost(P, C)
In this section, we describe an alternative way to
prove the existence of coresets independent of the holds for every set C of k centers.
number of input points and the dimension. It has In the following, we abbreviate CF-coreset by
an exponential dependency on ε−1 and thus leads to coreset. Notice that the error of the coreset compared
larger coresets for k-means. However, we show that to P is only induced by points which where summa-
the construction also works for a restricted class of rized in one clustering feature but would optimally be
μ-similar Bregman divergences, indicating that it is clustered with different centers from C.
applicable to a wider class of distance functions.
Construction and Sufficient Conditions.
Setting. Although we are mainly interested in CF-
We are interested in formulating our strategy for coresets, our algorithm will also work for other dis-
more general distance measures than only l2 . Thus, tance functions, also if they do not satisfy 11.10, if
for the formulation of the needed conditions, we just the distance function satisfies the following two con-
assume any distance function d : Rd × Rd → R≥0 ditions. This is independent of the actual definition of
which satisfies d(0) = 0. As
 usually, we use the abbre- the coreset used, as long as the coreset approximates
viations cost(U ) = min x∈U d(x, z), cost(U, C) = the cost as in Definition 4. We thus formulate the
 z∈V
min d(x, c) and opt(U, k) = min cost(U, C) conditions and the algorithm for any distance func-
x∈U c∈C C⊂V,|C|=k tion with any coreset notion, and costc denotes the
for U, C ⊂ Rd . We are interested in a coreset Z for cost function used for clustering with the unknown
P for approximate computation of opt(P, C) given an coreset type.
15
1448 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
k
1. If cost(P ) ≤ (1 + f1 (ε)) i=1 cost(Pi ) for all Lemma 11.1. If the cost function satisfies the above
partitionings P1 , . . . , Pk of P into k subsets, two conditions, then there exists a coreset of size
1
then there exists a coreset Z of size g(k, ε) k ν ·g(k, ε/2)+h(k ν , ε/2) for ν = log1+f1 (ε/2) f3 (ε/2) .
such that for any set of k centers we have
|cost(P, C) − costc (Z, C)| ≤ ε cost(P, C). So, if
an optimal k-clustering of P is at most a (1 + ε)- For k-means, we can use clustering features and
factor cheaper than the best 1-clustering, then achieve that g ≡ 1 and h(k ν , ε) = k ν . Thus, the
1
log
this must induce a coreset for P . overall coreset size is 2k 1+f1 (ε) f3 (ε) . We do not
present this in detail as the coreset is larger than the
2. If opt(P  , f2 (k)) ≤ f3 (ε) opt(P, k) for P  ⊂ P , k-means coreset coming from our first construction.
then there exists a set Z of size h(f2 (k), ε) However, the proof can be deduced from the following
such that for any set of centers C we have proof for μ-similar bregman divergences, as the k-
|cost(P  , C) − costc (Z, C)| ≤ ε cost(P, C). means case is easier.

Algorithm. 11.1 Coresets for μ-similar Bregman diver-


Given the above conditions, we use the following gences Let P be the input point set. Let dφ :
algorithm (the value of ν is defined later), P ∗ is the S × S → R be a m-similar Bregman divergence, i. e.,
input point set: dφ is defined on a convex set S ⊂ Rd and there exists
a Mahalanobis distance dB such that mdB (p, q) ≤
Partition(P, k) dφ (p, q) ≤ dB (p, q) for all points p, q ∈ Rd and an
m ∈ (0, 1] (note that we use m-similar instead of μ-
1. Let C1 ⊂ V be a set with |C| = k that similar in order to prevent confusion with the centroid
satisfies opt(P, k) = cost(P, C1 ). Consider the μ). We need the additional restriction on the convex
partitioning definined byPc = {p ∈ P | n(c−p) = set S that for every pair p, q of points from P , S con-
minc ∈C n(c − p)} into k subsets (ties broken tains all points within a ball of radius (4/mε) · d(p, q)
arbitrarily). around p for a constant m, and we call such a set P -
 covering. Thus, in addition to the convex hull of the
(a) If cost(P ) ≤ (1+f1 (ε)) c∈C cost(Pc ), stop.
point set, a P -covering set may have to be larger by
(b) Else, if this is the νth level of the recursion, a factor dependent on m, ε and the diameter of P .
still stop. Because of this additional restriction, our setting is
If it is not, call Partition(Pc , k) for all c ∈ C much more restricted than in [6]. It is an interesting
and then stop. open question how to remove this restriction or even
relax the μ-similarity further.
Note that the first time that we reach step 2,
Notice that because dB is a Mahalanobis distance
P is the inputpoint set, and thus opt(P, k) =
there exists a regular matrix B with dB (x, y) =
opt(P ∗ , k) and c∈C cost(Pc ) ≤ opt(P, k)/(1 + ε) =
B(x − y)2 for all points x, y ∈ Rn . In particular,
opt(P ∗ , k)/(1 + ε). Let Q denote the set of all
m · B(x − y)2 ≤ dφ (x, y) ≤ B(x − y)2 . By [12],
subsets generated by the algorithm on level ν and
Bregman divergences (also if they are not m-similar)
for which the algorithm stopped in line 1b. On
satisfy the Bregman version of equation (11.10), i. e.,
the ith level of the recursion, the sum of all sets is
opt(P ∗ , k)/(1 + f1 (ε))i . For ν = log1+f1 (ε) f31(ε) ,  
this is smaller than f3 (ε) · opt(P ∗ , k). (11.11) dφ (p, z) = dφ (p, μ) + |P |dφ (μ, z).
 Thus, we have
at most f (k) := k ν sets in Q, and U ∈Q cost(U ) ≤ p∈P p∈P
ε·opt(P, k). By Condition 2, this implies the existence
of a set Z of size h(k ν , ε) which has an error of at most Condition 1. To show that Condition 1 holds, we
ε opt(P ∗ , k).
For all sets where we stop in step 1a, Condition 1 set f1 (ε) = (1+ 14 )2 and assume that we are given a
m·ε
directly gives a coreset of size g(k, ε) for P . The union point set S that satisfies that for every partitioning
of these coresets give a coreset for the union of all sets of S into k subsets S1 , . . . , Sk it holds that
which stoped in step 1a. Alltogether, they induce
an error of less than ε opt(P, k). Together with the
ε opt(P, k) error induced by the sets in Q, this gives a  
k 
dφ (s, μ(S)) ≤ (1 + f (ε)) dφ (x, μ(Sj ))
total error of 2ε. So, if we start every thing with ε/2, j=1 x∈Sj
s∈S
we get a coreset for P with error ε opt(P, k). The size
of the coreset then is k ν · g(k, ε/2) + h(k ν , ε/2).

1449 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.

k  
k
c(μ(Sj ))). By the triangle inequality,
⇔ dφ (s, μ(Sj )) + |Sj |dφ (μ(Sj ), μ(S))  4
aj ≤ εm|Sj |/4B((1 + mε )(μ(S) − μ(Sj ))) +
j=1 s∈S j=1 
εm|Sj |/4B(μ(Sj ) − c(μ(Sj ))) = bj + dj with

k   4
≤ (1 + f (ε)) bj = εm|Sj |/4B((1 + mε )(μ(S) − μ(Sj ))) and
dφ (x, μ(Sj )) 
j=1 x∈Sj dj = εm|Sj |/4B(μ(Sj ) − c(μ(Sj ))). Then,

k
a ≤ b + d ≤ b + d,
(4) ⇔ |Sj |dφ (μ(Sj ), μ(S))
j=1 where we use the triangle inequality again for the
1 
k  second inequality. Now we observe that
≤ 4 2 dφ (x, μ(Sj )).
(1 + m·ε ) j=1 x∈Sj 
k
εm 4
b2 = |Sj |B((1 + )(μ(S) − μ(Sj )))2
We show that this restricts the error of clustering j=1
4 εm
all points in S with the same center, more specifically,
εm 
k
with the center c(μ(S)), the center closest to μ(S). To 4 2
= |Sj |(1 + ) B(μ(Sj ) − μ(S))2
do so, we virtually add points to S. For every j = 4 j=1 mε
1, . . . , k, we add one point with weight 14 ε·m·|Sj | with
4 2
4 k
coordinate μ(S) + m·ε (μ(S) − μ(Sj )) to Sj . Notice εm 1
≤ (1 + ) |Sj | dφ (μ(Sj ), μ(S))2
that this points lies within the convex set A that dB is 4 mε j=1 m
defined on because we assumed that S is P -covering.
ε
k
The additional point shifts the centroid of Sj to μ(S)
≤ dφ (x, μ(Sj )).
because 4 j=1
 4
 x∈Sj
|Sj | · μ(Sj ) + εm4 |Sj | μ(S) + m·ε (μ(S) − μ(Sj ))
(1 + 4 )|Sj |
m·ε Additionally, by the definition of m-similarity and by
 4
 Equation (11.11) it holds that
4 |Sj | μ(S) + m·ε μ(S)
εm
= = μ(S).
(1 + m·ε4 )|Sj |  k
2 1
d = εm|Sj |B(μ(Sj ) − c(μ(Sj )))2
We name the set consisting of Sj together with the 4
j=1
weighted added point Sj and the union of all Sj is S  .
ε
k
Now, clustering S  with center c(μ(S)) is certainly an
≤ |Sj |dφ (μ(Sj ), c(μ(Sj )))
upper bound for the clustering cost of S with c(μ(S)). 4 j=1

Additionally, when clustering Sj with only one center,
ε
k
then c(μ(S)) is optimal, so clustering Sj with c(μ(Sj ))
 ≤ dφ (x, μ(Sj )).
can only be more expensive. Thus, clustering all Sj 4 j=1
x∈Sj
with the centers c(μ(Sj )) gives an upper bound on the
cost of clustering S with c(μ(S)). So, to complete the This implies that a ≤ b + d ≤
proof, we have to upper bound the cost of clustering √ k 
all Sj with the respective centers c(μ(Sj )). We do 2 ε/2 j=1 x∈Sj dφ (x, μ(Sj )) and thus
this by bounding the additional cost of clustering the
added points with c(μ(Sj )), which is  k 
a2 ≤ ε dφ (x, μ(Sj )).
 k
εm 4 j=1 x∈Sj
|Sj | · dφ μ(S) + (μ(S)
j=1
4 m·ε This means that Condition 1 holds: If a k-clustering
of S is not much cheaper than a 1-clustering, then
− μ(Sj )), c(μ(Sj )) assigning all points in S to the same center yields a
(1 + ε)-approximation for arbitrary center sets. Thus,

k
εm 4 we can use a clustering feature to store S, which
≤ |Sj | · B(μ(S) + (μ(S)
j=1
4 m·ε means that g(k, f  (ε−1 )) ≡ 1.
− μ(Sj )) − c(μ(Sj )))2
Condition 2. For the second condition, assume that
S is a set of subsets of P representing the f2 (k) sub-
=a2
sets according to an optimal f2 (k)-clustering. Let a
for the k-dimensional vector a defined by set C of k centers be given, and define the partitioning
4
aj := εm|Sj |/4B(μ(S) + m·ε (μ(S) − μ(Sj )) − S1 , . . . , Sk for every S ∈ S according to C as above.
17
1450 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
By Equation (11.11) and by the precondition of Con- [2] D. Seung and L. Lee. Algorithms for non-negative
dition 2, matrix factorization, Advances in neural informa-
tion processing systems, volume 13, pp. 556–562,

k 2001.
|Sj |dφ (μ(Sj ), μ(S)) [3] Blei, D.M. and Ng, A.Y. and Jordan, M.I. Latent
S∈S j=1 Dirichlet Allocation Journal of machine Learning

k  
k  research, pp. 993–1022, 2003.
= dφ (x, μ(S)) − dφ (x, μ(Sj )) [4] IBM: What is big data?. Bringing big data to
S∈S j=1 x∈Sj S∈S j=1 x∈Sj
the enterprise. ibm.com/software/data/bigdata/,
accessed on the 3rd of October 2012.
≤f3 (ε) · opt(P, k). [5] Sandia sees data management challenges spiral.
HPC Projects, 4 August 2009.
We use the same technique as in the proof that [6] M. Ackermann and J. Blömer. Coresets and approx-
Condition 1 holds. There are two changes: First, imate clustering for Bregman divergences. Proceed-
there are |S| sets where the centroids of the subsets ings of the 20th Annual ACM-SIAM Symposium on
must be moved to the centroid of the specific S (where Discrete Algorithms (SODA), pp. 1088-1097, 2009.
in the above proof, we only had one set S).Second, [7] M. Ackermann, J. Blömer, and Christian Sohler.
the bound depends on opt(P, k) instead of S∈S , so Clustering for metric and nonmetric distance mea-
the approximation is dependent on opt(P, k) as well, sures. ACM Transactions on Algorithms, 6(4), 2010.
but this is consistent with the statement in Condition [8] Mahoney, M.W. Randomized Algorithms for Ma-
trices and Data. Foundations and Trends R in Ma-
2. The complete proof that Condition 2 holds can be
chine Learning, volume 3, II, pp.123–224, 2011, Now
found in the appendix, it is very similar to the proof
Publishers Inc.
that Condition 1 holds with minor changes due to the [9] M. Ackermann, C. Lammersen, M. Märtens, C. Rau-
two differences. We even set f3 (ε) = f1 (ε). pach, C. Sohler and K. Swierkot. StreamKM++: A
Clustering Algorithms for Data Streams. Proceed-
Theorem 11.1. If dB : S × S → R is a m-similar ings of the Workshop on Algorithm Engineering and
Bregman divergence on a convex and P -covering set S Experiments (ALENEX), pp. 173-187, 2010.
with m ∈ (0, 1], then there exists a coreset consisting [10] P. Agarwal, S. Har-Peled and K. Varadarajan. Ap-
of clustering features of constant size, i. e., the size proximating extent measures of points. Journal of
only depends on k and ε. the ACM, 51(4): 606-635, 2004.
[11] Bentley, J.L. and Saxe, J.B. Decomposable search-
Proof. We have seen that the two conditions hold ing problems I. Static-to-dynamic transformation,
with f1 (ε) = f3 (ε) = (1+ 14 )2 , and g ≡ 1 and Journal of Algorithms, volume 1 (4), 301–358, 1980
m·ε
[12] A. Banerjee, S. Merugu, I. S. Dhillon and J. Ghosh.
h(k ν , ε) = k ν . By Lemma 11.1, this implies a coreset
Clustering with Bregman Divergences. Journal of
size of Machine Learning Research, volume 6: 1705-1749,
log1+f1 (ε/2) 1
 2005.
2k ν = 2k f3 (ε/2)
[13] M. Bădoiu, S. Har-Peled, and P. Indyk. Approxi-
log1+ 1
8
(1+ m·ε )2  mate clustering via core-sets. Proceedings of the 34th
8 )2
= 2k (1+
m·ε 2 Annual ACM Symposium on Theory of Computing
(STOC), pp. 396–407, 2002.
Acknowledgements. The project CG Learning [14] M. Beyer. Gartner Says Solving ‘Big Data’
acknowledges the financial support of the Future Challenge Involves More Than Just Managing
Volumes of Data. Gartner. Retrieved 13 July 2011.
and Emerging Technologies (FET) programme within
http://www.gartner.com/it/page.jsp?id=1731916.
the Seventh Framework Programme for Research of [15] K. Chen. On Coresets for k-Median and k-Means
the European Commission, under FET-Open grant Clustering in Metric and Euclidean Spaces and Their
number: 255827. Applications. SIAM Journal on Computing, 39(3):
This work has been supported by Deutsche 923-947, 2009.
Forschungsgemeinschaft (DFG) within the Collabora- [16] A. Deshpande, L. Rademacher, S. Vempala, and
tive Research Center SFB 876 ‘Providing Information G. Wang. Matrix Approximation and Projective
by Resource-Constrained Analysis’, project A2. Clustering via Volume Sampling. Theory of Com-
puting, 2(1): 225-247, 2006.
[17] A. Deshpande, K. Varadarajan. Sampling-based
References dimension reduction for subspace approximation.
Proceedings of the 39th Annual ACM Symposium on
Theory of Computing (STOC), pp. 641-650, 2007.
[1] Community cleverness required. Nature, 455 (7209):
[18] P. Drineas, A. Frieze, R. Kannan, S. Vempala,
1. 4 September 2008. doi:10.1038/455001

1451 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.
V. Vinay. Clustering in large graphs via the Singular [34] J. Matoušek. On Approximate Geometric k-
Value Decomposition. Journal of Machine Learning, Clustering. Discrete & Computational Geometry,
volume 56 (1-3), pp. 9-33, 2004. volume 24, No. 1, pp. 66-84, 2000.
[19] D. Feldman, A. Fiat, M. Sharir. Coresets for- [35] M. Langberg and L. J. Schulman. Universal epsilon-
Weighted Facilities and Their Applications. Proceed- approximators for Integrals. Proceedings of the 21st
ings of the 47th IEEE Symposium on Foundations of Annual ACM-SIAM Symposium on Discrete Algo-
Computer Science (FOCS), pp. 315-324, 2006. rithms (SODA), pp. 598-607, 2010.
[20] D. Feldman, M. Monemizadeh, C. Sohler, and D. [36] K. Varadarajan and X. Xiao. A near-linear algo-
Woodruff. Coresets and Sketches for High Dimen- rithm for projective clustering integer points., Pro-
sional Subspace Approximation Problems. Proceed- ceedings of the Annual ACM-SIAM Symposium on
ings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2012
Discrete Algorithms (SODA), pp. 630-649, 2010. [37] O. J. Reichman, and M. B. Jones and M. P. Schild-
[21] D. Feldman, M. Langberg. A unified framework hauer. Challenges and Opportunities of Open
for approximating and clustering data. Proceedings Data in Ecology. Science, 331 (6018): 703-5.
of the 43rd Annual ACM Symposium on Theory of doi:10.1126/science.1197962, 2011.
Computing (STOC), pp. 569–578, 2011. [38] T. Segaran and J. Hammerbacher. Beautiful
[22] D. Feldman, M. Monemizadeh and C. Sohler. A Data: The Stories Behind Elegant Data Solutions.
PTAS for k-means clustering based on weak coresets. O’Reilly Media. p. 257, 2009.
Proceedings of the 23rd Annual ACM Symposium on [39] N. Shyamalkumar, K. Varadarajan. Efficient sub-
Computational Geometry, pp. 11-18, 2007. space approximation algorithms. Proceedings of
[23] D. Feldman, L. Schulman. Data reduction for the 18th Annual ACM-SIAM Symposium on Discrete
weighted and outlier-resistant clustering. Proceed- Algorithms (SODA). pp. 532–540, 2007.
ings of the 23rd Annual ACM-SIAM Symposium on [40] T. White, Hadoop: The definitive guide. O’Reilly
Discrete Algorithms (SODA), pp. 1343-1354, 2012. Media. 2012.
[24] G. Frahling and C. Sohler. Coresets in dynamic
geometric data streams. Proceedings of the 37th A Complete Proof for Condition 2.
Annual ACM Symposium on Theory of Computing
For the second condition, assume that S is a set of
(STOC), pp. 209–217, 2005.
[25] G. Frahling, C. Sohler. A Fast k-Means Implementa- subsets of P representing the f2 (k) subsets according
tion Using Coresets. International Journal of Com- to an optimal f2 (k)-clustering. Let a set C of k cen-
putational Geometry and Applications, 18(6): 605- ters be given, and define the partitioning S1 , . . . , Sk
625, 2008. for every S ∈ S according to C as above. By Equa-
[26] S. Har-Peled. How to Get Close to the Median tion (11.11) and by the precondition of Condition 2,
Shape. CGTA, 36 (1): 39-51, 2007.
[27] S. Har-Peled. No Coreset, No Cry. Proceedings 
k

of the 24th Foundations of Software Technology and


|Sj |dφ (μ(Sj ), μ(S))
S∈S j=1
Theoretical Computer Science (FSTTCS), pp. 324-
335, 2004. 
k  
k 
[28] S. Har-Peled and A. Kushal. Smaller coresets for = dφ (x, μ(S)) − dφ (x, μ(Sj ))
k-median and k-means clustering. Proceedings of S∈S j=1 x∈Sj S∈S j=1 x∈Sj
the 21st Annual ACM Symposium on Computational ≤f3 (ε) · opt(P, k).
Geometry, pp. 126–134, 2005.
[29] S. Har-Peled and S. Mazumdar. Coresets for k- We use the same technique as in the proof that
means and k-median clustering and their applica- Condition 1 holds. There are two changes: First,
tions. Proceedings of the 36th Annual ACM Sympo- there are |S| sets where the centroids of the subsets
sium on Theory of Computing (STOC), pp. 291–300, must be moved to the centroid of the specific S (where
2004.
in the above proof, we only had one set S).Second,
[30] J. Hellerstein. Parallel Programming in the Age of
the bound depends on opt(P, k) instead of S∈S , so
Big Data. Gigaom Blog, 9th November, 2008.
[31] M. Hilbert and P. Lopez. The World’s Technological the approximation is dependent on opt(P, k) as well,
Capacity to Store, Communicate, and Compute but this is consistent with the statement in Condition
Information. Science, volume 332, No. 6025, pp. 60- 2. The complete proof that Condition 2 holds can be
65, 2011. found in the appendix, it is very similar to the proof
[32] A. Jacobs. The Pathologies of Big Data. that Condition 1 holds with minor changes due to the
ACMQueue, 6 July 2009. two differences.
[33] D. Feldman, A. Krause, and Matthew Faulkner. We set f3 (ε) = f1 (ε) and again virtually add
Scalable Training of Mixture Models via Coresets points. For each S ∈ S and each subset Sj of S,
Proc. 25th Conference on Neural Information Pro- we add a point with weight m·ε 4 |Sj | and coordinate
cessing Systems (NIPS) 2011 4
μ(S) + m·ε (μ − μj ) to Sj . Notice that these points lie
19
1452 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
within the convex set A that dB is defined on because inequality. Now we observe that
we assumed that S is P -covering.
We name the new sets Sj , S  and S  . Notice that k
εm 4
2

the centroid of Sj is now b = |Sj |B((1 + )(μ(S) − μ(Sj )))2
j=1
4 εm
S∈S

εm  
k
4 2
  = |Sj |(1 + ) B(μ(Sj ) − μ(S))2
|Sj | · μ(Sj ) + 4 4 mε
4 |Sj | (μ(S) − μ(Sj ))
εm
μ(S) + m·ε j=1
S∈S
(1 + 4 )|Sj |
m·ε
4 2
k
εm 1
=μ(S) ≤ (1 + ) |Sj | dφ (μ(Sj ), μ(S))2
4 mε j=1
m
S∈S
ε
≤ opt(P, k).
4
in all cases. Again, clustering S  with c(μ(S)) is
an upper bound for the clustering cost of S with Additionally, by the definition of m-similarity and
c(μ(S)), and because the centroid of Sj is μ(S), by Equation (11.11) it holds that
clustering every Sj with c(μ(Sj )) is an upper bound
on clustering S with c(μ(S)). Finally, we have to  k
1
2
d = εm|Sj |B(μ(Sj ) − c(μ(Sj )))2
upper bound the cost of clustering all Sj in all S 4
S∈S j=1
with c(μ(Sj )), which we again do by bounding the
ε 
k
additional cost incurred by the added points. Adding
≤ |Sj |dφ (μ(Sj ), c(μ(Sj )))
this cost over all S yields 4 j=1 S∈S

ε  
k
≤ dφ (x, μ(Sj )).

k 4
1 j=1
S∈S x∈Sj
εm|Sj | · dφ (μ(S)
4
S∈S j=1 This
√  implies that a ≤ b + d ≤
4 2 ε/2 opt(P, k) and thus
+ (μ(S) − μ(Sj )) , c(μ(Sj )))
m·ε
 k a2 ≤ ε opt(P, k).
εm
≤ |Sj | · B(μ(S)
j=1
4
S∈S
4
+ (μ(S) − μ(Sj )) − c(μ(Sj )))2 = a2 .
m·ε

For the last equality, we define |S| vectors aS by aSj :=


 4
εm|Sj |/4B(μ(S)+ m·ε (μ(S) − μ(Sj ))−c(μ(Sj )))
and concatenate them in arbitrary but fixed order to
get a k · |S| dimensional
 vector a. By the triangle
4
inequality, aSj ≤ εm|Sj |/4B((1 + mε )(μ(S) −

μ(Sj )))+ εm|Sj |/4B(μ(Sj )−c(μ(Sj ))) = bSj +dSj
 4
with bSj = εm|Sj |/4B((1 + mε )(μ(S) − μ(Sj )))

and dj = εm|Sj |/4B(μ(Sj ) − c(μ(Sj ))). Define
S

b and d by concatenating the vectors bS and dS ,


respectively, in the same order as used for a. Then
we can again conclude that

a ≤ b + d ≤ b + d,

where we use the triangle inequality for the second

1453 Copyright © SIAM.


Unauthorized reproduction of this article is prohibited.

You might also like