Big Data Algo
Big Data Algo
Abstract 1 Introduction
We prove that the sum of the squared Euclidean
Big Data. Scientists regularly encounter limitations
distances from the n rows of an n × d matrix A to any
due to large data sets in many areas. Data sets grow
compact set that is spanned by k vectors in Rd can
in size because they are increasingly being gathered
be approximated up to (1 + ε)-factor, for an arbitrary
by ubiquitous information-sensing mobile devices,
small ε > 0, using the O(k/ε2 )-rank approximation of
aerial sensory technologies (remote sensing), genome
A and a constant. This implies, for example, that the
sequencing, cameras, microphones, radio-frequency
optimal k-means clustering of the rows of A is (1 + ε)-
identification chips, finance (such as stocks) logs,
approximated by an optimal k-means clustering of
internet search, and wireless sensor networks [30, 38].
their projection on the O(k/ε2 ) first right singular
The world’s technological per-capita capacity to store
vectors (principle components) of A.
information has roughly doubled every 40 months
A (j, k)-coreset for projective clustering is a small
since the 1980s [31]; as of 2012, every day 2.5
set of points that yields a (1 + ε)-approximation to
etabytes(2.5 × 1018 ) of data were created [4]. Data
the sum of squared distances from the n rows of A
sets as the ones described above and the challenges
to any set of k affine subspaces, each of dimension at
involved when analyzing them is often subsumed in
most j. Our embedding yields (0, k)-coresets of size
the term Big Data.
O(k) for handling k-means queries, (j, 1)-coresets of
Gartner, and now much of the industry use the
size O(j) for PCA queries, and (j, k)-coresets of size
“3Vs” model for describing Big Data [14]: increasing
(log n)O(jk) for any j, k ≥ 1 and constant ε ∈ (0, 1/2).
volume n (amount of data), its velocity (update time
Previous coresets usually have a size which is linearly
per new observation) and its variety d (dimension,
or even exponentially dependent of d, which makes
or range of sources). The main contribution of this
them useless when d ∼ n.
paper is that it deals with cases where both n and d
Using our coresets with the merge-and-reduce ap-
are huge, and does not assume d n.
proach, we obtain embarrassingly parallel streaming
algorithms for problems such as k-means, PCA and Data analysis. Classical techniques to analyze
projective clustering. These algorithms use update and/or summarize data sets include clustering, i.e.
time per point and memory that is polynomial in log n the partitioning of data into subsets of similar char-
and only linear in d. acteristics, and dimension reduction which allows to
For cost functions other than squared Euclidean consider the dimensions of a data set that have the
distances we suggest a simple recursive coreset con- highest variance. In this paper we mainly consider
O(1)
struction that produces coresets of size k 1/ε for problems that minimize the sum of squared error, i.e.
k-means and a special class of bregman divergences we try to find a set of geometric centers (points, lines
that is less dependent on the properties of the squared or subspaces), such that the sum of squared distances
Euclidean distance. from every input point to its nearest center is mini-
mized.
Examples are the k-means or sum of squares clus-
tering problem where the centers are points. An-
other example is the j-subspace mean problem, where
k = 1 and the center is a j-subspace, i. e., the sum
of squared distances to the points is minimized over
∗ MIT, Distributed Robotics Lab.
all j-subspaces. The j-rank approximation of a ma-
Email: dan-
[email protected] trix is the projection of its rows on their j-subspace
† TU Dortmund, Germany, Email: {melanie.schmidt, mean. Principal component analysis (PCA) is an-
christian.sohler}@tu-dortmund.de other example where k = 1, and the center is an
Note that A takes nd space, while the pair Am Notice that in Theorem 4.1, Am is still n-
and the constant A−Am can be stored using nm+1 dimensional and holds n points. We obtain Corol-
space. lary 4.2 by obbserving that Am X = D(m) V T X
for every X ∈ Rn×n−j . As the only non-zero entries
Coresets. Combining our main theorem with known
of D(m) V T are in the first m rows, we can store these
results [21, 19, 36] we gain the following coresets for
as the matrix Ã. This can be considered as an exact
projective clustering. All the coresets can handle
coreset for Am .
Big data (streaming, parallel computation and fast
update time), as explained in Section 10 and later in Streaming. In the streaming model, the input
this section. points (rows of A) arrive on-line (one by one) and
5
1438 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
we need to maintain the desired output for the points Latent Drichlet analysis (LDA) [3] is a generaliza-
that arrived so far. We aim that both the update tion of NNMF, where a prior (multiplicative weight)
time per point and the required memory (space) will is given for every possible k-subspace in Rd . In prac-
be small. Usually, linear in d and polynomial in log n.tice, especially when the corresponding optimization
For computing the k-rank approximation of a problem is NP-hard (as in the case of NNMF and
matrix A efficiently, in parallel or in the streaming LDA), running popular heuristics on the coreset pair
model we cannot use Corollary 4.2 directly: First, Ã and c may not only turn them into faster, stream-
because it assumes that we already have the O(k/ε)- ing and parallel algorithms. It might actually yield
rank approximation of A, and second, that A assumed better results (i.e, “1 − ε” approximations) compared
to be in memory which takes nd space. However, to running the heuristics on A; see [33].
using merge-and-reduce we only need to apply the In principle component analysis (PCA) we usu-
construction of the theorem on very small matrices ally interested in the affine k-subspace that mini-
A of size independent of d in overall time that is mizes the sum of squared distances to the rows of
linear in both n and d, and space that is logarithmic A. That is, the subspace may not intersect the ori-
in O(log n). The construction is also embarrassingly gin. To this end, we replace k by k + 1 in the previous
parallel; see Fig. 3 and discussion in Section 10. theorems, and compute the optimal affine k-subspace
The following corollary follows from Theo- rather than the (k + 1) optimal subspace of the small
rem 10.1 and the fact that computing the SVD for matrix Ã.
an m × d matrix takes O(dm2 ) time when m ≤ d. For the k-rank approximation we use the follow-
ing corollary with Z as the empty set of constraints.
Corollary 4.3. Let A be the n × d matrix whose n
Otherwise, for PCA, NNMF, or LDA we use the cor-
rows are vectors seen so far in a stream of vectors in
responding constraints.
Rd . For every n ≥ 1 we can maintain a matrix à and
c ≥ 0 that satisfies (4.3) where Corollary 4.4. Let A be an n × d matrix. Let
Ak denote an n × k matrix of rank at most k that
1. Ã is of size 2m × 2m for m = k/ε
minimizes A−Ak 2 among a given (possibly infinite)
2. The update time per row insertion, and overall set Z of such matrices. Let à and c be defined as
space used is in Corollary 4.3, and let Ãk denote the matrix that
minimizes à − Ãk 2 among the matrices in Z. Then
O(1)
k log n
d· A − Ãk 2 ≤ (1 + ε)A − Ak 2 .
ε
Using the last corollary, we can efficiently com- Moreover, A − Ak 2 can be approximated using
pute a (1 + ε)-approximation to the subspace that à and c, as
minimizes the sum of squared distances to the rows
(1 − ε)A − Ãk 2 ≤ Ã − Ãk 2 + c
of a huge matrix A. After computing à and c for
A as described in Corollary 4.3, we compute the k- ≤(1 + ε)A − Ãk 2 .
subspace S ∗ that minimizes the sum of squared dis-
tances to the small matrix Ã. By (4.3), S ∗ approxi- k-means. The k-mean of A is the set S ∗ of k points
mately minimizes the sum of squared distances to the in Rd that minimizes the sum of squared distances
rows of A. To obtain an approximation to the k-rank dist2 (A, S ∗ ) to the rows of A among every k points in
approximation of A, we project the rows of A on S ∗ Rd . It is not hard to prove that the k-mean of the k-
in O(ndk) time. rank approximation Ak of A is a 2-approximation for
Since à approximates dist2 (A, S) for any k- the k-mean of A in term of sum of squared distances
subspace of Rd (not only S ∗ ), computing the sub- to the k centers [18]. Since every set of k points is
space that minimizes dist2 (Ã, S) under arbitrary con- contained in a k-subspace of Rd , the k-mean of Am
straints would yield a (1 + ε)-approximation to the is a (1 + ε)-approximation to the k-means of A in
subspace that minimizes dist2 (A, S) under the same term of sum of squared distances. Since the k-mean
constraints. Such problems include the non-negative of à is clearly in the span of Ã, we conclude from our
matrix factorization (NNMF, also called pLSA, or main theorem the following corollary that generalizes
probabilistic LSA) which aims to compute a k- the known results from 2-approximation to (1 + ε)-
subspace S ∗ that minimizes sum of squared distances approximation.
to the rows of A as defined above, with the addi-
tional constraint that the entries of S ∗ will all be Corollary 4.5. Let Am denote the m = bk/ε2 rank
non-negative. approximation of an n × d matrix A, where b is
epsApproximation = -4.2727e-007 2
>> % Evaluate quality of cost estimation In the following we will show that one can use
>> % using coreset the first m rows of D(m) V T as a coreset for the
>> epsEstimation = abs(costCA/(costCC+c) - 1) linear j-dimensional subspace 1-clustering problem.
We exploit the observation that by the Pythagorean
epsEstimation = 5.1958e-014 theorem it holds that AY 22 + AX22 = A22 .
Thus, AY 22 can be decomposed as the differ-
ence between A22 , the squared lengths of the
Algorithm 2: A Matlab implementation of the coreset points in A, and AX 2
2 , the squared lengths of
construction for k-means queries. the projection of A on L. Now, when using
U D(m) V T instead of A, U D(m) V T Y 22 decomposes
into U D(m) V T 22 − U D(m) V T X22 in nthe same
U D(m) V T X22 is always non-negative. Then way. By noting that A22 is actually i=1 si and
(m) T 2 m
U D V 2 is s
i=1 i , it is clear that storing the
U DV T X22 − U D(m) V T X22 remaining terms of the sum is sufficient to account
= DV T X22 − D(m) V T X22 for the difference between these two norms. Approxi-
(m)
= (D − D )V X2 T 2 mating AY 22 by U D(m) V T Y 22 thus reduces to ap-
proximate AX22 by U D(m) V T X22 . We show that
≤ (D − D(m) )22 X2S
the difference between these two is only an ε-fraction
= j · sm+1 of AY 22 , and explain after the corollary why this
implies the desired coreset result.
9
1442 Copyright © SIAM.
Unauthorized reproduction of this article is prohibited.
Thus, we can replace the matrix U D(m) V T in the
above corollary by D(m) V T . This is interesting,
because all rows except the first m rows of this
new matrix have only 0 entries and so they don’t
contribute to D(m) V T X22 . Therefore, we will define
our coreset matrix S to be matrix consisting of the
first m = O(j/ε) rows of D(m) V T . The rows of this
matrix will be the coreset points.
In the following, we summarize our results and
state them for n points in d-dimensional space.
Corollary 6.1. Let A be an n × n matrix whose n
rows represent n points in n-dimensional space. Let
A = U DV T be the SVD of A and let D(m) be a matrix
Figure 2: The distances of a point and its projection that contains the first m = O(j/ε) diagonal entries
to a query subspace. of D and is 0 otherwise. Then the rows of D(m) V T
form a coreset for the linear j-subspace 1-clustering
problem.
Lemma 6.2. (Coreset for Linear Subspace 1-
Clustering) Let A ∈ Rn×n be an n × n matrix with If one is familiar with the coreset literature it
singular value decomposition A = U DV T , Y be may seem a bit strange that the resulting point set
an n × (n − j) matrix with orthonormal columns, is unweighted, i.e. we replace n unweighted points
0 ≤ ε ≤ 1, and m ∈ N with n − 1 ≥ m ≥ j + j/ε − 1. by m unweighted points. However, for this problem
Then the weighting can be implicitly done by scaling.
Alternatively, we could also define our coreset to be
n
the set of the first m rows of V T where the ith row is
0 ≤ U D(m) V T Y 22 + si − AY 22 ≤ ε · AY 22 .
weighted by si .
i=m+1
Proof. We have AY 22 ≤ U D(m) VT Y 22 + U (D − 7 Dimensionality Reduction for Projective
D(m) )V T Y 22 ≤ U D(m) V T Y 22 + i=m+1 si , which
n Clustering Problems
proves that U D(m) V T Y 22 + i=m+1 si − AY 22 is
n In order to deal with k subspaces we will use a form
non-negative. We now follow the outline sketched of dimensionality reduction. To define this reduction,
above. By the Pythagorean Theorem, AY 22 = let L be a linear j-dimensional subspace represented
A22 − AX22 , where X has orthonormal columns by an n × j matrix X with orthonormal columns and
and spans the space orthogonal to the space spanned with Y being an n × (n − j) matrix with orthonormal
n
by Y . Using, A 2
= s and, U D(m) V T 22 = columns that spans L⊥ . Notice that for any matrix
m 2 i=1 i
M we can write the projection of the points in the
i=1 si , we obtain
rows of M to L as M XX T , and that these projected
n
points are still n-dimensional, but lie within the j-
U D(m) V T Y 22 + si − AY 22 dimensional subspace L. Our first step will be to
i=m+1
show that if we project both U DV T and A :=
(m) T 2
= U D V 2 − U D(m) V T X22 U D(m) V T on X by computing U DV T XX T and
n U D(m) V T XX T , then the sum of squared distances
+ si − A22 + AX22 of the corresponding rows of the projection is small
i=m+1 compared to the cost of the projection. In other
= AX22 − U D(m) V T X22 words, after the projection the points of A will be
n relatively close to their counterparts of A . Notice
≤ ε· si ≤ ε · AY 22 the difference to the similar looking Lemma 6.1: In
i=j+1 Lemma 6.1, we showed that if we project A to L and
sum up the squared lengths of the projections, then
where the first inequality follows from Lemma 6.1. 2 this sum is similar to the sum of the lengths of the
projections of A . In the following corollary, we look
Now we observe that by orthonormality of the at the distances between a projection of a point from
columns of U we have U M 22 = M 22 for any matrix A and the projection of the corresponding point in
M , which implies that U DV T X22 = DV T X22 . A , then we square these distances and show that the
last inequality follows from Lemma 6.1. 2 summing the error of approximating an input point
p by its projection q. If C is contained in a j ∗ -
Now assume we want to use A = U D(m) V T dimensional subspace, the error will be sufficiently
to estimate the cost of an j-dimensional affine sub- small for m ≥ j ∗ + 30j ∗ /ε2 − 1. This is done in
space k-clustering problem. Let L1 , . . . , Lk be a set the proof of the theorem below.
of affine subspaces and let L be a j ∗ -dimensional
subspace containing C = L1 ∪ · · · ∪ Lk , j ∗ ≥ Theorem 8.1. Let A ∈ Rn×n be a matrix with
k(j + 1). Then by the Pythagorean theorem we can singular value decomposition A = U DV T and let
write dist2 (A, C) = dist2 (A, L) + dist2 (AXX T , C), ε > 0. Let j ≥ 1 be an integer, j ∗ = j + 1 and
where X is a matrix with orthonormal columns m = j ∗ + 30j ∗ /ε2 − 1 such that m ≤ n − 1.
whose span is L. We claim that dist2 (A , C) + Furthermore, let A = U D(m) V T , where D(m) is a
n 2 diagonal matrix whose diagonal is the first m diagonal
i=m+1 si is a good approximation for dist (A, C).
We also know that dist (A , C) = dist (A , L) + entries of D followed by n − m zeros.
2 2
dist2 (A XX T , C). Furthermore, by Corollary Then for any compact set C, which is contained
2 n 2 in a j-dimensional subspace, we have
6.2, |dist (A , L) + i=m+1 si − dist (A, L)| ≤
2 2
εdist (A, L) ≤ εdist (A, C). Thus, if we can n
prove that |dist2 (AXX T , C) − dist2 (A XX T )|22 ≤ dist2 (A, C) − dist2 (A , C) + si
ε · dist(A, C) we have shown that |dist(A, C) − i=m+1
n
2 2
dist (A, C) − dist (A , C) − si
i=j+1 Let L be any m + j ∗ -dimensional subspace that
contains the m-dimensional subspace in which the
(8.7) = AY 2 − A Y 2
points in A lie, let C be an arbitrary compact set,
n which can, for example, be a set of centers (points,
− si + (dist2 (p, C) − dist2 (p , C))
lines, subspaces, rings, etc.) and let L2 be a j ∗ -
i=j+1 p∈P
2
dimensional subspace that contains C. If we now set
(8.8) ≤ ε AY d1 := m and d2 := j ∗ and let L1 be the subspace
n
12pi − pi 2 ε spanned by the rows of A , then Lemma 9.1 implies
+ + · dist2 (pi , C) that every given compact set (and so any set of
i=1
ε 2
centers) can be rotated into L. We could now proceed
12ε ε by computing a coreset in a m + j ∗ -dimensional
(8.9) ≤ ε + + dist2 (A, C)
ε 2 subspace and then rotate every set of centers to this
≤ εdist2 (A, C), subspace. However, the last step is not necessary
because of the following.
where (8.7) follows from the Pythagorean Theo- The mapping defined by any orthonormal matrix
rem, (8.8) follows from Lemma 7.1 and Corollary 6.2, U is an isometry, and thus applying it does not change
and (8.9) follows from (8.5) and (8.6). 2 Euclidean distances. So, Lemma 9.1 implies that
the sum of squared distances of C to the rows of
9 Small Coresets for Projective Clustering A is the same as the sum of squared distances of
In this section we use the result of the previous U (C) := {U x : x ∈ C} to A . Now assume that we
section to prove that there is a strong coreset of size have a coreset for the subspace L. U (C) is still a union
independent of the dimension of the space. In the of k subspaces, and it is in L. This implies that the
last section, we showed that the projection A of A sum of squared distances to U (C) is approximated
is a coreset for A. A still has n points, which are n- by the coreset. But this is identical to the sum of
dimensional but lie in an m-dimensional subspace. To squared distances to C and so this is approximated
reduce the number of points, we want to apply known by the coreset as well.
coreset constructions to A within the low dimensional Thus, in order to construct a coreset for a set
subspace. However, this would mean that our coreset of n points in Rn we proceed as follows. In a first
only holds for centers that are also from the low step we use the dimensionality reduction to reduce
dimensional subspace, but of course we want that the input point set to a set of n points that lie in an
the centers can be chosen from the full dimensional m-dimensional subspace. Then we construct a coreset
space. We get around this problem by applying for a (d + j ∗ )-dimensional subspace that contains the
the coreset constructions to a slightly larger space low-dimensional point set. By the discussion above,
than the subspace that A lives in. The following this will be a coreset.
lemma provides us the necessary tool to complete the Finally, combining this with known results [21,
argumentation. 19, 36] we gain the following corolarry.
εm
k
4 2
= |Sj |(1 + ) B(μ(Sj ) − μ(S))2
|Sj | · μ(Sj ) + 4 4 mε
4 |Sj | (μ(S) − μ(Sj ))
εm
μ(S) + m·ε j=1
S∈S
(1 + 4 )|Sj |
m·ε
4 2
k
εm 1
=μ(S) ≤ (1 + ) |Sj | dφ (μ(Sj ), μ(S))2
4 mε j=1
m
S∈S
ε
≤ opt(P, k).
4
in all cases. Again, clustering S with c(μ(S)) is
an upper bound for the clustering cost of S with Additionally, by the definition of m-similarity and
c(μ(S)), and because the centroid of Sj is μ(S), by Equation (11.11) it holds that
clustering every Sj with c(μ(Sj )) is an upper bound
on clustering S with c(μ(S)). Finally, we have to k
1
2
d = εm|Sj |B(μ(Sj ) − c(μ(Sj )))2
upper bound the cost of clustering all Sj in all S 4
S∈S j=1
with c(μ(Sj )), which we again do by bounding the
ε
k
additional cost incurred by the added points. Adding
≤ |Sj |dφ (μ(Sj ), c(μ(Sj )))
this cost over all S yields 4 j=1 S∈S
ε
k
≤ dφ (x, μ(Sj )).
k 4
1 j=1
S∈S x∈Sj
εm|Sj | · dφ (μ(S)
4
S∈S j=1 This
√ implies that a ≤ b + d ≤
4 2 ε/2 opt(P, k) and thus
+ (μ(S) − μ(Sj )) , c(μ(Sj )))
m·ε
k a2 ≤ ε opt(P, k).
εm
≤ |Sj | · B(μ(S)
j=1
4
S∈S
4
+ (μ(S) − μ(Sj )) − c(μ(Sj )))2 = a2 .
m·ε