70 bd17
70 bd17
70 bd17
Abstract— Hence, beside the variety and veracity of the dataset, the data
A truncated singular value decomposition (SVD) is a powerful analysis tool must address the challenges associated with the
tool for analyzing modern datasets. However, the massive volume volume and velocity of the changes made to the dataset. For
and rapidly changing nature of the datasets often make it too
expensive to compute the SVD of the whole dataset at once. It instance, the computers may not have enough compute power
is more attractive to use only a part of the dataset at a time to accommodate such a rapidly growing or changing data if
and incrementally update the SVD. A randomized algorithm has the computational complexity of the data analysis tool grows
been shown to be a great alternative to a traditional updating superlinearly with the data size. In addition, accessing the
algorithm due to its ability to efficiently filter out the noises data through the local memory hierarchy is expensive, and
and extract the relevant features of the dataset. Though it is
often faster than the traditional algorithm, in order to extract accessing these data in the external storage is even more costly.
the relevant features, the randomized algorithm may need to Therefore, the data analysis tool needs to be data-pass efficient.
accesses the data multiple times, and this data access creates a In particular, it may become too costly to compute the SVD
significant performance bottleneck. To improve the performance of the whole dataset at once, or to recompute the SVD every
of the randomized algorithm for updating SVD, we study, in this time the changes are made to the dataset. In some applications,
paper, two sampling algorithms that access the data only two or
three times, respectively. We present several case studies to show recomputing the SVD may not even be possible because the
that only a small fraction of the data may be needed to maintain original data, for which the SVD has been already computed, is
the quality of the updated SVD, while our performance results no longer available. To address these challenges, an attractive
on a hybrid CPU/GPU computer demonstrate the potential of approach is to update (rather than recompute) the SVD. For
the sampling algorithms to improve the performance of the example, we can incrementally update the SVD using only a
randomized algorithm.
Index Terms—sample; randomize; update SVD; out-of-core;
part of the matrix that fit in the core memory at a time. Hence,
the whole matrix is moved to the core memory only once.
A randomized algorithm has been shown to be an efficient
I. I NTRODUCTION
method to update SVD [6]. To reduce both the computational
To analyze the modern datasets with a wide variety and and data access costs, it projects the data onto a smaller
veracity, a truncated singular value decomposition (SVD) [1] subspace before computing the updated SVD. Compared with
of the matrix representing the data is a powerful tool. The the state-of-the-art updating algorithm [7], the randomized
ability of the SVD to filter out noises and extract the underly- algorithm often compresses the data into a smaller projection
ing features of the data has been demonstrated in many data subspace with a lower communication latency cost. As a
analysis tools, including Latent Semantic Indexing (LSI) [2], result, the randomized algorithm could obtain much higher
recommendation systems [3], population clustering [4], and performance on a modern computer, where the communica-
subspace tracking [5]. Also, as the modern datasets are tion has become significantly more expensive compared with
constantly being updated and analyzed, we develop a good the arithmetic operations, both in terms of time and energy
understanding of the data (e.g., the singular value distribution), consumption. In addition, the randomized algorithm accesses
which can be used to tune the performance or the robustness the data only through the dense or sparse matrix-matrix
of computing the SVD for that particular application (e.g., multiplication (GEMM or SpMM) whose highly-optimized
the required numerical rank for the accurate data analysis, implementations are provided by many vendors. In other
or the number of data passes needed to compute the SVD). applications, the external storage (e.g., database) may provide
Furthermore, these tuning parameters stay roughly the same a functionality to compute the matrix multiplication and only
for different datasets from the same applications. transfer the resulting vectors to the memory, thus avoiding the
With the increase in the external storage capacity, the explicit generation and transfer of the matrix into the memory.
amount of data generated from the observations, experiments, To filter out the noises and extract the relevant features,
and simulations has been growing at an unprecedented rate. however, the randomized algorithm may require multiple data
These phenomena have led to the emergence of numerous passes that become the performance bottleneck. In this paper,
massive datasets in many areas of studies including science, we use two methods to reduce this bottleneck. 1) We integrate
engineering, medicine, finance, social media, and e-commerce. data sampling into the randomized algorithm. Namely, we first
The specific applications that generate the rapidly-changing sample the new data using the information gathered while
massive datasets include the communication and electric grids, compressing the previous data. Then, the randomized algo-
transportation and financial systems, personalized services on rithm only uses the sampled data (which fits in the memory)
the internet, particle or astro physics, and genome sequencing. to update the SVD. We present two sampling algorithms,
requiring only two or three data-passes, respectively (one U
bk := PbXk and Vbk := QY b k , where Xk and Yk are the
pass to sample the data, and one or two additional passes to corresponding left and right singular vectors of B, respectively.
generate the projection subspace). 2) We study a randomized The first step of generating the projection subspace typically
algorithm to incrementally update the SVD using a subset of dominates the performance, and is the focus of this paper.
the sampled rows. The algorithm does not access the rows that
III. P REVIOUS A LGORITHMS
have been already compressed and uses only the sampled rows
that can fit in the memory at a time. This becomes attractive Two algorithms have been previously used to generate the
when we fail to sample enough rows or all the sampled rows basis vectors Pb and Q
b of (2).
do not fit in the memory at once. A. Updating Algorithm
We present several case studies, in which we needed to
sample only a fraction of the data to maintain the quality of The updating algorithm [7] computes the basis vectors Pb
the updated SVD. We also show the potential of the sampling and Qb by first orthogonalizing D against the current approx-
algorithm to improve the performance of the randomized imate left singular vectors Uk ,
algorithm on multicore CPUs with an NVIDIA GPU using b := (I − Uk UkT )D,
D (3)
different implementations and data configurations. As the cost
of data access grows due to the properties of both data and and then computing its QR factorization to orthonormalize the
hardware, such sampling algorithms would likely play more resulting D,
b
critical roles in analyzing the modern datasets. Throughout P R := QR(D)b
this paper, we use ai,j and vi to denote the (i, j)-th entry of such that P is an m-by-d orthonormal matrix and R is a d-
a matrix A and the i-th entry of a vector v, respectively. by-d upper-triangular matrix. The basis vectors are then given
II. P ROBLEM S TATEMENT by
Vk 0
We assume that a rank-k approximation Ak = Uk Σk VkT P = Uk P
b and Q =b , (4)
0 Id
of an m-by-n matrix A has been computed where Σk is a
k-by-k diagonal matrix whose diagonal entries approximate where Id is a d-by-d identity matrix. The resulting (k + d)-
the k dominant singular values of A, and the columns of Uk by-(k + d) projected matrix B ≡ PbT AbQb is given by
and Vk approximate the corresponding left and right singular
Σk UkT D
vectors, respectively. We then consider computing a rank-k B = .
R
approximation of a matrix A,b
The algorithm is shown to compute a good approximation
b≈U
A b k Vb T ,
bk Σ (1)
k to the truncated SVD of the matrix A, b especially when its
where A b = [A D] and an m-by-d matrix D represents the singular values have so-called “low-rank-plus-shift” distri-
new set of the columns being added to A. This problem is bution [10]. Since the “low-rank” and “shift” respectively
of particular interest for the term-document matrices from the correspond to the relevant features and the noises of the
latent semantic indexing (LSI) [8], and it is commonly referred underlying data, the matrices from many applications of our
to as the updating-documents problem. Two other updating interests have this type of singular value distributions.
problems exist, updating-terms and updating-weights. They In (3), the algorithm first accesses the matrix D through
add a new set of matrix rows and update a set of the SpMM (or GEMM) to compute UkT D, and then accesses it
matrix entries, respectively. Though we focus only on updating again to accumulate the results of SpMM and compute D. b In
documents, all these three problems can be expressed as low- practice, to reduce the large amount of memory needed to store
rank corrections to the original matrix. In many cases, A
b is tall the m-by-d dense vectors P , we incrementally update the SVD
and skinny, having more rows than columns (i.e., m n, d). by adding a subset of the new columns D at a time. However,
All the algorithms studied in this paper belong to a class of all the columns of D are still orthogonalized against Uk . In
subspace projection methods: addition, the accumulated cost of computing the SVD of the
Alg. 1. Subspace Projection Method: matrices B and updating Uk and Vk could still be significant.
1) Generate a pair of k + ` orthonormal basis vectors Pb and Q b
B. Randomized Algorithm
that approximately span the range and domain of the matrix A,
b
respectively, To reduce the cost of the updating algorithm, a randomized
Ab ≈ PbQbT , (2) algorithm [6] applies the normalized power iterations to the
where ` is an oversampling parameter [9] selected to enhance
matrix Db of (3) without explicitly forming D:
b
the performance or robustness of the algorithm. Alg. 2. Randomized Algorithm for Adding Columns
2) Use a standard deterministic algorithm to compute the SVD of 1. Generate ` Gaussian random vectors Q
the projected matrix B := PbT A bQ,
b 2. Compute QR factorization of Q, QR := QR(Q)
for j = 1, 2, . . . , s do
b Y T := SVD(B).
XΣ 3. Approximate the matrix range, P := (I − Uk UkT )DQ
4. Compute QR factorization of P , P R := QR(P )
3) Compute the low-rank approximation (1) such that the diagonal if j < s then
entries of Σ
b k are the k dominant singular values of B, and 5. Approximate the matrix domain, Q := DT P
6. Compute QR factorization of Q, QR := QR(Q) in the interval (0,1)
end if 3.2. Select the sampled row ci such that
end for ci = arg maxi (γ < ti | i = 1, 2, . . . , m)
if sampling without replacement
After the power iteration, the basis vectors Pb and Q b are and ci has been previously selected then
computed as in (4), but using P generated by the iteration. 3.3. Draw a new γ and go to Step 3.2
end if
To further reduce the cost of solving the projected system, a end for
smaller right projection subspace was also proposed where the
basis vectors Q generated by the power iteration was used, In this paper, we use the following two types of distributions p.
1
The first is the uniform distribution (i.e., pi = m ) while the
b = Vk 0 . 1
Pk
Q second uses the leverage scores such that pi = k j=1 u2i,j ,
0 Q where ui,j is the (i, j)-th entry of the current approximate left
We refer to these two projection schemes as the randomized singular vectors Uk . Then, when sampled p m with replacements,
algorithms with one-sided and two-sided approximation, or as we use the scaling matrix Sr = c Ic , while without
1 1
Rnd1 and Rnd2 in short, respectively. The respective (k + `)- replacement, we let Sr = diag( √cp c1
, . . . , √cp cc
).
by-(k + d) and (k + `)-by-(k + `) projected matrices B ≡ Though we could have used any sampling algorithm, we
PbT A
bQb are given by focused on these two distributions that are readily available
Σk UkT D
Σk UkT DQ
without an additional data pass over D. Theoretical studies
B= and . (5) have been conducted on these two distributions including an
0 PTD 0 R
upper bound on the approximation error of the sampled Gram
The smaller B of Rnd2 leads to a lower computational cost. matrix [11]. Since we iterate on the Gram matrix, this provides
More importantly, however, if the result of DQ is saved at a good theoretical motivation for using these distributions. Fi-
Step 3 of the above algorithm, the projected matrix B of Rnd2 nally, these two distributions have been used in many studies,
can be computed without an additional SpMM. On the other and provide a baseline performance of our frameworks. If a
hand, to compute B, Rnd1 requires the additional SpMM to more effective sampling, or sketching [12], scheme exists for
multiply D with the vectors Uk and P . a particular application, then it can be easily integrated in our
When the updating algorithm of Section III-A incrementally framework. Our focus is to use these two basic distributions
adds ` columns at a time, it performs d` orthogonalization and to sample our specific matrix D b = (I − Uk U T )D and study
k
projection. Hence, if the randomized algorithm converges in their effectiveness for updating the SVD.
less than d` iterations, it has a lower cost of orthogonalization
than the updating algorithm. Since the randomized algorithm A. Row Sampling
typically requires only a couple of iterations, it could obtain We now describe our first randomized sampling framework
significant speedups over the updating algorithm (see Sec- to update the SVD. Given the row-sampling and scaling
tion VII). However, while the updating algorithm only accesses matrices Cr and Sr , we approximate the Gram matrix of
the matrix D twice, the randomized algorithm accesses D (I − Uk UkT )D using two row-sampled matrices D e r of (6)
(2s − 1) times over the s iterations. Though D is accessed and Uk , which is generated through the QR factorization of
e
only through SpMM, this data access often dominates the the row-sampled matrix Sr Cr Uk , i.e., U
ek R
e := QR(Sr Cr Uk ).
performance of the randomized algorithm. Then, we generate the right-projection subspace Q through
IV. S AMPLING A LGORITHM power iterations on the approximate normal equation without
explicitly forming the Gram matrix:
To lower the data access cost of the randomized algorithm,
in this section, we integrate sampling. For instance, instead of Alg. 4. Row-sampling for Adding Columns (Smp1 ):
1. Sample and scale rows of the matrices D and Uk ,
iterating with the new data D, we may iterate with the row- D
er := Sr Cr D and Uek R e := QR(Sr Cr Uk )
sampled matrix D e r that contains only a subset of its rows, 2. Generate right projection subspace Q
by power iterating with Gram matrix of (I − U ek UekT )D
er
D
e r = Sr Cr D, (6) 2.1. Generate ` Gaussian random vectors Q
2.2. Compute QR factorization QR := QR(Q)
where the c-by-m matrix Cr samples the rows of the matrix D, for j = 1, 2, . . . , s − 1 do
and the c-by-c matrix Sr scales the sampled rows (i.e., Sr Cr 2.3. Approximate the matrix range, Q := D erT (I − Uek UekT )D
er Q
has a single nonzero entry on each row, (Sr Cr )i,ci = si,i , 2.4. Compute QR factorization, QR := QR(Q)
end if
where ci is the index of the i-th sampled row). The following 3. Generate left projection subspace,
algorithm generates our sampling matrix C: P := (I − Uk UkT )DQ
Alg. 3. Algorithm to Sample rows: P R := QR(P )
1. Generate m-length probabilistic distribution p such that 4. Generate projected matrix B of (5).
Pm
pi is the probability that i-th row is sampled, and p =1
i=1 i
Pi After the power iteration, the left-projection subspace P is
2. Compute probabilistic interval t such that ti = p
j=1 j computed through SpMM with the original matrix D (Step 3
3. Sample c rows of D following the distribution p
for i = 1, 2, . . . , c do of Alg. 4). To generate B of Rnd1 , we need one more SpMM to
3.1. Draw a uniformly distributed random number γ compute DT [Uk P ]. On the other hand, if we store the result
D rT Dr Q Q D rT P Q Dc Q P Rnd1 Rnd2 Smp1 Smp2
Matrix operation with D
≈ ≈ ≈ # of Sp/GEMM s s (s − 1)τ + 1 (s − 1)τ
# of Sp/GEMMt s s-1 (s − 1)τ (s − 1)τ + 1
Dense computation (flop count)
Orth `2 ms `2 ms `2 ((s − 1)τ + 1)m `2 ((s − 1)τ + 1)m
SVD(B) (k + d)(k + `)2 (k + `)3 (k + `)3 (k + `)3
of DQ at Step 3, then we can generate B of Rnd2 without and recompute the updated SVD using the new set of the
any additional SpMM. We refer to Alg. 4 as Smp1 , and use sampled rows. Instead, in this section, we look at updating
Smp1,1 and Smp1,2 to distinguish Smp1 with the projection the already-updated SVD using the additional sampled rows.
schemes of Rnd1 and Rnd2 , respectively. Such a scheme is also attractive when all the sampled rows do
not fit in the memory at once because it allows an incremental
B. Row-column Sampling
update of the SVD using only a subset of the sampled rows
Our second sampling framework approximates the results at a time, which fit in the memory.
of SpMM with D and DT using D e c and D
e r that sample the
Assuming that we have updated the SVD using Smp1 , we
columns and rows of D, respectively. Though we do not have use the randomized algorithm to update the projection basis
the leverage scores for the columns of D, we can still, for vectors P and Q by adding more sampled rows to U ek and D
er.
example, sample the columns based on their norms (requiring The algorithm is based on the power iteration on the Gram
a data pass over D) or using the uniform distribution p. matrix, as shown below where Dr and U k represent the new
Alg. 5. Row-column sampling for Adding Columns (Smp2 ): set of sampled rows, while P and Q are the new set of basis
1. Sample and scale the rows of the matrices D and Uk vectors to be generated.
D
er := Sr Cr D and Uek R e := QR(Sr Cr Uk )
2. Sample and scale the columns of the matrix D Alg. 6. Randomized Algorithm for Adding Sampled Rows:
D
ec := DCc Sc 1. Sample more rows and generate Dr and U k
3. Generate projection subspaces P and Q 2. Generate ` Gaussian random vectors P
using the randomized algorithm on (I − U ek UekT )D
e 3. Compute QR factorization of P , P R := QR(P )
3.1. Generate ` Gaussian random vectors Q for j = 1, 2, . . . , s do
3.2. Compute QR factorization QR := QR(Q) 4. Approximate the matrix range,
T T
for j = 1, 2, . . . , s − 1 do Q := Dr (I − U k U k )P
3.3. Approximate the matrix range, P := (I − Uk UkT )D ec (Sc Cc Q) Q := (I − QQT )Q
3.4. Compute QR factorization of P , P R := QR(P ) 5. Compute QR factorization, QR := QR(Q), if requested
if j < s then if j < s then
erT (I − Uek UekT )(Sr Cr P ) T
3.5. Approximate the matrix domain, Q := D 6. Approximate the matrix domain, P := (I − U k U k )Dr Q
3.6. Compute QR factorization of Q, QR := QR(Q) 7. Compute QR factorization, P R := QR(P )
end if end if
end for end for
4. Generate projected matrix B of (5).
Then, Q is updated by compressing the new projection sub-
Since this approach uses power iteration to generate both the
space [Q Q], where the projected matrix B is given by
right- and left-projection subspaces P and Q, it does not
require the extra SpMM that is needed by Smp1 to generate P . I 0
B= T T
However, we still need to perform SpMM with D to generate e P Dr Q R
the projected matrix B. It also has the additional cost to
orthogonalize P during the power iteration. We refer to this since P and Q are orthogonal to U k and Q, respectively.
as Smp2 , and use Smp2,1 and Smp2,2 to refer to Smp2 with The main motivation of the above algorithm is to avoid
the projection schemes of Rnd1 and Rnd2 , respectively. accessing the previously sampled rows D e r that have been
Fig. 1 illustrates these two sampling schemes, and Fig. 2 already compressed into a low-rank representation P and Q.
lists their computational and data access costs.
VI. C ASE S TUDIES
V. R ANDOMIZATION TO ADD ROWS A different application uses a different error measurement
In many applications, we have a good understanding of the and requires a different approximation accuracy. In this sec-
data including how much data should be sampled. However, tion, we examine a few test cases to study the effectiveness of
in some cases, we may fail to sample enough rows for the the sampling and randomized algorithms to update the trun-
updated SVD to satisfy the desired accuracy. In such cases, cated SVD. To this end, we focus on a powerful data analysis
we could discard the updated SVD, increase the sampling size, tool, the principal component analysis (PCA) [13]. In PCA,
Smp1,1 Smp1,2 Smp2,1 Smp2,2
0.66 0.66 0.66 0.66
Smp (u, 0.4) Smp (u, 0.5) Smp (u, 0.6) Smp (u, 0.6)
1,1 1,2 2,1 2,2
0.64 Smp (u, 0.3) 0.64 Smp (u, 0.4) 0.64 Smp (u, 0.5) 0.64 Smp (u, 0.5)
1,1 1,2 2,1 2,2
0.62 Smp1,1(u, 0.2) 0.62 Smp1,2(u, 0.3) 0.62 Smp2,1(u, 0.4) 0.62 Smp2,2(u, 0.4)
0.6 Smp1,1(u, 0.1) 0.6 Smp1,2(u, 0.2) 0.6 Smp2,1(u, 0.3) 0.6 Smp2,2(u, 0.3)
0.58 Smp1,1(l, 0.4) 0.58 Smp1,2(l, 0.5) 0.58 Smp2,1(l, 0.6) 0.58 Smp2,2(l, 0.6)
0.56 Smp1,1(l, 0.3) 0.56 Smp1,2(l, 0.4) 0.56 Smp2,1(l, 0.5) 0.56 Smp2,2(l, 0.5)
Average precision
Average precision
Average precision
Average precision
Smp1,1(l, 0.2) Smp1,2(l, 0.3) Smp2,1(l, 0.4) Smp2,2(l, 0.4)
0.54 0.54 0.54 0.54
Smp1,1(l, 0.1) Smp1,2(l, 0.2) Smp2,1(l, 0.3) Smp2,2(l, 0.3)
0.52 0.52 0.52 0.52
0.5 0.5 0.5 0.5
0.48 0.48 0.48 0.48
0.46 0.46 0.46 0.46
0.44 0.44 0.44 0.44
0.42 0.42 0.42 0.42
0.4 0.4 0.4 0.4
0.38 0.38 0.38 0.38
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Number of documents added (d) Number of documents added (d) Number of documents added (d) Number of documents added (d)
Fig. 3. Average 11-point interpolated precision with different sampling rate for medline, where the first argument of (u, τ ) or (`, τ ) in the legend indicates
c
that the uniform probabilistic distribution or the leverage score is used, respectively, and τ is the sample rate, m , while we fixed n = 533 and s = 3
(m = 5735). The line shows the mean precisions of ten runs, while the markers above and below the line show the highest and lowest precisions, respectively.
0.56 0.56
ator (TMG) with the TREC dataset1 . We then preprocessed
Average precision
Average precision
0.54 0.54 the matrices using the lxn.bpx weighing scheme [16]. To
0.52 0.52 compare the different combinations of sampling and projec-
0.5 0.5
tion schemes, Fig. 3 shows the average 11-point interpolated
0.48 0.48
0.46 0.46
precisions [16] of the updated SVD for the medline dataset.
0.44 0.44 Specifically, the figure shows the average precision when the
0.42 0.42 randomized sampling was used for adding d new documents
0.4 0.4
0.38 0.38
to the rank-30 approximation of the first 533 documents. The
0 50 100 150 200 250 300 350 400 450 500
Number of documents added (d)
0 50 100 150 200 250 300 350 400 450 500
Number of documents added (d)
first sampling scheme Smp1 obtained higher precisions than
the second scheme Smp2 , while the first projection scheme
Fig. 4. Average 11-point interpolated precision for medline with Rnd(s) was slightly more effective obtaining the higher precisions
and Smp(` or u, τ, s) where the first argument ` or u specifies either the than the second scheme (e.g., Smp1,1 was more effective than
leverage score or uniform distribution is used to sample the rows. Smp1,2 ). Then, Fig. 4 compares the results with the three
previous algorithms: 1) recomputing SVD, 2) the updating
algorithm (adding incremental of 500 documents at a time),
multidimensional data is projected onto a low-dimensional
and 3) the randomized algorithm without sampling. Overall,
subspace given by the truncated SVD such that related items
only about 20 ∼ 50% of the new data was needed to obtain
are close to each other in the low-dimensional subspace.
the precisions that were equivalent to those obtained using the
Here, we examine three particular applications of PCA: LSI,
previous algorithms. We have observed similar results for the
data clustering or classification, and image processing. The
other datasets like cranfield.
matrices used for LSI and classification are sparse, while
2) Data Clustering and Classification: PCA has been suc-
the matrices used for the clustering and image processing
cessfully used to extract the underlying genetic structure of
are dense. The results with the randomized and sampling
human populations [17], [18], [19]. To study the potential
algorithms are the mean of ten runs.
of the sampling algorithm, we used it to update the SVD,
A. Sampling to Add Columns when a new population is incrementally added to the dataset
from the HapMap project2 . We randomly filled in the missing
We first investigate the required sampling rate (i.e., how data with either −1, 0, or 1 with the probabilities based
much data needs to be sampled) to maintain the quality of the on the available information for the SNP. Fig. 5(a) shows
updated SVD. We also compare the effectiveness of different the correlation coefficient of the resulting population cluster,
sampling schemes. which is computed using the MATLAB’s k-mean algorithm
1) Latent Semantic Indexing: For text mining [14], La-
tent Semantic Indexing (LSI) [2] is an effective information 1 http://scgroup20.ceid.upatras.gr:8000/tmg, http://ir.dcs.gla.ac.uk/resources
retrieval tool since it can resolve the ambiguity due to the 2 http://hapmap.ncbi.nlm.nih.gov
JPT+MEX +ASW +GIH +CEU +LWK +CHB
Recompute 1.00 1.00 1.00 0.99 0.76 0.72 0.5
Smp(10%)
0.5
Smp(10%)
No update 1.00 0.81 0.59 0.67 0.56 0.47 Smp(20%) Smp(20%)
0.48 0.48
Inc-Update 1.00 1.00 1.00 0.98 0.76 0.75 Smp(30%) Smp(30%)
Smp(10%) Smp(10%)
Rnd2 1.00 1.00 1.00 0.99 0.76 0.73 0.46 Rnd1(20%) 0.46 Rnd2(20%)
Smp1,2 (u) 1.00 1.00 1.00 0.99 0.76 0.60 Rnd1(30%) Rnd2(30%)
Smp1,2 (`) 1.00 1.00 1.00 0.99 0.76 0.71 0.44 0.44
Average precision
Average precision
(a) Population clustering where 83 African ancestry in south west USA 0.42 0.42
(ASW), 88 Gujarati Indian in Houston (GIH), 165 European ancestry in
0.4 0.4
Utah (CEU), 90 Luhya in Webuye, Kenya (LWK), and 84 Han Chinese
in Beijing (CHB) were incrementally added to the 116, 565 SNP matrix 0.38 0.38
of 86 Japanese in Tokyo and 77 Mexican ancestry in Los Angeles, USA
0.36 0.36
(JPT and MEX). We used the fixed parameters (s = 3, τ = 0.9%).
0.34 0.34
crude+interest +money-fx +trade +ship +grain
Recompute 1.00 0.74 0.74 0.61 0.59 0.32 0.32
No update 1.00 0.62 0.42 0.44 0.39
0.3 0.3
Inc-Update 1.00 0.75 0.74 0.60 0.58 0 200 400 600 800 0 200 400 600 800
Number of documents added Total number of documents
Rnd2 1.00 0.75 0.75 0.62 0.60
Smp1,2 (u) 1.00 0.76 0.76 0.63 0.60
Smp1,2 (`) 1.00 0.75 0.75 0.62 0.60
(b) Document classification where there are 253, 190, 206, 251, Fig. 7. Average 11-point interpolated precision for medline when adding
108, and 41 documents of crude, interest, money-fx, trade, sampled rows with (n, k) = (233, 50).
ship, and grain categories, with 19, 368 terms. We used the fixed
parameters (s = 3, τ = 25%).
3 3
Fig. 8. Performance of Smp1,2 (τ = .5, s = 3) using different SpMM
implementations with DT : (‘N’, DT ) explicitly stores the matrix DT , (‘T’,
Time (s)
Time (s)
2.5 2.5
D) transposes D stored in the CSR format, and in-place performs SpMM with
2 2
Der using D stored in the CSR format, (n = 5, 000, d = 5, 000). 1.00x 1.00x
1.5 1.5
1.46x
1 1.66x 1
2.59x 2.96x
2.89x
numerically stable in our experiments (the test matrices had 0.5 4.15x 0.5
n/100 10 25 50 75 100
SSD
SpMM(D) SpMM(DT ) total speedup
READ (contig) (s) 1.2 2.7 5.1 7.9 10.6 Recomp 937.2 / 941.5 620.4 / 622.3 1558.1 / 1564.0 1.0 / 1.0
(MB/s) 177.1 185.9 197.5 188.7 193.4 Rnd2 231.2 / 232.4 154.3 / 155.0 385.7 / 387.6 4.0 / 4.0
READ (by row) (s) 1.3 3.2 6.4 8.0 10.8 Smp1,2 97.1 / 96.9 0.1 / 0.1 97.2 / 97.1 16.0 / 16.1
(MB/s) 155.9 156.8 156.9 187.6 184.7 (a) netflix matrix with local HD / SSD.
READ (noncont) (s) 14.0 13.0 10.9 9.1 9.2
(MB/s) 14.3 38.5 91.9 165.5 218.1 GEMM(D) GEMM(DT ) total speedup
HD Recomp 139.7 / 158.1 101.0 / 105.4 240.9 / 263.4 1.0 / 1.0
READ (contig) (s) 1.29 3.18 5.02 7.09 9.13
(MB/s) 155.6 157.3 199.4 211.5 219.1 Rnd2 146.6 / 169.9 110.0 / 116.9 257.0 / 286.8 0.9 / 0.9
READ (by-row) (s) 1.28 3.20 5.62 6.90 10.2 Smp1,2 38.6 / 61.4 0.1 / 0.1 38.7 / 61.5 6.2 / 4.3
(MB/s) 155.7 156.4 177.9 217.4 196.8 (b) dense matrix with local HD / SSD.
READ (noncont) (s) 10.0 11.3 12.2 12.0 11.9
(MB/s) 20.0 44.2 81.7 125.4 167.5 GEMM(D) GEMM(DT ) total speedup
GEMM(‘N’, D) (s) 0.02 0.05 0.10 0.18 0.23 Rnd2 37.6 26.0 63.7 3.8×
(Gflp/s) 232.1 260.4 248.5 272.4 275.4 Smp1,2 24.2 0.1 24.3 9.9×
GEMM(‘T’, D) (s) 0.03 0.05 0.09 0.13 0.17 (c) dense matrix, D is stored in a separate file, with local HD.
(Gflp/s) 184.2 240.4 280.6 296.1 297.7
GEMM(‘N’, D T ) (s) 0.03 0.06 0.10 0.17 0.20
(Gflp/s) 185.6 216.6 252.4 226.1 250.0 Fig. 12. Performance with non-CPU-resident SpMM when adding 25% of
(b) Non-CPU-resident. columns, and assuming 25% of D fits in the main memory (i.e., n = 7, 500,
d = 2, 500, and s = 3).
Fig. 10. Time in seconds (average of five runs) for GPU and I/O data transfer
(m = 25, 000, n + d = 10, 000, ` = 100).