Acml 18
Acml 18
Abstract
Principal component analysis (PCA) is widely used for dimension reduction and embedding of real
data in social network analysis, information retrieval, and natural language processing, etc. In this
work we propose a fast randomized PCA algorithm for processing large sparse data. The algorithm
has similar accuracy to the basic randomized SVD (rPCA) algorithm (Halko et al., 2011), but is
largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy
for practical usage. Experiments on real data show that the proposed algorithm is up to 9.1X faster
than the basic rPCA algorithm without accuracy loss, and is up to 20X faster than the svds in
Matlab with little error. The algorithm computes the first 100 principal components of a large
information retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds
on a 24-core machine, while all conventional methods fail due to the out-of-memory issue.
Keywords: Principle Component Analysis; Singular Value Decomposition; Randomized Algo-
rithms
1. Introduction
In machine learning applications, principal component analysis (PCA) is widely used for dimension
reduction of input data. It often behaves as a preprocessing step for no matter supervised or unsuper-
vised learning methods. For the problems of social network analysis, information retrieval, natural
language processing (NLP), and even recommender system, where input data matrix is usually a
sparse one, PCA or the equivalent truncated SVD is also applied. For example, the latent semantic
analysis (LSA) (Deerwester et al., 1990), which builds dense representation of document in NLP,
includes PCA as a major step. However, the application of PCA to real-world problems which re-
quire processing large data accurately, often costs prohibitive computational time. Accelerating the
PCA for large sparse data is of an absolute necessity.
The standard method for performing PCA is calculating truncated singular value decomposition
(SVD). For sparse matrix, this is usually implemented with svds in Matlab (Lehoucq et al., 1998),
or lansvd in PROPACK (Larsen, 2004) which is an accelerated version of svds. However, if
the dimensions of matrix are large and more than dozens of principle components/directions are
needed, these conventional methods would induce large computational expense or even fail due to
∗
Corresponding author. This work is supported by the National Nature Science Foundation of China under Grant
61872206.
excessive memory cost. An alternative is the randomized method for PCA, which has gained a lot of
attention in recent years. The idea of randomized matrix method is mainly using random projection
to identify the subspace capturing the dominant actions of a matrix (Halko et al., 2011; Yu et al.,
2018). Then, a near-optimal low-rank decomposition of the matrix can be computed, so that we can
further obtain an approximate PCA. A comprehensive presentation of the relevant techniques and
theories is in Halko et al. (2011). This randomized technique has been extended to compute PCA of
data sets that are too large to be stored in RAM (Yu et al., 2017), or to speed up the distributed PCA
(Liang et al., 2014). For general SVD or PCA computation, the approaches based on it have also
been proposed (Voronin and Martinsson, 2015; Li et al., 2017). They outperform the conventional
techniques for calculating a few of principle components. Recently, a compressed SVD (cSVD)
algorithm was proposed in Benjamin Erichson et al. (2017), which is based on a variant of the
method in Halko et al. (2011) but runs faster for image and video processing applications. Another
idea for computing PCA of large data is performing eigenvalue decomposition to the product of the
data matrix’s transpose and itself. However, it only has benefit while handling low-dimensional data
(less than several thousands in dimension).
Although there are a lot of work on randomized PCA techniques, they are mostly for processing
dense data. Compared with the deterministic methods, they involve the same or fewer floating-
point operations (flops), and are more efficient for large high-dimensional dense data by exploiting
modern computing architectures. While for large sparse data in real world, these benefits may not
exist. Investigating the randomized PCA technique for large sparse data and comparing it with other
existing techniques are of great interest.
In this work, we first analyze the adaptability of some acceleration skills for the basic random-
ized PCA (rPCA) algorithm to sparse data, followed by theoretical proofs and computational cost
analysis. Then, we propose a modified power iteration scheme which allows odd number of passes
over data matrix and thus provides more flexible trade-off between runtime and accuracy. We also
devise a technique to efficiently handle the data matrix with more columns than rows, which is ig-
nored in existing work. To wrap them up, we propose a fast randomized PCA algorithm for sparse
data (frPCA) and its variant algorithm frPCAt, suitable for the data matrices with more columns
and more rows, respectively. Theoretical analysis is performed to reveal how the efficiency of the
proposed algorithms varies with the sparsity of data, the power iteration parameter, and the num-
ber of principle components wanted. In the section of experimental results, we first validate the
accuracy and efficiency of the proposed algorithms with some synthetic data. The results show that
it is up to 9.1X faster than the basic rPCA algorithm and 20X faster than svds, with negligible
loss of accuracy. Then, real large data in social network, information retrieval and recommender
system problems are tested. The results show the proposed algorithm is up to 8.7X faster than the
basic rPCA. And, it successfully handles the largest case in less than 400 seconds and with 23 GB
memory, while the svds fails due to out-of-memory issue (requesting more than 32 GB memory).
For reproducibility, the codes and test data in this work will be shared on GitHub (https:
//github.com/XuFengthucs/frPCA_sparse).
2. Preliminaries
In algorithm description, the Matlab conventions are used for specifying row/column indices of a
matrix and some operations on sparse matrix.
711
F ENG X IE S ONG Y U TANG
A = UΣVT , (1)
where U = [u1 , u2 , · · · ] and V = [v1 , v2 , · · · ] are orthogonal matrices which represent the left and
right singular vectors, respectively. The diagonal matrix Σ contains the singular values (σ1 , σ2 , · · · )
of A in descending order. Suppose that Uk and Vk are matrices with the first k columns of U and
V respectively, and Σk is the diagonal matrix containing the first k singular values of A. Then, the
truncated SVD of A can be represented as:
A ≈ Ak = Uk Σk VkT . (2)
Notice that Ak is the best rank-k approximation of the initial matrix A in either spectral norm of
Frobenius norm (Eckart and Young, 1936).
The approximation properties of SVD explain the equivalence between the truncated SVD and
PCA. Suppose each row of matrix A is an observed data. The matrix is assumed to be centered, i.e.,
the mean of all rows is a zero row vector. Then, the leading left singular vectors ui are the principal
components. Particularly, u1 is the first principal component.
The built-in function svds in Matlab is a common choice to compute truncated SVD. It is based
on a Krylov subspace iterative method and is especially efficient for sparse matrix. For a dense
matrix A ∈ Rm×n , svds costs O(mnk) flops for computing rank-k truncated SVD. The cost be-
comes O(nnz(A)k) flops when A is sparse, where nnz(·) means the number of nonzero elements.
lansvd in PROPACK (Larsen, 2004) is also an efficient program, written in Matlab/Fortran, for
computing the dominant singular values/vectors of a sparse matrix. lansvd can cost two to three
times less CPU time than svds. However, there is no parallel version of lansvd, so that its actual
runtime on a modern computer is often longer than that of svds.
712
FAST R ANDOMIZED PCA FOR S PARSE DATA
on (AAT )p A can achieve better accuracy. The orthonormalization operation “orth(·)” is used to
alleviate the round-off error in the floating-point computation. It can be implemented with a call to
a packaged QR factorization (e.g., qr(X, 0) in Matlab).
The basic rPCA algorithm with the PI scheme has the following guarantee (Halko et al., 2011;
Musco and Musco, 2015):
FC1 = pCqr nl2 + (p + 1)Cqr ml2 + (2p + 2)Cmul nnz(A)l + Cmul mlk + Csvd nl2 . (4)
3. Methodology
3.1. The Ideas for Acceleration
Because many real data can be regarded as sparse matrix, accelerating the basic rPCA algorithm
for sparse matrix is the focus. In Alg. 1, the matrix multiplication in Steps 2 and 7 occupy the
majority of computing time if A is dense. However, this is not true for sparse matrix, and therefore
optimizing other steps will bring substantial acceleration.
In existing work, some ideas have been proposed to accelerate the basic rPCA algorithm. In
Voronin and Martinsson (2015), the idea of using eigendecomposition to compute SVD in Step 8 of
Alg. 1 was proposed. It was pointed out that in the power iteration, orthonormalization after each
matrix multiplication is not necessary. In Li et al. (2017), the power iteration was accelerated by
replacing QR factorization with LU factorization, and the Gaussian matrix was replaced with the
random matrix with uniform distribution. In Benjamin Erichson et al. (2017), the rPCA algorithm
without power iteration was discussed for dense matrix in image or video processing problem. It
employs a variant of the basic rPCA algorithm, where the random matrix is multiplied to the left of
713
F ENG X IE S ONG Y U TANG
A. The algorithm was accelerated by using sparse random matrices and using eigendecomposition
to obtain the orthonormal basis of the subspace.
For handling sparse A, we just use the Gaussian matrix for Ω, because other matrices may cause
AΩ rank-deficient. The useful ideas for faster randomized PCA for sparse matrix are:
• Use the eigendecomposition for computing economic SVD of B,
• Replace the orthonormal Q with the left singular vector matrix U,
• Perform LU factorization in the power iteration,
• Perform orthonormalization after every other matrix-matrix multiplication in power iteration.
Firstly, we formulate the eigendecomposition based SVD as an eigSVD algorithm (described in
Alg. 2), where “eig(·)” computes the eigendecomposition and “spdiags(·)” is used for constructing
a sparse diagonal matrix. In Alg. 2, the “diag(·)” in Step 3 is the function to transform a diagonal
matrix to a vector. Step 4 is to construct a sparse diagonal matrix Ŝ = diag(S)−1 , where “./” is the
element-wise division operator. The eigSVD algorithm’s correctness is given as Lemma 1.
Algorithm 2 eigSVD
Input: A ∈ Rm×n (m ≥ n)
Output: U ∈ Rm×n , S ∈ Rn , V ∈ Rn×n
1: B = AT A
2: [V, D] = eig(B)
3: S = sqrt(diag(D))
4: Ŝ = spdiags(1./S, 0, n, n)
5: U = AVŜ
Lemma 1 The matrix U, S and V produced by Alg. 2 form the economic SVD of matrix A.
where Σ̂, a square diagonal matrix, is the first n rows of Σ. Eq. (5) is the economic SVD of A.
Then Step 1 computes
B = AT A = VΣ̂2 VT . (6)
The right-hand side of (6) is the eigendecomposition of B. This means in Step 2, D = Σ̂2 and V is
the right singular vector matrix of A. Therefore, the values of S in Step 3 are the diagonal elements
of Σ̂ and the Ŝ in Step 4 equals to Σ̂−1 . In Step 5, U = AVŜ = AVΣ̂−1 = U(:, 1 : n). The last
equality is derived from (5). This proves the lemma.
We assume performing the eigendecomposition of an n × n matrix costs Ceig n3 flops. Notice that
eigSVD algorithm is especially efficient if m n, because B becomes a small n × n matrix. And,
the computed singular values in S are in ascending order. Numerical issues can arise if matrix A
dose not have full column rank. So, eigSVD algorithm is only applicable to special situations.
714
FAST R ANDOMIZED PCA FOR S PARSE DATA
Notice that “eig(·)” in Step 2 of Alg. 2 can be replaced with “eigs(·)” to compute the largest
k eigenvalues/eigenvectors, so that the algorithm can produce the results of truncated SVD. This
results in an eigSVDs algorithm (see Algorithm 3), which can also be used to compute PCA.
Algorithm 3 eigSVDs
Input: A ∈ Rm×n (m ≥ n), k
Output: U ∈ Rm×k , S ∈ Rk , V ∈ Rk×n
1: B = AT A
2: [V, D] = eigs(B, k)
3: S = sqrt(diag(D))
4: Ŝ = spdiags(1./S, 0, k, k)
5: U = AVŜ
Secondly, the idea that the orthonormal Q can be replaced with the left singular matrix U can
be explained with Lemma 2.
Lemma 2 In the basic rPCA algorithm, orthonormal matrix Q includes a set of orthonormal basis
of subspace range(AΩ) or range((AAT )p AΩ). As long as Q holds this property, no matter how
it is produced, the result of basic rPCA algorithm will not change.
Proof From Step 2 of Alg. 1 we see that Q is an orthonormal matrix, and its columns are a set of
orthonormal basis of subspace range(AΩ). If p > 0, from Steps 3∼6 we can see that the orthonor-
mal matrix Q includes the a set of orthonormal basis of subspace range((AAT )p AΩ). The result
of the basic rPCA algorithm, is actually QB = QQT A, which further equals USVT . Notice that
QQT is an orthogonal projector onto the subspace range(Q), if Q is an orthonormal matrix. The
orthogonal projector is uniquely determined by the subspace (see (Golub and Van Loan, 1996) or
Section 8.2 of (Halko et al., 2011)), i.e. range(AΩ) or range((AAT )p AΩ). Therefore, As long
as Q includes a set of orthonormal basis of the subspace, QQT is identical and the basic rPCA
algorithm’s results will not change.
Both QR factorization and SVD of a same matrix produce the orthonormal basis of its range
space (column space), in Q and U, respectively. Therefore, with Lemma 2, we see that Q can be
replaced by U from SVD in the basic rPCA algorithm.
Thirdly, LU factorization is used in power iteration to replace QR factorization for saving run-
time. This does not affecting the algorithm’s correctness, which is proved in Lemma 3.
Lemma 3 In the basic rPCA algorithm, the “orth(·)” operation in the power iteration, except the
last one, can be replaced by LU factorization. This does not affect the algorithm’s accuracy in exact
arithmetic.
Proof Firstly, if the “orth(·)” is not performed, the power iteration produces Q including a set of
basis of the subspace range((AAT )p AΩ). As mentioned before, the “orth(·)” is just for alleviating
the round-off error, and after using it Q still represents range((AAT )p AΩ).
The pivot LU factorization of a matrix M is:
PM = LU, (8)
715
F ENG X IE S ONG Y U TANG
where P is a permutation matrix, and L and U are lower triangular and upper triangular matrices, re-
spectively. Obviously, M = (PT L)U, where PT L has the same column space as M. Therefore, re-
placing “orth(·)” with LU factorization (using PT L) also produces the basis of range((AAT )p AΩ).
Then, based on Lemma 2, this does not affect the algorithm’s result in exact arithmetic.
Notice that the LU factor PT L has scaled matrix entries with linearly independent columns,
since L is a lower triangular matrix with unit diagonals and P just means row permutation. There-
fore, it also alleviates the round-off error.
Finally, the orthonormalization or LU factorization in power iteration can be performed after
every other matrix multiplication. It harms the accuracy little but remarkably reduces runtime.
3.2. A Modified Power Iteration Scheme and Handling Matrix with More Columns
For a sparse matrix A, the power iteration in the basic rPCA algorithm (Alg. 1) is computationally
expensive, because it includes the multiplication of two dense matrices. We also notice that, each
time we increase the power parameter by one, two matrix multiplications are induced resulting
in large increase of computation cost. This makes inconvenient trade-off between runtime and
accuracy. To alleviate this issue, we here propose a modified power iteration scheme, which allows
odd number of passes over A and thus provides more convenient performance trade-off of the rPCA
algorithm.
We first observe that, if the power parameter p > 0, Steps 1 and 2 of Alg. 1 can be simply
replaced with:
1: Q = Ω = randn(m, k + s)
For the same power parameter p, this reduces one pass over matrix A. Because the singular val-
ues of (AAT )p decay more quickly than (AAT )p−1 A, performing randomized QB procedure on
(AAT )p is more accurate than (AAT )p−1 A. It means:
T
||A − Q̂Q̂ A|| < ||A − QQT A||, (9)
where Q̂ is the orthogonal matrix produced by (AAT )p Ω while Q is produced from (AAT )p−1 AΩ.
This proves the rationality of this modification. Therefore, we can just modify Step 1 and 2 in Alg.
1 like this without other modification, to realize the odd number of passes over A.
Another modification of Alg. 1 can be motivated by the observation that the flop count of Alg.
1, i.e. (4), is not favorable to the case with m < n, because Csvd is much large than other constants
(Cqr and Cmul ). Although we may run the algorithm to process AT , the transpose of a sparse matrix
is not easily obtained due to the storage format of sparse matrix.
Actually, there is a variant of the basic rPCA algorithm (Benjamin Erichson et al., 2017; Li
et al., 2017), where the random matrix is multiplied to the left of A. With the same idea, we derive
an algorithm called basic rPCAt described as Alg. 4. Its flop count is:
FC4 = pCqr ml2 + (p + 1)Cqr nl2 + (2p + 2)Cmul nnz(A)l + Cmul nlk + Csvd ml2 . (10)
Therefore, we can derive that when Alg. 1 and Alg. 4 handle a sparse matrix A ∈ Rm×n with
m < n,
FC1 − FC4 = (Csvd − Cqr − Cmul )(n − m)l2 > 0. (11)
716
FAST R ANDOMIZED PCA FOR S PARSE DATA
The reason is that Csvd is much larger than Cqr and Cmul . Eq. (11) shows Alg. 4 is more efficient
than Alg. 1 when handling the matrix with more columns. Thus, we shall choose between them
according to the matrix’s dimensions, so as to achieve the best runtime performance.
Theorem 1 The frPCA algorithm (Alg. 5) is mathematically equivalent to the basic rPCA algo-
rithm (Alg. 1) when p = (q − 2)/2.
Proof When p = (q − 2)/2, the number of power iteration is the same for the both algorithms. One
difference between Alg. 5 and Alg. 1 is in the power iteration (the ”for” loop). Based on Lemma
1 we see that eigSVD accurately produces a set of orthonormal basis. Besides, based on Lemma 2
and 3, we see the power iteration in Alg. 5 is mathematically equivalent to that in Alg. 1. The other
difference is the last three steps in Alg. 5. Its correctness is due to Lemma 1 and that the singular
values produces by eigSVD is in the ascending order.
Below we analyze the flop count of Alg. 5. Suppose the flop count of multiplication of M ∈
Rm×l and N ∈ Rl×l is Cmul ml2 . Here, Cmul reflects one addition and one multiplication. If LU
factorization is performed on M, it takes ml2 − l3 /2 times minus and multiplication operations,
denoted by Clu (ml2 − l3 /2) flop counts. If l m, we see that LU factorization costs similar
runtime as the the matrix multiplication. So, for the purpose of runtime comparison, we assume
717
F ENG X IE S ONG Y U TANG
Algorithm 5 frPCA
Input: A ∈ Rm×n (m ≤ n), k, pass parameter q ≥ 2
Output: U ∈ Rm×k , S ∈ Rk , V ∈ Rn×k
1: if q is an even number then
2: Q = randn(n, k + s)
3: Q = AQ
4: if q > 2 then [Q, ∼] = lu(Q) else [Q, ∼, ∼] = eigSVD(Q)
5: else
6: Q = randn(m, k + s)
7: end if
q−1
8: for i = 1, 2, 3, · · · , b 2 c do
9: if i == b q−1
2 c then
10: [Q, ∼, ∼] = eigSVD(A(AT Q))
11: else
12: [Q, ∼] = lu(A(AT Q))
13: end if
14: end for
15: [Û, Ŝ, V̂] = eigSVD(AT Q)
16: ind = k + s : −1 : s + 1
17: U = QV̂(:, ind), V = Û(:, ind), S = Ŝ(ind)
that Cmul ml2 ≈ Clu (ml2 − l3 /2) in the following analysis. Considering that l min(m, n), we
derive the flop count of Alg. 5 for the situation with q equal an even number:
q l3
FC5 = ( − 1)Clu (ml2 − ) + qCmul nnz(A)l + Cmul mlk + 2Cmul (m + n)l2 + 2Ceig l3
2 2
q
≈ ( − 1)Cmul ml + qCmul nnz(A)l + Cmul mlk + 2Cmul (m + n)l2 .
2
2
(12)
As we will see soon, Alg. 5 is more efficient for handling matrix A with dimension m <
n. So, we also propose a variant fast rPCA algorithm (denoted by frPCAt) through applying the
accelerating skills to Alg. 4. The resulted algorithm is described as Alg. 6.
Theorem 2 The variant fast rPCA algorithm (Alg. 6) is mathematically equivalent to the basic
rPCAt algorithm (Alg. 4) when p = (q − 2)/2.
Proof When p = (q − 2)/2, the number of power iteration is the same. The differences between
Alg. 6 and Alg. 4 are also in the power iteration and the last three steps in Alg. 6. Based on Lemma
1 we see that eigSVD accurately produces a set of orthonormal basis. Besides, based on Lemma
2 and 3, we see the power iteration in Alg. 6 is mathematically equivalent to that in Alg. 4. The
correctness of last three steps is due to Lemma 1 and that the singular values produces by eigSVD
is in the ascending order.
718
FAST R ANDOMIZED PCA FOR S PARSE DATA
Algorithm 6 frPCAt
Input: A ∈ Rm×n (m ≥ n), k, pass parameter q ≥ 2
Output: U ∈ Rm×k , S ∈ Rk , V ∈ Rn×k
1: if q is an even number then
2: Q = randn(k + s, m)
3: Q = (QA)T
4: if q == 2 then [Q, ∼, ∼] = eigSVD(Q) else [Q, ∼] = lu(Q)
5: else
6: Q = randn(n, k + s)
7: end if
q−1
8: for i = 1, 2, 3, · · · , b 2 c do
9: if i == b q−1
2 c then
10: [Q, ∼, ∼] = eigSVD(AT (AQ))
11: else
12: [Q, ∼] = lu(AT (AQ))
13: end if
14: end for
15: [Û, Ŝ, V̂] = eigSVD(AQ)
16: ind = k + s : −1 : s + 1
17: U = Û(:, ind), V = QV̂(:, ind), S = Ŝ(ind)
Similarly, we can analyze the flop count of the variant fast rPCA algorithm (Alg. 6).
q l3
FC6 = ( − 1)Clu (nl2 − ) + qCmul nnz(A)l + Cmul nlk + 2Cmul (m + n)l2 + 2Ceig l3
2 2 (13)
q
≈ ( − 1)Cmul nl + qCmul nnz(A)l + Cmul nlk + 2Cmul (m + n)l2 .
2
2
Now, the difference of flop counts of Alg. 5 and 6 is
q
FC6 − FC5 = ( − 1)Clu (n − m)l2 + Cmul (n − m)lk < 0, (14)
2
if they are performed on A ∈ Rm×n (m > n). It means Alg. 6 is more efficient for handling matrix
with more rows. Accordingly, Alg. 5 is more efficient for handling matrix with more columns.
To evaluate how the proposed fast PCA algorithm accelerates the basic rPCA algorithm, we give
the following analysis on the theoretical speedup based on flop counts. As Alg. 1 and Alg. 6 are
both efficient for the situation with m ≥ n, we analyze the ratio of flop counts of them. According
to (4) and (13), the speedup ratio of the frPCAt algorithm to the basic rPCA algorithm is (assuming
that p = (q − 2)/2):
FC1
Sp1 =
FC6
(15)
( q − 1)Cqr nl2 + 2q Cqr ml2 + qCmul nnz(A)l + Cmul mlk + Csvd nl2
≈ 2q .
( 2 − 1)Cmul nl2 + qCmul nnz(A)l + Cmul nlk + 2Cmul (m + n)l2
Denote t = nnz(A)/m as the average number of nonzeros per row, α = t/l as a sparsity parameter
related to the rank parameter, and β = n/m as a matrix shape parameter (β ≤ 1). We further
719
F ENG X IE S ONG Y U TANG
derive:
( 2q − 1)Cqr β + 2q Cqr + qCmul α + Cmul + Csvd β
Sp1 ≈ . (16)
( 2q + 1)Cmul β + qCmul α + Cmul β + 2Cmul
Based on this, we have the following theorem.
Theorem 3 The speedup ratio of the variant fast PCA algorithm (Alg. 6) to the basic rPCA algo-
rithm (Alg. 1), Sp1, depends on the number of passes over A (denoted by q), the ratio of average
number of nonzeros per row to the rank parameter l (denoted by α), and the number of columns
over the number of rows (denoted by β). Sp1 becomes higher as α decreases. And,
Cqr β + Cqr + 2Cmul α
lim Sp1 = , (17)
q→∞ Cmul β + 2Cmul α
which approaches to 2Cqr /Cmul for a very sparse square matrix A (α is small and β equals 1).
Here, Cqr and Cmul are the constants for the flop counts of QR factorization and matrix-matrix
multiplication respectively.
Proof Firstly, based on (15) and (16), we can derive the derivative of Sp1 with respect to α:
which results in (17). Finally, if α → 0 and β = 1 the speedup ratio approaches to 2Cqr /Cmul .
This is the upper bound of the speedup for a square or approximately square A, absolutely greater
than 1.
A similar theorem can be derived for Alg. 5. With the theorems, we see that the proposed fast
rPCA algorithm accelerates the basic rPCA algorithm without loss of accuracy. Besides, it allows
odd number of passes over matrix A, providing better trade-off between runtime and accuracy.
4. Experiments
All experiments are carried out on a Linux server with two 12-core Intel Xeon E5-2630 CPUs (2.30
GHz), and 32 GB RAM. The proposed algorithms Alg. 5 and Alg. 6 are implemented in C with
MKL libraries (Int, 2018) and OpenMP derivatives for multi-thread computing. QR factorization,
LU factorization and other basic linear algebra operations are realized through LAPACK routines
which are automatically executed in parallel on the multi-core CPUs. svds in Matlab2016b are
used as accurate truncated SVD. eigSVDs is the other algorithm to compare and is efficiently im-
plemented in Matlab2016b. Because the lansvd in Matlab/Fortran is not well parallelized, it runs
720
FAST R ANDOMIZED PCA FOR S PARSE DATA
slower than svds in our experiments. And, considering that k min(m, n), the method calcu-
lating all the singular values/vectors by eigSVD and then making truncation is not competitive in
runtime. Therefore, we do not include lansvd and eigSVD in the comparisons.
In the experiments, we choose Alg. 5 or 6 as the proposed fast algorithm according to the shape
of test matrix. The oversampling parameter is always set s = 5, and all runtimes are in seconds.
Table 1: The runtimes of different PCA algorithms for matrices with different sparsity.
Matrix 1 Matrix 2 Matrix 3 Matrix 4 Matrix 5 Matrix 6
Algorithm
time Sp2 time Sp2 time Sp2 time Sp2 time Sp2 time Sp2
svds 36.0 * 25.5 * 21.0 * 178.9 * 149.5 * 131.1 *
eigSVDs 459.2 0.1 104.6 0.2 37.2 0.6 278.7 0.6 156.2 1.1 75.7 1.7
Alg.1 (p = 5) 13.1 2.8 10.0 2.5 9.76 2.2 99.9 1.8 90.7 1.6 84.5 1.6
Alg.6 (q = 11) 4.32 8.3 1.58 16 1.05 20 17.2 10 13.8 11 10.2 13
Table 1 shows that the speedup ratio of Alg. 6 increases when nnz(A) decrease, no matter
compared with svds or to Alg. 1. The proposed algorithm is up to 20X faster than svds and
9.1X faster than the basic rPCA algorithm (both achieved on Matrix 3). In Fig. 1, the curves of
eigSVDs is not shown, as they are indistinguishable to those of svds. From the figure, we see
that the randomized PCA algorithms are indistinguishable from svds at the first tens of singular
values. Alg. 6 is also indistinguishable from Alg. 1. This validates the effectiveness of the proposed
algorithm with odd number of passes over A.
Fig. 2(a) shows the first principal component (i.e. u1 ) of Matrix 2 computed by svds and
Alg. 6, which looks indistinguishable (only 1.4 × 10−10 difference in l∞ -norm). For other principal
components, we calculate the correlation coefficient between the results obtained with the both
methods. As shown in Fig. 2(b), the correlation coefficients are close to 1. The largest deviation
occurs for the 29th principal component, with value 0.9988.
Secondly, we test the randomized algorithms with different q and p parameters. q = 2, 4, 6, 9, 11
and p = 0, 1, 2, 4, 5 are set to Alg. 6 and Alg. 1 respectively. The runtimes of the both algorithms
are listed in Table 2, for computing the first 100 principal components of Matrix 2.
From the table, we see that the speedup ratio increases with q. At the same time, we plot Fig. 3
to show the curves of computed singular values. From it we see with q or p increasing the singular
values approach to the accurate values. And, since the proposed algorithm allows odd number of
721
F ENG X IE S ONG Y U TANG
1200 600
svds svds svds
Alg. 1 (p=5) 1000 Alg. 1 (p=5) Alg. 1 (p=5)
500
Alg. 6 (q=11) Alg. 6 (q=11) Alg. 6 (q=11)
800
400
600
300
σi
σi
σ
3 400
10
200
200
σi
σ
30 15
20
20 10
0.998
Component coefficient
0.996
0.994
0.992
0.99
0 10 20 30
(a) (b)
Figure 2: The accuracy of Alg. 6 (q = 11) on principal components of Matrix 2 (with comparison to
the results from svds). (a) The numeric values (sorted) of first principal component. (b) The correlation
coefficients for the first 30 principal components.
passes over A, it has better flexibility. If lower accuracy is allowed, the frPCAt algorithm runs much
faster. For example, setting q = 4, it’s actually 27X faster than svds.
Lastly, we construct matrices by modifying the dimensions of matrix. We only keep the first
107,966 columns of Matrix 6 to obtain Matrix 7 and the first 107,966 rows of Matrix 6 to get Matrix
8. For them we test different PCA algorithms. The results are listed in Table 3. From it we see that
722
FAST R ANDOMIZED PCA FOR S PARSE DATA
Table 2: The runtimes of the basic rPCA algorithm and the proposed algorithm with different q values.
q=2 q=4 q=6 q=9 q = 11
Algorithm
time Sp1 time Sp1 time Sp1 time Sp1 time Sp1
Alg.1 (p = b(q − 1)/2c) 2.03 * 3.70 * 5.32 * 8.48 * 10.0 *
Alg.6 0.69 2.9 0.94 3.9 1.04 5.1 1.32 6.4 1.58 6.3
q=2 p=0
10 3 10 3
q=4 p=1
q=6 p=2
q=9 p=4
q=11 p=5
σi
σi
10 2 10 2
0 20 40 60 80 100 0 20 40 60 80 100
k k
eigSVDs (Alg. 3) is more efficient than svds when m is much larger than n. And, Alg. 5 runs
faster than Alg. 6, if when m < n. This validates the analysis in Section 3.3.
Table 3: The runtimes of different PCA algorithms for matrices with different dimensions.
Matrix 6 Matrix 7 Matrix 8
Algorithm 647,789×323,896 647,989×107,966 107,966×323,896
time Sp2 time Sp2 time Sp2
svds 131.1 * 100.4 * 51.3 *
eigSVDs 75.7 1.7 16.1 6.3 47.0 1.1
Alg.1 (p = 5) 84.5 1.6 59.6 1.7 35.6 1.4
Alg.5 (q = 11) 14.2 12 7.19 14 2.78 18
Alg.6 (q = 11) 10.2 13 5.91 18 3.79 14
723
F ENG X IE S ONG Y U TANG
Table 4: The runtimes of different PCA algorithms for real large sparse matrices.
svds eigSVDs Alg. 1 (p = 5) Alg. 6 (q = 12)
Sparse Data
time time time time Sp1 Sp2
MovieLens 108.4 566.2 34.8 12.5 2.8 8.5
Aminer * * 1448.3 398.7 3.6 *
SNAP 22.7 124.4 14.8 1.74 8.7 13
10 4 300
180
svds Alg. 1 (p=5) 160 svds
Alg. 1 (p=5) 250 Alg. 6 (q=12) 140 Alg. 1 (p=5)
Alg. 6 (q=12) 120 Alg. 6 (q=12)
200 100
80
150
σi
σi
σ
60
40
10 3 100
8.3 according to Eq. (19). They approximate the Sp1 in Table 4, which validates the analysis in
Theorem 3. In Figure 4, we plot the computed singular values, showing the good accuracy of the
proposed algorithm. The memory costs of svds, Alg. 1 and Alg. 6 are 1.1 GB, 1.0 GB and 0.87
GB on MovieLens and 0.58 GB, 0.55 GB and 0.35 GB on SNAP, respectively. They suggest that the
rSVD algorithms need less memory than svds. For the largest dataset, Aminer, the memory costs
of Alg. 1 and Alg. 6 are 25 GB and 23GB, respectively, while svds fails due to out-of-memory.
5. Conclusions
A fast randomized PCA algorithm including several techniques is proposed for sparse matrix. It is
faster than svds and the basic rPCA algorithm. Its speedup ratio is up to 20X to svds and 9.1X
to the basic rPCA algorithm. On real data from information retrieval, recommender system and
network analysis, the proposed frPCA algorithm performs well, while svds and eigSVDs algorithm
may fail due to large memory cost. The frPCA algorithm runs up to 13X faster than svds and 8.7X
faster than basic rPCA algorithm for the network analysis dataset with little accuracy loss.
References
Aminer. https://www.aminer.cn, 2018.
N. Benjamin Erichson, Steven L. Brunton, and J. Nathan Kutz. Compressed singular value decom-
position for image and video processing. In Proc. IEEE International Conference on Computer
Vision (ICCV), pages 1880–1888, Oct. 2017.
724
FAST R ANDOMIZED PCA FOR S PARSE DATA
725