Parallel Efficient Sparse Matrix-Matrix Multiplication On Multicore Platforms
Parallel Efficient Sparse Matrix-Matrix Multiplication On Multicore Platforms
Parallel Efficient Sparse Matrix-Matrix Multiplication On Multicore Platforms
net/publication/300779319
CITATIONS READS
48 1,881
10 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Md. Mostofa Ali Patwary on 12 April 2016.
1 Introduction
Sparse Matrix-Matrix Multiplication (SpGEMM) is an important kernel used in
many applications in High Performance Computing such as algebraic multigrid
solvers [4] and graph analytic kernels [7,10,12,17]. Compared to the efficiency
of the corresponding dense GEMM routines, SpGEMM suffers from poor per-
formance on most parallel hardware. The difficulty in optimizing SpGEMM lies
in the irregular memory access patterns, unknown pattern of non-zeros in the
output matrix, poor data locality and load imbalance during computation. For
sparse matrices that have non-zero patterns following power law distributions
(e.g. graphs from social network and recommendation system domains), this
leads to poor efficiency as some portions of the output are very dense while
others are very sparse.
This paper presents an optimized implementation of SpGEMM on two matri-
ces A and B that efficiently utilizes current multicore hardware. We have paral-
lelized SpGEMM through row and column based blocking of A and B respectively.
c Springer International Publishing Switzerland 2015
J.M. Kunkel and T. Ludwig (Eds.): ISC High Performance 2015, LNCS 9137, pp. 48–57, 2015.
DOI: 10.1007/978-3-319-20119-1 4
Parallel Efficient Sparse Matrix Multicore Matrix-Matrix Multiplication 49
By using a dense array to accumulate partial (sparse vector) results, we get supe-
rior performance compared to previous implementations. We also maintain a CSR
input and output format and include data structure transformation and memory
allocation costs in our runtime.
Our main contributions are as follows:
Fig. 1. Data access pattern of Gustavson [11] and Partitioned SpGEMM algorithms.
50 M.M.A. Patwary et al.
Consider the following notations. Ai,j denotes a single entry of matrix A. Ai,:
denotes the ith row of matrix A, and A:,i represents the ith column
of A. Then the
computation of the entire row i of C can be seen to be Ci,: = k Ai,k ∗Bk,: . This
computation is shown in Fig. 1(a). The figure shows the situation for a sparse
matrix A, where Ai,k is non-zero only for some values of k, and computations
only occur on the corresponding rows of B. The product of the scalar Ai,k with
the non-zeros in Bk,: basically scales the elements of the row Bk,: and has the
same sparsity structure as Bk,: . This product then needs to be summed into Ci,: .
Note that Ci,: is the sum of various sparse vectors obtained from the products
above, and is in general sparse, although its non-zero structure and density may
be quite different from that of A and B.
Gustavson [11] proposed a single-
threaded algorithm for SpGEMM
based on the Compressed Sparse Row
(CSR) format. This is a straightfor-
ward implementation of the compu-
tations described in Fig. 1(a). This
algorithm can be parallelized over
rows of A to run on multi-core proces-
sors. However, as we show below,
this basic algorithm does not take
full advantage of architectural fea-
tures such as caches present on mod-
ern hardware and makes inefficient
use of the limited bandwidth available at various levels of the memory hierarchy.
We now show the optimizations that we perform to overcome these bottlenecks.
non-zero entry and simply add its value to X[c], where c is the column index of the
non-zero. Finally, once all additions are complete, we need to convert this dense
array back to a sparse CSR representation when writing back to C. While the
additions themselves have very low overhead with this data structure, it is very
inefficient to have to scan the entire dense X array (most of whose elements are
zero) and write back only the few non-zero elements into a sparse format. Indeed,
our experiments indicate that this scan takes more than 20X the time required
for the additions themselves. Hence, in addition to X, we keep a index array that
stores the non-zero indices of X. This can be cheaply maintained during the addi-
tion process by simply appending a column index c when it is first written to, i.e.
when the existing array value X[c] is zero. We then iterate only over this sparse
index array when writing to C and reset the corresponding value in X.
3 Experimental Results
3.1 Experimental Setup
We used an Intel R
Xeon R1
E5-2697 v3 processor based system for the exper-
iments. The system consists of two processors, each with 14-cores running at
2.6 GHz (a total of 28 cores) with 36 MB L3 cache and 64 GB memory. The
system is based on the Haswell microarchitecture and runs Redhat Linux (ver-
sion 6.5). All our code is developed using C/C++ and is compiled using the
Intel
R
C++ compiler2 (version: 15.0.1) using the -O3 flag. We pins threads
to cores for efficient NUMA behavior by setting KMP AFFINITY to granular-
ity=fine,compact,1 in our experiments [2].
1
Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S.
and/or other countries.
2
Intel’s compilers may or may not optimize to the same degree for non-Intel micro-
processors for optimizations that are not unique to Intel microprocessors. These opti-
mizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations.
Intel does not guarantee the availability, functionality, or effectiveness of any opti-
mization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel micro-architecture are reserved for Intel micro-
processors. Please refer to the applicable product User and Reference Guides for
more information regarding the specific instruction sets covered by this notice. Notice
revision #20110804.
Parallel Efficient Sparse Matrix Multicore Matrix-Matrix Multiplication 53
Name Rows nnz Avg Degree Max Degree nnz (C) Time (sec) Type
harbor 46,835 4,701,167 100 289 7,900,918 0.0681 Symmetric
hood 220,542 10,768,436 49 77 34,242,181 0.0802 Symmetric
qcd 49,152 1,916,928 39 39 10,911,745 0.0373 Asymmetric
consph 83,334 6,010,480 72 81 26,539,737 0.0675 Symmetric
pwtk 217,918 11,634,424 53 180 32,772,237 0.0821 Symmetric
PR02R 161,070 8,185,136 51 92 30,969,454 0.0756 Asymmetric
mono 169,410 5,036,288 30 719 41,377,965 0.0965 Asymmetric
webbase 1,000,005 3,105,536 3 4,700 51,111,997 0.1597 Asymmetric
audikw 943,695 39,297,771 42 346 164,772,225 0.2053 Asymmetric
mou.gene 45,101 14,506,196 322 8033 190,430,984 0.9016 Asymmetric
cage14 1,505,785 27,130,349 18 82 236,999,813 0.2469 Asymmetric
dielFilt 1,102,824 45,204,422 41 271 270,082,366 0.2687 Asymmetric
rmat er 262,144 16,777,150 64 181 704,303,135 0.7538 Asymmetric
rmat g 262,144 16,749,883 64 3283 1,283,506,475 1.1721 Asymmetric
rmat b 262,144 16,749,883 64 54250 1,648,990,052 2.4519 Asymmetric
(i) the Combinatorial BLAS Library (CombBLAS v1.3, [1]), (ii) Intel R
Math
Kernel Library (MKL, version 11.2.1 [3]), (iii) BHSPARSE implementation on
the nVidia GeForce GTX Titan GPU reported in [13], and (iv) BHSPARSE
implementation on the AMD Radeon HD 7970 GPU reported in [13]. These rep-
resent some of the most recent SpGEMM publications that perform well on mod-
ern hardware [13]. We consider all 15 datasets in the comparison. However, since
the GPU SpGEMM code is not available online, we used the performance results
of the 6 overlapping datasets from [13]. Figure 2 shows the performance of these
implementations in GFlops. As can be seen, our code is able to achieve up to
18.4 GFlops (geomean 6.6 GFlops) whereas CombBLAS, MKL, BHSPARSE on
nVidia GTX Titan, and BHSPARSE on AMD Radeon HD 7970 GPU achieve
up to 0.3, 11.7, 2.8, and 5.0 (geomean of 0.09, 3.6, 1.5, and 2.1) GFlops respec-
tively. CombBLAS uses the DCSC matrix format that involves an additional
layer of indirection (more cache line accesses) that leads to poor performance.
MKL does not partition the columns of B and uses a static partitioning scheme
that can cause performance degradation. We note that our performance results
are up to 7.3X (geomean 3.9X) compared to the best BHSPARSE GPU imple-
mentation. This is despite the nearly 2× higher peak flops for the GPU cards
compared to the Intel R
Xeon R
processor used. We attribute this to the impact
of the algorithmic optimizations such as partitioning techniques, cache optimiza-
tions, and dynamic load balancing (more details are in Sect. 2).
We next demonstrate scalability using the 7 largest datasets in Table 1.
Figure 3 shows the scalability of our algorithm and Intel R
MKL. We ignore
CombBLAS for this comparison as it is significantly slower (76X slower on aver-
age) and also the GPU implementations due to unavailability of code. Figure 3(a)
and Fig. 3(b) show scaling results for our algorithm and MKL respectively. As
can be seen, we achieve up to 28X speedup using 28 threads with respect to single
Parallel Efficient Sparse Matrix Multicore Matrix-Matrix Multiplication 55
4 Related Work
Gustavson introduced an SpGEMM algorithm with a work complexity propor-
tional to the number of nonzeros in A, number of total flops, number of rows in
C, and the number of columns in C [11]. This algorithm, and how it relates to
our work, is described in detail in Sect. 2. A similar algorithm is used by Matlab,
which processes a column of C at a time, and uses a dense vector with values,
indices, and valid flags, for accumulating sparse partial results [9]. Buluc and
Gilbert address the case of hypersparse matrices, where the number of non-zeros
is less than the number of columns or rows [5]. The authors introduce the doubly
compressed sparse column (DCSC) format, and two new SpGEMM algorithms
to handle this case.
There have also been several efforts to optimize the performance of SpGEMM
for parallel and heterogeneous hardware. Sulatycke and Ghose analyzed the cache
behavior of different loop-orderings of SpGEMM with sparse A and B and dense
C, and proposed a cache-efficient parallel algorithm that divided work across
rows of A and C [16]. Siegel et al. created a framework for running SpGEMM on
a cluster of heterogeneous (2x CPU + 2x GPU) nodes, and demonstrated a sig-
nificant improvement in load-balance through dynamic scheduling as compared
to a static approach [15]. Zhu et al. propose a custom hardware implementation
to accelerate SpGEMM [18]. They use custom logic integrated with 3D stacked
memory to retrieve columns from A, and merge intermediate results into C using
content-addressable memories (CAMs). Liu and Vintner describe an SpGEMM
algorithm and implementation for GPUs [13]. In this implementation, rows of C
are divided into bins based on an upper-bound of the size of intermediate results
and processed using different routines.
References
1. Combinatorial Blas v 1.3. http://gauss.cs.ucsb.edu/∼aydin/CombBLAS/html/
2. Thread affinity interface. https://software.intel.com/en-us/node/522691
3. Intel math kernel library (2015). https://software.intel.com/en-us/intel-mkl
4. Bell, N., Dalton, S., Olson, L.N.: Exposing fine-grained parallelism in algebraic
multigrid methods. SIAM J. Sci. Comput. 34(4), C123–C152 (2012)
5. Buluc, A., Gilbert, J.: On the representation and multiplication of hypersparse
matrices. In: Proceedings of IPDPS, pp. 1–11, April 2008
6. Buluç, A., Gilbert, J.R.: Parallel sparse matrix-matrix multiplication and index-
ing: Implementation and experiments. CoRR abs/1109.3739 (2011)
7. Chan, T.M.: More algorithms for all-pairs shortest paths in weighted graphs.
SIAM J. Comput. 39(5), 2075–2089 (2010)
8. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM
Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
9. Gilbert, J., Moler, C., Schreiber, R.: Sparse matrices in matlab: design and imple-
mentation. SIAM J. Matrix Anal. Appl. 13(1), 333–356 (1992)
10. Gilbert, J.R., Reinhardt, S., Shah, V.B.: High-performance graph algorithms from
parallel sparse matrices. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski,
J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 260–269. Springer, Heidelberg (2007)
11. Gustavson, F.G.: Two fast algorithms for sparse matrices: multiplication and
permuted transposition. ACM Trans. Math. Softw. 4(3), 250–269 (1978)
12. Kaplan, H., Sharir, M., Verbin, E.: Colored intersection searching via sparse rec-
tangular matrix multiplication. In: Symposium on Computational Geometry, pp.
52–60. ACM (2006)
13. Liu, W., Vinter, B.: An efficient GPU general sparse matrix-matrix multiplication
for irregular data. In: Proceedings of IPDPS, pp. 370–381. IEEE (2014)
14. Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph
500. Cray User’s Group (2010)
15. Siegel, J., et al.: Efficient sparse matrix-matrix multiplication on heterogeneous
high performance systems. In: IEEE Cluster Computing, pp. 1–8 (2010)
16. Sulatycke, P., Ghose, K.: Caching-efficient multithreaded fast multiplication of
sparse matrices. In: Proceedings of IPPS/SPDP 1998, pp. 117–123, March 1998
17. Vassilevska, V., Williams, R., Yuster, R.: Finding heaviest h-subgraphs in real
weighted graphs, with applications. CoRR abs/cs/0609009 (2006)
18. Zhu, Q., Graf, T., Sumbul, H., Pileggi, L., Franchetti, F.: Accelerating sparse
matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In:
IEEE HPEC, pp. 1–6 (2013)