0% found this document useful (0 votes)

72 views11 pages

Parallel Efficient Sparse Matrix-Matrix Multiplication On Multicore Platforms

This document summarizes a research paper that presents an optimized implementation of sparse matrix-matrix multiplication (SpGEMM) on multicore platforms. The implementation partitions the matrices, uses dense arrays instead of hash tables to accumulate partial results, and performs dynamic load balancing. It achieves up to 3.8x speedup over Intel MKL and up to 257x speedup over CombBLAS. It also outperforms previous GPU implementations by up to 7.3x. The implementation demonstrates good multi-core scalability, achieving an 18.2x speedup using 28 threads compared to MKL's 7.5x speedup using the same number of threads.

Uploaded by

Satya Gautam Vadlamudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views11 pages

Parallel Efficient Sparse Matrix-Matrix Multiplication On Multicore Platforms

Uploaded by

Satya Gautam Vadlamudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/300779319

Parallel Efﬁcient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Conference Paper · July 2015

DOI: 10.1007/978-3-319-20119-1_4

CITATIONS READS

48 1,881

10 authors, including:

Md. Mostofa Ali Patwary Nadathur Satish

Intel Intel
66 PUBLICATIONS 2,136 CITATIONS 56 PUBLICATIONS 5,033 CITATIONS

SEE PROFILE SEE PROFILE

Jongsoo Park Sergey Pudov

Meta Siberian State University of Telecommunications and Informatics
55 PUBLICATIONS 1,176 CITATIONS 13 PUBLICATIONS 78 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Big Data Center View project

Sparse Tensor Factorization View project

All content following this page was uploaded by Md. Mostofa Ali Patwary on 12 April 2016.

The user has requested enhancement of the downloaded file.

Parallel Eﬃcient Sparse Matrix-Matrix
Multiplication on Multicore Platforms

Md. Mostofa Ali Patwary1(B) , Nadathur Rajagopalan Satish1 ,

Narayanan Sundaram1 , Jongsoo Park1 , Michael J. Anderson1 ,
Satya Gautam Vadlamudi1 , Dipankar Das1 , Sergey G. Pudov2 ,
Vadim O. Pirogov2 , and Pradeep Dubey1
1
Parallel Computing Lab, Intel Corporation, Santa Clara, USA
[email protected]
2
Software and Services Group, Intel Corporation, Santa Clara, USA

Abstract. Sparse matrix-matrix multiplication (SpGEMM) is a key

kernel in many applications in High Performance Computing such as
algebraic multigrid solvers and graph analytics. Optimizing SpGEMM
on modern processors is challenging due to random data accesses, poor
data locality and load imbalance during computation. In this work, we
investigate diﬀerent partitioning techniques, cache optimizations (using
dense arrays instead of hash tables), and dynamic load balancing on
SpGEMM using a diverse set of real-world and synthetic datasets. We
demonstrate that our implementation outperforms the state-of-the-art
using IntelR
Xeon R
processors. We are up to 3.8X faster than Intel
R

Math Kernel Library (MKL) and up to 257X faster than CombBLAS. We

also outperform the best published GPU implementation of SpGEMM
on nVidia GTX Titan and on AMD Radeon HD 7970 by up to 7.3X and
4.5X, respectively on their published datasets. We demonstrate good
multi-core scalability (geomean speedup of 18.2X using 28 threads) as
compared to MKL which gets 7.5X scaling on 28 threads.

1 Introduction
Sparse Matrix-Matrix Multiplication (SpGEMM) is an important kernel used in
many applications in High Performance Computing such as algebraic multigrid
solvers [4] and graph analytic kernels [7,10,12,17]. Compared to the efficiency
of the corresponding dense GEMM routines, SpGEMM suffers from poor per-
formance on most parallel hardware. The difficulty in optimizing SpGEMM lies
in the irregular memory access patterns, unknown pattern of non-zeros in the
output matrix, poor data locality and load imbalance during computation. For
sparse matrices that have non-zero patterns following power law distributions
(e.g. graphs from social network and recommendation system domains), this
leads to poor efficiency as some portions of the output are very dense while
others are very sparse.
This paper presents an optimized implementation of SpGEMM on two matri-
ces A and B that efficiently utilizes current multicore hardware. We have paral-
lelized SpGEMM through row and column based blocking of A and B respectively.
c Springer International Publishing Switzerland 2015
J.M. Kunkel and T. Ludwig (Eds.): ISC High Performance 2015, LNCS 9137, pp. 48–57, 2015.
DOI: 10.1007/978-3-319-20119-1 4
Parallel Efficient Sparse Matrix Multicore Matrix-Matrix Multiplication 49

By using a dense array to accumulate partial (sparse vector) results, we get supe-
rior performance compared to previous implementations. We also maintain a CSR
input and output format and include data structure transformation and memory
allocation costs in our runtime.
Our main contributions are as follows:

– We present the fastest SpGEMM results on a single node on a variety of diﬀer-

ent sparse matrices drawn from various domains. Our implementation running
on IntelR
Xeon R
E5-2697 v3 processor based system is faster than IntelR

MKL by up to 3.8X and CombBLAS by up to 257X. Our implementation also

outperforms previously published GPU implementations [13] on nVidia GTX
Titan and on AMD Radeon HD 7970 by up to 7.3X and 4.5X, respectively.
– We have explored diﬀerent partitioning schemes for SpGEMM. We provide
intelligent heuristics that combine row-wise partitioning of one matrix and
column-wise partitioning of the second matrix to get the best performance
(up to 1.4X improvement) on a single node.
– We divide the matrices into small partitions and perform dynamic load bal-
ancing over the partitions, resulting in speedups of up to 1.49X and 1.24X
respectively.
– We demonstrate good multi-core scalability (geomean speedup of 18.2X using
28 threads) as compared to MKL, which gets 7.5X scaling on 28 threads.

2 SpGEMM Algorithm and Optimizations

2.1 Overview

Sparse Matrix-Matrix Multiply (SpGEMM) involves the multiplication of two

sparse matrices A of dimension m × k and B of dimension k × n to yield a
resultant matrix C of dimension m×n. In this paper, for presentation simplicity,
we assume we deal with square matrices where m = n = k (represented as n
hereafter). However, our techniques are general and applicable to other matrices
as well.

Fig. 1. Data access pattern of Gustavson [11] and Partitioned SpGEMM algorithms.
50 M.M.A. Patwary et al.

Consider the following notations. Ai,j denotes a single entry of matrix A. Ai,:
denotes the ith row of matrix A, and A:,i represents the ith column
of A. Then the
computation of the entire row i of C can be seen to be Ci,: = k Ai,k ∗Bk,: . This
computation is shown in Fig. 1(a). The figure shows the situation for a sparse
matrix A, where Ai,k is non-zero only for some values of k, and computations
only occur on the corresponding rows of B. The product of the scalar Ai,k with
the non-zeros in Bk,: basically scales the elements of the row Bk,: and has the
same sparsity structure as Bk,: . This product then needs to be summed into Ci,: .
Note that Ci,: is the sum of various sparse vectors obtained from the products
above, and is in general sparse, although its non-zero structure and density may
be quite different from that of A and B.
Gustavson [11] proposed a single-
threaded algorithm for SpGEMM
based on the Compressed Sparse Row
(CSR) format. This is a straightfor-
ward implementation of the compu-
tations described in Fig. 1(a). This
algorithm can be parallelized over
rows of A to run on multi-core proces-
sors. However, as we show below,
this basic algorithm does not take
full advantage of architectural fea-
tures such as caches present on mod-
ern hardware and makes inefficient
use of the limited bandwidth available at various levels of the memory hierarchy.
We now show the optimizations that we perform to overcome these bottlenecks.

2.2 Performance Optimizations

We make a number of improvements to the algorithm described previously.

Adding Sparse Vectors. Whenever a new sparse vector is to be added to the

running sum for Ci,: , the index of each non-zero value in the sparse vector needs
to looked up in the running sum. If present, the value of the non-zero element
needs to be added to that in Ci,: , and if not, a new entry is to be created for
this non-zero.
We considered various options to efficiently implement this lookup. One app-
roach is to use a hash table to store the non-zero elements of the running sum for
Ci,: , with the key being the column index of the non-zero, and the value being its
numerical value. However, we found that this technique had high overheads due
to (i) cost of hash key computations and (ii) cost of handling collisions through
chaining.
Since the range of elements is known apriori (equal to the matrix dimension n),
it is much more efficient to use a dense array to store the running sum. We initial-
ize this array X to zero. When we need to add a new sparse vector, we take each
Parallel Efficient Sparse Matrix Multicore Matrix-Matrix Multiplication 51

non-zero entry and simply add its value to X[c], where c is the column index of the
non-zero. Finally, once all additions are complete, we need to convert this dense
array back to a sparse CSR representation when writing back to C. While the
additions themselves have very low overhead with this data structure, it is very
ineﬃcient to have to scan the entire dense X array (most of whose elements are
zero) and write back only the few non-zero elements into a sparse format. Indeed,
our experiments indicate that this scan takes more than 20X the time required
for the additions themselves. Hence, in addition to X, we keep a index array that
stores the non-zero indices of X. This can be cheaply maintained during the addi-
tion process by simply appending a column index c when it is ﬁrst written to, i.e.
when the existing array value X[c] is zero. We then iterate only over this sparse
index array when writing to C and reset the corresponding value in X.

Partitioning Schemes. The computation of individual rows of C can be done

independently, and hence rows of C (and the computation on the corresponding
rows of A) can be trivially partitioned among the threads.
However, we need to pay careful attention to cache behavior during the sparse
addition. Depending on the number of distinct cache lines of the dense array X
that are touched during the update of a row of C, the X data structure may
not completely reside in a close enough level of the cache hierarchy. Since this
update is in the inner loop of the code, this can significantly affect performance.
Specifically, for the datasets we describe in the evaluation section, we see a
number of potential misses to the private second level (L2) cache. These misses
are usually captured in the shared last level (LLC) cache, however we do find a
significant performance impact due to L2 misses. In general for larger data sets,
one could see misses to LLC as well. Hence we need a general scheme to localize
accesses to the X array through blocking.
Blocking accesses to X in an efficient manner is non-trivial. Without any
modifications, blocking along a row of B is difficult to achieve (as there are
only few non-zeros per row on average). We hence propose to change the data
structure of matrix B, and store it in a partitioned manner. We store individual
CSRs for each partition of B. The number of partitions required for B depends on
the L2 cache size. Figure 1(b) shows this scheme, where accesses to X get blocked.
There are, of course, overheads in creating this blocked CSR data structure from
the original CSR, and it may not be worthwhile to perform this transformation.
We discuss this shortly.
Algorithm 1 shows the overall pseudocode for our algorithm. We use a hybrid
scheme where we partition both the rows of A and the columns of B. Each parti-
tion updates a 2D block of matrix C. Each block of C is computed independently
as an SpGEMM of the corresponding row partition of A and column partition
of B.
As mentioned before, there is overhead in creating the blocked CSR represen-
tations of B. Further, each block of B is much sparser than the full row of B, and
accesses to row pointers are not well amortized. This can lead to some bandwidth
52 M.M.A. Patwary et al.

overhead when reading B as well. Consequently it only makes sense to perform

the transformation when we know that the benefits when accessing X are large.
If we know that, for a significant fraction of rows, the number of non-zeros of the
final X after all updates is greater than the L2 cache size (usually 256 KB, or
64 K single-precision numbers), then blocking accesses to X makes sense. Since
we do not apriori know the density of the output rows, we need to estimate it.
We use a simple upper-bound estimate described in [13], where we merely count
the number of multiplications involved in computing each row of C. This can
be done cheaply without any actual floating point operations by merely looking
at the non-zero structure of A and the row pointer array of B. This usually
takes just 1-2 % of overall runtime to compute. We use these

estimates e nnz(i)
e nnz(i)
for each row i to define an overall metric : e nnz = i:e nnz(i)>64K
. If
i e nnz(i)
this is sufficiently large (greater than 30 %), our results show that we should
partition B.

Dynamic Scheduling. Diﬀerent partitions in the computations described in

Algorithm 1 have differing amounts of computation and store different numbers
of non-zeros. This leads to severe load imbalance. We have found that reducing
the size of each partition so that the total number of partitions is 6–10 times the
number of threads leads to significantly better load balance.

3 Experimental Results
3.1 Experimental Setup

We used an Intel R
Xeon R1
E5-2697 v3 processor based system for the exper-
iments. The system consists of two processors, each with 14-cores running at
2.6 GHz (a total of 28 cores) with 36 MB L3 cache and 64 GB memory. The
system is based on the Haswell microarchitecture and runs Redhat Linux (ver-
sion 6.5). All our code is developed using C/C++ and is compiled using the
Intel
R
C++ compiler2 (version: 15.0.1) using the -O3 flag. We pins threads
to cores for efficient NUMA behavior by setting KMP AFFINITY to granular-
ity=fine,compact,1 in our experiments [2].
1
Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S.
and/or other countries.
2
Intel’s compilers may or may not optimize to the same degree for non-Intel micro-
processors for optimizations that are not unique to Intel microprocessors. These opti-
mizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations.
Intel does not guarantee the availability, functionality, or effectiveness of any opti-
mization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel micro-architecture are reserved for Intel micro-
processors. Please refer to the applicable product User and Reference Guides for
more information regarding the specific instruction sets covered by this notice. Notice
revision #20110804.
Parallel Efficient Sparse Matrix Multicore Matrix-Matrix Multiplication 53

Table 1. Structural properties of the datasets. C denotes the resultant matrix.

Name Rows nnz Avg Degree Max Degree nnz (C) Time (sec) Type
harbor 46,835 4,701,167 100 289 7,900,918 0.0681 Symmetric
hood 220,542 10,768,436 49 77 34,242,181 0.0802 Symmetric
qcd 49,152 1,916,928 39 39 10,911,745 0.0373 Asymmetric
consph 83,334 6,010,480 72 81 26,539,737 0.0675 Symmetric
pwtk 217,918 11,634,424 53 180 32,772,237 0.0821 Symmetric
PR02R 161,070 8,185,136 51 92 30,969,454 0.0756 Asymmetric
mono 169,410 5,036,288 30 719 41,377,965 0.0965 Asymmetric
webbase 1,000,005 3,105,536 3 4,700 51,111,997 0.1597 Asymmetric
audikw 943,695 39,297,771 42 346 164,772,225 0.2053 Asymmetric
mou.gene 45,101 14,506,196 322 8033 190,430,984 0.9016 Asymmetric
cage14 1,505,785 27,130,349 18 82 236,999,813 0.2469 Asymmetric
dielFilt 1,102,824 45,204,422 41 271 270,082,366 0.2687 Asymmetric
rmat er 262,144 16,777,150 64 181 704,303,135 0.7538 Asymmetric
rmat g 262,144 16,749,883 64 3283 1,283,506,475 1.1721 Asymmetric
rmat b 262,144 16,749,883 64 54250 1,648,990,052 2.4519 Asymmetric

We used both real-world and synthetic matrices for performance analysis.

Table 1 shows the structural properties of the datasets. Our real-world matrices
consist of 12 datasets collected from the Florida Matrix collection [8] covering
many applications including structural engineering, web connectivity, electro-
magnetics, and medical science. Six of these datasets (harbor, hood, qcd, pwtk,
mono, and webbase) are the same as those used in [13]. We speciﬁcally picked
datasets where the authors’ implementation shows the best performance and
augmented this set with the largest datasets considered in that work. To increase
the diversity of use cases considered, we additionally included 6 more datasets,
which are up to an order of magnitude larger (in terms of nonzeros) than those
previously experimented.
We also generated 3 types of synthetic datasets using the Graph500 RMAT
data generator [14] by varying the RMAT parameters (similar to previous work
[6]). These are (i) rmat b (parameters 0.55, 0.15, 0.15, 0.15), (ii) rmat g (para-
meters 0.45, 0.15, 0.15, 0.25), and (iii) rmat er (0.25, 0.25, 0.25, 0.25). These
matrices vary mainly in terms of degree distributions (e.g. rmat b is highly
skewed) to cover a wider range of applications.
The seven largest datasets (audikw, mou.gene, cage14, dielFilt, and the three
synthetic datasets) are treated as unsymmetric to avoid large SpGEMM out-
put matrices C that overﬂow memory limits. The symmetry of each dataset is
tabulated in Table 1.

3.2 Experimental Results

We ﬁrst compare the performance of our SpGEMM implementation with

state-of-the-art results. To do so, we consider four available implementations,
54 M.M.A. Patwary et al.

Fig. 2. Performance comparison of CombBLAS, Intel

R
MKL, BHSPARSE on nVidia
GTX Titan GPU and on AMD Radeon HD 7970 GPU, and our implementation.

(i) the Combinatorial BLAS Library (CombBLAS v1.3, [1]), (ii) Intel R
Math
Kernel Library (MKL, version 11.2.1 [3]), (iii) BHSPARSE implementation on
the nVidia GeForce GTX Titan GPU reported in [13], and (iv) BHSPARSE
implementation on the AMD Radeon HD 7970 GPU reported in [13]. These rep-
resent some of the most recent SpGEMM publications that perform well on mod-
ern hardware [13]. We consider all 15 datasets in the comparison. However, since
the GPU SpGEMM code is not available online, we used the performance results
of the 6 overlapping datasets from [13]. Figure 2 shows the performance of these
implementations in GFlops. As can be seen, our code is able to achieve up to
18.4 GFlops (geomean 6.6 GFlops) whereas CombBLAS, MKL, BHSPARSE on
nVidia GTX Titan, and BHSPARSE on AMD Radeon HD 7970 GPU achieve
up to 0.3, 11.7, 2.8, and 5.0 (geomean of 0.09, 3.6, 1.5, and 2.1) GFlops respec-
tively. CombBLAS uses the DCSC matrix format that involves an additional
layer of indirection (more cache line accesses) that leads to poor performance.
MKL does not partition the columns of B and uses a static partitioning scheme
that can cause performance degradation. We note that our performance results
are up to 7.3X (geomean 3.9X) compared to the best BHSPARSE GPU imple-
mentation. This is despite the nearly 2× higher peak flops for the GPU cards
compared to the Intel R
Xeon R
processor used. We attribute this to the impact
of the algorithmic optimizations such as partitioning techniques, cache optimiza-
tions, and dynamic load balancing (more details are in Sect. 2).
We next demonstrate scalability using the 7 largest datasets in Table 1.
Figure 3 shows the scalability of our algorithm and Intel R
MKL. We ignore
CombBLAS for this comparison as it is significantly slower (76X slower on aver-
age) and also the GPU implementations due to unavailability of code. Figure 3(a)
and Fig. 3(b) show scaling results for our algorithm and MKL respectively. As
can be seen, we achieve up to 28X speedup using 28 threads with respect to single
Parallel Efficient Sparse Matrix Multicore Matrix-Matrix Multiplication 55

Fig. 3. Scalability of our implementation and Intel

R
MKL on real (geomean of largest
4) and synthetic (geomean of largest 3) datasets with {1, 2, 4, 7, 14, 20, 28} threads.

thread performance (geomean 18.2X) on our datasets (for real-world datasets:

max 19.6X, and geomean of 16.4X; for synthetic datasets: max 28X, and geomean
of 20.7X). MKL shows up to 9.9X speedup (geomean of 7.5X) on the same
datasets (for real-world datasets:max 7.3X, and geomean of 6.3X; for synthetic
datasets: max 9.9X, and geomean of 9.5X). The bar at each point shows the stan-
dard deviation on the scalability of our code as observed in our experiments. In
general, scalability is better for dense datasets (mou.gene scales by about 20X
on 28 cores); as also for more skewed datasets (rmat b scales near linearly - 28X
on 28 cores) where a significant portion of runtime is spent in dense areas of the
matrix. For such matrices, there is more reuse of data structures and SpGEMM
is more compute bound.
Figure 4(a) shows the timing breakdown in our algorithm. Among the 4 steps
(memory allocation, computation, and input and output data structure conver-
sion), computation is the dominant part and the other 3 take only 0.4 %, 2.6 %,
and 0.3 %, respectively on average. For some datasets such as mou.gene, rmat g
and rmat b, where there are dense regions of computation, as determined by our
metric defined in Sect. 2, we partition columns of B and input data structure
conversion times increase (e.g. rmat g 7.9 %). This is still however small and our
computation times still dominate; overall runtimes for such datasets are up to
1.4X (average 1.22X) faster with column partitioning. We verified our reasoning
by counting L2 cache misses using hardware counters. Our analysis shows that
for such datasets, L2 cache misses went down by 1.3-2.3X when we performed
column partitioning. This ties in well with our performance gains.
We now discuss the impact of various optimizations described in Sect. 2
(Fig. 4(b)) on our SpGEMM code. We take the hash-based SpGEMM imple-
mentation as the baseline for the comparison. The main performance gain comes
from using a dense array instead of hash tables during the addition phase. This
yields up to 9.2X speedup (with a geomean of 6.3X). This is due to the overheads
of using hash tables as explained in Sect. 2. Increasing the number of partitions
improves performance by up to 49 % (with a geomean of 14.5 %). Using dynamic
scheduling gives an additional 24.9 % (with a geomean of 10.2 %) performance
boost. This is due to better load balancing among the parallel threads.
56 M.M.A. Patwary et al.

Fig. 4. Performance analysis of our implementation

4 Related Work
Gustavson introduced an SpGEMM algorithm with a work complexity propor-
tional to the number of nonzeros in A, number of total flops, number of rows in
C, and the number of columns in C [11]. This algorithm, and how it relates to
our work, is described in detail in Sect. 2. A similar algorithm is used by Matlab,
which processes a column of C at a time, and uses a dense vector with values,
indices, and valid flags, for accumulating sparse partial results [9]. Buluc and
Gilbert address the case of hypersparse matrices, where the number of non-zeros
is less than the number of columns or rows [5]. The authors introduce the doubly
compressed sparse column (DCSC) format, and two new SpGEMM algorithms
to handle this case.
There have also been several efforts to optimize the performance of SpGEMM
for parallel and heterogeneous hardware. Sulatycke and Ghose analyzed the cache
behavior of different loop-orderings of SpGEMM with sparse A and B and dense
C, and proposed a cache-efficient parallel algorithm that divided work across
rows of A and C [16]. Siegel et al. created a framework for running SpGEMM on
a cluster of heterogeneous (2x CPU + 2x GPU) nodes, and demonstrated a sig-
nificant improvement in load-balance through dynamic scheduling as compared
to a static approach [15]. Zhu et al. propose a custom hardware implementation
to accelerate SpGEMM [18]. They use custom logic integrated with 3D stacked
memory to retrieve columns from A, and merge intermediate results into C using
content-addressable memories (CAMs). Liu and Vintner describe an SpGEMM
algorithm and implementation for GPUs [13]. In this implementation, rows of C
are divided into bins based on an upper-bound of the size of intermediate results
and processed using different routines.

5 Conclusion and Future Work

In this paper we investigated Sparse Matrix-Matrix Multiplication (SpGEMM),
an important kernel used extensively in many applications including linear
Parallel Eﬃcient Sparse Matrix Multicore Matrix-Matrix Multiplication 57

solvers and graph analytics. To improve SpGEMM eﬃciency on multicore plat-

forms, we performed diﬀerent optimization techniques such as using dense arrays,
implementing column-wise partitioning, and dynamic scheduling. We showed
state-of-the-art results on both real-world and synthetic datasets. We are up to
3.8X faster than Intel R
MKL and up to 257X faster than CombBLAS. We are
also up to 7.3X better than the best published GPU implementation. Our code
shows good scalability of 18.2X using 28 threads, as compared to MKL that
achieves 7.5X speedup. In the future, we intend to extend these optimizations
in a distributed setup.

References
1. Combinatorial Blas v 1.3. http://gauss.cs.ucsb.edu/∼aydin/CombBLAS/html/
2. Thread affinity interface. https://software.intel.com/en-us/node/522691
3. Intel math kernel library (2015). https://software.intel.com/en-us/intel-mkl
4. Bell, N., Dalton, S., Olson, L.N.: Exposing fine-grained parallelism in algebraic
multigrid methods. SIAM J. Sci. Comput. 34(4), C123–C152 (2012)
5. Buluc, A., Gilbert, J.: On the representation and multiplication of hypersparse
matrices. In: Proceedings of IPDPS, pp. 1–11, April 2008
6. Buluç, A., Gilbert, J.R.: Parallel sparse matrix-matrix multiplication and index-
ing: Implementation and experiments. CoRR abs/1109.3739 (2011)
7. Chan, T.M.: More algorithms for all-pairs shortest paths in weighted graphs.
SIAM J. Comput. 39(5), 2075–2089 (2010)
8. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM
Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
9. Gilbert, J., Moler, C., Schreiber, R.: Sparse matrices in matlab: design and imple-
mentation. SIAM J. Matrix Anal. Appl. 13(1), 333–356 (1992)
10. Gilbert, J.R., Reinhardt, S., Shah, V.B.: High-performance graph algorithms from
parallel sparse matrices. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski,
J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 260–269. Springer, Heidelberg (2007)
11. Gustavson, F.G.: Two fast algorithms for sparse matrices: multiplication and
permuted transposition. ACM Trans. Math. Softw. 4(3), 250–269 (1978)
12. Kaplan, H., Sharir, M., Verbin, E.: Colored intersection searching via sparse rec-
tangular matrix multiplication. In: Symposium on Computational Geometry, pp.
52–60. ACM (2006)
13. Liu, W., Vinter, B.: An efficient GPU general sparse matrix-matrix multiplication
for irregular data. In: Proceedings of IPDPS, pp. 370–381. IEEE (2014)
14. Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph
500. Cray User’s Group (2010)
15. Siegel, J., et al.: Efficient sparse matrix-matrix multiplication on heterogeneous
high performance systems. In: IEEE Cluster Computing, pp. 1–8 (2010)
16. Sulatycke, P., Ghose, K.: Caching-efficient multithreaded fast multiplication of
sparse matrices. In: Proceedings of IPPS/SPDP 1998, pp. 117–123, March 1998
17. Vassilevska, V., Williams, R., Yuster, R.: Finding heaviest h-subgraphs in real
weighted graphs, with applications. CoRR abs/cs/0609009 (2006)
18. Zhu, Q., Graf, T., Sumbul, H., Pileggi, L., Franchetti, F.: Accelerating sparse
matrix-matrix multiplication with 3D-stacked logic-in-memory hardware. In:
IEEE HPEC, pp. 1–6 (2013)

View publication stats

Ericsson 2G OSS Commands
83% (6)
Ericsson 2G OSS Commands
6 pages
Arsenal Script Silent Aim, Aimbot, Esp
No ratings yet
Arsenal Script Silent Aim, Aimbot, Esp
2 pages
Executive Functions and ADHD in Adults Evidence
100% (3)
Executive Functions and ADHD in Adults Evidence
13 pages
CC Handouts
No ratings yet
CC Handouts
6 pages
Chapter - 2 Basic Computer Organization and Design: Common Bus
No ratings yet
Chapter - 2 Basic Computer Organization and Design: Common Bus
34 pages
Ultimate Load Capacities of Mooring Bollards and Hull Foundation Structures
No ratings yet
Ultimate Load Capacities of Mooring Bollards and Hull Foundation Structures
8 pages
PPoE Protocol
No ratings yet
PPoE Protocol
10 pages
Chapter 6 Multithreading
No ratings yet
Chapter 6 Multithreading
36 pages
OpenICDL Module2 PDF
No ratings yet
OpenICDL Module2 PDF
72 pages
Pantech Solutions
No ratings yet
Pantech Solutions
22 pages
1f98b77bcfcc427bb8c3e1bc9255a57a
No ratings yet
1f98b77bcfcc427bb8c3e1bc9255a57a
517 pages
SEPM Migration Overview
No ratings yet
SEPM Migration Overview
7 pages
Fuzzy Soft Set Theory and Its Applications
No ratings yet
Fuzzy Soft Set Theory and Its Applications
12 pages
OMEN by HP Laptop 15-Dc1044na: The Best Never Rest
No ratings yet
OMEN by HP Laptop 15-Dc1044na: The Best Never Rest
3 pages
01-04 Boards
No ratings yet
01-04 Boards
90 pages
Fao Bulletin
No ratings yet
Fao Bulletin
47 pages
Machine Learning For Power Grids
No ratings yet
Machine Learning For Power Grids
44 pages
Generacion Distribuida
No ratings yet
Generacion Distribuida
16 pages
3 ANTIPAROS Kervekidis 2015 PDF
No ratings yet
3 ANTIPAROS Kervekidis 2015 PDF
24 pages
Why Does The Burden of Disease Persist Relating TH
No ratings yet
Why Does The Burden of Disease Persist Relating TH
10 pages
Kako Napraviti Windows Instalaciju WinAddons
No ratings yet
Kako Napraviti Windows Instalaciju WinAddons
6 pages
Comparative Seed Germination and Seedling Development of The Ghost Orchid, Dendrophylax Lindenii (Orchidaceae..
No ratings yet
Comparative Seed Germination and Seedling Development of The Ghost Orchid, Dendrophylax Lindenii (Orchidaceae..
16 pages
Osteoblast-Osteoclast Interactions: Connective Tissue Research February 2017
No ratings yet
Osteoblast-Osteoclast Interactions: Connective Tissue Research February 2017
11 pages
Fitting Statistical Models With PROC MCMC: Conference Paper
No ratings yet
Fitting Statistical Models With PROC MCMC: Conference Paper
27 pages
3D Spherical Antenna Measurement System For CTIA O
No ratings yet
3D Spherical Antenna Measurement System For CTIA O
5 pages
CFRP Sheets For Exural Strengthening of RC Beams: July 2011
No ratings yet
CFRP Sheets For Exural Strengthening of RC Beams: July 2011
10 pages
Course Outline It C 19
No ratings yet
Course Outline It C 19
4 pages
Optimization of The Utilization of Deep Borehole Heat Exchangers
No ratings yet
Optimization of The Utilization of Deep Borehole Heat Exchangers
21 pages
Frequent Pattern Mining: Current Status and Future Directions
No ratings yet
Frequent Pattern Mining: Current Status and Future Directions
33 pages
DNA Sequencing Technologies: Sequencing Data Protocols and Bioinformatics Tools
No ratings yet
DNA Sequencing Technologies: Sequencing Data Protocols and Bioinformatics Tools
33 pages
2017 Edexcel IGCSE 4IT0/01 Past Paper
No ratings yet
2017 Edexcel IGCSE 4IT0/01 Past Paper
20 pages
Generalised Geometric Brownian Motion: Theory and Applications To Option Pricing
No ratings yet
Generalised Geometric Brownian Motion: Theory and Applications To Option Pricing
35 pages
NPPRJ Leeetal. PaperCups
No ratings yet
NPPRJ Leeetal. PaperCups
9 pages
Definition and Classification of Fault Damage Zones: A Review and A New Methodological Approach
No ratings yet
Definition and Classification of Fault Damage Zones: A Review and A New Methodological Approach
19 pages
ICDAR 2021 Competition On On-Line Signature Verification: September 2021
No ratings yet
ICDAR 2021 Competition On On-Line Signature Verification: September 2021
17 pages
mathcal (PT) $-Symmetric Tight-Binding Chain With Gain and Loss: A Completely Solvable Model
No ratings yet
mathcal (PT) $-Symmetric Tight-Binding Chain With Gain and Loss: A Completely Solvable Model
24 pages
ML Paper 3
No ratings yet
ML Paper 3
33 pages
Smart Studio (Logic Program)
No ratings yet
Smart Studio (Logic Program)
2 pages
Cetean - Et.al.2011 Eobigenerina IWAF8
No ratings yet
Cetean - Et.al.2011 Eobigenerina IWAF8
12 pages
Fault Detection in Wireless Sensor Network Based On Deep Learning Algorithms
No ratings yet
Fault Detection in Wireless Sensor Network Based On Deep Learning Algorithms
8 pages
RSM and Ann Modeling Approaches For Predicting Average Cutting Speed During Wedm of Sicp/6061 Al MMC
No ratings yet
RSM and Ann Modeling Approaches For Predicting Average Cutting Speed During Wedm of Sicp/6061 Al MMC
9 pages
HP Zbook 17 G2 Mobile Workstation: Datasheet
No ratings yet
HP Zbook 17 G2 Mobile Workstation: Datasheet
5 pages
Bridging Machine Learning and Computer Network Res
No ratings yet
Bridging Machine Learning and Computer Network Res
16 pages
4740 Lecture19 Dynamic Latches and Flipflops
No ratings yet
4740 Lecture19 Dynamic Latches and Flipflops
19 pages
Ci500588j DataWarrior 2D Rubberband Scaling
No ratings yet
Ci500588j DataWarrior 2D Rubberband Scaling
15 pages
ArticleText 93587 2 10 202104111
No ratings yet
ArticleText 93587 2 10 202104111
25 pages
CCR100
No ratings yet
CCR100
9 pages
Microprocessor Vs Microcontroller
No ratings yet
Microprocessor Vs Microcontroller
8 pages
Cloud Computing
No ratings yet
Cloud Computing
3 pages
2010 Domain Dynamics During Ferroelectric Switching
No ratings yet
2010 Domain Dynamics During Ferroelectric Switching
6 pages
BIOS Update Readme
No ratings yet
BIOS Update Readme
2 pages
SketchDLC A Sketch On Distributed Deep Learning Co
No ratings yet
SketchDLC A Sketch On Distributed Deep Learning Co
27 pages
ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm
No ratings yet
ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm
11 pages
GPS Micro21
No ratings yet
GPS Micro21
13 pages
Toad Installguide v4.6
No ratings yet
Toad Installguide v4.6
47 pages
The Spatial Distribution and Morphological Characteristics
No ratings yet
The Spatial Distribution and Morphological Characteristics
16 pages
DX Diag
No ratings yet
DX Diag
38 pages
Iecon 11 2009
No ratings yet
Iecon 11 2009
7 pages
Machinelearingfinal Revisedbyariel VC Ver04
No ratings yet
Machinelearingfinal Revisedbyariel VC Ver04
31 pages
Opportunities For Machine Learning in Scientific Discovery: Preprint
No ratings yet
Opportunities For Machine Learning in Scientific Discovery: Preprint
23 pages
Guo Indep Art
No ratings yet
Guo Indep Art
6 pages
Lab 5 EEPROM
No ratings yet
Lab 5 EEPROM
12 pages
An End To End of Scalable Tree
No ratings yet
An End To End of Scalable Tree
13 pages
Log
No ratings yet
Log
94 pages
Two-Way Communication Between Scientists and The Public: A View From Science Communication Trainers in North America
No ratings yet
Two-Way Communication Between Scientists and The Public: A View From Science Communication Trainers in North America
17 pages
BBZ 062
No ratings yet
BBZ 062
14 pages
Tunneling Field-Effect Transistors TFETs With Subt
No ratings yet
Tunneling Field-Effect Transistors TFETs With Subt
4 pages
2014 Kimgffff
No ratings yet
2014 Kimgffff
25 pages
Log File
No ratings yet
Log File
2 pages
DOME: Recommendations For Supervised Machine Learning Validation in Biology
No ratings yet
DOME: Recommendations For Supervised Machine Learning Validation in Biology
20 pages
SEC Power
No ratings yet
SEC Power
7 pages
Analyses of Extreme Wave Heights in The Gulf of Mexico For Offshore Engineering Applications
No ratings yet
Analyses of Extreme Wave Heights in The Gulf of Mexico For Offshore Engineering Applications
16 pages
Deep Learning Applications and Challenges in
No ratings yet
Deep Learning Applications and Challenges in
22 pages
Understanding The GPU Microarchitecture To Achieve Bare-Metal Performance Tuning
No ratings yet
Understanding The GPU Microarchitecture To Achieve Bare-Metal Performance Tuning
14 pages
Data Center
No ratings yet
Data Center
3 pages
Extracting Muscle Synergy Patterns From EMG Data Using Autoencoders
No ratings yet
Extracting Muscle Synergy Patterns From EMG Data Using Autoencoders
9 pages
Rosero Et Al 2010
No ratings yet
Rosero Et Al 2010
22 pages
PPINN
No ratings yet
PPINN
18 pages
Deep Inverse Photonic Design
No ratings yet
Deep Inverse Photonic Design
27 pages
PRIMME - for.SVD JScC-2016
No ratings yet
PRIMME - for.SVD JScC-2016
24 pages
An Integrative Genomics Approach To The Reconstruction of Gene Networks in Segregating Populations
No ratings yet
An Integrative Genomics Approach To The Reconstruction of Gene Networks in Segregating Populations
13 pages
Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing
No ratings yet
Dynamic Distributed and Parallel Machine Learning Algorithms For Big Data Mining Processing
44 pages
Cart OcUK
No ratings yet
Cart OcUK
1 page
CS 4402-01 Learning Journal Unit 3
No ratings yet
CS 4402-01 Learning Journal Unit 3
4 pages
Deb02 PDF GZ
No ratings yet
Deb02 PDF GZ
26 pages
A Subsampled Double Bootstrap For Massive Data
No ratings yet
A Subsampled Double Bootstrap For Massive Data
39 pages
Deep Learningin Agriculture Agricultureand Food Magazine April 2022
No ratings yet
Deep Learningin Agriculture Agricultureand Food Magazine April 2022
4 pages
NIPS 2017 Lightgbm A Highly Efficient Gradient Boosting Decision Tree Paper
No ratings yet
NIPS 2017 Lightgbm A Highly Efficient Gradient Boosting Decision Tree Paper
10 pages

Parallel Efficient Sparse Matrix-Matrix Multiplication On Multicore Platforms

Uploaded by

Parallel Efficient Sparse Matrix-Matrix Multiplication On Multicore Platforms

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Parallel Efﬁcient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Conference Paper · July 2015

Md. Mostofa Ali Patwary Nadathur Satish

SEE PROFILE SEE PROFILE

Jongsoo Park Sergey Pudov

SEE PROFILE SEE PROFILE

Big Data Center View project

Sparse Tensor Factorization View project

The user has requested enhancement of the downloaded file.

Md. Mostofa Ali Patwary1(B) , Nadathur Rajagopalan Satish1 ,

Abstract. Sparse matrix-matrix multiplication (SpGEMM) is a key

Math Kernel Library (MKL) and up to 257X faster than CombBLAS. We

– We present the fastest SpGEMM results on a single node on a variety of diﬀer-

MKL by up to 3.8X and CombBLAS by up to 257X. Our implementation also

2 SpGEMM Algorithm and Optimizations

Sparse Matrix-Matrix Multiply (SpGEMM) involves the multiplication of two

2.2 Performance Optimizations

We make a number of improvements to the algorithm described previously.

Adding Sparse Vectors. Whenever a new sparse vector is to be added to the

Partitioning Schemes. The computation of individual rows of C can be done

overhead when reading B as well. Consequently it only makes sense to perform

Dynamic Scheduling. Diﬀerent partitions in the computations described in

Table 1. Structural properties of the datasets. C denotes the resultant matrix.

We used both real-world and synthetic matrices for performance analysis.

3.2 Experimental Results

We ﬁrst compare the performance of our SpGEMM implementation with

Fig. 2. Performance comparison of CombBLAS, Intel

Fig. 3. Scalability of our implementation and Intel

thread performance (geomean 18.2X) on our datasets (for real-world datasets:

Fig. 4. Performance analysis of our implementation

5 Conclusion and Future Work

solvers and graph analytics. To improve SpGEMM eﬃciency on multicore plat-

View publication stats

You might also like