0% found this document useful (0 votes)
40 views12 pages

Merge-Based Parallel Sparse Matrix-Vector Multiplication SC2016

Uploaded by

Bittu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views12 pages

Merge-Based Parallel Sparse Matrix-Vector Multiplication SC2016

Uploaded by

Bittu Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Merge-based Parallel Sparse Matrix-Vector

Multiplication
Duane Merrill Michael Garland
NVIDIA Corporation NVIDIA Corporation
Santa Clara, CA Santa Clara, CA
[email protected] [email protected]

Abstract—We present a strictly balanced method for the inspired many custom matrix formats and encodings that
parallel computation of sparse matrix-vector products (SpMV). exploit both the structural properties of a given matrix and the
Our algorithm operates directly upon the Compressed Sparse organization of the underlying machine architecture. In fact,
Row (CSR) sparse matrix format without preprocessing, more than sixty specialized SpMV algorithms and sparse
inspection, reformatting, or supplemental encoding. Regardless matrix formats have been proposed for GPU processors alone,
of nonzero structure, our equitable 2D merge-based which exemplify the current trend of increased parallelism in
decomposition tightly bounds the workload assigned to each high performance computer architecture [6].
processing element. Furthermore, our technique is suitable for
recursively partitioning CSR datasets themselves into multi- However, improving SpMV performance with innovative
scale, distributed, NUMA, and GPU environments that are matrix formatting comes at a significant practical cost.
constrained by fixed-size local memories. Applications rarely maintain sparse matrices in custom
encodings, instead preferring general-purpose encodings such
We evaluate our method on both CPU and GPU as the Compressed Sparse Row (CSR) format for in-memory
microarchitectures across a very large corpus of diverse sparse representation (Fig. 1). The CSR encoding is free of
matrix datasets. We show that traditional CsrMV methods are architecture-specific blocking, reordering, annotations, etc. that
inconsistent performers, often subject to order-of-magnitude would otherwise hinder portability. Consequently, specialized
performance variation across similarly-sized datasets. In
or supplementary formats ultimately burden the application
comparison, our method provides predictable performance that
is substantially uncorrelated to the distribution of nonzeros
with both (1) preprocessing time for inspection and formatting,
among rows and broadly improves upon that of current CsrMV which may be tens of thousands of times greater than the
methods. SpMV operation itself; and (2) excess storage overhead
because the original CSR matrix is likely required by other
Keywords— SpMV, sparse matrix, parallel merge, merge-path, routines and cannot be discarded.
many-core, GPU, linear algebra An ideal CsrMV would deliver consistently high perfor-
I. INTRODUCTION mance without format conversion or augmentation. As shown
in Algorithm 1, CsrMV is principally a row-wise summation of
High performance algorithms for sparse linear algebra are partial matrix-vector dot-products. The independence of rows
important within many application domains, including and the associativity of addition provide ample opportunities
computational science, graph analytics, and machine learning. for parallelism. However, contemporary CsrMV strategies that
The sparse matrix-vector product (SpMV) is of particular attempt to parallelize these loops independently are subject to
importance for solving sparse linear systems, eigenvalue performance degradation arising from irregular row lengths
systems, Krylov subspace methods, and other similar and/or wide aspect ratios [7]–[11]. Despite various heuristics
problems. When generalized to other algebraic semi-rings, it is for constraining imbalance, they often underperform for small-
also an important building block for large-scale combinatorial world or scale-free datasets having a minority of rows that are
graph algorithms [1]–[3]. In its simplest form, a SpMV many orders of magnitude longer than average. Furthermore,
operation computes y=Ax where the matrix A is sparse and such underutilization often scales with increased processor
vectors x and y are dense. parallelism. This is particularly true of GPUs, where the
Patterns of SpMV usage are often highly repetitive within negative impact of a few long-running threads can be drastic.
iterative solvers and graph computations, making it a To illustrate the effects of row-based workload imbalance,
performance-critical component whose overhead can dominate consider the three similarly-sized sparse matrices in Table 1.
overall application running time. Achieving good performance All three comprise approximately 3M nonzeros, yet have quite
on today’s parallel architectures requires complementary different row-length variations. As variation increases, the
strategies for workload decomposition and matrix storage row-based CsrMV implementations within Intel MKL [10] and
formatting that provide both uniform processor utilization and NVIDIA cuSPARSE [11] are progressively unable to map their
efficient use of memory bandwidth, regardless of matrix workloads equitably across parallel threads. This nonlinear
nonzero structure [4], [5]. These design objectives have

SC16; Salt Lake City, Utah, USA; November 2016


978-1-4673-8815-3/16/$31.00 ©2016 IEEE 678

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
ALGORITHM 1: The sequential CsrMV algorithm
valuesA
Input: CSR matrix A, dense vector x
1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0
1.0 -- 1.0 --
0 1 2 3 4 5 6 7
Output: Dense vector y such that y Å Ax
-- -- -- -- column_indicesA 1 for (int row = 0; row < A.n; ++row)
2 {
-- -- 3.0 3.0 0 2 2 3 0 1 2 3 3 y[row] = 0.0;
00 11 22 33 44 55 66 77 4 for (int nz = A.row_offsets[row];
4.0 4.0 4.0 1.0 row_offsetsA 5 nz < A.row_offsets[row + 1];
0 2 2 4 8 6 ++nz)
matrix A 0 1 2 3 4 7 {
8 y[row] += A.values[nz] * x[A.column_indices[nz]];
Fig. 1. Example three-array CSR sparse matrix encoding 9 }
10 }

TABLE 1: Relative consistency of double-precision CsrMV performance among similarly-sized matrices from the University of Florida
Sparse Matrix Collection [7], evaluated using two Intel Xeon E5-2690 CPU processors (24-core each) and one NVIDIA Tesla K40 GPU

thermomech_dK cnr-2000 ASIC_320k


(temperature deformation) (Web connectivity) (circuit simulation)

Number of nonzeros (nnz) 2,846,228 3,216,152 2,635,364


Row-length coefficient of variation 0.10 2.1 61.4

MKL (GFLOPs) 17.9 13.4 11.8


CPU x2
Merge-based (GFLOPs) 21.2 22.8 23.2
cuSPARSE (GFLOPs) 12.4 5.9 0.12
GPU x1
Merge-based (GFLOPs) 15.5 16.7 14.1

performance response makes it difficult to establish reasonable Local storage allocation is simplified because our method
performance expectations. guarantees equitable storage partitioning.
Our CsrMV does not suffer from such imbalance because it Whereas prior SpMV investigations have studied perfor-
equitably splits the aggregate work performed by the loop nest mance on a few dozen select datasets, we perform our
as a whole. To do so, we adapt a simple parallelization evaluation using the entire University of Florida Sparse Matrix
strategy originally developed for efficient merging of sorted Collection (currently 4,201 non-trivial datasets). This
sequences [12]–[14]. The central idea is to frame the parallel experimental corpus covers a wide gamut of sparsity patterns,
CsrMV decomposition as a logical merger of two lists: ranging from well-structured matrices (typical of discretized
geometric manifolds) to power-law matrices (typical of
A. The row descriptors (e.g., the CSR row-offsets) network graphs). For shared-memory CPU and GPU
B. The natural numbers ℕ (e.g., the indices of the CSR architectures, we show MKL and cuSPARSE CsrMV to
nonzero values). commonly exhibit performance discrepancies of one or more
orders of magnitude among similarly-sized matrices. We also
Individual processing elements are assigned equal-sized show that our performance is substantially uncorrelated to
shares of this logical merger, with each processing element irregular row-lengths and highly correlated to problem size.
performing a two-dimensional search to isolate the correspond- We typically match or exceed the performance of MKL and
ing region within each list that would comprise its share. The cuSPARSE, achieving 1.6x and 1.1x respective average
regions can then be treated as independent CsrMV subprob- speedup for medium and large-sized datasets, and up to 198x
lems and consumed directly from the CSR data structures using for highly irregular datasets. Our method also performs
the sequential method of Algorithm 1. This equitable multi- favorably with respect to specialized formats (CSB [15], HYB
partitioning ensures that no single processing element can be [7]), even those that leverage expensive per-processor and per-
overwhelmed by assignment to (a) arbitrarily-long rows or (b) matrix auto-tuning (pOSKI [16], yaSpMV [17]).
an arbitrarily-large number of zero-length rows. Furthermore,
this method requires no preprocessing overhead, reordering, or II. BACKGROUND
extensions of the CSR matrix format with additional data
structures. Matrices can be presented to our implementation A. General-purpose sparse matrix formats
directly as constructed by the application. For a given n x m matrix A, sparse formats aim to store
only the nonzero entries of A. This can result in substantial
Our merge-based decomposition is also useful for recur- savings, as the number of nonzero entries nnz in many datasets
sively partitioning CSR datasets themselves within multi-scale is O(m+n).
and/or distributed memories (e.g., NUMA CPU architecture,
hierarchical GPU architecture, and multi-node HPC systems).

679

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
primary goal of many specialized formats is to regularize
p0 p1 p2 p3
computation and memory accesses, often through padding
values 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0
and/or matrix reorganization. For example, the ELL format
pads all rows to be the same length [7], but can degenerate into
column
indices 0 2 2 3 0 1 2 3
dense storage in the presence of singularly large rows.
Reorganization strategies attempt to map similarly-sized
rows onto co-scheduled thread groups. Bell and Garland
p0 p1 p2 p3 investigated a packetized (PKT) format that partitions the
row
matrix and then lays out the entries of each piece in a way that
0 2 2 4 8
offsets balances work across threads [21]. Ashari et al. suggested
binning the rows by length so that rows of similar length can
(a) Row-based
(unbalanced decomposition of nonzero data)
be processed together [9]. The Sliced ELL format [22] can
provide similar benefits, as it also bins rows by length.
Although these heuristics improve balance for many scenarios,
p0 p1 p2 p3 their coarseness can still lead to processor underutilization.
values 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0 Many specialized formats also aim to reduce memory I/O
column
via index compression. Blocking is a common extension of the
indices 0 2 2 3 0 1 2 3 storage formats above, where only a single index is used to
locate a small, dense block of matrix entries. (Our merge-
based CsrMV is compatible with blocking, as the CSR
p0 p1 p2 p3 structures are used in the same way to refer to sparse blocks
row
instead of individual sparse nonzeros.) Other sophisticated
offsets
0 2 2 4 8 compression schemes attempt to optimize the bit-encoding of
the matrix, often at the expense of significant formatting
(b) Nonzero-splitting overhead [23], [24]. The yaSpMV BCCOO format is a
(unbalanced decomposition of row descriptor data)
variation of block-compressed COO that uses bit flags to store
the row indices in line with column indices [17].
p0 p1 p2 p3
Hybrid and multi-level formats are also commonplace. The
values 1.0 1.0 3.0 3.0 4.0 4.0 4.0 4.0
matrix can be partitioned into separate regions, each of which
row offsets 0 2 2 4 may be stored in a different format. The pOSKI autotuning
column indices 0 2 2 3 0 1 2 3 framework explores a wide gamut of multi-level blocking
schemes [16]. The compressed sparse block (CSB) format is a
nested COO-of-COO representation where tuples at both levels
(c) Merge-based
are kept in a Morton Z-order [15]. The HYB format combines
(balanced data decomposition) an ELL portion with a COO portion [7]. Su et al. demonstrat-
ed an auto-tuning framework based on the Cocktail format that
Fig. 2. Basic CsrMV parallel decomposition strategies across four threads would explore arbitrary hybridizations [25]. For extended
reading, Williams et al. explore a range of techniques for
The simplest general-purpose representation is the Coordi- multicore CPUs [5], [26], and Filippone et al. provide a
nate (COO) format, which records each nonzero as an index- comprehensive survey of SpMV methods for GPUs [6].
value triple (i, j, Aij). COO lends itself to SpMV paralleliza- C. Contemporary CsrMV parallelization strategies
tions based upon segmented scan primitives [18]–[20] which
provide strong guarantees of workload balance and stable CsrMV implementations typically adhere to one of two
performance across all nonzero distributions [7]. However, general parallelization strategies: (a) a row splitting variation of
pure CooMV implementations do not often achieve high levels row-based work distribution; or (b) nonzero splitting, an
of performance, in part because the format has relatively high equitable parallel partitioning of nonzero data.
storage overhead. Row splitting. This practice assigns regularized sections of
The CSR format eliminates COO row-index repetition by long rows to multiple processors in order to limit the number of
storing the nonzero values and column indices in row-major number data items assigned to each processor. The partial
order, and building a separate row-offsets array such that the sums from related sub-rows can be aggregated in a subsequent
entries for row i in the other arrays occupy the half-open pass. The differences in length between rows that are smaller
interval [row-offsetsi, row-offsetsi+1). CSR is perhaps the most than the splitting size can still contribute load imbalance
commonplace general-purpose format for in-memory storage. between threads. The splitting of long rows can be done
statically via a preprocessing step that encodes the dataset into
B. Specialized sparse matrix formats an alternative format, or dynamically using a work-sharing
Due to its importance as an algorithmic building block, runtime. The dynamic variant requires runtime task distribu-
there is a long history of SpMV performance optimization tion, a behavior likely to incur processor contention and limited
under varying assumptions of matrix sparsity structure [4]. A scalability on massively parallel systems.

680

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
1000 100

100 GFLOPs Runtime


1000 100 1000 100
10
GFLOPs GFLOPs 10

GFLOPs (double)
Runtime (ms)
Runtime Runtime 1 1
100 100

0.1
10 10 0.1
0.01
10 10
0.001 0.01
Runtime (ms)

Runtime (ms)

174
536
2

1557
3230
5884
8621
11246
12812
14888
17598
19904
22749
32747
58142
87025
102432
186936
380900
759952
1703365
6291372
GFLOPs (double)

GFLOPs (double)
nnz of U.Fl. sparse matrices (4,201 datasets)
1 1 1 1

(c) Compressed Sparse Blocks (CsbMV)


1000 100

0.1 0.1
GFLOPs Runtime
100
10
0.1 0.1
10

GFLOPs (double)
Runtime (ms)
0.01 0.01
1 1

0.1
0.1
0.001 0.01 0.001 0.01
0.01

1317
1890
3084
4480
5892
7860
98

10240
11246
12400
13949
15118
16611
18875
19980
21877
23626
32299
46818
66498
87025
92826

1042170
1930655
4400122
2

111576
170695
277362
408264
676515
297
506

17550675
1317
1890
3084
4480
5892
7860
98

10240
11246
12400
13949
15118
16611
18875
19980
21877
23626
32299
46818
66498
87025
92826

1042170
1930655
4400122
2

111576
170695
277362
408264
676515
297
506

17550675

0.001 0.01

1703365
6291372
174
536
1557
3230
5884
8621
2

11246
12812
14888
17598
19904
22749
32747
58142
87025
102432
186936
380900
759952
nnz of U.Fl. sparse matrices (4,201 datasets) nnz of U.Fl. sparse matrices (4,201 datasets)
nnz of U.Fl. sparse matrices (4,201 datasets)

(a) Intel MKL CsrMV (b) Our merge-based CsrMV (d) pOSKI tuned SpMV

Fig. 3. CPU SpMV performance-consistency across the U.Fl. Sparse Matrix Collection, arranged by number of nonzero elements (Intel Xeon E5-2690v2 x 2)
1000 100
GFLOPs
Runtime
100
1000 100 1000 100
10
GFLOPs GFLOPs 10

Runtime (ms)
Runtime

GFLOPs (double)
Runtime 1 1
100 100

0.1
10 10
0.1
0.01
10 10
Runtime (ms)

Runtime (ms)
GLOPs (double)

0.001 0.01

174
476

1916152
2

1544
2940
5152
7448
10913
12068
14293
16237
19410
21039
23948
41601
69984
87025

7660826
114671
216082
419201
859554
GFLOPs (double)
1 1 1 1 nnz of U.Fl. sparse matrices (4,201 datasets)

(c) NVIDIA cuSPARSE HybMV


1000
100
0.1 0.1
GFLOPs Runtime
100
0.1 0.1 10
10

GFLOPs (single )
0.01 0.01 Runtime (ms)
1 1

0.1
0.001 0.01 0.001 0.01 0.1
92
2

1218
1740
2890
4228
5892
7367
9233

1200537
2269501
5377761
267
460

137845
199200
320606
491336
760154
11246
12000
13503
14627
16060
18193
19510
20841
22860
25588
38692
56126
76367
87025
98523

19771708

92

1218
1740
2890
4228
5892
7367
9233

1200537
2269501
5377761
2

267
460

137845
199200
320606
491336
760154
11246
12000
13503
14627
16060
18193
19510
20841
22860
25588
38692
56126
76367
87025
98523

19771708

0.01

0.001 0.01
174
536
1557
2

3230
5884
8621
11246
12812
14888
17598
19904
22749
32747
58142
87025
102432
186936
380900
759952
1703365
6291372
nnz of U.Fl. sparse matrices (4,201 datasets) nnz of U.Fl. sparse matrices (4,201 datasets)
nnz of U.Fl. sparse matrices (4,201 datasets)

(a) NVIDIA cuSPARSE CsrMV (b) Our merge-based CsrMV (d) yaSpMV tuned BccooMV (fp32)

Fig. 4. GPU SpMV performance-consistency across U.Fl. Sparse Matrix Collection, arranged by number of nonzero elements (NVIDIA K40)

Vectorization is a common variant of row-splitting in Processors then determine to which row(s) their data items
which a group of threads is assigned to process each row. belong by searching within the row-offsets array. As each
Vectorization is typical of GPU-based implementations such as processor consumes its section of nonzeros, it must track its
cuSPASRSE CsrMV [11]. Although bandwidth coalescing is progress through the row-offsets array. The CsrMV parallel-
improved by strip-mined access patterns, there is potential for izations from Dalton et al. [27] and Baxter [28] perform this
significant underutilization in the form of inactive SIMD lanes searching as an offline preprocessing step, storing processor-
when row-lengths do not align with SIMD widths [7]. Recent specific waypoints within supplemental data structures.
work by Greathouse and Daga [8] and Ashari et al. [9] (Despite their nomenclature, these implementations are not
adaptively vectorize based on row length, thus avoiding the merge-based, but rather examples of nonzero-splitting.)
problem that vectorized CSR kernels perform poorly on short
rows. However, these methods require additional storage and As illustrated in Fig. 2b, imbalance is still possible because
preprocessing time to augment CSR matrices with supple- some processors may consume arbitrarily more row offsets
mental data structures that capture thread assignment. than others in the presence of empty rows. Although such
hypersparse [29] matrices are fairly rare in numerical algebra
Nonzero splitting. An alternative to row-based paralleliza- (nonsingular square matrices must have nnz ≥ n), they occur
tion is to assign an equal share of the nonzero data (i.e., the frequently in computations on graphs. For example, more than
values and column-indices arrays) to each processor. 20% of web-crawl vertices (webbase-2001), 12% of Amazon

681

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
14 III. MERGE-BASED CSRMV
12 Our parallel CsrMV decomposition is similar to the fine-
10 grained merge-path method for efficiently merging two sorted
GFLOPs (double)

8
lists A and B [12]–[14]. The fundamental property of this
method is that each processor is provided with an equal share
6
of |A|+|B| steps, regardless of list sizing and value content.
4

A. Parallel “merge-path”
2

0
In general, a merge computation can be viewed as a deci-
sion path of length |A|+|B| in which progressively larger
elements are consumed from A and B. Fig. 7 illustrates this as
Fig. 5. Intel MKL CsrMV performance as a function of aspect ratio (Xeon a two-dimensional grid in which the elements of A are arranged
E5-2690v2 x2). As row-count falls below the width of the machine (40 along the x-axis and the elements of B are arranged along the y-
threads), performance falls by up to 8x. axis. The decision path begins in the top-left corner and ends
in the bottom-right. When traced sequentially, the merge path
40 moves right when consuming the elements from A and down
35 when consuming from B. As a consequence, the path
30 coordinates describe a complete schedule of element
comparison and consumption across both input sequences.
GFLOPs (double)

25

20
Furthermore, each path coordinate can be linearly indexed by
15
its grid diagonal, where diagonals are enumerated from top-left
to bottom-right. Per convention, the semantics of merge
10
always prefer items from A over those from B when comparing
5
same-valued items. This results in exactly one decision path.
0

To parallelize across p threads, the grid is sliced diagonally


into p swaths of equal width, and it is the job of each thread is
Fig. 6. NVIDIA cuSPARSE CsrMV performance as a function of aspect to establish the route of the decision-path through its swath.
ratio (Tesla K40). As row-count falls below the width of the machine (960 The fundamental insight is that any given path coordinate (i,j)
warps), performance falls by up to 270x.
can be found independently using the two-dimensional search
procedure presented in Algorithm 3. More specifically, the
two elements Ai and Bj scheduled to be compared at diagonal k
vertices (amazon-2008), and 50% of vertices within the
can be found via constrained binary search along that diagonal:
synthetic RMAT datasets generated for the Graph500
find the first (i,j) where Ai is greater than all of the items
benchmark have zero out-degree [30], [31].
consumed before Bj, given that i+j=k. Each thread need only
D. Illustrations of CsrMV workload imbalance search the first diagonal in its swath; the remainder of its path
Fig. 3a and Fig. 4a illustrate underutilization from irregular segment can be trivially established via sequential comparisons
sparsity structure, presenting MKL and cuSPARSE CsrMV seeded from that starting coordinate.
elapsed running time performance across the entire Florida An important property of this parallelization strategy is that
Sparse Matrix Collection. Ideally we would observe a smooth, the decision path can be partitioned hierarchically, trivially
continuous performance response that is highly correlated to enabling parallelization across large, multi-scale systems.
data set size (nnz). However, many performance outliers are Assuming the number of processors p is a finite, constant
readily visible (despite being plotted on a log-scale), often property of any given level of the underlying machine
running one or two orders of magnitude slower than similarly- (unrelated to N=|A|+|B|), the total work remains O(N).
sized datasets. This is especially pronounced on the GPU,
where parallel efficiency is particularly hindered by workload B. Adaptation for CSR SpMV
imbalance. As illustrated in Fig. 8, we can compute CsrMV using the
merge-path decomposition by logically merging the row-offsets
Furthermore, Fig. 5 and Fig. 6 highlight the performance
vector with the sequence of natural numbers ℕ used to index
cliffs that are common to row-based CsrMV when computing
the values and column-indices vectors. We emphasize that this
on short, wide matrices having many columns and few rows.
merger is never physically realized, but rather serves to guide
We use MKL and cuSPARSE to compute CsrMV across
the equitable consumption of the CSR matrix. By design, each
different aspect ratios of dense matrices (stored in sparse CSR
contiguous vertical section of the decision path corresponds to
format), all comprising the same number of nonzero values.
a row of nonzeros in the CSR sparse matrix. As threads follow
Performance drops severely as the number of rows falls below
the merge-path, they accumulate matrix-vector dot-products
the number of available hardware thread contexts. Although
when moving downwards. When moving rightwards, threads
the CSR format is general-purpose, these implementations
then flush these accumulated values to the corresponding row
discriminate by aspect ratio against entire genres of short, wide
output in y and reset their accumulator. The partial sums from
sparse matrices.
rows that span multiple threads can be aggregated in a
subsequent reduce-value-by-key “fix-up” pass. The result is

682

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
List A p0 p1 p2 List A

c c e i c c e i
c c e i
0 1 2 3 4
0

0
a 0
p0 @(0,0) a a
a 1

1
p0
b 1 p0 b
b
b 2 3 4

2
c 2
p1 @(2,2) c c c c
c 5
3 d 3 d d

List B

List B
p1
d p1 6 7
4

e e e
4 e
e 8
5

f 5 p2 @(3,5) f
f
f 9
6

g 6 g g
p2
g p2 10
7

h 7 h h
h 11 12
i
8 9 10 11 12

(a) The “merge path” will traverse a 2D grid from (b) Evenly partition the grid by “diagonals”. 2D- (c) Each thread runs the sequential merge algorithm
top-left to bottom-right search along diagonals for starting coordinates from starting coordinates

Fig. 7. An example merge-path visualization for the parallel merger of two sorted lists by three threads. Regardless of array content, the
path is twelve items long and threads consume exactly four items each.

Row offsets p0 p1 p2 Row offsets

0 2 2 4 8 0 2 2 4 8
0 2 2 4 8
0
0 0
0 1.0 p0 @(0,0) 1.0

Ax nonzero dot-product components


Ax nonzero dot-product components

0 1.0

Ax nonzero dot-product components


1
p0 p0
1 1
1 1.0 1.0
1 1.0
2 3 4
2
p1 @(2,2)
2 3.0 2 3.0

Value indices (ℕ)


Value indices (ℕ)

2 3.0 2.0 0.0 5


3 3
3 3.0 3.0
p1 3 p1 3.0
6 7
4 4
4 4.0 4.0
4 4.0 6.0
8
5 p2 @(3,5) 4
4.0
5 4.0 5 4.0
5 4.0
9
6 6
6 4.0 4.0
p2 6 p2 4.0
10
7 7
7 4.0 4.0
7 4.0
11 12

8 9 10 11 12
12.0

(a) Logically merge the row end-offsets with the (b) Evenly partition the grid by “diagonals”. 2D- (c) Each thread accumulates nonzeros when moving
indices of the matrix nonzeros search along diagonals for starting coordinates downward, outputs to y when moving rightward

Fig. 8. An example merge-based CsrMV visualization for the sparse matrix presented in Fig. 1. Threads consume exactly four CSR items
each, where an item is either a nonzero or a row-offset.

that we always partition equal amounts of CsrMV work across To underscore the importance of workload balance, our
parallel threads, regardless of matrix structure. implementation performs no architecture-specific optimiza-
tions other than affixing thread affinity to prevent migration
C. CPU implementation across cores and sockets. We do not explicitly incorporate
To illustrate the simplicity of this method, we present our software pipelining, branch elimination, SIMD intrinsics,
parallel OpenMP C++ merge-based CsrMV implementation in advanced pointer arithmetic, prefetching, register-blocking,
its entirety in Algorithm 2. Lines 3-6 establish the merger lists cache-blocking, TLB-blocking, matrix-blocking, or index
and merge-path lengths (total and per-thread). Entering the compression.
parallel section, lines 16-21 identify the thread’s starting and
ending diagonals and then search for the corresponding 2D D. CPU NUMA Optimization
starting and ending coordinates within the merge grid. (We use We did, however, explore NUMA opportunities for
a counting iterator to supply the elements of ℕ.) In lines 24- speedup when consuming datasets too large to fit in aggregate
32, each thread effectively executes the sequential CsrMV last-level cache. As is typical with most multi-socket CPU
Algorithm 1 along its share of the merge path. Lines 35-36 systems, each socket on our Xeon e5-2690 platform provisions
accumulate any nonzeros for a partial row shared with the next its own channels of off-chip physical memory into the shared
thread. Lines 39-40 save the thread’s running total and row-id virtual address space. For applications willing to engage in a
for subsequent fix-up. Back in the sequential phase, lines 44- two-phase “inspector/executor” relationship with our CsrMV
46 update the values in y for rows that span multiple threads. implementation, our NUMA-aware variant will perform a one-
time merge-path search to identify the sections of the values

683

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
ALGORITHM 2. The parallel merge-based CsrMV algorithm (C++ OpenMP)
Input: Number of parallel threads, CSR matrix A, dense vector x
Output: Dense vector y such that y Å Ax

1 void OmpMergeCsrmv(int num_threads, const CsrMatrix& A, double* x, double* y)


2 {
3 int* row_end_offsets = A.row_offsets + 1; // Merge list A: row end-offsets
4 CountingInputIterator<int> nz_indices(0); // Merge list B: Natural numbers(NZ indices)
5 int num_merge_items = A.num_rows + A.num_nonzeros; // Merge path total length
6 int items_per_thread = (num_merge_items + num_threads - 1) / num_threads; // Merge items per thread
7
8 int row_carry_out[num_threads];
9 double value_carry_out[num_threads];
10
11 // Spawn parallel threads
12 #pragma omp parallel for schedule(static) num_threads(num_threads)
13 for (int tid = 0; tid < num_threads; tid++)
14 {
15 // Find starting and ending MergePath coordinates (row-idx, nonzero-idx) for each thread
16 int diagonal = min(items_per_thread * tid, num_merge_items);
17 int diagonal_end = min(diagonal + items_per_thread, num_merge_items);
18 CoordinateT thread_coord = MergePathSearch(diagonal, row_end_offsets, nz_indices,
19 A.num_rows, A.num_nonzeros);
20 CoordinateT thread_coord_end = MergePathSearch(diagonal_end, row_end_offsets, nz_indices,
21 A.num_rows, A.num_nonzeros);
22
23 // Consume merge items, whole rows first
24 double running_total = 0.0;
25 for (; thread_coord.x < thread_coord_end.x; ++thread_coord.x)
26 {
27 for (; thread_coord.y < row_end_offsets[thread_coord.x]; ++thread_coord.y)
28 running_total += A.values[thread_coord.y] * x[A.column_indices[thread_coord.y]];
29
30 y[thread_coord.x] = running_total;
31 running_total = 0.0;
32 }
33
34 // Consume partial portion of thread's last row
35 for (; thread_coord.y < thread_coord_end.y; ++thread_coord.y)
36 running_total += A.values[thread_coord.y] * x[A.column_indices[thread_coord.y]];
37
38 // Save carry-outs
39 row_carry_out[tid] = thread_coord_end.x;
40 value_carry_out[tid] = running_total;
41 }
42
43 // Carry-out fix-up (rows spanning multiple threads)
44 for (int tid = 0; tid < num_threads - 1; ++tid)
45 if (row_carry_out[tid] < A.num_rows)
46 y[row_carry_out[tid]] += value_carry_out[tid];
47 }

and column-indices arrays that will be consumed by each shared memory with efficient strip-mined, coalesced loads,
thread, and then instruct the operating system to migrate those forming local-row-offsets and local-nonzeros.
pages onto the NUMA node where those threads will run. The
algorithm and CSR data structures remain unchanged through After the sublists are copied to local shared memory,
this preprocessing step. threads independently perform sequential CsrMV, each
consuming exactly items-per-thread. However, the doubly-
E. GPU Implementation nested loop in lines 24-32 of Algorithm 2 is inefficient for
We implemented our GPU version in CUDA C++ using a SIMD predication. As shown in Algorithm 4, we can
two-level merge-path parallelization to accommodate the restructure the nest as a single loop that runs for items-per-
hierarchical GPU processor organization [32]. At the coarsest thread iterations for all threads.
level, the merge path is equitably divided among thread blocks, The two levels of merge-path parallelization require two
of which we invoke only as many as needed to fully occupy the levels of “fix-up”. At the thread level, we use block-wide
GPU’s multiprocessors (a finite constant uncorrelated to reduce-value-by-key primitives from the CUB Library [33]. At
problem size). Each thread block then proceeds to consume its the thread block level, we use device-wide reduce-value-by-
share of the merge path in fixed size path-chunks. Path-chunk key primitives, also from the CUB library.
length is determined by the amount of local storage resources
available to each thread block. For example, a path-chunk of We emphasize that our merge-based method allows us to
896 items = 128 threads-per-block * 7 items-per-thread. make maximal utilization of the GPU’s fixed-size shared
memory resources, regardless of the ratio between row-offsets
Threads can then locally search for their 2D-path coordi- and nonzeros consumed during each path-chunk. This is not
nates relative to the thread block’s current path coordinate. The possible for row-based or nonzero-splitting strategies that are
search range of this second-level of path processing is unable to provide tight bounds on local workloads
restricted to a fixed-size tile of chunk-items x chunk-items of
the merge grid, so overall work complexity is unaffected. The IV. EVALUATION
thread block then copies the corresponding regions of row- We primarily compare against the CsrMV implementations
offsets (list A) and matrix-value dot-products (list B) into local provided by Intel’s Math Kernel Library v11.3 [10] and

684

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
ALGORITHM 3: 2D merge-path search ALGORITHM 4: Re-writing the inner loops of Algorithm 2 (lines 24-33) as a
single loop
Input: Diagonal index, lengths of lists A and B, iterators (pointers) to lists A and B 1 ...
Output: The 2D coordinate (x,y) of the intersection of the merge decision path with the 2 // Consume exactly items-per-thread merge items
3 double running_total = 0.0;
specified grid diagonal 4 for (int i = 0; i < items_per_thread; ++i)
1 CoordinateT MergePathSearch( 5 {
2 int diagonal, int a_len, int b_len, 6 if (nz_indices[thread_coord.y] <
3 AIteratorT a, BIteratorT b) 7 row_end_offsets[thread_coord.x])
4 { 8 {
5 // Diagonal search range (in x coordinate space) 9 // Move down (accumulate)
6 int x_min = max(diagonal - b_len, 0); 10 running_total += A.values[thread_coord.y] *
7 int x_max = min(diagonal, a_len); 11 x[A.column_indices[thread_coord.y]];
8 12 ++thread_coord.y;
9 // 2D binary-search along the diagonal search range 13 }
10 while (x_min < x_max) { 14 else
11 OffsetT pivot = (x_min + x_max) >> 1; 15 {
12 if (a[pivot] <= b[diagonal - pivot - 1]) { 16 // Move right (output row-total and reset)
13 // Keep top-right half of diagonal range 17 y[thread_coord.x] = running_total;
14 x_min = pivot + 1; 18 running_total = 0.0;
15 } else { 19 ++thread_coord.x;
16 // Keep bottom-left half of diagonal range 20 }
17 x_max = pivot; 21 }
18 } 22 ...
19 }
20 return CoordinateT(
21 min(x_min, a_len), // x coordinate in A
22 diagonal - x_min); // y coordinate in B
23 }

TABLE 2: SpMV implementations under evaluation


Compressed NVIDIA
Merge-based NUMA merge- cuSPARSE pOSKI yaSpMV
MKL CsrMV Sparse Blocks cuSPARSE
CsrMV based CsrMV CsrMV SpMV [16] BccooMV [17]
CsbMV [15], [35] HybMV [21]
ELL + COO block-compressed
CSR COO-of-COO multi-level block-
Format CSR CSR CSR (row-based + COO
(vectorized row- (block-row compressed CSR
(decomposition) (merge) (merge) (row-based) segmented (segmented
based) splitting) (opaque)
reduction) reduction)
Supported
CPU, GPU CPU CPU GPU CPU GPU CPU GPU
platforms
Evaluated precision double, single double double double double double double single
Autotuning (per matrix) no no no no no no yes yes
Avg. preprocessing
- 44x - - - 19x 484x 155,000x
overhead (per matrix)

NVIDIA’s cuSPARSE library v7.5 [11]. To highlight the and 720KB of texture cache (2.3MB aggregate cache), and a
competitiveness of our merge-based CsrMV method in the Stream Triad score of 249 GB/s (ECC off).
absolute, we also compare against several non-CSR formats
expressly designed to tolerate row-length variation: CSB [15], A. Performance consistency
HYB [7], pOSKI [16], and yaSpMV [17]. The properties of Relative performance consistency is highlighted in the
these implementations are further detailed in Table 2. We performance landscapes of Fig. 3 and Fig. 4, which plot
define preprocessing overhead as a one-time inspection, running time as a function of matrix size measured by nonzero
encoding, or tuning activity whose running time is counted count (nnz). The presence of significant outliers is readily
separately from any subsequent SpMV computations. We apparent for both MKL and cuSPARSE CsrMV paralleliza-
normalize it as the ratio between preprocessing time versus a tions. Our merge-based strategy visibly conveys a more
single SpMV running time. We do not count our merge-based consistent performance response. These observations are also
binary searching and fix-up as preprocessing, but rather as part validated statistically. Fig. 9a presents the degree to which
of SpMV running time. However, we do report our NUMA each implementation is able to decouple SpMV throughput
page migration as preprocessing overhead. from row-length irregularity. Relative to MKL and cuSPARSE
CsrMV, our merge-based performance is much less associated
Our corpus of sparse test matrices is comprised of the 4,201 with row-length irregularity, and even improves upon the
non-vector, non-complex matrices currently catalogued by the specialized GPU formats.
University of Florida Sparse Matrix Collection [31].
Furthermore, Fig. 9b presents the degree to which each
Our evaluation hardware is comprised of a dual-socket implementation conforms to the general expectation of a linear
NUMA CPU platform with two Intel Xeon E5-2690v2 CPUs correspondence between SpMV running time and matrix size.
and one NVIDIA Tesla K40 GPU. Each CPU is comprised of This general metric incorporates all aspects of the SpMV
10 cores with two-way hyper-threading (40 threads total) and computation that might affect performance predictability, such
25MB L3 cache (50MB total). Together, they achieve a as cache hierarchy, nonzero locality, etc. Our GPU perfor-
Stream Triad [34] score of 77.9 GB/s. The K40 is comprised mance is significantly more predictable than that of the other
of 15 SMs capable of concurrently executing 960 warps of 32
threads each (30k threads total), provides 1.5MB of L2 cache

685

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
0.5 1 0.98 0.97 0.99 0.94 0.87

Correlation of GFLOPs to row-variation


0.69

Correlation of runtime to nnz


0.4 0.8 0.65
0.3 0.6
0.4 0.30
0.2
0.1 0.2
0 0
-0.1 -0.01 -0.2
-0.06 -0.03 -0.04
-0.07 -0.07 -0.4
-0.2
-0.16 -0.6
-0.3 -0.24
-0.8
-0.4
-1
-0.5
MKL CsrMV Merge- CSB SpMV POSKI cuSPARSE Merge- HYB SpMV yaSpMV
MKL CsrMV Merge- CSB SpMV POSKI cuSPARSE Merge- HYB SpMV yaSpMV (CPU) based (CPU) SpMV CsrMV based (GPU) (fp32 GPU)
(CPU) based (CPU) SpMV CsrMV based (GPU) (fp32 GPU) CsrMV (CPU) (GPU) CsrMV
CsrMV (CPU) (GPU) CsrMV (CPU) (GPU)
(CPU) (GPU)
(a) Row-length imperviousness: correlation between GFLOPs throughput (b) Performance predictability: the correlation between elapsed
vs. row-length variation. (Closer to 0.0 is better.) running time and matrix size nnz vs. (Closer to 1.0 is better.)

Fig. 9. Metrics of SpMV performance consistency


16 8

8
4

Speedup vs MKL CsrMV


Speedup vs MKL CsrMV

4
2

1
1

0.5
0.5

0.25 0.25

6616827
6804304
7512317
7791168
8127528
8580313
8884839
9323432

110879972
127206144
283073458
10228360
10644002
11284032
11708077
13150496
14148858
14970767
16313034
17550675
17588875
17588875
18488476
19753078
21005389
22217266
23667183
25165738
27130349
28715634
33866826
37464962
46168124
54940162
59524291
68993773
89239674

1019903190
64
2

1536
1889
2860
3884
5390
6334
8188

1042160
1642833
3121160
175
399
701

107103
148416
216082
327680
474472
695234
10114
11246
11890
13089
14294
15301
16503
18437
19510
20649
22299
23796
31502
43998
59140
76367
87025
94274

nnz of U.Fl. sparse matrices (4,013 datasets) nnz of U.Fl. sparse matrices (188 datasets)
(a) vs. MKL (small matrices < 6M nnz) (b) vs. MKL (large matrices > 6M nnz)
128 256

64 128

64
32
Speedup vs. cuSPARSE CsrMV

32
16
16
Speedup

8
8
4 4

2 2

1 1

0.5
0.5
0.25
0.25
0.125
1042170
1127525
1227102
1408073
1528092
1717792
1935324
2164210
2472071
3094566
3597188
4194298
4817870
5416358
7339667
8884839
290378
315412
333029
355460
384636
405197
443573
499007
552557
578890
623970
684254
738598
818302
904522
979512

774472352
11294316
16171169
18489474
24738362
37464962
77651847
0.125
42
2

1544
1760
2593
3339
4492
5888
6614
8184
9357
174
336
426
870

100875
127182
166453
225046
11148
11405
12296
13432
14313
15188
16203
17668
19204
19717
20800
22273
23563
27444
37487
47369
61896
76367
87025
90002

nnz of U.Fl. sparse matrices (3,364 datasets) nnz of U.Fl. sparse matrices (837 datasets)

(c) vs. cuSPARSE (small matrices < 300K nnz) (d) vs. cuSPARSE (large matrices > 300K nnz)

Small-matrix harmonic Large-matrix harmonic Harmonic mean


Max speedup Min speedup
mean speedup mean speedup speedup
vs. MKL CsrMV 15.8 0.51 1.22 1.06 1.21
CPU merge-based CsrMV vs CSB SpMV 445 0.65 9.21 1.09 6.58
vs POSKI SpMV 24.4 0.59 11.0 1.10 7.86
vs. MKL CsrMV 15.7 0.50 1.25 1.58 1.26
CPU NUMA merge-based CsrMV vs CSB SpMV 464 0.87 9.63 1.66 7.65
vs POSKI SpMV 24.0 1.07 11.5 1.72 9.14
vs. cuSPARSE CsrMV 198 0.34 0.79 1.13 0.84
GPU merge-based CsrMV vs cuSPARSE HybMV 5.96 0.24 1.41 0.96 1.29
vs yaSpmv BccooMV (fp32) 2.43 0.39 0.78 0.75 0.78

Fig. 10. Relative speedup of merge-based CsrMV

GPU implementations, and we roughly match or exceed the outperform MKL, particularly smaller datasets that fit within
predictability of the other CPU implementations. aggregate last-level cache (nnz < 6M). Our NUMA-aware
variant is better able to extend this advantage to large datasets
We note that our merge-based CsrMV is not entirely free of that cannot be captured on-chip, and is 26% faster than MKL
performance variation among similarly-sized matrices. These on average. Furthermore, our CsrMV is capable of approxi-
irregularities are primarily the result of differing cache mating the consistency of the CSB and pOSKI implementa-
responses to different access patterns within the dense vector x. tions while broadly improving upon their performance. This is
B. Performance throughput particularly true for small problems where the added
Fig. 10 presents relative SpMV speedup. Our CPU merge- complexity of these specialized formats is not hidden by the
based methods achieve up to 15.8x speedup versus MKL latencies of off-chip memory accesses.
CsrMV on highly irregular matrices. We generally match or

686

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
35 35
MKL CsrMV Merge-based CsrMV NUMA Merge-based CsrMV cuSPARSE CsrMV Merge-based CsrMV
30 30

25 25

GFLOPs (double)
GFLOPs (double)

20 20

15 15

10 10

5 5

0 0

(a) Multi-socket CPU (E5-2690v2 x 2) (b) Single-socket GPU (K40)

Fig. 11. CsrMV throughput for commonly-evaluated large-matrix datasets

The wider parallelism of the GPU amplifies the detrimental CsrMV, the NUMA benefit provides average speedups of 1.7x
effects of workload imbalance, where our CsrMV achieves up and 1.4x, respectively.
to 198x speedup versus cuSPARSE. For the K40 GPU, the
For the GPU, the performance benefits of balanced compu-
threshold for being captured within aggregate cache is much
tation typically outweigh the merge-specific overheads for
smaller (nnz < 300K). In general, we are 13% faster than
searching and fix-up, and we achieve speedups of up to 46x
cuSPARSE for medium and large-sized size datasets that do
(for Circuit5M) and a harmonic mean speedup of 1.26x.
not fit in cache.
However, cuSPARSE CsrMV performs marginally better for
However, our CsrMV can be up to 50% slower than select matrices having row-lengths that align well with the
cuSPARSE for small datasets that align well with its architecture’s SIMD width (dense4k, bone010, Rucci1).
vectorized, row-based method. The high latencies of the GPU
Finally, in comparing our GPU CsrMV versus our regular
can be problematic for our CsrMV on small matrices (towards
and NUMA-optimized CPU implementations, the single K40
which the Florida repository is biased) where the basic SpMV
outperforms our two-socket Xeon system on average by 2.2x
workload is not sufficient to overcome our method-specific
and 1.5x, respectively. (These speedups are lower than the
overheads of (1) merge-path coordinate search and (2) a second
3.2x bandwidth advantage of the K40 GPU because the larger
kernel for inter-thread fix-up. For comparison, the cuSPARSE
CPU cache hierarchy is better suited for capturing accesses to
CsrMV is able to process more than half of the datasets in
the dense vector x.)
under 15μs, whereas the latency of an empty kernel invocation
alone is approximately 5μs. V. CONCLUSION
For small problems, the HybMV implementation performs As modern processors continue to exhibit wider parallel-
worse than our merge-based CsrMV because its fix-up stage ism, workload imbalance can quickly become the high-order
incurs even higher latencies than ours. For larger, non- performance limiter for segmented computations such as CSR
cacheable datasets, our performance is comparable with the SpMV. To make matters worse, data-dependent performance
HybMV implementation. degradations often pose significant challenges for applications
In comparison to yaSpMV BCCOO, our fp32 version of that require predictable performance response.
merge-based CsrMV is slower by 22% on average. This In this work, we have adapted merge-based parallel de-
exceeds our expectations, given that BCCOO is able to composition for computing a well-balanced SpMV directly on
compress away nearly 50% of all indexing data. CSR matrices without offline analysis, format conversion, or
C. Curated datasets the construction of side-band data. This decomposition is
particularly useful for bounding workloads across multi-scale
Fig. 11 compares CsrMV performance across commonly- processors and systems with fixed-size local memories. To the
evaluated, large-sized matrices [26], [35]. For the CPU, these best of our knowledge, no prior work achieves these goals.
matrices are large enough to exceed aggregate last-level cache
and comprise substantially more rows than hardware threads. Furthermore, we have conducted the broadest SpMV
Although the row-based MKL CsrMV can benefit from the evaluation to date to demonstrate the practical shortcomings of
inherent workload balancing aspects of oversubscription, our exiting CsrMV implementations on real-world data. Our study
merge-based CsrMV still demonstrates substantial speedup for reveals that contemporary CsrMV methods are inconsistent
ultra-irregular matrices (Wikipedia and Circuit5M). performers, whereas the performance response of our method
Furthermore, our NUMA-aware CsrMV performs well in this is substantially impervious to row-length irregularity. The
regime because matrix data is not being transferred across CPU importance of workload balance is further underscored by the
sockets. Compared to MKL and our standard merge-based simplicity of our implementation. Even in the absence of
architecture-specific optimization strategies, our method is

687

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
generally capable of matching or exceeding the performance of [18] G. E. Blelloch, M. A. Heroux, and M. Zagha, “Segmented Operations
contemporary CsrMV implementations. for Sparse Matrix Computation on Vector Multiprocessors,” School of
Computer Science, Carnegie Mellon University, CMU-CS-93-173, Aug.
Finally, our merge-based decomposition is orthogonal to 1993.
(and yet supportive of) bandwidth optimization strategies such [19] D. Merrill, “Allocation-oriented Algorithm Design with Application to
as index compression, blocking, and relabeling. In fact, the GPU Computing,” Ph.D. Dissertation, University of Virginia, 2011.
elimination of workload imbalance from our method is likely [20] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan primitives for
GPU computing,” in Proceedings of the 22nd ACM
to amplify the benefits of these techniques. SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware,
Aire-la-Ville, Switzerland, Switzerland, 2007, pp. 97–106.
REFERENCES
[21] N. Bell and M. Garland, “Efficient Sparse Matrix-Vector Multiplication
[1] J. Kepner, D. A. Bader, A. Buluç, J. R. Gilbert, T. G. Mattson, and H. on CUDA,” NVIDIA Corporation, NVIDIA Technical Report NVR-
Meyerhenke, “Graphs, Matrices, and the GraphBLAS: Seven Good 2008-004, Dec. 2008.
Reasons,” CoRR, vol. abs/1504.01039, 2015. [22] A. Monakov, A. Lokhmotov, and A. Avetisyan, “Automatically Tuning
[2] J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Sparse Matrix-Vector Multiplication for GPU Architectures,” in High
Algebra. Society for Industrial and Applied Mathematics, 2011. Performance Embedded Architectures and Compilers, vol. 5952, Y.
[3] M. Mohri, “Semiring Frameworks and Algorithms for Shortest-distance Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell, Eds.
Problems,” J. Autom. Lang. Comb., vol. 7, no. 3, pp. 321–350, Jan. Springer Berlin Heidelberg, 2010, pp. 111–125.
2002. [23] W. Tang, W. Tan, R. Goh, S. Turner, and W. Wong, “A Family of Bit-
[4] R. W. Vuduc, “Automatic Performance Tuning of Sparse Matrix Representation-Optimized Formats for Fast Sparse Matrix-Vector
Kernels,” Ph.D. Dissertation, University of California, Berkeley, 2003. Multiplication on the GPU,” Parallel and Distributed Systems, IEEE
Transactions on, vol. PP, no. 99, pp. 1–1, 2014.
[5] S. Williams, N. Bell, J. W. Choi, M. Garland, L. Oliker, and R. Vuduc,
“Sparse Matrix-Vector Multiplication on Multicore and Accelerators,” [24] W. T. Tang, W. J. Tan, R. Ray, Y. W. Wong, W. Chen, S. Kuo, R. S. M.
in Scientific Computing with Multicore and Accelerators, J. Kurzak, D. Goh, S. J. Turner, and W.-F. Wong, “Accelerating Sparse Matrix-vector
A. Bader, and J. Dongarra, Eds. Taylor & Francis, 2011, pp. 83–109. Multiplication on GPUs Using Bit-representation-optimized Schemes,”
in Proceedings of the International Conference on High Performance
[6] S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo, “Sparse
Computing, Networking, Storage and Analysis, New York, NY, USA,
Matrix-Vector Multiplication on GPGPUs,” ACM Trans. Math. Softw.,
2013, pp. 26:1–26:12.
To Appear.
[25] B.-Y. Su and K. Keutzer, “clSpMV: A Cross-Platform OpenCL SpMV
[7] N. Bell and M. Garland, “Implementing sparse matrix-vector
Framework on GPUs,” in Proceedings of the 26th ACM International
multiplication on throughput-oriented processors,” in Proceedings of the
Conference on Supercomputing, New York, NY, USA, 2012, pp. 353–
Conference on High Performance Computing Networking, Storage and
364.
Analysis, New York, NY, USA, 2009, pp. 18:1–18:11.
[26] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel,
[8] J. L. Greathouse and M. Daga, “Efficient Sparse Matrix-vector
“Optimization of Sparse Matrix-vector Multiplication on Emerging
Multiplication on GPUs Using the CSR Storage Format,” in
Multicore Platforms,” in Proceedings of the 2007 ACM/IEEE
Proceedings of the International Conference for High Performance
Conference on Supercomputing, New York, NY, USA, 2007, pp. 38:1–
Computing, Networking, Storage and Analysis, Piscataway, NJ, USA,
38:12.
2014, pp. 769–780.
[27] S. Dalton, S. Baxter, D. Merrill, L. Olson, and M. Garland, “Optimizing
[9] A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P.
Sparse Matrix Operations on GPUs Using Merge Path,” in Parallel and
Sadayappan, “Fast Sparse Matrix-vector Multiplication on GPUs for
Distributed Processing Symposium (IPDPS), 2015 IEEE International,
Graph Applications,” in Proceedings of the International Conference for
2015, pp. 407–416.
High Performance Computing, Networking, Storage and Analysis,
Piscataway, NJ, USA, 2014, pp. 781–792. [28] S. Baxter, Modern GPU. NVIDIA Research, 2013.
[10] Intel Math Kernel Library (MKL) v11.3. Intel Corporation, 2015. [29] A. Buluc and J. R. Gilbert, “On the representation and multiplication of
hypersparse matrices,” in Parallel and Distributed Processing, 2008.
[11] NVIDIA cuSPARSE v7.5. NVIDIA Corporation, 2013.
IPDPS 2008. IEEE International Symposium on, 2008, pp. 1–11.
[12] S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk, “Merge Path -
[30] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A Recursive
Parallel Merging Made Simple,” in Proceedings of the 2012 IEEE 26th
Model for Graph Mining,” in SIAM International Conference on Data
International Parallel and Distributed Processing Symposium
Mining, 2004.
Workshops & PhD Forum, Washington, DC, USA, 2012, pp. 1611–
1618. [31] T. Davis and Y. Hu, “University of Florida Sparse Matrix Collection.”
[Online]. Available: http://www.cise.ufl.edu/research/sparse/matrices/.
[13] S. Baxter and D. Merrill, “Efficient Merge, Search, and Set Operations
[Accessed: 11-Jul-2011].
on GPUs,” Mar-2013. [Online]. Available: http://on-
demand.gputechconf.com/gtc/2013/presentations/S3414-Efficient- [32] NVIDIA, “CUDA.” [Online]. Available:
Merge-Search-Set-Operations.pdf. http://www.nvidia.com/object/cuda_home_new.html. [Accessed: 25-
Aug-2011].
[14] N. Deo, A. Jain, and M. Medidi, “An optimal parallel algorithm for
merging using multiselection,” Inf. Process. Lett., vol. 50, no. 2, pp. 81– [33] D. Merrill, CUB v1.5.3: CUDA Unbound, a library of warp-wide, block-
87, Apr. 1994. wide, and device-wide GPU parallel primitives. NVIDIA Research,
2015.
[15] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson,
“Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector [34] J. D. McCalpin, “Memory Bandwidth and Machine Balance in Current
Multiplication Using Compressed Sparse Blocks,” in Proc. SPAA, High Performance Computers,” IEEE Computer Society Technical
Calgary, Canada, 2009. Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25,
Dec. 1995.
[16] A. Jain, “pOSKI: An Extensible Autotuning Framework to Perform
Optimized SpMVs on Multicore Architectures,” Master’s Thesis, [35] X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, “Efficient Sparse
University of California at Berkeley, 2008. Matrix-vector Multiplication on x86-based Many-core Processors,” in
Proceedings of the 27th International ACM Conference on International
[17] S. Yan, C. Li, Y. Zhang, and H. Zhou, “yaSpMV: Yet Another SpMV
Conference on Supercomputing, New York, NY, USA, 2013, pp. 273–
Framework on GPUs,” in Proceedings of the 19th ACM SIGPLAN
282.
Symposium on Principles and Practice of Parallel Programming, New
York, NY, USA, 2014, pp. 107–118.

688

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.
APPENDIX available from the Florida Sparse Matrix Repository.
Datasets can be downloaded individually from the UF
A. Abstract website:
This artifact comprises the source code, datasets, and
build instructions on GitHub that can be used to reproduce https://www.cise.ufl.edu/research/sparse/matrices/
our merge-based CsrMV results (alongside those of Intel Additionally, the merge-spmv project provides users with
MKL and NVIDIA cuSPARSE CsrMV results) presented in the script get_uf_datasets.sh that will download and
our SC’2016 paper Merge-based Parallel Sparse Matrix- unpack the entire corpus used in this evaluation. (Warning,
Vector Multiplication. this may be several hundred gigabytes.) .
B. Description C. Installation
1) Check-list (artifact meta information): First you must clone merge-spmv code to the local
machine:
x Algorithm: sparse matrix-vector multiplication
x Program: CUDA and C/C++ OpenMP code $ git clone https://github.com/dumerrill/merge-spmv.git
x Compilation: CPU OpenMP code: Intel icc (v16.0.1 Then you must use GNU make to build CPU and GPU
as tested); GPU CUDA code: NVIDIA nvcc and test drivers:
GNU gcc (v7.5.17 and v4.4.7 as tested, respectively)
$ cd merge-spmv
x Binary: CUDA and OpenMP executables $ make cpu-spmv
x Data set: Publicly available matrix market files $ make gpu-spmv sm=<cuda-arch, e.g., 350>
x Run-time environment: CentOS with CUDA
toolkit/driver and Intel Parallel Studio XE 2016 Finally, you can (optionally) download and unpack the
(v6.4, v7.5, and v2016.0.109 as tested, respectively) entire UF repository:
x Hardware: Any CUDA GPU with compute $ ./get_uf_datasets.sh <path/to/dataset-dir>
capability at least 3.0 (NVIDIA K40 as tested); any
D. Experimental workflow
Intel CPU (Xeon CPU E5-2695 v2 @ 2.40GHz as
tested) Before running any experiments, users will need to
x Output: Matrix dataset statistics, elapsed running export the Intel OpenMP environment variable that
time, GFLOPs throughput establishes thread-affinity (one-time):
x Experiment workflow: git clone projects; $ export KMP_AFFINITY=granularity=core,scatter
download the datasets; run the test scripts; observe
You can run the program CPU program on the specified
the results
dataset:
x Publicly available?: Yes
$ ./cpu-spmv --mtx=<path/to/dataset.mtx>
2) How delivered
Or run the program GPU program on the specified
The CPU-based and GPU-based CsrMV artifacts can be dataset:
found in the merge-spmv open source project hosted on
GitHub. The software comprises code, build, and evaluation $ ./gpu-spmv --mtx=<path/to/dataset.mtx>
instructions, and is provided under BSD license.
(Note: use the -–help commandline option to see extend-
3) Hardware dependences ed commandline usage, including options for specifying the
number of timing iterations and indicating which GPU to use
For adequate reproducibility, we suggest an NVIDIA in multi-GPU systems.)
GPU with compute capability at least 3.5 and Intel CPU with
AVX or wider vector extensions. Finally you can run a side-by-side evaluation on the
entire corpus:
4) Software dependences
$ ./eval-csrmv.sh <driver-prog> <path/to/dataset-
The CPU-based CsrMV evaluation requires the Intel C++ dir> > output.csv
compiler and Math Kernel Library, both of which are E. Evaluation and expected result
included with Intel Parallel Studio. The GPU-based CsrMV
evaluation requires the CUDA GPU driver, nvcc CUDA The expected results include matrix statistics
compiler, cuSPARSE library, all of which are included with (dimensions, nonzero count, coefficient of row-length
the CUDA Toolkit. Both artifacts have been tested on variation), elapsed running time, GFLOPs throughput
CentOs 6.4 and Ubuntu 12.04/14.04, and are expected to run F. Notes
correctly under other Linux distributions.
For up-to-date instruction, please visit the merge-spmv
5) Datasets project’s GitHub page (https://github.com/dumerrill/merge-
spmv).
At this time of writing, our matrix parsers currently only
support input files encoded using matrix market format. All
matrix market datasets used in this evaluation are publicly

689

Authorized licensed use limited to: INDIAN INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on February 01,2024 at 10:17:09 UTC from IEEE Xplore. Restrictions apply.

You might also like