Parallel and Scalable
Parallel and Scalable
Parallel and Scalable
FACULTY OF SCIENCE
PhD Thesis
Weifeng Liu
[Undertitel på afhandling]
Submitted: 26/10/2015
ii
To Huamin and Renheng
for enduring my absence while working on the PhD and the thesis
iii
iv
Acknowledgments
v
vi
Abstract
Sparse basic linear algebra subprograms (BLAS) are fundamental building blocks for
numerous scientific computations and graph applications. Compared with Dense
BLAS, parallelization of Sparse BLAS routines entails extra challenges due to the irreg-
ularity of sparse data structures. This thesis proposes new fundamental algorithms
and data structures that accelerate Sparse BLAS routines on modern massively paral-
lel processors: (1) a new heap data structure named ad-heap, for faster heap operations
on heterogeneous processors, (2) a new sparse matrix representation named CSR5, for
faster sparse matrix-vector multiplication (SpMV) on homogeneous processors such
as CPUs, GPUs and Xeon Phi, (3) a new CSR-based SpMV algorithm for a variety of
tightly coupled CPU-GPU heterogeneous processors, and (4) a new framework and
associated algorithms for sparse matrix-matrix multiplication (SpGEMM) on GPUs
and heterogeneous processors.
The thesis compares the proposed methods with state-of-the-art approaches on
six homogeneous and five heterogeneous processors from Intel, AMD and nVidia.
Using in total 38 sparse matrices as a benchmark suite, the experimental results show
that the proposed methods obtain significant performance improvement over the best
existing algorithms.
vii
viii
Resumé
ix
x
Contents
iii
1 Introduction 1
1.1 Organization of The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Author’s Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
I Foundations 5
2 Sparsity and Sparse BLAS 7
2.1 What Is Sparsity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Sparse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Where Are Sparse Matrices From? . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Finite Element Methods . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Sparse Representation of Signals . . . . . . . . . . . . . . . . . . 12
2.3 What Are Sparse BLAS? . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Level 1: Sparse Vector Operations . . . . . . . . . . . . . . . . . 14
2.3.2 Level 2: Sparse Matrix-Vector Operations . . . . . . . . . . . . . 16
2.3.3 Level 3: Sparse Matrix-Matrix Operations . . . . . . . . . . . . . 17
2.4 Where Does Parallelism Come From? . . . . . . . . . . . . . . . . . . . 18
2.4.1 Fork: From Matrix to Submatrix . . . . . . . . . . . . . . . . . . 19
2.4.2 Join: From Subresult to Result . . . . . . . . . . . . . . . . . . . 20
2.5 Challenges of Parallel and Scalable Sparse BLAS . . . . . . . . . . . . . 21
2.5.1 Indirect Addressing . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.2 Selection of Basic Primitives . . . . . . . . . . . . . . . . . . . . . 21
2.5.3 Data Decomposition, Load Balancing and Scalability . . . . . . 22
2.5.4 Sparse Output of Unknown Size . . . . . . . . . . . . . . . . . . 22
3 Parallelism in Architectures 25
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Multicore CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Manycore GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Manycore Coprocessor and CPU . . . . . . . . . . . . . . . . . . . . . . 27
xi
xii CONTENTS
In the past few decades, basic linear algebra subprograms (BLAS) [24, 60, 61] have
been gaining attention in many fields of scientific computing and simulation. When
the inputs of BLAS routines are large and sparse, exploiting their sparsity can sig-
nificantly reduce runtime and space complexity. In this case, a set of new routines
have been designed and called Sparse BLAS [68, 70, 71, 64]. Because a large amount
of real-world applications can obtain benefits from the exploitation of sparsity, de-
signing parallel and scalable data structures and algorithms for Sparse BLAS became
an important research area in the era of massively parallel processing.
This thesis presents the author’s parallel and scalable data structures and algo-
rithms for Sparse BLAS on modern multicore, manycore and heterogeneous pro-
cessors. Specifically, four main contributions have been made: (1) a new heap data
structure named ad-heap, for faster heap operations on heterogeneous processors, (2)
a new sparse matrix representation named CSR5, for faster sparse matrix-vector mul-
tiplication (SpMV) on homogeneous processors such as CPUs, GPUs and Xeon Phi, (3)
a new CSR-based SpMV algorithm for a variety of tightly coupled CPU-GPU hetero-
geneous processors, and (4) a new framework and associated algorithms for sparse
matrix-matrix multiplication (SpGEMM) on GPUs and heterogeneous processors.
1
2 1.2. AUTHOR’S PUBLICATIONS
Part III, composed of three chapters, gives the author’s algorithms for Sparse
BLAS routines on modern homogeneous and heterogeneous processors. Chapter 7
focuses on several methods for adding two sparse vectors, which is the main building
block for the more complex SpGEMM operation. Chapter 8 describes the author’s
two SpMV algorithms for sparse matrices in the CSR format and in the CSR5 format.
Chapter 9 describes the author’s approach for SpGEMM, the most complex Sparse
BLAS operation.
The last chapter concludes the thesis and suggests future work.
Appendix A lists detailed information of 38 sparse matrices used as benchmark
suites in this thesis. Appendix B gives the used experimental platforms.
6. Huamin Ren, Weifeng Liu, Søren Ingvor Olsen, Sergio Escalera, Thomas B.
Moeslund. “Unsupervised Behavior-Specific Dictionary Learning for Abnormal
Event Detection”. 26th British Machine Vision Conference (BMVC ’15), 2015.
pp. 28.1–28.13. ISBN 1-901725-53-7. (Reference [152])
4 1.2. AUTHOR’S PUBLICATIONS
Part I
Foundations
5
2. Sparsity and Sparse BLAS
Figure 2.1: A group of ordered balls of different colors and their representation.
Now assume that the gray balls are unimportant in a certain scenario and we
want to save memory space of the computer system, so that only red, green and blue
balls need to be stored. The expected storage form is the array shown in Figure 2.2.
However, if low-level details of modern computer storage systems and complex
data structures such as linked list are not considered here, typical computer memory
are organized in linear order thus cannot “label” unused entries in a hollow data
structure. In other words, by using the single array, all “gray” entries still occupy
memory space to guarantee correct indexing.
To avoid wasting memory space for the gray balls, a one-dimensional integer array
can be introduced to store indices of the red, green and blue balls. We can see that the
indices are implicitly given in the first example, but in the second, they are explicitly
7
8 2.1. WHAT IS SPARSITY?
Figure 2.2: The expected storage fashion when gray balls are not important.
stored. Therefore, all “gray” entries can be removed from the original color array, and
we still know the originally positions of the other entries. Figure 2.3 gives the new
storage form composed of two arrays rather than one. We can search for index 4 (as
input) in the index array to find the 5th ball is at the position 2 (as output). In other
words, we now know that position 2 of the index array stores index 4. Afterwards, it
is easy to find the position 2 (as new input) of the color array is green (as new output).
Similarly, when we search index 6 in the index array and find it is not in the array, we
can say that the 7th ball is gray.
Figure 2.3: Red, green and blue entries and their indices are stored in two arrays.
By using the above representation, the input data are actually “compressed”.
Given an input array composed of n entries and each entry occupies α storage units,
the array needs αn storage units in total. Assume it contains p “important” entries
and n − p “unimportant” entries, the space cost of storing the array in the compressed
fashion is (α+β)p, where β is the storage cost of the index of an entry. Because α and β
are constant factors, the compressed storage format is much more space-efficient than
the ordinary format when n is very large and p n. Furthermore, computational
cost can also be saved if only “important” entries are involved in a procedure.
We say that a dataset is sparse if many entries of the set are “unimportant”, and
the dataset can be represented in a compressed style to avoid space and time cost of
processing its “unimportant” entries. In contrast, we say that a dataset is dense or
full if all entries are seen as “important” and stored explicitly.
work.
CHAPTER 2. SPARSITY AND SPARSE BLAS 9
Figure 2.4: A simple mesh with 2 elements and 4 vertices. The numbers outside the
mesh are indices of the four vertices. The numbers inside the mesh are local indices
of vertices of each element.
Because each element has 3 vertices, we say that the local degrees of freedom
(DoF) per element is 3. In this case, each element generates a local dense matrix of
size 3 × 3, and contributes entries of the local dense matrix to a global sparse matrix
of 4 × 4 (because there are 4 vertices, i.e., total number of DoFs in the mesh). The
procedure is called finite element assembly [2].
Assume that the local matrix of the first element is
CHAPTER 2. SPARSITY AND SPARSE BLAS 11
0 1 2
0 a d g
E0 = 1 b e h ,
2 c f i
where the three entries in the column vector [0, 1, 2]T to the left of the matrix will be
row indices in the global matrix, and the three entries in the row vector [0, 1, 2] on
the top of the matrix will be column indices in the global matrix. Thus each entry in
the matrix knows where the position to write its value is. For example, value g will
be stored to location (0, 2) in the global matrix. After adding all entries, we obtain a
sparse global matrix
a d g 0
b e h 0
A= c f i 0 .
0 0 0 0
Similarly, we can form a local matrix for the second element:
0 2 3
0 j m p
E1 = 2 k n q ,
3 l o r
where the three entries in the column vector [0, 2, 3]T to the left of the matrix will be
row indices in the global matrix, and the three entries in the row vector [0, 2, 3] on the
top of the matrix will be column indices in the global matrix. By adding all entries
of this local matrix to the obtained global sparse matrix, we update the 4x4 sparse
matrix to
a d g+m p
b e h 0
A= c + k f i + n q .
l 0 o r
The sparse matrix can have much more zeros thus more sparse when the domain
is divided into more elements. Appendix A collects some matrices generated from
FEM applications.
(a) Social network including 6 individuals. (b) Graph representation of the social network.
An arrow means “gives like to”.
Figure 2.5: A small and simple social network and its graph representation.
individual does not like anybody but is liked by all the others. This social network
can be represented by the graph shown in Figure 2.5(b).
Actually, we can directly extract a sparse matrix
0 1 0 0 1 0
1 0 1 1 1 1
1 0 0 0 1 0
A=
0 0 1 0 1 0
0 0 0 0 0 0
0 0 0 1 1 0
from Figure 2.5(a) and recognize the equivalence between the sparse matrix A and the
graph in Figure 2.5(b). When the number of social actors in a network increases, the
sparse matrix obtained from it is more and more sparse. The reason is that generally
each individuals friend circle is only a very small proportion of the whole network.
That is
0
0
2 1 0 0 0 0 0 1 0 0 0
2 0 1 0 0 0 0 1 0 0
0
3 0
0 0 1 0 0 0 0 1 0
x = = Da =
0 .
3
0
0 0 1 0 0 0 1 0
0
5 0 0 0 0 1 0 0 0 1
2
5 0 0 0 0 0 1 0 0 1 3
5
14 2.3. WHAT ARE SPARSE BLAS?
We can call dense vector x a signal, and can see that even though D required
larger memory space than B, sparse vector a0 occupies less space than a. When we
need to calculate or to store a large amount of signals, using sparse vectors like a0 is
more space-efficient, when D keeps unchanged for all signals. In this case, a large
amount of sparse vectors construct a sparse matrix A. By multiplying D with A, a
group of signals are reconstructed. This technique is widely-used in applications
such as signal processing and audio/image analysis [153, 152]. The above matrix D is
called a dictionary (or overcomplete dictionary), and the sparse vector is called sparse
approximation or sparse representation of a signal.
This operation is basically a gather operation. All nonzero entries of the sparse
vector are multiplied with entries of the same indices in the dense vector. Then a sum
of the results of the multiplication is the output of the dot product operation.
where input vectors x and y are sparse, and the output dot is a scalar.
In this operation, a nonzero entry in an input sparse vector does not contribute to
the dot product result if its index is not found in the nonzero entry list of the other
input sparse vector. So dot product of two sparse vectors can be converted to a set
intersection problem. That is, only the nonzero entries having the same index in both
inputs are multiplied. Then the sum of the separate results of the multiplication is the
output of dot product.
or
zd ← xs + yd ,
where one of the input vectors x and y is sparse, the other is dense, and the output z
is a dense vector.
Because the indices of the sparse vector is known, the addition is actually a
scatter operation. In other words, the nonzero entries of the sparse vector are added
with entries of the same indices in the dense vector, the results are written to the
corresponding positions of the output vector. All the other entries at the indices not
found in the sparse vector will directly copy the corresponding values from the dense
input.
nonzero entries of x and adding the resulting sparse vectors together to obtain the
sparse vector y. In this case, A sparse vector – sparse vector addition operation is also
utilized.
Cd ← Ad + Bs ,
or
Cd ← As + Bd ,
where one of the two input matrices A and B is dense and the other is sparse, and the
resulting matrix C is dense.
This operation is similar to the dense vector – sparse vector addition operation in
Level 1 Sparse BLAS, except multiple rows/columns are considered.
Cs ← As + Bs ,
where
A = A0 + A1 + A2 .
The second method is by sparsity structure. We can divide the above sparse matrix
A into a diagonal matrix A00 and another matrix A01 composed of the remaining entries.
Then we have
a 0 0 0 0 0 0 d g 0 0 0
0 e 0 0 0 0 b 0 h 0 0 0
0 0 i 0 0 0 0 c f 0 0 0 0
A00 =
0 0 0 j 0 0 1 0 0 0 0 0 0 ,
, A =
0 0 0 0 k 0 0 0 0 0 0 m
0 0 0 0 0 m 0 0 0 0 l 0
where
A = A00 + A01 .
The third method is by the number of nonzero entries. We can evenly divide A,
which contains 14 nonzero entries, into two submatrices, A000 and A001 , that contain 7
nonzero entries each. Then we have
a d g 0 0 0 0 0 0 0 0 0
b e h 0 0 0 0 0 0 0 0 0
00
c 0 0 0 0 0 00 0 f i 0 0 0
A0 =
, A1 = 0 0 0 j 0 0 ,
0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 k m
0 0 0 0 0 0 0 0 0 0 l n
where
A = A000 + A001 .
20 2.4. WHERE DOES PARALLELISM COME FROM?
Using the first method as an example, if we want to compute one of the level 2 BLAS
routines that multiply A with a dense vector
T
x = 1, 2, 3, 4, 5, 6 .
We can compute
a d g 0 0 0 1 1a + 2d + 3g
b e h 0 0 0
2 1b + 2e + 3h
c f i 0 0 0 3 1c + 2f + 3i
y0 = A0 x =
0 =
,
0 0 0 0 0
4 0
0 0 0 0 0 0 5
0
0 0 0 0 0 0 6 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 2 0
0 0 0 0 0 0
3 = 0 ,
y1 = A1 x =
0
0 0 j 0 0
4 4j
0 0 0 0 0 0 5 0
0 0 0 0 0 0 6 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 2 0
0 0 0 0 0 0 3
0
y2 = A2 x =
0 = .
0 0 0 0 0 4
0
0 0 0 0 k m 5
5k + 6m
0 0 0 0 l n 6 5l + 6n
y = y0 + y1 + y2
1a + 2d + 3g 0 0
1b + 2e + 3h 0 0
1c + 2f + 3i 0 0
= + +
0 4j
0
0 0 5k + 6m
0 0 5l + 6n
1a + 2d + 3g
1b + 2e + 3h
1c + 2f + 3i
= .
4j
5k + 6m
5l + 6n
input vectors. This is obviously slower than a scatter or gather operation when one
operand is dense, and can be much slower than trivial dot product of two dense
vectors.
Therefore, selection of the best building blocks plays a crucial role in the imple-
mentation of Sparse BLAS. Many basic primitives such as sorting, searching, insertion,
merging, reduction and prefix-sum scan all have their own application scenarios when
using different hardware platforms. Chapter 4 of this thesis describes useful parallel
primitives such as sorting network, evaluates multiple parallel merge algorithms,
and develops new parallel primitives such as fast segmented sum. We further show
that the contributions of Chapter 4 greatly improve performance of Sparse BLAS
described in Chapters 7 – 9.
3.1 Overview
The performance of Sparse BLAS operations highly depend on the used hardware. In
the past decade, a variety of chip multiprocessors (CMPs) have replaced single-core
processors as the main targets of shared memory Sparse BLAS algorithm design. This
Chapter introduces four mainstream CMPs: multicore CPU, manycore GPU, many-
core coprocessor and CPU, and emerging tightly coupled CPU-GPU heterogeneous
processor.
25
26 3.3. MANYCORE GPU
games. Because basic numerical linear algebra operations play crucial roles in real-
time 3D computer graphics, GPUs are designed for this set of operations. Because
GPUs offer higher peak performance and bandwidth, numerical linear algebra appli-
cations can deliver much higher performance than merely using multicore CPUs. In
the year 2007, the birth of nVidia’s CUDA programming model made GPU program-
ming much easier for researchers without background knowledge on programming
shading languages such as GLSL1 and HLSL2 . To describe non-graphics applications
on GPU, a new research direction general-purpose computation on graphics hardware
(or GPGPU for short) has been naturally established [100, 134]. Figure 3.2 shows a
GPU composed of four cores, private scratchpad memory, and private and shared
caches.
Figure 3.2: A GPU composed of four cores, scratchpad memory, and caches.
Because CUDA and OpenCL are both widely used in GPU programming and
they actually deliver comparable performance [74], our Sparse BLAS algorithms
1 The OpenGL Shading Language from the OpenGL Architecture Review Board (ARB).
2 The High Level Shading Language from the Microsoft Corp..
CHAPTER 3. PARALLELISM IN ARCHITECTURES 27
support both of them. We use CUDA implementation on nVidia GPUs and OpenCL
implementation on AMD GPUs.
For simplicity, we define the following unified terminologies: (1) thread denotes
thread in CUDA and work item in OpenCL, (2) thread bunch denotes warp in nVidia
GPU and wavefront in AMD GPU, (3) thread group denotes thread block or cooperative
thread array (CTA) in CUDA and work group in OpenCL, (4) core denotes streaming
multiprocessor (SMX) or Maxwell streaming multiprocessor (SMM) in nVidia GPU and
compute unit in AMD GPU, and (5) scratchpad memory denotes shared memory in CUDA
and local memory in OpenCL.
Because GPU cores can execute massively parallel lightweight threads on SIMD
units for higher aggregate throughput, applications with good data-level parallelism
can be significantly accelerated by GPUs [42, 43, 41]. As such, some modern supercom-
puters have already used GPUs as their accelerators. However, utilizing the power
of GPUs requires rewriting programs in CUDA, OpenCL or other GPU-oriented
programming models. This brings non-trivial software engineering overhead for
applications with a lot of legacy code [37, 165]. Furthermore, because GPUs have their
own memory, data have to be transferred back and forth between CPU-controlled
system memory and GPU-controlled device memory. This data transfer may offset
gained performance improvements from GPUs. Also, programming data transfer
makes software more complicated and less easy to maintain [180]. Figure 3.3 shows a
loosely coupled CPU-GPU heterogeneous system. We can see that the CPU and the
GPU are connected by a PCIe interface, which delivers much lower bandwidth than
the GPU device memory.
Figure 3.3: A loosely coupled CPU-GPU heterogeneous system with PCIe interface.
Figure 3.4: A manycore CPU including 12 small cores and their private caches.
though the small cores are not as powerful as their counterparts in multicore CPUs,
a manycore coprocessor or CPU can include at least tens of such small cores. Thus
the computational power is also impressive [73], compared with GPUs. The first
generations of such chips were released in a coprocessor form, which is similar to
an accelerator but has more flexible memory management. The later generations
can be used independently thus have no fundamental difference compared to classic
multicore CPUs. Figure 3.4 plots a manycore coprocessor/CPU composed of 12 small
cores and their private caches.
One obvious advantage of manycore coprocessors and CPUs is the programma-
bility. Ideally, unchanged legacy code for classic CPUs can run smoothly on those
chips. As a result, a large mount of applications are well prepared. Moreover, the
data transfer overhead in a CPU-GPU system can be completely avoided when using
a manycore CPU as the only compute device in one system.
The CPU cores have higher single-thread performance due to out-of-order ex-
ecution, branch prediction and large amounts of caches. The GPU cores execute
massively parallel lightweight threads on SIMD units for higher aggregate through-
30 3.5. TIGHTLY COUPLED CPU-GPU HETEROGENEOUS PROCESSOR
put. The two types of compute units have completely different ISAs and separate
cache sub-systems.
Compared to loosely-coupled CPU-GPU heterogeneous systems shown in Fig-
ure 3.3, the two types of cores in a heterogeneous processor share one single unified
address space instead of using separate address spaces (i.e., system memory space and
GPU device memory space). One obvious benefit is avoiding data transfer through
connection interfaces (e.g., PCIe link), which is one of the most well known bottle-
necks of coprocessor/accelerator computing [82]. Additionally, GPU cores can access
more memory by paging memory to and from disk. Further, the consistent pageable
shared virtual memory can be fully or partially coherent, meaning that much more
efficient CPU-GPU interactions are possible due to eliminated heavyweight synchro-
nization (i.e., flushing and GPU cache invalidation). Currently, programming on the
unified address space and low-overhead kernel launch are supported by HSA [90],
OpenCL [136] and CUDA [137].
To leverage the heterogeneous processors, existing research has concentrated on
various coarse-grained methods that exploit task, data and pipeline parallelism in
the heterogeneous processors. However, it is still an open question whether or not
the new features of the emerging heterogeneous processors can expose fine-grained
parallelism in fundamental data structure and algorithm design. Also, whether
new designs can outperform their conventional counterparts plus the coarse-grained
parallelization is a further question. A lot of prior work has concentrated on exploiting
coarse-grained parallelism or one-side computation in the heterogeneous processors.
The current literature can be classified into four groups: (1) eliminating data transfer,
(2) decomposing tasks and data, (3) pipelining, and (4) prefetching data.
Eliminating data transfer over a PCIe bus is one of the most distinct advantages
brought by the heterogeneous processors, thus its influence on performance and
energy consumption has been relatively well studied. Research [49, 181, 133] reported
that various benchmarks can obtain performance improvements from the AMD
APUs because of reduced data movement cost. Besides the performance benefits,
research [169, 138] demonstrated that non-negligible power savings can be achieved
by running programs on the APUs rather than the discrete GPUs because of shorter
data path and the elimination of the PCIe bus and controller. Further, Daga and
Nutter [48] showed that using the much larger system memory makes searches on
very large B+ tree possible.
Decomposing tasks and data is also widely studied in heterogeneous system
research. Research [106, 174] proposed scheduling approaches that map workloads
onto the most appropriate core types in the single-ISA heterogeneous processors. In
recent years, as GPU computing is becoming more and more important, scheduling
on multi-ISA heterogeneous environments has been a hot topic. StarPU [9], Qilin
[125], Glinda [163] and HDSS [22] are representatives that can simultaneously execute
suitable compute programs for different data portions on CPUs and GPUs.
Pipelining is another widely used approach that divides a program into multiple
stages and executes them on most suitable compute units in parallel. Heterogeneous
environments further enable pipeline parallelism to minimize serial bottleneck in
Amdahl’s Law [88, 148, 59, 133]. Chen et al. [44] pipelined map and reduce stages
on different compute units. Additionally, pipelining scheme can also expose wider
CHAPTER 3. PARALLELISM IN ARCHITECTURES 31
design dimensions. Wang et al. [181] used CPU for relieving GPU workload after
each previous iteration finished, thus overall execution time was largely reduced. He
et al. [87] exposed data parallelism in pipeline parallelism by using both CPU and
GPU for every high-level data parallel stage.
Prefetching data can be considered with heterogeneity as well. Once GPU and
CPU share one cache block, the idle integrated GPU compute units can be leveraged
as prefetchers for improving single thread performance of the CPU [186, 187], and
vice versa [189]. Further, Arora et al. [6] argued that stride-based prefetchers are
likely to become significantly less relevant on the CPU if a GPU is integrated.
32 3.5. TIGHTLY COUPLED CPU-GPU HETEROGENEOUS PROCESSOR
4. Parallelism in Data Structures
and Algorithms
4.1 Overview
Fundamental data structures and algorithms play the key roles in implementing any
computer programs. Data structures are approaches to store information. Algorithms
are methods for solving problems. To accelerate Sparse BLAS on modern computer
architectures, effectively utilizing parallel-friendly data structures and scalable algo-
rithms is crucial. This chapter introduces some parallel building blocks used in the
implementation of Sparse BLAS.
4.2 Contributions
This chapter makes the following contributions:
• A new method that uses offset information for converting a complex segmented
sum operation to a simple inclusive all-scan operation.
• A performance comparison of five recently developed merge methods on three
nVidia and AMD GPUs. To the best of our knowledge, no literature has reported
performance of merging short sequences of size less than 212 , which fully use
on-chip scratchpad memory.
• A new heap data structure named ad-heap is proposed for faster fundamental
heap operations on heterogeneous processors. To the best of our knowledge,
the ad-heap is the first fundamental data structure that efficiently leveraged the
two different types of cores in the emerging heterogeneous processors through
fine-grained frequent interactions between the CPUs and the GPUs.
33
34 4.3. SIMPLE DATA-LEVEL PARALLELISM
From the point view of pseudocode algorithm representation, this thesis uses
keyword in parallel right after a for loop for exhibiting a routine can be executed in
parallel. For example, the sequential version of Algorithm 1 is shown in Algorithm 2.
4.3.2 Reduction
Sometimes it is required to calculate a scalar from an array. The scalar can be the
maximum/minimum value or the sum of all of the entries of the array. Reduction
operation is commonly used to obtain the scalar. Algorithm 3 gives pseudocode of a
serial reduction-sum. This method loops through each entry in the array and sum all
entries up to obtain the scalar.
Suppose that the length of the input array is n, serial reduction has a work com-
plexity O(n). In modern microprocessors, data in memory can be efficiently streamed
to a compute unit by prefetching thus has high cache hit rate. But when a system
has more compute units, they cannot be automatically used for higher performance.
Therefore, parallel reduction has been designed through a balanced tree structure.
The height (i.e., level or depth) of a balanced binary tree is log2 n + 1, if it has n nodes
(i.e., leaves). So an array of size n to be reduced can be integrated with a balanced
binary tree of depth log2 n + 1. Figure 4.1 shows an example of executing reduction-
sum on an array of size 8 with a balanced tree of level 4. We can see that addition
operations between every two levels of the tree can run in parallel. For example, the
four additions between the two levels on the top can run independently in parallel.
and an associative operator ⊕ as inputs of a scan operation, the output of the inclusive
scan is another array of the same size
ain = a0 , (a0 ⊕ a1 ), ..., (a0 ⊕ a1 ⊕ ... ⊕ an−1 ) ,
We see that
aex = ain − a.
For example, setting the ⊕ operator to arithmetic + and assuming the input is
[3, 6, 7, 8, 9, 12], the outputs will be [3, 9, 16, 24, 33, 45] and [0, 3, 9, 16, 24, 33] through
the inclusive scan and the exclusive scan, respectively.
Scan can convert seemingly serial computations to parallel operations. Unstruc-
tured sparse matrices have rows/columns of irregular sizes, thus can be a good
scenario for using parallel scan operation. The following subsection gives an exam-
ple.
and to obatin
0 f 0
a g n
b 0 m
Aodd =
0 h
0
0 0 0
0 i n
only containing three odd columns of A. Because the two sparse matrices are required
to be stored in compressed style, parallelizing this partitioning operation is not trivial.
CHAPTER 4. PARALLELISM IN DATA STRUCTURES AND ALGORITHMS 37
can be easily achieved by sequential iterating all entries and saving the ones with odd
indices. However, parallel processing is expected to have higher performance. This
can be achievable by using the scan primitive.
First, an intermediate array is introduced to label each nonzero entry. A nonzero
entry is labeled as 1 if it is an index for a odd column, otherwise 0 is used. So the 2nd
row with this auxiliary array is stored in
col idx = 0, 2, 3, 4, 5 ,
val = a, g, k, n, o ,
aux array = 1, 1, 0, 1, 0 .
which stores the target positions of the odd columns. Now it is easy to save the odd
columns of the whole 2nd row of A by Algorithm 5 in parallel. This method can be
easily extended to the whole matrix.
4.4.2 All-Scan
A scan operation is called all-scan. If it runs on the whole input array, as opposed to
part of it. Algorithm 6 shows a serial out-of-place inclusive scan. Similar to the serial
reduction described above, all-scan iterates through all entries of the array and lets
each entry of the output array to store the sum of all entries except ones on its right
in the input array. Sometimes, the scan is required to be in-place for less memory
allocation. Algorithm 7 gives an in-place version of the serial all-scan. Similarly,
38 4.4. SCAN (PREFIX SUM)
Parallel scan algorithm has been given by Blelloch [25, 26, 27]. The method divides
all-scan into two phases: up-sweep and down-sweep, each is organized by a balanced
tree described above. Figure 4.2 gives an example conducting all-scan on an array
of size 8. We can see that the up-sweep phase basically executes a parallel reduction
operation, and the down-sweep phase transmits zeros and adds previous sums to
subsequent entries. Algorithm 10 shows pseudocode of this operation. Its runtime
complexity is O(n) but can be completed in O(log2 n) time with enough parallel
resources.
Recently, the in-place inclusive and exclusive all-scan operations have already
been defined by a set of Workgroup Functions of OpenCL 2.0 [136]. Using vendors’
implementation may bring high efficiency and programming productivity.
whether an addition is required. Also, if any one of the inputs of the addition is
performed, the target position is flagged as TRUE in the up-sweep phase, or modified
to FALSE, if the entry is involved in any steps of the down-sweep phase.
Sorting is a fundamental algorithm in computer science and can find its broad use in
Sparse BLAS. Recall the first example of this thesis and Figure 2.3, searching the 5th
small ball takes longer time if the index array is not sorted. If so, a fast binary search
can be utilized to obtain the position of index 4 in O(log n) as opposed to O(n) time.
In the past decades, many sorting algorithms have been proposed. This thesis
mainly focuses on parallel-friendly sorting methods such as sorting network ap-
proaches and the recently developed merge path method.
to another form
1 a 0 g k n o
50 e i l n q
0 0 c f j 0 0
A0 = ,
2b 0 0 0 m 0
3 0 d h 0 0 0
4 0 0 0 0 0 p
in which the longer rows appear before shorter ones, where the column vector to
the left of the matrix contains indices of the rows of the matrix. This permutation
operation can be implemented by sorting an array of key-value pairs: the key is the
number of nonzero entries of each row, and the value is the index of each row of the
matrix. In this example, the array of key-value pair is
ahkey, valuei = h3, 0i, h5, 1i, h2, 2i, h2, 3i, h1, 4i, h5, 5i .
Then the value part of the key-value array is actually the new order of rows of the
matrix.
Figure 4.4: An example of bitonic sort. The arrows indicate ascending order operated
by comparators.
Figure 4.5: An example of odd-even sort. The arrows indicate ascending order
operated by comparators.
less-than-or-equal-to it in another input array) and adds the rank to its local index
to obtain the final position in the output array. Algorithm 17 gives pseudocode
of this method. Because the binary search of each entry is independent of all the
other ones, the operation can run in parallel very well. Overall cost of ranking
merge is O(n1 log2 n2 + n2 log2 n1 ), where n1 and n2 are sizes of the two input arrays,
respectively. Similar to its serial version, ranking merge is required to be out-of-place.
CHAPTER 4. PARALLELISM IN DATA STRUCTURES AND ALGORITHMS 47
Sorting network algorithms such as bitonic sort and odd-even sort can also be
used for merging. In their last stage (e.g., stage 3 in Figures 4.4 and 4.5), the input
is already composed of two ordered subarrays of the same size. Then the merge
methods can further sort the two subarrays.
subarray including entry of value 0 and another subarray including two entries of
values 1 and 2, respectively. Then the ordered three entries are stored to the first 3
positions of the resulting array.
Algorithm 18 gives pseudocode of merge path. Lines 7–25 conduct binary search
on the 45-degree anti-diagonals to obtain positions of the intersections. Lines 26–33
complete serial merge between those intersections. More detailed description and
complexity analysis of the GPU merge path algorithm can be found in [81].
4.5.6 A Comparison
Since merging is an important operation in sparse matrix computations (such as
addition of two sparse vectors described in Chapter 7), the most efficient algorithm is
required to be selected. Consider on-chip scratchpad memory of GPUs is controlled
by the user and may offer performance gain for short inputs and outputs, we consider
some recently implemented merge algorithms [157, 81, 99, 146, 145, 91, 55] for GPUs.
First, because the main objective of the research [145, 91, 55] is efficiently merging
large data in the global memory, they still use basic methods, such as bitonic sort and
ranking-based merge, as building blocks for small data in the scratchpad memory.
Peters et al. [146] proposed a locality-oriented advanced bitonic sort method that can
reduce synchronization overhead by merging data in fast private memory instead
of relatively slow shared memory. Therefore we experimentally evaluate 5 GPU
merge algorithms: (1) ranking merge [157], (2) merge path [81], (3) basic oddeven
CHAPTER 4. PARALLELISM IN DATA STRUCTURES AND ALGORITHMS 49
merge [99], (4) basic bitonic merge [99], and (5) advanced bitonic merge [146]. The
implementation of the algorithm (2) is extracted from the Modern GPU library [18].
The implementations of the algorithm (3) and (4) are extracted from the nVidia CUDA
SDK. We implement the algorithms (1) and (5). Additionally, another reason why we
conduct the evaluation is that none of the above literature presented performance of
merging short sequences of size less than 212 , which is the most important length for
relatively short rows of our benchmark suite.
Our evaluation results of merging 32-bit keys, 32-bit key-32-bit value pairs and
32-bit key-64-bit value pairs are shown in Figure 4.7. The experimental platforms are
50 4.6. AD-HEAP AND K-SELECTION
nVidia GeForce GTX Titan Black, nVidia GeForce GTX 980 and AMD Radeon R9 290X
(see detailed specifications of the GPUs listed in Table B.2 of Appendix B). Each of the
five algorithms merges two short ordered sequences of size l into one ordered output
sequence of size 2l. The sorting network methods in our evaluation only execute the
last stage, since both inputs are sorted. To saturate throughput of GPUs, the whole
problem size is set to size 225 . For example, 214 thread groups are launched while
each of them merges two sub-sequences of size l = 210 . We execute each problem set
through multiple thread groups of different sizes and record the best performance for
the evaluation.
In Figure 4.7, we can see that the GPU merge path algorithm almost always
outperforms other methods while sub-sequence size is no less than 28 . The extra
advantages of the merge path method are that it can evenly assign work load to
threads and can easily deal with the input sequences of arbitrary sizes.
We can see that the ranking merge is faster than the merge path method in Fig-
ure 4.7(f). But this algorithm requires more scratchpad memory and thus cannot scale
to longer sequences. The basic bitonic merge and the basic oddeven merge in general
do not show better performance and cannot simply deal with data of arbitrary sizes.
The advanced bitonic sort method is always the slowest because it loads data from
the scratchpad memory to thread private memory (register file or an off-chip memory
space) for data locality. However, due to the small or negatively large latency gap
between the scratchpad memory and the thread private memory, the load operations
actually reduce the overall performance. Thus this method should only be used for
migrating global memory access to scratchpad memory access.
We can also see that the AMD Radeon R9 290X GPU is almost always much faster
than the two nVidia GPUs in all tests. The reason is that the capacity of the scratchpad
memory (2816 kB, 64 kB/core × 44 cores, in the AMD GPU, 1536 kB, 96 kB/core
× 16 cores, in the nVidia Maxwell-based GTX 980 GPU and 720 kB, 48 kB/core ×
15 cores, in the nVidia Kepler-based GTX Titan Black GPU) heavily influence the
performance of merging small sequences. For the same reason, the GTX 980 GPU
delivers better overall performance than the GTX Titan Black GPU. On the other hand,
even though the AMD GPU has 64 kB scratchpad memory per core, each instance
of the kernel program can only use up to 32 kB. Thus the AMD GPU cannot scale to
longer sub-sequences (e.g., 212 with 32-bit key-32-bit value pairs) that can be executed
by using the nVidia GPUs.
Figure 4.7: Performance comparison of merging 32-bit keys, 32-bit key-32-bit value
pairs and 32-bit key-64-bit value pairs through 5 GPU merge algorithms: ranking
merge, merge path, basic oddeven merge, basic bitonic merge and advanced bitonic
merge on three different GPUs: nVidia GeForce GTX Titan Black, nVidia GeForce
GTX 980 and AMD Radeon R9 290X.
two simulated heterogeneous processors composed of real AMD CPU-GPU and Intel
CPU and nVidia GPU, respectively. The experimental results show that compared
with the optimal scheduling method that executes the fastest d-heaps on the stan-
dalone CPUs and GPUs in parallel, the ad-heap achieves up to 1.5x and 3.6x speedup
on the two platforms, respectively.
Because of the padded head, each node has to add an of f set = d − 1 to its index
in the implicit array. Given a node of index i, its array index becomes i + of f set. Its
parent node’s (if i 6= 0) array index is b(i − 1)/dc + of f set. If any, its first child node
is located in di + 1 + of f set and the last child node is in array index di + d + of f set.
Given an established non-empty max-d-heap, we can execute three typical heap
operations:
• insert operation adds a new node at the end of the heap, increases the heap size
to n + 1, and takes O(logd n) worst-case time to reconstruct the heap property,
• delete-max operation copies the last node to the position of the root node, de-
creases the heap size to n−1, and takes O(d logd n) worst-case time to reconstruct
the heap property, and
• update-key operation updates a node, keeps the heap size unchanged, and takes
O(d logd n) worst-case time (if the root node is updated) to reconstruct the heap
property.
• find-maxchild operation takes O(d) time to find the maximum child node for a
given parent node, and
two design objectives: (1) maximizing throughput of the large amount of the SIMD
units for faster find-maxchild operations, and (2) minimizing negative impact from the
single-thread compare-and-swap operations.
ad-heap Operations
The corresponding operations of the ad-heap data structure are redesigned as well.
Again, for simplicity and without loss of generality, we only consider the update-key
operation described in Algorithm 19.
Before the update-key operation starts, the bridge is constructed in the on-chip
scratchpad memory of a GPU and the node counter is initialized to zero. Then in
each iteration (lines 6–12 of the Algorithm 19 ), a group of lightweight SIMD threads
in the GPU simultaneously execute the find-maxchild operation (i.e., in parallel load
at most d child nodes to the scratchpad memory and run the streaming reduction
scheme to find the index and the value of the maximum child node). After each
find-maxchild and compare operation, if a swap operation is needed, one of the SIMD
56 4.6. AD-HEAP AND K-SELECTION
threads adds a new index-value pair (index of the current parent node and value of
the maximum child node) to the bridge and updates the node counter. If the current
level is not the last level, the new value of the child node can be stored in a register
and be reused as the parent node of the next level. Otherwise, the single SIMD thread
stores the new indices and values of both of the parent node and the child node
to the bridge. Because the on-chip scratchpad memory is normally two orders of
magnitude faster than the off-chip memory, the cost of the single-thread operations
is negligible. When all iterations are finished, at most 2h + 1 SIMD threads store the
bridge from the on-chip scratchpad memory to the continuous off-chip memory by
d(2h + 1)/we off-chip memory transactions. The single program multiple data (SPMD)
pseudocode is shown in Algorithm 21. Here we do not give out parallel pseudocode
of the find-maxchild operation, since it is very similar to reduction operation described
in Algorithm 4. After the bridge is dumped, a signal object is transferred to the
GPU-CPU queue.
Triggered by the synchronization signal from the queue, one of the CPU cores
sequentially loads the entries from the bridge and stores them to the real heap space
in linear time. Note that no data transfer, address mapping or explicit coherence
maintaining is required due to the unified memory space with cache coherence. And
because the entries in the bridge are located in continuous memory space, the CPU
cache system can be efficiently utilized. When all entries are updated, the whole
update-key operation is completed. The pseudocode of the CPU workload in the
update-key operation is shown in Algorithm 22.
Refer to the command queue in the OpenCL specification and the architected
queueing language (AQL) in the HSA design, we list the pseudocode of the update-key
operation in Algorithm 23.
We can see that although the overall time complexity is not reduced, the two
types of compute units more focus on the off-chip memory behaviors that they are
good at. We can calculate that the number of the GPU off-chip memory access needs
hd/w +(2h+1)/w transactions instead of h(d/w +1) in the d-heap. For example, given
a 7-level 32-heap and set w to 32, the d-heap needs 14 off-chip memory transactions
while the ad-heap only needs 8. Since the cost of the off-chip memory access dominates
execution time, the practical GPU performance can be improved significantly. Further,
from the CPU perspective, all read transactions are from the bridge in continuous
cache blocks and all write transactions only trigger non-time-critical cache write
misses to random positions. Therefore the CPU workload performance can also be
expected to be good.
ad-heap Simulator
From the perspective of the programming model, synchronization mechanism among
compute units is redefined. Recently, several CPU-GPU fast synchronization ap-
proaches [44, 108, 127] have been proposed. In this section, we implement the ad-heap
operations through the synchronization mechanism designed by the HSA Foundation.
According to the current HSA design [108], each compute unit executes its task and
sends a signal object of size 64 Byte to a low-latency shared memory queue when it
has completed the task. Thus with HSA, CPUs and GPUs can queue tasks to each
other and to themselves. Further, the communications can be dispatched in the user
CHAPTER 4. PARALLELISM IN DATA STRUCTURES AND ALGORITHMS 57
Algorithm 21 The SPMD GPU workload in the update-key operation of the ad-heap.
1: function GPU- WORKLOAD(∗heap, d, n, h, newv)
2: tid ← GET- THREAD - LOCALID()
3: i←0
4: v ← newv
5: ∗bridge ← SCRATCHPAD - MALLOC(2h + 1)
6: if tid = 0 then
7: bridge[0] ← 0 . initialize the node counter
8: end if
9: while di + 1 < n do
10: hmaxi, maxvi ← FIND - MAXCHILD(∗heap, d, n, i)
11: if maxv > v then
12: if tid = 0 then . insert a index-value pair
13: bridge[2 ∗ bridge[0] + 1] ← i
14: bridge[2 ∗ bridge[0] + 2] ← maxv
15: bridge[0] ← bridge[0] + 1
16: end if
17: i ← maxi
18: else
19: break
20: end if
21: end while
22: if tid = 0 then . insert the last index-value pair
23: bridge[2 ∗ bridge[0] + 1] ← i
24: bridge[2 ∗ bridge[0] + 2] ← v
25: bridge[0] ← bridge[0] + 1
26: end if
27: if tid < 2h + 1 then . dump the bridge to off-chip
28: heap[tid] ← bridge[tid]
29: end if
30: return
31: end function
mode of the operating systems, thus the traditional “GPU kernel launch” method
(through the operating system kernel services and the GPU drivers) is avoided and
the CPU-GPU communication latency is significantly reduced. Figure 4.10 shows an
example of the shared memory queue.
Because the HSA programming tools for the heterogeneous processor hardware
described in this section are not currently available yet, we conduct experiments on
simulated heterogeneous processor platforms composed of real standalone CPUs and
GPUs. The ad-heap simulator has two stages:
(1) Pre-execution stage. For a given input list and a size d, we first count the
number of the update-key operations and the numbers of the subsequent find-maxchild
and compare-and-swap operations by pre-executing the work through the d-heap
on the CPU. We write Nu , Nf , Nc and Ns to denote the numbers of the update-
58 4.6. AD-HEAP AND K-SELECTION
• The CPU part reads the entries in Nu bridges (back from the GPU) and writes
Nu (Ns + 1) values to the corresponding entry indices. This part takes Tcc time
on the CPU.
After simulation runs, we use overlapped work time on the CPU and the GPU as
execution time of the ad-heap since the two types of cores are able to work in parallel.
Thus the final execution time is the longer one of Tcc + Tcq and Tgc .
Because of the features of the heterogeneous processors, costs of device/host
memory copy and GPU kernel launch are not included in our timer. Note that
because we use both the CPU and the GPU separately, the simulated heterogeneous
processor platform is assumed to have accumulated off-chip memory bandwidths
of the both processors. Moreover, we also assume that the GPU supports the device
fission function defined in the OpenCL 1.2 specification and cores in the current GPU
devices can be used as sub-devices which are more like the GPUs in the HSA design.
Thus one CPU core and one GPU core can cooperate to deal with one ad-heap. The
simulator is programmed in C++ and CUDA/OpenCL.
entries, the root node is the kth smallest entry and the heap contains the k smallest
entries of the input sub-list.
In our ad-heap implementation, we execute heapify function (i.e., the first construc-
tion of the heap) on the GPU and the root node comparison operations (i.e., to decide
whether an update-key operation is required) on the CPU. Besides the execution time
described in the ad-heap simulator, the execution time of the above two operations
are recorded in our timer as well.
According to capacity limitation of the GPU device memory, we set sizes of the
list sets to 228 and 225 on the two machines, respectively, data type to 32-bit integer
(randomly generated), size of each sub-list to the same length l (from 211 to 221 ), and
k to 0.1l.
Experimental Results
Primary Y-axis-aligned line graphs in Figures 4.11(a)–(e) and 4.12(a)–(e) show the
selection rates of the d-heaps (on the CPUs and the GPUs) and the ad-heap (on the
simulators) over the different sizes of the sub-lists and d values on the machine 1
and the machine 2, respectively. In all tests, all cores of the CPUs are utilized. We
can see that for the performance of the d-heaps in all groups, the multicore CPUs
are almost always faster than the GPUs, even when the larger d values significantly
reduce throughputs of the CPUs. Thus, for the conventional d-heap data structure, the
CPUs are still better choices in the heap-based k-selection problem. For the ad-heap,
the fastest size d is always 32. On one hand, the smaller d values cannot fully utilize
computation and bandwidth resources of the GPUs. On the other hand, the larger d
values lead to much more data loading but do not bring the same order of magnitude
shallower heaps.
Secondary Y-axis-aligned stacked columns in Figures 4.11(a)–(e) and 4.12(a)–(e)
show the execution time of the three parts (CPU compute, CPU queue and GPU
compute) of the ad-heap simulators. On the machine 1, the execution time of the
GPU compute is always longer than the total time of the CPU work, because the raw
performance of the integrated GPU is relatively too low to accelerate the find-maxchild
operations and the memory sub-system in the APU is not completely designed for
the GPU memory behaviors. On the machine 2, the ratio of CPU time and GPU time
is much more balanced (in particular, while d = 32) due to the much stronger discrete
GPU.
Figures 4.11(f) and 4.12(f) show aggregated performance numbers include the best
results in the former 5 groups and the optimal scheduling method that runs the fastest
d-heaps on the CPUs and the GPUs in parallel, respectively. In these two sub-figures,
we can see that the ad-heap obtains up to 1.5x and 3.6x speedup over the optimal
scheduling method when the d value is equal to 32 and the sub-list size is equal to 218
and 219 , respectively. Note that the optimal scheduling method is also assumed to
utilize accumulated off-chip memory bandwidths of the both processors.
We can see that among all the candidates, only the ad-heap maintains relatively
good performance stabilities while problem size grows. The performance numbers
support our ad-heap design that gets benefits from main features of the two types of
cores while the CPU d-heaps suffer with wider find-maxchild operations and the GPU
d-heaps suffer with more single-thread compare-and-swap operations.
CHAPTER 4. PARALLELISM IN DATA STRUCTURES AND ALGORITHMS 61
(a) d = 8 (b) d = 16
(c) d = 32 (d) d = 64
Figure 4.11: Selection rates and ad-heap execution time over different sizes of the
sub-lists on the machine 1 (i.e., the one with AMD CPU and GPU). The line-shape
data series is aligned to the primary Y-axis. The stacked column-shape data series is
aligned to the secondary Y-axis.
62 4.6. AD-HEAP AND K-SELECTION
(a) d = 8 (b) d = 16
(c) d = 32 (d) d = 64
Figure 4.12: Selection rates and ad-heap execution time over different sizes of the sub-
lists on the machine 2 (i.e., the one with Intel CPU and nVidia GPU). The line-shape
data series is aligned to the primary Y-axis. The stacked column-shape data series is
aligned to the secondary Y-axis.
analysis. The experimental results showed that the ad-heap can obtain up to 1.5x and
3.6x performance of the optimal scheduling method on two representative machines,
respectively. To the best of our knowledge, the ad-heap is the first fundamental data
structure that efficiently leveraged the two different types of cores in the emerging
heterogeneous processors through fine-grained frequent interactions between the
CPUs and the GPUs. Further, the performance numbers also showed that redesigning
data structure and algorithm is necessary for exposing higher computational power
of the heterogeneous processors.
Compared with the prior work, our ad-heap not only takes advantage of reduced
data movement cost but also utilizes computational power of the both types of cores.
As shown in the previous section, we found that 8-heap is the best choice for the CPU
and 32-heap is the fastest on the GPU, thus the optimal scheduling method should
execute the best d-heap operations on both types of cores in parallel. However, our
results showed that the ad-heap is much faster than the optimal scheduling method.
Thus scheduling is not always the best approach, although task or data parallelism
is obvious. Actually, in the ad-heap, the find-maxchild operation can be seen as a
parallelizable stage of its higher-level operation delete-max or update-key. However,
the ad-heap is different from the previous work because it utilizes advantages of the
heterogeneous processors through frequent fine-grained interactions between the
CPUs and the GPUs. If the two types of cores shared the last level cache, the ad-heap
can naturally obtain benefits from heterogeneous prefetching, because the bridge and
the nodes to be modified are already loaded to the on-chip cache by the GPUs, prior
to writing back by the CPUs.
Because of the legacy CPU and GPU architecture design, in this section we choose
focusing on an heterogeneous processor environment with separate last level cache
sub-systems, as plotted in Figure 3.5(a). Conducting experiments on a shared last
level cache heterogeneous processor like the one in Figure 3.5(b) can be an interesting
future work. Additionally, our approach is different from the previous work since we
see both GPUs and CPUs as compute units as well but not just prefetchers.
64 4.6. AD-HEAP AND K-SELECTION
Part II
65
5. Existing Storage Formats
5.1 Overview
Because different sparse matrices have different sparsity structures, selection of
representation (i.e., format) for a certain matrix may affect requirement of memory
space and efficiency of algorithms running on it. As a result, research on sparse matrix
representation attracts a lot of attention. This chapter introduces some widely used
and recently developed sparse matrix representations.
Some representative formats are classified into three groups: (1) four basic formats
mostly supported in many mathematical software packages, (2) three hybrid formats
mainly developed for irregular matrices, and (3) eight extended formats constructed
based on the basic formats.
67
68 5.2. BASIC STORAGE FORMATS
However, the DIA format is designed only for sparse matrices dominated by
diagonals. If some weak diagonals have very few nonzero entries, the DIA format
may waste a large amount of memory space. Figure 5.3 shows an example matrix A2 .
The diagonals at positions -2 and 2 only have one nonzero entry, respectively. Thus
each of them wastes 5 memory units for padded fill-ins. In general, the DIA format
may largely waste memory space for storing more irregular sparse matrices.
Figure 5.4 plots the ELL format in the row-major order of the matrix A2 . We can
see that nnzrmax = 4 in this case, whereby the size of col idx and val is 6 × 4.
Because the 1st, 2nd, 5th and 6th rows have fewer nonzero entries compared to the
longest row, some fill-ins are padded to keep arrays col idx and val rectangular.
Note that the ELL data can also be stored in the column-major order for aligned
memory access.
Figure 5.4: A sparse matrix and its ELL format in the row major order.
Unfortunately, for matrices with a relatively long row, the ELL format may need
many fill-ins. Figure 5.5 gives an example matrix A3 . Because the 3rd row is much
longer than the others, many fill-ins are explicitly stored thus waste memory space.
Figure 5.5: A sparse matrix and its ELL format in the row-major order.
However, the COO format still stores some redundant row index information,
if a row contains more than one nonzero entry. In Figure 5.6, we can see that the
row idx array stores four ‘2’s for the nonzero entries in the 3rd row, and two ‘5’s for
the 6th row. Thus despite its generality, the COO format may waste memory space in
common cases.
Because the CSR format does not store redundant information, it is the most
widely used format for sparse matrices. As a result, Sparse BLAS algorithms for
matrices in the CSR format are basic functions of a sparse matrix library. Actually,
1 The CSR format is also known as the Compressed Row Storage (CRS) format [156], and is equal to the
CSR and COO formats can be further compressed [101, 172, 183]. But description of
those methods is beyond the scope of this chapter.
5.3.2 Cocktail
Su and Keutzer proposed a more aggressive hybrid format called Cocktail [170]. It
constructed a framework that analyzes the sparsity structure of an input matrix and
stores it in a combination of 3 categories (i.e., diagonal based formats, flat formats
and block based formats) of 9 formats. Figure 5.9 shows that the matrix A4 can be
saved more efficiently if its ELL part is saved in the DIA format.
72 5.4. SLICED AND BLOCKED STORAGE FORMATS
5.3.3 SMAT
However, given an irregular sparse matrix, selecting the best combination is NP-
hard. Li et al. [115] proposed a machine learning based autotuner trained by 2373
matrices in the University of Florida Sparse Matrix Collection [56] for deciding a
combination of formats for a new input matrix. Similar to the SMAT, Sedaghati et
al. [160] constructed machine learning classifiers for automatic selection of the best
format for a given input on a target device.
Figure 5.12: A sparse matrix and its BCOO format of 2 × 2 dense blocks.
74 5.4. SLICED AND BLOCKED STORAGE FORMATS
terms of the number of nonzero entries in each row, then aligns the entries to their
left as the ELL format does, finally stores entries in dense blocks in the column-major
order for better cache locality. Because of the sorting, a new array called perm is
introduced for storing permutation, i.e., original row positions. Figure 5.15 shows an
example. We can see that the 3rd row is longer than the others thus be located to the
front.
5.4.8 CSR-Adaptive
Greathouse and Daga proposed their CSR-Adaptive format [80, 50] that uses an array
row block instead of the binning information of the ACSR format for grouping rows
of comparable length. In this format, contiguous rows are organized as row blocks
if the sum of their lengths is no longer than the size of usable on-chip memory. The
array row block records the starting and ending points of row groups. Figure 5.17
gives an example that assumes each row block does not contain more than 4 nonzero
entries. We can see that the 1st row block contains 2 rows, the 2nd one contains 1 row,
and so on. If the row blocks have roughly the same number of nonzero entries, good
load balancing can be expected on massively parallel devices. As the ACSR format,
the CSR data is not changed in the CSR-Adaptive format.
6.1 Overview
In the previous chapter, we can see that many advanced formats have been proposed
for a variety of purposes. However, they may need not cheap format conversion cost,
if the input matrix is stored in a basic format such as CSR. The conversion cost is
mainly from the expensive structure-dependent parameter tuning of a storage format.
For example, some block based formats [33, 35, 45, 177, 178, 188] such as BCCOO
require finding a good 2D block size. Moreover, some hybrid formats [21, 170] such
as HYB may need completely different partitioning parameters for distinct input
matrices.
To avoid the format conversion overhead, the ACSR [7] and the CSR-Adaptive [80,
50] directly use the CSR data. However, they may provide very low performance for
irregular matrices due to unavoidable load imbalance. Furthermore, none of them
can avoid an overhead from preprocessing, since certain auxiliary data for the basic
CSR format have to be generated.
Therefore, to be practical, an efficient format must satisfy two criteria: (1) it should
limit format conversion cost by avoiding structure-dependent parameter tuning, and
(2) it should support fast sparse matrix computations for both regular and irregular
matrices.
6.2 Contributions
To meet these two criteria, this chapter proposes CSR5 (Compressed Sparse Row 5)1 ,
a new format directly extending the classic CSR format. The CSR5 format leaves
one of the three arrays of the CSR format unchanged, stores the other two arrays
in an in-place tile-transposed order, and adds two groups of extra auxiliary infor-
mation. The format conversion from the CSR to the CSR5 merely needs two tuning
parameters: one is hardware-dependent and the other is sparsity-dependent (but
structure-independent). Because the added two groups of information are usually
much shorter than the original three in the CSR format, very limited extra space is
required. Furthermore, the CSR5 format is SIMD-friendly and thus can be easily
implemented on all mainstream processors with the SIMD units. Because of the
1 The reason we call the storage format CSR5 is that it has five groups of data, instead of three in the
classic CSR.
77
78 6.3. THE CSR5 FORMAT
structure-independence and the SIMD utilization, the CSR5-based sparse matrix algo-
rithm such as SpMV can bring stable high throughput for both regular and irregular
matrices.
Figure 6.2: The CSR5 storage format of the sparse matrix A. The five groups of
information include row ptr, tile ptr, col idx, val and tile desc.
the matrices with large nnz/row, σ may need to be small. The reason is that once
the whole tile is located inside a matrix row (i.e., only one segment is in the tile), the
segmented sum converts to a fast reduction sum.
Therefore, for the nnz/row to σ mapping on GPUs, we define three simple bounds:
r, s and t. The first bound r is designed to prevent a too small σ. The second bound
s is used for preventing a too large σ. But when nnz/row is further larger than the
third bound t, σ is set to a small value u. Then we have
r if nnz/row ≤ r
nnz/row if r < nnz/row ≤ s
σ=
s if s < nnz/row ≤ t
u if t < nnz/row.
The three bounds, r, s and t, and the value u are hardware-dependent, meaning
that for a given processor, they can be fixed for use. For example, to execute double
precision SpMV on nVidia Maxwell GPUs and AMD GCN GPUs, we always set
<r, s, t, u> = <4, 32, 256, 4> and <4, 7, 256, 4>, respectively. As for future processors
with new architectures, we can obtain the four values through some simple bench-
marks during initialization, and then use them for later runs. So the parameter σ can
be decided once the very basic information of a matrix and a low-level hardware are
known.
Therefore, we can see that the parameter tuning time becomes negligible because
ω and σ are easily obtained. This can save a great deal of preprocessing time.
‘1000 ... 000’, and tile pointer 0 is stored as ‘0000 ... 000’. To the best of our knowledge,
the index of 31 or 63 bits is completely compatible to most numerical libraries such as
Intel MKL. Moreover, reference implementation of the HPCG benchmark [62] also
uses 32-bit signed integer for problem dimension no more than 231 and 64-bit signed
integer for problem dimension larger than that. Thus it is safe to save 1 bit as the
empty row hint and the other 31 or 63 bits as a ‘real’ row index.
The third array seg offset of size ω is used for accelerating a local segmented
sum in the workload of each tile. The local segmented sum is an essential step that
synchronizes partial sums in a 2D tile (imagine multiple columns in the tile come
from the same matrix row). In the previous segmented sum (or segmented scan)
method [28, 40, 161, 63], the local segmented sum is complex and not efficient enough.
Thus we use the method described in 4.4.4 introducing a seg offset array for fast
segmented sum.
To generate seg offset, we let each column search its right neighbor columns
and count the number of contiguous columns without any TRUEs in their bit flag.
Using Tile 0 in Figure 6.2 as an example, its 2nd column has one and only one right
neighbor column (the 3rd column) without any TRUEs in its bit flag. Thus the 2nd
column’s seg offset value is 1. In contrast, because the other three columns (the 1st,
3rd and 4th) do not have any ‘all FALSE’ right neighbors, their values in seg offset
is 0. Algorithm 25 shows how to generate seg offset using an SIMD-friendly
method.
The seg offset array is used for calculating fast segmented sum through an
inclusive prefix-sum scan. See Section 4.4.4 for detailes.
The last array empty offset occurs when and only when a 2D tile includes
any empty rows (i.e., its tile pointer is negative). Because an empty row of a matrix
has the same row pointer with its rightmost non-empty neighbor row (recall the
second row in the matrix A in Figure 1), y offset will record an incorrect offset for
it. We correct for this by storing correct offsets for segments within a tile. Thus the
length of empty offset is the number of segments (i.e., the total number of TRUEs
in bit flag) in a tile. For example, Tile 0 in Figure 6.2 has 4 entries in its empty -
CHAPTER 6. THE CSR5 STORAGE FORMAT 83
offset since its bit flag includes 4 TRUEs. Algorithm 26 lists the pseudocode that
generates empty offset for a tile that contains at least one empty row.
the CSR format. Additionally, some applications such as finite element methods can
directly assemble sparse matrices in the CSR5 format from data sources.
85
7. Level 1: Sparse Vector
Operations
7.1 Overview
In Section 2.3, we can see that many Levels 2 and 3 sparse operations can directly
use Level 1 routines as building blocks, thus their performance is crucial for all
Sparse BLAS routines. This chapter main focuses on the sparse vector – sparse vector
addition operation since it is used as part of the SpGEMM method described in
Chapter 9.
7.2 Contributions
In this chapter, a heuristic approach for adding two sparse vectors is developed.
Depending on the upper bound length of the sparse output vector, the approach
maps the sparse input vectors to a heap method, a bitonic ESC method or a merge
method.
87
88 7.4. CSR-BASED SPARSE VECTOR ADDITION
input sparse vectors are short enough and the size of them is enough long.
Figure 7.1: Two steps of an example of the heap method. From (a) to (b), the root
entry is fused to the first entry in resulting sequence since they share the same index.
From (b) to (c), the root entry is inserted to the sequence since they have different
indices. After each step, the heap property is reconstructed.
Figure 7.3: An example of the merge method. The original status of the resulting
sequence contains the 1st input vector. The input sequence as the 2nd input vector
can be stored in the register file. Its mask sequence and the resulting sequence are in
the scratchpad memory.
1 Actually according to the CSR format standard, the column indices in each row do not necessarily
have to be sorted. But most implementations choose to do so, thus our method reasonably makes this
assumption.
8. Level 2: Sparse Matrix-Vector
Operations
8.1 Overview
In the Level 2 Sparse BLAS, multiplication of a sparse matrix and a dense or sparse
vector is very useful and challenging. In this chapter, we mainly consider fast algo-
rithms for sparse matrix-vector multiplication (SpMV for short), which is perhaps the
most widely-used non-trivial BLAS in computational science and modeling. In the
next chapter, a Level 3 BLAS routine SpGEMM will be introduced. This operation can
be seen as a more generic case of multiplication of sparse matrix and sparse vector.
The SpMV operation multiplies a sparse matrix A of size m × n by a dense vector
x of size n and gives a dense vector y of size m. The naı̈ve sequential implementation
of SpMV can be very simple, and can be easily parallelized by adding a few pragma
directives for the compilers [112]. But to accelerate large-scale computation, parallel
SpMV is still required to be hand-optimized with specific data storage formats and
algorithms [7, 8, 16, 21, 33, 35, 38, 45, 58, 76, 80, 102, 115, 116, 122, 170, 173, 166, 177,
184, 188, 191, 193, 194, 192] (see Chapter 5 for some recently proposed formats for
specific hardware architectures such as GPUs and Xeon Phi). The experimental results
showed that these formats can provide performance improvement for various SpMV
benchmarks.
However, the completely new formats bring several new problems. The first one
is backward-compatibility. When the input data are stored in basic formats (e.g., CSR),
a format conversion is required for using the new format based SpMV. In practice,
fusing a completely new format into well-established toolkits (e.g., PETSc [12]) for
scientific software is not a trivial task [132] because of the format conversion. More-
over, Kumbhar [107] pointed out that once an application (in particular a non-linear
solver) needs repeated format conversion after a fixed small number of iterations, the
new formats may degrade overall performance. Furthermore, Langr and Tvrdı́k [110]
demonstrated that isolated SpMV performance is insufficient to evaluate a new for-
mat. Thus more evaluation criteria, such as format conversion cost and memory
footprint, must be taken into consideration. Secondly, when the SpMV operation is
used with other sparse building blocks (such as preconditioning operations [116] and
SpGEMM [118]) that require basic storage formats, using the all-new formats is less
feasible.
91
92 8.2. CONTRIBUTIONS
Therefore, if we can reduce the cost of format conversion and extra memory
footprint to a certain level, a new format can offer faster SpMV, compared to the
CSR format, in an iterative scenario. Otherwise, accelerating CSR-based SpMV can
give the best performance, in particular when the number of iterations is low. In this
chapter, we introduce our SpMV algorithms both for the CSR format and for the CSR5
format described in Chapter 6.
8.2 Contributions
Our work described in this chapter particularly focuses on CSR-based SpMV on
CPU-GPU heterogeneous processors and CSR5-based SpMV on CPUs, GPUs and
Xeon Phi.
The main idea of our CSR-based SpMV algorithm is first speculatively executing
SpMV on a heterogeneous processor’s GPU cores targeting high throughput compu-
tation, and then locally re-arranging resulting vectors by the CPU cores of the same
chip for low-latency memory access. To achieve load balanced first step computation
and to utilize both CPU and GPU cores, we improved the conventional segmented
sum method by generating auxiliary information (e.g., segment descriptor) at runtime
and recognizing empty rows on-the-fly. Compared to the row block methods for the
CSR-based SpMV, our method delivers load balanced computation to achieve higher
throughput. Compared with the classic segmented sum method for the CSR-based
SpMV, our approach decreases the overhead of global synchronization and removes
pre- and post-processing regarding empty rows.
The CSR-based SpMV work makes the following contributions:
The CSR5 format described in Chapter 6 extends the basic CSR format for using
SIMD units efficiently, thus can be an alternative of the CSR format. The CSR5 format
merely needs very short format conversion time (a few SpMV operations) and very
small extra memory footprint (around 2% of the CSR data). Because the CSR5 format
shares data with the CSR format, the CSR-based Sparse BLAS routines can efficiently
work with the CSR5 format.
The CSR5-based SpMV work makes the following contributions:
• The work is implemented on four mainstream devices: CPU, nVidia GPU, AMD
GPU and Intel Xeon Phi.
• The CSR5-based SpMV is evaluated in both isolated SpMV tests and iteration-
based scenarios.
Although various strategies, such as data streaming [51, 80], memory coalescing [58],
data reordering or reconstruction [16, 84, 149], static or dynamic binning [7, 80], Dy-
namic Parallelism [7] and and dynamic row distribution [123], have been developed,
none of those can fundamentally solve the problem of load imbalance, and thus
provide relatively low SpMV performance for the CSR format.
Blelloch et al. [28] pointed out that the segmented sum may be more attractive for the
CSR-based SpMV, since it is SIMD friendly and insensitive to the sparsity structure
of the input matrix, thus overcoming the shortcomings of the row block methods. A
serial version of segmented sum is shown in Algorithm 12.
In the SpMV operation, the segmented sum treats each matrix row as a segment
and calculates a partial sum for the entry-wise products generated in each row.
The SpMV operation using the segmented sum methods consists of four steps: (1)
generating an auxiliary bit flag array of size nnz from the row ptr array. An
entry in bit flag is flagged as TRUE if its location matches the first nonzero entry of
a row, otherwise it is flagged as FALSE, (2) calculating all intermediate entries (i.e.,
entry-wise products) to an array of size nnz, (3) executing the parallel segmented
sum for the array, and (4) collecting all partial sums to the result vector y if a row is
not empty. Algorithm 30 lists the pseudocode. Figure 8.2 illustrates an example using
the matrix A plotted in Figure 8.1. We can see that once the heaviest workload, i.e.,
step 3, is parallelized through a fast segmented sum method described in [40, 63, 161],
nearly perfect load balance can be expected in all steps of Algorithm 30.
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 95
Figure 8.3 gives a performance comparison of the row block method from the cuS-
PARSE v6.5 library and the segmented sum method from the cuDPP v2.2 [76, 161]
library. It shows that the row block method can significantly outperform the seg-
mented sum method, while doing SpMV on relatively regular matrices (see Table A.1
for details of the used benchmark suite). On the other hand, row block method only
gives very low performance for irregular matrices.
Why is this the case? We can see that the step 1 of the segmented sum method is
a scatter operation and the step 4 is a gather operation, both from the row space of
size m. This prevents the two steps from fusing with the steps 2 and 3 in the nonzero
entry space of size nnz. In this case, more global synchronizations and global memory
accesses may degrade the overall performance. Previous research [21, 170] has found
that the segmented sum may be more suitable for the COO-based SpMV, since the
fully stored row index data can convert the steps 1 and 4 to the nonzero entry space:
the bit flag array can be generated by comparison of neighbor row indices, and
the partial sums in the product array can be directly saved to y since their final
locations are easily known from the row index array. Further, Yan et al. [188] and
Tang et al. [173] reported that some variants of the COO format can also benefit from
the segmented sum. However, it is well known that accessing row indices in the COO
pattern brings higher off-chip memory pressure, which is just what the CSR format
tries to avoid.
In Figure 8.3, we can also see that the CSR5-based SpMV can obtain up to 4x
speedup over the CSR-based SpMV using the segmented sum primitive [161]. Its
main reason is that the CSR5-based SpMV can utilize both the segmented sum for
load balance and the compressed row data for better load/store efficiency. The details
will be shown below.
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 97
Data Decomposition
In the proposed CSR-based SpMV, we first evenly decompose nonzero entries of the
input matrix to multiple small tiles for load balanced data parallelism. Here we define
a tile as a 2D array of size W × T . The width T is the size of a thread-bunch, which is
the minimum SIMD execution unit in a given vector processor. It is also known as
wavefront in AMD GPUs or warp in nVidia GPUs. The height W is the workload (i.e.,
the number of nonzero entries to be processed) of a thread. A tile is a basic work unit
in matrix-based segmented sum method [161, 63], which is used as a building block
in our SpMV algorithm. Actually, the term “tile” is equivalent to the term “matrix”
used in original description of the segmented scan algorithms [161, 63]. Here we use
“tile” to avoid confusion between a work unit of matrix shape and a sparse matrix in
SpMV.
Since a thread-bunch can be relatively too small (e.g., as low as 8 in current Intel
GPUs) to amortize scheduling cost, we combine multiple thread-bunches into one
thread-group for possibly higher throughput. We define B to denote the number
of thread-bunches in one thread-group. Additionally, we let each thread-bunch
compute S contiguous tiles. Thus higher on-chip resource reuse and faster global
synchronization are expected.
Therefore, we can calculate that each thread-group deals with BSW T nonzero
entries. Thus the whole nonzero entry space of size nnz can be evenly assigned to
dnnz/(BSW T )e thread-groups. Figure 8.4 shows an example of the data decomposi-
tion. In this example, we set B = 2, S = 2, W = 4, and T = 2. Thus each thread-group
is responsible for 32 nonzero entries. Then dnnz/32e thread-groups are dispatched.
Figure 8.4: Data decomposition on the nonzero entry space. nnz nonzero entries
are assigned to multiple thread-groups. In this case, each thread-group consists of 2
thread-bunches (i.e., B = 2). The number of threads in each thread-bunch is equal
to 2 (i.e., T = 2). The workload per thread is 4 (i.e., W = 4). The number of iterative
steps in each thread-bunch is 2 (i.e., S = 2).
98 8.4. CSR-BASED SPMV
Figure 8.5: An example of our CSR-based SpMV algorithm. The input sparse matrix
contains 48 nonzero entries in 12 rows (10 non-empty rows and 2 empty rows). One
thread-bunch composed of 4 threads is launched in this 2-iteration process. The arrays
synchronizer and speculator store tuples (shown with angular brackets).
the locations to store generated partial sums. Lines 3–7 of Algorithm 31 give a code
expression of this step. In our example shown in Figure 8.5, the 2 tiles of size 24 have
3 boundaries {0, 24, 48}. The results of binary search of {0, 24, 48} on the CSR row
pointer array are {0, 4, 12}. Note that the binary search needs to return the rightmost
match, if multiple slots have the same value.
Then each thread-bunch executes an iteration of S steps. Lines 8–59 of Algo-
rithm 31 give code expression of this step. Each iteration deals with one tile. By
calculating offset between the left boundary of a tile and the covered row indices,
a local segment descriptor is generated (lines 14–21 in Algorithm 31). For example,
the left boundary of the second tile is 24 and its row index range is 4–12. We need
to compute offset between 24 and the row pointer {19, 27, 29, 29, 34, 37, 37, 44, 48}.
Then we obtain a group of offsets {-5, 3, 5, 5, 10, 13, 13, 20, 24}. After removing
duplicate values and overflowed values on the left and the right sides, the effective
part {3, 5, 10, 13, 20} in fact implies local segment descriptor for the current tile. We
can easily convert it to a binary expression {0, 0, 0, 1, 0, 1, 0, ... , 0, 0, 1, 0, 0, 0} through
a scatter operation in on-chip scratchpad memory. Moreover, since each tile is an
independent work unit, the first bit of its segment descriptor should be TRUE. Thus
the final expression becomes {1, 0, 0, 1, 0, 1, 0, ... , 0, 0, 1, 0, 0, 0}. In Figure 8.5, the
filled and empty circles are heads (i.e., 1s or TRUEs) and body (i.e., 0s or FALSEs) of
segments, respectively.
While generating the segment descriptor, each thread detects whether or not
its right neighbor wants to write to the same slot. If yes (like the duplicate offset
information {..., 5, 5, ...} and {..., 13, 13, ...} in the above example), we can make sure
that this tile contains at least one empty row, since an empty row is expressed as two
contiguous indices of the same value in the CSR row pointer array. Then we mark
this tile as “dirty” (line 19 in Algorithm 31). Further, the dirty counter array stored
in DRAM is incremented by atomic operation, and this tile’s offset is recorded in
the speculator array (lines 53–58 in Algorithm 31). In our example, dirty counter is 1
and speculator array has a pair of offsets {h4, 12i} ((shown with angular brackets in
Figure 8.5).
Then we calculate and save element-wise products in scratchpad memory, based
on its nonzero entries’ column indices, values and corresponding values in the vector
x. Lines 22–26 of Algorithm 31 show code expression of this step. When finished, we
transmit the sum of the last segment to an intermediate space for the next iteration
(lines 27–31 in Algorithm 31). In our example, the first tile’s last value 5e is transmitted
to the next tile. Then we execute the matrix-based segmented sum (lines 32–33) on the
tile. Because the segmented sum algorithm used here is very similar to the method
described in [28], we refer the reader to [28] and several pervious GPU segmented
sum algorithms [161, 63] for details. But note that compared to [28], our method
makes one difference: we store partial sums in a compact pattern (i.e., values are
arranged in order from the first location in the thread work space), but not save them
to locations of corresponding segment heads. For this reason, we need to record
the starting position and the number of partial sums. Then we can use an ordinary
exclusive scan operation (lines 34–35) for obtaining contiguous indices of the partials
sums in y. In Figure 8.5, we can see that the partial sums (expressed as filled hexagons)
are aggregated in the compact fashion. Note that empty hexagons are intermediate
102 8.4. CSR-BASED SPMV
partial sums, which are already added to the correct position of segment heads.
Finally, we store the partial sums to known locations in the resulting vector. Lines
36–52 of Algorithm 31 show code expression. As an exception, the sum result of the
first segment in a thread-bunch is stored to the synchronizer array (lines 40–43), since
the first row of each thread-bunch may cross multiple thread-bunch. This is a well
known issue while conducting basic primitives, such as reduction and prefix-sum
scan, using more than one thread-group that cannot communicate with each other. In
fact, atomic add operation can be utilized to avoid the global synchronization. But
we choose not to use relatively slow global atomic operations and let a CPU core
to later on finish the global synchronization. Lines 62–68 of Algorithm 31 show the
corresponding code expression. Since the problem size (i.e., dnnz/(SW T )e) can be
too small to saturate a GPU core, a CPU core is in fact faster for accessing short arrays
linearly stored in DRAM. Taking the first tile in Figure 8.5 as an example, its first
partial sum is 3a, which is stored with its global index 0 to the synchronizer. After that,
the value 3a is added to position 0 of y.
When the above steps are complete, the resulting vector is numerically identified,
except that some values generated by dirty tiles are not in their correct locations.
In Figure 8.5, we can see that after synchronization, vector y is already numerically
identified to its final form, but entries 5g, 3h, 7i and 4j generated by the second tile
are located in wrong slots.
The checking prediction stage first checks value of the dirty counter array. If it
is zero, the previous prediction is correct and the result of the first stage is the final
result; if it is not zero, the predicted entries generated by dirty tiles are scattered to
their correct positions in the resulting vector. In this procedure, the CSR row pointer
array is required to be read for getting correct row distribution information. Again,
we use a CPU core for the irregular linear memory access, which is more suitable for
cache sub-systems in CPUs. In our example, entries 5g, 3h, 7i and 4j are moved to
their correct positions. Then the SpMV operation is done.
Complexity Analysis
Our CSR-based SpMV algorithm pre-allocates three auxiliary arrays, synchronizer,
dirty counter and speculator, in off-chip DRAM. The space complexity of synchronizer
is dnnz/(SW T )e, equivalent to the number of thread-bunches. The size of dirty
counter is constant 1. The speculator array needs a size of dnnz/(W T )e, equivalent to
the number of tiles. Since W and T are typically set to relatively large values, the
auxiliary arrays merely slightly increase overall space requirement.
For each thread-bunch, we executes S + 1 binary searches in the row pointer array
of size m + 1. Thus O(dnnz/(SW T )e × (S + 1) × log2 (m + 1)) = O(nnz log2 (m)/W T )
is work complexity of this part. On the whole, generating segment descriptor needs
O(m) time. Collecting element-wise products needs O(nnz) time. For each tile, seg-
mented sum needs O(W T + log2 (T )) time. Thus all segmented sum operations need
O(dnnz/(W T )e(W T + log2 (T ))) = O(nnz + nnz log2 (T )/W T ) time. Saving entries
to y needs O(m) time. Synchronization takes O(dnnz/(SW T )e) = O(nnz/SW T )
time. Possible re-arrangement needs O(m) time in the worst case. Thus overall
work complexity of our CSR-based SpMV algorithm is O(m + nnz + nnz(log2 (m) +
log2 (T ))/W T ).
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 103
Implementation Details
Based on the above analysis, we can see that when the input matrix is fixed, the cost
of our SpMV algorithm only depends on two parameters: T and W . In our algorithm
implementation, T is set to SIMD length of the used processor. Choosing W needs
to consider the capacity of on-chip scratchpad memory. The other two parameters
B and S are empirically chosen. Table 8.1 shows the selected parameters. Note that
double precision is not currently supported in Intel OpenCL implementation for its
GPUs.
We implement the first stage of our algorithm in OpenCL for the Intel and AMD
platforms (and CUDA for the nVidia platform) for GPU execution and the second
stage in standard C language running on the CPU part. Since our algorithm needs
CPU and GPU share some arrays, we allocate all arrays in Shared Virtual Memory
supported by OpenCL for the best performance. On the nVidia platform, we use
Unified Memory in CUDA SDK.
approaches are running on all CPU cores of the used heterogeneous processors. For
the Intel CPU, we report results from MKL, since it always delivers the best perfor-
mance and the pOSKI is not supported by the used Microsoft Windows operating
system. For the AMD CPU, we report the best results of the three libraries, since none
of the three libraries outperforms all the others. For the ARM CPU included in the
nVidia Tegra K1 platform, we only report results from OpenMP, since the current
pOSKI and Intel MKL implementations do not support the ARM architecture. More-
over, single-threaded naı̈ve implementation on CPU is included in our benchmark as
well.
On GPUs, we benchmark variants of the CSR-scalar and the CSR-vector algorithms
proposed in [21]. The OpenCL version of the CSR-scalar method is extracted from
PARALLUTION v1.0.0 [126] and evaluated on the AMD platform. The OpenCL
implementation of the CSR-vector method is extracted from semantically equivalent
CUDA code in the CUSP library v0.4.0 and executed on both the Intel and the AMD
platforms. On the nVidia platform, we run the CSR-based SpMV from vendor-
supplied cuSPARSE v6.0 and CUSP v0.4.0 libraries.
For all tests, we run SpMV 200 times and record averages. The implicit data
transfer (i.e., matrices and vectors data copy from their sources to OpenCL Shared
Virtual Memory or CUDA Unified Memory) is not included in our evaluation, since
SpMV operation is normally one building block of more complex applications. All
participating methods conduct general SpMV, meaning that symmetry is not consid-
ered although some input matrices are symmetric. The throughput (flops per second)
is calculated by
2 × nnz
.
runtime
The bandwidth (bytes per second) is calculated by
Benchmark Suite
Performance Analysis
Figures 8.6 and 8.7 show throughput of single precision and double precision SpMV
of the tested CSR-based approaches, respectively.
In Figure 8.6, we can see that on the Intel heterogeneous processor, our approach
obtains up to 6.90x and on average 2.57x speedup over the CSR-vector method run-
ning on the used GPU. Although the speedup mainly comes from irregular matrices,
our method generally does not obviously lose performance on regular matrices. Fur-
ther, compared to CPU cores running MKL, both GPU SpMV algorithms are slower.
For our algorithm, the main reason is that the integrated GPU implements scratchpad
memory in its L3 cache, which has one order of magnitude higher latency compared
to fast scratchpad in nVidia or AMD GPUs. Our algorithm in fact heavily uses
scratchpad memory for storing and reusing segment descriptor, element-wise prod-
ucts and other shared data by threads. Thus even though the GPU part of the Intel
heterogeneous processor has higher single precision theoretical peak performance
than its CPU part, the delivered SpMV throughput is lower than expected. For the
CSR-vector method, the low performance has another reason: small thread-bunch
of size 8 dramatically increases loop overhead [15], which is one of the well known
bottlenecks [74] of GPU programming.
In Figures 8.6 and 8.7, we can see that on the AMD heterogeneous processor, our
106 8.4. CSR-BASED SPMV
Figure 8.6: Throughput (GFlop/s) of the single precision CSR-based SpMV algorithms
running on the three platforms.
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 107
Figure 8.7: Throughput (GFlop/s) of the double precision CSR-based SpMV algo-
rithms running on the AMD and the nVidia platforms.
108 8.4. CSR-BASED SPMV
method delivers up to 71.90x (94.05x) and on average 22.17x (22.88x) speedup over
the single (double) precision CSR-scalar method running on the used GPU. Compared
to the GPU CSR-vector method, our algorithm achieves up to 16.07x (14.43x) and
on average 5.61x (4.47x) speedup. The CSR-scalar and the CSR-vector methods give
very low throughput while running the last 6 irregular matrices, because of the
problem of load imbalance. Further, we find that the Intel heterogeneous processor’s
GPU is actually faster than the AMD GPU while running the last 6 matrices. The
reason is that the shorter thread-bunch (8 in Intel GPU vs. 64 in AMD GPU) brings a
positive influence for saving SIMD idle cost by executing a much shorter vector width
for dramatically imbalanced row distribution. On the other hand, for several very
regular matrices with short rows, e.g., Epidemiology, the CSR-scalar method offers
the best performance because of almost perfect load balance and execution of short
rows without loop cost. For most regular matrices, our method delivers comparable
performance over the best CPU algorithm.
In Figures 8.6 and 8.7, we can see that on the nVidia platform, our method delivers
up to 5.91x (6.20x) and on average 2.69x (2.53x) speedup over the single (double)
precision SpMV in the CUSP library running on the used GPU. Compared to cuS-
PARSE, our method has higher speedups. Since the both libraries use CSR-vector
algorithm, those speedups are within expectations. Consider the Tegra K1 platform
only contains one single GPU core, the problem of load imbalance on this device is
not as heavy as on the above AMD platform. As a result, the speedups are not as high
as those from the AMD processor. Here our method delivers on average 1.41x (1.42x)
speedup over OpenMP-accelerated SpMV on the quad-core ARM CPU, while using
single (double) precision benchmark.
Figure 8.8 shows bandwidth utilization of our algorithm proposed in this section.
We can see that the regular matrices can use bandwidth more efficiently compared to
the irregular ones. Considering the throughput speedups listed above, our method
can obtain higher bandwidth utilization than the other CSR-based SpMV algorithms
running on GPUs.
Parameter Selection
We further conduct experiments to exploit how selected parameters influence overall
performance.
Figure 8.9 shows dependency of the overall performance (harmonic means of
the 20 benchmarks) on the parameters, while we fix all the parameters except for
parameter W (i.e., workload per thread). We can see that in general the overall
performance goes up as parameter W increases. This trend matches the algorithm
complexity analysis described in Section 3.3. However, when W is larger than a
certain value, the overall performance degrades. The reason is that device occupancy
may decrease while more on-chip scratchpad memory is allocated for W T work space
of each thread-bunch.
Figure 8.10 shows the trend of the overall performance while we change parameter
S (i.e., the number of iterations of each thread-bunch) and fix all the other parameters.
We can see that if we assign more work to each thread-bunch, a better performance
can be expected. The performance improvement mainly comes from higher on-chip
resource reuse.
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 109
Figure 8.8: Bandwidth utilization (GB/s) of our CSR-based SpMV algorithm running
on the three platforms. Theoretical bandwidth from the hardware specifications are
marked up using black lines.
(a) Intel, SP (b) AMD, SP (c) nVidia, SP (d) AMD, DP (e) nVidia, DP
Figure 8.9: Single precision (SP) and double precision (DP) SpMV performance of our
algorithm on the three platforms while parameter W changes and all the others fixed
to the best observed values (see Table 8.1).
(a) Intel, SP (b) AMD, SP (c) nVidia, SP (d) AMD, DP (e) nVidia, DP
Figure 8.10: Single precision (SP) and double precision (DP) SpMV performance of
our algorithm on the three platforms while parameter S changes and all the others
fixed to the best observed values(see Table 8.1).
While running the CSR5-based SpMV, each column in a tile can extract information
from bit flag and label the segments in its local data to three colors: (1) red means
a sub-segment unsealed from its top, (2) green means a completely sealed segment
existed in the middle, and (3) blue means a sub-segment unsealed from its bottom.
There is an exception that if a column is unsealed both from its top and from its
bottom, it is colored to red.
Algorithm 32 shows the pseudocode of the CSR5-based SpMV algorithm. Fig-
ure 8.11 plots an example of this procedure. We can see that the green segments can
directly save their partial sums to y without any synchronization, since the indices
can be calculated by using tile ptr and y offset. In contrast, the red and the
blue sub-segments have to further add their partial sums together, since they are not
complete segments. For example, the sub-segments B2 , R2 and R3 in Figure 8.11 have
contributions to the same row, thus an addition is required. This addition operation
needs the fast segmented sum shown in Algorithm 13 and Figure 4.3. Furthermore, if
a tile has any empty rows, the empty offset array is accessed to get correct global
indices in y.
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 111
Figure 8.11: The CSR5-based SpMV in a tile. Partial sums of the green segments are
directly stored to y. The red and the blue sub-segments require an extra segmented
sum before issuing off-chip write.
Consider the synchronization among the tiles, since the same matrix row can be
influenced by multiple 2D tiles running concurrently, the first and the last segments of
a tile need to store to y by atomic add (or a global auxiliary array used in device-level
reduction, scan or segmented scan [63, 161]). In Figure 8.11, the atomic add operations
are highlighted by arrow lines with plus signs.
For the last entries not in a complete tile (e.g., the last two nonzero entries of the
matrix in Figure 6.2), we execute a conventional CSR-vector method after all of the
complete 2D tiles have been consumed. Note that even though the last tile (i.e., the
incomplete one) does not have tile desc arrays, it can extract a starting position
from tile ptr.
In Algorithm 32, we can see that the main computation (lines 5–21) only contains
very basic arithmetic and logic operations that can be easily programmed on all
mainstream processors with SIMD units. As the most complex part in our algorithm,
the fast segmented sum operation (line 22) only requires a prefix-sum scan, which
has been well-studied and can be efficiently implemented by using CUDA, OpenCL
or x86 SIMD intrinsics.
DDR3-1600 memory and 64-bit Ubuntu Linux v14.04 installed. Host of the Xeon Phi
is a machine with Intel Xeon E5-2680 v2 CPU, quad-channel DDR3-1600 memory and
64-bit Red Hat Enterprise Linux v6.5 installed. Detailed specifications of the used
four devices are listed in Tables B.1, B.2 and B.3.
Here we evaluate double precision SpMV. So cuDPP library [76, 161], clSpMV [170]
and yaSpMV [188] are not included since they only support single precision floating
point as data type. Two recently published methods [103, 173] are not tested since the
source code is not available to us yet.
We use OpenCL profiling scheme for timing SpMV on the AMD platform and
record wall-clock time on the other three platforms. For all participating formats and
algorithms, we evaluate SpMV 10 times (each time contains 1000 runs and records
the average) and report the best observed result.
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 113
Benchmark Suite
In Table 8.4, we list 24 sparse matrices as our benchmark suite for all platforms. The
first 20 matrices have been widely adopted in previous SpMV research [8, 21, 80, 122,
170, 184, 188]. The other 4 matrices are chosen since they have more diverse sparsity
structures. Table A.1 in Appendix A gives more details of the matrices.
To achieve a high degree of differentiation, we categorize the 24 matrices in
Table 8.4 into two groups: (1) regular group with the upper 14 matrices, (2) irregular
group with the lower 10 matrices. This classification is mainly based on the minimum,
average and maximum lengths of the rows. Matrix dc2 is a representative of the group
of irregular matrices. Its longest single row contains 114K nonzero entries, i.e., 15%
nonzero entries of the whole matrix with 117K rows. This sparsity pattern challenges
the design of efficient storage format and SpMV algorithm.
the CSR-Adaptive method can obtain better scalability than the CSR-vector method, it
still cannot achieve near perfect load balance. On the Xeon Phi, the CSR5 is slower than
Intel MKL and the ESB format. The main reason is that the current generation of Xeon
Phi can only issue up to 4 relatively slow threads per core (i.e., up to 4 × 60 threads
in total on the used device), and thus the latency of gathering entries from vector x
becomes the main bottleneck. Then reordering or partitioning nonzero entries based
on the column index for better cache locality behaves well in the ESB-based SpMV.
However, later on we will show that this strategy leads to very high preprocessing
cost.
Figure 8.13 shows double precision SpMV performance of the 10 irregular matrices.
We can see that the irregularity can dramatically impact SpMV throughput of some
approaches. On the CPU platform, the row block method based Intel MKL is now
slower than the other methods. The CSR5 outperforms the others because of better
SIMD efficiency from the AVX2 intrinsics. On the nVidia GPU, the CSR5 brings the
best performance because of the near perfect load balance. The other two irregularity-
oriented formats, HYB and ACSR, behave well but still suffer from imbalanced work
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 115
Figure 8.12: The SpMV performance of the 14 regular matrices. (nGPU=nVidia GPU,
aGPU=AMD GPU)
Effects of Auto-Tuning
The format conversion from the CSR to the CSR5 includes four steps: (1) memory
allocation, (2) generating tile ptr, (3) generating tile desc, and (4) transposition
of col idx and val arrays. Figure 8.15 shows the cost of the four steps for the 24
matrices (the x axis is the matrix ids) on the four used platforms. Cost of one single
SpMV operation is used for normalizing format conversion cost on each platform.
We can see that the conversion cost can be on average as low as the overhead of a
few SpMV operations on the two GPUs. On the two x86 platforms, the conversion
time is longer (up to cost of around 10–20 SpMV operations). The reason is that the
conversion code is manually SIMDized using CUDA or OpenCL on GPUs, but only
auto-parallelized by OpenMP on x86 processors.
Iteration-Based Scenarios
Since both the preprocessing (i.e., format conversion from a basic format) time and the
SpMV time are important for real-world applications, we have designed an iteration-
based benchmark. This benchmark measures the overall performance of a solver with
n iterations. We assume the input matrix is already stored in the CSR format. So
csr csr
the overall cost of using the CSR format for the scenarios is nTspmv , where Tspmv is
118 8.5. CSR5-BASED SPMV
execution time of one CSR-based SpMV operation. For a new format, the overall cost
new new new new
is Tpre + nTspmv , where Tpre is preprocessing time and the Tspmv is one SpMV time
using the new format. Thus we can calculate speedup of a new format over the CSR
csr new new
format in the scenarios, through (nTspmv )/(Tpre + nTspmv ).
new new
Tables 8.5 and 8.6 show the new formats’ preprocessing cost (i.e., Tpre /Tspmv ) and
their speedups over the CSR format in the iteration-based scenarios when n = 50 and
n = 500. The emboldened font in the tables shows the highest positive speedups on
each platform. The compared baseline is the fastest CSR-based SpMV implementation
(i.e., Intel MKL, nVidia cuSPARSE/CUSP, CSR-vector from CUSP, and Intel MKL,
respectively) on each platform. We can see that because of the very low preprocessing
overhead, the CSR5 can further outperform the previous methods when doing 50
iterations and 500 iterations. Although two GPU methods, the ACSR format and
the CSR-Adaptive approach, in general have shorter preprocessing time, they suffer
from lower SpMV performance and thus cannot obtain the best speedups. On all
platforms, the CSR5 always achieves the highest overall speedups. Moreover, the
CSR5 is the only format that obtains higher performance than the CSR format when
only 50 iterations are required.
CHAPTER 8. LEVEL 2: SPARSE MATRIX-VECTOR OPERATIONS 119
Table 8.5: Preprocessing cost and its impact on the iteration-based scenarios.
Table 8.6: Preprocessing cost and its impact on the iteration-based scenarios.
potential block layout. Therefore, block-based formats are mainly suitable for some
matrices generated from scientific computation problems, but may not fit irregular
matrices generated from graph applications. Our methods proposed in this chap-
ter is insensitive to the sparsity structure of input matrix, thus a generally better
performance is achieved.
A lot of research has focused on improving row block method CSR-based SpMV.
Williams et al. [184] proposed multiple optimization techniques for SpMV on multi-
core CPUs and Cell B.E. processor. Nishtala et al. [139] designed a high-level data
partitioning method for SpMV to achieve better cache locality on multicore CPUs.
Pichel et al. [147] evaluated how reordering techniques influence performance of
SpMV on GPUs. Baskaran and Bordawekar [16] improved off-chip and on-chip
memory access patterns of SpMV on GPUs. Reguly and Giles [151] improved thread
cooperation for better GPU cache utilization. Ashari et al. [7] utilized static reordering
and the Dynamic Parallelism scheme offered by nVidia GPUs for fast SpMV operation.
Greathouse et al. [80] grouped contiguous rows for better runtime load balancing
on GPUs. LightSpMV [123] proposed to dynamically distribute matrix rows over
warps in order for more balanced CSR-based SpMV without the requirement of
generating auxiliary data structures, and implemented this approach using atomic
operations and warp shuffle functions as the fundamental building blocks. However,
again, the row block methods cannot achieve good performance for input matrix with
dramatically imbalanced row distribution. In contrast, our methods are independent
with the sparsity structure of input matrix.
As mentioned, using segmented sum method as a building block is potentially a
better generic method for the CSR-based SpMV. An early segmented sum method
GPU SpMV was introduced by Sengupta et al. [161] and Garland [76] and imple-
mented in the cuDPP library [86]. But the cost of segmented sum and global memory
access degrade overall SpMV performance. Zhang [195] improved backward seg-
mented scan for a better cache efficiency and implemented the CSR-based SpMV
on multicore CPUs. Recently, nVidia’s Modern GPU library [18] implemented an
improved reduction method, which has been used as a back-end of cuDPP. However,
its performance still suffered by pre- and post-processing empty rows in global mem-
ory space. The segmented sum methods have been used in two recently published
papers [173, 188] for the SpMV on either GPUs or Xeon Phi. However, both of them
need to store the matrix in COO-like formats to utilize the segmented sum. Our
CSR-based SpMV methods, in contrast, uses scratchpad memory more efficiently
and utilizes the two types of cores in a heterogeneous processor for better workload
distribution. Moreover, the CSR5 format saves useful row index information in a
compact way, and thus can be more efficient both for the format conversion and for
the SpMV operation.
Compared with the CSR5 work designed for cross-platform SpMV on CPUs, GPUs
and Xeon Phi, our CSR-based SpMV approach does not need to process any format
conversion or generate any auxiliary data for the input CSR matrix. Consider the
format conversion from the CSR to the CSR5 merely needs the cost of a few SpMV
operations, the CSR5-based SpMV and the CSR-based SpMV can find their own
application scenarios, such as solvers with different number of iterations.
9. Level 3: Sparse Matrix-Matrix
Operations
9.1 Overview
General matrix-matrix multiplication (GEMM) [114, 23, 128] is one of the most crucial
operations in computational science and modeling. The operation multiplies a matrix
A of size m × k with a matrix B of size k × n and gives a resulting matrix C of
size m × n. In many linear solvers and graph problems such as algebraic multigrid
method (AMG) [20], breadth first search [79], finding shortest path [39], colored
intersection [95] and sub-graphs [175], it is required to exploit sparsity of the two
input matrices and the resulting matrix because their dense forms normally need
huge storage space and computation cost for the zero entries. Therefore SpGEMM
becomes a common building block in these applications.
Compared to CPUs, modern graphics processing units (GPUs) promise much
higher peak floating-point performance and memory bandwidth. Thus a lot of
research has concentrated on GPU accelerated sparse matrix-dense vector multipli-
cation [21, 120, 121] and sparse matrix-dense matrix multiplication [143, 176] and
achieved relatively attractive performance. However, despite the prior achievements
on these GPU sparse BLAS routines, massive parallelism in GPUs is still significantly
underused for the SpGEMM algorithm, because it has to handle three more challeng-
ing problems: (1) the number of nonzero entries in the resulting matrix is unknown
in advance, (2) very expensive parallel insert operations at random positions in the
resulting matrix dominate the execution time, and (3) load balancing must account
for sparse data in both input matrices with diverse sparsity structures.
Previous GPU SpGEMM methods [20, 57, 140, 51, 52, 83] have proposed a few
solutions for the above problems and demonstrated relatively good time and space
complexity. However, the experimental results showed that they either only work best
for fairly regular sparse matrices [57, 140, 83], or bring extra high memory overhead
for matrices with some specific sparsity structures [20, 51, 52]. Moreover, in the usual
sense, none of these methods can constantly outperform well optimized SpGEMM
approach [92] for multicore CPUs.
121
122 9.2. CONTRIBUTIONS
9.2 Contributions
The work described in this chapter particularly focuses on improving GPU SpGEMM
performance for matrices with arbitrary irregular sparsity structures by proposing
more efficient methods to solve the above three problems on GPUs and emerging
CPU-GPU heterogeneous processors.
This chapter makes the following contributions:
• A hybrid method that initially allocates memory of upper bound size for short
rows and progressively allocates memory for long rows. The experimental
results show that our method saves a large amount of global memory space and
efficiently utilizes the very limited on-chip scratchpad memory.
• An efficient parallel insert method for long rows of the resulting matrix by
using the fastest merge algorithm available on GPUs. We make an experimental
evaluation and choose GPU merge path algorithm from five candidate GPU
merge approaches.
• A load balancing oriented heuristic method that assigns rows of the resulting
matrix to multiple bins with different subsequent computational methods. Our
approach guarantees load balancing in all calculation steps.
compared to merely using its GPU cores, our framework delivers on average 1.2x (up
to 1.8x) speedup while utilizing re-allocatable shared virtual memory in the system.
ci∗ = (aik bk1 + ail bl1 , aik bk2 + ail bl2 , . . . , aik bkp + ail blp ).
We can see in this case, only entries in the kth and the lth row of B have contribu-
tion to the ith row of C. Then row vector form instead of column vector form is used
for the matrix B. So we obtain
Since the matrix B is sparse as well, again without loss of generality, we assume
that the kth row of B has only two nonzero entries in the rth and the tth column, and
the lth row of B also has only two nonzero entries in the sth and the tth column. So
the two rows are given by bk∗ = (bkr , bkt ) and bl∗ = (bls , blt ). Then
Because the matrix C is also sparse and the ith row of C only has three nonzero
entries in the rth, the sth and the tth column, the row can be given by
where cir = aik bkr , cis = ail bls and cit = aik bkt + ail blt .
In general there are more nonzero entries per rows of the matrices A, B and C.
But from the above derivation we can see that the SpGEMM can be represented by
operations on row vectors of the matrices. Therefore, in this work we store all sparse
matrices in the CSR format. Actually compressed sparse column (CSC) format is also
124 9.3. BASIC METHOD
widely used for sparse matrices stored in column-major order [77]. The SpGEMM in
the CSC format is almost the same as in the CSR format except rows are changed to
columns and vice versa.
The above CSR-based SpGEMM algorithm can be performed by pseudocode in
Algorithm 33. An early description of this algorithm was given by Gustavson [85].
this method is relatively expensive since the SpGEMM operation in the same pattern
is executed twice.
The second method, probabilistic method, estimates an imprecise nnz(C). This
group of approaches [4, 47, 144] are based on random sampling and probability
analysis on the input matrices. Since they do not guarantee a safe lower bound for the
resulting matrix C and extra memory has to be allocated while the estimation fails,
they were mostly used for estimating the shortest execution time of multiplication of
multiple sparse matrices.
The third method, upper bound method, computes an upper bound of the number of
nonzero entries in the resulting matrix C and allocates corresponding memory space.
Numerically, the upper bound size equals nnz(C), b or half of f lops, the number of
necessary arithmetic operations. The ESC algorithms use this method for memory
pre-allocation. Even though this approach saves cost of the pre-computation in the
precise method, it brings another problem that the intermediate matrix C b may be too
large to fit in the device global memory. Since the SpGEMM algorithm does not take
into consideration cancellation that eliminates zero entries generated by arithmetic
operations, the resulting matrix is normally larger than the input matrices. Table 9.2
shows that nnz(C) b is much larger than nnz(C) while squaring some matrices. For
example, the sparse matrix Wind Tunnel generates 626.1 million nonzero entries (or
7.5 GB memory space for 32-bit index and 64-bit value) for the intermediate matrix C b
while the real product C (i.e., A2 ) only contains 32.8 million nonzero entries. Although
the upper bound method can partition the intermediate matrix C b into multiple sub-
matrices, higher global memory pressure may reduce overall performance.
The last method, progressive method, first allocates memory of a proper size, starts
sparse matrix computation and re-allocates the buffer if larger space is required. Some
CPU sparse matrix libraries use this method. For instance, sparse matrix computation
in the Matlab [77] increases the buffer by a ratio of 50% if the current memory space
is exhausted.
Since the upper bound method sacrifices space efficiency for the sake of improved
performance and the progressive method is good at saving space, we use a hybrid
method composed of the both approaches. However, compared to the relatively
convenient upper bound method, it is hard to directly implement a progressive
method for discrete GPUs. The reason is that although modern GPU devices have the
ability of allocating device global memory while kernels are running, they still cannot
re-allocate device memory on the fly. We will describe our hybrid method designed
for discrete GPUs in the next section.
On the other hand, emerging heterogeneous processors, composed of multiple
CPU cores and GPU cores in one chip, supply both flexibility and efficiency. In this
architecture, integrated GPU cores can directly use system memory allocated by the
CPU part. Then data transfer through connection interfaces such as PCIe link can
be avoided to obtain higher performance [82]. This gives our SpGEMM algorithm a
chance to let integrated GPUs use re-allocatable system memory for a better overall
performance. Later on, we will show the corresponding performance gain by using
an AMD APU.
126 9.4. CSR-BASED SPGEMM
Load Balancing
Because distribution patterns of nonzero entries in both input sparse matrices can
be very diverse (consider plots of the matrices in Table 9.2), input space-based data
decomposition [57, 171] normally does not bring efficient load balancing. One ex-
ception is that computing SpGEMM for huge sparse matrices on large scale dis-
tributed memory systems, 2D and 3D decomposition on input space methods demon-
strated good load balancing and scalability by utilizing efficient communication
strategies [79, 13, 35]. However, in this chapter we mainly consider load balancing
for fine-grained parallelism in GPU and CPU-GPU shared memory architectures.
Therefore we use the other group of load balancing methods based on output
space decomposition. Dalton et al. [52] presented a method that sorts rows of the
intermediate matrix C, b divides it into 3 sub-matrices that include the rows in different
size ranges, and uses differentiated ESC methods for the sub-matrices. We have a
similar consideration, but our implementation is completely different. We do not
strictly sort rows of the intermediate matrix C b but just assign rows to a fixed number
of bins through a much faster linear time traverse on CPU. Moreover, we decompose
the output space in a more detailed way that guarantees much more efficient load
balancing. We will demonstrate that our method is always load balanced in all stages
for maximizing resource utilization of GPUs.
one GPU thread for computing each entry of the array U . Algorithm 34 describes this
procedure.
The second stage, binning, deals with load balancing and memory pre-allocation.
We first allocate 38 bins and put them into five bin groups. The bins contain the indices
of the entries in the array U and present as one array of size m with 38 segments.
Then all rows are assigned to corresponding bins according to the number of nonzero
entries. Finally, based on the sizes of the bins, we allocate a temporary matrix for
nonzero entries in the resulting matrix C.
The first bin group includes one bin that contains the indices of the rows of size 0.
The second bin group also only has one bin that contains the indices of the rows of
size 1. Because the rows in the first two bins only require trivial operations, they are
excluded from subsequent more complex computation on GPUs. Thus a better load
balancing can be expected.
The third bin group is composed of 31 bins that contain the indices of the rows
of sizes 2–32, respectively. Since the sizes of these rows are no more than the size of
a single thread bunch (32 in current nVidia GPUs or 64 in current AMD GPUs) and
these rows require non-trivial computation, using one thread bunch or one thread
group for each row cannot bring efficient instruction throughput on GPUs. Therefore,
we use one thread for each row. Further, because each bin only contains the rows
of the same upper bound size, the bins can be executed separately on GPUs with
128 9.4. CSR-BASED SPGEMM
different kernel programs for efficient load balancing. In other words, 31 GPU kernel
programs will be executed for the 31 bins, if not empty.
The fourth bin group consists of 4 bins that contain the indices of the rows located
in size ranges 33–64, 65–128, 129–256 and 257–512, respectively. The rows of these sizes
are grouped because of three reasons: (1) each of them is large enough to be efficiently
executed by a thread group, (2) each of them is small enough for scratchpad memory
(48 kB per core in current nVidia Kepler GPUs, 96 kB per core in current nVidia
Maxwell GPUs and 64 kB per core in current AMD Graphics Core Next, or GCN,
GPUs), and (3) the final sizes of these rows in the resulting matrix C are predictable
in a reasonable small range (no less than the lower bound of size 1 and no more than
the corresponding upper bound sizes). Even though the rows in each bin do not
have exactly the same upper bound size, a good load balancing still can be expected
because each row is executed by using one thread group and inter-thread group load
balancing is naturally guaranteed by the GPU low-level scheduling sub-systems.
The fifth bin group includes the last bin that contains the indices of the rest of the
rows of size larger than 512. These rows have two common features: (1) their sizes
can be too large (recall nnzr(C) b in Table 9.2) to fit in the scratchpad memory, and (2)
predicting the final sizes of these rows to a small range (scratchpad memory level) is
not possible in advance. Therefore, we execute them in a unified progressive method
described later. Again because we use one thread group for each row, load balancing
is naturally guaranteed.
Since we do not use precise method for memory pre-allocation, a temporary
memory space for the resulting matrix C is required. We design a hybrid method that
allocates a CSR format sparse matrix C e of the same size of the resulting matrix C as
temporary matrix. We set nnz(e ci∗ ) to ui while the row index i is located in the bin
groups 1–4 because compared with modern GPU global memory capacity, the total
space requirement of these rows is relatively small. For the rows in the bin group 5,
we set nnz(e ci∗ ) to a fixed size 256 since normally this is an efficient working size for
the scratchpad memory. Therefore, we can see that if all of the indices of the rows are
in the bin groups 1–4, our hybrid method converts to the upper bound method, on
the other extreme end, our method converts to the progressive method. But generally,
we obtain benefits from the both individual methods. The stage 2 is executed on CPU
since it only requires a few simple linear time traverses, which are more efficient for
the CPU cache sub-systems. The pseudocode is shown in Algorithm 35.
The third stage, computing the resulting matrix, generates nnz(ci∗ ) and the final
nonzero entries stored in the temporary matrix C. e
For the rows in the bin groups 1–2, we simply update the numbers of correspond-
ing nonzero entries. For the rows in the bin groups 3–5, we use three totally different
methods: (1) heap method, (2) bitonic ESC method, and (3) merge method, respec-
tively. Note that each bin has a counter (at the host side) that records the number
of rows included. So the host can easily decide if a GPU kernel will be issued for a
certain bin. In other words, our approach only issue kernels for non-empty bins.
The heap method described in Section 7.4.1 is used for each row in the bin group
3. For the rows in each bin of the bin group 4, a typical ESC algorithm described
in Section 7.4.2 is used. For the rows in the bin group 5, our method inserts each
input nonzero entry to the corresponding row of the resulting matrix C (lines 7–11 in
CHAPTER 9. LEVEL 3: SPARSE MATRIX-MATRIX OPERATIONS 129
Algorithm 33) in parallel. Because the resulting rows in the bin group 5 may require
more involved entries, method described in Section 7.4.3 is used.
As we allocate a limited scratchpad memory space for the resulting sequence of
the bin group 5, a potential overflow may happen. In this case, we first compare total
size of the two sequences (note that the input sequence is in the thread registers, but
not in the scratchpad memory yet) with the allocated size of the resulting sequence
in the scratchpad memory. If a merge operation is not allowed, our method records
current computation position as a checkpoint and dumps the resulting sequence from
the scratchpad memory to the global memory. Then the host allocates more global
memory (we use 2x each time) and re-launches kernel with a 2x large scratchpad
memory setting. The relaunched kernels obtain checkpoint information, and load
existing results to the scratchpad memory and continue the computation. The global
memory dumping and reloading bring an extra overhead, but actually it does not
affect the total execution time too much because of three reasons: (1) the global
memory access is almost completely coalesced, (2) the latency could be hidden
130 9.5. EXPERIMENTAL RESULTS
by subsequent computation, and (3) this overhead is only a small factor of large
computation (short rows normally do not face this problem). For very long rows
that exceed the scratchpad memory capacity, our method still allocates a space in the
scratchpad memory as a level-1 merge sequence, executes the same merge operations
on it and merges the level-1 sequence in the scratchpad memory and the resulting
sequence in the global memory only once before the kernel is ready to return.
It is worth noting that the parameters of the binning depends on specifications
(e.g., thread bunch size and scratchpad memory capacity) of GPU architectures. In this
chapter, we use the abovementioned fixed-size parameters for assigning the rows into
the bins since the current nVidia GPUs and AMD GPUs have comparable hardware
specifications. However, the strategies in stages 2 and 3 can be easily extended for
future GPUs with changed architecture designs.
The fourth stage, arranging data, first sums the numbers of nonzero entries in all
rows of the resulting matrix C and allocates its final memory space. Then our method
copies existing nonzero entries from the temporary matrix C e to the resulting matrix
C. For the rows in the bin group 1, the copy operation is not required. For the rows
in the bin group 2, we use one thread for each row. For the rest of the rows in the
bin groups 3–5, we use one thread group for each row. After all copy operations, the
SpGEMM computation is done.
We use four platforms (one CPU and three GPUs) shown in Table 9.1 for evaluating
the SpGEMM algorithms. Tables B.1 and B.2 list specifications of the used hardware.
The host side of all GPUs is a quad-core 3.7GHz CPU in an AMD A10-7850K APU
with 8 GB DDR3-1600 dual-channel system memory and 64-bit Ubuntu Linux 14.04.
Performance Comparison
Figures 9.2 and 9.3 show execution time of Galerkin products P T AP in constructing
an AMG hierarchy (typically including 3-5 levels) for a smoothed aggregation pre-
conditioner in single precision and double precision, respectively. The input system
matrix A is from 2D 5-point, 2D 9-point, 3D 7-point or 3D 27-point Poisson prob-
lem, respectively. The two 2D problems have dimensions 1024 × 1024 and generate
system matrices of size 1048576 × 1048576. The two 3D problems have dimensions
101×101×101 and generate system matrices of size 1030301×1030301. The SpGEMM
approaches in three libraries, CUSP v0.4.0, cuSPARSE v6.5 and bhSPARSE1 , are tested
on nVidia GeForce GTX Titan Black and GeForce GTX 980 GPUs. To obtain the best
SpGEMM performance, CUSP uses the coordinate (COO) format for its input matri-
ces. The other two libraries use the CSR format. Because the operation multiplies
three sparse matrices P T , A and P , the order of multiplication may influence overall
performance. Here we test the two possible orders (P T A)P and P T (AP ). In our
experiments, matrix data transfer time between the host and the device is not included
since the SpGEMM is normally one of the building blocks for more complex problem
completely running on GPUs.
In Figures 9.2 and 9.3, we can see that our method is constantly faster than
SpGEMM algorithms in the other two libraries. When using system matrix from 3D
27-point Poisson problem, bhSPARSE delivers up to 2.6x and up to 2.7x speedups
over cuSPARSE and CUSP, respectively. On average, speedups of 1.9x and 1.7x are
achieved when compared with the above two libraries, respectively.
As for the order of multiplication, we can see that our method in general gives
better performance while doing P T (AP ), compared to running (P T A)P . In contrast,
the order of multiplication does not bring obvious performance difference for CUSP.
When cuSPARSE is used, (P T A)P delivers better throughput for the two 2D problems,
but degrades throughput for the two 3D problems.
Figure 9.2: Execution time (in milliseconds) comparison of single precision SpGEMM
(SpSGEMM) from three libraries CUSP, cuSPARSE and bhSPARSE in the context of
smoothed aggregation preconditioner with Jacobi smoother. The system matrices are
from four Poisson problems. Both (P T A)P and P T (AP ) are tested on two nVidia
GPUs.
Figure 9.3: Execution time (in milliseconds) comparison of double precision SpGEMM
(SpDGEMM) from three libraries CUSP, cuSPARSE and bhSPARSE in the context of
smoothed aggregation preconditioner with Jacobi smoother. The system matrices are
from four Poisson problems. Both (P T A)P and P T (AP ) are tested on two nVidia
GPUs.
Besides the input matrix A, the work complexities of the different SpGEMM
algorithms also depend on the intermediate matrix C b and the resulting matrix C.
So we list characteristics of the three matrices in Table 9.2. The set of characteristics
includes matrix dimension (n), the number of nonzero entries (nnz) and the average
number of nonzero entries in rows (nnzr). The upper 9 matrices in the table have
relatively regular nonzero entry distribution mostly on the diagonal. The other 14
matrices include various irregular sparsity structures.
134 9.5. EXPERIMENTAL RESULTS
Table 9.2: Overview of sparse matrices for benchmarking matrix squaring. Here
b is the upper bound size of A2 . Numerically, nnz(C)
nnz(C) b equals to half of f lops,
the number of necessary arithmetic operations while doing SpGEMM. nnz(C) is the
number of nonzero entries in the resulting matrix C = A2 .
nnz(A), nnz(C),
b nnz(C),
Id Name n
nnzr(A) nnzr(C) b nnzr(C)
r1 FEM/Cantilever 63 K 4 M, 64 269.5 M, 4315 17.4 M, 279
r2 Economics 207 K 1.3 M, 6 7.6 M, 37 6.7 M, 32
r3 Epidemiology 526 K 2.1 M, 4 8.4 M, 16 5.2 M, 10
r4 Filter3D 106 K 2.7 M, 25 86 M, 808 20.2 M, 189
r5 Wind Tunnel 218 K 11.6 M, 53 626.1 M, 2873 32.8 M, 150
r6 FEM/Ship 141 K 7.8 M, 55 450.6 M, 3199 24.1 M, 171
r7 FEM/Harbor 47 K 2.4 M, 51 156.5 M, 3341 7.9 M, 169
r8 Protein 36 K 4.3 M, 119 555.3 M, 15249 19.6 M, 538
r9 FEM/Spheres 83 K 6 M, 72 463.8 M, 5566 26.5 M, 318
i1 2cubes sphere 102 K 1.6 M, 16 27.5 M, 270 9 M, 88
i2 FEM/Accelerator 121 K 2.6 M, 22 79.9 M, 659 18.7 M, 154
i3 Cage12 130 K 2 M, 16 34.6 M, 266 15.2 M, 117
i4 Hood 221 K 10.8 M, 49 562 M, 2548 34.2 M, 155
i5 M133-b3 200 K 0.8 M, 4 3.2 M, 16 3.2 M, 16
i6 Majorbasis 160 K 1.8 M, 11 19.2 M, 120 8.2 M, 52
i7 Mario002 390 K 2.1 M, 5 12.8 M, 33 6.4 M, 17
i8 Mono 500Hz 169 K 5 M, 30 204 M, 1204 41.4 M, 244
i9 Offshore 260 K 4.2 M, 16 71.3 M, 275 23.4 M, 90
i10 Patents main 241 K 0.6 M, 2 2.6 M, 11 2.3 M, 9
i11 Poisson3Da 14 K 0.4 M, 26 11.8 M, 871 3 M, 219
i12 QCD 49 K 1.9 M, 39 74.8 M, 1521 10.9 M, 222
i13 Circuit 171 K 1 M, 6 8.7 M, 51 5.2 M, 31
i14 Webbase 1M 3.1 M, 3 69.5 M, 70 51.1 M, 51
Performance Comparison
The single precision and double precision absolute performance of the SpGEMM
algorithms that compute C = A2 are shown in Figures 9.4 and 9.5, respectively. Four
GPU methods from CUSP v0.4.0, cuSPARSE v6.5, RMerge [83] and bhSPARSE are
evaluated on three GPUs: nVidia GeForce GTX Titan Black, nVidia GeForce GTX 980
and AMD Radeon R9 290X. One CPU method in Intel MKL v11.0 is evaluated on Intel
Xeon E5-2630 CPU. The performance of another recent ESC-based GPU SpGEMM
work [52] is not included in the comparison because its source code is not available to
us yet. The Intel MKL SpGEMM program is multithreaded and utilizes all six cores
in the Intel Xeon CPU. For GPU algorithms, again, the host-device data transfer time
is not included.
We first compare the performance of the four different GPU SpGEMM algorithms
on the nVidia GPUs. We can see that bhSPARSE always outperforms CUSP, cuSPARSE
and RMerge on most sparse matrices in the benchmark suite. Compared to the two
CHAPTER 9. LEVEL 3: SPARSE MATRIX-MATRIX OPERATIONS 135
vendor supplied libraries, our method obtains better SpSGEMM and SpDGEMM
performance on 21 and 21 matrices out of the whole 23 matrices over CUSP, and on 19
and 21 matrices over cuSPARSE, respectively. Compared to RMerge, another CUDA-
specific method, bhSPARSE achieves better SpSGEMM and SpDGEMM performance
on 19 and 10 matrices on the GTX Titan Black GPU, and on 19 and 20 matrices on the
GTX 980 GPU.
From the perspective of speedup, our method delivers on average 4.6x (up to 9.6x)
and 3.1x (up to 8.8x) speedup on SpSGEMM performance over CUSP and cuSPARSE,
and on average 4.6x (up to 9.9x) and 3.1x (up to 9.5x) speedup on SpDGEMM perfor-
mance over them, respectively. Compared to RMerge, our method offers on average
1.4x (up to 2.5x) speedup and 2.8x (up to 4.9x) speedup for SpSGEMM and on average
1.0x (up to 1.5x) and 2.1x (up to 3.4x) speedup for SpDGEMM on the GTX Titan Black
GPU and GTX 980 GPU, respectively.
We can see that the cuSPARSE method outperforms our approach when and
only when the input matrices are fairly regular (belong to the first 9 matrices in
Table 9.2). For all irregular matrices and some regular ones, our bhSPARSE is always
more efficient. On the other hand, the absolute performance of the CUSP method is
very stable since its execution time almost only depends on the number of necessary
arithmetic operations. Therefore this approach is insensitive to sparsity structures.
Actually this insensitivity may bring better performance on matrices with some
specific sparsity structures. However in most cases, the CUSP method suffers with
higher global memory pressure. The RMerge method offers significant speedups over
the other methods on three matrices (i.e., Epidemiology, M133-b3 and Mario002), which
are characterized by short rows. However, for the other matrices, RMerge supplies
relatively lower performance due to imbalanced workload and high-overhead global
memory operations between iterative steps. Further, we can see that since RMerge
mainly relies on computational power of the SIMD units, its performance decreases
from GTX Titan Black (2880 CUDA cores running at 889 MHz) to GTX 980 (2048
CUDA cores running at 1126 MHz). In contrast, our method also depends on capacity
of scratchpad memory. Thus we can see that bhSPARSE obtains better performance
while using GTX 980 (1536 kB scratchpad) over GTX Titan Black (720 kB scratchpad).
The relative performance (harmonic mean) of the SpGEMM algorithms that com-
pute C = A2 is shown in Figure 9.6. We can see that our method in general delivers the
best performance on the used testbeds while running the 23 matrices as a benchmark
suite. If we set the Intel MKL SpGEMM performance in this scenario as a baseline,
our approach is the only GPU SpGEMM that constantly outperforms well optimized
CPU method.
138 9.5. EXPERIMENTAL RESULTS
Figure 9.7 shows the comparison of the three memory pre-allocation methods, while
benchmarking C = A2 . We can see that, for small matrices (e.g., 2cubes sphere),
our hybrid method shows exactly the same space requirements as the upper bound
method does. However, for large matrices, allocated memory sizes through our hybrid
method are much closer to the memory sizes allocated by the precise method. Taking
the matrix Protein as an example, our hybrid method requires 2.7x memory space over
the precise method, while the upper bound method needs 20.6x space requirement.
One exception is the matrix Webbase, our hybrid method actually allocates more
memory space than the upper bound method. The reasons are that the reduced rate
of the intermediate matrix Cb to the resulting matrix C is very low (see Table 9.2) and
our 2x progression mechanism just allocates memory across the upper bound size.
But overall, our hybrid method saves space allocation of the upper bound method
and execution time of the precise method without introducing any significant extra
CHAPTER 9. LEVEL 3: SPARSE MATRIX-MATRIX OPERATIONS 139
space requirements.
Figure 9.7: Global memory requirement comparison of the precise method, our hybrid
method and the upper bound method, when benchmarking C = A2 on the 23 matrices.
The memory requirement of the precise method includes the two input matrices and
the resulting matrix. The memory requirements of the other two methods also contain
additional intermediate matrices. “Hmean” refers to harmonic mean.
for SpSGEMM and SpDGEMM, respectively. Therefore, our GPU SpGEMM method
may deliver further performance improvement on future GPUs with re-allocatable
memory, or on emerging heterogeneous processors composed of CPU cores and
GPU cores. Moreover, both CPU cores and GPU cores can be utilized for Stage 3 in
our framework. We leave this heterogenous workload partitioning (similar to the
methods described in [162, 164]) to future work.
Our experimental results show that the proposed SpGEMM method in general
outperforms the pervious SpGEMM methods designed for CPUs and GPUs.
10. Conclusion and Future Work
10.1 Conclusion
This thesis studied some key routines of Sparse BLAS and some fundamental data
structures and algorithms as their building blocks.
Chapter 4 proposed ad-heap, a new efficient heap data structure for the tightly
coupled CPU-GPU heterogeneous processors. Empirical studies were conducted
based on the theoretical analysis. The experimental results showed that the ad-heap
can obtain up to 1.5x and 3.6x performance of the optimal scheduling method on two
representative machines, respectively. To the best of our knowledge, the ad-heap is
the first fundamental data structure that efficiently leveraged the two different types
of cores in the emerging heterogeneous processors through fine-grained frequent
interactions between the CPUs and the GPUs. Further, the performance numbers
also showed that redesigning data structure and algorithm is necessary for exposing
higher computational power of the heterogeneous processors.
Chapter 6 proposed the CSR5 format. The format conversion from the CSR to the
CSR5 was very fast because of the format’s insensitivity to sparsity structure of the
input matrix.
Chapter 7 developed three methods for sparse vector addition. Those methods
have been used for adding rows of different sizes in the SpGEMM operation described
in Chapter 9.
Chapter 8 proposed an efficient method for SpMV on heterogeneous processors
using the CSR storage format. On three mainstream platforms from Intel, AMD
and nVidia, the method greatly outperforms row block method CSR-based SpMV
algorithms running on GPUs. The performance gain mainly comes from the newly
developed speculative segmented sum strategy that efficiently utilizes different types
of cores in a heterogeneous processor.
Chapter 8 also proposed the CSR5-based cross-platform SpMV algorithm for
CPUs, GPUs and Xeon Phi. The CSR5-based SpMV was implemented by a redesigned
segmented sum algorithm with higher SIMD utilization compared to the classic
methods. The experimental results showed that the CSR5 delivered high throughput
both in the isolated SpMV tests and in the iteration-based scenarios.
Chapter 9 demonstrated an efficient SpGEMM framework and corresponding
algorithms on GPUs and emerging CPU-GPU heterogeneous processors for solving
the three challenging problems in the SpGEMM. In the two experimental scenarios
using matrices with diverse sparsity structures as input, the SpGEMM algorithm
143
144 10.2. FUTURE WORK
delivered excellent absolute and relative performance as well as space efficiency over
the previous GPU SpGEMM methods. Moreover, on average, the approach obtained
around twice the performance of the start-of-the-art CPU SpGEMM method. Further,
the method obtained higher performance on emerging heterogeneous processors with
re-allocatable memory.
[1] David Abrahams and Aleksey Gurtovoy. C++ Template Metaprogramming: Con-
cepts, Tools, and Techniques from Boost and Beyond (C++ in Depth Series). Addison-
Wesley Professional, 2004.
[4] Rasmus Resen Amossen, Andrea Campagna, and Rasmus Pagh. Better Size
Estimation for Sparse Matrix Products. Algorithmica, pages 741–757, 2014.
[5] Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. Reducing Vector I/O
for Faster GPU Sparse Matrix-Vector Multiplication. In Parallel and Distributed
Processing Symposium, 2015 IEEE International, IPDPS ’15, pages 1043–1052, 2015.
[6] M. Arora, S. Nath, S. Mazumdar, S.B. Baden, and D.M. Tullsen. Redefining the
role of the cpu in the era of cpu-gpu integration. Micro, IEEE, 32(6):4–16, 2012.
[7] Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and
P. Sadayappan. Fast Sparse Matrix-Vector Multiplication on GPUs for Graph
Applications. In Proceedings of the International Conference for High Performance
Computing, Networking, Storage and Analysis, SC ’14, pages 781–792, 2014.
[8] Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. An Efficient
Two-dimensional Blocking Strategy for Sparse Matrix-vector Multiplication on
GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing,
ICS ’14, pages 273–282, 2014.
[9] Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andre Wacre-
nier. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multi-
core Architectures. Concurr. Comput. : Pract. Exper., 23(2):187–198, feb 2011.
[10] Brett W. Bader and Tamara G. Kolda. Algorithm 862: MATLAB Tensor Classes
for Fast Algorithm Prototyping. ACM Transactions on Mathematical Software,
32(4):635–653, 2006.
145
146 BIBLIOGRAPHY
[11] Brett W. Bader and Tamara G. Kolda. Efficient MATLAB Computations with
Sparse and Factored Tensors. SIAM Journal on Scientific Computing, 30(1):205–
231, 2007.
[12] Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris
Buschelman, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G.
Knepley, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, and Hong Zhang.
PETSc Users Manual. Technical Report ANL-95/11 - Revision 3.5, Argonne
National Laboratory, 2014.
[13] Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Benjamin Lipshitz,
Oded Schwartz, and Sivan Toledo. Communication Optimal Parallel Multipli-
cation of Sparse Random Matrices. In Proceedings of the 25th ACM Symposium on
Parallelism in Algorithms and Architectures, SPAA ’13, pages 222–231, 2013.
[14] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. Efficient and Scalable
Computations with Sparse Tensors. In High Performance Extreme Computing
(HPEC), 2012 IEEE Conference on, pages 1–6, 2012.
[15] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy,
J. Ramanujam, Atanas Rountev, and P. Sadayappan. A Compiler Framework
for Optimization of Affine Loop Nests for GPGPUs. In Proceedings of the 22Nd
Annual International Conference on Supercomputing, ICS ’08, pages 225–234, 2008.
[16] Muthu Manikandan Baskaran and Rajesh Bordawekar. Optimizing Sparse
Matrix-Vector Multiplication on GPUs. Technical Report RC24704, IBM, 2008.
[17] K. E. Batcher. Sorting networks and their applications. In Proceedings of the
April 30–May 2, 1968, Spring Joint Computer Conference, AFIPS ’68 (Spring), pages
307–314, 1968.
[18] Sean Baxter. Modern GPU. nVidia, 2013.
[19] Bob Beaty. DKit: C++ Library of Atomic and Lockless Data Structures, 2012.
[20] N. Bell, S. Dalton, and L. Olson. Exposing Fine-Grained Parallelism in Algebraic
Multigrid Methods. SIAM Journal on Scientific Computing, 34(4):C123–C152,
2012.
[21] Nathan Bell and Michael Garland. Implementing Sparse Matrix-Vector Multi-
plication on Throughput-Oriented Processors. In Proceedings of the Conference
on High Performance Computing Networking, Storage and Analysis, SC ’09, pages
18:1–18:11, 2009.
[22] Mehmet E. Belviranli, Laxmi N. Bhuyan, and Rajiv Gupta. A dynamic self-
scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans.
Archit. Code Optim., 9(4):57:1–57:20, jan 2013.
[23] P. Bjorstad, F. Manne, T. Sorevik, and M. Vajtersic. Efficient Matrix Multipli-
cation on SIMD Computers. SIAM Journal on Matrix Analysis and Applications,
13(1):386–401, 1992.
BIBLIOGRAPHY 147
[38] Bradford L. Chamberlain and Lawrence Snyder. Array Language Support for
Parallel Sparse Computation. In Proceedings of the 15th International Conference
on Supercomputing, ICS ’01, pages 133–145, 2001.
[39] Timothy M. Chan. More Algorithms for All-pairs Shortest Paths in Weighted
Graphs. In Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of
Computing, STOC ’07, pages 590–598, 2007.
[40] Siddhartha Chatterjee, G.E. Blelloch, and Marco Zagha. Scan Primitives for
Vector Computers. In Supercomputing ’90., Proceedings of, pages 666–675, 1990.
[41] Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J.W. Sheaffer, Sang-Ha Lee, and
K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In
Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on,
pages 44–54, 2009.
[42] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer,
and Kevin Skadron. A Performance Study of General-purpose Applications on
Graphics Processors Using CUDA. J. Parallel Distrib. Comput., 68(10):1370–1380,
2008.
[43] Shuai Che, J.W. Sheaffer, M. Boyer, L.G. Szafaryn, Liang Wang, and K. Skadron.
A Characterization of the Rodinia Benchmark Suite with Comparison to Con-
temporary CMP Workloads. In Workload Characterization (IISWC), 2010 IEEE
International Symposium on, pages 1–11, 2010.
[44] Linchuan Chen, Xin Huo, and Gagan Agrawal. Accelerating mapreduce on
a coupled cpu-gpu architecture. In Proceedings of the International Conference
on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages
25:1–25:11, 2012.
[45] Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-Driven Autotuning
of Sparse Matrix-vector Multiply on GPUs. In Proceedings of the 15th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP
’10, pages 115–126, 2010.
[46] E.S. Chung, P.A. Milder, J.C. Hoe, and Ken Mai. Single-Chip Heterogeneous
Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? In
Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium
on, pages 225–236, 2010.
[49] Mayank Daga, Ashwin M. Aji, and Wu-chun Feng. On the efficacy of a fused
cpu+gpu processor (or apu) for parallel computing. In Proceedings of the 2011
Symposium on Application Accelerators in High-Performance Computing, SAAHPC
’11, pages 141–149, 2011.
[50] Mayank Daga and Joseph L. Greathouse. Structural Agnostic SpMV: Adapting
CSR-Adaptive for Irregular Matrices. In High Performance Computing (HiPC),
2015 22nd International Conference on, HiPC ’15, 2015.
[51] Steven Dalton and Nathan Bell. CUSP : A C++ Templated Sparse Matrix
Library.
[52] Steven Dalton, Luke Olsen, and Nathan Bell. Optimizing Sparse Matrix-Matrix
Multiplication for the GPU. ACM Transactions on Mathematical Software, 41(4),
2015.
[54] Hoang-Vu Dang and Bertil Schmidt. The sliced coo format for sparse matrix-
vector multiplication on cuda-enabled gpus. Procedia Computer Science, 9(0):57 –
66, 2012.
[55] A. Davidson, D. Tarjan, M. Garland, and J.D. Owens. Efficient Parallel Merge
Sort for Fixed and Variable Length Keys. In Innovative Parallel Computing (InPar),
2012, pages 1–9, 2012.
[56] Timothy A. Davis and Yifan Hu. The University of Florida Sparse Matrix
Collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, dec 2011.
[58] Yangdong (Steve) Deng, Bo David Wang, and Shuai Mu. Taming Irregular
EDA Applications on GPUs. In Proceedings of the 2009 International Conference
on Computer-Aided Design, ICCAD ’09, pages 539–546, 2009.
[59] Mrinal Deo and Sean Keely. Parallel suffix array and least common prefix for
the gpu. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming, PPoPP ’13, pages 197–206, 2013.
[61] J. J. Dongarra, Jeremy Du Croz, Sven Hammarling, and I. S. Duff. A Set of Level
3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 16(1):1–17, 1990.
[62] J. J. Dongarra and Michael A. Heroux. Toward a New Metric for Ranking High
Performance Computing Systems. Technical Report SAND2013-4744, Sandia
National Laboratories, 2013.
[63] Yuri Dotsenko, Naga K. Govindaraju, Peter-Pike Sloan, Charles Boyd, and John
Manferdelli. Fast Scan Algorithms on Graphics Processors. In Proceedings of the
22Nd Annual International Conference on Supercomputing, ICS ’08, pages 205–213,
2008.
[64] Iain S. Duff. A Survey of Sparse Matrix Research. Proceedings of the IEEE,
65(4):500–535, 1977.
[65] Iain S Duff, Albert M Erisman, and John K Reid. Direct Methods for Sparse
Matrices. Oxford University Press, Inc., 1986.
[66] Iain S. Duff, Roger G. Grimes, and John G. Lewis. Sparse Matrix Test Problems.
ACM Trans. Math. Softw., 15(1):1–14, 1989.
[67] Iain S. Duff, Roger G. Grimes, and John G. Lewis. The Rutherford-Boeing Sparse
Matrix Collection. Technical Report RAL-TR-97-031, Rutherford Appleton
Laboratory, UK, 1997.
[68] Iain S. Duff, Michael A. Heroux, and Roldan Pozo. An Overview of the Sparse
Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical
Forum. ACM Trans. Math. Softw., 28(2):239–267, 2002.
[69] Iain S. Duff and J. Koster. On algorithms for permuting large entries to the
diagonal of a sparse matrix. SIAM J. Matrix Anal. Appl., 22(4):973–996, 2000.
[70] Iain S. Duff, Michele Marrone, Giuseppe Radicati, and Carlo Vittoli. Level 3
basic linear algebra subprograms for sparse matrices: A user-level interface.
ACM Trans. Math. Softw., 23(3):379–401, 1997.
[71] Iain S. Duff and Christof Vömel. Algorithm 818: A Reference Model Implemen-
tation of the Sparse BLAS in Fortran 95. ACM Trans. Math. Softw., 28(2):268–283,
2002.
[73] Jianbin Fang, Henk Sips, LiLun Zhang, Chuanfu Xu, Yonggang Che, and
Ana Lucia Varbanescu. Test-driving Intel Xeon Phi. In Proceedings of the 5th
ACM/SPEC International Conference on Performance Engineering, ICPE ’14, pages
137–148, 2014.
[75] Timothy Furtak, José Nelson Amaral, and Robert Niewiadomski. Using simd
registers and instructions to enable instruction-level parallelism in sorting
algorithms. In Proceedings of the Nineteenth Annual ACM Symposium on Parallel
Algorithms and Architectures, SPAA ’07, pages 348–357, 2007.
[77] J. Gilbert, C. Moler, and R. Schreiber. Sparse Matrices in MATLAB: Design and
Implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333–356,
1992.
[78] John R. Gilbert, William W. Pugh, and Tatiana Shpeisman. Ordered Sparse
Accumulator and its Use in Efficient Sparse Matrix Computation. United States
Patent US 5983230 A, nov 1999.
[79] JohnR. Gilbert, Steve Reinhardt, and ViralB. Shah. High-Performance Graph
Algorithms from Parallel Sparse Matrices. In Bo Kågström, Erik Elmroth, Jack
Dongarra, and Jerzy Waśniewski, editors, Applied Parallel Computing. State of
the Art in Scientific Computing, volume 4699 of Lecture Notes in Computer Science,
pages 260–269. Springer Berlin Heidelberg, 2007.
[80] Joseph L. Greathouse and Mayank Daga. Efficient Sparse Matrix-Vector Mul-
tiplication on GPUs using the CSR Storage Format. In Proceedings of the In-
ternational Conference for High Performance Computing, Networking, Storage and
Analysis, SC ’14, pages 769–780, 2014.
[81] Oded Green, Robert McColl, and David A. Bader. GPU Merge Path: A GPU
Merging Algorithm. In Proceedings of the 26th ACM International Conference on
Supercomputing, ICS ’12, pages 331–340, 2012.
[82] C. Gregg and K. Hazelwood. Where is the Data? Why You Cannot Debate CPU
vs. GPU Performance Without the Answer. In Performance Analysis of Systems
and Software (ISPASS), 2011 IEEE International Symposium on, pages 134–144,
2011.
[83] Felix Gremse, Andreas Höfter, Lars Ole Schwen, Fabian Kiessling, and Uwe
Naumann. GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative
Row Merging. SIAM Journal on Scientific Computing, 37(1):C54–C71, 2015.
[84] Dahai Guo and William Gropp. Adaptive Thread Distributions for SpMV on a
GPU. In Proceedings of the Extreme Scaling Workshop, pages 2:1–2:5, 2012.
[85] Fred G. Gustavson. Two Fast Algorithms for Sparse Matrices: Multiplication
and Permuted Transposition. ACM Trans. Math. Softw., 4(3):250–269, sep 1978.
[86] Mark Harris, John D. Owens, and Shubho Sengupta. CUDPP Documentation.
nVidia, 2.2 edition, Aug 2014.
152 BIBLIOGRAPHY
[87] Jiong He, Mian Lu, and Bingsheng He. Revisiting co-processing for hash joins
on the coupled cpu-gpu architecture. Proc. VLDB Endow., 6(10):889–900, aug
2013.
[88] M.D. Hill and M.R. Marty. Amdahl’s law in the multicore era. Computer,
41(7):33–38, 2008.
[89] Kaixi Hou, Hao Wang, and Wu-chun Feng. ASPaS: A Framework for Auto-
matic SIMDization of Parallel Sorting on x86-based Many-core Processors. In
Proceedings of the 29th ACM International Conference on Supercomputing, ICS ’15,
pages 383–392, 2015.
[90] HSA Foundation. HSA Programmer’s Reference Manual: HSAIL Virtual ISA and
Programming Model, Compiler Writer’s Guide, and Object Format (BRIG), 0.95
edition, May 2013.
[93] Donald B. Johnson. Priority queues with update and finding minimum span-
ning trees. Information Processing Letters, 4(3):53 – 57, 1975.
[94] Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T. Lewis, Chunling
Hu, and Keshav Pingali. Adaptive Heterogeneous Scheduling for Integrated
GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures
and Compilation, PACT ’14, pages 151–162, 2014.
[95] Haim Kaplan, Micha Sharir, and Elad Verbin. Colored Intersection Searching
via Sparse Rectangular Matrix Multiplication. In Proceedings of the Twenty-second
Annual Symposium on Computational Geometry, SCG ’06, pages 52–60, 2006.
[96] S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the
Future of Parallel Computing. Micro, IEEE, 31(5):7–17, 2011.
[97] J. Kepner and J. Gilbert. Graph Algorithms in the Language of Linear Algebra.
Society for Industrial and Applied Mathematics, 2011.
[98] Changkyu Kim, Tim Kaldewey, Victor W. Lee, Eric Sedlar, Anthony D. Nguyen,
Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. Sort vs.
Hash Revisited: Fast Join Implementation on Modern Multi-core CPUs. Proc.
VLDB Endow., 2(2):1378–1389, aug 2009.
[99] Peter Kipfer and Rüdiger Westermann. Improved GPU Sorting. In Matt Pharr,
editor, GPU Gems 2: Programming Techniques for High-Performance Graphics and
General-Purpose Computation, chapter 46, pages 733–746. Addison-Wesley, mar
2005.
BIBLIOGRAPHY 153
[100] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors
(Second Edition). Morgan Kaufmann, second edition edition, 2013.
[101] Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. Optimizing sparse
matrix-vector multiplication using index and value compression. In Proceedings
of the 5th Conference on Computing Frontiers, CF ’08, pages 87–96, 2008.
[102] Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. Exploiting Com-
pression Opportunities to Improve SpMxV Performance on Shared Memory
Systems. ACM Trans. Archit. Code Optim., 7(3):16:1–16:31, 2010.
[103] Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R.
Bishop. A unified sparse matrix data format for efficient general sparse matrix-
vector multiply on modern processors with wide simd units. SIAM Journal on
Scientific Computing, 2014.
[104] Mads R. B. Kristensen, Simon A. F. Lund, Troels Blum, Kenneth Skovhede, and
Brian Vinter. Bohrium: A Virtual Machine Approach to Portable Parallelism.
In Proceedings of the 2014 IEEE International Parallel & Distributed Processing
Symposium Workshops, pages 312–321, 2014.
[105] R. Kumar, D.M. Tullsen, N.P. Jouppi, and P. Ranganathan. Heterogeneous Chip
Multiprocessors. Computer, 38(11):32–38, nov 2005.
[106] Rakesh Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, Norman P.
Jouppi, and Keith I. Farkas. Single-isa heterogeneous multi-core architectures
for multithreaded workload performance. In Proceedings of the 31st Annual
International Symposium on Computer Architecture, ISCA ’04, pages 64–, 2004.
[107] Pramod Kumbhar. Performance of PETSc GPU Implementation with Sparse
Matrix Storage Schemes. Master’s thesis, The University of Edinburgh, Aug
2011.
[108] George Kyriazis. Heterogeneous system architecture: A technical review. Tech-
nical report, AMD, aug 2013.
[109] Anthony LaMarca and Richard Ladner. The influence of caches on the perfor-
mance of heaps. J. Exp. Algorithmics, 1, jan 1996.
[110] D. Langr and P. Tvrdı́k. Evaluation Criteria for Sparse Matrix Storage Formats.
IEEE Transactions on Parallel and Distributed Systems, [in press], 2015.
[111] Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. Transparent
CPU-GPU Collaboration for Data-parallel Kernels on Heterogeneous Systems.
In Proceedings of the 22nd International Conference on Parallel Architectures and
Compilation Techniques, PACT ’13, pages 245–256, 2013.
[112] Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. Openmp to gpgpu: A
compiler framework for automatic translation and optimization. In Proceed-
ings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPoPP ’09, pages 101–110, 2009.
154 BIBLIOGRAPHY
[113] Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. An
Input-adaptive and In-place Approach to Dense Tensor-times-matrix Multiply.
In Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’15, pages 76:1–76:12, 2015.
[114] Jiajia Li, Xingjian Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. An
Optimized Large-scale Hybrid DGEMM Design for CPUs and ATI GPUs. In
Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12,
pages 377–386, 2012.
[115] Jiajia Li, Guangming Tan, Mingyu Chen, and Ninghui Sun. Smat: An input
adaptive auto-tuner for sparse matrix-vector multiplication. In Proceedings
of the 34th ACM SIGPLAN Conference on Programming Language Design and
Implementation, PLDI ’13, pages 117–126, 2013.
[117] Weifeng Liu and Brian Vinter. Ad-heap: An efficient heap data structure for
asymmetric multicore processors. In Proceedings of Workshop on General Purpose
Processing Using GPUs, GPGPU-7, pages 54:54–54:63, 2014.
[118] Weifeng Liu and Brian Vinter. An Efficient GPU General Sparse Matrix-Matrix
Multiplication for Irregular Data. In Proceedings of the 2014 IEEE 28th Interna-
tional Parallel and Distributed Processing Symposium, IPDPS ’14, pages 370–381,
2014.
[119] Weifeng Liu and Brian Vinter. A Framework for General Sparse Matrix-Matrix
Multiplication on GPUs and Heterogeneous Processors. Journal of Parallel and
Distributed Computing, 85:47 – 61, 2015.
[120] Weifeng Liu and Brian Vinter. CSR5: An Efficient Storage Format for Cross-
Platform Sparse Matrix-Vector Multiplication. In Proceedings of the 29th ACM
International Conference on Supercomputing, ICS ’15, pages 339–350, 2015.
[121] Weifeng Liu and Brian Vinter. Speculative Segmented Sum for Sparse Matrix-
Vector Multiplication on Heterogeneous Processors. Parallel Computing, pages –,
2015.
[122] Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. Efficient
Sparse Matrix-Vector Multiplication on x86-based Many-Core Processors. In
Proceedings of the 27th International ACM Conference on International Conference on
Supercomputing, ICS ’13, pages 273–282, 2013.
[123] Yongchao Liu and B. Schmidt. LightSpMV: faster CSR-based sparse matrix-
vector multiplication on CUDA-enabled GPUs. In Application-specific Systems,
Architectures and Processors (ASAP), 2015 IEEE 26th International Conference on,
2015.
BIBLIOGRAPHY 155
[124] Anders Logg, Kent-Andre Mardal, and Garth Wells. Automated solution of
differential equations by the finite element method: The FEniCS book, volume 84.
Springer Science & Business Media, 2012.
[125] Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: Exploiting parallelism
on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the
42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO
42, pages 45–55, 2009.
[126] Dimitar Lukarski and Nico Trost. PARALUTION - User Manual. Technical
Report 1.0.0, PARALUTION Labs UG (haftungsbeschränkt) Co. KG, Feb 2015.
[127] Daniel Lustig and Margaret Martonosi. Reducing gpu offload latency via
fine-grained cpu-gpu synchronization. In Proceedings of the 2013 IEEE 19th
International Symposium on High Performance Computer Architecture (HPCA),
HPCA ’13, pages 354–365, 2013.
[128] Fredrik Manne. Load Balancing in Parallel Sparse Matrix Computations. PhD
thesis, Department of Informatics, University of Bergen, Norway, 1993.
[129] Michele Martone. Efficient Multithreaded Untransposed, Transposed or Sym-
metric Sparse Matrix-Vector Multiplication with the Recursive Sparse Blocks
Format. Parallel Computing, 40(7):251 – 270, 2014.
[130] K. Matam, S.R.K.B. Indarapu, and K. Kothapalli. Sparse Matrix-Matrix Multi-
plication on Modern Architectures. In High Performance Computing (HiPC), 2012
19th International Conference on, pages 1–10, 2012.
[131] T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, J. Feo,
J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, A. Lumsdaine,
D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo. Stan-
dards for graph algorithm primitives. In High Performance Extreme Computing
Conference (HPEC), 2013 IEEE, pages 1–2, 2013.
[132] Victor Minden, Barry Smith, and MatthewG. Knepley. Preliminary Implementa-
tion of PETSc Using GPUs. In David A. Yuen, Long Wang, Xuebin Chi, Lennart
Johnsson, Wei Ge, and Yaolin Shi, editors, GPU Solutions to Multi-scale Prob-
lems in Science and Engineering, Lecture Notes in Earth System Sciences, pages
131–140. Springer Berlin Heidelberg, 2013.
[133] Perhaad Mistry, Yash Ukidave, Dana Schaa, and David Kaeli. Valar: A bench-
mark suite to study the dynamic behavior of heterogeneous systems. In Proceed-
ings of the 6th Workshop on General Purpose Processor Using Graphics Processing
Units, GPGPU-6, pages 54–65, 2013.
[134] Sparsh Mittal and Jeffrey S. Vetter. A Survey of CPU-GPU Heterogeneous
Computing Techniques. ACM Comput. Surv., 47(4):69:1–69:35, 2015.
[135] Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. Automati-
cally tuning sparse matrix-vector multiplication for gpu architectures. In High
Performance Embedded Architectures and Compilers, volume 5952 of Lecture Notes
in Computer Science, pages 111–125. Springer Berlin Heidelberg, 2010.
156 BIBLIOGRAPHY
[136] Aaftab Munshi. The OpenCL Specification. Khronos OpenCL Working Group,
2.0 edition, Mar 2014.
[137] Dan Negrut, Radu Serban, Ang Li, and Andrew Seidl. Unified Memory in
CUDA 6: A Brief Overview and Related Data Access/Transfer Issues. Technical
Report TR-2014-09, University of Wisconsin–Madison, Jun 2014.
[138] Naoki Nishikawa, Keisuke Iwai, and Takakazu Kurokawa. Power efficiency
evaluation of block ciphers on gpu-integrated multicore processor. In Yang
Xiang, Ivan Stojmenovic, BernadyO. Apduhan, Guojun Wang, Koji Nakano,
and Albert Zomaya, editors, Algorithms and Architectures for Parallel Processing,
volume 7439 of Lecture Notes in Computer Science, pages 347–361. Springer Berlin
Heidelberg, 2012.
[139] Rajesh Nishtala, RichardW. Vuduc, JamesW. Demmel, and KatherineA. Yelick.
When Cache Blocking of Sparse Matrix Vector Multiply Works and Why. Ap-
plicable Algebra in Engineering, Communication and Computing, 18(3):297–311,
2007.
[141] nVidia. NVIDIA Tegra K1 A New Era in Mobile Computing, 1.1 edition, Jan 2014.
[142] S. Odeh, O. Green, Z. Mwassi, O. Shmueli, and Y. Birk. Merge Path - Par-
allel Merging Made Simple. In Parallel and Distributed Processing Symposium
Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 1611–1618,
2012.
[143] Gloria Ortega, Francisco Vázquez, Inmaculada Garcı́a, and Ester M. Garzón.
FastSpMM: An Efficient Library for Sparse Matrix Matrix Product on GPUs.
The Computer Journal, 57(7):968–979, 2014.
[144] Rasmus Pagh and Morten Stöckel. The Input/Output Complexity of Sparse
Matrix Multiplication. In AndreasS. Schulz and Dorothea Wagner, editors,
Algorithms - ESA 2014, Lecture Notes in Computer Science, pages 750–761.
Springer Berlin Heidelberg, 2014.
[147] Juan C. Pichel, Francisco F. Rivera, Marcos Fernández, and Aurelio Rodrı́guez.
Optimization of Sparse Matrix-Vector Multiplication Using Reordering Tech-
niques on GPUs. Microprocessors and Microsystems, 36(2):65 – 77, 2012.
BIBLIOGRAPHY 157
[149] A. Pinar and M.T. Heath. Improving performance of sparse matrix-vector mul-
tiplication. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing,
SC ’99, 1999.
[152] Huamin Ren, Weifeng Liu, Søren Ingvor Olsen, Sergio Escalera, and Thomas B.
Moeslund. Unsupervised Behavior-Specific Dictionary Learning for Abnormal
Event Detection. In Mark W. Jones Xianghua Xie and Gary K. L. Tam, editors,
Proceedings of the British Machine Vision Conference (BMVC), pages 28.1–28.13.
BMVA Press, 2015.
[153] Huamin Ren and T.B. Moeslund. Abnormal Event Detection Using Local Sparse
Representation. In Advanced Video and Signal Based Surveillance (AVSS), 2014
11th IEEE International Conference on, pages 125–130, 2014.
[154] Karl Rupp, Florian Rudolf, and Josef Weinbub. ViennaCL - A High Level Linear
Algebra Library for GPUs and Multi-Core CPUs. In GPUScA, pages 51–56,
2010.
[155] Yousef Saad. Sparskit : A basic tool kit for sparse matrix computations. Techni-
cal Report RIACS-90-20, Research Institute for Advanced Computer Science,
NASA Ames Research Center, Moffett Field, CA, 1990.
[156] Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial
and Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2003.
[158] Erik Saule, Kamer Kaya, and ÜmitV. Çatalyürek. Performance evaluation of
sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing
and Applied Mathematics, Lecture Notes in Computer Science, pages 559–570.
Springer Berlin Heidelberg, 2014.
[159] M.J. Schulte, M. Ignatowski, G.H. Loh, B.M. Beckmann, W.C. Brantley, S. Gu-
rumurthi, N. Jayasena, I. Paul, S.K. Reinhardt, and G. Rodgers. Achieving
Exascale Capabilities through Heterogeneous Computing. Micro, IEEE, 35(4):26–
36, 2015.
158 BIBLIOGRAPHY
[161] Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan
Primitives for GPU Computing. In Proceedings of the 22Nd ACM SIGGRAPH/EU-
ROGRAPHICS Symposium on Graphics Hardware, GH ’07, pages 97–106, 2007.
[162] Jie Shen, Jianbin Fang, Ana Lucia Varbanescu, and Henk Sips. An Application-
Centric Evaluation of OpenCL on Multi-Core CPUs. Parallel Computing,
39(12):834–850, 2013.
[163] Jie Shen, Ana Lucia Varbanescu, Henk Sips, Michael Arntzen, and Dick G.
Simons. Glinda: A framework for accelerating imbalanced applications on
heterogeneous platforms. In Proceedings of the ACM International Conference on
Computing Frontiers, CF ’13, pages 14:1–14:10, 2013.
[164] Jie Shen, Ana Lucia Varbanescu, Peng Zou, Yutong Lu, and Henk Sips. Im-
proving Performance by Matching Imbalanced Workloads with Heterogeneous
Platforms. In Proceedings of the 28th ACM International Conference on Supercom-
puting, ICS ’14, pages 241–250, 2014.
[165] A. Sidelnik, S. Maleki, B.L. Chamberlain, M.J. Garzaran, and D. Padua. Perfor-
mance Portability with the Chapel Language. In Parallel Distributed Processing
Symposium (IPDPS), 2012 IEEE 26th International, pages 582–594, 2012.
[167] Kenneth Skovhede, Morten N. Larsen, and Brian Vinter. Extending Distributed
Shared Memory for the Cell Broadband Engine to a Channel Model. In Applied
Parallel and Scientific Computing - 10th International Conference, PARA 2010, pages
108–118, 2010.
[168] Kenneth Skovhede, Morten N. Larsen, and Brian Vinter. Programming the
CELL-BE using CSP. In 33th Communicating Process Architectures Conference,
CPA 2011,, pages 55–70, 2011.
[169] Kyle L. Spafford, Jeremy S. Meredith, Seyong Lee, Dong Li, Philip C. Roth, and
Jeffrey S. Vetter. The tradeoffs of fused memory hierarchies in heterogeneous
computing architectures. In Proceedings of the 9th Conference on Computing
Frontiers, CF ’12, pages 103–112, 2012.
[172] Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen,
Shyh-hao Kuo, Rick Siow Mong Goh, Stephen John Turner, and Weng-Fai
Wong. Accelerating sparse matrix-vector multiplication on gpus using bit-
representation-optimized schemes. In Proceedings of SC13: International Confer-
ence for High Performance Computing, Networking, Storage and Analysis, SC ’13,
pages 26:1–26:12, 2013.
[173] Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai
Li, and Rick Siow Mong Goh. Optimizing and Auto-Tuning Scale-Free Sparse
Matrix-Vector Multiplication on Intel Xeon Phi. In Proceedings of the 13th Annual
IEEE/ACM International Symposium on Code Generation and Optimization, CGO
’15, pages 136–145, 2015.
[174] Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and
Joel Emer. Scheduling heterogeneous multi-cores through performance impact
estimation (pie). In Proceedings of the 39th Annual International Symposium on
Computer Architecture, ISCA ’12, pages 213–224, 2012.
[175] Virginia Vassilevska, Ryan Williams, and Raphael Yuster. Finding Heaviest H-
subgraphs in Real Weighted Graphs, with Applications. ACM Trans. Algorithms,
6(3):44:1–44:23, jul 2010.
[176] F. Vazquez, G. Ortega, J.J. Fernandez, I. Garcia, and E.M. Garzon. Fast Sparse
Matrix Matrix Product Based on ELLR-T and GPU Computing. In Parallel
and Distributed Processing with Applications (ISPA), 2012 IEEE 10th International
Symposium on, pages 669–674, 2012.
[177] Richard Vuduc, James W Demmel, and Katherine A Yelick. OSKI: A Library of
Automatically Tuned Sparse Matrix Kernels. Journal of Physics: Conference Series,
16(1):521–530, 2005.
[178] Richard Wilson Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels.
PhD thesis, University of California, Berkeley, Dec 2003.
[179] RichardW. Vuduc and Hyun-Jin Moon. Fast Sparse Matrix-Vector Multiplication
by Exploiting Variable Block Structure. In High Performance Computing and
Communications, volume 3726 of Lecture Notes in Computer Science, pages 807–
816. Springer Berlin Heidelberg, 2005.
[180] Hao Wang, S. Potluri, D. Bureddy, C. Rosales, and D.K. Panda. GPU-Aware MPI
on RDMA-Enabled Clusters: Design, Implementation and Evaluation. Parallel
and Distributed Systems, IEEE Transactions on, 25(10):2595–2605, 2014.
[181] Jin Wang, Norman Rubin, Haicheng Wu, and Sudhakar Yalamanchili. Ac-
celerating simulation of agent-based models on heterogeneous architectures.
160 BIBLIOGRAPHY
[195] Nan Zhang. A Novel Parallel Scan for Multicore Processors and Its Application
in Sparse Matrix-Vector Multiplication. Parallel and Distributed Systems, IEEE
Transactions on, 23(3):397–404, Mar 2012.
[196] Xianyi Zhang, Chao Yang, Fangfang Liu, Yiqun Liu, and Yutong Lu. Optimizing
and Scaling HPCG on Tianhe-2: Early Experience. In Algorithms and Architectures
for Parallel Processing, volume 8630 of Lecture Notes in Computer Science, pages
28–41. Springer International Publishing, 2014.
162 BIBLIOGRAPHY
A. Benchmark Suite
To maintain a relatively fair performance comparison, Duff et al. [66] advocated build-
ing publicly accessible sparse matrix collections. In the past several decades, a few
collections (e.g., the SPARSKIT Collection [155], the Rutherford-Boeing Sparse Matrix
Collection [67], the Matrix Market Collection [29]and the University of Florida Sparse
Matrix Collection [56]) have been established and widely used in evaluating sparse
matrix algorithms. The latest University of Florida Sparse Matrix Collection [56]
is a superset of the former collections and includes over 2700 sparse matrices now.
Therefore, all matrices (except a dense matrix named Dense) used in this thesis are
downloaded from this collection. Details of matrices used in this thesis are listed in
the alphabetical order in Table A.1.
Plot Information
Name: 2cubes sphere
#rows: 101,492
#columns: 101,492
#nonzeros: 1,647,264
#nonzeros per row (min; avg; max): 5; 16; 31
Author: E. Um
Kind: Electromagnetics problem
Description: FEM, electromagnetics, 2 cubes in a sphere.
Name: ASIC 680k
#rows: 682,862
#columns: 682,862
#nonzeros: 3,871,773
#nonzeros per row (min; avg; max): 1; 6; 395,259
Author: R. Hoekstra
Kind: Circuit simulation problem
Description: Sandia, Xyce circuit simulation matrix
(stripped).
163
164
Name: Boyd2
#rows: 466,316
#columns: 466,316
#nonzeros: 1,500,397
#nonzeros per row (min; avg; max): 2; 3; 93,262
Author: N. Gould
Kind: Optimization problem
Description: KKT matrix - convex QP (CUTEr).
Name: Cage12
#rows: 130,228
#columns: 130,228
#nonzeros: 2,032,536
#nonzeros per row (min; avg; max): 5; 15; 33
Author: A. van Heukelum
Kind: Directed weighted graph
Description: DNA electrophoresis, 12 monomers in polymer.
Name: Circuit
#rows: 170,998
#columns: 170,998
#nonzeros: 958,936
#nonzeros per row (min; avg; max): 1; 5; 353
Author: S. Hamm
Kind: Circuit simulation problem
Description: Motorola circuit simulation.
Name: Circuit5M
#rows: 5,558,326
#columns: 5,558,326
#nonzeros: 59,524,291
#nonzeros per row (min; avg; max): 1; 10; 1,290,501
Author: K. Gullapalli
Kind: Circuit simulation problem
Description: Large circuit from Freescale Semiconductor.
Name: Dc2
#rows: 116,835
#columns: 116,835
#nonzeros: 766,396
#nonzeros per row (min; avg; max): 1; 6; 114,190
Author: T. Lehner
Kind: Subsequent circuit simulation problem
Description: IBM EDA circuit simulation matrix.
APPENDIX A. BENCHMARK SUITE 165
Name: Dense
#rows: 2,000
#columns: 2,000
#nonzeros: 4,000,000
#nonzeros per row (min; avg; max): 2,000; 2,000; 2,000
Author: Unknown
Kind: Dense
Description: Dense matrix in sparse format.
Name: Economics
#rows: 206,500
#columns: 206,500
#nonzeros: 1,273,389
#nonzeros per row (min; avg; max): 1; 6; 44
Author: Unknown
Kind: Economic problem
Description: Macroeconomic model.
Name: Epidemiology
#rows: 525,825
#columns: 525,825
#nonzeros: 2,100,225
#nonzeros per row (min; avg; max): 2; 3; 4
Author: Unknown
Kind: 2D/3D problem
Description: 2D Markov model of epidemic.
Name: Eu-2005
#rows: 862,664
#columns: 862,664
#nonzeros: 19,235,140
#nonzeros per row (min; avg; max): 0; 22; 6,985
Author: Universita degli Studi di Milano
Kind: Directed graph
Description: Small web crawl of .eu domain.
Name: FEM/Accelerator
#rows: 121,192
#columns: 121,192
#nonzeros: 2,624,331
#nonzeros per row (min; avg; max): 0; 21; 81
Author: Unknown
Kind: 2D/3D problem
Description: FEM/Accelerator: Accelerator cavity design.
166
Name: FEM/Cantilever
#rows: 62,451
#columns: 62,451
#nonzeros: 4,007,383
#nonzeros per row (min; avg; max): 1; 64; 78
Author: Unknown
Kind: 2D/3D problem
Description: FEM/Cantilever.
Name: FEM/Harbor
#rows: 46,835
#columns: 46,835
#nonzeros: 2,329,092
#nonzeros per row (min; avg; max): 4; 50; 145
Author: S. Bova
Kind: Computational fluid dynamics problem
Description: 3D CFD model, Charleston harbor.
Name: FEM/Ship
#rows: 140,874
#columns: 140,874
#nonzeros: 7,813,404
#nonzeros per row (min; avg; max): 24; 55; 102
Author: C. Damhaug
Kind: Structural problem
Description: DNV-Ex 4 : Ship section/detail from produc-
tion run-1999-01-17.
Name: FEM/Spheres
#rows: 83,334
#columns: 83,334
#nonzeros: 6,010,480
#nonzeros per row (min; avg; max): 1; 72; 81
Author: Unknown
Kind: 2D/3D problem
Description: FEM/Spheres: FEM concentric spheres.
Name: Filter3D
#rows: 106,437
#columns: 106,437
#nonzeros: 2,707,179
#nonzeros per row (min; avg; max): 8; 25; 112
Author: D. Hohlfield, T. Bechtold, H. Zappe
Kind: Model reduction problem
Description: Oberwolfach: tunable optical filter.
APPENDIX A. BENCHMARK SUITE 167
Name: FullChip
#rows: 2,987,012
#columns: 2,987,012
#nonzeros: 26,621,983
#nonzeros per row (min; avg; max): 1; 8; 2,312,481
Author: K. Gullapalli
Kind: Circuit simulation problem
Description: Circuit simulation from Freescale Semiconduc-
tor.
Name: Ga41As41H72
#rows: 268,096
#columns: 268,096
#nonzeros: 18,488,476
#nonzeros per row (min; avg; max): 18; 68; 702
Author: Y. Zhou, Y. Saad, M. Tiago, J. Chelikowsky
Kind: Theoretical/quantum chemistry problem
Description: Real-space pseudopotential method.
Name: Hood
#rows: 220,542
#columns: 220,542
#nonzeros: 10,768,436
#nonzeros per row (min; avg; max): 21; 48; 77
Author: J. Weiher
Kind: Structural problem
Description: INDEED Test Matrix (DC-mh).
Name: In-2004
#rows: 1,382,908
#columns: 1,382,908
#nonzeros: 16,917,053
#nonzeros per row (min; avg; max): 0; 12; 7,753
Author: Universita degli Studi di Milano
Kind: Directed graph
Description: Small web crawl of .in domain.
Name: Ins2
#rows: 309,412
#columns: 309,412
#nonzeros: 2,751,484
#nonzeros per row (min; avg; max): 5; 8; 309,412
Author: A. Andrianov
Kind: Optimization problem
Description: ins2 matrix from SAS Institute Inc.
168
Name: LP
#rows: 4,284
#columns: 1,096,894
#nonzeros: 11,284,032
#nonzeros per row (min; avg; max): 1; 2,633; 56,181
Author: P. Nobili
Kind: Linear programming problem
Description: Italian railways (H. Mittelmann test set).
Name: M133-b3
#rows: 200,200
#columns: 200,200
#nonzeros: 800,800
#nonzeros per row (min; avg; max): 4; 4; 4
Author: V. Welker
Kind: Combinatorial problem
Description: Simplicial complexes from Homology.
Name: Majorbasis
#rows: 160,000
#columns: 160,000
#nonzeros: 1,750,416
#nonzeros per row (min; avg; max): 6; 10; 11
Author: Q. Li and M. Ferris
Kind: Optimization problem
Description: MCP; mixed complementarity optimization
problem; similar to QLi/crashbasis.
Name: Mario002
#rows: 389,874
#columns: 389,874
#nonzeros: 2,097,566
#nonzeros per row (min; avg; max): 2; 5; 7
Author: N. Gould, Y. Hu, J. Scott
Kind: Duplicate 2D/3D problem
Description: Larger matrix from Mario For which MA47
analysis is slow
Name: Mip1
#rows: 66,463
#columns: 66,463
#nonzeros: 10,352,819
#nonzeros per row (min; avg; max): 4; 155; 66,395
Author: A. Andrianov
Kind: T. Davis
Description: mip1 matrix from SAS Institute Inc.
APPENDIX A. BENCHMARK SUITE 169
Name: QCD
#rows: 49,152
#columns: 49,152
#nonzeros: 1,916,928
#nonzeros per row (min; avg; max): 39; 39; 39
Author: B. Medeke
Kind: Theoretical/quantum chemistry problem
Description: Quantum chromodynamics conf5.4-00l8x8-
2000
Name: Rajat21
#rows: 411,676
#columns: 411,676
#nonzeros: 1,876,011
#nonzeros per row (min; avg; max): 1; 4; 118,689
Author: Rajat
Kind: Circuit simulation problem
Description: Rajat/rajat21 circuit simulation matrix.
Name: Si41Ge41H72
#rows: 185,639
#columns: 185,639
#nonzeros: 15,011,265
#nonzeros per row (min; avg; max): 13; 80; 662
Author: Y. Zhou, Y. Saad, M. Tiago, J. Chelikowsky
Kind: Theoretical/quantum chemistry problem
Description: Real-space pseudopotential method.
Name: Transient
#rows: 178,866
#columns: 178,866
#nonzeros: 961,368
#nonzeros per row (min; avg; max): 1; 5; 60,423
Author: K. Gullapalli
Kind: Circuit simulation problem
Description: Small circuit from Freescale Semiconductor.
Name: Webbase
#rows: 1,000,005
#columns: 1,000,005
#nonzeros: 3,105,536
#nonzeros per row (min; avg; max): 1; 3; 4,700
Author: unknown
Kind: Weighted directed graph
Description: Web connectivity matrix.
APPENDIX A. BENCHMARK SUITE 171
A variety of platforms are used for evaluating the proposed sparse matrix algorithms
in this thesis. The CPUs, GPUs, Xeon Phi and tightly coupled CPU-GPU heteroge-
neous processors are listed separately in this Appendix.
B.1 CPUs
The used two CPUs are listed in Table B.1.
Table B.1: Two CPUs used for benchmarking.
173
174 B.2. GPUS
B.2 GPUs
The used two nVidia GPUs and one AMD GPU are listed in Table B.2.
Vendor Intel
Family Xeon Phi
Device 5110p
Codename Knights Corner
#Cores 60
#SIMD units 60×2×512-bit wide
Clock 1.05 GHz
SP flop/cycle 1920
SP Peak 2016 GFlop/s
DP flop/cycle 960
DP Peak 1008 GFlop/s
L1 data cache 60×32 kB
L2 cache 60×512 kB
Memory 8 GB GDDR5
Bandwidth 320 GB/s
ECC on
OS (64-bit) RedHat Enterprise Linux v6.5
Device driver v3.4-1
µOS v2.6.38.8
Compiler Intel C/C++ v15.0.1
176 B.4. HETEROGENEOUS PROCESSORS
Table B.4: Intel and nVidia heterogeneous processors used for benchmarking.
179
180
D. Short Biography
Weifeng Liu is currently a Ph.D. candidate at Niels Bohr Institute, Faculty of Science,
University of Copenhagen, Denmark. He is working in the eScience Center under
advisor Professor Brian Vinter. Before he moved to Copenhagen, he worked as a
senior researcher in high performance computing technology at SINOPEC Exploration
& Production Research Institute for about six years. He received his B.E. degree and
M.E. degree in computer science, both from China University of Petroleum, Beijing,
in 2002 and 2006, respectively. He is a member of the ACM, the IEEE, the CCF and
the SIAM.
His research interests include numerical linear algebra and parallel computing,
particularly in designing algorithms for sparse matrix computations on throughput-
oriented processors. His algorithms run on a variety of many-core devices (e.g., nVidia
GPUs, AMD GPUs, Intel GPUs and Intel Xeon Phi) and CPU-GPU heterogeneous
processors (e.g., nVidia Tegra, AMD Kaveri and Intel Broadwell).
181