Chen 2022 FG SPMSP V
Chen 2022 FG SPMSP V
HPC Platforms
YUEDAN CHEN, GUOQING XIAO, and KENLI LI, College of Computer Science and Electronic
Engineering, Hunan University, and National Supercomputing Center in Changsha
FRANCESCO PICCIALLI, Department of Electrical Engineering and Information Technologies,
University of Naples Federico II
ALBERT Y. ZOMAYA, School of Information Technologies, University of Sydney
Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations
in many high-performance scientific and engineering applications. The inherent irregularity and poor data 8
locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems:
(i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the ir-
regular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained
parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges
for large-scale real-world applications. First, fgSpMSpV adopts an MPI+OpenMP+X parallelization model
to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both
pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel exe-
cution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system,
while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the
fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format.
Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV
on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization tech-
niques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100
GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of
up to 134.38× over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves
the speedup of up to 21.68× over the state-of-the-arts.
Additional Key Words and Phrases: Heterogeneous, HPC, manycore, optimization, parallelism, SpMSpV
The research was partially funded by the National Key R&D Programs of China (Grant No. 2020YFB2104000), the Programs
of National Natural Science Foundation of China (Grant Nos. 62172157, 61860206011, 61806077), the Programs of Hunan
Province, China (Grant Nos. 2020RC2032, 2021RC3062, 2021JJ40109, 2021JJ40121), the Programs of China Postdoctoral
Council (Grant Nos. PC2020025, 2021M701153), the Program of Zhejiang Lab (Grant No. 2022RC0AB03), and the General
Program of Fundamental Research of Shen Zhen (Grant No. JCYJ20210324135409026).
Authors’ addresses: Y. Chen, G. Xiao (corresponding author), and K. Li, College of Computer Science and Electronic Engi-
neering, Hunan University, and National Supercomputing Center in Changsha, Changsha, Hunan 410082, China; emails:
{chenyuedan, xiaoguoqing, lkl}@hnu.edu.cn; F. Piccialli, Department of Electrical Engineering and Information Technolo-
gies, University of Naples Federico II, Naples 80100, Italy; email: [email protected]; A. Y. Zomaya, School of
Information Technologies, University of Sydney, Sidney, NSW 2006, Australia; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
2329-4949/2022/04-ART8 $15.00
https://doi.org/10.1145/3512770
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:2 Y. Chen et al.
1 INTRODUCTION
Sparse matrices are a common kind of data sources in a wide variety of high-performance scien-
tific and industrial engineering applications, such as recommender systems [29], social network
services [23], numerical simulation [42], business intelligence [24], cryptography [11], and algo-
rithms for least squares and eigenvalue problems [33].
Large-scale sparse matrices are frequently used in various applications with several reasons
as follows [28]. On one hand, with the exploding numbers of commodities, netizens, and network
nodes, there is an increasing need to depict relationships among those involved entities with matri-
ces. Typically, the matrices obtained from aforementioned applications contain very few non-zero
elements and, hence, can be modeled as sparse matrices intuitively. On the other hand, it is difficult
to completely observe relationships among a lot of entities. For instance, for millions of users and
items in recommender systems, each entity in a rating matrix models the usage history of items
by users. Correspondingly the matrix is usually very sparse with a number of missing values since
only a finite item set can be touched by each user.
SpMSpV is one of the most fundamental and important operations in all kinds of high-
performance scientific computing and engineering applications, such as machine learning [30, 38],
graph computations [31, 34], data analysis [12, 15], and so on. The mathematical formulation of
SpMSpV is y = A × x, where the input matrix A, the input vector x, and the output vector y are
sparse. As an example, in many graph computation algorithms that are implemented using matrix
primitives, such as bipartite graph matching [5, 8], breadth-first search (BFS) [14], connected
component [7, 39], maximal independent sets [13], and the computational essence is to convert
a set of active vertices (usually called “current frontier”) into a new set of active vertices (usually
called “next frontier”). The computational pattern of “frontier expansion” can be neatly captured
by SpMSpV, where input sparse matrix A represents the graph, input sparse vector x represents
the current frontier, and output sparse vector y represents the next frontier.
Moreover, SpMSpV can even be applied to some typical sparse matrix-dense vector multipli-
cation (SpMV) algorithms, such as PageRank [27], to achieve incremental convergence [4]. Dif-
ferent from SpMV [35, 36]; however, the sparsity of x in SpMSpV causes that many floating-point
operations (multiplications and additions) on non-zeros of A are omitted, which results in a large
amount of redundant data and irregular data accesses in SpMSpV. Moreover, SpMSpV can be consid-
ered as a specific case of general sparse matrix-sparse matrix multiplication (SpGEMM) [16],
where the second sparse matrix of SpGEMM only has one column. Some efficient SpGEMM algo-
rithms, such as Gustavson’s SpGEMM algorithms [21], are suitable for sequential SpMSpV rather
than parallel SpMSpV. The reason is that each SpMSpV operation in SpGEMM evolves much little
work, which requires novel approaches to scale SpMSpV to multiple threads.
Since SpMSpV plays a fundamental role in many scientific and real-world applications, there
has been significant research to scale SpMSpV for large-scale applications by utilizing the pow-
erful parallel computing capacity of state-of-the-art multi-core and manycore processors, such as
Central Processing Unit (CPU) [34], Graphics Processing Unit (GPU) [37], and so on. To ef-
fectively scale, it is necessary to both consider the algorithmic characteristics of SpMSpV and the
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:3
architectural advantages of computing platforms. In this article, we study how to scale SpMSpV
over HPC systems as characterized by the Sunway TaihuLight.
The Sunway TaihuLight with 40960 heterogeneous manycore SW26010 chips [17, 22] had held
its top position as the fastest supercomputer in the world from June 2016 to June 2018. As shown
in Figure 1, the first stage of parallelism of Sunway comes from the four Core Groups (CGs)
in the SW26010. Within a CG, the Management Processing Element (MPE), also called host,
manages tasks on the CG, while the Computing Processing Element (CPE) cluster containing
64 CPEs, also called device, provides the second stage of parallelism. The MPE and CPEs share a
DDR3 memory of 8 GB. Each CPE has the no-cache memory structure, wherein, each CPE has no
data cache but only a 64KB fast scratch-pad memory (SPM), known as local device memory
(LDM) as well. The heterogeneous and multi-stage parallelism and no-cache memory structure of
Sunway system present both challenges and opportunities for optimizing SpMSpV.
Challenges. Scaling SpMSpV over the HPC systems is facing two problems:
(i) A large amount of redundant data. Only non-zeros in x, non-zeros in the corresponding
columns in A, and non-zeros in y are the necessary data in SpMSpV. A large amount of redun-
dant data in A, x, and y causes excessive memory footprint, unnecessary data transmission, and
useless computation, which is unfavorable for utilization of bandwidth and parallel resources.
(ii) The irregular data access pattern. The sparsity of A, x, and y results in unpredictable mem-
ory references and random memory accesses in SpMSpV, so that there could be several problems
in exploiting the computing power of the platform, such as expensive latency of non-coalesced
memory accesses, possibility of parallel write collisions, and load imbalance. In addition, the paral-
lelization design of SpMSpV is required to adapt to the heterogeneous and multi-stage parallelism
and no-cache structure of the Sunway.
Therefore, in this article, the fgSpMSpV framework is devised to address the above-mentioned
problems and optimize SpMSpV on the Sunway architecture. The contributions made in this article
are summarized as follows:
(i) We investigate a hybrid MPI+OpenMP+X -based approach for fgSpMSpV on heteroge-
neous HPC architectures. fgSpMSpV exploits heterogeneous inter-node MPI communication
and OpenMP+X intra-node parallelism to accelerate both pre-/post-processing and main
SpMSpV computation in real-world applications.
(ii) We devise an adaptive parallel execution for fgSpMSpV to reduce the pre-processing of each
SpMSpV in real-world applications, leverage the heterogeneous and multi-stage parallelism
and memory hierarchy of the Sunway TaihuLight, while still tame redundant and random
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:4 Y. Chen et al.
memory accesses. fgSpMSpV adapts to and exploits the hardware architecture by adopting
the fine-grained partitioner, re-organizes the necessary data in SpMSpV and optimizes mem-
ory access behavior by utilizing the re-collection method, and better preserves the efficiency
of parallel SpMSpV by using the CSCV sparse matrix format.
(iii) We further propose performance optimization techniques for fgSpMSpV to fully utilize the
Single Instruction Multiple Data (SIMD) technique, take advantages of the transmission
bandwidth, and synchronize computation with communication of fgSpMSpV.
(iv) We evaluate fgSpMSpV and its key techniques on the Sunway TaihuLight supercomputer
using real-world datasets. In addition, we also implement fgSpMSpV on an NIVIDA Tesla
P100 and apply it to the BFS application to verify its generality, flexibility, and efficiency.
2 RELATED WORK
Several approaches have been proposed to parallelize the SpMSpV algorithm on various platforms.
GraphMat [34] optimizes parallel SpMSpV large-scale graph analytics on CPUs, where the distri-
bution structure of non-zeros in the input matrix determines the SpMSpV computation (matrix-
driven). SpMSpV-bucket [6] uses a list of buckets to partition the necessary columns in the input
matrix on the fly and parallelize SpMSpV, where the distribution structure of non-zeros in the input
vector determines the computation (vector-driven). This article proposes a vector-driven SpMSpV
method that re-collects the necessary non-zeros of x, y, and A for a fine-grained and efficient
parallelization of SpMSpV on an HPC architecture.
SpMSpV can be interpreted as a specific case of SpGEMM, where the second sparse matrix of
SpGEMM only has one column. Akbudak et al. [2, 3] utilize the hypergraph model and bipartite
graph model to optimize parallel SpGEMM. Ballard et al. [9] devise a fine-grained hypergraph
model to reduce the sparsity-dependent communication bounds in SpGEMM. Dalton et al. [19]
decompose SpGEMM into three computational stages (expansion, sorting, and compression) to
exploit the intrinsic fine-grained parallelism of throughput-oriented processors, such as GPU.
Additionally, SpMSpV is likely to be considered as a special case of SpMV where the input vec-
tor x is sparse. Li et al. [26] devise the adaptive SpMV/SpMSpV framework on GPUs that uses
a machine-learning based kernel selector to select the appropriate SpMV/SpMSpV kernel based
on the computing pattern, workload distribution, and write-back strategy. Chen et al. [18] design
the large-scale SpMV on the Sunway TaihuLight that includes two parts. The first part performs
the parallel partial Compressed Sparse Row (CSR)-based SpMV, and the second part performs
the parallel accumulation on the results of the first part. Zhang et al. [40] propose the Blocked
Compressed Common Coordinate (BCCOO) format that uses bit flags to alleviate the band-
width problem of SpMV, and further design partitioning method to divide the matrix into vertical
slices to improve data locality. MpSpMV [1] splits non-zeros in the input matrix into two parts:
single-precision part and double-precision part. The non-zeros within the range of (−1, 1) belong
to the single-precision part, which are multiplied by the input vector x using single-precision.
Other non-zeros belonging to the double-precision part, which are multiplied by x using double
precision. The final double-precision result is created by combining the calculation results of the
two parts.
The appropriate data structures for the sparse matrix and the sparse vector are determinate to
the performance of sparse matrix operations. Langr and Tvrdík [25] list the state-of-the-art com-
pressed storage formats for sparse matrices. Azad and Buluç [6] use the Compressed Sparse
Column (CSC) format to accelerate parallel SpMSpV. Benatia et al. [10] choose the most suitable
matrix format, from Coordinate (COO), CSR, the Blocked Compressed Sparse Row (BCSR),
ELLPACK (ELL), and Diagonal (DIA) formats, for SpMV by the proposed cost-aware classifica-
tion models on Weighted Support Vector Machines (WSVMs). Zhou et al. [41] select the most
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:5
proper matrix format for SpMV by focusing on the runtime prediction and format conversion
overhead.
There has been some work on designing the customized parallelization for sparse matrix opera-
tions, including SpGEMM [16] and SpMV [36], based on the heterogeneous manycore architecture
of Sunway. Especially, the parallel implementation of SpGEMM where the second sparse matrix
is stored in CSC format can be directly used in parallelization design for SpMSpV on the Sunway.
However, there are some performance bottlenecks of the proposed SpGEMM parallelization, i.e.,
redundant memory usage and unnecessary data transfer for the first input matrix, irregular ac-
cesses to columns of the first input matrix, and random accesses to and possible parallel writing
collisions on the results. Moreover, the parallel SpGEMM design on Sunway is not well suited to
the parallelization of SpMSpV. Therefore, this article designs a suitable and efficient parallelization
and optimization for SpMSpV with an adaptive column-wise sparse matrix format on the Sunway,
which solves the performance bottlenecks in [16] as well.
3 BACKGROUND
3.1 Notation of SpMSpV
We define nnz(·) as a function that outputs the number of non-zeros in its input. The input sparse
matrix A of SpMSpV contains M rows, N columns, and nnz(A) non-zeros. The input sparse vector
x has length of N and contains nnz(x ) non-zeros. The output sparse vector y has length of M and
contains nnz(y) non-zeros.
There may be operations performed on non-zeros of A are omitted since the corresponding
elements in x are zeros. Therefore, the number of operations in SpMSpV (including additions and
multiplications) must be less than 2×nnz(A). If A is stored by rows, there is a necessary step to find
the non-zeros in each row of A that correspond to the non-zeros in x for calculation. However, if A
is stored by columns, each non-zero of x only accesses the corresponding column in A, as shown
in Figure 2. The non-zero element of x multiplies each non-zero in the corresponding column of
A, and the multiplication results are accumulated to the corresponding elements of y. There is no
need to find the calculational non-zeros in A when A is stored by columns, which simplifies the
calculation of SpMSpV. In addition, only the corresponding nnz(x ) columns of A are accessed by
non-zeros of x, which reduces redundant data accesses in SpMSpV.
Consequently, in this article, we only discuss the column-wise SpMSpV. The CSC format is one
of the most popular column-wise compressed storage formats for sparse matrices. It uses three
arrays to store non-zeros in the sparse matrix A by columns, including an integer array storing the
pointers to each column (p[N +1]), an integer array storing the row id of each non-zero (r[nnz(A)]),
and a floating-point array storing the numerical value of each non-zero (v[nnz(A)]). As described
in Algorithm 1, SpMSpV based on CSC format is executed by columns.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:6 Y. Chen et al.
Fig. 2. An example of column-wise SpMSpV, where nnz(A) = 11, nnz(x ) = 4, nnz(y) = 7, N > 4, M > 7, and
there are only four columns of A are accessed.
Each CG is equipped with an 8 GB DDR3 shared memory. The MPE of each CG has a 32 KB L1
cache for instructions and a 256 KB L2 cache for instructions and data. Each CPE of a CG has a
16 KB L1 cache for instructions and a 64 KB LDM for data instead of a cache. The LDM of each CPE
provides higher bandwidth and lower access latency than that of the main memory. Bulk transfer
between the LDM of each CPE and the main memory of the CG uses the Direct Memory Access
(DMA) transmission. Nevertheless, each CPE accesses non-coalesced data from the main memory
by Global Load/Store (gld/gst) operations with high overhead.
The Sunway architecture brings three main challenges for designing and implementing com-
puting kernels, as follows:
— How to properly coordinate the heterogeneous computing architecture.
— How to fully develop the multi-stage parallelism (CG- and CPE-stage).
— How to fully utilize the “main memory + LDM” memory hierarchy.
Therefore, an appropriate heterogeneous parallelization is important for the computing kernel
to properly coordinate the MPE and CPEs in each CG. In addition, an adaptive parallel execution
is also critical for the computing kernel to utilize the computing capability and manage data based
on the memory structure of Sunway.
4 OVERVIEW
Figure 3 presents the overview of fgSpMSpV framework on a CG of each SW26010 processor.
fgSpMSpV uses the following three key designs to mitigate the challenges of SpMSpV on HPC
architectures:
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:7
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:8 Y. Chen et al.
Athread for CPE cluster and CUDA for GPU. Therefore, fgSpMSpV uses MPI+OpenMP+Athread
parallelization model on the Sunway TaihuLight.
In each CG, the MPE parallelizes the re-collection method and CSCV storing using OpenMP
to prepare the CSCV-stored sparse matrix and the compressed input sparse vector, and allocates
parallel SpMSpV tasks to the CPE cluster. The CPE cluster parallelizes and optimizes SpMSpV via
Athread. Whereas the fast LDM on each CPE has only 64 KB of memory, which requires each CPE
to get the appropriate amount of data from the main memory at a time for parallel computation.
Therefore, there are several rounds of processing on each CPE. In each round of processing, the
CPE loads data, that has a suitable size for the LDM memory, from main memory, and then executes
parallel computation on the received data. Finally, the CPE sends results back to main memory.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:9
Fig. 4. The fine-grained partitioner for the example of fgSpMSpV presented in Figure 2 on the Sunway, where
α = 2 and β = 2.
There is an array of positions of all tiles in A, denoted as Pos[α × β + 1], which provides a fast,
accurate, and intuitive way for each CG and CPE to find the start and end positions of each block
and tile in A. Pos[i × β] and Pos[(i + 1) × β] − 1 are indices of the first and last rows of the block
on CGi in A, respectively, where i = {0, 1, 2, . . . , α − 1}. Pos[i × β + j] and Pos[i × β + j + 1] − 1 are
indices of the first and last rows of the tile on the CPEj of CGi in A, where j = {0, 1, 2, . . . , β − 1}.
Correspondingly, the length of CPE-segy on the CPEj of CGi is Pos[i × β + j + 1] − Pos[i × β + j].
(3) Fine-grained Partitioning. To make better use of the limited memory of LDM, each tile
is further partitioned by columns into a number of CVSs, where each CVS has a size suitable for
N
loading into the LDM. Each CVS has inc column vectors. Correspondingly, x is partitioned into inc
segments, denoted as LDM-segx, and each LDM-segx has inc elements.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:10 Y. Chen et al.
Fig. 5. The re-collection method for x, A, and y in the example presented in Figure 2.
where A(:, j) is the jth column of A. There are three arrays storing A: the integer array storing
column pointers (Colp[nnz(x )+1]), the integer array storing row ids of non-zeros (Rows[nnz(A )]),
and the floating-point array storing numerical values (Vals[nnz(A )]).
For the example shown in Figure 2, M = 7 and the re-collected A is stored by three arrays, i.e.,
Colp[5] = {0, 3, 6, 7, 8}, Rows[8] = {1, 2, 4, 0, 4, 6, 3, 5}, and Vals[8] = {a 1 , a 2 , a 3 , a 4 , a 5 , a 6 , a 7 , a 8 },
as presented in Figure 5.
Re-collection for y—The output vector y is compressed into y , where only non-zeros of y are
stored. There are two arrays storing y : the integer array storing indices of non-zeros in y (Yi[M ]),
and the floating-point array storing numerical values of non-zeros (Yv[M ]).
For the example shown in Figure 2, y is stored by two arrays, i.e., Yi[7] = {i 1 , i 2 , i 3 , i 4 , i 5 , i 6 , i 7 }
and Yv[7] = {y1 , y2 , y3 , y4 , y5 , y6 , y7 }, as shown in Figure 5.
Therefore, only computational data are stored and can be continuously accessed in re-collected
SpMSpV, which exploits the data locality of SpMSpV and improves the memory bandwidth utiliza-
tion. Algorithm 2 describes the sequential algorithm of re-collected SpMSpV.
nnz column indices correspond to the indices in array Xi(nnz). The result is the corresponding
segment of the re-collected y (y ), denoted as CG-segy .
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:11
Fig. 6. The parallelization of fgSpMSpV for the example presented in Figure 4 on the Sunway, where α = 2,
β = 2, and inc = 2.
(2) On each CPE of the CG, the CPE-stage re-collected SpMSpV multiplies a tile in the block
by x , where the tile contains the nnz columns in the tile of block, and the nnz column indices
correspond to the indices in array Xi(nnz). The result is the corresponding segment of CG-segy ,
denoted as CPE-segy .
An auxiliary array Pos[α × β + 1] is built to mark the start and end positions of each block and
tile in A. Pos[i × β] and Pos[(i + 1) × β] − 1 are indices of the first and last rows of the block’ on
CGi in A, respectively, where i = {0, 1, 2, . . . , α − 1}. Pos[i × β + j] and Pos[i × β + j + 1] − 1 are
indices of the first and last rows of the tile’ on the CPEj of CGi in A, where j = {0, 1, 2, . . . , β − 1}.
Correspondingly, the length of CPE-segy on the CPEj of CGi is Pos[i × β + j + 1] − Pos [i × β + j].
(3) Furthermore, the fine-grained re-collected SpMSpV multiplies a CVS in the tile by the
corresponding segment of x (LDM-segx ), where each CVS has inc column vectors in the tile , and
each LDM-segx has inc non-zeros. The result is accumulated to the corresponding CPE-segy .
Especially, as shown in Figure 6, each empty CVS, where all elements are zeros, and the corre-
sponding LDM-segx will not be fetched by the CPE. Additionally, to reduce non-coalesced mem-
ory accesses and atomic operations, an array is allocated in each LDM to store the numerical values
of CPE-segy . The results on the CPE are accumulated to the corresponding elements in this array.
Until the calculations on all the CVSs of the tile are completed, the results cached in this array
will not be returned to main memory.
To adapt to the fine-grained parallelization design of fgSpMSpV, we devise a Compressed
Sparse Column Vector (CSCV) format for A to store non-zeros by column vectors of each CVS.
The CSCV format is a fine-grained variant CSC format. There are also three arrays of CSCV to
store A, i.e., the integer array of column vector pointers (SColp[nnz(x ) × α × β + 1]), the integer
array of row indices of non-zeros of A (SRows[nnz(A )]), and the floating-point array of numerical
values (SVals[nnz(A )]). The details are as follow:
— SColp stores the position where each column begins and ends in tiles.
— SRows stores the row id of each non-zero in tiles.
— SVals stores the value of each non-zero in tiles.
For the example of parallelization shown in Figure 4, the CSCV format stores the re-collected
matrix A by three arrays
— SColp[17] = {0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 7, 8};
— SRows[8] = {1, 0, 0, 1, 0, 0, 1, 0};
— SVals[8] = {a 1 , a 4 , a 2 , a 7 , a 3 , a 5 , a 6 , a 8 }.
Algorithm 3 presents the parallel algorithm for fgSpMSpV on a CPE. The data in the arrays of
tile and x , i.e., SColp, SRows, SVals, and Xv, can be accessed continuously by each CPE. The
CPE allocates LDM spaces for these fetched arrays, i.e., scolpldm , srowsldm , svalsldm , and xvldm
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:12 Y. Chen et al.
in Algorithm 3. The arrays of tile s and x are fetched from main memory to LDM using DMA. As
described in Algorithm 3, after each CPE ensures the CVS is non-empty (len 0) according to
the array SColp, the required data in arrays SRows, SVals, and Xv will be fetched. In addition, the
array storing numerical values of the CPE-segy on each CPE, denoted as yvldm [Pos[i × β + j +
1] − Pos[i × β + j]], is allocated and cached in the LDM. The results are cached in the array, and
the calculation results of each processing round update the array yvldm . Until all the tasks on the
CPE are finished, the data in array yvldm will not be sent to main memory using DMA.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:13
SIMD variable must have the 32-byte boundary alignment (16-byte boundary alignment for single
floating-point data), as shown in Figure 7, otherwise it will impair the performance of SIMD.
On each CPE, non-zeros of each CV in the CVSs are successively multiplied by a corresponding
element in x . However, the number of non-zeros of each CV is irregular, which poses a difficulty
to vectorize the non-zeros of each column with the 32-byte boundary alignment. Therefore, we
propose a padding technique for non-vanishing CV s. The CV padding technique pads each non-
vanishing CV so that its number of elements is a multiple of four. Therefore, all the non-vanishing
CV s on each CPE can be efficiently vectorized with 32-byte alignment.
For a CV with nnz(CV) non-zeros, if the CV is empty, i.e., nnz(CV) = 0, it does not need to be
padded, and the number of elements in the CV after padding, denoted as nnz(CV ), remains to be 0.
If the number of non-zeros in the CV is a multiple of four, i.e., nnz(CV) 0 and nnz(CV) mod 4 = 0,
it does not need to be padded, and nnz(CV ) remains to be nnz(CV). Otherwise, nnz(CV) 0
and nnz(CV) mod 4 0, so the CV needs to be padded with 4 − (nnz(CV) mod 4) elements, i.e.,
nnz(CV ) = nnz(CV) + 4 − (nnz(CV) mod 4). Therefore, nnz(CV ) on a CPE can be expressed as
⎧
⎪ 0, nnz(CV) = 0
⎪
⎪
⎨
nnz(CV ) = ⎪ nnz(CV), nnz(CV) 0 and nnz(CV) mod 4 = 0 . (2)
⎪
⎪ nnz(CV) + 4 − (nnz(CV) mod 4), nnz(CV) 0 and nnz(CV) mod 4 0
⎩
7.2 DMA Bandwidth Optimization
As shown in Algorithm 3, each CVS has inc column vectors, so that the number of transferred
elements in both the integer array SColp and the floating-point number array Xv is inc. The num-
ber of transferred elements in both the integer array SRows and the floating-point number array
SVals is len, where len is the number of non-zeros in the non-vanishing CVS and can be calculated
by SColp. Moreover, the number of transferred elements of the floating-point number array of the
CPE-segy on the CPEj of CGi is Pos[i × β + j + 1] − Pos[i × β + j]. All floating-point numbers are
set as doubles, each of which has 8 bytes. Each integer number has 4 bytes. The memory storage
of LDM is 64 KB, so the amount of the transferred data in each round must meet:
4inc + 8inc + 4len + 8len + 8 Pos[i × β + j + 1] − Pos[i × β + j] ≤ 64 × 1024. (3)
The above-mentioned arrays are transferred between main memory and LDM using DMA. Ac-
cording to the performance characteristics of DMA transmission, when the amount of data trans-
ferred is a multiple of 128 bytes and exceeds 512 bytes, the DMA bandwidth achieves an expected
performance. To improve the DMA bandwidth utilization, the amount of each transferred array in
each processing round should meet the optimization criteria.
The array Yv storing CPE-segy is transferred by the CPE only once, and its size is irregular
on each CPE. In addition, the data size in arrays SRows and SVals for storing each non-vanishing
CVS is irregular. So, it is challenging for these three arrays to meet the DMA optimization criteria.
However, for the amount of transferred data in array SColp in each round, we have 4inc = 128n
and 4inc ≥ 512, where n ∈ N + . For array Xv, we have 8inc = 128m and 8inc ≥ 512, where m ∈ N + .
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:14 Y. Chen et al.
Consequently, the DMA transmissions of SColp and Xv can achieve the optimized performance
when the parameter inc satisfies Equation (4).
8 PERFORMANCE ANALYSIS
This section will detail the overhead analysis for re-collecting x, A, and y and the parallel runtime
analysis of fgSpMSpV. Table 1 lists the explanations of important symbols used in this article.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:15
Table 1. Symbols and Their Descriptions
Symbol Description
α The number of CGs used in Sunway system
β The number of CPEs used in each CG
CGi The i t h CG of Sunway, where i ∈ {0, 1, 2, . . . , α − 1}
CPEj The j t h CPE of each CG, where j ∈ {0, 1, 2, . . . , β − 1}
A The M × N input sparse matrix of SpMSpV
nnz(·) The number of non-zeros in its input
T The execution time of fgSpMSpV on each CPE
L The overhead of loading data by each CPE
Lf The overhead of loading data in the first processing round on each CPE
C The overhead of calculations on each CPE
Cp The overhead of calculations in the penultimate round on each CPE
R The overhead of returning data to memory by each CPE
numerical value of each non-zero. Arrays Xi and Xv are obtained by going through N elements of
x. Therefore, the time complexity of re-collecting x is O(N ).
For re-collecting A, the input matrix A (M rows, N columns, and nnz(A) non-zeros) is com-
pressed and re-collected into A (M rows, nnz(x ) columns, and nnz(A ) non-zeros). First of all,
the nnz(A ) non-zeros in corresponding nnz(x ) columns of A are went through to obtain arrays
SColp[nnz(x ) + 1] and SVals[nnz(A )] that respectively store position pointers of each column
in A and numerical values of the nnz(A ) non-zeros. Beside, an array Mark[M] is obtained, in
which every element is zeros except for Mark[i], and i is the row index of each non-zero of A.
This step has the time complexity of O(nnz(A )). Second, the M elements of array Mark are went
through to replace the numerical numbers of non-zero elements in Mark with the order in which
they appear in all the non-zero elements in Mark. This step has the time complexity of O(M ).
Then, the nnz(A ) non-zeros of A are further went through based on array Mark to obtain array
SRows[nnz(A )] that stores the row indices of the nnz(A ) non-zeros, which has time complexity
of O(nnz(A )). Therefore, the time complexity of re-collecting A is O(nnz(A ) + M ).
For re-collecting y, the output vector y is compressed into y , in which Yi[nnz(y)] stores the
index of each non-zero and Yv[nnz(y)] stores the numerical value of each non-zero. Array Yv
is obtained by performing SpMSpV computation, while array Yi is obtained by going through
the corresponding nnz(A ) non-zeros in A. Their row indices in A are the elements in array Yi.
Therefore, the time complexity of re-collection for y is O(nnz(A )).
In conclusion, the re-collection method has the time complexity of O(N + nnz(A ) + M ).
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:16 Y. Chen et al.
overhead in the final round on the CPE. Using the optimized prefetching, the execution time of
fgSpMSpV on a CPE, denoted as T , is reduced and can be expressed as
T = max{L − L f , C − Cp } + L f + Cp + R. (6)
In the following two subsections, we will detail the analysis of L, L f , R, C, and Cp .
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:17
1
ζ ζ
Cs
C − Cp = 2nnz(CVk ) − nnz(CVk )
4S 4nnz(tile )
k=ξ k=λ
(12)
Cs
λ−1
= nnz(CVk ),
4nnz(tile ) k=ξ
⎧ ⎡⎢ ⎤⎥ ⎫
⎪
⎪
ζ λ−1 ⎪
⎨ 12 ⎢⎢nnz(x ) − inc + 1 + ⎥ Cs ⎪⎬
Ti×β +j = max ⎪ ⎢ nnz(CV ) ⎥
⎥ , nnz(CV ) ⎪
⎪B ⎢ k
⎥ 4nnz(tile ) k=ξ k ⎪
⎩ ⎣ k=φ+1 ⎦ ⎭
⎡⎢ φ ⎤⎥ ζ
12 ⎢
nnz(CVk ) ⎥⎥⎥ +
Cs (13)
+ ⎢⎢inc + nnz(CVk )
B ⎢ ⎥ 4nnz(tile )
⎣ ⎦
k=ξ k=λ
8 Pos [i × β + j + 1] − Pos [i × β + j]
+ ,
B
where ξ = (i × β + j)nnz(x ), φ = (i × β + j)nnz(x ) + inc − 1, λ = (i × β + j + 1)nnz(x ) − inc,
ζ = (i × β + j + 1)nnz(x ) − 1.
9 EVALUATION
9.1 Evaluation Setup
fgSpMSpV is implemented using C language. Most of experiments are conducted on the Sun-
way TaihuLight supercomputer. The CPE-layer parallel efficiency of fgSpMSpV is exploited using
Athread and tested on one CG, and the CG-layer parallel efficiency is exploited using MPI and
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:18 Y. Chen et al.
Table 2. The Tested Sparse Matrices
tested on multiple CGs. The Sunway system supports C language compiler that is customized for
compiling and linking programs on MPE and CPE of the heterogeneous SW26010 processor. The
compiling command of programs on MPE is mpicc -host, on CPE is mpicc -slave, and the hybrid
linking command is mpicc -hybrid. The compiler switches used include −msimd (turning on the
SIMD function). One CG has the maximum calculation speed of 742.5 GFLOPS and peak memory
bandwidth of 34 GB/s. Each MPE and each CPE of a CG are both ran at 1.45 GHz. Additionally, we
also implement fgSpMSpV on the GPU and conduct experiments on a machine that is equipped
with two Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (14 cores each) with 256GB RAM and one
NVIDIA Tesla P100 GPU with 16 GB HBM2 stacked memory. Expect for the contrast experiments
with the state-of-the-arts on the GPU where the floating-point numbers are single-precision, the
floating-point numbers in all the experiments are double-precision.
Table 2 presents the 20 sparse matrices that are widely used in various related work [6, 26, 34].
All the sparse matrices except soc-orkut are selected from the SuiteSparse Matrix Collection [20].
soc-orkut is selected from the Network Data Repository [32]. For experiments on the Sunway Tai-
huLight, due to the limitations of main memory (8GB) on each CG as well as LDM (64KB) on each
CPE, the first ten smaller-dimension sparse matrices in Table 2 are tested on up to one CG, and
the last 10 larger-dimension sparse matrices are tested on up to 16 CGs.
According to the examples of BFS and PageRank applications on the wikipedia-2007 dataset
provided in Ref. [26], the density of x in each SpMSpV operation varies from 2.80 × 10−7 to 0.67.
Therefore, in our experiments, we set the sparsity of the input vector x to five classes, i.e., nnz(x ) =
0.01%N , nnz(x ) = 0.1%N , nnz(x ) = 1%N , nnz(x ) = 20%N , and nnz(x ) = N .
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:19
and the DMA bandwidth is hard to improve. Especially for nnz(x ) = 0.01%N , the number of re-
collected necessary columns is very few (ranged from 2 to 20), which increases the difficulty of
improving the parallel execution efficiency.
Figure 10 presents the runtime of the main SpMSpV computation in fgSpMSpV varying the
number of CPEs in a CG (α = 1 and β = {1, 4, 16, 64}). On average, fgSpMSpV achieves the speedup
of 13.42× (min: 6.86×, max: 21.48×) when nnz(x ) = 20%N , 8.05× (min: 4.97×, max: 10.93×) when
nnz(x ) = 1%N , 1.79× (min: 1.13×, max: 2.79×) when nnz(x ) = 0.1%N , and 1.29× (min: 1.09×,
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:20 Y. Chen et al.
max: 1.54×) when nnz(x ) = 0.01%N . From Figure 10, fgSpMSpV obtains higher speedups on a CG
with 64 CPEs over that with 1 CPE as x becomes sparser (from nnz(x ) = 20%N to nnz(x ) = 0.01%N ).
The size of array SColp[nnz(x ) × α × β + 1] scales linearly with the values of α and β. According to
Algorithm 3, despite the increase in the number of CPEs (β), the amount of data to be transferred
in arrays SColp and Xv remains unchanged, which impairs the CPE-stage scalability of fgSpMSpV.
fgSpMSpV shows the better CPE-stage scalability on scircuit. When the number of CPE is small,
the CPE-segy requires large memory in each LDM due to the relatively large dimension of scircuit.
Moreover, the maximum number of non-zeros in columns of scircuit is relatively large. Considering
the limitation of LDM memory, inc should be set to a small value to ensure that each CVS can fit
in the LDM. Nevertheless, with the increase of the number of CPE, the size of each CVS and the
length of CPE-segy on each CPE decrease, so that the size of LDM allows for a larger inc, which,
in turn, making better use of the DMA bandwidth.
As we can see from Figure 10, there is almost no speedup when the CPE-stage scalability of
fgSpMSpV is tested for sparser inputs (nnz(x ) = 0.01%N and nnz(x ) = 0.1%N ); hence, the CG-
stage scalability on the 10 smaller-dimension datasets is only tested when nnz(x ) = 20%N and
nnz(x ) = 1%N . Figure 11 presents the CG-stage scalability on the 10 smaller-dimension sparse
matrices, where up to 16 CGs are used (α = {1, 2, 4, 8, 16} and β = 64). On average, fgSpMSpV
achieves the speedup of 1.83× (min: 1.38×, max: 2.52×) when nnz(x ) = 20%N and 1.64× (min:
1.45×, max: 1.92×) when nnz(x ) = 1%N . The size of data in arrays SColp and Xv to be transferred
by each CPE remains unchanged as the value of α increases, which has a more negative impact on
the scalability of CG-stage parallelism than that of CPE-stage parallelism. This is because the num-
ber of non-zeros in each tile and the length of each CPE-segy keep decreasing with the increasing
of α, correspondingly reducing the communication overhead of transferring arrays SRows, SVals,
and Yv and computation overhead on each CPE. Moreover, for smaller datasets, as α increases, the
proportion of these overheads in the overall parallel SpMSpV overhead largely decreases, while the
communication overhead of transferring arrays SColp and Xv remains unchanged, which impairs
the CG-stage scalability of fgSpMSpV.
Figure 12 shows the CG-stage scalability on the 10 larger-dimension sparse matrices, where the
number of CGs is scaled from 16 to 128 (α = {16, 32, 64, 128} and β = 64). Some experimental re-
sults on the three largest-dimension datasets (indochina-2004, wb-edu, and road_usa) are missing in
Figure 12. The runtime of indochina-2004 does not appear in the figure when nnz(x ) = 20%. When
nnz(x ) = 1%, the corresponding figure only shows the runtime of indochina-2004 on 128 CGs.
In addition, Figure 12 does not show the runtime of wb-edu on 16 CGs and that of road_usa on
16, 32, and 64 CGs. The reason is that the LDM memory limits the execution of fgSpMSpV for
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:21
large-dimension sparse matrices when x has relatively large density. According to the proposed
adaptive parallel execution, additionally, the more CGs used, the finer the granularity of CVSs and
can fit in the LDM. In addition, we observe that fgSpMSpV runs poor CG-stage scalability. Due
to the small amount of computation in SpMSpV, the communication overhead is the main perfor-
mance bottleneck of parallel SpMSpV. The communication overhead on each CPE corresponds to
(A )
the cost of loading nnz(x ) + 1 column pointers from array SColp, nnz α ×β row indices with addi-
tional padded elements from array SRows, and nnz (A )
α ×β non-zeros with additional padded elements
from array SVals. For the tested datasets that have high sparsity, the overhead of transferring
SColp has a greater share of the communication overhead of fgSpMSpV. This is the reason why
fgSpMSpV performs unsatisfied CG-stage scalability on the tested datasets.
9.2.2 Effects of the Key Techniques. Figure 13 presents the runtime of main SpMSpV computa-
tion in fgSpMSpV on a CG respectively using the proposed re-collection and optimization tech-
niques. We first test the SpMSpV runtime without any optimization techniques, as shown as bars
labeled “SpMSpV without Re-collection and Optimizations”. Then we only re-collect x to test the
SpMSpV runtime, as shown as bars labeled “SpMSpV with Re-collecting x”. By comparing “SpMSpV
with Re-collecting x” with “SpMSpV without Re-collection and Optimizations” bars, we observed
that, on average, the re-collection of x contributes the performance improvement of 1.46% when
the density of x is 20%, 14.44% when the density is 1%, 34.97% when the density is 0.1%, and 41.20%
when the density is 0.01%. The useless elements in x are removed due to the re-collection of x;
hence, the overhead of data fetching from host memory to LDM decreases. Moreover, the sparser
x is, the more useless elements in x will be removed. This is the reason why the performance
improves significantly and continues to obtain gains from lower density of x.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:22 Y. Chen et al.
Subsequently, we test the SpMSpV runtime of fgSpMSpV with re-collecting x, A, and y, as shown
as “SpMSpV with Re-collecting x, A, and y” bars in Figure 13. By comparing “SpMSpV with Re-
collecting x, A, and y” with “SpMSpV with Re-collecting x” bars, on average, the re-collection of
A and y contributes the performance improvement of 75.98% when nnz(x ) = 20%, 50.37% when
nnz(x ) = 1%, 45.53% when nnz(x ) = 0.1%, and 32.53% when nnz(x ) = 0.01%. If re-collection is not
used or only x is re-collected, there exists irregular memory accesses to the necessary columns
of A, and each CPE would only load non-zeros of one column of A each time, which impairs the
bandwidth utilization in fgSpMSpV. Re-collecting A and y addresses these problems.
Finally, we test the SpMSpV runtime using re-collection and the three optimization strategies
introduced in Section 7, as shown as “SpMSpV with Re-collection and Optimizations” bars in
Figure 13, to present the effects of the three optimizations by comparing with “SpMSpV with Re-
collecting x, A, and y” bars. On average, the performance improves 12.24% when nnz(x ) = 20%,
13.41% when nnz(x ) = 1%, 14.78% when nnz(x ) = 0.1%, and 14.86% when nnz(x ) = 0.01%.
As shown in Figure 13, no significant difference is found between the effects of optimizations
for different sparsity of x. Therefore, Figure 14 further presents the performance contributions
on a CG from different optimization techniques, i.e., the CV padding, DMA bandwidth optimiza-
tion, and the optimized prefetching, when nnz(x ) = 20%N . Figure 14 first presents the execution
time of fgSpMSpV that does not use any optimization, labeled “Parallel Re-collected SpMSpV”. Sec-
ond, fgSpMSpV then uses the CV padding technique, and its performance is shown by the bar label
“Padding”. We can observe that the performance contribution of the CV padding technique is quite
limited (1.44% on average) by comparing label “Parallel Re-collected SpMSpV” and label “Padding”.
The CV padding technique enables the utilization of SIMD, which optimizes the computation part
of the parallel SpMSpV. However, the CV padding increases the amount of communication data
at the same time, which has negative impact on its optimization effect. Third, fgSpMSpV with CV
padding further uses the DMA bandwidth optimization, and the label “DMA” presents the perfor-
mance of fgSpMSpV that uses the CV padding technique and DMA bandwidth optimization. The
performance contribution of the DMA bandwidth optimization can be observed by comparing la-
bel “Padding” and label “DMA”, which is 9.58% on average. Finally, we further use the optimized
prefetching technique for fgSpMSpV, and label “Buffering” presents the performance of fgSpMSpV
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:23
Fig. 14. Effects of the three optimization techniques on a CG, where nnz(x ) = 20%N .
Fig. 15. The runtime comparison of serial re-collection, OpenMP-based re-collection, and the main parallel
SpMSpV computation in fgSpMSpV on a CG.
that uses all the proposed optimizations. The performance contribution of the optimized prefetch-
ing technique is 1.52% on average. As shown in Figure 14, the DMA bandwidth optimization plays
a more important role in performance optimizations of fgSpMSpV.
Either the serial re-collection or OpenMP-based re-collection is executed in fgSpMSpV. To an-
alyze the speedups of OpenMP-based re-collection over serial re-collection and the impact of the
re-collection overhead on the overall fgSpMSpV runtime, Figure 15 shows the overheads of serial
re-collection, OpenMP-based re-collection, and the main parallel SpMSpV computation, respec-
tively. The runtime overheads of serial re-collection and OpenMP-based re-collection are mea-
sured on the MPE of a CG, and the parallel SpMSpV is measured on the CPE cluster. As shown in
Figure 15, the average speedup of OpenMP-based re-collection is 3.07× (min: 2.20×, max: 4.47×)
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:24 Y. Chen et al.
when nnz(x ) = 20%, 2.40× (min: 2.04×, max: 2.66×) when nnz(x ) = 1%, 1.79× (min: 1.15×, max:
2.18×) when nnz(x ) = 0.1%, and 0.45× (min: 0.15×, max: 0.70×) when nnz(x ) = 0.01% over the
serial re-collection. The OpenMP-based re-collection on a CG achieves higher speedups while still
being a large share of the overall fgSpMSpV overhead when x has higher density. However, when
nnz(x ) = 0.01%, even though serial re-collection runs faster than OpenMP-based re-collection (be-
cause the very few necessary columns that need to be re-collected in the tested datasets limit the
development of parallelism of re-collection method), the overhead of serial or OpenMP-based re-
collection is a smaller share of the overall overhead of fgSpMSpV. Therefore, the overhead of pre-
processing on a single CG is expected to be a smaller share of the overall overhead of fgSpMSpV
when the input vector has higher sparsity. Despite these overheads, fgSpMSpV can still bring
substantial performance improvements in real-world applications, as shown in the following ex-
perimental results.
We implement fgSpMSpV on a CPU+GPU machine to test its performance on all the 20 sparse
matrices, where the MPI+OpenMP+CUDA parallelization model is used. Figure 16 shows the
runtime comparison of serial re-collection, OpenMP-based recollection, and the main parallel
SpMSpV computation on an NVIDIA Tesla P100 GPU. As can be seen from Figure 16, the OpenMP-
based re-collection achieves higher speedups over serial re-collection with denser density of x and
larger dimension of A. However, different from on the Sunway system, the main parallel SpMSpV
computation dominates the runtime of fgSpMSpV on the CPU+GPU machine. On average, the
OpenMP-based re-collection overhead is 10.34% (min: 4.02%, max: 20.93%) of the parallel SpMSpV
computation overhead when nnz(x ) = 20%, 13.38% (min: 3.33%, max: 29.40%) of the parallel SpM-
SpV computation overhead when nnz(x ) = 1%, 19.72% (min: 1.86%, max: 42.80%) of the parallel
SpMSpV computation overhead when nnz(x ) = 0.1%, and 21.27% (min: 2.63%, max: 66.58%) of the
parallel SpMSpV computation overhead when nnz(x ) = 0.01%.
Table 3 shows the storage comparison of CSC, CSCV, and padded CSCV formats, where the size
of an integer is 4, and the size of a double-precision floating-point number is 8. The columns whose
indices i satisfy (i + 1) mod 5 = 0 are selected for storage comparison when nnz(x ) = 20%, whose
indices i satisfy (i + 1) mod 100 = 0 are selected for nnz(x ) = 1%, whose indices i satisfy (i + 1)
mod 1000 = 0 are selected for nnz(x ) = 0.1%, and whose indices i satisfy (i + 1) mod 10000 = 0
are selected for nnz(x ) = 0.01%, where i ∈ {0, 1, 2, . . . , N − 1}. Compared to the CSC format, CSCV
format saves an average of 84.28% in memory footprint (40.38% when nnz(x ) = 20%, 97.06%,
when nnz(x ) = 1%, 99.71% when nnz(x ) = 0.1%, and 99.97% when nnz(x ) = 0.01%). Using the
CV padding technique, the padded CSCV format increases the memory footprint by 60.08% on
average compared to the CSCV format (62.09% when nnz(x ) = 20%, 60.50% when nnz(x ) = 1%,
60.54% when nnz(x ) = 0.1%, and 57.19% when nnz(x ) = 0.01%). In addition, as can be seen from
Table 3, the CV padding technique pads fewer numbers for matrices with lower density of x.
9.2.3 Relative Performance of fgSpMSpV. Figure 17 compares the performance of fgSpMSpV on
an NVIDIA Tesla P100 GPU with two efficient SpMSpV kernels used in the adaptive SpMV/SpMSpV
framework [26] (the sorted merge-based SpMSpV kernel and the merge-based SpMSpV kernel
without sorting). On average, compared to the sorted merge-based SpMSpV and the merge-based
SpMSpV, fgSpMSpV achieves the speedups of 0.22× (min: 0.03×, max: 0.63×) and 0.18× (min: 0.03×,
max: 0.60×) when nnz(x ) = 20%, 2.83× (min: 0.50×, max: 8.04×) and 2.68× (min: 0.50×, max: 7.99×)
when nnz(x ) = 1%, 14.91× (min: 2.80×, max: 53.275×) and 13.85× (min: 2.83×, max: 53.16×) when
nnz(x ) = 0.1%, and 32.61× (min: 5.43×, max: 134.38×) and 32.14× (min: 5.43×, max: 134.18×) when
nnz(x ) = 0.01%. We have two important observations from Figure 17. First, fgSpMSpV outperforms
the two benchmark kernels, except when nnz(x ) = 20%. This is because the array SColp[nnz(x ) ×
α ×β +1] in CSCV format has high memory footprint when the density of x is high, which requires
more overhead of memory accessing. Second, the runtime of fgSpMSpV drops much faster than
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:25
Fig. 16. The runtime comparison of serial re-collection, OpenMP-based re-collection, and the main parallel
SpMSpV computation in fgSpMSpV on an NVIDIA Tesla P100 GPU.
the two benchmark kernels with the increasing sparsity of x. Due to the re-collection method,
as x becomes sparser, the memory footprint of CSCV format and the communication overhead
dramatically decrease, and the performance of fgSpMSpV is significantly improved.
In many real-world applications, the sparsity of the input vector varies dramatically during
the program execution. To investigate the usefulness of fgSpMSpV in this case, this experiment is
carried out by applying it to the BFS real-world application on an NVIDIA Tesla P100. The adaptive
SpMV/SpMSpV framework [26] can automatically optimize the SpMV/SpMSpV kernel on GPUs
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:26 Y. Chen et al.
Table 3. Storage Comparison (MB) of CSC, CSCV, and Padded CSCV Formats
nnz(x ) = 20% nnz(x ) = 1% nnz(x ) = 0.1% nnz(x ) = 0.01%
Dataset CSC
Padded Padded Padded Padded
CSCV CSCV CSCV CSCV
CSCV CSCV CSCV CSCV
raefsky3 17.128 4.443 8.986 0.222 0.449 0.022 0.045 0.002 0.004
pdb1HYS 25.208 6.787 13.464 0.342 0.680 0.035 0.070 0.003 0.005
rma10 27.347 7.718 14.960 0.393 0.766 0.038 0.073 0.004 0.007
cant 23.526 7.707 13.917 0.385 0.695 0.038 0.069 0.004 0.007
2cubes_sphere 10.394 6.958 9.628 0.347 0.478 0.034 0.047 0.003 0.005
cop20k_A 16.050 9.045 13.216 0.448 0.651 0.044 0.063 0.004 0.006
cage12 23.757 11.013 17.219 0.550 0.860 0.055 0.085 0.006 0.009
144 23.812 12.800 20.449 0.434 0.541 0.043 0.054 0.004 0.005
scircuit 11.626 10.542 13.466 0.527 0.674 0.052 0.065 0.005 0.006
mac_econ_fwd500 15.361 12.998 16.884 0.650 0.844 0.064 0.082 0.006 0.008
dielFilterV3real 521.530 157.325 295.294 7.853 14.735 0.783 1.468 0.076 0.143
m hollywood-2009 662.562 187.216 362.624 9.366 18.143 0.911 1.755 0.109 0.218
kron_g500-logn21 1049.893 242.106 428.380 11.840 20.799 1.131 1.956 0.116 0.203
soc-orkut 1228.503 389.423 713.527 19.391 35.489 1.941 3.553 0.204 0.377
wikipedia-20070206 528.939 277.849 416.095 14.357 21.888 1.528 2.404 0.134 0.197
soc-LiveJournal1 808.063 395.505 607.249 19.822 30.472 1.953 2.979 0.198 0.304
ljournal-2008 924.807 442.368 683.021 22.191 34.320 2.198 3.384 0.178 0.241
indochina-2004 2249.690 807.884 1402.324 41.579 72.882 5.146 9.593 0.360 0.600
wb-edu 691.663 611.464 785.752 30.643 39.452 3.003 3.802 0.288 0.351
road_usa 421.563 1235.340 1323.389 61.774 66.186 6.177 6.618 0.617 0.659
Fig. 17. Performance comparison of fgSpMSpV and SpMSpV kernels used in Ref. [26] on an NVIDIA Tesla
P100 GPU.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:27
Fig. 18. Comparison of speedups of fgSpMSpV and the adaptive SpMV/SpMSpV framework [26] over cuS-
PARSE in BFS application on an NVIDIA Tesla P100.
by adapting with the varieties of input vectors at the runtime. Therefore, Figure 18 compares
the speedups of fgSpMSpV and the adaptive SpMV/SpMSpV framework over cuSPARSE in BFS
application on the 10 common tested sparse matrices in this article and Ref. [26]. As compared
with cuSPARSE, the average speedup of the adaptive SpMV/SpMSpV framework achieved is 4.51×,
while that of fgSpMSpV achieved is 19.94× (min: 3.48×, max: 56.38×). Compared with the adaptive
SpMV/SpMSpV framework, the average speedup of fgSpMSpV achieved is 6.69× (min: 0.43×, max:
21.68×). Thanks to the MPI+OpenMP+X model, fgSpMSpV can still bring substantial performance
improvements in real-world applications in spite of the pre-processing overhead. Furthermore, due
to the OpenMP-based re-collected method, fgSpMSpV presents its flexibility and efficiency with
respect to various sparsity patterns of the input vector in real-world applications.
10 CONCLUSION
SpMSpV is a fundamental operation in many scientific computing and industrial engineering
applications. This article proposes fgSpMSpV on the Sunway TaihuLight supercomputer to
alleviate the challenges of optimizing SpMSpV in large-scale real-world applications caused
by the inherent irregularity and poor data locality of SpMSpV on HPC architectures. Using an
MPI+OpenMP+X parallelization model that exploits the multi-stage and hybrid parallelism of
heterogeneous HPC architectures, both pre-/post-processing and main SpMSpV computation in
fgSpMSpV are accelerated. An adaptive parallel execution is devised for fgSpMSpV to reduce the
pre-processing, adapt to the Sunway architecture, while still tame redundant and random memory
accesses in SpMSpV, with a set of techniques like the fine-grained partitioner, re-collection method,
and CSCV matrix format. fgSpMSpV further exploits the computing resources of the platform
by utilizing several optimization techniques. fgSpMSpV on the Sunway TaihuLight gains a
noticeable performance improvement from the key optimization techniques with various sparsity
of the input. Additionally, fgSpMSpV on an NVIDIA Tesla P100 GPU obtains the speedup of up
to 134.38× over the state-of-the-art SpMSpV algorithms. The BFS application using fgSpMSpV
on a P100 GPU achieves the speedup of up to 21.68× over the state-of-the-arts. The results on
both Sunway and GPU architectures show that the main SpMSpV computation of fgSpMSpV
scales better for higher density of x (especially when nnz(x ) = 20% in our experiments), while
the overall performance of fgSpMSpV outperforms the state-of-the-arts for lower density of x
(especially when nnz(x ) = 0.1% and nnz(x ) = 0.01% in our experiments).
In our future work, on the basis of fgSpMSpV, we plan to further accelerate real-world applica-
tions, such as BFS, on heterogeneous HPC platforms. For example, to design an adaptive scheme
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:28 Y. Chen et al.
for applications that switches optimizations for SpMSpV, such as serial and parallel re-collection
methods and SpMSpV kernels, according to the input data structures and hardware architectures.
ACKNOWLEDGMENTS
The authors would like to thank the three anonymous reviewers and editors for their valuable and
helpful comments on improving the manuscript.
REFERENCES
[1] Khalid Ahmad, Hari Sundar, and Mary W. Hall. 2020. Data-driven mixed precision sparse matrix vector multiplication
for GPUs. ACM Trans. Archit. Code Optim. 16, 4 (2020), 51:1–51:24.
[2] Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting locality in sparse matrix-matrix multiplication on many-core
architectures. IEEE Trans. Parallel Distrib. Syst. 28, 8 (2017), 2258–2271.
[3] Kadir Akbudak, R. Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-
matrix multiplication. ACM Trans. Parallel Computing 4, 3 (2018), 13:1–13:34.
[4] Michael J. Anderson, Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Theodore L. Willke, and
Pradeep Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of
the International Parallel and Distributed Processing Symposium. 313–322.
[5] Ariful Azad and Aydin Buluç. 2016. Distributed-Memory algorithms for maximum cardinality matching in bipartite
graphs. In Proceedings of the International Parallel and Distributed Processing Symposium. 32–42.
[6] Ariful Azad and Aydin Buluç. 2017. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In
Proceedings of the International Parallel and Distributed Processing Symposium. 688–697.
[7] Ariful Azad and Aydin Buluç. 2019. LACC: A linear-algebraic algorithm for finding connected components in dis-
tributed memory. In Proceedings of the International Parallel and Distributed Processing Symposium. 2–12.
[8] Ariful Azad, Aydin Buluç, Xiaoye S. Li, Xinliang Wang, and Johannes Langguth. 2020. A distributed-memory algorithm
for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42, 4 (2020), C143–C168.
[9] Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2016. Hypergraph partitioning for sparse matrix-
matrix multiplication. ACM Trans. Parallel Computing 3, 3 (2016), 18:1–18:34.
[10] Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2018. BestSF: A sparse meta-format for optimizing SpMV on
GPU. ACM Trans. Archit. Code Optim. 15, 3 (2018), 29:1–29:27.
[11] Thierry P. Berger, Julien Francq, Marine Minier, and Gaël Thomas. 2016. Extended generalized feistel networks us-
ing matrix representation to propose a new lightweight block cipher: Lilliput. IEEE Trans. Computers 65, 7 (2016),
2074–2089.
[12] Aysenur Bilgin, Hani Hagras, Joy van Helvert, and Daniyal M. Al-Ghazzawi. 2016. A linear general type-2 fuzzy-logic-
based computing with words approach for realizing an ambient intelligent platform for cooking recipe recommenda-
tion. IEEE Trans. Fuzzy Systems 24, 2 (2016), 306–329.
[13] Aydin Buluç, Erika Duriakova, Armando Fox, John R. Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, and
Samuel Williams. 2013. High-Productivity and high-performance analysis of filtered semantic graphs. In Proceedings
of the International Parallel and Distributed Processing Symposium. 237–248.
[14] Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings
of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 65:1–65:12.
[15] Paolo Campigotto, Christian Rudloff, Maximilian Leodolter, and Dietmar Bauer. 2017. Personalized and situation-
aware multimodal route recommendations: The FAVOUR algorithm. IEEE Trans. Intelligent Transportation Systems 18,
1 (2017), 92–102.
[16] Yuedan Chen, Kenli Li, Wangdong Yang, Guoqing Xiao, Xianghui Xie, and Tao Li. 2019. Performance-Aware model
for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30,
4 (2019), 923–938.
[17] Yuedan Chen, Guoqing Xiao, M. Tamer Özsu, Chubo Liu, Albert Y. Zomaya, and Tao Li. 2020. aeSpTV: An adaptive
and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE
Trans. Parallel Distributed Syst. 31, 10 (2020), 2329–2345.
[18] Yuedan Chen, Guoqing Xiao, Fan Wu, and Zhuo Tang. 2019. Towards large-scale sparse matrix-vector multiplication
on the SW26010 manycore architecture. In Proceedings of the 21st International Conference on High Performance Com-
puting and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on
Data Science and Systems. IEEE, 1469–1476.
[19] Steven Dalton, Luke N. Olson, and Nathan Bell. 2015. Optimizing sparse matrix - matrix multiplication for the GPU.
ACM Trans. Math. Softw. 41, 4 (2015), 25:1–25:20.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:29
[20] Timothy A. Davis and Yifan Hu. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38,
1 (2011), 1:1–1:25.
[21] Fred G. Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM
Trans. Math. Softw. 4, 3 (1978), 250–269.
[22] Lixin He, Hong An, Chao Yang, Fei Wang, Junshi Chen, Chao Wang, Weihao Liang, Shao-Jun Dong, Qiao Sun, Wenting
Han, Wenyuan Liu, Yongjian Han, and Wenjun Yao. 2018. PEPS++: towards extreme-scale simulations of strongly
correlated quantum many-particle models on sunway taihulight. IEEE Trans. Parallel Distrib. Syst. 29, 12 (2018), 2838–
2848.
[23] Yong-Yeon Jo, Sang-Wook Kim, and Duck-Ho Bae. 2015. Efficient sparse matrix multiplication on GPU for large social
network analysis. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Manage-
ment. 1261–1270.
[24] Byeongho Kim, Jongwook Chung, Eojin Lee, Wonkyung Jung, Sunjung Lee, Jaewan Choi, Jaehyun Park, Minbok Wi,
Sukhan Lee, and Jung Ho Ahn. 2020. MViD: Sparse matrix-vector multiplication in mobile DRAM for accelerating
recurrent neural networks. IEEE Trans. Computers 69, 7 (2020), 955–967.
[25] Daniel Langr and Pavel Tvrdík. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Trans. Parallel Distrib.
Syst. 27, 2 (2016), 428–440.
[26] Min Li, Yulong Ao, and Chao Yang. 2021. Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE
Trans. Parallel Distributed Syst. 32, 7 (2021), 1842–1853.
[27] Siqiang Luo. 2019. Distributed PageRank Computation: An improved theoretical study. In Proceedings of the AAAI
Conference on Artificial Intelligence. 4496–4503.
[28] Xin Luo, Mengchu Zhou, Shuai Li, Lun Hu, and Mingsheng Shang. 2020. Non-negativity constrained missing data
estimation for high-dimensional and sparse matrices from industrial applications. IEEE Trans. on Cybernetics 50, 5
(2020), 1844–1855.
[29] Xin Luo, Mengchu Zhou, Shuai Li, Zhu-Hong You, Yunni Xia, and Qingsheng Zhu. 2016. A nonnegative latent factor
model for large-scale sparse matrices in recommender systems via alternating direction method. IEEE Trans. Neural
Netw. Learning Syst. 27, 3 (2016), 579–592.
[30] Asit K. Mishra, Eriko Nurvitadhi, Ganesh Venkatesh, Jonathan Pearce, and Debbie Marr. 2017. Fine-grained acceler-
ators for sparse machine learning workloads. In Proceedings of the 22nd Asia and South Pacific Design Automation
Conference. 635–640.
[31] Muhammet Mustafa Ozdal. 2019. Improving efficiency of parallel vertex-centric algorithms for irregular graphs. IEEE
Trans. Parallel Distrib. Syst. 30, 10 (2019), 2265–2282.
[32] Ryan A. Rossi and Nesreen K. Ahmed. 2015. The network data repository with interactive graph analytics and visual-
ization. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 4292–4293.
[33] Liang Sun, Shuiwang Ji, and Jieping Ye. 2009. A least squares formulation for a class of generalized eigenvalue problems
in machine learning. In Proceedings of the 26th Annual International Conference on Machine Learning. 977–984.
[34] Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Subramanya Dulloor, Michael J. Anderson,
Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High performance graph analytics
made productive. PVLDB 8, 11 (2015), 1214–1225.
[35] Guoqing Xiao, Yuedan Chen, Chubo Liu, and Xu Zhou. 2020. ahSpMV: An auto-tuning hybrid computing scheme for
SpMV on the sunway architecture. IEEE Internet of Things Journal 7, 3 (2020), 1736–1744.
[36] Guoqing Xiao, Kenli Li, Yuedan Chen, Wangquan He, Albert Zomaya, and Tao Li. 2021. CASpMV: A customized and
accelerative SPMV framework for the sunway TaihuLight. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2021), 131–146.
[37] Carl Yang, Yangzihao Wang, and John D. Owens. 2015. Fast sparse matrix and sparse vector multiplication algorithm
on the GPU. In Proceedings of the International Parallel and Distributed Processing Symposium. 841–847.
[38] Leonid Yavits and Ran Ginosar. 2018. Accelerator for sparse machine learning. IEEE Comput. Archit. Lett. 17, 1 (2018),
21–24.
[39] Yongzhe Zhang, Ariful Azad, and Zhenjiang Hu. 2020. FastSV: A distributed-memory connected component algorithm
with fast convergence. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP).
46–57.
[40] Yunquan Zhang, Shigang Li, Shengen Yan, and Huiyang Zhou. 2016. A cross-platform SpMV framework on many-core
architectures. ACM Trans. Archit. Code Optim. 13, 4 (2016), 33:1–33:25.
[41] Weijie Zhou, Yue Zhao, Xipeng Shen, and Wang Chen. 2020. Enabling runtime SpMV format selection through an
overhead conscious method. IEEE Trans. Parallel Distrib. Syst. 31, 1 (2020), 80–93.
[42] Alwin Zulehner and Robert Wille. 2019. Matrix-Vector vs. matrix-matrix multiplication: Potential in DD-based simu-
lation of quantum computations. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition.
90–95.
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.