0% found this document useful (0 votes)
45 views29 pages

Chen 2022 FG SPMSP V

Uploaded by

al
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views29 pages

Chen 2022 FG SPMSP V

Uploaded by

al
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on

HPC Platforms
YUEDAN CHEN, GUOQING XIAO, and KENLI LI, College of Computer Science and Electronic
Engineering, Hunan University, and National Supercomputing Center in Changsha
FRANCESCO PICCIALLI, Department of Electrical Engineering and Information Technologies,
University of Naples Federico II
ALBERT Y. ZOMAYA, School of Information Technologies, University of Sydney

Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations
in many high-performance scientific and engineering applications. The inherent irregularity and poor data 8
locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems:
(i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the ir-
regular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained
parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges
for large-scale real-world applications. First, fgSpMSpV adopts an MPI+OpenMP+X parallelization model
to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both
pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel exe-
cution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system,
while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the
fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format.
Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV
on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization tech-
niques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100
GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of
up to 134.38× over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves
the speedup of up to 21.68× over the state-of-the-arts.

CCS Concepts: • Computing methodologies;

Additional Key Words and Phrases: Heterogeneous, HPC, manycore, optimization, parallelism, SpMSpV

The research was partially funded by the National Key R&D Programs of China (Grant No. 2020YFB2104000), the Programs
of National Natural Science Foundation of China (Grant Nos. 62172157, 61860206011, 61806077), the Programs of Hunan
Province, China (Grant Nos. 2020RC2032, 2021RC3062, 2021JJ40109, 2021JJ40121), the Programs of China Postdoctoral
Council (Grant Nos. PC2020025, 2021M701153), the Program of Zhejiang Lab (Grant No. 2022RC0AB03), and the General
Program of Fundamental Research of Shen Zhen (Grant No. JCYJ20210324135409026).
Authors’ addresses: Y. Chen, G. Xiao (corresponding author), and K. Li, College of Computer Science and Electronic Engi-
neering, Hunan University, and National Supercomputing Center in Changsha, Changsha, Hunan 410082, China; emails:
{chenyuedan, xiaoguoqing, lkl}@hnu.edu.cn; F. Piccialli, Department of Electrical Engineering and Information Technolo-
gies, University of Naples Federico II, Naples 80100, Italy; email: [email protected]; A. Y. Zomaya, School of
Information Technologies, University of Sydney, Sidney, NSW 2006, Australia; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
2329-4949/2022/04-ART8 $15.00
https://doi.org/10.1145/3512770

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:2 Y. Chen et al.

ACM Reference format:


Yuedan Chen, Guoqing Xiao, Kenli Li, Francesco Piccialli, and Albert Y. Zomaya. 2022. fgSpMSpV: A
Fine-grained Parallel SpMSpV Framework on HPC Platforms. ACM Trans. Parallel Comput. 9, 2, Article 8
(April 2022), 29 pages.
https://doi.org/10.1145/3512770

1 INTRODUCTION
Sparse matrices are a common kind of data sources in a wide variety of high-performance scien-
tific and industrial engineering applications, such as recommender systems [29], social network
services [23], numerical simulation [42], business intelligence [24], cryptography [11], and algo-
rithms for least squares and eigenvalue problems [33].
Large-scale sparse matrices are frequently used in various applications with several reasons
as follows [28]. On one hand, with the exploding numbers of commodities, netizens, and network
nodes, there is an increasing need to depict relationships among those involved entities with matri-
ces. Typically, the matrices obtained from aforementioned applications contain very few non-zero
elements and, hence, can be modeled as sparse matrices intuitively. On the other hand, it is difficult
to completely observe relationships among a lot of entities. For instance, for millions of users and
items in recommender systems, each entity in a rating matrix models the usage history of items
by users. Correspondingly the matrix is usually very sparse with a number of missing values since
only a finite item set can be touched by each user.
SpMSpV is one of the most fundamental and important operations in all kinds of high-
performance scientific computing and engineering applications, such as machine learning [30, 38],
graph computations [31, 34], data analysis [12, 15], and so on. The mathematical formulation of
SpMSpV is y = A × x, where the input matrix A, the input vector x, and the output vector y are
sparse. As an example, in many graph computation algorithms that are implemented using matrix
primitives, such as bipartite graph matching [5, 8], breadth-first search (BFS) [14], connected
component [7, 39], maximal independent sets [13], and the computational essence is to convert
a set of active vertices (usually called “current frontier”) into a new set of active vertices (usually
called “next frontier”). The computational pattern of “frontier expansion” can be neatly captured
by SpMSpV, where input sparse matrix A represents the graph, input sparse vector x represents
the current frontier, and output sparse vector y represents the next frontier.
Moreover, SpMSpV can even be applied to some typical sparse matrix-dense vector multipli-
cation (SpMV) algorithms, such as PageRank [27], to achieve incremental convergence [4]. Dif-
ferent from SpMV [35, 36]; however, the sparsity of x in SpMSpV causes that many floating-point
operations (multiplications and additions) on non-zeros of A are omitted, which results in a large
amount of redundant data and irregular data accesses in SpMSpV. Moreover, SpMSpV can be consid-
ered as a specific case of general sparse matrix-sparse matrix multiplication (SpGEMM) [16],
where the second sparse matrix of SpGEMM only has one column. Some efficient SpGEMM algo-
rithms, such as Gustavson’s SpGEMM algorithms [21], are suitable for sequential SpMSpV rather
than parallel SpMSpV. The reason is that each SpMSpV operation in SpGEMM evolves much little
work, which requires novel approaches to scale SpMSpV to multiple threads.
Since SpMSpV plays a fundamental role in many scientific and real-world applications, there
has been significant research to scale SpMSpV for large-scale applications by utilizing the pow-
erful parallel computing capacity of state-of-the-art multi-core and manycore processors, such as
Central Processing Unit (CPU) [34], Graphics Processing Unit (GPU) [37], and so on. To ef-
fectively scale, it is necessary to both consider the algorithmic characteristics of SpMSpV and the

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:3

Fig. 1. The SW26010 processor.

architectural advantages of computing platforms. In this article, we study how to scale SpMSpV
over HPC systems as characterized by the Sunway TaihuLight.
The Sunway TaihuLight with 40960 heterogeneous manycore SW26010 chips [17, 22] had held
its top position as the fastest supercomputer in the world from June 2016 to June 2018. As shown
in Figure 1, the first stage of parallelism of Sunway comes from the four Core Groups (CGs)
in the SW26010. Within a CG, the Management Processing Element (MPE), also called host,
manages tasks on the CG, while the Computing Processing Element (CPE) cluster containing
64 CPEs, also called device, provides the second stage of parallelism. The MPE and CPEs share a
DDR3 memory of 8 GB. Each CPE has the no-cache memory structure, wherein, each CPE has no
data cache but only a 64KB fast scratch-pad memory (SPM), known as local device memory
(LDM) as well. The heterogeneous and multi-stage parallelism and no-cache memory structure of
Sunway system present both challenges and opportunities for optimizing SpMSpV.
Challenges. Scaling SpMSpV over the HPC systems is facing two problems:
(i) A large amount of redundant data. Only non-zeros in x, non-zeros in the corresponding
columns in A, and non-zeros in y are the necessary data in SpMSpV. A large amount of redun-
dant data in A, x, and y causes excessive memory footprint, unnecessary data transmission, and
useless computation, which is unfavorable for utilization of bandwidth and parallel resources.
(ii) The irregular data access pattern. The sparsity of A, x, and y results in unpredictable mem-
ory references and random memory accesses in SpMSpV, so that there could be several problems
in exploiting the computing power of the platform, such as expensive latency of non-coalesced
memory accesses, possibility of parallel write collisions, and load imbalance. In addition, the paral-
lelization design of SpMSpV is required to adapt to the heterogeneous and multi-stage parallelism
and no-cache structure of the Sunway.
Therefore, in this article, the fgSpMSpV framework is devised to address the above-mentioned
problems and optimize SpMSpV on the Sunway architecture. The contributions made in this article
are summarized as follows:
(i) We investigate a hybrid MPI+OpenMP+X -based approach for fgSpMSpV on heteroge-
neous HPC architectures. fgSpMSpV exploits heterogeneous inter-node MPI communication
and OpenMP+X intra-node parallelism to accelerate both pre-/post-processing and main
SpMSpV computation in real-world applications.
(ii) We devise an adaptive parallel execution for fgSpMSpV to reduce the pre-processing of each
SpMSpV in real-world applications, leverage the heterogeneous and multi-stage parallelism
and memory hierarchy of the Sunway TaihuLight, while still tame redundant and random

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:4 Y. Chen et al.

memory accesses. fgSpMSpV adapts to and exploits the hardware architecture by adopting
the fine-grained partitioner, re-organizes the necessary data in SpMSpV and optimizes mem-
ory access behavior by utilizing the re-collection method, and better preserves the efficiency
of parallel SpMSpV by using the CSCV sparse matrix format.
(iii) We further propose performance optimization techniques for fgSpMSpV to fully utilize the
Single Instruction Multiple Data (SIMD) technique, take advantages of the transmission
bandwidth, and synchronize computation with communication of fgSpMSpV.
(iv) We evaluate fgSpMSpV and its key techniques on the Sunway TaihuLight supercomputer
using real-world datasets. In addition, we also implement fgSpMSpV on an NIVIDA Tesla
P100 and apply it to the BFS application to verify its generality, flexibility, and efficiency.

2 RELATED WORK
Several approaches have been proposed to parallelize the SpMSpV algorithm on various platforms.
GraphMat [34] optimizes parallel SpMSpV large-scale graph analytics on CPUs, where the distri-
bution structure of non-zeros in the input matrix determines the SpMSpV computation (matrix-
driven). SpMSpV-bucket [6] uses a list of buckets to partition the necessary columns in the input
matrix on the fly and parallelize SpMSpV, where the distribution structure of non-zeros in the input
vector determines the computation (vector-driven). This article proposes a vector-driven SpMSpV
method that re-collects the necessary non-zeros of x, y, and A for a fine-grained and efficient
parallelization of SpMSpV on an HPC architecture.
SpMSpV can be interpreted as a specific case of SpGEMM, where the second sparse matrix of
SpGEMM only has one column. Akbudak et al. [2, 3] utilize the hypergraph model and bipartite
graph model to optimize parallel SpGEMM. Ballard et al. [9] devise a fine-grained hypergraph
model to reduce the sparsity-dependent communication bounds in SpGEMM. Dalton et al. [19]
decompose SpGEMM into three computational stages (expansion, sorting, and compression) to
exploit the intrinsic fine-grained parallelism of throughput-oriented processors, such as GPU.
Additionally, SpMSpV is likely to be considered as a special case of SpMV where the input vec-
tor x is sparse. Li et al. [26] devise the adaptive SpMV/SpMSpV framework on GPUs that uses
a machine-learning based kernel selector to select the appropriate SpMV/SpMSpV kernel based
on the computing pattern, workload distribution, and write-back strategy. Chen et al. [18] design
the large-scale SpMV on the Sunway TaihuLight that includes two parts. The first part performs
the parallel partial Compressed Sparse Row (CSR)-based SpMV, and the second part performs
the parallel accumulation on the results of the first part. Zhang et al. [40] propose the Blocked
Compressed Common Coordinate (BCCOO) format that uses bit flags to alleviate the band-
width problem of SpMV, and further design partitioning method to divide the matrix into vertical
slices to improve data locality. MpSpMV [1] splits non-zeros in the input matrix into two parts:
single-precision part and double-precision part. The non-zeros within the range of (−1, 1) belong
to the single-precision part, which are multiplied by the input vector x using single-precision.
Other non-zeros belonging to the double-precision part, which are multiplied by x using double
precision. The final double-precision result is created by combining the calculation results of the
two parts.
The appropriate data structures for the sparse matrix and the sparse vector are determinate to
the performance of sparse matrix operations. Langr and Tvrdík [25] list the state-of-the-art com-
pressed storage formats for sparse matrices. Azad and Buluç [6] use the Compressed Sparse
Column (CSC) format to accelerate parallel SpMSpV. Benatia et al. [10] choose the most suitable
matrix format, from Coordinate (COO), CSR, the Blocked Compressed Sparse Row (BCSR),
ELLPACK (ELL), and Diagonal (DIA) formats, for SpMV by the proposed cost-aware classifica-
tion models on Weighted Support Vector Machines (WSVMs). Zhou et al. [41] select the most

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:5

proper matrix format for SpMV by focusing on the runtime prediction and format conversion
overhead.
There has been some work on designing the customized parallelization for sparse matrix opera-
tions, including SpGEMM [16] and SpMV [36], based on the heterogeneous manycore architecture
of Sunway. Especially, the parallel implementation of SpGEMM where the second sparse matrix
is stored in CSC format can be directly used in parallelization design for SpMSpV on the Sunway.
However, there are some performance bottlenecks of the proposed SpGEMM parallelization, i.e.,
redundant memory usage and unnecessary data transfer for the first input matrix, irregular ac-
cesses to columns of the first input matrix, and random accesses to and possible parallel writing
collisions on the results. Moreover, the parallel SpGEMM design on Sunway is not well suited to
the parallelization of SpMSpV. Therefore, this article designs a suitable and efficient parallelization
and optimization for SpMSpV with an adaptive column-wise sparse matrix format on the Sunway,
which solves the performance bottlenecks in [16] as well.

3 BACKGROUND
3.1 Notation of SpMSpV
We define nnz(·) as a function that outputs the number of non-zeros in its input. The input sparse
matrix A of SpMSpV contains M rows, N columns, and nnz(A) non-zeros. The input sparse vector
x has length of N and contains nnz(x ) non-zeros. The output sparse vector y has length of M and
contains nnz(y) non-zeros.
There may be operations performed on non-zeros of A are omitted since the corresponding
elements in x are zeros. Therefore, the number of operations in SpMSpV (including additions and
multiplications) must be less than 2×nnz(A). If A is stored by rows, there is a necessary step to find
the non-zeros in each row of A that correspond to the non-zeros in x for calculation. However, if A
is stored by columns, each non-zero of x only accesses the corresponding column in A, as shown
in Figure 2. The non-zero element of x multiplies each non-zero in the corresponding column of
A, and the multiplication results are accumulated to the corresponding elements of y. There is no
need to find the calculational non-zeros in A when A is stored by columns, which simplifies the
calculation of SpMSpV. In addition, only the corresponding nnz(x ) columns of A are accessed by
non-zeros of x, which reduces redundant data accesses in SpMSpV.
Consequently, in this article, we only discuss the column-wise SpMSpV. The CSC format is one
of the most popular column-wise compressed storage formats for sparse matrices. It uses three
arrays to store non-zeros in the sparse matrix A by columns, including an integer array storing the
pointers to each column (p[N +1]), an integer array storing the row id of each non-zero (r[nnz(A)]),
and a floating-point array storing the numerical value of each non-zero (v[nnz(A)]). As described
in Algorithm 1, SpMSpV based on CSC format is executed by columns.

3.2 The Sunway TaihuLight


There are 40960 SW26010 processors in the Sunway TaihuLight. As shown in Figure 1, there are
65 cores in each CG (an MPE core and 64 CPE cores) and 260 cores in each SW26010 processor.
Therefore, over 10 million cores of the Sunway provide computing capability. In each CG, the
MPE, also called host, usually takes care of pre- and post-processing, while the CPE cluster, also
called device, mainly executes parallel tasks. The Sunway system utilizes MPI to leverage the CG-
stage parallelism. To exploit the parallelism within each CG, the system supports OpenMP for the
multi-threading parallelism of MPE and equips with Athread, a customized API library, for the
fine-grained CPE-stage parallelism.

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:6 Y. Chen et al.

Fig. 2. An example of column-wise SpMSpV, where nnz(A) = 11, nnz(x ) = 4, nnz(y) = 7, N > 4, M > 7, and
there are only four columns of A are accessed.

ALGORITHM 1: Column-wise SpMSpV based on CSC format.


Require: x: x[N ];
A stored in CSC format: p[N + 1], r[nnz(A)], and v[nnz(A)].
Ensure: y: y[M].
1: Initiate y[M], j, q;
2: for j = 0; j < N ; j + + do
3: for q = p[j]; q <p[j + 1]; q + + do
4: y[r[q]] ← y[r[q]] + v[q] × x[j];
5: end for
6: end for
7: return y[M].

Each CG is equipped with an 8 GB DDR3 shared memory. The MPE of each CG has a 32 KB L1
cache for instructions and a 256 KB L2 cache for instructions and data. Each CPE of a CG has a
16 KB L1 cache for instructions and a 64 KB LDM for data instead of a cache. The LDM of each CPE
provides higher bandwidth and lower access latency than that of the main memory. Bulk transfer
between the LDM of each CPE and the main memory of the CG uses the Direct Memory Access
(DMA) transmission. Nevertheless, each CPE accesses non-coalesced data from the main memory
by Global Load/Store (gld/gst) operations with high overhead.
The Sunway architecture brings three main challenges for designing and implementing com-
puting kernels, as follows:
— How to properly coordinate the heterogeneous computing architecture.
— How to fully develop the multi-stage parallelism (CG- and CPE-stage).
— How to fully utilize the “main memory + LDM” memory hierarchy.
Therefore, an appropriate heterogeneous parallelization is important for the computing kernel
to properly coordinate the MPE and CPEs in each CG. In addition, an adaptive parallel execution
is also critical for the computing kernel to utilize the computing capability and manage data based
on the memory structure of Sunway.

4 OVERVIEW
Figure 3 presents the overview of fgSpMSpV framework on a CG of each SW26010 processor.
fgSpMSpV uses the following three key designs to mitigate the challenges of SpMSpV on HPC
architectures:
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:7

Fig. 3. The overview of fgSpMSpV.

Hybrid MPI+OpenMP+X Model. To efficiently exploit multi-stage and hybrid parallelism of


heterogeneous HPC architectures for SpMSpV demanded by large-scale real-world applications,
fgSpMSpV adopts a hybrid MPI+OpenMP+X parallelization model to accelerate SpMSpV on het-
erogeneous HPC architectures, where “X ” represents the parallel programming interfaces provided
by the many-/multi-core device in each heterogeneous parallel compute node (Section 5).
Adaptive Parallel Execution. To achieve best performance for parallel SpMSpV on heteroge-
neous and multi-stage parallel architectures, fgSpMSpV utilizes the fine-grained partitioner to
adapt to and exploit the hardware architecture (Section 6.1), the re-collection method to optimize
memory access behavior and parallel computation (Section 6.2), and the CSCV sparse matrix format
to co-execute the adaptive parallel execution while still enjoying the efficient SpMSpV (Section 6.3).
Optimization Techniques. To preserve better execution efficiency for SpMSpV in the heteroge-
neous and parallel environment, fgSpMSpV uses the column vector (CV ) padding to optimize the
parallel computing efficiency, the DMA bandwidth optimization to improve the communication
efficiency, and the optimized prefetching to overlap data transfers with computation (Section 7).
As shown in Figure 3, the MPE first reads and partitions the input sparse matrix A and sparse
vector x using the fine-grained partitioner based on the multi-stage parallel architecture, and pre-
pares a fine-grained column-wise sparse matrix dataset on the main memory (❶). After that, the
MPE adopts the re-collection method to re-organize the necessary columns in the fine-grained tiles
of A and non-zeros in x and y (❷), and uses the CSCV format to store the re-collected fine-grained
tiles (❸), so that the redundant calculations in SpMSpV are eliminated and the memory accesses
are optimized. Next, the MPE leverages the CPE cores to perform parallel SpMSpV. Each CPE first
prepares a cache of the re-collected x and prefetches a fine-grained matrix tile of the CSCV-stored
A at a time in the LDM (❹). Then, each CPE executes parallel SpMSpV on the fetched data (❺).
The results are returned to main memory after all the computation tasks on the CPE have been
finished (❻).

5 HYBRID MPI+OPENMP+X PARALLELIZATION MODEL


To achieve desired performance of parallel SpMSpV on the CPE clusters while still enjoying the
efficient pre-processing in real-world applications, fgSpMSpV adopts the MPI+OpenMP+X -based
approach to fully utilize the multi-stage and hybrid parallelism. fgSpMSpV leverages the inter-
node parallelism by MPI. Within each heterogeneous multi-/many-core node, OpenMP is utilized
to leverage the host CPU for the parallelizable pre-processing tasks in fgSpMSpV, while “X ” rep-
resents the parallel programming interfaces supported by the the many-/multi-core device, e.g.,

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:8 Y. Chen et al.

Athread for CPE cluster and CUDA for GPU. Therefore, fgSpMSpV uses MPI+OpenMP+Athread
parallelization model on the Sunway TaihuLight.
In each CG, the MPE parallelizes the re-collection method and CSCV storing using OpenMP
to prepare the CSCV-stored sparse matrix and the compressed input sparse vector, and allocates
parallel SpMSpV tasks to the CPE cluster. The CPE cluster parallelizes and optimizes SpMSpV via
Athread. Whereas the fast LDM on each CPE has only 64 KB of memory, which requires each CPE
to get the appropriate amount of data from the main memory at a time for parallel computation.
Therefore, there are several rounds of processing on each CPE. In each round of processing, the
CPE loads data, that has a suitable size for the LDM memory, from main memory, and then executes
parallel computation on the received data. Finally, the CPE sends results back to main memory.

6 ADAPTIVE PARALLEL EXECUTION


This article defines α as the number of launched CGs and β as the number of launched CPEs per
CG. Thus, the SpMSpV kernel should be partitioned and assigned to α × β CPEs.
There are three main difficulties for parallelizing SpMSpV:
— Due to the irregular indices of non-zeros in x and A, the locations of accessing to columns
in A and accumulating to y are random and unpredictable. The random memory accesses by
CPEs lead to expensive access latency in parallel SpMSpV.
— Parallel write collisions could occur when multiple CPEs simultaneously perform accumu-
lations to the same element of y. If A is partitioned by columns and assigned to CPEs, CPEs
may perform accumulations on the same element of y at the same time. These simultane-
ous accumulations will be executed as atomic operations, which results in weak parallel
performance of SpMSpV.
— The irregular positions of non-zeros in A lead to load imbalance on multi-stage parallel
computing architectures. The uneven task assignment can negatively affect the performance
of parallel SpMSpV in real-world applications on the Sunway TaihuLight.
To alleviate the expensive access latency problem, fgSpMSpV re-collects the necessary ele-
ments in A, x, and y, and caches the elements of y in the LDM, so that the expensive non-coalesced
memory accesses are avoided and the limited LDM is fully utilized. To surmount the atomic op-
erations problem, fgSpMSpV partitions A by rows for CGs and CPEs to avoid simultaneous write
operations on the same position of y from CPEs. To ease the load imbalance problem, fgSpMSpV
partitions A for CGs and CPEs according to the number of non-zeros.

6.1 Fine-grained Partitioner


We devise a fine-grained partitioner for SpMSpV to overcome these difficulties and fully develop
the parallel computing capacity of Sunway.
As shown in Figure 4, the fine-grained partitioner mainly includes three steps, as follows:
(1) CG-stage Partitioning. To develop the CG-stage parallelism, A is divided by rows into α
blocks, and each CG works on one. Each block has less or more equal number of non-zeros, that is
nnz(block) = nnzα(A) . Therefore, each CG is assigned to a block and the full x vector. The result is
the corresponding segment of y, denoted as CG-segy.
(2) CPE-stage Partitioning. To develop the CPE-stage parallelism, the block on each CG is
further divided by rows into β tiles for the CPEs. Each tile has less or more equal number of non-
zeros, i.e., nnz(tile) = nnz (block)
β = nnz (A)
α ×β . Therefore, each CPE is assigned to a tile and the full x
vector. The result is the corresponding segment of y, denoted as CPE-segy.

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:9

Fig. 4. The fine-grained partitioner for the example of fgSpMSpV presented in Figure 2 on the Sunway, where
α = 2 and β = 2.

There is an array of positions of all tiles in A, denoted as Pos[α × β + 1], which provides a fast,
accurate, and intuitive way for each CG and CPE to find the start and end positions of each block
and tile in A. Pos[i × β] and Pos[(i + 1) × β] − 1 are indices of the first and last rows of the block
on CGi in A, respectively, where i = {0, 1, 2, . . . , α − 1}. Pos[i × β + j] and Pos[i × β + j + 1] − 1 are
indices of the first and last rows of the tile on the CPEj of CGi in A, where j = {0, 1, 2, . . . , β − 1}.
Correspondingly, the length of CPE-segy on the CPEj of CGi is Pos[i × β + j + 1] − Pos[i × β + j].
(3) Fine-grained Partitioning. To make better use of the limited memory of LDM, each tile
is further partitioned by columns into a number of CVSs, where each CVS has a size suitable for
N
loading into the LDM. Each CVS has inc column vectors. Correspondingly, x is partitioned into inc
segments, denoted as LDM-segx, and each LDM-segx has inc elements.

6.2 The Re-collection Method


As described in Section 3.1, only the nnz(x ) columns of A that correspond to non-zeros of x are
accessed in SpMSpV. Large amount of columns in A and non-zeros in x and y are redundant. More-
over, non-zero elements in x are irregularly distributed, which results in the irregular accesses to
the corresponding columns in A. Therefore, to fully exploit the memory bandwidth, we propose a
re-collection method for SpMSpV to remove useless data and optimize coalescing memory accesses.
The details are as follows:
Re-collection for x—The input vector x is compressed into x , where only non-zeros of x
are stored. There are two arrays storing x : the integer array storing indices of non-zeros in x
(Xi[nnz(x )]), and the floating-point array storing numerical values of non-zeros in x (Xv[nnz(x )]).
For the example shown in Figure 2, the compressed x  is stored by two arrays, i.e., Xi[4] =
{j 1 , j 2 , j 3 , j 4 } and Xv[4] = {x 1 , x 2 , x 3 , x 4 }, as shown in Figure 5.
Re-collection for A—The input matrix A, stored in CSC format, is further compressed and re-
collected into A, where only the nnz(x ) columns of A that corresponds to the nnz(x ) non-zeros
of x are stored. The nnz(x ) columns are re-collected together in order of the indices of the corre-
sponding nnz(x ) non-zeros of x, while other redundant columns of A are eliminated. In addition,
the rows of A where all the elements are zeros are eliminated. Therefore, the number of columns
in A is nnz(x ), the number of rows in A is M , where M  <= M, and the number of non-zeros of
A (nnz(A )) can be expressed as

nnz(A ) = nnz(A(:, j)), (1)
j ∈X i

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:10 Y. Chen et al.

Fig. 5. The re-collection method for x, A, and y in the example presented in Figure 2.

ALGORITHM 2: The re-collected SpMSpV.


Require: Xi[nnz(x )], Xv[nnz(x )], Colp[nnz(x ) + 1], Rows[nnz(A )], and Vals[nnz(A )].
Ensure: Yv[M ].
1: Initiate Yv[M  ], j, q, Col_start, and Col_end;
2: for j = 0; j < nnz(x ); j + + do
3: Col_start ← Colp[j];
4: Col_end ← Colp[j + 1];
5: for q = Col_start; q < Col_end; q + + do
6: Yv[Rows[q]] ← Yv[Rows[q]] + Vals[q] × Xv[j];
7: end for
8: end for
9: return Yv[M  ].

where A(:, j) is the jth column of A. There are three arrays storing A: the integer array storing
column pointers (Colp[nnz(x )+1]), the integer array storing row ids of non-zeros (Rows[nnz(A )]),
and the floating-point array storing numerical values (Vals[nnz(A )]).
For the example shown in Figure 2, M  = 7 and the re-collected A is stored by three arrays, i.e.,
Colp[5] = {0, 3, 6, 7, 8}, Rows[8] = {1, 2, 4, 0, 4, 6, 3, 5}, and Vals[8] = {a 1 , a 2 , a 3 , a 4 , a 5 , a 6 , a 7 , a 8 },
as presented in Figure 5.
Re-collection for y—The output vector y is compressed into y , where only non-zeros of y are
stored. There are two arrays storing y : the integer array storing indices of non-zeros in y (Yi[M ]),
and the floating-point array storing numerical values of non-zeros (Yv[M ]).
For the example shown in Figure 2, y  is stored by two arrays, i.e., Yi[7] = {i 1 , i 2 , i 3 , i 4 , i 5 , i 6 , i 7 }
and Yv[7] = {y1 , y2 , y3 , y4 , y5 , y6 , y7 }, as shown in Figure 5.
Therefore, only computational data are stored and can be continuously accessed in re-collected
SpMSpV, which exploits the data locality of SpMSpV and improves the memory bandwidth utiliza-
tion. Algorithm 2 describes the sequential algorithm of re-collected SpMSpV.

6.3 The CSCV Matrix Format


As shown in Figure 6, based on the proposed fine-grained partitioner and re-collection method,
fgSpMSpV on the Sunway TaihuLight contains three-stage parallelism:
(1) On each CG, the CG-stage re-collected SpMSpV multiplies a block  in the re-collected A
(A ) by the re-collected x (x ), where the block  contains the nnz columns in the block of A, and the


nnz column indices correspond to the indices in array Xi(nnz). The result is the corresponding
segment of the re-collected y (y ), denoted as CG-segy .

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:11

Fig. 6. The parallelization of fgSpMSpV for the example presented in Figure 4 on the Sunway, where α = 2,
β = 2, and inc = 2.

(2) On each CPE of the CG, the CPE-stage re-collected SpMSpV multiplies a tile  in the block 
by x , where the tile  contains the nnz columns in the tile of block, and the nnz column indices
correspond to the indices in array Xi(nnz). The result is the corresponding segment of CG-segy ,
denoted as CPE-segy .
An auxiliary array Pos[α × β + 1] is built to mark the start and end positions of each block and
tile in A. Pos[i × β] and Pos[(i + 1) × β] − 1 are indices of the first and last rows of the block’ on
CGi in A, respectively, where i = {0, 1, 2, . . . , α − 1}. Pos[i × β + j] and Pos[i × β + j + 1] − 1 are
indices of the first and last rows of the tile’ on the CPEj of CGi in A, where j = {0, 1, 2, . . . , β − 1}.
Correspondingly, the length of CPE-segy  on the CPEj of CGi is Pos[i × β + j + 1] − Pos [i × β + j].
(3) Furthermore, the fine-grained re-collected SpMSpV multiplies a CVS in the tile  by the
corresponding segment of x  (LDM-segx ), where each CVS has inc column vectors in the tile , and
each LDM-segx  has inc non-zeros. The result is accumulated to the corresponding CPE-segy .
Especially, as shown in Figure 6, each empty CVS, where all elements are zeros, and the corre-
sponding LDM-segx  will not be fetched by the CPE. Additionally, to reduce non-coalesced mem-
ory accesses and atomic operations, an array is allocated in each LDM to store the numerical values
of CPE-segy . The results on the CPE are accumulated to the corresponding elements in this array.
Until the calculations on all the CVSs of the tile  are completed, the results cached in this array
will not be returned to main memory.
To adapt to the fine-grained parallelization design of fgSpMSpV, we devise a Compressed
Sparse Column Vector (CSCV) format for A to store non-zeros by column vectors of each CVS.
The CSCV format is a fine-grained variant CSC format. There are also three arrays of CSCV to
store A, i.e., the integer array of column vector pointers (SColp[nnz(x ) × α × β + 1]), the integer
array of row indices of non-zeros of A (SRows[nnz(A )]), and the floating-point array of numerical
values (SVals[nnz(A )]). The details are as follow:
— SColp stores the position where each column begins and ends in tiles.
— SRows stores the row id of each non-zero in tiles.
— SVals stores the value of each non-zero in tiles.
For the example of parallelization shown in Figure 4, the CSCV format stores the re-collected
matrix A by three arrays
— SColp[17] = {0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 7, 8};
— SRows[8] = {1, 0, 0, 1, 0, 0, 1, 0};
— SVals[8] = {a 1 , a 4 , a 2 , a 7 , a 3 , a 5 , a 6 , a 8 }.
Algorithm 3 presents the parallel algorithm for fgSpMSpV on a CPE. The data in the arrays of
tile  and x , i.e., SColp, SRows, SVals, and Xv, can be accessed continuously by each CPE. The
CPE allocates LDM spaces for these fetched arrays, i.e., scolpldm , srowsldm , svalsldm , and xvldm

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:12 Y. Chen et al.

ALGORITHM 3: fgSpMSpV on a CPE


Require: Arrays of the tile : SColp, SRows, SVals;
An array of x : Xv;
Other parameters: Pos , CG_id, inc, nnz(x ), β.
Ensure: CPE-segy .
1: my_id ← get_thread_id(); //get id of the CPE
2: allocate and initiate yvldm [Pos  [CG_id × β + my_id + 1] − Pos  [CG_id × β + my_id]];
3: //loop all the CVS  s of the tile 
4: i ← nnz(x ) × (CG_id × β + my_id);
5: p ← SColp[i];
6: while i < nnz(x ) × (CG_id × β + my_id + 1) do
7: //fetch a CVS and the LDM-segx  from main memory
8: DMA_get_data(SColp[i], inc, scolpldm , ...);
9: //check if all the elements of the CVS are zeros
10: int len = scolpldm [inc − 1] − p;
11: if len  0 then
12: DMA_get(SRows[p], len, srowsldm , ...);
13: DMA_get(SVals[p], len, svalsldm , ...);
14: DMA_get(Xv[i − nnz(x ) × (CG_id × β + my_id)], inc, xvldm , ...);
15: for each element e in the LDM-segx  do
16: for each non-zero nz in the corresponding column of the CVS do
17: yvldm [srowsldm [nz]] ← yvldm [srowsldm [nz]] + svalsldm [nz] × xvldm [e]; //perform
SpMSpV
18: end for
19: end for
20: end if
21: i ← i + inc;
22: p ← scolpldm [inc − 1];
23: end while
24: DMA_send(yvldm , Pos  [CG_id × β + my_id + 1] − Pos  [CG_id × β + my_id], Yv, ...); //return
the CPE-segy 
25: return CPE-segy  .

in Algorithm 3. The arrays of tile s and x  are fetched from main memory to LDM using DMA. As
described in Algorithm 3, after each CPE ensures the CVS is non-empty (len  0) according to
the array SColp, the required data in arrays SRows, SVals, and Xv will be fetched. In addition, the
array storing numerical values of the CPE-segy  on each CPE, denoted as yvldm [Pos[i × β + j +
1] − Pos[i × β + j]], is allocated and cached in the LDM. The results are cached in the array, and
the calculation results of each processing round update the array yvldm . Until all the tasks on the
CPE are finished, the data in array yvldm will not be sent to main memory using DMA.

7 OPTIMIZATION FOR FGSPMSPV ON THE SUNWAY


7.1 CV Padding Technique
Each CPE on the Sunway SW26010 processor is equipped a 256-bit vector unit, which enables
four floating-point operations to be performed simultaneously. Every four continuous operators
should be vectorized into an SIMD variable before performing calculations. Additionally, each

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:13

Fig. 7. Misaligned vectorization.

SIMD variable must have the 32-byte boundary alignment (16-byte boundary alignment for single
floating-point data), as shown in Figure 7, otherwise it will impair the performance of SIMD.
On each CPE, non-zeros of each CV in the CVSs are successively multiplied by a corresponding
element in x . However, the number of non-zeros of each CV is irregular, which poses a difficulty
to vectorize the non-zeros of each column with the 32-byte boundary alignment. Therefore, we
propose a padding technique for non-vanishing CV s. The CV padding technique pads each non-
vanishing CV so that its number of elements is a multiple of four. Therefore, all the non-vanishing
CV s on each CPE can be efficiently vectorized with 32-byte alignment.
For a CV with nnz(CV) non-zeros, if the CV is empty, i.e., nnz(CV) = 0, it does not need to be
padded, and the number of elements in the CV after padding, denoted as nnz(CV ), remains to be 0.
If the number of non-zeros in the CV is a multiple of four, i.e., nnz(CV)  0 and nnz(CV) mod 4 = 0,
it does not need to be padded, and nnz(CV ) remains to be nnz(CV). Otherwise, nnz(CV)  0
and nnz(CV) mod 4  0, so the CV needs to be padded with 4 − (nnz(CV) mod 4) elements, i.e.,
nnz(CV ) = nnz(CV) + 4 − (nnz(CV) mod 4). Therefore, nnz(CV ) on a CPE can be expressed as

⎪ 0, nnz(CV) = 0


 ⎨
nnz(CV ) = ⎪ nnz(CV), nnz(CV)  0 and nnz(CV) mod 4 = 0 . (2)

⎪ nnz(CV) + 4 − (nnz(CV) mod 4), nnz(CV)  0 and nnz(CV) mod 4  0

7.2 DMA Bandwidth Optimization
As shown in Algorithm 3, each CVS has inc column vectors, so that the number of transferred
elements in both the integer array SColp and the floating-point number array Xv is inc. The num-
ber of transferred elements in both the integer array SRows and the floating-point number array
SVals is len, where len is the number of non-zeros in the non-vanishing CVS and can be calculated
by SColp. Moreover, the number of transferred elements of the floating-point number array of the
CPE-segy  on the CPEj of CGi is Pos[i × β + j + 1] − Pos[i × β + j]. All floating-point numbers are
set as doubles, each of which has 8 bytes. Each integer number has 4 bytes. The memory storage
of LDM is 64 KB, so the amount of the transferred data in each round must meet:
 
4inc + 8inc + 4len + 8len + 8 Pos[i × β + j + 1] − Pos[i × β + j] ≤ 64 × 1024. (3)
The above-mentioned arrays are transferred between main memory and LDM using DMA. Ac-
cording to the performance characteristics of DMA transmission, when the amount of data trans-
ferred is a multiple of 128 bytes and exceeds 512 bytes, the DMA bandwidth achieves an expected
performance. To improve the DMA bandwidth utilization, the amount of each transferred array in
each processing round should meet the optimization criteria.
The array Yv storing CPE-segy  is transferred by the CPE only once, and its size is irregular
on each CPE. In addition, the data size in arrays SRows and SVals for storing each non-vanishing
CVS is irregular. So, it is challenging for these three arrays to meet the DMA optimization criteria.
However, for the amount of transferred data in array SColp in each round, we have 4inc = 128n
and 4inc ≥ 512, where n ∈ N + . For array Xv, we have 8inc = 128m and 8inc ≥ 512, where m ∈ N + .

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:14 Y. Chen et al.

Fig. 8. The workflow of fgSpMSpV using optimizations on a CPE.

We simplify the equations and obtain the solution



inc = 32l, l ∈ N +
. (4)
inc ≥ 128

Consequently, the DMA transmissions of SColp and Xv can achieve the optimized performance
when the parameter inc satisfies Equation (4).

7.3 Optimized Prefetching Technique


In the parallelization design of fgSpMSpV, each processing round on each CPE loads data to be used
in this round and performs calculations. The result CPE-segy  will be returned to main memory
until all the rounds on each CPE are completed. The data transmission and calculation in each
round are executed sequentially on each CPE.
fgSpMSpV utilizes the optimized prefetching technique to process the DMA data transmission
and calculation in execute on each CPE, as shown in Figure 8. Using the optimized prefetching
technique, each CPE loads a padded CVS in the first round. For the next round (the second round),
each CPE not only performs calculation on the CVS loaded in the previous round (the first round),
but also loads the data to be used in next round (the third round) in advance. The rest rounds on
each CPE are performed in this way as well, i.e., each round not only performs calculations on the
CVS loaded in the previous round, but also loads a CVS to be used in next round. Differently, the
final round on the CPE just performs calculations on the CVS loaded in the previous round and
returns the result back to the main memory. As shown in Figure 8, therefore, except for the first
and the final rounds on each CPE, the DMA data transmission and calculation in other rounds are
performed simultaneously.

8 PERFORMANCE ANALYSIS
This section will detail the overhead analysis for re-collecting x, A, and y and the parallel runtime
analysis of fgSpMSpV. Table 1 lists the explanations of important symbols used in this article.

8.1 Re-collection Time Analysis


For re-collecting x, the input vector x (containing N elements) is compressed into x  (containing
nnz(x ) elements), in which Xi[nnz(x )] stores the index of each non-zero and Xv[nnz(x )] stores the

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:15
Table 1. Symbols and Their Descriptions

Symbol Description
α The number of CGs used in Sunway system
β The number of CPEs used in each CG
CGi The i t h CG of Sunway, where i ∈ {0, 1, 2, . . . , α − 1}
CPEj The j t h CPE of each CG, where j ∈ {0, 1, 2, . . . , β − 1}
A The M × N input sparse matrix of SpMSpV
nnz(·) The number of non-zeros in its input
T The execution time of fgSpMSpV on each CPE
L The overhead of loading data by each CPE
Lf The overhead of loading data in the first processing round on each CPE
C The overhead of calculations on each CPE
Cp The overhead of calculations in the penultimate round on each CPE
R The overhead of returning data to memory by each CPE

numerical value of each non-zero. Arrays Xi and Xv are obtained by going through N elements of
x. Therefore, the time complexity of re-collecting x is O(N ).
For re-collecting A, the input matrix A (M rows, N columns, and nnz(A) non-zeros) is com-
pressed and re-collected into A (M  rows, nnz(x ) columns, and nnz(A ) non-zeros). First of all,
the nnz(A ) non-zeros in corresponding nnz(x ) columns of A are went through to obtain arrays
SColp[nnz(x ) + 1] and SVals[nnz(A )] that respectively store position pointers of each column
in A and numerical values of the nnz(A ) non-zeros. Beside, an array Mark[M] is obtained, in
which every element is zeros except for Mark[i], and i is the row index of each non-zero of A.
This step has the time complexity of O(nnz(A )). Second, the M elements of array Mark are went
through to replace the numerical numbers of non-zero elements in Mark with the order in which
they appear in all the non-zero elements in Mark. This step has the time complexity of O(M ).
Then, the nnz(A ) non-zeros of A are further went through based on array Mark to obtain array
SRows[nnz(A )] that stores the row indices of the nnz(A ) non-zeros, which has time complexity
of O(nnz(A )). Therefore, the time complexity of re-collecting A is O(nnz(A ) + M ).
For re-collecting y, the output vector y is compressed into y , in which Yi[nnz(y)] stores the
index of each non-zero and Yv[nnz(y)] stores the numerical value of each non-zero. Array Yv
is obtained by performing SpMSpV computation, while array Yi is obtained by going through
the corresponding nnz(A ) non-zeros in A. Their row indices in A are the elements in array Yi.
Therefore, the time complexity of re-collection for y is O(nnz(A )).
In conclusion, the re-collection method has the time complexity of O(N + nnz(A ) + M ).

8.2 Parallel Execution Time Analysis


Figure 8 presents the runtime analysis of fgSpMSpV, where the computation and communication
are overlapped using optimizations. Define L as the overhead of loading data from main memory
by a CPE, C as the computation overhead on the CPE, and R as the overhead of returning data
to main memory by the CPE in the final processing round. So, the execution time of fgSpMSpV
without the optimized prefetching technique on a CPE, denoted as T , can be expressed as
T = L + C + R. (5)
However, the optimized prefetching technique allows the DMA data transmission to overlap
with the calculations in all processing rounds on each CPE, except for the first and the final rounds.
Define L f as the overhead of loading data by a CPE in the first round, Cp as the computation

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:16 Y. Chen et al.

overhead in the final round on the CPE. Using the optimized prefetching, the execution time of
fgSpMSpV on a CPE, denoted as T , is reduced and can be expressed as
T  = max{L − L f , C − Cp } + L f + Cp + R. (6)
In the following two subsections, we will detail the analysis of L, L f , R, C, and Cp .

8.3 Communication Time Analysis


For the communication time on the CPEj of CGi , the overhead of loading data by the CPE is the
ratio of the amount of loaded data to the DMA bandwidth, denoted as B.
There are three arrays of the padded tile to be loaded by the CPE, i.e., SColp[nnz(x ) + 1],
SRows[nnz], and SVals[nnz], where nnz is the number of elements of the padded tile on
the CPE. There are nnz(x ) padded CV s on the CPE, i.e., {CV(i×β +j )nnz(x ) , CV(i×β +j )nnz(x )+1 , . . . ,
CV(i×β +j+1)nnz(x )−1 }. Moreover, there are an array of x, i.e., Xv[nnz(x )], and an array of Pos[2] to
be loaded by the CPE.
For the first round on the CPE, there are three arrays of the padded tile to be loaded, i.e.,
SColp[inc], SRows[len], and SVals[len], where len is the number of elements in the first CVS,
i.e., inc padded CV s. The inc padded CV s in the first round are {CV(i×β +j )nnz(x ) , CV(i×β +j )nnz(x )+1 ,
. . . , CV(i×β +j )nnz(x )+inc−1 }. In this section, we only discuss the case that all the padded CVS are
non-empty, hence each CPE would load all the data of Xv. There is an array of x, i.e., Xv[inc], to
be loaded in the first round on the CPE. Therefore, based on Equation (2), the overhead of loading
data in the first round on the CPEj of CGi (L f ) can be expressed as Equation (7)
4inc + 12len + 8inc
Lf =
B
⎡⎢ φ ⎤⎥ (7)
12 ⎢
= ⎢⎢inc + nnz(CVk ) ⎥⎥⎥ ,
B ⎢ ⎥⎦
⎣ k=ξ

where ξ = (i × β + j)nnz(x ) and φ = (i × β + j)nnz(x ) + inc − 1.


Except for the overhead in the first round, the rest overhead of loading data on the CPEj of CGi ,
i.e., L − L f , can be expressed as
 
4 nnz(x ) + 1 + 12nnz + 8nnz(x ) + 4 × 2
L − Lf = − Lf
B
⎡   ⎤⎥
12 ⎢⎢
ζ φ
= ⎢⎢nnz(x ) − inc + 1 + nnz(CVk ) − nnz(CVk ) ⎥⎥⎥ (8)
B ⎢ ⎥⎦
⎣ k=ξ k=ξ
⎡  ⎤⎥
12 ⎢⎢
ζ
 ⎥
= ⎢nnz(x ) − inc + 1 + nnz(CVk ) ⎥⎥ ,
B ⎢⎢ ⎥⎦
⎣ k=φ+1

where ξ = (i × β + j)nnz(x ), φ = (i × β + j)nnz(x ) + inc − 1, and ζ = (i × β + j + 1)nnz(x ) − 1.


For the final round on the CPE, there is an array of y , i.e., Yv[Pos[i × β + j + 1] − Pos[i × β + j]],
to be returned back to main memory. So, the overhead of returning data in the final round on the
CPEj of CGi , i.e., R, can be expressed as
 
8 Pos[i × β + j + 1] − Pos[i × β + j]
R= . (9)
B

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:17

8.4 Computation Time Analysis



For the computation time, there are 2 k=ξ nnz(CVk ) floating-point calculations on the CPEj of
CGi . Additionally, the SIMD technique allows every four floating-point calculations to be per-
formed synchronously. Define S as the speed of a floating-point calculation on the SW26010. De-
fine Cs as the sequential execution time of the re-collected SpMSpV on the i × β + jth tile on a
single core. So, Cs can be expressed as
Cs = 2nnz(tile )/S. (10)

The final round on the CPE executes computation on the final CVS . The inc padded CV s in
the final CVS are {CV(i β +j+1)nnz(x )−inc , CV(i β +j+1)nnz(x )−inc+1 , . . . , CV(i β +j+1)nnz(x )−1 }. According to
Equation (10), the calculation overhead of the final round, denoted as Cp , on the CPEj of CGi can
be expressed as
1 
ζ
Cp = 2nnz(CVk )
4S
k=λ
(11)
Cs ζ
= nnz(CVk ),
4nnz(tile ) k=λ
where λ = (i × β + j + 1)nnz(x ) − inc and ζ = (i × β + j + 1)nnz(x ) − 1. In addition, based on
Equations (10) and (11), C − Cp can be expressed as Equation (12)

1  
ζ ζ
 Cs
C − Cp = 2nnz(CVk ) −  nnz(CVk )
4S 4nnz(tile )
k=ξ k=λ
(12)
Cs 
λ−1
=  nnz(CVk ),
4nnz(tile ) k=ξ

where ξ = (i × β + j)nnz(x ) and λ = (i × β + j + 1)nnz(x ) − inc.


Therefore, according to Equations (6), (7), (8), (9), (11), and (12), the execution time of fgSpMSpV

on the CPEj of CGi , denoted as Ti×β , can be expressed as
+j

⎧ ⎡⎢  ⎤⎥  ⎫


ζ λ−1 ⎪
 ⎨ 12 ⎢⎢nnz(x ) − inc + 1 +  ⎥ Cs  ⎪⎬
Ti×β +j = max ⎪ ⎢ nnz(CV ) ⎥
⎥ ,  nnz(CV ) ⎪
⎪B ⎢ k
⎥ 4nnz(tile ) k=ξ k ⎪
⎩ ⎣ k=φ+1 ⎦ ⎭
⎡⎢ φ ⎤⎥  ζ
12 ⎢
nnz(CVk ) ⎥⎥⎥ +
Cs (13)
+ ⎢⎢inc +  nnz(CVk )
B ⎢ ⎥ 4nnz(tile )
 ⎣ ⎦
k=ξ k=λ

 
8 Pos [i × β + j + 1] − Pos [i × β + j]
+ ,
B
where ξ = (i × β + j)nnz(x ), φ = (i × β + j)nnz(x ) + inc − 1, λ = (i × β + j + 1)nnz(x ) − inc,
ζ = (i × β + j + 1)nnz(x ) − 1.

9 EVALUATION
9.1 Evaluation Setup
fgSpMSpV is implemented using C language. Most of experiments are conducted on the Sun-
way TaihuLight supercomputer. The CPE-layer parallel efficiency of fgSpMSpV is exploited using
Athread and tested on one CG, and the CG-layer parallel efficiency is exploited using MPI and

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:18 Y. Chen et al.
Table 2. The Tested Sparse Matrices

Sparse Matrix A M N nnz(A) Sparse Matrix A M N nnz(A)


raefsky3 21K 21K 1.49M dielFilterV3real 1.10M 1.10M 45.20M
pdb1HYS 36K 36K 4.34M hollywood-2009 1.14M 1.14M 113.89M
rma10 47K 47K 2.37M kron_g500-logn21 2.10M 2.10M 182.08M
cant 62K 62K 4.01M soc-orkut 3.00M 3.00M 212.70M
2cubes_sphere 101K 101K 874K wikipedia-20070206 3.57M 3.57M 45.03M
cop20k_A 121K 121K 2.62M soc-LiveJournal1 4.85M 4.85M 68.99M
cage12 130K 130K 2.03M ljournal-2008 5.36M 5.36M 79.02M
144 145K 145K 1.07M indochina-2004 7.41M 7.41M 194.11M
scircuit 171K 171K 959K wb-edu 9.85M 9.85M 57.16M
mac_econ_fwd500 207K 207K 1.27M road_usa 23.95M 23.95M 57.71M

tested on multiple CGs. The Sunway system supports C language compiler that is customized for
compiling and linking programs on MPE and CPE of the heterogeneous SW26010 processor. The
compiling command of programs on MPE is mpicc -host, on CPE is mpicc -slave, and the hybrid
linking command is mpicc -hybrid. The compiler switches used include −msimd (turning on the
SIMD function). One CG has the maximum calculation speed of 742.5 GFLOPS and peak memory
bandwidth of 34 GB/s. Each MPE and each CPE of a CG are both ran at 1.45 GHz. Additionally, we
also implement fgSpMSpV on the GPU and conduct experiments on a machine that is equipped
with two Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (14 cores each) with 256GB RAM and one
NVIDIA Tesla P100 GPU with 16 GB HBM2 stacked memory. Expect for the contrast experiments
with the state-of-the-arts on the GPU where the floating-point numbers are single-precision, the
floating-point numbers in all the experiments are double-precision.
Table 2 presents the 20 sparse matrices that are widely used in various related work [6, 26, 34].
All the sparse matrices except soc-orkut are selected from the SuiteSparse Matrix Collection [20].
soc-orkut is selected from the Network Data Repository [32]. For experiments on the Sunway Tai-
huLight, due to the limitations of main memory (8GB) on each CG as well as LDM (64KB) on each
CPE, the first ten smaller-dimension sparse matrices in Table 2 are tested on up to one CG, and
the last 10 larger-dimension sparse matrices are tested on up to 16 CGs.
According to the examples of BFS and PageRank applications on the wikipedia-2007 dataset
provided in Ref. [26], the density of x in each SpMSpV operation varies from 2.80 × 10−7 to 0.67.
Therefore, in our experiments, we set the sparsity of the input vector x to five classes, i.e., nnz(x ) =
0.01%N , nnz(x ) = 0.1%N , nnz(x ) = 1%N , nnz(x ) = 20%N , and nnz(x ) = N .

9.2 Results Analysis


9.2.1 Parallel Scaling Results. Figure 9 shows the speedups of the main SpMSpV computation
in fgSpMSpV obtained on a CG with 64 CPEs over the sequential algorithm on an MPE with the
sparsity of x varying. Especially, there is no need to use the re-collection method in fgSpMSpV
when nnz(x ) = N , which corresponds to performing SpMV. On average, fgSpMSpV achieves the
speedup of 0.15 (min: 0.07, max: 0.21) when nnz(x ) = 0.01%N , 1.19 (min: 0.38, max: 1.76) when
nnz(x ) = 0.1%N , 8.40 (min: 6.29, max: 10.28) when nnz(x ) = 1%N , 18.91 (min: 15.59, max: 21.98)
when nnz(x ) = 20%N , and 19.40 (min: 16.90, max: 22.65) when nnz(x ) = N on a CG with 64 CPEs
over that of on an MPE. fgSpMSpV on a CG achieves higher speedups as the input vector x is
denser (from nnz(x ) = 0.01%N to nnz(x ) = 20%N ). The reason is that as the input vector becomes
sparser, there is the smaller number of re-collected necessary columns in the 10 tested sparse
matrix and the small amount of data to be transferred; hence, the load balance across 64 CPEs

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:19

Fig. 9. Speedups of fgSpMSpV obtained on a CG with different sparsity of x.

Fig. 10. Scalability of CPE-stage parallelism.

and the DMA bandwidth is hard to improve. Especially for nnz(x ) = 0.01%N , the number of re-
collected necessary columns is very few (ranged from 2 to 20), which increases the difficulty of
improving the parallel execution efficiency.
Figure 10 presents the runtime of the main SpMSpV computation in fgSpMSpV varying the
number of CPEs in a CG (α = 1 and β = {1, 4, 16, 64}). On average, fgSpMSpV achieves the speedup
of 13.42× (min: 6.86×, max: 21.48×) when nnz(x ) = 20%N , 8.05× (min: 4.97×, max: 10.93×) when
nnz(x ) = 1%N , 1.79× (min: 1.13×, max: 2.79×) when nnz(x ) = 0.1%N , and 1.29× (min: 1.09×,

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:20 Y. Chen et al.

Fig. 11. Scalability of CG-stage parallelism on the 10 smaller-dimension datasets.

max: 1.54×) when nnz(x ) = 0.01%N . From Figure 10, fgSpMSpV obtains higher speedups on a CG
with 64 CPEs over that with 1 CPE as x becomes sparser (from nnz(x ) = 20%N to nnz(x ) = 0.01%N ).
The size of array SColp[nnz(x ) × α × β + 1] scales linearly with the values of α and β. According to
Algorithm 3, despite the increase in the number of CPEs (β), the amount of data to be transferred
in arrays SColp and Xv remains unchanged, which impairs the CPE-stage scalability of fgSpMSpV.
fgSpMSpV shows the better CPE-stage scalability on scircuit. When the number of CPE is small,
the CPE-segy  requires large memory in each LDM due to the relatively large dimension of scircuit.
Moreover, the maximum number of non-zeros in columns of scircuit is relatively large. Considering
the limitation of LDM memory, inc should be set to a small value to ensure that each CVS can fit
in the LDM. Nevertheless, with the increase of the number of CPE, the size of each CVS and the
length of CPE-segy  on each CPE decrease, so that the size of LDM allows for a larger inc, which,
in turn, making better use of the DMA bandwidth.
As we can see from Figure 10, there is almost no speedup when the CPE-stage scalability of
fgSpMSpV is tested for sparser inputs (nnz(x ) = 0.01%N and nnz(x ) = 0.1%N ); hence, the CG-
stage scalability on the 10 smaller-dimension datasets is only tested when nnz(x ) = 20%N and
nnz(x ) = 1%N . Figure 11 presents the CG-stage scalability on the 10 smaller-dimension sparse
matrices, where up to 16 CGs are used (α = {1, 2, 4, 8, 16} and β = 64). On average, fgSpMSpV
achieves the speedup of 1.83× (min: 1.38×, max: 2.52×) when nnz(x ) = 20%N and 1.64× (min:
1.45×, max: 1.92×) when nnz(x ) = 1%N . The size of data in arrays SColp and Xv to be transferred
by each CPE remains unchanged as the value of α increases, which has a more negative impact on
the scalability of CG-stage parallelism than that of CPE-stage parallelism. This is because the num-
ber of non-zeros in each tile and the length of each CPE-segy  keep decreasing with the increasing
of α, correspondingly reducing the communication overhead of transferring arrays SRows, SVals,
and Yv and computation overhead on each CPE. Moreover, for smaller datasets, as α increases, the
proportion of these overheads in the overall parallel SpMSpV overhead largely decreases, while the
communication overhead of transferring arrays SColp and Xv remains unchanged, which impairs
the CG-stage scalability of fgSpMSpV.
Figure 12 shows the CG-stage scalability on the 10 larger-dimension sparse matrices, where the
number of CGs is scaled from 16 to 128 (α = {16, 32, 64, 128} and β = 64). Some experimental re-
sults on the three largest-dimension datasets (indochina-2004, wb-edu, and road_usa) are missing in
Figure 12. The runtime of indochina-2004 does not appear in the figure when nnz(x ) = 20%. When
nnz(x ) = 1%, the corresponding figure only shows the runtime of indochina-2004 on 128 CGs.
In addition, Figure 12 does not show the runtime of wb-edu on 16 CGs and that of road_usa on
16, 32, and 64 CGs. The reason is that the LDM memory limits the execution of fgSpMSpV for

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:21

Fig. 12. Scalability of CG-stage parallelism on the 10 larger-dimension datasets.

large-dimension sparse matrices when x has relatively large density. According to the proposed
adaptive parallel execution, additionally, the more CGs used, the finer the granularity of CVSs and
can fit in the LDM. In addition, we observe that fgSpMSpV runs poor CG-stage scalability. Due
to the small amount of computation in SpMSpV, the communication overhead is the main perfor-
mance bottleneck of parallel SpMSpV. The communication overhead on each CPE corresponds to
(A )
the cost of loading nnz(x ) + 1 column pointers from array SColp, nnz α ×β row indices with addi-

tional padded elements from array SRows, and nnz (A )
α ×β non-zeros with additional padded elements
from array SVals. For the tested datasets that have high sparsity, the overhead of transferring
SColp has a greater share of the communication overhead of fgSpMSpV. This is the reason why
fgSpMSpV performs unsatisfied CG-stage scalability on the tested datasets.

9.2.2 Effects of the Key Techniques. Figure 13 presents the runtime of main SpMSpV computa-
tion in fgSpMSpV on a CG respectively using the proposed re-collection and optimization tech-
niques. We first test the SpMSpV runtime without any optimization techniques, as shown as bars
labeled “SpMSpV without Re-collection and Optimizations”. Then we only re-collect x to test the
SpMSpV runtime, as shown as bars labeled “SpMSpV with Re-collecting x”. By comparing “SpMSpV
with Re-collecting x” with “SpMSpV without Re-collection and Optimizations” bars, we observed
that, on average, the re-collection of x contributes the performance improvement of 1.46% when
the density of x is 20%, 14.44% when the density is 1%, 34.97% when the density is 0.1%, and 41.20%
when the density is 0.01%. The useless elements in x are removed due to the re-collection of x;
hence, the overhead of data fetching from host memory to LDM decreases. Moreover, the sparser
x is, the more useless elements in x will be removed. This is the reason why the performance
improves significantly and continues to obtain gains from lower density of x.

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:22 Y. Chen et al.

Fig. 13. Performance contribution from re-collection and optimizations on a CG.

Subsequently, we test the SpMSpV runtime of fgSpMSpV with re-collecting x, A, and y, as shown
as “SpMSpV with Re-collecting x, A, and y” bars in Figure 13. By comparing “SpMSpV with Re-
collecting x, A, and y” with “SpMSpV with Re-collecting x” bars, on average, the re-collection of
A and y contributes the performance improvement of 75.98% when nnz(x ) = 20%, 50.37% when
nnz(x ) = 1%, 45.53% when nnz(x ) = 0.1%, and 32.53% when nnz(x ) = 0.01%. If re-collection is not
used or only x is re-collected, there exists irregular memory accesses to the necessary columns
of A, and each CPE would only load non-zeros of one column of A each time, which impairs the
bandwidth utilization in fgSpMSpV. Re-collecting A and y addresses these problems.
Finally, we test the SpMSpV runtime using re-collection and the three optimization strategies
introduced in Section 7, as shown as “SpMSpV with Re-collection and Optimizations” bars in
Figure 13, to present the effects of the three optimizations by comparing with “SpMSpV with Re-
collecting x, A, and y” bars. On average, the performance improves 12.24% when nnz(x ) = 20%,
13.41% when nnz(x ) = 1%, 14.78% when nnz(x ) = 0.1%, and 14.86% when nnz(x ) = 0.01%.
As shown in Figure 13, no significant difference is found between the effects of optimizations
for different sparsity of x. Therefore, Figure 14 further presents the performance contributions
on a CG from different optimization techniques, i.e., the CV padding, DMA bandwidth optimiza-
tion, and the optimized prefetching, when nnz(x ) = 20%N . Figure 14 first presents the execution
time of fgSpMSpV that does not use any optimization, labeled “Parallel Re-collected SpMSpV”. Sec-
ond, fgSpMSpV then uses the CV padding technique, and its performance is shown by the bar label
“Padding”. We can observe that the performance contribution of the CV padding technique is quite
limited (1.44% on average) by comparing label “Parallel Re-collected SpMSpV” and label “Padding”.
The CV padding technique enables the utilization of SIMD, which optimizes the computation part
of the parallel SpMSpV. However, the CV padding increases the amount of communication data
at the same time, which has negative impact on its optimization effect. Third, fgSpMSpV with CV
padding further uses the DMA bandwidth optimization, and the label “DMA” presents the perfor-
mance of fgSpMSpV that uses the CV padding technique and DMA bandwidth optimization. The
performance contribution of the DMA bandwidth optimization can be observed by comparing la-
bel “Padding” and label “DMA”, which is 9.58% on average. Finally, we further use the optimized
prefetching technique for fgSpMSpV, and label “Buffering” presents the performance of fgSpMSpV

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:23

Fig. 14. Effects of the three optimization techniques on a CG, where nnz(x ) = 20%N .

Fig. 15. The runtime comparison of serial re-collection, OpenMP-based re-collection, and the main parallel
SpMSpV computation in fgSpMSpV on a CG.

that uses all the proposed optimizations. The performance contribution of the optimized prefetch-
ing technique is 1.52% on average. As shown in Figure 14, the DMA bandwidth optimization plays
a more important role in performance optimizations of fgSpMSpV.
Either the serial re-collection or OpenMP-based re-collection is executed in fgSpMSpV. To an-
alyze the speedups of OpenMP-based re-collection over serial re-collection and the impact of the
re-collection overhead on the overall fgSpMSpV runtime, Figure 15 shows the overheads of serial
re-collection, OpenMP-based re-collection, and the main parallel SpMSpV computation, respec-
tively. The runtime overheads of serial re-collection and OpenMP-based re-collection are mea-
sured on the MPE of a CG, and the parallel SpMSpV is measured on the CPE cluster. As shown in
Figure 15, the average speedup of OpenMP-based re-collection is 3.07× (min: 2.20×, max: 4.47×)

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:24 Y. Chen et al.

when nnz(x ) = 20%, 2.40× (min: 2.04×, max: 2.66×) when nnz(x ) = 1%, 1.79× (min: 1.15×, max:
2.18×) when nnz(x ) = 0.1%, and 0.45× (min: 0.15×, max: 0.70×) when nnz(x ) = 0.01% over the
serial re-collection. The OpenMP-based re-collection on a CG achieves higher speedups while still
being a large share of the overall fgSpMSpV overhead when x has higher density. However, when
nnz(x ) = 0.01%, even though serial re-collection runs faster than OpenMP-based re-collection (be-
cause the very few necessary columns that need to be re-collected in the tested datasets limit the
development of parallelism of re-collection method), the overhead of serial or OpenMP-based re-
collection is a smaller share of the overall overhead of fgSpMSpV. Therefore, the overhead of pre-
processing on a single CG is expected to be a smaller share of the overall overhead of fgSpMSpV
when the input vector has higher sparsity. Despite these overheads, fgSpMSpV can still bring
substantial performance improvements in real-world applications, as shown in the following ex-
perimental results.
We implement fgSpMSpV on a CPU+GPU machine to test its performance on all the 20 sparse
matrices, where the MPI+OpenMP+CUDA parallelization model is used. Figure 16 shows the
runtime comparison of serial re-collection, OpenMP-based recollection, and the main parallel
SpMSpV computation on an NVIDIA Tesla P100 GPU. As can be seen from Figure 16, the OpenMP-
based re-collection achieves higher speedups over serial re-collection with denser density of x and
larger dimension of A. However, different from on the Sunway system, the main parallel SpMSpV
computation dominates the runtime of fgSpMSpV on the CPU+GPU machine. On average, the
OpenMP-based re-collection overhead is 10.34% (min: 4.02%, max: 20.93%) of the parallel SpMSpV
computation overhead when nnz(x ) = 20%, 13.38% (min: 3.33%, max: 29.40%) of the parallel SpM-
SpV computation overhead when nnz(x ) = 1%, 19.72% (min: 1.86%, max: 42.80%) of the parallel
SpMSpV computation overhead when nnz(x ) = 0.1%, and 21.27% (min: 2.63%, max: 66.58%) of the
parallel SpMSpV computation overhead when nnz(x ) = 0.01%.
Table 3 shows the storage comparison of CSC, CSCV, and padded CSCV formats, where the size
of an integer is 4, and the size of a double-precision floating-point number is 8. The columns whose
indices i satisfy (i + 1) mod 5 = 0 are selected for storage comparison when nnz(x ) = 20%, whose
indices i satisfy (i + 1) mod 100 = 0 are selected for nnz(x ) = 1%, whose indices i satisfy (i + 1)
mod 1000 = 0 are selected for nnz(x ) = 0.1%, and whose indices i satisfy (i + 1) mod 10000 = 0
are selected for nnz(x ) = 0.01%, where i ∈ {0, 1, 2, . . . , N − 1}. Compared to the CSC format, CSCV
format saves an average of 84.28% in memory footprint (40.38% when nnz(x ) = 20%, 97.06%,
when nnz(x ) = 1%, 99.71% when nnz(x ) = 0.1%, and 99.97% when nnz(x ) = 0.01%). Using the
CV padding technique, the padded CSCV format increases the memory footprint by 60.08% on
average compared to the CSCV format (62.09% when nnz(x ) = 20%, 60.50% when nnz(x ) = 1%,
60.54% when nnz(x ) = 0.1%, and 57.19% when nnz(x ) = 0.01%). In addition, as can be seen from
Table 3, the CV padding technique pads fewer numbers for matrices with lower density of x.
9.2.3 Relative Performance of fgSpMSpV. Figure 17 compares the performance of fgSpMSpV on
an NVIDIA Tesla P100 GPU with two efficient SpMSpV kernels used in the adaptive SpMV/SpMSpV
framework [26] (the sorted merge-based SpMSpV kernel and the merge-based SpMSpV kernel
without sorting). On average, compared to the sorted merge-based SpMSpV and the merge-based
SpMSpV, fgSpMSpV achieves the speedups of 0.22× (min: 0.03×, max: 0.63×) and 0.18× (min: 0.03×,
max: 0.60×) when nnz(x ) = 20%, 2.83× (min: 0.50×, max: 8.04×) and 2.68× (min: 0.50×, max: 7.99×)
when nnz(x ) = 1%, 14.91× (min: 2.80×, max: 53.275×) and 13.85× (min: 2.83×, max: 53.16×) when
nnz(x ) = 0.1%, and 32.61× (min: 5.43×, max: 134.38×) and 32.14× (min: 5.43×, max: 134.18×) when
nnz(x ) = 0.01%. We have two important observations from Figure 17. First, fgSpMSpV outperforms
the two benchmark kernels, except when nnz(x ) = 20%. This is because the array SColp[nnz(x ) ×
α ×β +1] in CSCV format has high memory footprint when the density of x is high, which requires
more overhead of memory accessing. Second, the runtime of fgSpMSpV drops much faster than
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:25

Fig. 16. The runtime comparison of serial re-collection, OpenMP-based re-collection, and the main parallel
SpMSpV computation in fgSpMSpV on an NVIDIA Tesla P100 GPU.

the two benchmark kernels with the increasing sparsity of x. Due to the re-collection method,
as x becomes sparser, the memory footprint of CSCV format and the communication overhead
dramatically decrease, and the performance of fgSpMSpV is significantly improved.
In many real-world applications, the sparsity of the input vector varies dramatically during
the program execution. To investigate the usefulness of fgSpMSpV in this case, this experiment is
carried out by applying it to the BFS real-world application on an NVIDIA Tesla P100. The adaptive
SpMV/SpMSpV framework [26] can automatically optimize the SpMV/SpMSpV kernel on GPUs
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:26 Y. Chen et al.
Table 3. Storage Comparison (MB) of CSC, CSCV, and Padded CSCV Formats
nnz(x ) = 20% nnz(x ) = 1% nnz(x ) = 0.1% nnz(x ) = 0.01%
Dataset CSC
Padded Padded Padded Padded
CSCV CSCV CSCV CSCV
CSCV CSCV CSCV CSCV
raefsky3 17.128 4.443 8.986 0.222 0.449 0.022 0.045 0.002 0.004
pdb1HYS 25.208 6.787 13.464 0.342 0.680 0.035 0.070 0.003 0.005
rma10 27.347 7.718 14.960 0.393 0.766 0.038 0.073 0.004 0.007
cant 23.526 7.707 13.917 0.385 0.695 0.038 0.069 0.004 0.007
2cubes_sphere 10.394 6.958 9.628 0.347 0.478 0.034 0.047 0.003 0.005
cop20k_A 16.050 9.045 13.216 0.448 0.651 0.044 0.063 0.004 0.006
cage12 23.757 11.013 17.219 0.550 0.860 0.055 0.085 0.006 0.009
144 23.812 12.800 20.449 0.434 0.541 0.043 0.054 0.004 0.005
scircuit 11.626 10.542 13.466 0.527 0.674 0.052 0.065 0.005 0.006
mac_econ_fwd500 15.361 12.998 16.884 0.650 0.844 0.064 0.082 0.006 0.008
dielFilterV3real 521.530 157.325 295.294 7.853 14.735 0.783 1.468 0.076 0.143
m hollywood-2009 662.562 187.216 362.624 9.366 18.143 0.911 1.755 0.109 0.218
kron_g500-logn21 1049.893 242.106 428.380 11.840 20.799 1.131 1.956 0.116 0.203
soc-orkut 1228.503 389.423 713.527 19.391 35.489 1.941 3.553 0.204 0.377
wikipedia-20070206 528.939 277.849 416.095 14.357 21.888 1.528 2.404 0.134 0.197
soc-LiveJournal1 808.063 395.505 607.249 19.822 30.472 1.953 2.979 0.198 0.304
ljournal-2008 924.807 442.368 683.021 22.191 34.320 2.198 3.384 0.178 0.241
indochina-2004 2249.690 807.884 1402.324 41.579 72.882 5.146 9.593 0.360 0.600
wb-edu 691.663 611.464 785.752 30.643 39.452 3.003 3.802 0.288 0.351
road_usa 421.563 1235.340 1323.389 61.774 66.186 6.177 6.618 0.617 0.659

Fig. 17. Performance comparison of fgSpMSpV and SpMSpV kernels used in Ref. [26] on an NVIDIA Tesla
P100 GPU.

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:27

Fig. 18. Comparison of speedups of fgSpMSpV and the adaptive SpMV/SpMSpV framework [26] over cuS-
PARSE in BFS application on an NVIDIA Tesla P100.

by adapting with the varieties of input vectors at the runtime. Therefore, Figure 18 compares
the speedups of fgSpMSpV and the adaptive SpMV/SpMSpV framework over cuSPARSE in BFS
application on the 10 common tested sparse matrices in this article and Ref. [26]. As compared
with cuSPARSE, the average speedup of the adaptive SpMV/SpMSpV framework achieved is 4.51×,
while that of fgSpMSpV achieved is 19.94× (min: 3.48×, max: 56.38×). Compared with the adaptive
SpMV/SpMSpV framework, the average speedup of fgSpMSpV achieved is 6.69× (min: 0.43×, max:
21.68×). Thanks to the MPI+OpenMP+X model, fgSpMSpV can still bring substantial performance
improvements in real-world applications in spite of the pre-processing overhead. Furthermore, due
to the OpenMP-based re-collected method, fgSpMSpV presents its flexibility and efficiency with
respect to various sparsity patterns of the input vector in real-world applications.

10 CONCLUSION
SpMSpV is a fundamental operation in many scientific computing and industrial engineering
applications. This article proposes fgSpMSpV on the Sunway TaihuLight supercomputer to
alleviate the challenges of optimizing SpMSpV in large-scale real-world applications caused
by the inherent irregularity and poor data locality of SpMSpV on HPC architectures. Using an
MPI+OpenMP+X parallelization model that exploits the multi-stage and hybrid parallelism of
heterogeneous HPC architectures, both pre-/post-processing and main SpMSpV computation in
fgSpMSpV are accelerated. An adaptive parallel execution is devised for fgSpMSpV to reduce the
pre-processing, adapt to the Sunway architecture, while still tame redundant and random memory
accesses in SpMSpV, with a set of techniques like the fine-grained partitioner, re-collection method,
and CSCV matrix format. fgSpMSpV further exploits the computing resources of the platform
by utilizing several optimization techniques. fgSpMSpV on the Sunway TaihuLight gains a
noticeable performance improvement from the key optimization techniques with various sparsity
of the input. Additionally, fgSpMSpV on an NVIDIA Tesla P100 GPU obtains the speedup of up
to 134.38× over the state-of-the-art SpMSpV algorithms. The BFS application using fgSpMSpV
on a P100 GPU achieves the speedup of up to 21.68× over the state-of-the-arts. The results on
both Sunway and GPU architectures show that the main SpMSpV computation of fgSpMSpV
scales better for higher density of x (especially when nnz(x ) = 20% in our experiments), while
the overall performance of fgSpMSpV outperforms the state-of-the-arts for lower density of x
(especially when nnz(x ) = 0.1% and nnz(x ) = 0.01% in our experiments).
In our future work, on the basis of fgSpMSpV, we plan to further accelerate real-world applica-
tions, such as BFS, on heterogeneous HPC platforms. For example, to design an adaptive scheme
ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
8:28 Y. Chen et al.

for applications that switches optimizations for SpMSpV, such as serial and parallel re-collection
methods and SpMSpV kernels, according to the input data structures and hardware architectures.

ACKNOWLEDGMENTS
The authors would like to thank the three anonymous reviewers and editors for their valuable and
helpful comments on improving the manuscript.

REFERENCES
[1] Khalid Ahmad, Hari Sundar, and Mary W. Hall. 2020. Data-driven mixed precision sparse matrix vector multiplication
for GPUs. ACM Trans. Archit. Code Optim. 16, 4 (2020), 51:1–51:24.
[2] Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting locality in sparse matrix-matrix multiplication on many-core
architectures. IEEE Trans. Parallel Distrib. Syst. 28, 8 (2017), 2258–2271.
[3] Kadir Akbudak, R. Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-
matrix multiplication. ACM Trans. Parallel Computing 4, 3 (2018), 13:1–13:34.
[4] Michael J. Anderson, Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Theodore L. Willke, and
Pradeep Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of
the International Parallel and Distributed Processing Symposium. 313–322.
[5] Ariful Azad and Aydin Buluç. 2016. Distributed-Memory algorithms for maximum cardinality matching in bipartite
graphs. In Proceedings of the International Parallel and Distributed Processing Symposium. 32–42.
[6] Ariful Azad and Aydin Buluç. 2017. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In
Proceedings of the International Parallel and Distributed Processing Symposium. 688–697.
[7] Ariful Azad and Aydin Buluç. 2019. LACC: A linear-algebraic algorithm for finding connected components in dis-
tributed memory. In Proceedings of the International Parallel and Distributed Processing Symposium. 2–12.
[8] Ariful Azad, Aydin Buluç, Xiaoye S. Li, Xinliang Wang, and Johannes Langguth. 2020. A distributed-memory algorithm
for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42, 4 (2020), C143–C168.
[9] Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2016. Hypergraph partitioning for sparse matrix-
matrix multiplication. ACM Trans. Parallel Computing 3, 3 (2016), 18:1–18:34.
[10] Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2018. BestSF: A sparse meta-format for optimizing SpMV on
GPU. ACM Trans. Archit. Code Optim. 15, 3 (2018), 29:1–29:27.
[11] Thierry P. Berger, Julien Francq, Marine Minier, and Gaël Thomas. 2016. Extended generalized feistel networks us-
ing matrix representation to propose a new lightweight block cipher: Lilliput. IEEE Trans. Computers 65, 7 (2016),
2074–2089.
[12] Aysenur Bilgin, Hani Hagras, Joy van Helvert, and Daniyal M. Al-Ghazzawi. 2016. A linear general type-2 fuzzy-logic-
based computing with words approach for realizing an ambient intelligent platform for cooking recipe recommenda-
tion. IEEE Trans. Fuzzy Systems 24, 2 (2016), 306–329.
[13] Aydin Buluç, Erika Duriakova, Armando Fox, John R. Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, and
Samuel Williams. 2013. High-Productivity and high-performance analysis of filtered semantic graphs. In Proceedings
of the International Parallel and Distributed Processing Symposium. 237–248.
[14] Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings
of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 65:1–65:12.
[15] Paolo Campigotto, Christian Rudloff, Maximilian Leodolter, and Dietmar Bauer. 2017. Personalized and situation-
aware multimodal route recommendations: The FAVOUR algorithm. IEEE Trans. Intelligent Transportation Systems 18,
1 (2017), 92–102.
[16] Yuedan Chen, Kenli Li, Wangdong Yang, Guoqing Xiao, Xianghui Xie, and Tao Li. 2019. Performance-Aware model
for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30,
4 (2019), 923–938.
[17] Yuedan Chen, Guoqing Xiao, M. Tamer Özsu, Chubo Liu, Albert Y. Zomaya, and Tao Li. 2020. aeSpTV: An adaptive
and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE
Trans. Parallel Distributed Syst. 31, 10 (2020), 2329–2345.
[18] Yuedan Chen, Guoqing Xiao, Fan Wu, and Zhuo Tang. 2019. Towards large-scale sparse matrix-vector multiplication
on the SW26010 manycore architecture. In Proceedings of the 21st International Conference on High Performance Com-
puting and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on
Data Science and Systems. IEEE, 1469–1476.
[19] Steven Dalton, Luke N. Olson, and Nathan Bell. 2015. Optimizing sparse matrix - matrix multiplication for the GPU.
ACM Trans. Math. Softw. 41, 4 (2015), 25:1–25:20.

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms 8:29

[20] Timothy A. Davis and Yifan Hu. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38,
1 (2011), 1:1–1:25.
[21] Fred G. Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM
Trans. Math. Softw. 4, 3 (1978), 250–269.
[22] Lixin He, Hong An, Chao Yang, Fei Wang, Junshi Chen, Chao Wang, Weihao Liang, Shao-Jun Dong, Qiao Sun, Wenting
Han, Wenyuan Liu, Yongjian Han, and Wenjun Yao. 2018. PEPS++: towards extreme-scale simulations of strongly
correlated quantum many-particle models on sunway taihulight. IEEE Trans. Parallel Distrib. Syst. 29, 12 (2018), 2838–
2848.
[23] Yong-Yeon Jo, Sang-Wook Kim, and Duck-Ho Bae. 2015. Efficient sparse matrix multiplication on GPU for large social
network analysis. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Manage-
ment. 1261–1270.
[24] Byeongho Kim, Jongwook Chung, Eojin Lee, Wonkyung Jung, Sunjung Lee, Jaewan Choi, Jaehyun Park, Minbok Wi,
Sukhan Lee, and Jung Ho Ahn. 2020. MViD: Sparse matrix-vector multiplication in mobile DRAM for accelerating
recurrent neural networks. IEEE Trans. Computers 69, 7 (2020), 955–967.
[25] Daniel Langr and Pavel Tvrdík. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Trans. Parallel Distrib.
Syst. 27, 2 (2016), 428–440.
[26] Min Li, Yulong Ao, and Chao Yang. 2021. Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE
Trans. Parallel Distributed Syst. 32, 7 (2021), 1842–1853.
[27] Siqiang Luo. 2019. Distributed PageRank Computation: An improved theoretical study. In Proceedings of the AAAI
Conference on Artificial Intelligence. 4496–4503.
[28] Xin Luo, Mengchu Zhou, Shuai Li, Lun Hu, and Mingsheng Shang. 2020. Non-negativity constrained missing data
estimation for high-dimensional and sparse matrices from industrial applications. IEEE Trans. on Cybernetics 50, 5
(2020), 1844–1855.
[29] Xin Luo, Mengchu Zhou, Shuai Li, Zhu-Hong You, Yunni Xia, and Qingsheng Zhu. 2016. A nonnegative latent factor
model for large-scale sparse matrices in recommender systems via alternating direction method. IEEE Trans. Neural
Netw. Learning Syst. 27, 3 (2016), 579–592.
[30] Asit K. Mishra, Eriko Nurvitadhi, Ganesh Venkatesh, Jonathan Pearce, and Debbie Marr. 2017. Fine-grained acceler-
ators for sparse machine learning workloads. In Proceedings of the 22nd Asia and South Pacific Design Automation
Conference. 635–640.
[31] Muhammet Mustafa Ozdal. 2019. Improving efficiency of parallel vertex-centric algorithms for irregular graphs. IEEE
Trans. Parallel Distrib. Syst. 30, 10 (2019), 2265–2282.
[32] Ryan A. Rossi and Nesreen K. Ahmed. 2015. The network data repository with interactive graph analytics and visual-
ization. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 4292–4293.
[33] Liang Sun, Shuiwang Ji, and Jieping Ye. 2009. A least squares formulation for a class of generalized eigenvalue problems
in machine learning. In Proceedings of the 26th Annual International Conference on Machine Learning. 977–984.
[34] Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Subramanya Dulloor, Michael J. Anderson,
Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High performance graph analytics
made productive. PVLDB 8, 11 (2015), 1214–1225.
[35] Guoqing Xiao, Yuedan Chen, Chubo Liu, and Xu Zhou. 2020. ahSpMV: An auto-tuning hybrid computing scheme for
SpMV on the sunway architecture. IEEE Internet of Things Journal 7, 3 (2020), 1736–1744.
[36] Guoqing Xiao, Kenli Li, Yuedan Chen, Wangquan He, Albert Zomaya, and Tao Li. 2021. CASpMV: A customized and
accelerative SPMV framework for the sunway TaihuLight. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2021), 131–146.
[37] Carl Yang, Yangzihao Wang, and John D. Owens. 2015. Fast sparse matrix and sparse vector multiplication algorithm
on the GPU. In Proceedings of the International Parallel and Distributed Processing Symposium. 841–847.
[38] Leonid Yavits and Ran Ginosar. 2018. Accelerator for sparse machine learning. IEEE Comput. Archit. Lett. 17, 1 (2018),
21–24.
[39] Yongzhe Zhang, Ariful Azad, and Zhenjiang Hu. 2020. FastSV: A distributed-memory connected component algorithm
with fast convergence. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP).
46–57.
[40] Yunquan Zhang, Shigang Li, Shengen Yan, and Huiyang Zhou. 2016. A cross-platform SpMV framework on many-core
architectures. ACM Trans. Archit. Code Optim. 13, 4 (2016), 33:1–33:25.
[41] Weijie Zhou, Yue Zhao, Xipeng Shen, and Wang Chen. 2020. Enabling runtime SpMV format selection through an
overhead conscious method. IEEE Trans. Parallel Distrib. Syst. 31, 1 (2020), 80–93.
[42] Alwin Zulehner and Robert Wille. 2019. Matrix-Vector vs. matrix-matrix multiplication: Potential in DD-based simu-
lation of quantum computations. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition.
90–95.

Received October 2020; revised December 2021; accepted December 2021

ACM Transactions on Parallel Computing, Vol. 9, No. 2, Article 8. Publication date: April 2022.

You might also like